# 准备语料
- 以TECCL语料库为例
- Ten-thousand English Compositions of Chinese Learners (the TECCL Corpus)，中国学生万篇英语作文语料库V1.1，由北外许家金教授创建的开放获取学习者语料库。网址：http://corpus.bfsu.edu.cn/content/teccl-corpus
- 下载后解压
- 把其中的“01TECCL_V1.1_RAW”文件夹包括其中的文件拷贝到当前目录
- **注意** 这个语料的规模对于词向量训练是不够的，仅做演示之用

# 参数设置

- <font color="red",size=4>**只需在这部分设置参数、语料地址、模型保存地址以及语料库预处理定制**</font>

## 词向量参数
- 根据需要修改下面的参数，注意保留每一行后面的英文逗号

- <b>size</b>:  Dimensionality of the word vectors. 向量维度
- <b>window</b>: Maximum distance between the current and predicted word within a sentence.窗口大小
- <b>min_count</b>: Ignores all words with total frequency lower than this. 最小词频
- <b>iter</b>: Number of iterations (epochs) over the corpus. 训练次数

In [79]:
paras = {
     "size" : 100,
     "window" : 5,
    "min_count" : 5,
    "iter": 5,   
}  

## 语料地址以及词向量模型保存地址
- 文件或目录地址要用反斜杠“/”分割路径
- 放在英文模式引号里面

- 语料地址

In [80]:
corpus_path = "01TECCL_V1.1_RAW"

- 词向量模型保存地址

In [81]:
saved_path = "teccl.model"

## 语料预处理设置


- 是否分句： 是， True; 否, False
- 如果语料是每行一句，一般可以选否；否则根据需要是否将每一个段落进行分句

In [82]:
sent_tokenize = True

- 是否分词： 是， True; 否, False
- 如果语料中每个单词及标点已经由空格分隔，选否；否则根据需要是否将每个句子进行分词处理

In [83]:
word_tokenize = True

- 是否统一为小写： 是，True； 否，False

In [84]:
lower_case = True

- 是否去除停用词： 是， True; 否, False

In [85]:
remove_stopwords = False

- 默认调用nltk的停用词列表，即None
- 若需要更改，请填入停用词列表， 如：
```python
stopwords = ['is','a','the','an']
```

In [86]:
stopwords = None

# 训练词向量

In [87]:
import os,re
import nltk
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [88]:
class Corpus(object):    
    def __init__(self,corp_dir = None,suffixes =None,**kwargs):
        self.fdir = corp_dir
        suffixes = [suffixes] if isinstance(suffixes,str) else suffixes
        self.suffixes = suffixes if suffixes else ['.txt']
        self.files = self._get_files(self.fdir)
        self.kwargs = kwargs
        if kwargs['remove_stopwords']:
            if not kwargs['stopwords']:
                from nltk.corpus import stopwords
                try:
                    self.stopwords = stopwords.words('english')
                except:
                    nltk.download('stopwords')
                    self.stopwords = stopwords.words('english')
            else:
                self.stopwords = kwargs['stopwords']
        
    def _get_files(self,path):
        files = []
        for f in os.listdir(path):
            fpath = os.path.join(path,f)
            if os.path.isdir(fpath): continue            
            if self._check_file_type(f):
                files.append(fpath)
        logging.info('%d file(s) loaded!'%len(files))
        return files
                    
    def _check_file_type(self,f):
        for suffix in self.suffixes:
            if f.endswith(suffix):
                return True
        return False

    def __iter__(self):
        for f in self.files:
            for line in open(f,encoding='utf-8'):
                line = line.strip()                
                if not line: continue                          
                for words in self.preprocess(line):
                    yield words
                    
    def preprocess(self,line):
        sents = nltk.tokenize.sent_tokenize(line) if self.kwargs['sent_tokenize'] else [line]
        for sent in sents:
            words = nltk.tokenize.word_tokenize(sent) if self.kwargs['word_tokenize'] else sent.split()
            out_words = []
            for w in words:
                if not w.strip(): continue  
                if self.kwargs['lower_case']:
                    w = w.lower()
                if self.kwargs['remove_stopwords']:
                    if w in self.stopwords:
                        continue
                out_words.append(w)
            
            yield out_words


def train_w2vmodel(sentences,save2path,**kwargs):     
    model = gensim.models.Word2Vec(**kwargs)   
    model.build_vocab(sentences)
    model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
    model.save(save2path,ignore=[])
    return model

In [89]:
mycorpus = Corpus(corpus_path,suffixes=['.txt'],
                          sent_tokenize = sent_tokenize,
                         word_tokenize = word_tokenize,
                         lower_case = lower_case,
                         remove_stopwords = remove_stopwords,
                         stopwords= stopwords)

2019-06-07 21:30:59,046 : INFO : 9864 file(s) loaded!


- <font color="red",size=3>**启动词向量训练**</font>
- 若已经设置好参数，可遵照如下方法启动训练
- (1) 在这个Notebook菜单栏上，点击Cell
- (2) 在Cell下拉菜单里，点击 Run All

In [90]:
model = train_w2vmodel(mycorpus,saved_path,**paras)

2019-06-07 21:30:59,054 : INFO : collecting all words and their counts
2019-06-07 21:30:59,074 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-06-07 21:31:03,168 : INFO : PROGRESS: at sentence #10000, processed 116736 words, keeping 13165 word types
2019-06-07 21:31:07,127 : INFO : PROGRESS: at sentence #20000, processed 236963 words, keeping 20263 word types
2019-06-07 21:31:11,330 : INFO : PROGRESS: at sentence #30000, processed 354036 words, keeping 26195 word types
2019-06-07 21:31:15,171 : INFO : PROGRESS: at sentence #40000, processed 468567 words, keeping 31382 word types
2019-06-07 21:31:19,233 : INFO : PROGRESS: at sentence #50000, processed 585958 words, keeping 36301 word types
2019-06-07 21:31:23,538 : INFO : PROGRESS: at sentence #60000, processed 713515 words, keeping 41646 word types
2019-06-07 21:31:27,796 : INFO : PROGRESS: at sentence #70000, processed 837378 words, keeping 46367 word types
2019-06-07 21:31:32,104 : INFO : PROGRESS: at 

2019-06-07 21:32:29,864 : INFO : EPOCH 3 - PROGRESS: at 8.94% examples, 34005 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:30,944 : INFO : EPOCH 3 - PROGRESS: at 13.76% examples, 34338 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:32,033 : INFO : EPOCH 3 - PROGRESS: at 18.22% examples, 34402 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:33,067 : INFO : EPOCH 3 - PROGRESS: at 22.57% examples, 34794 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:34,136 : INFO : EPOCH 3 - PROGRESS: at 26.98% examples, 34831 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:35,196 : INFO : EPOCH 3 - PROGRESS: at 31.55% examples, 34920 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:36,275 : INFO : EPOCH 3 - PROGRESS: at 36.28% examples, 34869 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:37,356 : INFO : EPOCH 3 - PROGRESS: at 41.16% examples, 34854 words/s, in_qsize 0, out_qsize 0
2019-06-07 21:32:38,409 : INFO : EPOCH 3 - PROGRESS: at 45.72% examples, 34918 words/s, in_qsize 0, out_q

2019-06-07 21:33:42,182 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-07 21:33:42,184 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-07 21:33:42,192 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-07 21:33:42,194 : INFO : EPOCH - 5 : training on 1123711 raw words (846375 effective words) took 26.0s, 32503 effective words/s
2019-06-07 21:33:42,195 : INFO : training on a 5618555 raw words (4231201 effective words) took 123.8s, 34180 effective words/s
2019-06-07 21:33:42,195 : INFO : saving Word2Vec object under teccl.model, separately None
2019-06-07 21:33:42,306 : INFO : saved teccl.model


# 下一步 
- 词向量加载
- 词向量相似词查询
- 相似度计算
- 类比推理
- 可视化呈现