# 准备语料
- 以TECCL语料库为例
- Ten-thousand English Compositions of Chinese Learners (the TECCL Corpus)，中国学生万篇英语作文语料库V1.1，由北外许家金教授创建的开放获取学习者语料库。网址：http://corpus.bfsu.edu.cn/content/teccl-corpus
- 下载后解压
- 把其中的“01TECCL_V1.1_RAW”文件夹包括其中的文件拷贝到当前目录
- **注意** 这个语料的规模对于词向量训练是不够的，仅做演示之用

# 参数设置

- <font color="red",size=4>**只需在这部分设置参数、语料地址、模型保存地址以及语料库预处理定制**</font>

## 词向量参数
- 根据需要修改下面的参数，注意保留每一行后面的英文逗号

- <b>size</b>:  Dimensionality of the word vectors. 向量维度
- <b>window</b>: Maximum distance between the current and predicted word within a sentence.窗口大小
- <b>min_count</b>: Ignores all words with total frequency lower than this. 最小词频
- <b>iter</b>: Number of iterations (epochs) over the corpus. 训练次数
- <b>sg</b>:  Training algorithm:{0, 1} 1 for skip-gram; otherwise CBOW.训练算法

In [1]:
paras = {
     "size" : 100,
     "window" : 5,
    "min_count" : 5,
    "iter": 5, 
    "sg":1,
}  

## 语料库地址以及词向量模型保存地址
- 文件或目录地址要用反斜杠“/”分割路径
- 放在英文模式引号里面

- 语料库地址

In [2]:
corpus_path = "01TECCL_V1.1_RAW"

- 词向量模型保存地址
- 若扩展名为“bin”或者“txt”,则保存为谷歌原word2vec工具保存的通用格式
- 若扩展名为“model”或者其它，则保存为Gensim特有的词向量保存格式

In [3]:
saved_path = "teccl.txt"

## 语料预处理设置


- 是否分句： 是， True; 否, False
- 如果语料是每行一句，一般可以选否；否则根据需要是否将每一个段落进行分句

In [4]:
sent_tokenize = True

- 是否分词： 是， True; 否, False
- 如果语料中每个单词及标点已经由空格分隔，选否；否则根据需要是否将每个句子进行分词处理

In [5]:
word_tokenize = True

- 是否统一为小写： 是，True； 否，False

In [6]:
lower_case = True

- 是否去除停用词： 是， True; 否, False

In [7]:
remove_stopwords = False

- 默认调用nltk的停用词列表，即None
- 若需要更改，请填入停用词列表， 如：
```python
stopwords = ['is','a','the','an']
```

In [8]:
stopwords = None

# 训练词向量

In [9]:
import os,re
import nltk
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim.models import utils_any2vec
import numpy as np

In [10]:
class Corpus(object):    
    def __init__(self,corp_dir = None,suffixes =None,**kwargs):
        self.fdir = corp_dir
        suffixes = [suffixes] if isinstance(suffixes,str) else suffixes
        self.suffixes = suffixes if suffixes else ['.txt']
        self.files = self._get_files(self.fdir)
        self.kwargs = kwargs
        if kwargs['remove_stopwords']:
            if not kwargs['stopwords']:
                from nltk.corpus import stopwords
                try:
                    self.stopwords = stopwords.words('english')
                except:
                    nltk.download('stopwords')
                    self.stopwords = stopwords.words('english')
            else:
                self.stopwords = kwargs['stopwords']
        
    def _get_files(self,path):
        files = []
        for f in os.listdir(path):
            fpath = os.path.join(path,f)
            if os.path.isdir(fpath): continue            
            if self._check_file_type(f):
                files.append(fpath)
        logging.info('%d file(s) loaded!'%len(files))
        return files
                    
    def _check_file_type(self,f):
        for suffix in self.suffixes:
            if f.endswith(suffix):
                return True
        return False

    def __iter__(self):
        for f in self.files:
            for line in open(f,encoding='utf-8'):
                line = line.strip()                
                if not line: continue                          
                for words in self.preprocess(line):
                    yield words
                    
    def preprocess(self,line):
        sents = nltk.tokenize.sent_tokenize(line) if self.kwargs['sent_tokenize'] else [line]
        for sent in sents:
            words = nltk.tokenize.word_tokenize(sent) if self.kwargs['word_tokenize'] else sent.split()
            out_words = []
            for w in words:
                if not w.strip(): continue  
                if self.kwargs['lower_case']:
                    w = w.lower()
                if self.kwargs['remove_stopwords']:
                    if w in self.stopwords:
                        continue
                out_words.append(w)
            
            yield out_words

def gensim2wordvec(model,w2v_path):        
    dels = [w for w in model.wv.vocab if ' 'in w]
    for w in dels: del model.wv.vocab[w]
    vectors = []
    i = 0
    for w in model.wv.vocab:
        vectors.append(model.wv[w])
        model.wv.vocab[w].index = i
        i += 1
    vectors = np.array(vectors)
    binary = True if w2v_path.endswith('.bin') else False
    utils_any2vec._save_word2vec_format(w2v_path, model.wv.vocab,vectors,binary=binary)

def train_w2vmodel(sentences,save2path,**kwargs):     
    model = gensim.models.Word2Vec(**kwargs)   
    model.build_vocab(sentences)
    model.train(sentences,total_examples=model.corpus_count, epochs=model.epochs)
    if save2path.endswith('.txt') or save2path.endswith('.bin'):
        gensim2wordvec(model,save2path)
    else:
        model.save(save2path,ignore=[])
    
    return model

In [11]:
mycorpus = Corpus(corpus_path,suffixes=['.txt'],
                          sent_tokenize = sent_tokenize,
                         word_tokenize = word_tokenize,
                         lower_case = lower_case,
                         remove_stopwords = remove_stopwords,
                         stopwords= stopwords)

2019-06-26 10:13:20,205 : INFO : 9864 file(s) loaded!


- <font color="red",size=3>**启动词向量训练**</font>
- 若已经设置好参数，可遵照如下方法启动训练
- (1) 在这个Notebook菜单栏上，点击Cell
- (2) 在Cell下拉菜单里，点击 Run All

In [12]:
model = train_w2vmodel(mycorpus,saved_path,**paras)

2019-06-26 10:13:20,237 : INFO : collecting all words and their counts
2019-06-26 10:13:20,252 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-06-26 10:13:23,690 : INFO : PROGRESS: at sentence #10000, processed 208368 words, keeping 13300 word types
2019-06-26 10:13:27,018 : INFO : PROGRESS: at sentence #20000, processed 419787 words, keeping 20400 word types
2019-06-26 10:13:30,769 : INFO : PROGRESS: at sentence #30000, processed 626892 words, keeping 26334 word types
2019-06-26 10:13:34,019 : INFO : PROGRESS: at sentence #40000, processed 832074 words, keeping 31523 word types
2019-06-26 10:13:37,176 : INFO : PROGRESS: at sentence #50000, processed 1039883 words, keeping 36442 word types
2019-06-26 10:13:40,489 : INFO : PROGRESS: at sentence #60000, processed 1263703 words, keeping 41787 word types
2019-06-26 10:13:43,927 : INFO : PROGRESS: at sentence #70000, processed 1484104 words, keeping 46508 word types
2019-06-26 10:13:47,521 : INFO : PROGRESS: 

2019-06-26 10:14:43,907 : INFO : EPOCH 3 - PROGRESS: at 45.82% examples, 63540 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:44,907 : INFO : EPOCH 3 - PROGRESS: at 50.72% examples, 63829 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:45,938 : INFO : EPOCH 3 - PROGRESS: at 55.87% examples, 63974 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:47,016 : INFO : EPOCH 3 - PROGRESS: at 60.75% examples, 64220 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:48,032 : INFO : EPOCH 3 - PROGRESS: at 65.90% examples, 64270 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:49,048 : INFO : EPOCH 3 - PROGRESS: at 71.04% examples, 64377 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:50,110 : INFO : EPOCH 3 - PROGRESS: at 75.97% examples, 64681 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:51,220 : INFO : EPOCH 3 - PROGRESS: at 81.34% examples, 64454 words/s, in_qsize 0, out_qsize 0
2019-06-26 10:14:52,220 : INFO : EPOCH 3 - PROGRESS: at 86.15% examples, 64530 words/s, in_qsize 0, out_

# 下一步 
- 词向量加载
- 词向量相似词查询
- 相似度计算
- 类比推理
- 可视化呈现