# Facebook FastText

fasttext是facebook开源的一个词向量与文本分类工具，在学术上没有太多创新点，好处是模型简单，训练速度非常快。简单尝试可以发现，用起来还是非常顺手的，做出来的结果也不错，可以达到上线使用的标准。

简单说来，fastText做的事情，就是把文档中所有词通过lookup table变成向量，取平均之后直接用线性分类器得到分类结果。fastText和ACL-15上的deep averaging network(DAN，如下图)比较相似，是一个简化的版本，去掉了中间的隐层。论文指出了对一些简单的分类任务，没有必要使用太复杂的网络结构就可以取得差不多的结果。

fastText论文中提到了两个tricks

- hierarchical softmax
    - 类别数较多时，通过构建一个霍夫曼编码树来加速softmax layer的计算，和之前word2vec中的trick相同
- N-gram features
    - 只用unigram的话会丢掉word order信息，所以通过加入N-gram features进行补充用hashing来减少N-gram的存储

## fastText有监督学习(分类)示例

可以通过pip install fasttext安装包含fasttext python的接口的package

fastText做文本分类要求文本是如下的存储形式：
```
__label__2 , birchas chaim , yeshiva birchas chaim is a orthodox jewish mesivta high school in lakewood township new jersey . it was founded by rabbi shmuel zalmen stein in 2001 after his father rabbi chaim stein asked him to open a branch of telshe yeshiva in lakewood . as of the 2009-10 school year the school had an enrollment of 76 students and 6 . 6 classroom teachers ( on a fte basis ) for a student–teacher ratio of 11 . 5 1 . 
__label__6 , motor torpedo boat pt-41 , motor torpedo boat pt-41 was a pt-20-class motor torpedo boat of the united states navy built by the electric launch company of bayonne new jersey . the boat was laid down as motor boat submarine chaser ptc-21 but was reclassified as pt-41 prior to its launch on 8 july 1941 and was completed on 23 july 1941 . 
__label__11 , passiflora picturata , passiflora picturata is a species of passion flower in the passifloraceae family . 
__label__13 , naya din nai raat , naya din nai raat is a 1974 bollywood drama film directed by a . bhimsingh . the film is famous as sanjeev kumar reprised the nine-role epic performance by sivaji ganesan in navarathri ( 1964 ) which was also previously reprised by akkineni nageswara rao in navarathri ( telugu 1966 ) . this film had enhanced his status and reputation as an actor in hindi cinema . 
```
其中前面的`__label__`是前缀，也可以自己定义，`__label__`后接的为类别。

我们定义我们的5个类别分别为：
```
1:technology
2:car
3:entertainment
4:military
5:sports
```

### 生成文本格式

In [2]:
import jieba
import pandas as pd
import random

cate_dic = {'technology':1, 'car':2, 'entertainment':3, 'military':4, 'sports':5}

df_technology = pd.read_csv("./origin_data/technology_news.csv", encoding='utf-8')
df_technology = df_technology.dropna()

df_car = pd.read_csv("./origin_data/car_news.csv", encoding='utf-8')
df_car = df_car.dropna()

df_entertainment = pd.read_csv("./origin_data/entertainment_news.csv", encoding='utf-8')
df_entertainment = df_entertainment.dropna()

df_military = pd.read_csv("./origin_data/military_news.csv", encoding='utf-8')
df_military = df_military.dropna()

df_sports = pd.read_csv("./origin_data/sports_news.csv", encoding='utf-8')
df_sports = df_sports.dropna()

technology = df_technology.content.values.tolist()[1000:21000]
car = df_car.content.values.tolist()[1000:21000]
entertainment = df_entertainment.content.values.tolist()[:20000]
military = df_military.content.values.tolist()[:20000]
sports = df_sports.content.values.tolist()[:20000]

In [3]:
stopwords=pd.read_csv("origin_data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = list(filter(lambda x:len(x)>1, segs))
            segs = list(filter(lambda x:x not in stopwords, segs))
            sentences.append("__label__"+str(category)+" , "+" ".join(segs))
        except Exception as e:
            print(line)
            continue

In [4]:
#生成训练数据
sentences = []

preprocess_text(technology, sentences, cate_dic['technology'])
preprocess_text(car, sentences, cate_dic['car'])
preprocess_text(entertainment, sentences, cate_dic['entertainment'])
preprocess_text(military, sentences, cate_dic['military'])
preprocess_text(sports, sentences, cate_dic['sports'])

random.shuffle(sentences)

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/j1/ls86yccj7l5dyscbpmp85ngw0000gn/T/jieba.cache
Loading model cost 1.007 seconds.
Prefix dict has been built succesfully.


In [5]:
print("writing data to fasttext format...")
out = open('train_data.txt',encoding="utf-8",mode='w')
for sentence in sentences:
    out.write(sentence + "\n")
print("done!")

writing data to fasttext format...
done!


### 调用fastText训练生成模型

In [6]:
import fasttext
classifier = fasttext.train_supervised('train_data.txt', label_prefix='__label__')
classifier.save_model("fasttext.model")

### 对模型效果进行评估

In [10]:
model = fasttext.load_model("fasttext.model")




In [11]:
result = model.test('train_data.txt')

In [12]:
result

(87589, 0.973969334048796, 0.973969334048796)

In [14]:
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('train_data.txt'))

N	87589
P@1	0.974
R@1	0.974


### 实际预测

In [19]:
label_to_cate = {"__label__1":'technology', "__label__2":'car',
                 "__label__3":'entertainment', "__label__4":'military', "__label__5":'sports'}

#texts = ['中新网 日电 2018 预赛 亚洲区 强赛 中国队 韩国队 较量 比赛 上半场 分钟 主场 作战 中国队 率先 打破 场上 僵局 利用 角球 机会 大宝 前点 攻门 得手 中国队 领先']
texts = ['这 是 中国 第 一 次 军舰 演习']
labels = model.predict(texts)
print(labels)
print(label_to_cate[str(labels[0][0][0])])

([['__label__4']], array([[1.00001001]]))
military


In [None]:
"这是中国第一次军舰演习"

### Top K 个预测结果

In [18]:
labels = model.predict(texts, k=3)
print(labels)

([['__label__5', '__label__4', '__label__1']], array([[9.99993920e-01, 2.34443942e-05, 1.20390805e-05]]))
