# 用朴素贝叶斯完成语种检测

我们试试用朴素贝叶斯完成一个语种检测的分类器，说起来，用朴素贝叶斯完成这个任务，其实准确度还不错。

<img src='./images/detector.png'/>

<img src='./images/detector2.png'/>

机器学习的算法要取得好效果，离不开数据。
此处使用twitter数据，包含English, French, German, Spanish, Italian 和 Dutch 6种语言。

In [1]:
in_f = open('./files/data.csv')
lines = in_f.readlines()
in_f.close()

In [3]:
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
dataset[:5]

[('1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme',
  'nl'),
 ('1 millón de afectados ante las inundaciones en sri lanka unicef está distribuyendo ayuda de emergencia srilanka',
  'es'),
 ('1 millón de fans en facebook antes del 14 de febrero y paty miki dani y berta se tiran en paracaídas qué harías tú porunmillondefans',
  'es'),
 ('1 satellite galileo sottoposto ai test presso lesaestec nl galileo navigation space in inglese',
  'it'),
 ('10 der welt sind bei', 'de')]

划分数据集，将数据划分为训练集和测试集。

In [12]:
from sklearn.model_selection import train_test_split
x, y = zip(*dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

In [13]:
len(x_train)

6799

模型要有好效果，数据质量要保证。

我们用正则表达式，去掉噪声数据。

In [15]:
import re

def remove_noise(document):
    noise_pattern = re.compile('|'.join(['http\S+', '\@\w+', '\#\w+']))
    clean_text = re.sub(noise_pattern, '', document)
    return clean_text.strip()

remove_noise('Trump images are now more popular than cat gifs. @trump #trends http://www.trumptrends.html')

'Trump images are now more popular than cat gifs.'

下一步要做的就是在降噪数据上抽取出来有用的特征。

抽取1-gram和2-gram的统计特征。

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(lowercase=True,  # lowercase the text
                     analyzer='char_wb',  # tokenise by character ngrams
                     ngram_range=(1,2),  # use ngrams of size 1 and 2
                     max_features=1000, # keep the most common 1000 grams
                     preprocessor=remove_noise  # 计算词频前，预处理的函数
                     )
vec.fit(x_train)

CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2),
        preprocessor=<function remove_noise at 0x10a0fef28>,
        stop_words=None, strip_accents=None,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

In [18]:
def get_features(x):
    vec.transform(x)

构造贝叶斯分类器

In [24]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

评估准确率

In [29]:
classifier.score(vec.transform(x_test), y_test)

0.9770621967357741

能在1500句话上，训练得到准确率97.7%的分类器，效果还是不错的。  
如果大家加大语料，准确率会非常高。

## 规划化

In [39]:
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [42]:
class LanguageDetector(object):
    
    def __init(self, classifier=MultinomialNB()):
        self.classifier = classifier
        self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1000, preprocessor=self._remove_noise)
        
    def _remove_noise(self, document):
        noise_pattern = re.compile('|'.join(['http\S+', '\@\w+', '\#\w+']))
        clean_text = re.sub(noise_pattern, '', document)
        return clean_text
    
    def features(self, X):
        return self.vectorizer.transform(X)
    
    def fit(self, X, y):
        self.vectorizer.fit(X)
        self.classifier.fit(self.features(X), y)
    
    def predict(self, x):
        return self.classifier.predict(self.features([x]))
    
    def score(self, X, y ):
        return self.classifier.score(self.features(X), y)

In [58]:
with open('./files/data.csv', 'r') as f:
    lines = f.readlines()
    for line in lines:
        dataset = zip(line.strip().split(','))

zip

In [52]:
a = zip([1,2],[0,1])

In [60]:
m = (pow(i,2) for i in range(10))

In [61]:
m.next()

AttributeError: 'generator' object has no attribute 'next'