# 20 NewsGroups - Classify by Scikit-learn

应用 scikit-learn 对 20 newsgroups 进行处理实验。

### 初始化工程

In [1]:
import datetime
import numpy as np
from nltk.stem.snowball import EnglishStemmer
from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

  from numpy.core.umath_tests import inner1d


### 通过 Sklearn 的 API 获取数据集
这里我们先让 Training Set 和 Test Set 使用同一份数据集，看看算法的效果。

此外，我去除了邮件的头，尾以及引用内容对数据集的影响。

In [4]:
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='train',
                                     remove=('headers', 'footers', 'quotes'))

train_texts = newsgroups_train['data']
train_labels = newsgroups_train['target']
test_texts = newsgroups_test['data']
test_labels = newsgroups_test['target']
print(len(train_texts), len(test_texts))

11314 11314


### 定义分类器的操作封装
#### 执行对应的分类算法并打印出对应耗时和准确率
其中， classifier 将传入对应的算法，这里我们通过 TfidfVectorizer 来实现文本数据集的预处理。
TfidfVectorizer 是在对数据进行计数式矢量化（CountVectorizer）基础上再通过 Tf–idf term weighting 处理，从而消除一些无意义却经常出现的单词对特征提取的影响。

关于文本数据采集和处理的详细解释参见： https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [6]:
def do_classify(tag, classifier, vectorizer=TfidfVectorizer()):
    text_clf = Pipeline([('tfidf', vectorizer),
                         (tag, classifier)])
    print(tag + " start...")
    start_time = datetime.datetime.now()
    text_clf = text_clf.fit(train_texts, train_labels)
    end_time = datetime.datetime.now()
    print("   training time: " + str(end_time - start_time))
    predicted = text_clf.predict(test_texts)
    end_time = datetime.datetime.now()
    print("   classification time: " + str(end_time - start_time))
    print("   accuracy: ", np.mean(predicted == test_labels))

### Naive Bayes 和 K-Nearest Neighbors classifier  的处理结果 

In [7]:
do_classify("MultinomialNB", MultinomialNB())
do_classify("KNeighborsClassifier", KNeighborsClassifier())

MultinomialNB start...
   training time: 0:00:03.205695
   classification time: 0:00:05.598349
   accuracy:  0.8113841258617642
KNeighborsClassifier start...
   training time: 0:00:02.827368
   classification time: 0:00:18.535705
   accuracy:  0.372812444758706


可以看出 KNN 的结果比较差，单纯的使用该算法并不太适合当前的场景。
NB 的结果还可以（81.1%），但也不是很好。

### 将测试数据集区分于训练数据集的结果

In [9]:
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'))
test_texts = newsgroups_test['data']
test_labels = newsgroups_test['target']

do_classify("MultinomialNB", MultinomialNB())
do_classify("KNeighborsClassifier", KNeighborsClassifier())

MultinomialNB start...
   training time: 0:00:02.450610
   classification time: 0:00:03.787500
   accuracy:  0.6062134891131173
KNeighborsClassifier start...
   training time: 0:00:02.480272
   classification time: 0:00:10.907317
   accuracy:  0.07992565055762081


可以看出，使用相同的数据集进行训练和预测的结果跟实际场景可能会有一些冲突（下降）。
这里应该是因为分类算法对训练数据的 overfitting 或者是数据集本身不够充分，从而造成了在结果的不理想。

### 将数据的全集作为训练集的结果

In [10]:
newsgroups_train = fetch_20newsgroups(subset='all',
                                      remove=('headers', 'footers', 'quotes'))
train_texts = newsgroups_train['data']
train_labels = newsgroups_train['target']

do_classify("MultinomialNB", MultinomialNB())
do_classify("KNeighborsClassifier", KNeighborsClassifier())

MultinomialNB start...
   training time: 0:00:03.930140
   classification time: 0:00:05.294939
   accuracy:  0.7885023898035051
KNeighborsClassifier start...
   training time: 0:00:03.729825
   classification time: 0:00:19.912452
   accuracy:  0.37041954328199683


NB 的结果稍有些改观，这里我觉得，近似可以认为是训练数据集的充分优化了结果。
同时也可以看出 KNN 在结果不令人满意的同时，预测耗时也相比 NB 会多很多。

### 一些其他算法的处理结果

In [11]:
do_classify("RandomForestClassifier", RandomForestClassifier(n_estimators=8))
do_classify("AdaBoostClassifier", AdaBoostClassifier())
do_classify("DecisionTreeClassifier", DecisionTreeClassifier())

RandomForestClassifier start...
   training time: 0:00:19.849521
   classification time: 0:00:21.338131
   accuracy:  0.9641529474243229
AdaBoostClassifier start...
   training time: 0:00:20.527629
   classification time: 0:00:22.147604
   accuracy:  0.3898035050451407
DecisionTreeClassifier start...
   training time: 0:00:53.043192
   classification time: 0:00:54.354173
   accuracy:  0.9726500265533723


可以看出（新学的）RandomForest 和 DecisionTree 的结果还是比较令人满意的！

### 对单词进行 stopwords 和 stemming 预处理后的结果

In [12]:
stemmer = EnglishStemmer()
analyzer = TfidfVectorizer().build_analyzer()


def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))


vectorizer = TfidfVectorizer(stop_words="english", analyzer=stemmed_words)
do_classify("MultinomialNB", MultinomialNB(), vectorizer)
do_classify("KNeighborsClassifier", KNeighborsClassifier(), vectorizer)

do_classify("RandomForestClassifier", RandomForestClassifier(n_estimators=8), vectorizer)
do_classify("AdaBoostClassifier", AdaBoostClassifier(), vectorizer)
do_classify("DecisionTreeClassifier", DecisionTreeClassifier(), vectorizer)

MultinomialNB start...
   training time: 0:01:12.378044
   classification time: 0:01:39.451199
   accuracy:  0.772437599575146
KNeighborsClassifier start...
   training time: 0:01:10.976910
   classification time: 0:01:52.478407
   accuracy:  0.41117896972915563
RandomForestClassifier start...
   training time: 0:01:28.230731
   classification time: 0:01:54.683303
   accuracy:  0.9646840148698885
AdaBoostClassifier start...
   training time: 0:01:31.417845
   classification time: 0:01:57.831992
   accuracy:  0.4171534784917685
DecisionTreeClassifier start...
   training time: 0:01:54.033363
   classification time: 0:02:19.589413
   accuracy:  0.9725172596919809


可以看出，加入 stopwords 和 stemming 的环节后耗时有了巨大的提高，但结果并没有明显的改善。

可能是受 stopwords 和 stemming 本身算法的准确性影响，最终的效果并没有发挥出来。

### 展望

目前，我只是采用单兵作战的方式对各个算法单独使用。

后续可以尝试组合的方式看看识别率会不会提高，或者采用深度学习的方式看看结果。