## 参考文献

+ [Tf-idf term weighting](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)
+ [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer)

## 预计试验流程

1. 使用修改过的 `wikifil.pl`（称为 `newfil.pl`） 对每个文件进行预处理
    + 修改方式：注释掉 13-15 行的语句，在 16 行加入一个 `{` 即可
2. 读入经过 1. 预处理的文本数据，保存到 `data['nearRaw.train']` 和 `data['nearRaw.test']` 中
    + 每条数据的**文本内容**保存到 `data['nearRaw.train']['content']` 或 `data['nearRaw.test']['content']` 中（取决于这条数据是训练数据还是测试数据）
    + 每条数据的**正确分类**保存到 `data['nearRaw.train']['class']` 或 `data['nearRaw.test']['class']` 中（取决于这条数据是训练数据还是测试数据）
3. [向量化 + TF-IDF] + 转换 + 存储
    1. 使用 `sklearn.feature_extraction.text.TfidfVectorizer` 完成：词袋计数向量化 + TF-IDF 权重计算
        + 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.fit`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) 完成词汇学习与计算过程
    2. 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.transform`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) 将`data['nearRaw.train']`中的 `stringContent` 和 `data['nearRaw.test']`中的 `stringContent` 进行处理，将结果输出到 `data['matrix.train']` 与 `data['matrix.test']` 中，供后续学习和训练使用
    3. 将 `data['matrix.train']` 与 `data['matrix.test']` 转换成 `Pandas.DataFrame` 格式，保存到 `df['train']` 和 `df['test']` 中（`df` 为字典格式：`String -> DataFrame`）
4. 使用 `df['train']` 为数据集，任选一种学习方法，训练出分类器 `clf`
5. 使用分类器 `clf` 对 `df['test']` 进行实际分类后，评估 `clf` 的效果


In [1]:
# Step 1: preprocess the data

import os

paths = {}
paths['dir.train'] = os.path.join(os.getcwd(), 'trialdata', 'train')
paths['dir.test'] = os.path.join(os.getcwd(), 'trialdata', 'test')

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for cls in os.listdir(dirpath):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            os.system('mv {} {}.old'.format(fpath, fpath))
            os.system('perl newfil.pl {}.old > {}'.format(fpath, fpath))
            os.system('rm {}.old'.format(fpath))

stopwordlist = []
with open(os.path.join(os.getcwd(), 'stoplist2.txt'), 'r') as readf:
    stopwordlist = readf.read()
    stopwordlist = stopwordlist.split('\n')
            
print "Step 1 Succeed"

Step 1 Succeed


In [2]:
# Step 2: read data and save it in data['nearRaw.train'] 和 data['nearRaw.test']
import random

data = {}
data['nearRaw.train'] = {'content':[], 'class':[]}
data['nearRaw.test'] = {'content':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                data['nearRaw.{}'.format(tpart)]['content'].append(readf.read())
                data['nearRaw.{}'.format(tpart)]['class'].append(cls)
    tmp = data['nearRaw.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print "sample(transformed) from {}[{}]:\n[content]\n {}\n[class]\n{}".format(tpart, ind, tmp['content'][ind], tmp['class'][ind])
    print 

print "Step 2 Succeed"

sample(transformed) from train[821]:
[content]
  from rwag gwl com rodger wagner subject running c exe under windows three one reply to rwag gwl com organization the great west life assurance company x disclaimer the views expressed in this message are those of an individual at the great west life assurance company and do not necessarily reflect those of the company lines one seven preface i am a novice user at best to the windows environment i am trying to execute a ms c seven zero executable program which accesses a btrieve database to build an ascii file when i execute it under windows the screen goes blank and my pc locks up the only way for me to return is to reset the machine does anyone have any insight on what i may have to do in order for the program to correctly under windows by the way it runs fine in dos five zero system gateway four eight six dx two five zero ati graphics ultra card six four zero x four eight zero any help would be greatly appreciated rodger
[class]
comp.o

In [43]:
# Step 3: TfidfVectorizer.fit + TfidfVectorizer.transform + save in Pandas.DataFrame
#
# B. 使用 sklearn.feature_extraction.text.TfidfVectorizer.transform 将data['nearRaw.train']中的 stringContent 和 data['nearRaw.test']中的 stringContent 进行处理，
# 将结果输出到 data['matrix.train'] 与 data['matrix.test'] 中，供后续学习和训练使用
# 
# C. 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from IPython.display import display
%matplotlib inline

## Substep A: vectorization + TF-IDF calculation
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01, max_features=500, analyzer='word', stop_words=stopwordlist)
vectorizer.fit(data['nearRaw.train']['content'])

print "Substep A finished."
print "--------------------------------------------------"

## Substep B: Transformation

for tpart in ['train', 'test']:
    data['matrix.{}'.format(tpart)] = vectorizer.transform(data['nearRaw.{}'.format(tpart)]['content'])
    ind = (random.sample(range(data['matrix.{}'.format(tpart)].shape[0]), 1))[0]
    print "sample for matrix.{}".format(tpart)
    print "from ind: {}".format(ind)
    print data['matrix.{}'.format(tpart)][ind]
    print 
    
print "Substep B finished."
print "--------------------------------------------------"
    
# Substep C: integrate data into DataFrame format
df = {}
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['nearRaw.{}'.format(tpart)]['class']
    for col in range(data['matrix.{}'.format(tpart)].shape[1]):
        datadict[col]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]

    df[tpart] = pd.DataFrame(data=datadict)
    print "See df[{}]".format(tpart)
    display(df[tpart])
    print "\n\n\n"

print "Substep C finished."
print "--------------------------------------------------"

print "Step 3 Succeed."

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

Substep A finished.
--------------------------------------------------
sample for matrix.train
from ind: 1462
  (0, 499)	0.358137312902
  (0, 479)	0.278053072003
  (0, 456)	0.121058402835
  (0, 449)	0.165627821606
  (0, 441)	0.262243089413
  (0, 439)	0.262957751299
  (0, 437)	0.231885688549
  (0, 414)	0.211694026078
  (0, 400)	0.208753825778
  (0, 392)	0.115849132967
  (0, 374)	0.178500179345
  (0, 339)	0.118601129748
  (0, 309)	0.227303259274
  (0, 300)	0.122608073865
  (0, 190)	0.121899004356
  (0, 173)	0.213448577895
  (0, 154)	0.193333652082
  (0, 124)	0.111451203652
  (0, 121)	0.260623351981
  (0, 87)	0.271120306946
  (0, 57)	0.223028270579
  (0, 43)	0.211694026078

sample for matrix.test
from ind: 950
  (0, 499)	0.106897750083
  (0, 479)	0.165987998067
  (0, 387)	0.282750435615
  (0, 385)	0.292549098485
  (0, 374)	0.213116778107
  (0, 365)	0.192615890296
  (0, 342)	0.282750435615
  (0, 339)	0.141601486029
  (0, 335)	0.205900239777
  (0, 313)	0.269644597523
  (0, 309)	0.2713842554

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,class
0,0.161126,0.000000,0.000000,0.153929,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.038208,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.169109,0.000000,0.000000,0.000000,...,0.000000,0.076088,0.311048,0.000000,0.000000,0.000000,0.000000,0.152456,0.116309,alt.atheism
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
3,0.010477,0.000000,0.000000,0.020018,0.010866,0.010414,0.022087,0.000000,0.000000,0.011206,...,0.000000,0.000000,0.040625,0.019982,0.000000,0.017837,0.010272,0.029868,0.045572,alt.atheism
4,0.000000,0.000000,0.000000,0.075064,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.037264,0.000000,0.000000,0.000000,0.000000,0.077037,0.000000,0.028481,alt.atheism
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.131021,0.000000,...,0.000000,0.104869,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.041429,0.084680,0.000000,0.000000,0.074362,0.000000,0.000000,0.094992,alt.atheism
7,0.000000,0.161841,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.074177,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.109234,...,0.000000,0.024218,0.049502,0.000000,0.000000,0.000000,0.100133,0.048525,0.203611,alt.atheism
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.085865,0.000000,0.000000,0.000000,0.030347,alt.atheism






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,class
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.080766,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.185189,alt.atheism
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.092604,0.000000,0.000000,0.000000,...,0.221761,0.166664,0.085165,0.000000,0.000000,0.000000,0.000000,0.000000,0.095536,alt.atheism
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.314571,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.111272,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.212614,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.181320,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.104775,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.080080,alt.atheism
8,0.000000,0.075817,0.017109,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.039838,alt.atheism
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.230105,0.043234,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism






Substep C finished.
--------------------------------------------------
Step 3 Succeed.


In [56]:
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier

X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']

# clf = KNeighborsClassifier()
# clf = LinearSVC()
# clf = SGDClassifier()
# clf = GaussianNB()
# clf = DecisionTreeClassifier()
# clf =MultinomialNB()
clf = MLPClassifier()
clf.fit(X_train, y_train)
# print "Training fishied with clf with [n_classes, n_features]: {}".format(clf.coef_)
print "Step 4 finished"

Step 4 finished


In [57]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

y_test_pred = clf.predict(X_test)
print accuracy_score(y_test_true, y_test_pred)
print f1_score(y_test_true, y_test_pred, average='macro')
print f1_score(y_test_true, y_test_pred, average='micro')

0.877918612408
0.876934072158
0.877918612408


In [6]:
# s = ""
# with open('stoplist.txt', 'w') as stoplistfile:
#     for w in vectorizer.stop_words_:
#         s += "{} ".format(w)
#     stoplistfile.write(s)
    
# print "Output stoplist successfully."

Output stoplist successfully.


## 改善过程记录

当分类数上升时，分类性能会下降

这时候可以通过调整下述参数来提高性能：

+ 特征量数目
+ 使用合适的停止词列表
+ 。。。？