## 参考文献

+ [Tf-idf term weighting](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)
+ [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer)

## 预计试验流程

1. 使用修改过的 `wikifil.pl`（称为 `newfil.pl`） 对每个文件进行预处理
    + 修改方式：注释掉 13-15 行的语句，在 16 行加入一个 `{` 即可
2. 读入经过 1. 预处理的文本数据，保存到 `data['nearRaw.train']` 和 `data['nearRaw.test']` 中
    + 每条数据的**文本内容**保存到 `data['nearRaw.train']['content']` 或 `data['nearRaw.test']['content']` 中（取决于这条数据是训练数据还是测试数据）
    + 每条数据的**正确分类**保存到 `data['nearRaw.train']['class']` 或 `data['nearRaw.test']['class']` 中（取决于这条数据是训练数据还是测试数据）
3. [向量化 + TF-IDF] + 转换 + 存储
    1. 使用 `sklearn.feature_extraction.text.TfidfVectorizer` 完成：词袋计数向量化 + TF-IDF 权重计算
        + 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.fit`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) 完成词汇学习与计算过程
    2. 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.transform`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) 将`data['nearRaw.train']`中的 `stringContent` 和 `data['nearRaw.test']`中的 `stringContent` 进行处理，将结果输出到 `data['matrix.train']` 与 `data['matrix.test']` 中，供后续学习和训练使用
    3. 将 `data['matrix.train']` 与 `data['matrix.test']` 转换成 `Pandas.DataFrame` 格式，保存到 `df['train']` 和 `df['test']` 中（`df` 为字典格式：`String -> DataFrame`）
4. 使用 `df['train']` 为数据集，任选一种学习方法，训练出分类器 `clf`
5. 使用分类器 `clf` 对 `df['test']` 进行实际分类后，评估 `clf` 的效果


In [5]:
# Step 1: preprocess the data

import os

paths = {}
paths['dir.train'] = os.path.join(os.getcwd(), 'trialdata', 'train')
paths['dir.test'] = os.path.join(os.getcwd(), 'trialdata', 'test')

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for cls in os.listdir(dirpath):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            os.system('mv {} {}.old'.format(fpath, fpath))
            os.system('perl newfil.pl {}.old > {}'.format(fpath, fpath))
            os.system('rm {}.old'.format(fpath))

print "Step 1 Succeed"

Step 1 Succeed


In [205]:
# Step 2: read data and save it in data['nearRaw.train'] 和 data['nearRaw.test']
import random

data = {}
data['nearRaw.train'] = {'content':[], 'class':[]}
data['nearRaw.test'] = {'content':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                data['nearRaw.{}'.format(tpart)]['content'].append(readf.read())
                data['nearRaw.{}'.format(tpart)]['class'].append(cls)
    tmp = data['nearRaw.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print "sample(transformed) from {}[{}]:\n[content]\n {}\n[class]\n{}".format(tpart, ind, tmp['content'][ind], tmp['class'][ind])
    print 

print "Step 2 Succeed"

sample(transformed) from train[98]:
[content]
  from cfdeb zero one ux one cts eiu edu dixon berry subject mail order sales billing receivables program organization eastern illinois university lines two zero surely some one of you is familiar with what a mail order company goes through this company has only a few products but thousands of clients i need a sales billing and receivables program to handle the thing but i need to be able to customize it myself own the source etc anyone willing to sell me the basic stuff in any development language i ll be willing to pay about one zero zero zero to it has to be ready now i need this sort of solution immediately with more time i ll just develop one myself if you can have me a prototype in two weeks you can make some quick cash dixon berry i see the light cfdeb zero one ux one cts eiu edu at the end of the tunnel now eastern illinois university thanks bill clinton booth library someone please tell me computer resource center it s not a train 

In [206]:
# Step 3: TfidfVectorizer.fit + TfidfVectorizer.transform + save in Pandas.DataFrame
#
# B. 使用 sklearn.feature_extraction.text.TfidfVectorizer.transform 将data['nearRaw.train']中的 stringContent 和 data['nearRaw.test']中的 stringContent 进行处理，
# 将结果输出到 data['matrix.train'] 与 data['matrix.test'] 中，供后续学习和训练使用
# 
# C. 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from IPython.display import display
%matplotlib inline

## Substep A: vectorization + TF-IDF calculation
vectorizer = TfidfVectorizer(max_df=0.99, min_df=0.01, max_features=100)
vectorizer.fit(data['nearRaw.train']['content'])

print "Substep A finished."
print "--------------------------------------------------"

## Substep B: Transformation

for tpart in ['train', 'test']:
    data['matrix.{}'.format(tpart)] = vectorizer.transform(data['nearRaw.{}'.format(tpart)]['content'])
    ind = (random.sample(range(data['matrix.{}'.format(tpart)].shape[0]), 1))[0]
    print "sample for matrix.{}".format(tpart)
    print "from ind: {}".format(ind)
    print data['matrix.{}'.format(tpart)][ind]
    print 
    
print "Substep B finished."
print "--------------------------------------------------"
    
# Substep C: integrate data into DataFrame format
df = {}
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['nearRaw.{}'.format(tpart)]['class']
    for col in range(data['matrix.{}'.format(tpart)].shape[1]):
        datadict[col]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]

    df[tpart] = pd.DataFrame(data=datadict)
    print "See df[{}]".format(tpart)
    display(df[tpart])
    print "\n\n\n"

print "Substep C finished."
print "--------------------------------------------------"

print "Step 3 Succeed."

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

Substep A finished.
--------------------------------------------------
sample for matrix.train
from ind: 1266
  (0, 98)	0.267457475816
  (0, 96)	0.221401445375
  (0, 79)	0.0710299520433
  (0, 75)	0.114212866528
  (0, 74)	0.250293706468
  (0, 73)	0.171150877127
  (0, 65)	0.0912859629198
  (0, 64)	0.101873111338
  (0, 60)	0.0554518346965
  (0, 58)	0.139802232718
  (0, 57)	0.0587119347018
  (0, 55)	0.0657381547077
  (0, 54)	0.191313015965
  (0, 53)	0.125562280378
  (0, 52)	0.104450768257
  (0, 41)	0.306289377983
  (0, 40)	0.367991733607
  (0, 39)	0.275434455543
  (0, 37)	0.103909915677
  (0, 18)	0.436614948826
  (0, 17)	0.205688408838
  (0, 8)	0.214315482238
  (0, 6)	0.113107662527
  (0, 4)	0.222452796076

sample for matrix.test
from ind: 923
  (0, 96)	0.0819892140411
  (0, 90)	0.309966223929
  (0, 88)	0.0880433612174
  (0, 86)	0.0898885249849
  (0, 81)	0.150263550009
  (0, 80)	0.283852754586
  (0, 79)	0.105215030219
  (0, 78)	0.132386524536
  (0, 74)	0.417098583394
  (0, 73)	0.1267609240

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,class
0,0.038474,0.073177,0.000000,0.083647,0.070334,0.148380,0.000000,0.182598,0.033881,0.172158,...,0.000000,0.150986,0.485756,0.000000,0.040404,0.035001,0.143260,0.084563,0.022913,comp.os.ms-windows.misc
1,0.070990,0.067511,0.081472,0.000000,0.064888,0.039112,0.000000,0.056153,0.000000,0.063531,...,0.000000,0.069648,0.149382,0.050602,0.074551,0.000000,0.211469,0.156032,0.126835,comp.os.ms-windows.misc
2,0.000000,0.000000,0.000000,0.196120,0.000000,0.149098,0.083848,0.285415,0.000000,0.161458,...,0.000000,0.000000,0.094909,0.064299,0.094732,0.082064,0.000000,0.000000,0.000000,comp.os.ms-windows.misc
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,comp.os.ms-windows.misc
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.143821,0.080881,0.000000,0.076626,0.000000,...,0.000000,0.000000,0.000000,0.124048,0.000000,0.000000,0.194403,0.191253,0.310931,comp.os.ms-windows.misc
5,0.000000,0.000000,0.000000,0.000000,0.079056,0.142954,0.080393,0.000000,0.076164,0.077402,...,0.000000,0.084854,0.000000,0.123300,0.000000,0.000000,0.064410,0.000000,0.309057,comp.os.ms-windows.misc
6,0.095601,0.000000,0.000000,0.000000,0.000000,0.000000,0.088861,0.075620,0.000000,0.085556,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.086970,0.213584,0.000000,0.113870,comp.os.ms-windows.misc
7,0.103612,0.000000,0.000000,0.000000,0.000000,0.171254,0.000000,0.081957,0.000000,0.092726,...,0.000000,0.000000,0.218027,0.000000,0.000000,0.000000,0.077161,0.000000,0.061707,comp.os.ms-windows.misc
8,0.029439,0.055993,0.000000,0.000000,0.000000,0.178414,0.027364,0.139719,0.051849,0.105384,...,0.036109,0.057765,0.123896,0.083937,0.030916,0.026782,0.175390,0.032353,0.245457,comp.os.ms-windows.misc
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.086606,0.000000,0.000000,0.000000,0.140679,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.117065,0.000000,0.000000,comp.os.ms-windows.misc






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,class
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.230121,0.000000,comp.os.ms-windows.misc
1,0.000000,0.000000,0.000000,0.000000,0.096936,0.058429,0.000000,0.167774,0.000000,0.094909,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.096478,0.000000,0.000000,0.189478,comp.os.ms-windows.misc
2,0.000000,0.000000,0.000000,0.314042,0.264061,0.159164,0.000000,0.000000,0.000000,0.258538,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.215142,0.000000,0.000000,comp.os.ms-windows.misc
3,0.115570,0.000000,0.000000,0.000000,0.000000,0.063673,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.121595,0.000000,0.000000,0.105137,0.172133,0.000000,0.275312,comp.os.ms-windows.misc
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.092130,0.000000,0.132272,0.147256,0.000000,...,0.000000,0.000000,0.175938,0.000000,0.000000,0.000000,0.000000,0.000000,0.099589,comp.os.ms-windows.misc
5,0.097746,0.000000,0.000000,0.000000,0.000000,0.161559,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.308525,0.069673,0.000000,0.000000,0.072793,0.107420,0.174639,comp.os.ms-windows.misc
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.049810,0.000000,0.000000,0.079614,0.000000,...,0.000000,0.000000,0.285361,0.128885,0.000000,0.082246,0.067328,0.000000,0.323055,comp.os.ms-windows.misc
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.187336,0.000000,0.000000,0.000000,0.152149,...,0.000000,0.000000,0.000000,0.121185,0.000000,0.000000,0.253221,0.000000,0.405006,comp.os.ms-windows.misc
8,0.000000,0.000000,0.000000,0.093308,0.156915,0.283744,0.000000,0.000000,0.151175,0.153633,...,0.000000,0.084212,0.000000,0.122366,0.000000,0.156173,0.383535,0.000000,0.051119,comp.os.ms-windows.misc
9,0.179591,0.000000,0.000000,0.000000,0.000000,0.098945,0.000000,0.000000,0.000000,0.160721,...,0.220278,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,comp.os.ms-windows.misc






Substep C finished.
--------------------------------------------------
Step 3 Succeed.


In [208]:
from sklearn.svm import LinearSVC

X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']

clf = LinearSVC()
clf.fit(X_train, y_train)
# print "Training fishied with clf with [n_classes, n_features]: {}".format(clf.coef_)
print "Step 4 finished"

Step 4 finished


In [213]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

y_test_pred = clf.predict(X_test)
print accuracy_score(y_test_true, y_test_pred)
print f1_score(y_test_true, y_test_pred, average='macro')
print f1_score(y_test_true, y_test_pred, average='micro')

0.821186440678
0.819878633131
0.821186440678
