## 参考文献

+ [Tf-idf term weighting](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)
+ [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer)

## 预计试验流程

1. 使用修改过的 `wikifil.pl`（称为 `newfil.pl`） 对每个文件进行预处理
    + 修改方式：注释掉 13-15 行的语句，在 16 行加入一个 `{` 即可
2. 读入经过 1. 预处理的文本数据，保存到 `data['nearRaw.train']` 和 `data['nearRaw.test']` 中
    + 每条数据的**文本内容**保存到 `data['nearRaw.train']['content']` 或 `data['nearRaw.test']['content']` 中（取决于这条数据是训练数据还是测试数据）
    + 每条数据的**正确分类**保存到 `data['nearRaw.train']['class']` 或 `data['nearRaw.test']['class']` 中（取决于这条数据是训练数据还是测试数据）
3. [向量化 + TF-IDF] + 转换 + 存储
    1. 使用 `sklearn.feature_extraction.text.TfidfVectorizer` 完成：词袋计数向量化 + TF-IDF 权重计算
        + 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.fit`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) 完成词汇学习与计算过程
    2. 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.transform`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) 将`data['nearRaw.train']`中的 `stringContent` 和 `data['nearRaw.test']`中的 `stringContent` 进行处理，将结果输出到 `data['matrix.train']` 与 `data['matrix.test']` 中，供后续学习和训练使用
    3. 将 `data['matrix.train']` 与 `data['matrix.test']` 转换成 `Pandas.DataFrame` 格式，保存到 `df['train']` 和 `df['test']` 中（`df` 为字典格式：`String -> DataFrame`）
4. 使用 `df['train']` 为数据集，任选一种学习方法，训练出分类器 `clf`
5. 使用分类器 `clf` 对 `df['test']` 进行实际分类后，评估 `clf` 的效果


In [1]:
# Step 0: import module

import os
import tensorflow as tf
import pandas as pd
import numpy as np
from IPython.display import display
%matplotlib inline

print "import modules successfully"

import modules successfully


In [2]:
# Step 1: preprocess the data

paths = {}
paths['dir.train'] = os.path.join(os.getcwd(), 'trialdata', 'train')
paths['dir.test'] = os.path.join(os.getcwd(), 'trialdata', 'test')

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for cls in os.listdir(dirpath):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            os.system('mv {} {}.old'.format(fpath, fpath))
            os.system('perl newfil.pl {}.old > {}'.format(fpath, fpath))
            os.system('rm {}.old'.format(fpath))

print "file preporcessing succefully"
            
stopwordlist = []
with open(os.path.join(os.getcwd(), 'stoplist2.txt'), 'r') as readf:
    stopwordlist = readf.read()
    stopwordlist = stopwordlist.split('\n')
            
print "read stop word list successfully"
        
print "Step 1 Succeed"

file preporcessing succefully
read stop word list successfully
Step 1 Succeed


In [3]:
# Step 2: read data and save it in data['nearRaw.train'] 和 data['nearRaw.test']
import random

data = {}
data['nearRaw.train'] = {'content':[], 'class':[]}
data['nearRaw.test'] = {'content':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                data['nearRaw.{}'.format(tpart)]['content'].append(readf.read())
                data['nearRaw.{}'.format(tpart)]['class'].append(cls)
    tmp = data['nearRaw.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print "sample(transformed) from {}[{}]:\n[content]\n {}\n[class]\n{}".format(tpart, ind, tmp['content'][ind], tmp['class'][ind])
    print 

print "Step 2 Succeed"

sample(transformed) from train[1218]:
[content]
  from dwilson csugrad cs vt edu david wilson subject need apartment room in boston lines one zero organization virginia tech computer science dept blacksburg va lines one zero i will be in boston cambridge specifically working this summer and am in need of a place to stay if you have a room to sublease or anything of the sort i would appreciate a mail i am a two zero year old white male and am very flexible i can adapt to a smoking or non smoking environment access to the t would be nice though i will have a car thus need a parking space i would need this from late may or early june until aproximately end of july any responses welcome mike mbeck vtssi vt edu
[class]
misc.forsale

sample(transformed) from test[962]:
[content]
  from rtd spectrx saigon com ramesh daryani subject one x one eight zero three six pcs one zero two organization spectrox systems four zero eight two five two one zero zero five cupertino ca lines one three hi fello

In [91]:
# Step 3: TfidfVectorizer.fit + TfidfVectorizer.transform + save in Pandas.DataFrame
#
# B. 使用 sklearn.feature_extraction.text.TfidfVectorizer.transform 将data['nearRaw.train']中的 stringContent 和 data['nearRaw.test']中的 stringContent 进行处理，
# 将结果输出到 data['matrix.train'] 与 data['matrix.test'] 中，供后续学习和训练使用
# 
# C. 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

from sklearn.feature_extraction.text import TfidfVectorizer

## Substep A: vectorization + TF-IDF calculation
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01, max_features=500, analyzer='word', stop_words=stopwordlist)
vectorizer.fit(data['nearRaw.train']['content'])

print "Substep A finished."
print "--------------------------------------------------"

## Substep B: Transformation

for tpart in ['train', 'test']:
    data['matrix.{}'.format(tpart)] = vectorizer.transform(data['nearRaw.{}'.format(tpart)]['content'])
    ind = (random.sample(range(data['matrix.{}'.format(tpart)].shape[0]), 1))[0]
    print "sample for matrix.{}".format(tpart)
    print "from ind: {}".format(ind)
    print data['matrix.{}'.format(tpart)][ind]
    print 
    
print "Substep B finished."
print "--------------------------------------------------"
    
# Substep C: integrate data into DataFrame format
df = {}
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['nearRaw.{}'.format(tpart)]['class']
    for col in range(data['matrix.{}'.format(tpart)].shape[1]):
        datadict[str(col)]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]

    df[tpart] = pd.DataFrame(data=datadict)
    print "See df[{}]".format(tpart)
    display(df[tpart])
    print "\n\n\n"

print "Substep C finished."
print "--------------------------------------------------"

print "Step 3 Succeed."

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

Substep A finished.
--------------------------------------------------
sample for matrix.train
from ind: 2100
  (0, 499)	0.30535864615
  (0, 494)	0.200839171271
  (0, 484)	0.358081573637
  (0, 462)	0.157243457135
  (0, 455)	0.210605624714
  (0, 415)	0.224694376702
  (0, 414)	0.360993389253
  (0, 378)	0.22581504819
  (0, 356)	0.0822594060136
  (0, 346)	0.190160612435
  (0, 302)	0.245362559273
  (0, 299)	0.0903826470481
  (0, 288)	0.208920198452
  (0, 191)	0.188425870237
  (0, 159)	0.16291670074
  (0, 154)	0.164842087457
  (0, 121)	0.296287079653
  (0, 73)	0.2225200611
  (0, 39)	0.213687053016
  (0, 27)	0.100951887144

sample for matrix.test
from ind: 939
  (0, 499)	0.389955947522
  (0, 460)	0.165076093624
  (0, 456)	0.105451049042
  (0, 431)	0.15700026853
  (0, 429)	0.255880142541
  (0, 392)	0.100913379954
  (0, 374)	0.310974384681
  (0, 339)	0.103310577841
  (0, 335)	0.150222101092
  (0, 300)	0.106800930025
  (0, 247)	0.160809290219
  (0, 231)	0.207420381728
  (0, 207)	0.188035418594
 

Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,91,92,93,94,95,96,97,98,99,class
0,0.161126,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.075761,0.154483,alt.atheism
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
2,0.000000,0.000000,0.000000,0.097940,0.000000,0.000000,0.0,0.089396,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.086914,alt.atheism
3,0.010477,0.000000,0.083615,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.029357,...,0.000000,0.048603,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.073389,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.103266,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
6,0.000000,0.000000,0.116193,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,...,0.000000,0.081048,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.083753,alt.atheism
7,0.000000,0.161841,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.366883,0.000000,0.000000,0.000000,alt.atheism
8,0.000000,0.000000,0.000000,0.000000,0.053324,0.000000,0.0,0.000000,0.0,0.047696,...,0.000000,0.047379,0.000000,0.038244,0.000000,0.000000,0.000000,0.000000,0.048960,alt.atheism
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,...,0.000000,0.077676,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism






See df[test]


Unnamed: 0,0,1,10,100,101,102,103,104,105,106,...,91,92,93,94,95,96,97,98,99,class
0,0.000000,0.000000,0.000000,0.000000,0.177831,0.000000,0.000000,0.000000,0.000000,0.159063,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.326557,alt.atheism
1,0.000000,0.000000,0.058430,0.000000,0.091740,0.000000,0.000000,0.000000,0.000000,0.082058,...,0.000000,0.000000,0.000000,0.000000,0.096432,0.000000,0.000000,0.000000,0.000000,alt.atheism
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.162650,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
5,0.000000,0.000000,0.000000,0.000000,0.244999,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.108842,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
8,0.000000,0.075817,0.024365,0.000000,0.019128,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.054874,0.000000,0.000000,0.000000,0.017226,0.000000,alt.atheism
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.089899,0.000000,0.000000,...,0.000000,0.084579,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism






Substep C finished.
--------------------------------------------------
Step 3 Succeed.


In [111]:
# Use TensorFlow to train the DNN

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

print cls2num
print df['train'][COL_OUTCOME].values

{'comp.os.ms-windows.misc': 1, 'sci.med': 3, 'misc.forsale': 2, 'alt.atheism': 0}
['alt.atheism' 'alt.atheism' 'alt.atheism' ..., 'sci.med' 'sci.med'
 'sci.med']


In [135]:
# train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {str(col): tf.constant(df[dataset][str(col)].values) for col in COL_FEATURE}
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=4, 
    model_dir='/tmp/tfidf_model'
)

# with tf.Session() as tmss:
#     for i in COL_FEATURE:
#         print tma[i].eval()

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe1a7b46750>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}
























Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tfidf_model/model.ckpt.
INFO:tensorflow:loss = 1.38487, step = 1
INFO:tensorflow:global_step/sec: 20.443
INFO:tensorflow:loss = 1.06527, step = 101
INFO:tensorflow:global_step/sec: 20.452
INFO:tensorflow:loss = 0.66833, step = 201
INFO:tensorflow:global_step/sec: 22.1483
INFO:tensorflow:loss = 0.442633, step = 301
INFO:tensorflow:global_step/sec: 22.1208
INFO:tensorflow:loss = 0.326336, step = 401
INFO:tensorflow:global_step/sec: 21.7908
INFO:tensorflow:loss = 0.25951, step = 501
INFO:tensorflow:global_step/sec: 21.7004
INFO:tensorflow:loss = 0.216658, step = 601
INFO:t

DNNClassifier(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x7fe1a7a11d50>, 'hidden_units': [512], 'feature_columns': (_RealValuedColumn(column_name='0', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='1', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='10', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='100', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='101', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='102', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='103', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='104', dimension=1, default_value=None, dtype=tf.float32, norm

In [114]:
accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=1)['accuracy']
print "Tst Accuracy: {}".format(accuracy_score)

























Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-04-24-07:31:41
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2017-04-24-07:31:41
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.893262, auc = 0.985283, global_step = 2000, loss = 0.288232
Tst Accuracy: 0.893262147903


In [152]:
accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print "Tst Accuracy: {}".format(accuracy_score)

























Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-04-24-08:20:47
INFO:tensorflow:Evaluation [1/1499]
INFO:tensorflow:Evaluation [2/1499]
INFO:tensorflow:Evaluation [3/1499]
INFO:tensorflow:Evaluation [4/1499]
INFO:tensorflow:Evaluation [5/1499]
INFO:tensorflow:Evaluation [6/1499]
INFO:tensorflow:Evaluation [7/1499]
INFO:tensorflow:Evaluation [8/1499]
INFO:tensorflow:Evaluation [9/1499]
INFO:tensorflow:Evaluation [10/1499]
INFO:tensorflow:Evaluation [11/1499]
INFO:tensorflow:Evaluation [12/1499]
INFO:tensorflow:Evaluation [13/1499]
INFO:tensorflow:Evaluation [14/1499]
INFO:tensorflow:Evaluation [15/1499]
INFO:tensorflow:Evaluation [16/1499]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [95/1499]
INFO:tensorflow:Evaluation [96/1499]
INFO:tensorflow:Evaluation [97/1499]
INFO:tensorflow:Evaluation [98/1499]
INFO:tensorflow:Evaluation [99/1499]
INFO:tensorflow:Evaluation [100/1499]
INFO:tensorflow:Evaluation [101/1499]
INFO:tensorflow:Evaluation [102/1499]
INFO:tensorflow:Evaluation [103/1499]
INFO:tensorflow:Evaluation [104/1499]
INFO:tensorflow:Evaluation [105/1499]
INFO:tensorflow:Evaluation [106/1499]
INFO:tensorflow:Evaluation [107/1499]
INFO:tensorflow:Evaluation [108/1499]
INFO:tensorflow:Evaluation [109/1499]
INFO:tensorflow:Evaluation [110/1499]
INFO:tensorflow:Evaluation [111/1499]
INFO:tensorflow:Evaluation [112/1499]
INFO:tensorflow:Evaluation [113/1499]
INFO:tensorflow:Evaluation [114/1499]
INFO:tensorflow:Evaluation [115/1499]
INFO:tensorflow:Evaluation [116/1499]
INFO:tensorflow:Evaluation [117/1499]
INFO:tensorflow:Evaluation [118/1499]
INFO:tensorflow:Evaluation [119/1499]
INFO:tensorflow:Evaluation [120/1499]
INFO:tensorflow:E

INFO:tensorflow:Evaluation [311/1499]
INFO:tensorflow:Evaluation [312/1499]
INFO:tensorflow:Evaluation [313/1499]
INFO:tensorflow:Evaluation [314/1499]
INFO:tensorflow:Evaluation [315/1499]
INFO:tensorflow:Evaluation [316/1499]
INFO:tensorflow:Evaluation [317/1499]
INFO:tensorflow:Evaluation [318/1499]
INFO:tensorflow:Evaluation [319/1499]
INFO:tensorflow:Evaluation [320/1499]
INFO:tensorflow:Evaluation [321/1499]
INFO:tensorflow:Evaluation [322/1499]
INFO:tensorflow:Evaluation [323/1499]
INFO:tensorflow:Evaluation [324/1499]
INFO:tensorflow:Evaluation [325/1499]
INFO:tensorflow:Evaluation [326/1499]
INFO:tensorflow:Evaluation [327/1499]
INFO:tensorflow:Evaluation [328/1499]
INFO:tensorflow:Evaluation [329/1499]
INFO:tensorflow:Evaluation [330/1499]
INFO:tensorflow:Evaluation [331/1499]
INFO:tensorflow:Evaluation [332/1499]
INFO:tensorflow:Evaluation [333/1499]
INFO:tensorflow:Evaluation [334/1499]
INFO:tensorflow:Evaluation [335/1499]
INFO:tensorflow:Evaluation [336/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [527/1499]
INFO:tensorflow:Evaluation [528/1499]
INFO:tensorflow:Evaluation [529/1499]
INFO:tensorflow:Evaluation [530/1499]
INFO:tensorflow:Evaluation [531/1499]
INFO:tensorflow:Evaluation [532/1499]
INFO:tensorflow:Evaluation [533/1499]
INFO:tensorflow:Evaluation [534/1499]
INFO:tensorflow:Evaluation [535/1499]
INFO:tensorflow:Evaluation [536/1499]
INFO:tensorflow:Evaluation [537/1499]
INFO:tensorflow:Evaluation [538/1499]
INFO:tensorflow:Evaluation [539/1499]
INFO:tensorflow:Evaluation [540/1499]
INFO:tensorflow:Evaluation [541/1499]
INFO:tensorflow:Evaluation [542/1499]
INFO:tensorflow:Evaluation [543/1499]
INFO:tensorflow:Evaluation [544/1499]
INFO:tensorflow:Evaluation [545/1499]
INFO:tensorflow:Evaluation [546/1499]
INFO:tensorflow:Evaluation [547/1499]
INFO:tensorflow:Evaluation [548/1499]
INFO:tensorflow:Evaluation [549/1499]
INFO:tensorflow:Evaluation [550/1499]
INFO:tensorflow:Evaluation [551/1499]
INFO:tensorflow:Evaluation [552/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [743/1499]
INFO:tensorflow:Evaluation [744/1499]
INFO:tensorflow:Evaluation [745/1499]
INFO:tensorflow:Evaluation [746/1499]
INFO:tensorflow:Evaluation [747/1499]
INFO:tensorflow:Evaluation [748/1499]
INFO:tensorflow:Evaluation [749/1499]
INFO:tensorflow:Evaluation [750/1499]
INFO:tensorflow:Evaluation [751/1499]
INFO:tensorflow:Evaluation [752/1499]
INFO:tensorflow:Evaluation [753/1499]
INFO:tensorflow:Evaluation [754/1499]
INFO:tensorflow:Evaluation [755/1499]
INFO:tensorflow:Evaluation [756/1499]
INFO:tensorflow:Evaluation [757/1499]
INFO:tensorflow:Evaluation [758/1499]
INFO:tensorflow:Evaluation [759/1499]
INFO:tensorflow:Evaluation [760/1499]
INFO:tensorflow:Evaluation [761/1499]
INFO:tensorflow:Evaluation [762/1499]
INFO:tensorflow:Evaluation [763/1499]
INFO:tensorflow:Evaluation [764/1499]
INFO:tensorflow:Evaluation [765/1499]
INFO:tensorflow:Evaluation [766/1499]
INFO:tensorflow:Evaluation [767/1499]
INFO:tensorflow:Evaluation [768/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [959/1499]
INFO:tensorflow:Evaluation [960/1499]
INFO:tensorflow:Evaluation [961/1499]
INFO:tensorflow:Evaluation [962/1499]
INFO:tensorflow:Evaluation [963/1499]
INFO:tensorflow:Evaluation [964/1499]
INFO:tensorflow:Evaluation [965/1499]
INFO:tensorflow:Evaluation [966/1499]
INFO:tensorflow:Evaluation [967/1499]
INFO:tensorflow:Evaluation [968/1499]
INFO:tensorflow:Evaluation [969/1499]
INFO:tensorflow:Evaluation [970/1499]
INFO:tensorflow:Evaluation [971/1499]
INFO:tensorflow:Evaluation [972/1499]
INFO:tensorflow:Evaluation [973/1499]
INFO:tensorflow:Evaluation [974/1499]
INFO:tensorflow:Evaluation [975/1499]
INFO:tensorflow:Evaluation [976/1499]
INFO:tensorflow:Evaluation [977/1499]
INFO:tensorflow:Evaluation [978/1499]
INFO:tensorflow:Evaluation [979/1499]
INFO:tensorflow:Evaluation [980/1499]
INFO:tensorflow:Evaluation [981/1499]
INFO:tensorflow:Evaluation [982/1499]
INFO:tensorflow:Evaluation [983/1499]
INFO:tensorflow:Evaluation [984/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [1171/1499]
INFO:tensorflow:Evaluation [1172/1499]
INFO:tensorflow:Evaluation [1173/1499]
INFO:tensorflow:Evaluation [1174/1499]
INFO:tensorflow:Evaluation [1175/1499]
INFO:tensorflow:Evaluation [1176/1499]
INFO:tensorflow:Evaluation [1177/1499]
INFO:tensorflow:Evaluation [1178/1499]
INFO:tensorflow:Evaluation [1179/1499]
INFO:tensorflow:Evaluation [1180/1499]
INFO:tensorflow:Evaluation [1181/1499]
INFO:tensorflow:Evaluation [1182/1499]
INFO:tensorflow:Evaluation [1183/1499]
INFO:tensorflow:Evaluation [1184/1499]
INFO:tensorflow:Evaluation [1185/1499]
INFO:tensorflow:Evaluation [1186/1499]
INFO:tensorflow:Evaluation [1187/1499]
INFO:tensorflow:Evaluation [1188/1499]
INFO:tensorflow:Evaluation [1189/1499]
INFO:tensorflow:Evaluation [1190/1499]
INFO:tensorflow:Evaluation [1191/1499]
INFO:tensorflow:Evaluation [1192/1499]
INFO:tensorflow:Evaluation [1193/1499]
INFO:tensorflow:Evaluation [1194/1499]
INFO:tensorflow:Evaluation [1195/1499]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1382/1499]
INFO:tensorflow:Evaluation [1383/1499]
INFO:tensorflow:Evaluation [1384/1499]
INFO:tensorflow:Evaluation [1385/1499]
INFO:tensorflow:Evaluation [1386/1499]
INFO:tensorflow:Evaluation [1387/1499]
INFO:tensorflow:Evaluation [1388/1499]
INFO:tensorflow:Evaluation [1389/1499]
INFO:tensorflow:Evaluation [1390/1499]
INFO:tensorflow:Evaluation [1391/1499]
INFO:tensorflow:Evaluation [1392/1499]
INFO:tensorflow:Evaluation [1393/1499]
INFO:tensorflow:Evaluation [1394/1499]
INFO:tensorflow:Evaluation [1395/1499]
INFO:tensorflow:Evaluation [1396/1499]
INFO:tensorflow:Evaluation [1397/1499]
INFO:tensorflow:Evaluation [1398/1499]
INFO:tensorflow:Evaluation [1399/1499]
INFO:tensorflow:Evaluation [1400/1499]
INFO:tensorflow:Evaluation [1401/1499]
INFO:tensorflow:Evaluation [1402/1499]
INFO:tensorflow:Evaluation [1403/1499]
INFO:tensorflow:Evaluation [1404/1499]
INFO:tensorflow:Evaluation [1405/1499]
INFO:tensorflow:Evaluation [1406/1499]
INFO:tensorflow:Evaluatio

In [142]:
X_tensor_test, yt = my_input_fn('test')

# tensorPredCls = clf.predict(df['test'].drop('class', axis=1))
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
# y_test_true = df['test']['class']
# print f1_score(y_test_true, tensorPredCls, average='micro')
# print tensorPredCls
print cls2num

























{'comp.os.ms-windows.misc': 1, 'sci.med': 3, 'misc.forsale': 2, 'alt.atheism': 0}


In [147]:
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print f1_score(y_test_true, tensorPredClsStr, average='micro')

0.896597731821


In [None]:
# Use TensorFlow to train CNN



In [115]:
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']

# clf = KNeighborsClassifier()
clf = LinearSVC()
# clf = SGDClassifier()
# clf = GaussianNB()
# clf = DecisionTreeClassifier()
# clf =MultinomialNB()
# clf = MLPClassifier()
clf.fit(X_train, y_train)
# print "Training fishied with clf with [n_classes, n_features]: {}".format(clf.coef_)
print "Step 4 finished"

Step 4 finished


In [131]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

y_test_pred = clf.predict(X_test)
print accuracy_score(y_test_true, y_test_pred)
print f1_score(y_test_true, y_test_pred, average='macro')
print f1_score(y_test_true, y_test_pred, average='micro')

print f1_score(y_test_true, tensorPredCls, average='micro')

0.895263509006
0.894369000208
0.895263509006
0.895263509006


In [6]:
# s = ""
# with open('stoplist.txt', 'w') as stoplistfile:
#     for w in vectorizer.stop_words_:
#         s += "{} ".format(w)
#     stoplistfile.write(s)
    
# print "Output stoplist successfully."

Output stoplist successfully.


## 改善过程记录

当分类数上升时，分类性能会下降

这时候可以通过调整下述参数来提高性能：

+ 特征量数目
+ 使用合适的停止词列表
+ 。。。？