## 参考文献

+ [Tf-idf term weighting](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)
+ [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer)

## 预计试验流程

1. 使用修改过的 `wikifil.pl`（称为 `newfil.pl`） 对每个文件进行预处理
    + 修改方式：注释掉 13-15 行的语句，在 16 行加入一个 `{` 即可
2. 读入经过 1. 预处理的文本数据，保存到 `data['nearRaw.train']` 和 `data['nearRaw.test']` 中
    + 每条数据的**文本内容**保存到 `data['nearRaw.train']['content']` 或 `data['nearRaw.test']['content']` 中（取决于这条数据是训练数据还是测试数据）
    + 每条数据的**正确分类**保存到 `data['nearRaw.train']['class']` 或 `data['nearRaw.test']['class']` 中（取决于这条数据是训练数据还是测试数据）
3. [向量化 + TF-IDF] + 转换 + 存储
    1. 使用 `sklearn.feature_extraction.text.TfidfVectorizer` 完成：词袋计数向量化 + TF-IDF 权重计算
        + 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.fit`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) 完成词汇学习与计算过程
    2. 使用 [`sklearn.feature_extraction.text.TfidfVectorizer.transform`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) 将`data['nearRaw.train']`中的 `stringContent` 和 `data['nearRaw.test']`中的 `stringContent` 进行处理，将结果输出到 `data['matrix.train']` 与 `data['matrix.test']` 中，供后续学习和训练使用
    3. 将 `data['matrix.train']` 与 `data['matrix.test']` 转换成 `Pandas.DataFrame` 格式，保存到 `df['train']` 和 `df['test']` 中（`df` 为字典格式：`String -> DataFrame`）
4. 使用 `df['train']` 为数据集，任选一种学习方法，训练出分类器 `clf`
5. 使用分类器 `clf` 对 `df['test']` 进行实际分类后，评估 `clf` 的效果


In [1]:
# Step 0: import module

import os
import tensorflow as tf
import pandas as pd
import numpy as np
from IPython.display import display
%matplotlib inline

print "import modules successfully"

import modules successfully


In [3]:
# Step 1: preprocess the data

paths = {}
paths['dir.train'] = os.path.join(os.getcwd(), 'data', 'trialdata', 'train')
paths['dir.test'] = os.path.join(os.getcwd(), 'data', 'trialdata', 'test')

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for cls in os.listdir(dirpath):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            os.system('mv {} {}.old'.format(fpath, fpath))
            os.system('perl newfil.pl {}.old > {}'.format(fpath, fpath))
            os.system('rm {}.old'.format(fpath))

print "file preporcessing succefully"
            
stopwordlist = []
with open(os.path.join(os.getcwd(), 'data', 'stoplist-web.txt'), 'r') as readf:
    stopwordlist = readf.read()
    stopwordlist = stopwordlist.split('\n')
            
print "read stop word list successfully"
        
print "Step 1 Succeed"

file preporcessing succefully
read stop word list successfully
Step 1 Succeed


In [4]:
# Step 2: read data and save it in data['nearRaw.train'] 和 data['nearRaw.test']
import random

data = {}
data['nearRaw.train'] = {'content':[], 'class':[]}
data['nearRaw.test'] = {'content':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                data['nearRaw.{}'.format(tpart)]['content'].append(readf.read())
                data['nearRaw.{}'.format(tpart)]['class'].append(cls)
    tmp = data['nearRaw.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print "sample(transformed) from {}[{}]:\n[content]\n {}\n[class]\n{}".format(tpart, ind, tmp['content'][ind], tmp['class'][ind])
    print 
    
print "Step 2 Succeed"

sample(transformed) from train[2198]:
[content]
  from caf omen uucp chuck forsberg wa seven kgx subject re my new diet it works great organization omen technology inc portland rain forest lines three two in article bhjelle carina unm edu writes gordon banks a lot to keep from going back to morbid obesity i think all of us cycle one s success depends on how large the fluctuations in the cycle are some people can cycle only five pounds unfortunately i m not one of them this certainly describes my situation perfectly for me there is a constant dynamic between my tendency to eat which appears to be totally limitless and the purely conscious desire to not put on too much weight when i get too fat i just diet exercise more with varying degrees of success to take off the extra weight usually i cycle within a one five lb range but smaller and larger cycles occur as well i m always afraid that this method will stop working someday but usually i seem to be able to hold the weight gain in check 

In [7]:
# Step 3: TfidfVectorizer.fit + TfidfVectorizer.transform + save in Pandas.DataFrame
# A. 使用 sklearn.feature_extraction.text.TfidfVectorizer.fit 拟合训练数据，建立 BOW+TF-IDF 
# B. 使用 sklearn.feature_extraction.text.TfidfVectorizer.transform 将data['nearRaw.train']中的 stringContent 和 data['nearRaw.test']中的 stringContent 进行处理，
# 将 BOW+TF-IDF 表示结果输出到 data['matrix.train'] 与 data['matrix.test'] 中，供后续学习和训练使用
# 
# C. 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

from sklearn.feature_extraction.text import TfidfVectorizer

## Substep A: vectorization + TF-IDF calculation
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01, max_features=784, analyzer='word', stop_words=stopwordlist)
vectorizer.fit(data['nearRaw.train']['content'])

print "Substep A finished."
print "--------------------------------------------------"

## Substep B: Transformation

for tpart in ['train', 'test']:
    data['matrix.{}'.format(tpart)] = vectorizer.transform(data['nearRaw.{}'.format(tpart)]['content'])
    ind = (random.sample(range(data['matrix.{}'.format(tpart)].shape[0]), 1))[0]
    print "sample for matrix.{}".format(tpart)
    print "from ind: {}".format(ind)
    print data['matrix.{}'.format(tpart)][ind]
    print 
    
print "Substep B finished."
print "--------------------------------------------------"

# Substep C: integrate data into DataFrame format

csvpath_root = os.path.join(os.getcwd(), 'data_CSV')
if not os.path.exists('data_CSV'):
    os.mkdir(csvpath_root)

df = {}
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['nearRaw.{}'.format(tpart)]['class']
    for col in range(data['matrix.{}'.format(tpart)].shape[1]):
        datadict[col]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]
#         datadict[str(col)]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]

    df[tpart] = pd.DataFrame(data=datadict)
    print "See df[{}]".format(tpart)
    display(df[tpart])
    print "\n\n\n"
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, "{}.csv".format(tpart))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)

print "Substep C finished."
print "--------------------------------------------------"

print "Step 3 Succeed."

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

Substep A finished.
--------------------------------------------------
sample for matrix.train
from ind: 594
  (0, 783)	0.0473372893787
  (0, 780)	0.12409735345
  (0, 779)	0.128039552468
  (0, 772)	0.0619350976582
  (0, 763)	0.10492589363
  (0, 756)	0.176067331267
  (0, 721)	0.0975047424023
  (0, 720)	0.272062564452
  (0, 719)	0.100194071446
  (0, 702)	0.131352503714
  (0, 699)	0.112384419295
  (0, 692)	0.166318180646
  (0, 686)	0.09201259308
  (0, 685)	0.139027202352
  (0, 624)	0.10492589363
  (0, 555)	0.0510080504441
  (0, 530)	0.250419474354
  (0, 529)	0.137327533663
  (0, 493)	0.151155006983
  (0, 485)	0.128535181508
  (0, 484)	0.154760199257
  (0, 475)	0.240352954543
  (0, 462)	0.112090345489
  (0, 460)	0.130860390808
  (0, 437)	0.101177348663
  (0, 427)	0.147457915718
  (0, 366)	0.232787363357
  (0, 336)	0.307380398627
  (0, 302)	0.133939076332
  (0, 293)	0.144126596858
  (0, 278)	0.152146318665
  (0, 258)	0.0505113256479
  (0, 249)	0.204433120048
  (0, 223)	0.131128935395
  (0, 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.145416,0.000000,0.000000,0.000000,0.000000,0.138920,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.122551,0.000000,0.000000,0.093494,alt.atheism
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
3,0.009725,0.000000,0.021898,0.000000,0.011967,0.018582,0.010086,0.0,0.000000,0.009667,...,0.018549,0.000000,0.016558,0.000000,0.009535,0.027725,0.012015,0.000000,0.042303,alt.atheism
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.069364,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.071187,0.000000,0.000000,0.000000,0.026318,alt.atheism
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.059393,0.000000,0.000000,0.000000,0.000000,0.000000,0.075871,alt.atheism
7,0.000000,0.123556,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
8,0.000000,0.000000,0.053000,0.000000,0.000000,0.000000,0.000000,0.0,0.060813,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.092313,0.044735,0.000000,0.000000,0.187709,alt.atheism
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.077663,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.027448,alt.atheism






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.154511,alt.atheism
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.078954,alt.atheism
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.262639,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.178762,0.000000,alt.atheism
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.159341,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.068537,alt.atheism
8,0.000000,0.074479,0.000000,0.016807,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.022231,0.000000,0.039135,alt.atheism
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism






Substep C finished.
--------------------------------------------------
Step 3 Succeed.


### CNN trial

In [2]:
# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(os.getcwd(), 'data_CSV', '{}.csv'.format(tpart))
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print "read {} successfully".format(tpart)
        display(df[tpart])


read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,sci.med
1,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.196022,comp.os.ms-windows.misc
2,0.000000,0.000000,0.000000,0.071039,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.055138,comp.os.ms-windows.misc
3,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.270149,misc.forsale
4,0.000000,0.000000,0.000000,0.136546,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,comp.os.ms-windows.misc
5,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.146443,sci.med
6,0.139493,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.090097,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.176970,sci.med
7,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.168165,0.0,0.000000,sci.med
8,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.145928,sci.med
9,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.136204,comp.os.ms-windows.misc


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.078843,misc.forsale
1,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.053303,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.558905,misc.forsale
2,0.076496,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.065118,0.000000,0.000000,0.000000,0.000000,0.000000,0.055456,sci.med
3,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
4,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.221145,misc.forsale
5,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.745294,comp.os.ms-windows.misc
6,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.195553,0.491431,misc.forsale
7,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.052262,0.000000,0.000000,0.000000,alt.atheism
8,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.099883,0.000000,0.089162,0.000000,0.000000,0.000000,0.000000,0.000000,0.113898,alt.atheism
9,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.307215,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.434309,misc.forsale


In [3]:
from sklearn.preprocessing import LabelBinarizer

csvpath_root = os.path.join(os.getcwd(), 'data_CSV')

df_new = {}

lb = LabelBinarizer()
lb.fit(df['train']['class'])

for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}.csv".format(tpart, subpart))
        df_new[tpart][subpart].to_csv(csvpath)
    
print "label cleaning succussfully"

label cleaning succussfully


In [4]:
tf.logging.set_verbosity(tf.logging.INFO)

sess = tf.InteractiveSession()

COL_OUTCOME = 'class'
COL_FEATURE = [col for col in list(df['train'].columns) if col != COL_OUTCOME]

# cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

count_feature = len(COL_FEATURE)
count_class = len(df['train']['class'].unique())

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, count_class], name='y_')

W = tf.Variable(tf.zeros([count_feature, count_class]))
b = tf.Variable(tf.zeros([count_class]))
y = tf.matmul(x, W) + b

# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

In [5]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

In [6]:
def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

In [7]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

In [8]:
def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

In [9]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_text = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_text, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32, name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, count_class])
b_fc2 = bias_variable([count_class])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

In [10]:
### Start to traini and evaluate the model

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [11]:
sess.run(tf.global_variables_initializer())

In [12]:
x_input = df_new['train']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['train']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]
# y_input = [np.array([y_input.iloc[i].values]) for i in range(y_input.shape[0])]

# not use random input

for i in range(df['train'].shape[0]):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x: x_input[i+j],#x_input.iloc[i+j].values, #
                    y_: y_input[i+j]#y_input.iloc[i+j].values #
                })
            )
        print "step {}, training accuracy {}".format(i, np.mean(train_accuracy))
    train_step.run(feed_dict={
        keep_prob: 0.5,
        x:  x_input[i],#x_input.iloc[i].values, #
        y_: y_input[i]#y_input.iloc[i].values#
    })

print "finish train for CNN"

step 0, training accuracy 0.319999992847
step 100, training accuracy 0.259999990463
step 200, training accuracy 0.239999994636
step 300, training accuracy 0.319999992847
step 400, training accuracy 0.239999994636
step 500, training accuracy 0.300000011921
step 600, training accuracy 0.419999986887
step 700, training accuracy 0.239999994636
step 800, training accuracy 0.419999986887
step 900, training accuracy 0.319999992847
step 1000, training accuracy 0.479999989271
step 1100, training accuracy 0.360000014305
step 1200, training accuracy 0.439999997616
step 1300, training accuracy 0.439999997616
step 1400, training accuracy 0.519999980927
step 1500, training accuracy 0.579999983311
step 1600, training accuracy 0.600000023842
step 1700, training accuracy 0.340000003576
step 1800, training accuracy 0.5
step 1900, training accuracy 0.5
step 2000, training accuracy 0.620000004768
step 2100, training accuracy 0.479999989271
step 2200, training accuracy 0.5
finish CNN


In [13]:
x_input = df_new['test']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['test']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]

for i in range(df['test'].shape[0]):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x: x_input[i+j],#x_input.iloc[i+j].values, #
                    y_: y_input[i+j]#y_input.iloc[i+j].values #
                })
            )
        print "step {}, training accuracy {}".format(i, np.mean(train_accuracy))

print "finish test for CNN"

step 0, training accuracy 0.540000021458
step 100, training accuracy 0.40000000596
step 200, training accuracy 0.5
step 300, training accuracy 0.40000000596
step 400, training accuracy 0.419999986887
step 500, training accuracy 0.360000014305
step 600, training accuracy 0.239999994636
step 700, training accuracy 0.419999986887
step 800, training accuracy 0.5
step 900, training accuracy 0.439999997616
step 1000, training accuracy 0.379999995232
step 1100, training accuracy 0.259999990463
step 1200, training accuracy 0.439999997616
step 1300, training accuracy 0.519999980927
step 1400, training accuracy 0.5
finish test for CNN


### DNN trial

In [16]:
# Use TensorFlow to train the DNN

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

print cls2num
print df['train'][COL_OUTCOME].values

{'sci.med': 0, 'comp.os.ms-windows.misc': 1, 'misc.forsale': 2, 'alt.atheism': 3}
['sci.med' 'comp.os.ms-windows.misc' 'comp.os.ms-windows.misc' ...,
 'sci.med' 'sci.med' 'alt.atheism']


In [17]:
# train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {str(col): tf.constant(df[dataset][str(col)].values) for col in COL_FEATURE}
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=4, 
    model_dir='/tmp/tfidf_model'
)

# with tf.Session() as tmss:
#     for i in COL_FEATURE:
#         print tma[i].eval()

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9239762310>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}






































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tfidf_model/model.ckpt.
INFO:tensorflow:loss = 1.38679, step = 1
INFO:tensorflow:global_step/sec: 13.1708
INFO:tensorflow:loss = 1.11517, step = 101
INFO:tensorflow:global_step/sec: 13.831
INFO:tensorflow:loss = 0.715512, step = 201
INFO:tensorflow:global_step/sec: 14.2298
INFO:tensorflow:loss = 0.460608, step = 301
INFO:tensorflow:global_step/sec: 14.5536
INFO:tensorflow:loss = 0.327819, step = 401
INFO:tensorflow:global_step/sec: 14.5296
INFO:tensorflow:loss = 0.252887, step = 501
INFO:tensorflow:global_step/sec: 14.524
INFO:tensorflow:loss = 0.205736, step = 601
INFO

DNNClassifier(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x7f92b0351910>, 'hidden_units': [512], 'feature_columns': (_RealValuedColumn(column_name='0', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='1', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='2', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='3', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='4', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='5', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='6', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='7', dimension=1, default_value=None, dtype=tf.float32, normalizer=None

In [18]:
accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=1)['accuracy']
print "Tst Accuracy: {}".format(accuracy_score)







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-04-29-05:19:00
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2017-04-29-05:19:00
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.902602, auc = 0.987447, global_step = 2000, loss = 0.264541
Tst Accuracy: 0.902601718903


In [19]:
accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print "Tst Accuracy: {}".format(accuracy_score)







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-04-29-05:19:36
INFO:tensorflow:Evaluation [1/1499]
INFO:tensorflow:Evaluation [2/1499]
INFO:tensorflow:Evaluation [3/1499]
INFO:tensorflow:Evaluation [4/1499]
INFO:tensorflow:Evaluation [5/1499]
INFO:tensorflow:Evaluation [6/1499]
INFO:tensorflow:Evaluation [7/1499]
INFO:tensorflow:Evaluation [8/1499]
INFO:tensorflow:Evaluation [9/1499]
INFO:tensorflow:Evaluation [10/1499]
INFO:tensorflow:Evaluation [11/1499]
INFO:tensorflow:Evaluation [12/1499]
INFO:tensorflow:Evaluation [13/1499]
INFO:tensorflow:Evaluation [14/1499]
INFO:tensorflow:Evaluation [15/1499]
INFO:tensorflow:Evaluation [16/1499]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [73/1499]
INFO:tensorflow:Evaluation [74/1499]
INFO:tensorflow:Evaluation [75/1499]
INFO:tensorflow:Evaluation [76/1499]
INFO:tensorflow:Evaluation [77/1499]
INFO:tensorflow:Evaluation [78/1499]
INFO:tensorflow:Evaluation [79/1499]
INFO:tensorflow:Evaluation [80/1499]
INFO:tensorflow:Evaluation [81/1499]
INFO:tensorflow:Evaluation [82/1499]
INFO:tensorflow:Evaluation [83/1499]
INFO:tensorflow:Evaluation [84/1499]
INFO:tensorflow:Evaluation [85/1499]
INFO:tensorflow:Evaluation [86/1499]
INFO:tensorflow:Evaluation [87/1499]
INFO:tensorflow:Evaluation [88/1499]
INFO:tensorflow:Evaluation [89/1499]
INFO:tensorflow:Evaluation [90/1499]
INFO:tensorflow:Evaluation [91/1499]
INFO:tensorflow:Evaluation [92/1499]
INFO:tensorflow:Evaluation [93/1499]
INFO:tensorflow:Evaluation [94/1499]
INFO:tensorflow:Evaluation [95/1499]
INFO:tensorflow:Evaluation [96/1499]
INFO:tensorflow:Evaluation [97/1499]
INFO:tensorflow:Evaluation [98/1499]
INFO:tensorflow:Evaluation [99/1499]
I

INFO:tensorflow:Evaluation [290/1499]
INFO:tensorflow:Evaluation [291/1499]
INFO:tensorflow:Evaluation [292/1499]
INFO:tensorflow:Evaluation [293/1499]
INFO:tensorflow:Evaluation [294/1499]
INFO:tensorflow:Evaluation [295/1499]
INFO:tensorflow:Evaluation [296/1499]
INFO:tensorflow:Evaluation [297/1499]
INFO:tensorflow:Evaluation [298/1499]
INFO:tensorflow:Evaluation [299/1499]
INFO:tensorflow:Evaluation [300/1499]
INFO:tensorflow:Evaluation [301/1499]
INFO:tensorflow:Evaluation [302/1499]
INFO:tensorflow:Evaluation [303/1499]
INFO:tensorflow:Evaluation [304/1499]
INFO:tensorflow:Evaluation [305/1499]
INFO:tensorflow:Evaluation [306/1499]
INFO:tensorflow:Evaluation [307/1499]
INFO:tensorflow:Evaluation [308/1499]
INFO:tensorflow:Evaluation [309/1499]
INFO:tensorflow:Evaluation [310/1499]
INFO:tensorflow:Evaluation [311/1499]
INFO:tensorflow:Evaluation [312/1499]
INFO:tensorflow:Evaluation [313/1499]
INFO:tensorflow:Evaluation [314/1499]
INFO:tensorflow:Evaluation [315/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [506/1499]
INFO:tensorflow:Evaluation [507/1499]
INFO:tensorflow:Evaluation [508/1499]
INFO:tensorflow:Evaluation [509/1499]
INFO:tensorflow:Evaluation [510/1499]
INFO:tensorflow:Evaluation [511/1499]
INFO:tensorflow:Evaluation [512/1499]
INFO:tensorflow:Evaluation [513/1499]
INFO:tensorflow:Evaluation [514/1499]
INFO:tensorflow:Evaluation [515/1499]
INFO:tensorflow:Evaluation [516/1499]
INFO:tensorflow:Evaluation [517/1499]
INFO:tensorflow:Evaluation [518/1499]
INFO:tensorflow:Evaluation [519/1499]
INFO:tensorflow:Evaluation [520/1499]
INFO:tensorflow:Evaluation [521/1499]
INFO:tensorflow:Evaluation [522/1499]
INFO:tensorflow:Evaluation [523/1499]
INFO:tensorflow:Evaluation [524/1499]
INFO:tensorflow:Evaluation [525/1499]
INFO:tensorflow:Evaluation [526/1499]
INFO:tensorflow:Evaluation [527/1499]
INFO:tensorflow:Evaluation [528/1499]
INFO:tensorflow:Evaluation [529/1499]
INFO:tensorflow:Evaluation [530/1499]
INFO:tensorflow:Evaluation [531/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [722/1499]
INFO:tensorflow:Evaluation [723/1499]
INFO:tensorflow:Evaluation [724/1499]
INFO:tensorflow:Evaluation [725/1499]
INFO:tensorflow:Evaluation [726/1499]
INFO:tensorflow:Evaluation [727/1499]
INFO:tensorflow:Evaluation [728/1499]
INFO:tensorflow:Evaluation [729/1499]
INFO:tensorflow:Evaluation [730/1499]
INFO:tensorflow:Evaluation [731/1499]
INFO:tensorflow:Evaluation [732/1499]
INFO:tensorflow:Evaluation [733/1499]
INFO:tensorflow:Evaluation [734/1499]
INFO:tensorflow:Evaluation [735/1499]
INFO:tensorflow:Evaluation [736/1499]
INFO:tensorflow:Evaluation [737/1499]
INFO:tensorflow:Evaluation [738/1499]
INFO:tensorflow:Evaluation [739/1499]
INFO:tensorflow:Evaluation [740/1499]
INFO:tensorflow:Evaluation [741/1499]
INFO:tensorflow:Evaluation [742/1499]
INFO:tensorflow:Evaluation [743/1499]
INFO:tensorflow:Evaluation [744/1499]
INFO:tensorflow:Evaluation [745/1499]
INFO:tensorflow:Evaluation [746/1499]
INFO:tensorflow:Evaluation [747/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [938/1499]
INFO:tensorflow:Evaluation [939/1499]
INFO:tensorflow:Evaluation [940/1499]
INFO:tensorflow:Evaluation [941/1499]
INFO:tensorflow:Evaluation [942/1499]
INFO:tensorflow:Evaluation [943/1499]
INFO:tensorflow:Evaluation [944/1499]
INFO:tensorflow:Evaluation [945/1499]
INFO:tensorflow:Evaluation [946/1499]
INFO:tensorflow:Evaluation [947/1499]
INFO:tensorflow:Evaluation [948/1499]
INFO:tensorflow:Evaluation [949/1499]
INFO:tensorflow:Evaluation [950/1499]
INFO:tensorflow:Evaluation [951/1499]
INFO:tensorflow:Evaluation [952/1499]
INFO:tensorflow:Evaluation [953/1499]
INFO:tensorflow:Evaluation [954/1499]
INFO:tensorflow:Evaluation [955/1499]
INFO:tensorflow:Evaluation [956/1499]
INFO:tensorflow:Evaluation [957/1499]
INFO:tensorflow:Evaluation [958/1499]
INFO:tensorflow:Evaluation [959/1499]
INFO:tensorflow:Evaluation [960/1499]
INFO:tensorflow:Evaluation [961/1499]
INFO:tensorflow:Evaluation [962/1499]
INFO:tensorflow:Evaluation [963/1499]
INFO:tensorf

INFO:tensorflow:Evaluation [1150/1499]
INFO:tensorflow:Evaluation [1151/1499]
INFO:tensorflow:Evaluation [1152/1499]
INFO:tensorflow:Evaluation [1153/1499]
INFO:tensorflow:Evaluation [1154/1499]
INFO:tensorflow:Evaluation [1155/1499]
INFO:tensorflow:Evaluation [1156/1499]
INFO:tensorflow:Evaluation [1157/1499]
INFO:tensorflow:Evaluation [1158/1499]
INFO:tensorflow:Evaluation [1159/1499]
INFO:tensorflow:Evaluation [1160/1499]
INFO:tensorflow:Evaluation [1161/1499]
INFO:tensorflow:Evaluation [1162/1499]
INFO:tensorflow:Evaluation [1163/1499]
INFO:tensorflow:Evaluation [1164/1499]
INFO:tensorflow:Evaluation [1165/1499]
INFO:tensorflow:Evaluation [1166/1499]
INFO:tensorflow:Evaluation [1167/1499]
INFO:tensorflow:Evaluation [1168/1499]
INFO:tensorflow:Evaluation [1169/1499]
INFO:tensorflow:Evaluation [1170/1499]
INFO:tensorflow:Evaluation [1171/1499]
INFO:tensorflow:Evaluation [1172/1499]
INFO:tensorflow:Evaluation [1173/1499]
INFO:tensorflow:Evaluation [1174/1499]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1361/1499]
INFO:tensorflow:Evaluation [1362/1499]
INFO:tensorflow:Evaluation [1363/1499]
INFO:tensorflow:Evaluation [1364/1499]
INFO:tensorflow:Evaluation [1365/1499]
INFO:tensorflow:Evaluation [1366/1499]
INFO:tensorflow:Evaluation [1367/1499]
INFO:tensorflow:Evaluation [1368/1499]
INFO:tensorflow:Evaluation [1369/1499]
INFO:tensorflow:Evaluation [1370/1499]
INFO:tensorflow:Evaluation [1371/1499]
INFO:tensorflow:Evaluation [1372/1499]
INFO:tensorflow:Evaluation [1373/1499]
INFO:tensorflow:Evaluation [1374/1499]
INFO:tensorflow:Evaluation [1375/1499]
INFO:tensorflow:Evaluation [1376/1499]
INFO:tensorflow:Evaluation [1377/1499]
INFO:tensorflow:Evaluation [1378/1499]
INFO:tensorflow:Evaluation [1379/1499]
INFO:tensorflow:Evaluation [1380/1499]
INFO:tensorflow:Evaluation [1381/1499]
INFO:tensorflow:Evaluation [1382/1499]
INFO:tensorflow:Evaluation [1383/1499]
INFO:tensorflow:Evaluation [1384/1499]
INFO:tensorflow:Evaluation [1385/1499]
INFO:tensorflow:Evaluatio

In [20]:
X_tensor_test, yt = my_input_fn('test')

# tensorPredCls = clf.predict(df['test'].drop('class', axis=1))
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
# y_test_true = df['test']['class']
# print f1_score(y_test_true, tensorPredCls, average='micro')
# print tensorPredCls
print cls2num







































{'sci.med': 0, 'comp.os.ms-windows.misc': 1, 'misc.forsale': 2, 'alt.atheism': 3}


In [21]:
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print f1_score(y_test_true, tensorPredClsStr, average='micro')

0.90260173449


In [None]:
# Use TensorFlow to train CNN



In [14]:
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']

# clf = KNeighborsClassifier()
clf = LinearSVC()
# clf = SGDClassifier()
# clf = GaussianNB()
# clf = DecisionTreeClassifier()
# clf =MultinomialNB()
# clf = MLPClassifier()
clf.fit(X_train, y_train)
# print "Training fishied with clf with [n_classes, n_features]: {}".format(clf.coef_)
print "Step 4 finished"

Step 4 finished


## 改善过程记录

当分类数上升时，分类性能会下降

这时候可以通过调整下述参数来提高性能：

+ 特征量数目
+ 使用合适的停止词列表
+ 。。。？

In [6]:
# s = ""
# with open('stoplist.txt', 'w') as stoplistfile:
#     for w in vectorizer.stop_words_:
#         s += "{} ".format(w)
#     stoplistfile.write(s)
    
# print "Output stoplist successfully."

Output stoplist successfully.


In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

y_test_pred = clf.predict(X_test)
print accuracy_score(y_test_true, y_test_pred)
print f1_score(y_test_true, y_test_pred, average='macro')
print f1_score(y_test_true, y_test_pred, average='micro')

# print f1_score(y_test_true, tensorPredCls, average='micro')

0.906604402935
0.905995511168
0.906604402935
