## 更新日志

+ 【2017/05/11】
    - 重新梳理前期实验结果，并整合到该份报告中
    - 前期实验报告参见同一目录下的其他以 `trial_` 开头的 `.ipynb` 文件

## 备注

1. 搜索该符号以定位到报告中待完善的部分：。。。
2. 稍后可考虑将部分代码包装成 Python 脚本，以模块的形式导入，使整个 notebook 更简洁？或者不这样处理，从而使读者更方便阅读（而不用另外切换于多个页面之间）？
3. 。。。

## 参考文献

1. (#miscellaneous) [20 Newsgroup Document Classification Report](http://cn-static.udacity.com/mlnd/Capstone_Poject_Sample01.pdf)
2. (#word2vec, #tensorflow) [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
3. (#word2vec) [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
4. (#word2vec, #gensim) [models.word2vec – Deep learning with word2vec](https://radimrehurek.com/gensim/models/word2vec.html)
5. (#CNN, #tensorflow) [Deep MNIST for Experts](https://www.tensorflow.org/get_started/mnist/pros)
6. (#text8) [text8](http://mattmahoney.net/dc/textdata)
7. 。。。

## 实验模块规划

粗略规划如下，稍后精细整理：实现下述共计 9 种「表示 + 训练」组合

表示模型 | 分类器训练法
---------------|---------------------
BOW + TF-IDF | SVM
                     | DNN
                     | LSA
                     | LDA
Word2Vec      | SVM
                     | DNN
                     | CNN
                     
其中：
+ 名词解释
    - 表示模型
        * BOW：Bag-of-Words，词袋模型
        * TF-IDF：Term Frequency - Inverse Document Frequency，文档-逆文档频率
    - 分类器训练法
        * SVM：Support Vector Machine，支持向量机
        * LSA：潜在语义分析
        * LDA：Latent Dirichlet Allocation，隐含狄利克雷分布
        * DNN：Deep Neural Network，深度神经网络（普通的多层感知机构成的多层神经网络）
        * CNN：Convolution Neural Network，卷积神经网络
+ 训练工具
    - 表示模型
        * BOW+TF-IDF：使用 scikit-learn 建模
        * Word2Vec：使用 gensim 与 TensorFlow 建模
    - 分类器训练法
        * 传统算法：SVM, LSA, LDA 使用 scikit-learn 训练
        * 神经网络算法：DNN, CNN：使用 TensorFlow 训练

## 实验流程规划

1. 模块导入
2. 数据预处理（通用预处理）
  + 对文本进行清洗，包括但不限于去除特殊符号、进行大小写转换等工作，最终使文本中只包含：由小写字母 a-z 组成的单词、单一空格
  + 不在 a-z 之间的字符将一律被转换为空格
3. 文本表示建模
  + BOW+TF-IDF 表示：
      1. 读入原始语料并保存
      2. 使用 scikit-learn，在原始语料的基础上，进行建立词袋（BOW）、计算 TF-IDF
      3. 通过使用 BOW+TF-IDF 向量表示文档中的每个词，从而表示每篇文档（包括所有语料：训练集和测试集）
      4. 每篇文档对应的标签独热（one-hot）向量化
  + Word2Vec 表示：
      1. 分别使用 gensim 和 TensorFlow 中的每一种，分别在 text8 的基础上、在待学习样本的基础上，建立词嵌入（word embedding）模型（Word2Vec）
          + 即：训练出 gensim+text8, gensim+待学习样本, TensorFlow+text8, TensorFlow+待学习样本 共计 4 种表示模型
          + 使用 Skip-Gram 方法进行建模
      2. 通过使用 Word2Vec 向量表示文档中的每个词，然后建立 2 种文档表示模型：
          + 求这些词向量的和，以求和向量表示每一篇文档；对于不在词汇表中的词，以某常量代替——具体而言，可指定为加入零向量，或在求和向量乘上某个常量系数
          + 对于上述求和向量进行求算术平均，使用算术平均向量表示每一篇文档；对于不在词汇表中的词，同上述处理方法
      3. 每篇文档对应的标签独热（one-hot）向量化
4. 分类器训练
  + 传统算法：SVM, LSA, LDA 使用 scikit-learn 训练
  + 神经网络算法：DNN, CNN：使用 TensorFlow 训练
5. 分类器评估
  + 对于传统算法：
    - 使用 sckit-learn 提供的 GridSearchCV 与 LearningCurve 方法寻找最优参数组合
    - 使用 scikit-learn 提供的 accuracy_score（查准率 P） 与 f1_score（F1 分数，同时考察了查准率 P 与查全率 R） 评估训练结果
  + 对于神经网络算法：
    - 暂定手工选择一组参数进行训练；待考察是否可使用 GridSearchCV 与 LearningCurve 进行参数组合寻找最优参数组合
    - 暂定使用手工编写的方法计算查准率 P、查全率 R、F1 分数；待考察是否可对数据格式进行一定程度上的转换或存储，以使用上述提及的 scikit-learn 提供的评估工具

## 实验记录

In [1]:
# Step 0: import module

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import os
import random

import pandas as pd
import numpy as np
import tensorflow as tf

from sklearn.preprocessing import LabelBinarizer

from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from IPython.display import display
%matplotlib inline

print("import modules successfully")

import modules successfully


In [2]:
# Step 1: preprocess the data

paths = {}
paths['dir.dataroot'] =  os.path.join(os.getcwd(), '..', 'data')
paths['dir.train'] = os.path.join(paths['dir.dataroot'], 'trialdata', 'train')
paths['dir.test'] = os.path.join(paths['dir.dataroot'], 'trialdata', 'test')
        
for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for cls in os.listdir(dirpath):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            os.system('mv {} {}.old'.format(fpath, fpath))
            os.system('perl {} {}.old > {}'.format(os.path.join(paths['dir.dataroot'], 'newfil.pl'), fpath, fpath))
            os.system('rm {}.old'.format(fpath))

print("file preporcessing succefully")
            
stopwordlist = []
with open(os.path.join(paths['dir.dataroot'], 'stoplist-web.txt'), 'r') as readf:
    stopwordlist = readf.read()
    stopwordlist = stopwordlist.split('\n')
            
print("read stop word list successfully")
        
print("Step 1 Succeed")

file preporcessing succefully
read stop word list successfully
Step 1 Succeed


### BOW+TF-IDF 表示法

In [3]:
# Step 2: read data and save it in data['nearRaw.train'] 和 data['nearRaw.test']
modelChoice = 'TFIDF'

data = {}
data['nearRaw.train'] = {'content':[], 'class':[]}
data['nearRaw.test'] = {'content':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                data['nearRaw.{}'.format(tpart)]['content'].append(readf.read())
                data['nearRaw.{}'.format(tpart)]['class'].append(cls)
    tmp = data['nearRaw.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[content]\n {}\n[class]\n{}".format(
            tpart, ind, tmp['content'][ind], tmp['class'][ind]
        )
    )
    print() 
    
print("Step 2 Succeed")

sample(transformed) from train[1]:
[content]
  from robp landru network com rob peglar subject re did he really rise reply to robp landru network com organization network systems corporation lines seven one in article one three seven three geneva rutgers edu parkin eng sun com michael parkin writes another issue of importance was the crucification the will of god or a tragic mistake i believe it was a tragic mistake god s will can never be accomplished through the disbelief of man i finished reading a very good book the will of god weatherhead this was very helpful to me in applying thought to the subject of the will of god weatherhead broke the will of god into three distinct parts intentional will circumstancial will and ultimate will he weatherhead also refuted the last statement above by michael parkin above quite nicely summarizing despite the failures of humankind god s ultimate will is never to be defeated god s intentions may be interfered with even temporarily defeated by the 

In [20]:
# Step 3: TfidfVectorizer.fit + TfidfVectorizer.transform + save in Pandas.DataFrame
# A. 使用 sklearn.feature_extraction.text.TfidfVectorizer.fit 拟合训练数据，建立 BOW+TF-IDF 
# B. 使用 sklearn.feature_extraction.text.TfidfVectorizer.transform 将data['nearRaw.train']中的 stringContent 和 data['nearRaw.test']中的 stringContent 进行处理，
# 将 BOW+TF-IDF 表示结果输出到 data['matrix.train'] 与 data['matrix.test'] 中，供后续学习和训练使用
# 
# C. 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

from sklearn.feature_extraction.text import TfidfVectorizer

## Substep A: vectorization + TF-IDF calculation
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01, max_features=784, analyzer='word', stop_words=stopwordlist)
vectorizer.fit(data['nearRaw.train']['content'])

print("Substep A finished.")
print("--------------------------------------------------")

## Substep B: Transformation

for tpart in ['train', 'test']:
    data['matrix.{}'.format(tpart)] = vectorizer.transform(data['nearRaw.{}'.format(tpart)]['content'])
    ind = (random.sample(range(data['matrix.{}'.format(tpart)].shape[0]), 1))[0]
    print("sample for matrix.{}".format(tpart))
    print("from ind: {}".format(ind))
    print(data['matrix.{}'.format(tpart)][ind])
    print() 
    
print("Substep B finished.")
print("--------------------------------------------------")

# Substep C: integrate data into DataFrame format

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
if not os.path.isdir(csvpath_root):
    os.mkdir(csvpath_root)

df = {}
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['nearRaw.{}'.format(tpart)]['class']
    for col in range(data['matrix.{}'.format(tpart)].shape[1]):
        datadict[col]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]
#         datadict[str(col)]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, "{}-{}.csv".format(tpart, modelChoice))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)

print("Substep C finished.")
print("--------------------------------------------------")

print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

Substep A finished.
--------------------------------------------------
sample for matrix.train
from ind: 516
  (0, 783)	0.0844955606429
  (0, 780)	0.0431000836037
  (0, 778)	0.0442701220325
  (0, 777)	0.0677729647583
  (0, 770)	0.0182683902219
  (0, 766)	0.0508657613309
  (0, 764)	0.0520481166174
  (0, 763)	0.151848777412
  (0, 761)	0.02055449068
  (0, 760)	0.027786638291
  (0, 758)	0.041928382426
  (0, 756)	0.0507852352762
  (0, 752)	0.0273401631941
  (0, 750)	0.0837094629954
  (0, 748)	0.0282656119887
  (0, 747)	0.0174029704187
  (0, 725)	0.0202535018237
  (0, 722)	0.126843162329
  (0, 718)	0.0130582337739
  (0, 711)	0.134698496434
  (0, 707)	0.134107749502
  (0, 706)	0.0862898842883
  (0, 705)	0.0571673848629
  (0, 702)	0.0564075989481
  (0, 700)	0.0260683883378
  :	:
  (0, 163)	0.055573276582
  (0, 162)	0.155117050743
  (0, 150)	0.0312855328975
  (0, 149)	0.0434173512182
  (0, 137)	0.0322351258292
  (0, 131)	0.0285188349172
  (0, 130)	0.0259800248037
  (0, 129)	0.0854932270894
  (0

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.151835,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.131008,0.000000,0.140640,0.000000,0.000000,0.000000,0.000000,0.141967,0.116894,soc.religion.christian
1,0.000000,0.000000,0.000000,0.086425,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.017108,0.000000,0.000000,0.000000,0.000000,0.070079,0.000000,0.000000,0.091591,soc.religion.christian
2,0.000000,0.000000,0.097996,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.042926,0.000000,0.184327,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
3,0.073392,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.031662,0.000000,0.067981,0.000000,0.000000,0.000000,0.000000,0.000000,0.113006,soc.religion.christian
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.023204,0.061024,0.000000,0.000000,0.000000,0.000000,0.000000,0.050289,0.041408,soc.religion.christian
5,0.058849,0.000000,0.000000,0.000000,0.070452,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.135919,soc.religion.christian
6,0.000000,0.000000,0.000000,0.000000,0.056674,0.000000,0.0,0.000000,0.000000,0.043449,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.135911,0.000000,0.000000,soc.religion.christian
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.063593,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.170227,soc.religion.christian
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.060522,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.216009,soc.religion.christian
9,0.000000,0.000000,0.000000,0.246793,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.048853,0.128482,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.072939,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.031467,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
1,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.078079,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
2,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.077998,0.103377,0.000000,0.000000,0.000000,soc.religion.christian
3,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.050030,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.044641,soc.religion.christian
4,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.210194,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.081678,0.000000,soc.religion.christian
5,0.000000,0.0,0.000000,0.000000,0.000000,0.053324,0.0,0.000000,0.000000,0.000000,...,0.019539,0.000000,0.125853,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
6,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.092240,0.000000,0.000000,0.080370,soc.religion.christian
7,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.031390,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.068031,0.028008,soc.religion.christian
8,0.000000,0.0,0.000000,0.072953,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.200846,0.295775,0.000000,0.000000,0.051542,soc.religion.christian
9,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.059443,...,0.000000,0.000000,0.000000,0.058781,0.000000,0.114455,0.000000,0.000000,0.124657,soc.religion.christian






Substep C finished.
--------------------------------------------------
Step 3 Succeed.


In [21]:
# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(paths['dir.dataroot'], 'data_CSV', '{}-{}.csv'.format(tpart, modelChoice))
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])


read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.083012,0.0,0.000000,0.000000,0.096209,0.000000,0.000000,0.000000,0.370347,rec.motorcycles
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.069024,0.000000,0.000000,0.150354,rec.motorcycles
2,0.000000,0.000000,0.000000,0.026483,0.087284,0.000000,0.000000,0.000000,0.000000,0.044610,...,0.020969,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,alt.atheism
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.122483,...,0.057573,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.205485,soc.religion.christian
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.103417,0.000000,0.000000,0.000000,...,0.036518,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.059723,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.053289,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.038827,0.0,0.083364,0.000000,0.000000,0.000000,0.000000,0.000000,0.207867,rec.autos
7,0.000000,0.000000,0.131372,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.057545,0.0,0.000000,0.121059,0.000000,0.000000,0.000000,0.124718,0.000000,alt.atheism
8,0.000000,0.000000,0.000000,0.100830,0.000000,0.000000,0.000000,0.000000,0.000000,0.056617,...,0.026613,0.0,0.000000,0.000000,0.000000,0.000000,0.059034,0.028839,0.000000,soc.religion.christian
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.038275,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.546428,rec.motorcycles


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.230146,0.000000,0.000000,0.000000,0.049347,0.0,0.050425,0.000000,0.038467,...,0.018082,0.000000,0.116467,0.038039,0.000000,0.000000,0.040110,0.000000,0.000000,alt.atheism
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.072712,0.000000,0.312232,0.305930,0.000000,0.000000,0.000000,0.000000,0.064879,rec.motorcycles
2,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.222632,rec.autos
3,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.121485,0.000000,0.130417,0.000000,0.000000,0.000000,0.000000,0.000000,0.270995,alt.atheism
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.054815,0.000000,0.117691,0.000000,0.000000,0.000000,0.000000,0.000000,0.586924,comp.graphics
5,0.000000,0.000000,0.000000,0.132286,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.113508,0.000000,alt.atheism
6,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.064346,comp.graphics
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.062083,0.127796,comp.graphics
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.087306,...,0.041039,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.073235,rec.autos
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.023039,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.020557,soc.religion.christian


In [23]:
# Step 4: One-hot representation for labels

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lb = LabelBinarizer()
lb.fit(df['train']['class'])

for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelChoice))
        df_new[tpart][subpart].to_csv(csvpath)
    
print("label cleaning succussfully")

label cleaning succussfully


In [24]:
# Step 5: Training

In [25]:
# Step 5.1.1: SVM

if 'TFIDF' == modelChoice:
    #train
    X_train = df['train'].drop('class', axis=1)
    y_train = df['train']['class']
    #test
    X_test = df['test'].drop('class', axis=1)
    y_test_true = df['test']['class']
else:
    #train
    X_train = df_new['train']['x']
    y_train = df_new['train']['y']
    #test
    X_test = df_new['test']['x']
    y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished


In [26]:
# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

# print f1_score(y_test_true, tensorPredCls, average='micro')

0.861578947368
0.859091299728
0.861578947368


In [27]:
temps = ""
with open(os.path.join(paths['dir.dataroot'], 'stoplist-baseTFIDF.txt'), 'w') as stoplistfile:
    for w in vectorizer.stop_words_:
        temps += "{} ".format(w)
    stoplistfile.write(temps)
    
print("Output stoplist successfully.")

Output stoplist successfully.


### Word2Vec 表示法

## 总结

### 1. 对于同一种分类器训练法，不同表示模型对结果的影响

### 2. 对于同一种表示模型，不同训练模型对结果的影响

### 3. 综合来看，「表示模型 + 分类器」组合的效果评估

### 4. 展望