## -2. 更新日志

+ 【2017/05/11】
    - 重新梳理前期实验结果，并整合到该份报告中
    - 前期实验报告参见同一目录下的其他以 `trial_` 开头的 `.ipynb` 文件

## -1. 备注

1. 搜索该符号以定位到报告中待完善的部分：。。。
2. 稍后可考虑将部分代码包装成 Python 脚本，以模块的形式导入，使整个 notebook 更简洁？或者不这样处理，从而使读者更方便阅读（而不用另外切换于多个页面之间）？
3. 。。。

## 0. 参考文献

1. (#miscellaneous) [20 Newsgroup Document Classification Report](http://cn-static.udacity.com/mlnd/Capstone_Poject_Sample01.pdf)
2. (#word2vec, #tensorflow) [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
3. (#word2vec) [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
4. (#word2vec, #gensim) [models.word2vec – Deep learning with word2vec](https://radimrehurek.com/gensim/models/word2vec.html)
5. (#CNN, #tensorflow) [Deep MNIST for Experts](https://www.tensorflow.org/get_started/mnist/pros)
6. (#text8) [text8](http://mattmahoney.net/dc/textdata)
7. 。。。

## 1. 实验模块规划

粗略规划如下，稍后精细整理：实现下述共计 6 种「表示 + 训练」组合

表示模型 | 分类器训练法
---------------|---------------------
BOW + TF-IDF | SVM
                     | LSA
                     | LDA
Word2Vec      | SVM
                     | DNN
                     | CNN
                     
其中：
+ 名词解释
    - 表示模型
        * BOW：Bag-of-Words，词袋模型
        * TF-IDF：Term Frequency - Inverse Document Frequency，文档-逆文档频率
    - 分类器训练法
        * SVM：Support Vector Machine，支持向量机
        * LSA：潜在语义分析
        * LDA：Latent Dirichlet Allocation，隐含狄利克雷分布
        * DNN：Deep Neural Network，深度神经网络（普通的多层感知机构成的多层神经网络）
        * CNN：Convolution Neural Network，卷积神经网络
+ 训练工具
    - 表示模型
        * BOW+TF-IDF：使用 scikit-learn 建模
        * Word2Vec：使用 gensim 与 TensorFlow 建模
    - 分类器训练法
        * 传统算法：SVM, LSA, LDA 使用 scikit-learn 训练
        * 神经网络算法：DNN, CNN：使用 TensorFlow 训练

## 2. 实验流程规划

1. 模块导入
2. 数据预处理（通用预处理）
  + 对文本进行清洗，包括但不限于去除特殊符号、进行大小写转换等工作，最终使文本中只包含：由小写字母 a-z 组成的单词、单一空格
  + 不在 a-z 之间的字符将一律被转换为空格
3. 文本表示建模
  + BOW+TF-IDF 表示：
      1. 读入原始语料并保存
      2. 使用 scikit-learn，在原始语料的基础上，进行建立词袋（BOW）、计算 TF-IDF
      3. 通过使用 BOW+TF-IDF 向量表示文档中的每个词，从而表示每篇文档（包括所有语料：训练集和测试集）
      4. 每篇文档对应的标签独热（one-hot）向量化
  + Word2Vec 表示：
      1. 分别使用 gensim 和 TensorFlow 中的每一种，分别在 text8 的基础上、在待学习样本的基础上，建立词嵌入（word embedding）模型（Word2Vec）
          + 即：训练出 gensim+text8, gensim+待学习样本, TensorFlow+text8, TensorFlow+待学习样本 共计 4 种表示模型
          + 使用 Skip-Gram 方法进行建模
      2. 通过使用 Word2Vec 向量表示文档中的每个词，然后建立 2 种文档表示模型：
          + 求这些词向量的和，以求和向量表示每一篇文档；对于不在词汇表中的词，以某常量代替——具体而言，可指定为加入零向量，或在求和向量乘上某个常量系数
          + 对于上述求和向量进行求算术平均，使用算术平均向量表示每一篇文档；对于不在词汇表中的词，同上述处理方法
      3. 每篇文档对应的标签独热（one-hot）向量化
4. 分类器训练
  + 传统算法：SVM, LSA, LDA 使用 scikit-learn 训练
  + 神经网络算法：DNN, CNN：使用 TensorFlow 训练
5. 分类器评估
  + 对于传统算法：
    - 使用 sckit-learn 提供的 GridSearchCV 与 LearningCurve 方法寻找最优参数组合
    - 使用 scikit-learn 提供的 accuracy_score（查准率 P） 与 f1_score（F1 分数，同时考察了查准率 P 与查全率 R） 评估训练结果
  + 对于神经网络算法：
    - 暂定手工选择一组参数进行训练；待考察是否可使用 GridSearchCV 与 LearningCurve 进行参数组合寻找最优参数组合
    - 暂定使用手工编写的方法计算查准率 P、查全率 R、F1 分数；待考察是否可对数据格式进行一定程度上的转换或存储，以使用上述提及的 scikit-learn 提供的评估工具

## 3. 实验记录

In [29]:
# Step 0: import module

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import os
import random
import datetime

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelBinarizer

from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

import gensim
import tensorflow as tf

from IPython.display import display
%matplotlib inline

print("import modules successfully")

import modules successfully


In [2]:
%%time

# Step 1: preprocess the data
vecsize = 784

paths = {}
paths['dir.dataroot'] =  os.path.join(os.getcwd(), '..', 'data')
paths['dir.train'] = os.path.join(paths['dir.dataroot'], 'trialdata', 'train')
paths['dir.test'] = os.path.join(paths['dir.dataroot'], 'trialdata', 'test')
        
preprocessedFlag = os.path.join(paths['dir.dataroot'], 'preprocessed')
if not os.path.isfile(preprocessedFlag):
    for tpart in ['train', 'test']:
        dirpath = paths['dir.{}'.format(tpart)]
        for cls in os.listdir(dirpath):
            clspath = os.path.join(dirpath, cls)
            files = os.listdir(clspath)
            for f in files:
                fpath = os.path.join(clspath, f)
                os.system('mv {} {}.old'.format(fpath, fpath))
                os.system('perl {} {}.old > {}'.format(os.path.join(paths['dir.dataroot'], 'newfil.pl'), fpath, fpath))
                os.system('rm {}.old'.format(fpath))
    os.system('touch {}'.format(preprocessedFlag))
                
print("file preporcessing succefully")
            
stopwordlist = []
with open(os.path.join(paths['dir.dataroot'], 'stoplist-web.txt'), 'r') as readf:
    stopwordlist = readf.read()
    stopwordlist = stopwordlist.split('\n')
            
print("read stop word list successfully")
        
print("Step 1 Succeed")

file preporcessing succefully
read stop word list successfully
Step 1 Succeed
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.69 ms


### 3.1 BOW+TF-IDF 表示法

In [6]:
%%time

# Step 2: read data and save it in data['nearRaw.train'] 和 data['nearRaw.test']
modelChoice = 'TFIDF'

data = {}
data['nearRaw.train'] = {'content':[], 'class':[]}
data['nearRaw.test'] = {'content':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                data['nearRaw.{}'.format(tpart)]['content'].append(readf.read())
                data['nearRaw.{}'.format(tpart)]['class'].append(cls)
    tmp = data['nearRaw.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[content]\n {}\n[class]\n{}".format(
            tpart, ind, tmp['content'][ind], tmp['class'][ind]
        )
    )
    print() 
    
print("Step 2 Succeed")

sample(transformed) from train[2279]:
[content]
  from ed cwis unomaha edu ed stastny subject the otis project ftp sites for original art and images keywords mr owl how many licks organization university of nebraska at omaha lines two two seven the otis project nine three the operative term is stimulate this file last updated four two one nine three what is otis otis is here for the purpose of distributing original artwork and photographs over the network for public perusal scrutiny and distribution digital immortality the basic idea behind digital immortality is that computer networks are here to stay and that anything interesting you deposit on them will be around near forever the gifs and jpgs of today will be the artifacts of a digital future perhaps they ll be put in different formats perhaps only surviving on backup tapes but they ll be there and someone will dig them up if that doesn t interest you otis also offers a forum for critique and exhibition of your works a virtual art 

#### 3.1.1 普通 VSM 表示模型

<center>参数列表</center>

参数 |  取值
--------|-----------
max_df | 0.9
min_df | 0.01
max_features | 784
analyzer | 'word'
stop_words | stopwordlist（来自文件）

In [7]:
%%time

# Step 3: TfidfVectorizer.fit + TfidfVectorizer.transform + save in Pandas.DataFrame
# A. 使用 sklearn.feature_extraction.text.TfidfVectorizer.fit 拟合训练数据，建立 BOW+TF-IDF 
# B. 使用 sklearn.feature_extraction.text.TfidfVectorizer.transform 将data['nearRaw.train']中的 stringContent 和 data['nearRaw.test']中的 stringContent 进行处理，
# 将 BOW+TF-IDF 表示结果输出到 data['matrix.train'] 与 data['matrix.test'] 中，供后续学习和训练使用
# 
# C. 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

from sklearn.feature_extraction.text import TfidfVectorizer

## Substep A: vectorization + TF-IDF calculation
vecsize = 784
vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.01, max_features=vecsize, analyzer='word', stop_words=stopwordlist)
vectorizer.fit(data['nearRaw.train']['content'])

print("Substep A finished.")
print("--------------------------------------------------")

## Substep B: Transformation

for tpart in ['train', 'test']:
    data['matrix.{}'.format(tpart)] = vectorizer.transform(data['nearRaw.{}'.format(tpart)]['content'])
    ind = (random.sample(range(data['matrix.{}'.format(tpart)].shape[0]), 1))[0]
    print("sample for matrix.{}".format(tpart))
    print("from ind: {}".format(ind))
    print(data['matrix.{}'.format(tpart)][ind])
    print() 
    
print("Substep B finished.")
print("--------------------------------------------------")

# Substep C: integrate data into DataFrame format

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
if not os.path.isdir(csvpath_root):
    os.mkdir(csvpath_root)

df = {}
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['nearRaw.{}'.format(tpart)]['class']
    for col in range(data['matrix.{}'.format(tpart)].shape[1]):
        datadict[col]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]
#         datadict[str(col)]= [i[0] for i in data['matrix.{}'.format(tpart)].getcol(col).toarray()]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, "{}-{}.csv".format(tpart, modelChoice))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)

print("Substep C finished.")
print("--------------------------------------------------")

print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

Substep A finished.
--------------------------------------------------
sample for matrix.train
from ind: 2385
  (0, 783)	0.285416737335
  (0, 770)	0.185125856098
  (0, 652)	0.219756542021
  (0, 628)	0.112067764994
  (0, 612)	0.11498914104
  (0, 566)	0.180083750663
  (0, 549)	0.170377050063
  (0, 481)	0.0741653714903
  (0, 426)	0.693245185673
  (0, 283)	0.223815276854
  (0, 281)	0.239568315863
  (0, 257)	0.315214396579
  (0, 250)	0.101436255178
  (0, 203)	0.114557050154
  (0, 182)	0.179956076368

sample for matrix.test
from ind: 564
  (0, 783)	0.25037021215
  (0, 781)	0.124487406684
  (0, 775)	0.0561196577336
  (0, 772)	0.141241431695
  (0, 770)	0.0974364719433
  (0, 765)	0.131777847543
  (0, 736)	0.123712296337
  (0, 722)	0.0966474258649
  (0, 711)	0.404116464909
  (0, 708)	0.120212864006
  (0, 691)	0.0996602988254
  (0, 683)	0.103176840655
  (0, 658)	0.316198386017
  (0, 603)	0.0936574438545
  (0, 590)	0.156114988678
  (0, 566)	0.0947826829205
  (0, 549)	0.134510697275
  (0, 523)	0.06

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.151835,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.131008,0.000000,0.140640,0.000000,0.000000,0.000000,0.000000,0.141967,0.116894,soc.religion.christian
1,0.000000,0.000000,0.000000,0.086425,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.017108,0.000000,0.000000,0.000000,0.000000,0.070079,0.000000,0.000000,0.091591,soc.religion.christian
2,0.000000,0.000000,0.097996,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.042926,0.000000,0.184327,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
3,0.073392,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.031662,0.000000,0.067981,0.000000,0.000000,0.000000,0.000000,0.000000,0.113006,soc.religion.christian
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.023204,0.061024,0.000000,0.000000,0.000000,0.000000,0.000000,0.050289,0.041408,soc.religion.christian
5,0.058849,0.000000,0.000000,0.000000,0.070452,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.135919,soc.religion.christian
6,0.000000,0.000000,0.000000,0.000000,0.056674,0.000000,0.0,0.000000,0.000000,0.043449,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.135911,0.000000,0.000000,soc.religion.christian
7,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.063593,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.170227,soc.religion.christian
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.060522,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.216009,soc.religion.christian
9,0.000000,0.000000,0.000000,0.246793,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.048853,0.128482,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.072939,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.031467,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
1,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.078079,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
2,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.077998,0.103377,0.000000,0.000000,0.000000,soc.religion.christian
3,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.050030,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.044641,soc.religion.christian
4,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.210194,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.081678,0.000000,soc.religion.christian
5,0.000000,0.0,0.000000,0.000000,0.000000,0.053324,0.0,0.000000,0.000000,0.000000,...,0.019539,0.000000,0.125853,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,soc.religion.christian
6,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.092240,0.000000,0.000000,0.080370,soc.religion.christian
7,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.031390,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.068031,0.028008,soc.religion.christian
8,0.000000,0.0,0.000000,0.072953,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.200846,0.295775,0.000000,0.000000,0.051542,soc.religion.christian
9,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.059443,...,0.000000,0.000000,0.000000,0.058781,0.000000,0.114455,0.000000,0.000000,0.124657,soc.religion.christian






Substep C finished.
--------------------------------------------------
Step 3 Succeed.
CPU times: user 5.89 s, sys: 96 ms, total: 5.98 s
Wall time: 5.88 s


In [8]:
%%time

# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(paths['dir.dataroot'], 'data_CSV', '{}-{}.csv'.format(tpart, modelChoice))
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])


read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.088296,rec.motorcycles
1,0.000000,0.000000,0.270996,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.118705,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.052959,rec.motorcycles
2,0.000000,0.000000,0.000000,0.106788,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.126834,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.037723,alt.atheism
3,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.088524,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.513421,rec.motorcycles
4,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.117773,0.000000,0.000000,0.000000,0.000000,0.099905,comp.graphics
5,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.119460,rec.autos
6,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.095337,...,0.089627,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,rec.motorcycles
7,0.039368,0.000000,0.000000,0.000000,0.04713,0.000000,0.000000,0.000000,0.0,0.036132,...,0.067936,0.000000,0.000000,0.035729,0.000000,0.000000,0.075349,0.000000,0.015154,alt.atheism
8,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.285417,comp.graphics
9,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.080364,0.000000,0.000000,0.169062,0.000000,0.000000,0.000000,0.000000,0.000000,comp.graphics


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.024113,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.052260,0.000000,alt.atheism
1,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.057582,0.0,0.000000,0.000000,...,0.061000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.199572,rec.autos
2,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.052024,soc.religion.christian
3,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.039734,rec.autos
4,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.057348,0.000000,0.000000,0.120644,0.000000,0.000000,0.000000,0.000000,0.255850,rec.motorcycles
5,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.015311,0.000000,0.000000,0.032210,0.000000,0.000000,0.000000,0.000000,0.068308,soc.religion.christian
6,0.000000,0.0,0.000000,0.0,0.000000,0.109173,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.086700,0.000000,alt.atheism
7,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.089744,soc.religion.christian
8,0.099696,0.0,0.196379,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.076754,soc.religion.christian
9,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.105101,0.000000,0.000000,0.000000,0.000000,0.267465,rec.autos


CPU times: user 632 ms, sys: 0 ns, total: 632 ms
Wall time: 667 ms


In [10]:
# %%time

# # Step 4: One-hot representation for labels

# csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

# lb = LabelBinarizer()
# lb.fit(df['train']['class'])

# df_new = {}
# for tpart in ['train', 'test']:
#     labels = lb.transform(df[tpart]['class'])
#     labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
#     df_new[tpart] = {}
#     df_new[tpart]['y'] = labelsDf
#     df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
#     df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
#     #save in CSV
#     for subpart in ['x', 'y', 'all']:
#         csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelChoice))
#         df_new[tpart][subpart].to_csv(csvpath)
    
# print("label cleaning succussfully")

label cleaning succussfully
CPU times: user 2.05 s, sys: 76 ms, total: 2.12 s
Wall time: 2.12 s


In [24]:
# Step 5: Training

In [11]:
%%time

# Step 5.1.1: SVM

# if 'TFIDF' == modelChoice:

#train
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']
#test
X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

# else:
#     #train
#     X_train = df_new['train']['x']
#     y_train = df_new['train']['y']
#     #test
#     X_test = df_new['test']['x']
#     y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 68 ms, sys: 16 ms, total: 84 ms
Wall time: 83 ms


In [12]:
%%time

# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

# print f1_score(y_test_true, tensorPredCls, average='micro')

0.861578947368
0.859091299728
0.861578947368
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 19 ms


In [13]:
%%time

temps = ""
with open(os.path.join(paths['dir.dataroot'], 'stoplist-baseTFIDF.txt'), 'w') as stoplistfile:
    for w in vectorizer.stop_words_:
        temps += "{} ".format(w)
    stoplistfile.write(temps)
    
print("Output stoplist successfully.")

Output stoplist successfully.
CPU times: user 40 ms, sys: 0 ns, total: 40 ms
Wall time: 46.3 ms


#### 3.1.2 基于 BOW+TFIDF 的 LSA 表示

<center>参数列表</center>

参数 |  取值
--------|-----------
k(n_components) | 200
n_iter | 10
random_state | 19

In [14]:
%%time

# Step 4: LSA based on BOW+TFIDF
# 注意：sklearn.decomposition.TruncatedSVD 的 fit() 与 transform() 方法接受的参数虽然说要求是稀疏矩阵（sparse matrix），但只要求输入

from sklearn.decomposition import TruncatedSVD

modelChoice = 'LSA'

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

svd = TruncatedSVD(n_components=200, n_iter=10, random_state=19)
svd.fit(df['train'].drop('class', axis=1))

df_new = {}
for tpart in ['train', 'test']:
    df_new[tpart] = {}
    datadict = {}
    X_LSA = svd.transform(df[tpart].drop('class', axis=1))
    for col in range(X_LSA.shape[1]):
        datadict[col] = X_LSA[:, col]
    df_new[tpart]['y'] = df[tpart]['class']
    df_new[tpart]['x'] = pd.DataFrame(data=datadict)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    print('Finish {} for data: {}'.format(modelChoice, tpart))
    display(df_new[tpart]['all'])
    print("\n\n\n")
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelChoice))
        df_new[tpart][subpart].to_csv(csvpath)

Finish LSA for data: train


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,class
0,0.271848,0.086854,0.044718,-0.184021,-0.104875,-0.061883,-0.023094,-0.019757,0.072721,-0.174748,...,0.020996,-0.057589,-0.023206,0.044620,0.033468,-0.042031,-0.056432,-0.053197,0.022025,rec.motorcycles
1,0.354713,0.024572,-0.053660,0.027571,-0.112272,0.008970,-0.117263,0.035975,0.068295,-0.064175,...,0.040832,-0.026654,0.012075,0.050080,-0.090460,-0.027934,-0.044482,0.023124,-0.080695,rec.motorcycles
2,0.336806,0.131161,-0.080951,0.061080,0.033372,0.182860,0.146511,0.024233,-0.036709,0.011195,...,0.016687,0.011382,0.001840,-0.011631,0.018478,-0.000071,-0.004590,-0.035387,-0.019772,alt.atheism
3,0.669176,-0.222391,0.154325,-0.113312,-0.008736,0.042402,-0.122434,0.090178,0.086653,-0.010587,...,-0.020793,-0.010716,0.008322,-0.000535,0.042258,-0.040557,-0.000607,-0.007880,-0.022110,rec.motorcycles
4,0.242416,-0.017904,-0.132323,0.110422,0.117907,-0.096068,-0.184019,-0.017587,-0.052250,-0.054797,...,-0.012923,-0.014938,-0.019100,0.031885,-0.027216,-0.009254,0.005577,-0.005521,-0.059192,comp.graphics
5,0.265944,0.053592,-0.089929,-0.205034,-0.135492,0.045844,-0.182254,0.044461,0.185399,0.072694,...,-0.011158,0.000387,0.002401,0.018223,0.032595,-0.072950,-0.046289,0.011751,0.000851,rec.autos
6,0.245897,0.041406,-0.037822,-0.135230,-0.105791,-0.113244,0.051711,0.029380,0.099952,-0.237700,...,0.006944,-0.036590,0.077661,0.012173,0.072640,0.055367,0.008979,0.036428,0.081808,rec.motorcycles
7,0.232699,0.244133,-0.128749,-0.213448,0.189454,0.225212,0.036346,-0.033897,-0.228594,-0.196528,...,-0.034541,0.023703,0.021700,0.006158,-0.039249,0.021697,0.065730,0.032391,0.020442,alt.atheism
8,0.356918,-0.174550,0.037398,0.050916,0.139657,0.043051,-0.035772,-0.085151,0.107688,-0.025660,...,0.061714,0.075787,0.045739,0.027133,-0.002291,-0.013261,-0.026346,-0.009649,0.010169,comp.graphics
9,0.259889,-0.031931,-0.118164,0.042314,-0.168207,0.100546,-0.009619,-0.098718,0.044267,0.007701,...,0.041649,-0.061747,0.029451,-0.027174,-0.066717,-0.023137,-0.021732,0.009132,-0.039401,comp.graphics






Finish LSA for data: test


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,class
0,0.114564,0.164067,-0.024724,-0.030537,0.090902,-0.009459,0.014457,0.019973,-0.052380,-0.020850,...,0.062908,-0.020304,0.072522,0.070766,-0.021639,0.050329,0.061819,0.043197,0.000102,alt.atheism
1,0.519665,-0.021193,0.011763,-0.081025,-0.016817,-0.088999,0.137517,-0.203228,-0.036433,0.073082,...,0.004456,0.041031,-0.002497,0.017334,-0.017513,-0.022711,0.006521,0.051933,0.019353,rec.autos
2,0.400768,-0.004849,0.001060,0.138331,-0.053400,0.017185,0.028488,-0.087491,-0.006689,-0.127574,...,-0.007918,-0.025594,-0.012875,-0.006147,0.002793,-0.037490,0.013872,-0.035135,0.029218,soc.religion.christian
3,0.494432,-0.128792,-0.075977,0.158679,-0.160794,0.070026,0.079511,0.025272,-0.132389,-0.084178,...,-0.042501,-0.037805,-0.003796,-0.010782,0.006794,0.049987,-0.000206,0.030371,0.007290,rec.autos
4,0.440319,-0.061954,-0.023238,-0.122690,0.095754,-0.078904,0.067468,0.146911,0.039260,-0.041248,...,-0.025799,-0.017724,-0.024578,-0.001334,0.025187,0.011958,0.027222,0.035019,-0.032083,rec.motorcycles
5,0.328207,0.181910,0.122593,0.142017,-0.053450,0.042890,0.002027,-0.075838,0.015566,0.030917,...,0.047407,0.066344,-0.022169,0.023059,0.000695,-0.010573,-0.018855,0.054881,0.007070,soc.religion.christian
6,0.272883,0.289963,0.077959,0.143253,0.023714,-0.017974,0.033273,0.009029,0.021902,-0.041701,...,0.011500,0.017489,0.037776,0.001220,-0.016065,0.054652,0.000180,-0.006513,0.015332,alt.atheism
7,0.328090,0.137693,0.135918,0.102294,-0.081237,0.089016,-0.017763,-0.031980,0.150092,0.009629,...,-0.030689,0.006407,-0.001757,-0.077525,0.002236,-0.005762,0.042114,-0.016589,-0.031774,soc.religion.christian
8,0.312789,0.139804,0.065698,0.134523,0.012125,0.002448,-0.053565,0.060769,0.013670,0.032497,...,-0.011592,0.068106,0.020433,-0.011731,0.026763,0.032598,0.016502,-0.018786,0.018138,soc.religion.christian
9,0.511293,-0.162422,0.009341,-0.008767,-0.010792,0.063196,-0.004236,-0.066234,0.123269,-0.025538,...,-0.075358,-0.040903,-0.013458,-0.064500,0.021368,0.049023,-0.035709,0.069203,0.014727,rec.autos






CPU times: user 6.97 s, sys: 3.82 s, total: 10.8 s
Wall time: 2.45 s


In [15]:
%%time

# Step 5: Training

# if 'TFIDF' == modelChoice:
#     #train
#     X_train = df['train'].drop('class', axis=1)
#     y_train = df['train']['class']
#     #test
#     X_test = df['test'].drop('class', axis=1)
#     y_test_true = df['test']['class']
# else:

#train
X_train = df_new['train']['x']
y_train = df_new['train']['y']
#test
X_test = df_new['test']['x']
y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 164 ms, sys: 0 ns, total: 164 ms
Wall time: 164 ms


In [16]:
%%time

# Step 6: Evaluate

y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.867368421053
0.863905609159
0.867368421053
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 15 ms


#### 3.1.3 基于 BOW+TFIDF 的 LDA 表示

<center>参数列表</center>

参数 |  取值
--------|-----------
n_topic | 50
max_iter | 10
random_state | 19

> 注意：实验表明，主题数的选择会影响训练效果。目前测试过 5, 35, 50, 75, 100, 200，似乎以 50 效果最好。因此最后保留 50 的训练结果。

In [29]:
%%time

# Step 4: LDA based on BOW+TFIDF
# 注意：sklearn.decomposition.TruncatedSVD 的 fit() 与 transform() 方法接受的参数虽然说要求是稀疏矩阵（sparse matrix），但只要求输入

from sklearn.decomposition import LatentDirichletAllocation

modelChoice = 'LDA'

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lda = LatentDirichletAllocation(n_topics=50, max_iter=10, random_state=19, learning_method='batch')
lda.fit(df['train'].drop('class', axis=1))

df_new = {}
for tpart in ['train', 'test']:
    df_new[tpart] = {}
    datadict = {}
    X_LDA = lda.transform(df[tpart].drop('class', axis=1))
    for col in range(X_LDA.shape[1]):
        datadict[col] = X_LDA[:, col]
    df_new[tpart]['y'] = df[tpart]['class']
    df_new[tpart]['x'] = pd.DataFrame(data=datadict)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    print('Finish {} for data: {}'.format(modelChoice, tpart))
    display(df_new[tpart]['all'])
    print("\n\n\n")
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelChoice))
        df_new[tpart][subpart].to_csv(csvpath)

Finish LDA for data: train


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,class
0,0.002856,0.002856,0.002856,0.002856,0.002856,0.002856,0.002856,0.002856,0.002856,0.002856,...,0.002856,0.002856,0.002856,0.002856,0.022632,0.002856,0.002856,0.002856,0.002856,rec.motorcycles
1,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,...,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,0.002700,rec.motorcycles
2,0.002795,0.153632,0.002795,0.002795,0.002795,0.002795,0.002795,0.002795,0.002795,0.002795,...,0.002795,0.002795,0.002795,0.002795,0.002795,0.002795,0.132057,0.002795,0.002795,alt.atheism
3,0.002905,0.002905,0.002905,0.002905,0.002905,0.002905,0.002905,0.002905,0.002905,0.002905,...,0.002905,0.002905,0.002905,0.002905,0.019844,0.002905,0.002905,0.002905,0.002905,rec.motorcycles
4,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,...,0.367731,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,0.003393,comp.graphics
5,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,...,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,0.004160,rec.autos
6,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,...,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,0.002825,rec.motorcycles
7,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,...,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,0.002394,alt.atheism
8,0.004751,0.004751,0.004751,0.004751,0.004751,0.004751,0.004751,0.004751,0.004751,0.004751,...,0.004751,0.004751,0.004751,0.004751,0.462776,0.004751,0.004751,0.004751,0.004751,comp.graphics
9,0.003754,0.003754,0.053233,0.003754,0.003754,0.003754,0.003754,0.003754,0.003754,0.003754,...,0.045130,0.003754,0.003754,0.003754,0.003754,0.003754,0.003754,0.003754,0.003754,comp.graphics






Finish LDA for data: test


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,class
0,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,...,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,0.003941,alt.atheism
1,0.002108,0.002108,0.002108,0.002108,0.002108,0.002108,0.002108,0.002108,0.002108,0.002108,...,0.002108,0.002108,0.002108,0.002108,0.002108,0.002108,0.008482,0.002108,0.002108,rec.autos
2,0.002848,0.002848,0.002848,0.002848,0.002848,0.002848,0.002848,0.002848,0.002848,0.002848,...,0.002848,0.002848,0.002848,0.002848,0.254074,0.002848,0.002848,0.002848,0.002848,soc.religion.christian
3,0.003116,0.003116,0.003116,0.003116,0.003116,0.003116,0.003116,0.003116,0.003116,0.003116,...,0.003116,0.003116,0.003116,0.106555,0.024955,0.003116,0.003116,0.003116,0.003116,rec.autos
4,0.002868,0.002868,0.002868,0.002868,0.002868,0.002868,0.002868,0.002868,0.002868,0.002868,...,0.002868,0.002868,0.002868,0.115058,0.002868,0.002868,0.002868,0.002868,0.002868,rec.motorcycles
5,0.002355,0.002355,0.002355,0.002355,0.002355,0.002355,0.002355,0.002355,0.002355,0.002355,...,0.002355,0.002355,0.002355,0.002355,0.002355,0.002355,0.018169,0.002355,0.002355,soc.religion.christian
6,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,...,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,0.002652,alt.atheism
7,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,...,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,0.003309,soc.religion.christian
8,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,...,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,0.003099,soc.religion.christian
9,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,...,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,0.002785,rec.autos






CPU times: user 19.8 s, sys: 58.2 s, total: 1min 17s
Wall time: 10.5 s


In [31]:
%%time

# Step 5: Training

# if 'TFIDF' == modelChoice:
#     #train
#     X_train = df['train'].drop('class', axis=1)
#     y_train = df['train']['class']
#     #test
#     X_test = df['test'].drop('class', axis=1)
#     y_test_true = df['test']['class']
# else:

#train
X_train = df_new['train']['x']
y_train = df_new['train']['y']
#test
X_test = df_new['test']['x']
y_test_true = df_new['test']['y']
    
clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 100 ms, sys: 0 ns, total: 100 ms
Wall time: 97.7 ms


In [32]:
%%time

# Step 6: Evaluate

y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.593157894737
0.573493123176
0.593157894737
CPU times: user 12 ms, sys: 4 ms, total: 16 ms
Wall time: 13.5 ms


### 3.2 Word2Vec 表示法

In [3]:
paths['dir.modelroot'] = os.path.join(paths['dir.dataroot'], '..', 'models')
for modeltool in ['gensim', 'TensorFlow']:
    for embedsource in ['text8', 'corpus']:
        dname = os.path.join(paths['dir.modelroot'], '{}.{}'.format(modeltool, embedsource))
        if not os.path.isdir(dname):
            os.mkdir(dname)
        paths['dir.{}.{}'.format(modeltool, embedsource)] = dname

#### 3.2.1 gensim 训练

In [104]:
# Step 0: Import modules

modelFrom = 'gensim'

print("Step 0 succeed.")

Step 0 succeed.


##### 3.2.1.1 基于 text8 建模

In [6]:
%%time

# Step 1: preprocess the data

# import stoplist
stopwords = ""

pathtemp_TFIDF = os.path.join(paths['dir.dataroot'], 'stoplist-baseTFIDF.txt')
with open(pathtemp_TFIDF, 'r') as stoplistfile:
    stopwords = stoplistfile.read()
stopwords = stopwords.split()

pathtemp_web = os.path.join(paths['dir.dataroot'], 'stoplist-web.txt')
with open(pathtemp_web, 'r') as stoplistfile2:
    stopwords2 = stoplistfile2.read()
    stopwords2 = stopwords2.split('\n')
    stopwords = set(stopwords)
    stopwords = list(stopwords.union(set(stopwords)))
    
print("Read stop words successfully.")

Read stop words successfully.
CPU times: user 4 ms, sys: 8 ms, total: 12 ms
Wall time: 23.2 ms


In [7]:
%%time
   
trimmer = lambda word, count, min_count: gensim.utils.RULE_DISCARD if word in stopwords else gensim.utils.RULE_KEEP

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.01 µs


In [8]:
%%time

embedFrom = 'text8'
text8fname = os.path.join(paths['dir.dataroot'], 'text8')
sentences = gensim.models.word2vec.Text8Corpus(text8fname)

vecsize = 784
model = gensim.models.word2vec.Word2Vec(iter=15, size=vecsize, sg=1, window=5, workers=12)
model.build_vocab(sentences=sentences, trim_rule=trimmer)
model.train(sentences=sentences)

CPU times: user 2h 13min 36s, sys: 8.62 s, total: 2h 13min 45s
Wall time: 31min 56s


In [15]:
%%time

# save model

calander = datetime.date.today().timetuple()

modelpath = os.path.join(
    paths['dir.{}.{}'.format(modelFrom, embedFrom)],
    '{}.{}.{}{}{}'.format(modelFrom, embedFrom, calander.tm_year, calander.tm_mon, calander.tm_mday)
)
model.save(modelpath)

print("save model finished")

save model finished
CPU times: user 2.58 s, sys: 228 ms, total: 2.81 s
Wall time: 12.5 s


In [17]:
%%time

# Step 2: read data and save it in data['vec.train'] 和 data['vec.test']

data = {}
data['vec.train'] = {'w2v.mean':[], 'class':[]}
data['vec.test'] = {'w2v.mean':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                tokens = [token for token in readf.read().split()] # if token not in stopwords]#readf.read().split()#
                # Word2Vec representation
                # begin
                vec = np.array([0.0 for i in range(vecsize)])
                expectationVal = np.array([0.0 for i in range(vecsize)])
                countvec = 0
                for token in tokens:
                    try:
                        vec += model[token]
                        countvec += 1
                    except:
                        vec += expectationVal
                vec = vec / float(countvec)#float(len(tokens))
                 # end
            data['vec.{}'.format(tpart)]['w2v.mean'].append(vec)
            data['vec.{}'.format(tpart)]['class'].append(cls)

    tmp = data['vec.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[corpus]\n {}\n[class]\n{}".format(
            tpart, ind, tmp['w2v.mean'][ind], tmp['class'][ind]
        )
    )
    print()
    
print("Step 2 Succeed")

sample(transformed) from train[1294]:
[corpus]
 [  1.42176096e-01   1.62430700e-01  -8.79922659e-02   5.35091242e-02
   8.59380437e-02   2.62035794e-04   7.22310296e-02   2.61135434e-02
  -2.16664064e-02   5.44147822e-02   1.06476713e-01  -5.75761132e-02
  -1.10888660e-01  -2.65366570e-02  -1.12344211e-01   7.02185809e-02
   2.53348814e-02   1.30706466e-01  -1.04799571e-01   4.39703308e-02
  -9.66833411e-03   1.64926087e-02   7.21110973e-02   4.64726405e-02
   1.42359131e-01   2.71736607e-02  -1.44038068e-03  -3.01274558e-02
  -1.66951142e-01  -1.89049134e-01  -1.28849644e-01   1.76089907e-01
   7.72865493e-02  -3.66129199e-02  -1.38183594e-02   2.13841639e-02
   1.38888888e-02   1.29469739e-01   1.23148409e-01  -1.73164004e-01
   2.79120405e-02  -7.15156136e-02  -1.03259663e-02   6.85949264e-03
   5.14812296e-03  -1.38993850e-01  -2.75248113e-02   5.88013921e-02
  -1.28485937e-02   2.55639031e-02  -5.69223881e-03   6.49880191e-03
   1.56568261e-02  -3.84757851e-02  -4.55514945e-02  -5

sample(transformed) from test[1009]:
[corpus]
 [  1.57098071e-01   1.53154620e-01  -9.61496837e-02   7.66679577e-02
   9.88287428e-02   2.56884951e-02   4.97653553e-02   1.61335431e-02
  -4.28836131e-02   8.93512531e-02   8.97204755e-02  -4.42556309e-02
  -1.20880319e-01  -3.39067199e-02  -1.21989825e-01   6.05836994e-02
   2.43342272e-02   1.07924186e-01  -7.54850704e-02   2.36673727e-02
  -1.31180015e-02   3.76038207e-02   5.74989647e-02   4.92838261e-02
   1.01037374e-01   3.15683201e-02   2.09713250e-02  -1.74226246e-02
  -1.54403559e-01  -1.89296066e-01  -1.21736109e-01   1.71895260e-01
   5.00430703e-02  -5.84864492e-02  -1.67442754e-02   1.83596857e-02
   2.22139320e-02   1.44453748e-01   1.37844339e-01  -1.70359024e-01
   1.47091010e-02  -7.39391100e-02  -1.53880306e-02   6.87595693e-03
   1.30805930e-02  -1.23579607e-01  -2.07829833e-02   3.87572804e-02
  -2.03224753e-02   2.31136564e-02  -3.07712576e-02   1.17793162e-02
   9.60251690e-03  -2.09789647e-02  -2.84358605e-02  -3.

In [19]:
%%time

# Step 3: Save in Pandas.DataFrame
#
# 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

df = {}
csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['vec.{}'.format(tpart)]['class']
    datavec = np.array(data['vec.{}'.format(tpart)]['w2v.mean'])
    for col in range(vecsize):
        datadict[col]= datavec[:, col]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, '{}-w2v-{}-{}.csv'.format(tpart, embedFrom, modelFrom))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)
    
print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

See df[train]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.147760,0.191871,-0.123469,0.066009,0.132763,-0.006644,0.044266,0.059715,-0.024833,0.091452,...,-0.099979,-0.074620,0.042913,0.040350,0.134450,0.018293,0.017293,-0.075392,0.078175,soc.religion.christian
1,0.178941,0.187470,-0.079834,0.107205,0.166234,0.019693,0.040066,-0.019030,-0.072647,0.133745,...,-0.118303,-0.016686,0.043755,0.055443,0.155301,0.016082,0.033731,-0.099639,0.073829,soc.religion.christian
2,0.202831,0.170065,-0.097257,0.128243,0.140756,0.013609,0.016742,0.029650,-0.049912,0.088734,...,-0.092249,-0.034035,0.056689,0.049364,0.157539,0.037136,0.029556,-0.080967,0.077829,soc.religion.christian
3,0.208845,0.245270,-0.103284,0.091380,0.157237,0.048984,0.036000,0.005156,-0.052962,0.120857,...,-0.105384,-0.043563,0.055596,0.056120,0.121999,0.023624,0.019472,-0.115223,0.068031,soc.religion.christian
4,0.185805,0.200608,-0.090226,0.083986,0.157810,0.045228,0.077690,0.000514,-0.020258,0.106118,...,-0.108957,-0.053231,0.061687,0.034940,0.137244,0.024679,0.039585,-0.137030,0.093189,soc.religion.christian
5,0.105656,0.135173,-0.043216,0.047694,0.105946,-0.001172,0.115349,0.059904,-0.023608,0.039884,...,-0.107429,-0.093504,0.029772,0.049526,0.099876,0.017826,0.096981,-0.123978,0.081878,soc.religion.christian
6,0.178755,0.182410,-0.090002,0.091815,0.152302,0.048337,0.032308,0.022511,-0.033339,0.115746,...,-0.089736,-0.051238,0.057909,0.061582,0.130873,0.013066,0.029940,-0.105021,0.095332,soc.religion.christian
7,0.090505,0.109543,-0.060146,0.057255,0.110668,-0.001868,0.093157,0.022505,-0.005506,0.042767,...,-0.079171,-0.063464,0.037490,0.063633,0.130327,0.005175,0.059961,-0.070104,0.083359,soc.religion.christian
8,0.111884,0.144636,-0.063390,0.072714,0.108716,0.017135,0.085774,0.080821,-0.028071,0.031066,...,-0.089500,-0.075519,0.033466,0.047627,0.125210,0.003856,0.049750,-0.059231,0.058041,soc.religion.christian
9,0.135486,0.172539,-0.071040,0.081852,0.130687,0.036760,0.045177,0.042558,-0.025867,0.066756,...,-0.093644,-0.043220,0.054607,0.059081,0.128636,0.029543,0.027708,-0.092555,0.107019,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.181281,0.228398,-0.075890,0.098953,0.124880,0.040266,0.058507,0.023693,-0.042918,0.126799,...,-0.098906,-0.036052,0.050036,0.045954,0.131378,0.003953,0.044525,-0.122465,0.071459,soc.religion.christian
1,0.166943,0.141915,-0.058615,0.072855,0.123155,0.022023,0.037875,-0.002635,-0.073110,0.098993,...,-0.061153,-0.044855,0.026496,0.037081,0.146074,-0.016054,0.070424,-0.099942,0.069886,soc.religion.christian
2,0.146988,0.125112,-0.043065,0.094749,0.106156,0.000753,0.046489,0.014582,-0.064784,0.105506,...,-0.087145,-0.038346,0.045473,0.038739,0.150991,0.013829,0.041216,-0.082288,0.088481,soc.religion.christian
3,0.131239,0.177776,-0.054687,0.067915,0.103457,0.034679,0.087555,0.032761,-0.032261,0.086895,...,-0.110234,-0.073667,-0.015017,0.069666,0.147622,-0.010669,0.076071,-0.098396,0.066565,soc.religion.christian
4,0.160831,0.223853,-0.106787,0.097534,0.147625,0.072289,0.030739,0.009655,-0.036697,0.121479,...,-0.113563,-0.052026,0.046215,0.045257,0.130601,0.008114,0.055839,-0.133417,0.061739,soc.religion.christian
5,0.212691,0.201205,-0.110092,0.121360,0.171044,0.069911,0.012311,-0.013675,-0.047791,0.135481,...,-0.125282,-0.044721,0.051155,0.076026,0.135150,0.029299,0.025845,-0.101702,0.086462,soc.religion.christian
6,0.171318,0.173036,-0.055780,0.084935,0.139292,0.023904,0.041544,0.033871,-0.033655,0.101461,...,-0.114612,-0.062470,0.040511,0.047051,0.128280,-0.008847,0.051515,-0.099396,0.076984,soc.religion.christian
7,0.179166,0.204333,-0.074029,0.094361,0.138624,0.024829,0.047687,0.001240,-0.046740,0.105575,...,-0.090121,-0.025891,0.040806,0.040483,0.132452,0.001379,0.031532,-0.087991,0.074717,soc.religion.christian
8,0.194129,0.174792,-0.083872,0.109072,0.140045,0.016814,0.043043,0.039243,-0.058386,0.111187,...,-0.101575,-0.057712,0.077170,0.042844,0.162748,0.002867,0.039176,-0.076533,0.077580,soc.religion.christian
9,0.135488,0.154116,-0.044559,0.073116,0.123987,0.040695,0.047806,0.016156,-0.045030,0.072989,...,-0.100252,-0.058510,0.050253,0.046008,0.146536,0.009696,0.044494,-0.084077,0.061215,soc.religion.christian






Step 3 Succeed.
CPU times: user 2.07 s, sys: 56 ms, total: 2.12 s
Wall time: 2.17 s


In [107]:
%%time

# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(
        csvpath_root, '{}-w2v-{}-{}.csv'.format(
            tpart, embedFrom, modelFrom
        )
    )
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])

read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.177065,0.220111,-0.144173,0.102411,0.097858,0.019923,0.061870,0.113750,0.052380,0.011849,...,-0.119355,-0.109509,0.017502,0.072398,0.094370,0.002441,0.068058,-0.141721,0.118900,comp.graphics
1,0.150134,0.171049,-0.060894,0.035372,0.127200,0.028285,0.075184,0.016150,-0.004238,0.043623,...,-0.068356,-0.083557,0.029568,0.096670,0.092045,0.050060,0.028716,-0.099589,0.038581,comp.graphics
2,0.112585,0.129564,-0.096344,0.075622,0.157670,0.073118,0.048869,0.016336,-0.020263,0.072829,...,-0.059083,-0.052792,0.033013,0.061560,0.150799,0.016177,0.059237,-0.104495,0.098079,soc.religion.christian
3,0.168117,0.169154,-0.078624,0.084069,0.112711,0.024250,0.064745,0.033847,-0.002577,0.059480,...,-0.086128,-0.097710,0.035511,0.051856,0.113769,-0.002649,0.061280,-0.134222,0.075854,comp.graphics
4,0.141454,0.179205,-0.080936,0.077810,0.106397,0.034038,0.074566,0.020144,0.021634,0.098362,...,-0.105267,-0.063068,0.024644,0.034122,0.119906,-0.002638,0.059990,-0.098128,0.061528,soc.religion.christian
5,0.113995,0.096731,-0.069479,0.079731,0.107044,-0.034205,0.068047,0.042622,-0.001899,0.031603,...,-0.089813,-0.068895,0.031148,0.029334,0.097274,-0.012348,0.045399,-0.068110,0.086690,comp.graphics
6,0.205406,0.228389,-0.157220,0.121508,0.150480,0.032163,0.029472,0.017817,-0.011324,0.138975,...,-0.114809,-0.035353,0.077374,0.071350,0.093895,0.007648,0.033415,-0.093949,0.080868,soc.religion.christian
7,0.091921,0.119454,-0.075911,0.022086,0.119876,0.031191,0.005383,0.010170,-0.056288,0.046024,...,-0.111086,-0.034373,0.022683,0.056666,0.129046,-0.003815,0.075984,-0.101797,0.119218,soc.religion.christian
8,0.136040,0.130605,-0.019988,0.053304,0.095861,0.036624,0.086791,0.022059,-0.032591,0.096640,...,-0.094072,-0.075990,0.031536,0.001119,0.111031,0.004717,0.073187,-0.119843,0.092968,rec.motorcycles
9,0.175618,0.133833,-0.115383,0.085493,0.079358,-0.024408,0.049117,0.039929,-0.013472,0.043383,...,-0.093006,-0.068485,0.006319,0.030944,0.109105,0.016425,0.056066,-0.087943,0.098050,rec.autos


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,0.055221,0.074701,-0.059779,0.047216,0.091813,-0.038038,0.093739,0.049682,-0.008024,0.054676,...,-0.090123,-0.063039,0.028016,0.049200,0.102679,-0.006638,0.088112,-0.084938,0.119870,soc.religion.christian
1,0.123648,0.142640,-0.050926,0.069142,0.085433,0.033125,0.103106,0.025281,-0.023796,0.034686,...,-0.098824,-0.055011,0.013116,0.027032,0.090893,-0.013875,0.057162,-0.142370,0.086493,rec.motorcycles
2,0.213773,0.273997,-0.092123,0.057986,0.163515,0.074589,0.058103,0.021981,0.015551,0.060478,...,-0.118686,-0.044158,0.036226,0.032314,0.106265,0.045570,0.019126,-0.133571,0.120159,soc.religion.christian
3,0.157166,0.165131,-0.074020,0.041901,0.103741,-0.003904,0.070539,0.010501,0.001023,0.068057,...,-0.061571,-0.086888,0.025970,0.015124,0.097172,-0.011317,0.049156,-0.117782,0.089839,rec.motorcycles
4,0.157254,0.193971,-0.097019,0.025098,0.148397,0.027434,0.037771,0.038489,-0.065917,0.103195,...,-0.081350,-0.064140,0.020080,0.070574,0.075425,-0.009907,0.061753,-0.132262,0.072737,comp.graphics
5,0.094931,0.158852,-0.083298,0.036415,0.061289,0.039578,0.131851,0.079916,-0.036807,0.013662,...,-0.071642,-0.089384,-0.025363,0.029126,0.088830,-0.017244,0.041484,-0.128943,0.082857,rec.motorcycles
6,0.071874,0.081405,-0.021858,0.022914,0.056762,-0.018064,0.104337,0.057867,-0.001098,0.031533,...,-0.097016,-0.073418,-0.008779,0.026716,0.137433,-0.026200,0.063651,-0.092392,0.074638,rec.autos
7,0.042522,0.043513,-0.066467,0.065892,0.002371,-0.048086,0.099314,0.090553,-0.010046,0.003057,...,-0.064708,-0.085967,-0.014625,0.017201,0.129812,-0.008780,0.100066,-0.101729,0.073067,rec.motorcycles
8,0.183396,0.187342,-0.087993,0.083409,0.112866,0.055050,0.091043,0.040141,0.007992,0.079223,...,-0.101577,-0.082538,0.027820,0.054967,0.129546,0.011346,0.042986,-0.125952,0.090539,alt.atheism
9,0.033685,0.095458,-0.030138,0.075265,0.045467,0.012184,0.143523,0.069437,-0.031456,0.016175,...,-0.056311,-0.091113,-0.041184,0.018035,0.124233,-0.034256,0.094540,-0.090154,0.080557,rec.motorcycles


CPU times: user 944 ms, sys: 0 ns, total: 944 ms
Wall time: 944 ms


###### SVM classifier(gensim + text8)

In [108]:
%%time

# Step 5.1.1: SVM

# if 'TFIDF' == modelChoice:

#train
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']
#test
X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

# else:
#     #train
#     X_train = df_new['train']['x']
#     y_train = df_new['train']['y']
#     #test
#     X_test = df_new['test']['x']
#     y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 1.99 s, sys: 0 ns, total: 1.99 s
Wall time: 1.99 s


In [109]:
%%time
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.830526315789
0.827658664987
0.830526315789
CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 21.3 ms


###### DNN classifier(gensim + text8)

In [110]:
%%time

# Step 4: One-hot representation for labels

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lb = LabelBinarizer()
lb.fit(df['train']['class'])

df_new = {}
for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelFrom))
        df_new[tpart][subpart].to_csv(csvpath)
    
print("label cleaning succussfully")

label cleaning succussfully
CPU times: user 3.64 s, sys: 108 ms, total: 3.75 s
Wall time: 3.75 s


In [111]:
%%time

## Step 5 : Train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {
        str(col): tf.constant(
            df[dataset][str(col)].values
        )
        for col in COL_FEATURE
    }
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=len(df['train']['class'].unique())
)

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f152ac0e1d0>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}






































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp26dqmD/model.ckpt.
INFO:tensorflow:loss = 1.61282, step = 1
INFO:tensorflow:global_step/sec: 11.1793
INFO:tensorflow:loss = 1.36607, step = 101
INFO:tensorflow:global_step/sec: 11.0008
INFO:tensorflow:loss = 1.15435, step = 201
INFO:tensorflow:global_step/sec: 10.8384
INFO:tensorflow:loss = 0.987877, step = 301
INFO:tensorflow:global_step/sec: 11.5904
INFO:tensorflow:loss = 0.858602, step = 401
INFO:tensorflow:global_step/sec: 11.781
INFO:tensorflow:loss = 0.762772, step = 501
INFO:tensorflow:global_step/sec: 11.7706
INFO:tensorflow:loss = 0.690341, step = 601
INFO:t

In [112]:
%%time

## Step 6: Evaluate

accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print("Test Accuracy by TensorFlow: {}".format(accuracy_score))

X_tensor_test, yt = my_input_fn('test')
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print('Test Accuracy by Scikit-learn: ', f1_score(y_test_true, tensorPredClsStr, average='micro'))







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-05-13-10:00:32
INFO:tensorflow:Evaluation [1/1900]
INFO:tensorflow:Evaluation [2/1900]
INFO:tensorflow:Evaluation [3/1900]
INFO:tensorflow:Evaluation [4/1900]
INFO:tensorflow:Evaluation [5/1900]
INFO:tensorflow:Evaluation [6/1900]
INFO:tensorflow:Evaluation [7/1900]
INFO:tensorflow:Evaluation [8/1900]
INFO:tensorflow:Evaluation [9/1900]
INFO:tensorflow:Evaluation [10/1900]
INFO:tensorflow:Evaluation [11/1900]
INFO:tensorflow:Evaluation [12/1900]
INFO:tensorflow:Evaluation [13/1900]
INFO:tensorflow:Evaluation [14/1900]
INFO:tensorflow:Evaluation [15/1900]
INFO:tensorflow:Evaluation [16/1900]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [73/1900]
INFO:tensorflow:Evaluation [74/1900]
INFO:tensorflow:Evaluation [75/1900]
INFO:tensorflow:Evaluation [76/1900]
INFO:tensorflow:Evaluation [77/1900]
INFO:tensorflow:Evaluation [78/1900]
INFO:tensorflow:Evaluation [79/1900]
INFO:tensorflow:Evaluation [80/1900]
INFO:tensorflow:Evaluation [81/1900]
INFO:tensorflow:Evaluation [82/1900]
INFO:tensorflow:Evaluation [83/1900]
INFO:tensorflow:Evaluation [84/1900]
INFO:tensorflow:Evaluation [85/1900]
INFO:tensorflow:Evaluation [86/1900]
INFO:tensorflow:Evaluation [87/1900]
INFO:tensorflow:Evaluation [88/1900]
INFO:tensorflow:Evaluation [89/1900]
INFO:tensorflow:Evaluation [90/1900]
INFO:tensorflow:Evaluation [91/1900]
INFO:tensorflow:Evaluation [92/1900]
INFO:tensorflow:Evaluation [93/1900]
INFO:tensorflow:Evaluation [94/1900]
INFO:tensorflow:Evaluation [95/1900]
INFO:tensorflow:Evaluation [96/1900]
INFO:tensorflow:Evaluation [97/1900]
INFO:tensorflow:Evaluation [98/1900]
INFO:tensorflow:Evaluation [99/1900]
I

INFO:tensorflow:Evaluation [290/1900]
INFO:tensorflow:Evaluation [291/1900]
INFO:tensorflow:Evaluation [292/1900]
INFO:tensorflow:Evaluation [293/1900]
INFO:tensorflow:Evaluation [294/1900]
INFO:tensorflow:Evaluation [295/1900]
INFO:tensorflow:Evaluation [296/1900]
INFO:tensorflow:Evaluation [297/1900]
INFO:tensorflow:Evaluation [298/1900]
INFO:tensorflow:Evaluation [299/1900]
INFO:tensorflow:Evaluation [300/1900]
INFO:tensorflow:Evaluation [301/1900]
INFO:tensorflow:Evaluation [302/1900]
INFO:tensorflow:Evaluation [303/1900]
INFO:tensorflow:Evaluation [304/1900]
INFO:tensorflow:Evaluation [305/1900]
INFO:tensorflow:Evaluation [306/1900]
INFO:tensorflow:Evaluation [307/1900]
INFO:tensorflow:Evaluation [308/1900]
INFO:tensorflow:Evaluation [309/1900]
INFO:tensorflow:Evaluation [310/1900]
INFO:tensorflow:Evaluation [311/1900]
INFO:tensorflow:Evaluation [312/1900]
INFO:tensorflow:Evaluation [313/1900]
INFO:tensorflow:Evaluation [314/1900]
INFO:tensorflow:Evaluation [315/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [506/1900]
INFO:tensorflow:Evaluation [507/1900]
INFO:tensorflow:Evaluation [508/1900]
INFO:tensorflow:Evaluation [509/1900]
INFO:tensorflow:Evaluation [510/1900]
INFO:tensorflow:Evaluation [511/1900]
INFO:tensorflow:Evaluation [512/1900]
INFO:tensorflow:Evaluation [513/1900]
INFO:tensorflow:Evaluation [514/1900]
INFO:tensorflow:Evaluation [515/1900]
INFO:tensorflow:Evaluation [516/1900]
INFO:tensorflow:Evaluation [517/1900]
INFO:tensorflow:Evaluation [518/1900]
INFO:tensorflow:Evaluation [519/1900]
INFO:tensorflow:Evaluation [520/1900]
INFO:tensorflow:Evaluation [521/1900]
INFO:tensorflow:Evaluation [522/1900]
INFO:tensorflow:Evaluation [523/1900]
INFO:tensorflow:Evaluation [524/1900]
INFO:tensorflow:Evaluation [525/1900]
INFO:tensorflow:Evaluation [526/1900]
INFO:tensorflow:Evaluation [527/1900]
INFO:tensorflow:Evaluation [528/1900]
INFO:tensorflow:Evaluation [529/1900]
INFO:tensorflow:Evaluation [530/1900]
INFO:tensorflow:Evaluation [531/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [722/1900]
INFO:tensorflow:Evaluation [723/1900]
INFO:tensorflow:Evaluation [724/1900]
INFO:tensorflow:Evaluation [725/1900]
INFO:tensorflow:Evaluation [726/1900]
INFO:tensorflow:Evaluation [727/1900]
INFO:tensorflow:Evaluation [728/1900]
INFO:tensorflow:Evaluation [729/1900]
INFO:tensorflow:Evaluation [730/1900]
INFO:tensorflow:Evaluation [731/1900]
INFO:tensorflow:Evaluation [732/1900]
INFO:tensorflow:Evaluation [733/1900]
INFO:tensorflow:Evaluation [734/1900]
INFO:tensorflow:Evaluation [735/1900]
INFO:tensorflow:Evaluation [736/1900]
INFO:tensorflow:Evaluation [737/1900]
INFO:tensorflow:Evaluation [738/1900]
INFO:tensorflow:Evaluation [739/1900]
INFO:tensorflow:Evaluation [740/1900]
INFO:tensorflow:Evaluation [741/1900]
INFO:tensorflow:Evaluation [742/1900]
INFO:tensorflow:Evaluation [743/1900]
INFO:tensorflow:Evaluation [744/1900]
INFO:tensorflow:Evaluation [745/1900]
INFO:tensorflow:Evaluation [746/1900]
INFO:tensorflow:Evaluation [747/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [938/1900]
INFO:tensorflow:Evaluation [939/1900]
INFO:tensorflow:Evaluation [940/1900]
INFO:tensorflow:Evaluation [941/1900]
INFO:tensorflow:Evaluation [942/1900]
INFO:tensorflow:Evaluation [943/1900]
INFO:tensorflow:Evaluation [944/1900]
INFO:tensorflow:Evaluation [945/1900]
INFO:tensorflow:Evaluation [946/1900]
INFO:tensorflow:Evaluation [947/1900]
INFO:tensorflow:Evaluation [948/1900]
INFO:tensorflow:Evaluation [949/1900]
INFO:tensorflow:Evaluation [950/1900]
INFO:tensorflow:Evaluation [951/1900]
INFO:tensorflow:Evaluation [952/1900]
INFO:tensorflow:Evaluation [953/1900]
INFO:tensorflow:Evaluation [954/1900]
INFO:tensorflow:Evaluation [955/1900]
INFO:tensorflow:Evaluation [956/1900]
INFO:tensorflow:Evaluation [957/1900]
INFO:tensorflow:Evaluation [958/1900]
INFO:tensorflow:Evaluation [959/1900]
INFO:tensorflow:Evaluation [960/1900]
INFO:tensorflow:Evaluation [961/1900]
INFO:tensorflow:Evaluation [962/1900]
INFO:tensorflow:Evaluation [963/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [1150/1900]
INFO:tensorflow:Evaluation [1151/1900]
INFO:tensorflow:Evaluation [1152/1900]
INFO:tensorflow:Evaluation [1153/1900]
INFO:tensorflow:Evaluation [1154/1900]
INFO:tensorflow:Evaluation [1155/1900]
INFO:tensorflow:Evaluation [1156/1900]
INFO:tensorflow:Evaluation [1157/1900]
INFO:tensorflow:Evaluation [1158/1900]
INFO:tensorflow:Evaluation [1159/1900]
INFO:tensorflow:Evaluation [1160/1900]
INFO:tensorflow:Evaluation [1161/1900]
INFO:tensorflow:Evaluation [1162/1900]
INFO:tensorflow:Evaluation [1163/1900]
INFO:tensorflow:Evaluation [1164/1900]
INFO:tensorflow:Evaluation [1165/1900]
INFO:tensorflow:Evaluation [1166/1900]
INFO:tensorflow:Evaluation [1167/1900]
INFO:tensorflow:Evaluation [1168/1900]
INFO:tensorflow:Evaluation [1169/1900]
INFO:tensorflow:Evaluation [1170/1900]
INFO:tensorflow:Evaluation [1171/1900]
INFO:tensorflow:Evaluation [1172/1900]
INFO:tensorflow:Evaluation [1173/1900]
INFO:tensorflow:Evaluation [1174/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1361/1900]
INFO:tensorflow:Evaluation [1362/1900]
INFO:tensorflow:Evaluation [1363/1900]
INFO:tensorflow:Evaluation [1364/1900]
INFO:tensorflow:Evaluation [1365/1900]
INFO:tensorflow:Evaluation [1366/1900]
INFO:tensorflow:Evaluation [1367/1900]
INFO:tensorflow:Evaluation [1368/1900]
INFO:tensorflow:Evaluation [1369/1900]
INFO:tensorflow:Evaluation [1370/1900]
INFO:tensorflow:Evaluation [1371/1900]
INFO:tensorflow:Evaluation [1372/1900]
INFO:tensorflow:Evaluation [1373/1900]
INFO:tensorflow:Evaluation [1374/1900]
INFO:tensorflow:Evaluation [1375/1900]
INFO:tensorflow:Evaluation [1376/1900]
INFO:tensorflow:Evaluation [1377/1900]
INFO:tensorflow:Evaluation [1378/1900]
INFO:tensorflow:Evaluation [1379/1900]
INFO:tensorflow:Evaluation [1380/1900]
INFO:tensorflow:Evaluation [1381/1900]
INFO:tensorflow:Evaluation [1382/1900]
INFO:tensorflow:Evaluation [1383/1900]
INFO:tensorflow:Evaluation [1384/1900]
INFO:tensorflow:Evaluation [1385/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1572/1900]
INFO:tensorflow:Evaluation [1573/1900]
INFO:tensorflow:Evaluation [1574/1900]
INFO:tensorflow:Evaluation [1575/1900]
INFO:tensorflow:Evaluation [1576/1900]
INFO:tensorflow:Evaluation [1577/1900]
INFO:tensorflow:Evaluation [1578/1900]
INFO:tensorflow:Evaluation [1579/1900]
INFO:tensorflow:Evaluation [1580/1900]
INFO:tensorflow:Evaluation [1581/1900]
INFO:tensorflow:Evaluation [1582/1900]
INFO:tensorflow:Evaluation [1583/1900]
INFO:tensorflow:Evaluation [1584/1900]
INFO:tensorflow:Evaluation [1585/1900]
INFO:tensorflow:Evaluation [1586/1900]
INFO:tensorflow:Evaluation [1587/1900]
INFO:tensorflow:Evaluation [1588/1900]
INFO:tensorflow:Evaluation [1589/1900]
INFO:tensorflow:Evaluation [1590/1900]
INFO:tensorflow:Evaluation [1591/1900]
INFO:tensorflow:Evaluation [1592/1900]
INFO:tensorflow:Evaluation [1593/1900]
INFO:tensorflow:Evaluation [1594/1900]
INFO:tensorflow:Evaluation [1595/1900]
INFO:tensorflow:Evaluation [1596/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1783/1900]
INFO:tensorflow:Evaluation [1784/1900]
INFO:tensorflow:Evaluation [1785/1900]
INFO:tensorflow:Evaluation [1786/1900]
INFO:tensorflow:Evaluation [1787/1900]
INFO:tensorflow:Evaluation [1788/1900]
INFO:tensorflow:Evaluation [1789/1900]
INFO:tensorflow:Evaluation [1790/1900]
INFO:tensorflow:Evaluation [1791/1900]
INFO:tensorflow:Evaluation [1792/1900]
INFO:tensorflow:Evaluation [1793/1900]
INFO:tensorflow:Evaluation [1794/1900]
INFO:tensorflow:Evaluation [1795/1900]
INFO:tensorflow:Evaluation [1796/1900]
INFO:tensorflow:Evaluation [1797/1900]
INFO:tensorflow:Evaluation [1798/1900]
INFO:tensorflow:Evaluation [1799/1900]
INFO:tensorflow:Evaluation [1800/1900]
INFO:tensorflow:Evaluation [1801/1900]
INFO:tensorflow:Evaluation [1802/1900]
INFO:tensorflow:Evaluation [1803/1900]
INFO:tensorflow:Evaluation [1804/1900]
INFO:tensorflow:Evaluation [1805/1900]
INFO:tensorflow:Evaluation [1806/1900]
INFO:tensorflow:Evaluation [1807/1900]
INFO:tensorflow:Evaluatio







































Test Accuracy by Scikit-learn:  0.809473684211
CPU times: user 9min 3s, sys: 50 s, total: 9min 53s
Wall time: 1min 45s


###### CNN classifier(gensim + text8)

In [69]:
%%time

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

sess = tf.InteractiveSession()

COL_OUTCOME = 'class'
COL_FEATURE = [col for col in list(df['train'].columns) if col != COL_OUTCOME]

# cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

count_feature = len(COL_FEATURE)
count_class = len(df['train']['class'].unique())

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, count_class], name='y_')

W = tf.Variable(tf.zeros([count_feature, count_class]))
b = tf.Variable(tf.zeros([count_class]))
y = tf.matmul(x, W) + b

# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_text = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_text, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32, name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, count_class])
b_fc2 = bias_variable([count_class])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

print("CNN initialization finished")

CNN initialization finished
CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 36.7 ms


In [70]:
%%time

### Start to traini and evaluate the model

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.global_variables_initializer())

x_input = df_new['train']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['train']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]
# y_input = [np.array([y_input.iloc[i].values]) for i in range(y_input.shape[0])]

# not use random input

for i in range(df['train'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, training accuracy {}".format(i, np.mean(train_accuracy)))
    train_step.run(feed_dict={
        keep_prob: 0.5,
        x:  np.array([elem[0] for elem in x_input[i:i+50]]),#x_input.iloc[i].values, #
        y_: np.array([elem[0] for elem in y_input[i:i+50]])#y_input.iloc[i].values#
    })

print("CNN training finished")

step 0, training accuracy 0.211600005627
step 100, training accuracy 0.321599990129
step 200, training accuracy 0.512800037861
step 300, training accuracy 0.61879992485
step 400, training accuracy 0.57279998064
step 500, training accuracy 0.709200084209
step 600, training accuracy 0.688800036907
step 700, training accuracy 0.630800008774
step 800, training accuracy 0.652400076389
step 900, training accuracy 0.734400093555
step 1000, training accuracy 0.644799947739
step 1100, training accuracy 0.745999932289
step 1200, training accuracy 0.767199933529
step 1300, training accuracy 0.762399971485
step 1400, training accuracy 0.737200081348
step 1500, training accuracy 0.761199951172
step 1600, training accuracy 0.827199995518
step 1700, training accuracy 0.82959997654
step 1800, training accuracy 0.682000100613
step 1900, training accuracy 0.841599881649
step 2000, training accuracy 0.853199899197
step 2100, training accuracy 0.77480006218
step 2200, training accuracy 0.798800051212
step

In [71]:
%%time

# Evaluate

x_input = df_new['test']['x']#df_new['test']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['test']['y']#df_new['test']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]

for i in range(df['test'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, testing accuracy {}".format(i, np.mean(train_accuracy)))

        
print("CNN testing finished")

step 0, testing accuracy 0.712000131607
step 100, testing accuracy 0.715600073338
step 200, testing accuracy 0.726800024509
step 300, testing accuracy 0.857200026512
step 400, testing accuracy 0.729999899864
step 500, testing accuracy 0.82439994812
step 600, testing accuracy 0.726400017738
step 700, testing accuracy 0.704400002956
step 800, testing accuracy 0.795599997044
step 900, testing accuracy 0.645599961281
step 1000, testing accuracy 0.723200082779
step 1100, testing accuracy 0.737999975681
step 1200, testing accuracy 0.807200074196
step 1300, testing accuracy 0.759599983692
step 1400, testing accuracy 0.709999978542
step 1500, testing accuracy 0.768800020218
step 1600, testing accuracy 0.777999937534
step 1700, testing accuracy 0.814000070095
step 1800, testing accuracy 0.771599888802
CNN testing finished
CPU times: user 2min 40s, sys: 10.1 s, total: 2min 51s
Wall time: 26 s


##### 3.2.1.2 只用待学习语料建模

In [73]:
embedFrom = 'corpus'

In [85]:
%%time

# collect sentences from raw data
sentences = {}
pathtmp = {}
pathtmp['root'] = os.path.join(paths['dir.dataroot'], 'trialdata')
for tpart in ['train', 'test']:
    pathtmp[tpart] = os.path.join(pathtmp['root'], tpart)
    sentences[tpart] = []
    folderList = os.listdir(pathtmp[tpart])
    for folder in folderList:
        fileList = os.listdir(os.path.join(pathtmp[tpart], folder))
        for eachf in fileList:
            fpathtmp = os.path.join(pathtmp[tpart], folder, eachf)
            with open(fpathtmp, 'r') as f:
                sentences[tpart].append(f.read())
      #save sentences in file
        sentencePath = os.path.join(pathtmp['root'], 'sentences-{}'.format(tpart))
        with open(sentencePath, 'w') as f:
            for sentence in sentences[tpart]:
                f.write(sentence)
                f.write('\n')

pathtmp = os.path.join(pathtmp['root'], 'sentences-train')
sentences = []
with open(pathtmp, 'r') as f:
    buff = f.read()
    sentencesBuffer = buff.split('\n')
    sentences = [stcbuffer.split() for stcbuffer in sentencesBuffer]

print('get sentences from training corpus successfully')
print('example:')
print(sentences[random.randrange(len(sentences))])

get sentences from training corpus successfully
example:
['from', 'livesey', 'solntze', 'wpd', 'sgi', 'com', 'jon', 'livesey', 'subject', 're', 'slavery', 'was', 're', 'why', 'is', 'sex', 'only', 'allowed', 'in', 'marriage', 'organization', 'sgi', 'lines', 'three', 'seven', 'distribution', 'world', 'nntp', 'posting', 'host', 'solntze', 'wpd', 'sgi', 'com', 'in', 'article', 'mas', 'cadence', 'com', 'masud', 'khan', 'writes', 'leonard', 'i', 'll', 'give', 'you', 'an', 'example', 'of', 'this', 'my', 'father', 'recently', 'bought', 'a', 'business', 'the', 'business', 'price', 'was', 'one', 'five', 'zero', 'zero', 'zero', 'zero', 'pounds', 'and', 'my', 'father', 'approached', 'the', 'people', 'in', 'the', 'community', 'for', 'help', 'he', 'raised', 'six', 'zero', 'zero', 'zero', 'zero', 'pounds', 'in', 'interest', 'free', 'loans', 'from', 'friends', 'and', 'relatives', 'and', 'muslims', 'he', 'knew', 'five', 'zero', 'zero', 'zero', 'zero', 'had', 'cash', 'and', 'the', 'rest', 'he', 'got', '

In [87]:
%%time

vecsize = 784
model = gensim.models.word2vec.Word2Vec(iter=15, size=vecsize, sg=1, window=5, workers=12)
print("Model created.")

Model created.
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 238 µs


In [88]:
%%time
   
trimmer = lambda word, count, min_count: gensim.utils.RULE_DISCARD if word in stopwords else gensim.utils.RULE_KEEP

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 6.91 µs


In [89]:
%%time

model.build_vocab(sentences=sentences, trim_rule=trimmer)

print("Vocabulary creation finished.")

Vocabulary creation finished.
CPU times: user 5.72 s, sys: 20 ms, total: 5.74 s
Wall time: 5.64 s


In [90]:
%%time

model.train(sentences=sentences)

print('Training finished.')

Training finished.
CPU times: user 3min 27s, sys: 56 ms, total: 3min 27s
Wall time: 26.8 s


In [91]:
%%time

# save model

calander = datetime.date.today().timetuple()

modelpath = os.path.join(
    paths['dir.{}.{}'.format(modelFrom, embedFrom)],
    '{}.{}.{}{}{}'.format(modelFrom, embedFrom, calander.tm_year, calander.tm_mon, calander.tm_mday)
)
model.save(modelpath)

print("save model finished")

save model finished
CPU times: user 8 ms, sys: 4 ms, total: 12 ms
Wall time: 12.4 ms


In [92]:
%%time

# Step 2: read data and save it in data['vec.train'] 和 data['vec.test']

data = {}
data['vec.train'] = {'w2v.mean':[], 'class':[]}
data['vec.test'] = {'w2v.mean':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                tokens = [token for token in readf.read().split()] # if token not in stopwords]#readf.read().split()#
                # Word2Vec representation
                # begin
                vec = np.array([0.0 for i in range(vecsize)])
                expectationVal = np.array([0.0 for i in range(vecsize)])
                countvec = 0
                for token in tokens:
                    try:
                        vec += model[token]
                        countvec += 1
                    except:
                        vec += expectationVal
                vec = vec / float(countvec)#float(len(tokens))
                 # end
            data['vec.{}'.format(tpart)]['w2v.mean'].append(vec)
            data['vec.{}'.format(tpart)]['class'].append(cls)

    tmp = data['vec.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[corpus]\n {}\n[class]\n{}".format(
            tpart, ind, tmp['w2v.mean'][ind], tmp['class'][ind]
        )
    )
    print()
    
print("Step 2 Succeed")

sample(transformed) from train[2137]:
[corpus]
 [ -6.92075080e-02  -1.10341109e-02  -1.57756309e-03  -4.38288346e-02
  -2.56870093e-03  -1.91868105e-02   2.59707384e-02  -1.38480009e-02
   4.66735698e-02   1.22320529e-02  -4.48232980e-03   3.01153942e-02
   4.60493026e-03   3.00879649e-02   4.19246667e-02   1.45628062e-02
  -5.77818627e-02   2.03765628e-02   2.11208692e-03   1.31738036e-02
   1.65932280e-03  -1.35084295e-02  -2.44918113e-02  -3.92084046e-02
  -5.48109850e-03  -4.01681491e-03   1.42943723e-02  -2.04842861e-03
   8.92381919e-03  -1.54173397e-02  -1.73201344e-04  -3.68162232e-02
   8.38746933e-03   1.47829359e-02   1.49441627e-03   1.45780508e-02
  -3.15133147e-02  -3.66009132e-02  -1.93746889e-02  -1.80645023e-02
  -6.49417490e-02  -2.29584450e-02  -2.50614714e-02  -1.47653579e-02
  -4.99981898e-04  -5.59790347e-02   2.02353659e-02   1.19156180e-03
  -6.45340259e-04  -6.39465189e-03  -4.33630103e-03   4.03514177e-03
   2.22452323e-02   7.41300533e-03  -7.16117787e-03  -5

sample(transformed) from test[1103]:
[corpus]
 [-0.11811182 -0.03760944  0.00637991 -0.10428453  0.02088733 -0.00743363
  0.00273809  0.00321351  0.07759132  0.02099297  0.06763557  0.07964662
  0.04184821  0.11546032  0.05451436 -0.00761499 -0.09659691  0.01623226
  0.00857963  0.02312686 -0.03373558  0.03801262 -0.01173902 -0.07968408
 -0.01005026 -0.01140222 -0.00076246  0.0052872   0.01264707 -0.04825866
  0.01448141 -0.0767749   0.07888501  0.05307805  0.00290516 -0.0237392
 -0.05275487 -0.0825643  -0.01165215 -0.06120615 -0.10834015 -0.03880486
 -0.02553512 -0.02379313 -0.00265643 -0.01575326  0.02981099 -0.05304489
 -0.03617571 -0.01693243 -0.01640014  0.00522742  0.01684704  0.09463365
  0.02964319  0.01782001  0.07333235  0.00321156  0.02612249 -0.03638247
  0.01153358  0.02508477  0.04155179  0.01252466  0.01814196  0.04597889
 -0.03922394  0.01359904 -0.00491344 -0.00076291  0.02965598 -0.04829625
 -0.03138166 -0.01471911 -0.01941514 -0.01868932  0.00892622 -0.0564291
  0.03

In [93]:
%%time

# Step 3: Save in Pandas.DataFrame
#
# 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

df = {}
csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['vec.{}'.format(tpart)]['class']
    datavec = np.array(data['vec.{}'.format(tpart)]['w2v.mean'])
    for col in range(vecsize):
        datadict[col]= datavec[:, col]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, '{}-w2v-{}-{}.csv'.format(tpart, embedFrom, modelFrom))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)
    
print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

See df[train]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.079730,-0.023649,-0.000637,-0.058202,0.000308,-0.016652,0.023493,-0.001410,0.078345,0.011816,...,-0.019484,-0.018181,0.052244,-0.027038,-0.025845,0.038562,0.051001,-0.001862,0.077121,soc.religion.christian
1,-0.056762,0.012930,0.021774,-0.066071,-0.015854,-0.005939,0.023844,0.020822,0.069722,-0.015533,...,-0.022090,-0.035425,0.057279,-0.005960,-0.040210,0.029291,0.056450,0.003017,0.062584,soc.religion.christian
2,-0.041504,-0.028601,-0.020247,-0.063503,-0.017398,-0.028696,0.034888,0.021110,0.054268,0.014402,...,-0.013897,-0.021263,0.051658,-0.011763,-0.060453,0.044683,0.044996,0.003645,0.053631,soc.religion.christian
3,-0.070549,-0.022650,0.004644,-0.057970,-0.013228,-0.018406,0.027923,0.011727,0.048545,-0.005384,...,-0.003719,-0.014226,0.046223,-0.005666,-0.023267,0.044336,0.045128,0.016655,0.063528,soc.religion.christian
4,-0.051495,-0.002146,-0.001128,-0.039290,0.003812,-0.021420,0.038133,-0.002986,0.042595,-0.002844,...,-0.026251,-0.023259,0.044939,-0.017648,-0.030844,0.038878,0.047336,-0.008602,0.066024,soc.religion.christian
5,-0.078105,0.016148,0.027750,-0.095598,-0.005802,-0.008280,0.010347,-0.029389,0.050281,-0.011014,...,-0.012819,-0.004744,0.052907,-0.029764,-0.015968,0.048749,0.035587,0.032801,0.059229,soc.religion.christian
6,-0.069843,-0.013937,-0.007243,-0.041216,-0.010258,-0.023068,0.035144,0.001942,0.050948,-0.000750,...,-0.007474,-0.024936,0.037026,-0.004318,-0.037518,0.038701,0.047312,-0.000626,0.049751,soc.religion.christian
7,-0.065727,0.025335,0.012322,-0.068324,0.001057,0.013478,0.009310,-0.007826,0.068203,-0.009847,...,-0.005894,-0.018491,0.063467,-0.034221,-0.025167,0.043036,0.057216,0.017436,0.056326,soc.religion.christian
8,-0.094956,0.002193,0.023108,-0.090306,0.003462,-0.021560,0.012234,-0.016177,0.068556,-0.004596,...,-0.008637,-0.022019,0.059442,-0.030542,-0.026341,0.048808,0.046653,0.014746,0.056790,soc.religion.christian
9,-0.062736,-0.001663,0.000600,-0.045071,0.000152,-0.020807,0.039427,0.012743,0.064969,-0.005784,...,-0.011838,-0.028456,0.051516,-0.013243,-0.024054,0.050275,0.044571,0.000135,0.050249,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.062496,-0.003992,0.009274,-0.059166,0.002295,-0.010985,0.022264,0.014221,0.069631,-0.004797,...,-0.015165,-0.017867,0.041458,-0.023496,-0.039388,0.032961,0.052439,-0.006439,0.065302,soc.religion.christian
1,-0.087414,-0.038249,0.007048,-0.076555,0.021722,-0.045024,0.028561,0.004363,0.065021,0.018761,...,-0.022608,-0.032574,0.034987,-0.024158,-0.033635,0.040352,0.059029,-0.004064,0.065818,soc.religion.christian
2,-0.061214,-0.011331,0.005822,-0.036661,0.003768,-0.026180,0.026175,-0.001027,0.057774,-0.004146,...,-0.035250,-0.032543,0.036931,-0.025581,-0.038027,0.036342,0.043178,-0.000201,0.075717,soc.religion.christian
3,-0.072991,0.003276,0.025641,-0.074216,0.004734,0.011358,0.000940,-0.004469,0.069140,-0.031780,...,-0.019148,-0.020766,0.044144,-0.021039,-0.032014,0.048216,0.056782,0.018681,0.063605,soc.religion.christian
4,-0.069047,-0.025650,-0.014127,-0.039385,-0.000390,-0.016138,0.033307,0.003785,0.049565,0.004687,...,-0.014008,-0.025831,0.031113,-0.008229,-0.027739,0.044987,0.045992,-0.009662,0.046029,soc.religion.christian
5,-0.053782,-0.001994,-0.000659,-0.049186,-0.004210,-0.010316,0.034825,0.019230,0.055765,-0.015361,...,-0.013282,-0.015699,0.045814,-0.005469,-0.026109,0.030754,0.040282,-0.009578,0.060240,soc.religion.christian
6,-0.078397,-0.027214,-0.019220,-0.039719,0.000998,-0.026669,0.037096,-0.003177,0.048374,-0.011524,...,-0.005910,-0.016200,0.049237,-0.010819,-0.024510,0.045075,0.039118,0.007250,0.061699,soc.religion.christian
7,-0.054396,-0.024790,-0.005471,-0.054194,-0.001723,-0.003584,0.037532,0.009406,0.062509,0.000281,...,-0.030041,-0.012802,0.036459,-0.016810,-0.029783,0.023699,0.036585,-0.013798,0.066923,soc.religion.christian
8,-0.046941,-0.021817,-0.002797,-0.035273,-0.014086,-0.034321,0.030448,-0.002469,0.047158,0.004280,...,-0.017431,-0.031450,0.057079,-0.013610,-0.037398,0.044692,0.050148,0.000716,0.060035,soc.religion.christian
9,-0.063387,-0.027813,-0.007046,-0.044512,-0.000325,-0.018768,0.028955,-0.001471,0.065737,-0.004054,...,-0.016120,-0.025815,0.050199,-0.014001,-0.032361,0.036760,0.040632,-0.007805,0.074645,soc.religion.christian






Step 3 Succeed.
CPU times: user 2.12 s, sys: 32 ms, total: 2.16 s
Wall time: 2.16 s


In [94]:
# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(
        csvpath_root, '{}-w2v-{}-{}.csv'.format(
            tpart, embedFrom, modelFrom
        )
    )
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])
        print('\n\n\n')

read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.041930,-0.037320,-0.013703,-0.031316,0.000647,-0.041839,0.020111,-0.001131,0.030702,0.015909,...,-0.012208,-0.014173,0.034225,-0.030942,-0.060793,0.055482,0.032378,0.005133,0.042735,comp.graphics
1,-0.053875,-0.037723,-0.001289,-0.044911,-0.009387,-0.031085,0.022400,0.000807,0.055394,0.008216,...,-0.015606,-0.026844,0.035230,-0.013593,-0.040151,0.034606,0.047973,-0.007257,0.045528,alt.atheism
2,-0.082996,-0.006368,0.012487,-0.101907,0.026058,-0.006897,0.030774,0.009709,0.069928,-0.006884,...,-0.008285,-0.003806,0.053953,-0.033277,-0.027587,0.040798,0.049388,0.031073,0.056879,rec.motorcycles
3,-0.079253,-0.038878,-0.018226,-0.040129,0.012973,-0.007393,0.042378,0.015500,0.064230,-0.002062,...,-0.006541,-0.023788,0.039469,-0.031862,-0.025708,0.043924,0.047402,-0.004219,0.059046,alt.atheism
4,-0.098658,-0.062071,-0.013294,-0.096206,0.023424,-0.006853,0.051590,0.035089,0.062530,0.000769,...,-0.030714,-0.019996,0.063740,-0.029146,-0.017373,0.052743,0.052029,0.009889,0.070545,alt.atheism
5,-0.133948,-0.026668,0.026274,-0.126822,0.013906,-0.018019,0.007832,0.001418,0.102060,0.020701,...,-0.047264,-0.010577,0.065511,-0.047223,0.000912,0.063435,0.063396,0.009550,0.071706,rec.autos
6,-0.125721,-0.017746,0.000925,-0.110393,0.008402,0.008770,0.000420,-0.003976,0.060523,-0.011276,...,-0.017145,0.012790,0.061869,-0.043170,-0.009553,0.036598,0.040131,0.029338,0.024254,comp.graphics
7,-0.046582,-0.008309,-0.013047,-0.034686,-0.013737,-0.023121,0.038858,0.005264,0.042349,0.007917,...,-0.009330,-0.028854,0.057401,-0.013395,-0.021866,0.037975,0.050527,-0.002294,0.045047,soc.religion.christian
8,-0.104960,-0.037653,0.011340,-0.054826,0.034294,-0.034874,0.031261,0.015569,0.054510,0.006463,...,-0.022376,-0.023033,0.029968,-0.007767,-0.039972,0.035874,0.046882,0.012881,0.083525,rec.autos
9,-0.141489,-0.066759,0.001601,-0.105917,0.021009,0.006697,0.014165,0.018442,0.091156,0.000600,...,-0.025044,-0.002760,0.082174,-0.046818,-0.028289,0.065187,0.071128,0.026814,0.063396,rec.motorcycles






read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.081731,-0.004764,-0.012101,-0.069220,-0.001379,-0.015316,0.022999,0.020267,0.069385,-0.006278,...,0.013480,-0.008676,0.056640,-0.040310,-0.022069,0.048490,0.034204,0.010040,0.065003,comp.graphics
1,-0.097792,-0.017958,0.001700,-0.055562,0.003669,-0.033278,0.033678,-0.004591,0.046991,-0.001492,...,-0.011707,-0.008645,0.051835,-0.019044,-0.026755,0.046990,0.042069,0.016343,0.057705,comp.graphics
2,-0.116530,0.008329,0.016162,-0.132806,0.017821,0.018096,-0.004903,-0.001694,0.083451,-0.009576,...,-0.014934,-0.007064,0.085981,-0.034492,-0.048087,0.089984,0.068729,0.002414,0.114820,rec.motorcycles
3,-0.106671,-0.017298,-0.018855,-0.059363,0.005475,-0.018509,0.002709,0.005687,0.056503,-0.024181,...,-0.007485,-0.000348,0.057278,-0.028655,-0.016583,0.034875,0.051682,0.024302,0.037784,comp.graphics
4,-0.065294,-0.013719,-0.000415,-0.044229,-0.001409,-0.023134,0.024657,0.002024,0.055189,-0.004980,...,-0.010764,-0.022938,0.039239,-0.010271,-0.025679,0.031539,0.047365,0.000367,0.048799,soc.religion.christian
5,-0.097884,-0.034024,0.035295,-0.085874,0.019200,-0.004469,0.028108,-0.003590,0.047012,-0.007293,...,-0.019252,-0.005879,0.061241,-0.027668,-0.014782,0.042143,0.051885,0.030211,0.033070,rec.motorcycles
6,-0.125321,-0.017656,0.024158,-0.124819,0.014406,0.013520,-0.003531,0.003272,0.110730,-0.018335,...,-0.027205,0.001510,0.084814,-0.056488,0.008497,0.043953,0.063471,0.022955,0.035609,alt.atheism
7,-0.087147,-0.012321,0.000941,-0.064898,0.009898,-0.007535,0.034204,0.007687,0.061507,-0.029589,...,-0.018586,0.004568,0.053618,-0.019018,-0.044329,0.046996,0.031444,0.032712,0.066138,comp.graphics
8,-0.088266,0.009463,0.006250,-0.092894,0.022079,0.011474,0.011496,0.009644,0.109954,-0.002389,...,-0.023825,-0.022800,0.073818,-0.041873,-0.034369,0.051517,0.071210,0.005400,0.059941,soc.religion.christian
9,-0.087879,-0.023612,0.004497,-0.080114,0.024681,-0.007263,0.026019,0.008766,0.046200,0.010603,...,-0.028790,-0.014976,0.051058,-0.036790,-0.010563,0.059333,0.055061,-0.001314,0.102799,rec.motorcycles








In [95]:
%%time

# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(
        csvpath_root, '{}-w2v-{}-{}.csv'.format(
            tpart, embedFrom, modelFrom
        )
    )
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])

read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.094903,-0.033396,0.023281,-0.068885,0.012803,-0.020081,0.043820,-0.013916,0.045930,0.007536,...,0.015936,-0.009133,0.044235,-0.011449,-0.033533,0.049224,0.059578,-0.008464,0.073953,rec.autos
1,-0.098500,0.003244,0.054230,-0.109855,0.010125,0.003442,-0.017964,-0.000731,0.100552,-0.004235,...,-0.005796,-0.044728,0.058191,-0.029152,-0.057866,0.055716,0.082525,0.012761,0.089680,rec.autos
2,-0.139683,-0.053830,-0.005676,-0.095735,0.025591,-0.004752,0.049037,0.020243,0.075016,-0.003708,...,-0.013632,-0.001994,0.063991,-0.034154,-0.019847,0.055147,0.045145,0.033465,0.041801,rec.motorcycles
3,-0.112694,-0.026646,0.039184,-0.098880,0.024492,0.005924,0.010822,-0.010065,0.068318,0.020327,...,-0.034969,0.002065,0.055133,-0.046471,-0.006569,0.034517,0.063772,0.042543,0.047836,rec.motorcycles
4,-0.102097,0.016169,0.027686,-0.136797,0.000690,0.017346,-0.021475,-0.019313,0.093308,0.001078,...,-0.028761,0.011107,0.102111,-0.065938,-0.002415,0.051332,0.065822,0.018612,0.054210,comp.graphics
5,-0.070430,-0.021697,0.025177,-0.088297,0.017552,-0.004928,-0.013037,0.010590,0.092257,-0.027441,...,-0.015653,-0.017257,0.062221,-0.030948,-0.058316,0.065484,0.044497,-0.004387,0.106836,rec.autos
6,-0.179349,-0.019614,0.001812,-0.121889,0.023463,0.000710,0.033576,0.029739,0.109084,0.003450,...,-0.011901,-0.013323,0.070001,-0.018533,-0.030263,0.080563,0.057742,0.037061,0.049532,comp.graphics
7,-0.127285,-0.005624,0.015655,-0.119179,0.031591,0.046910,0.003071,-0.016354,0.086801,-0.019492,...,-0.052847,-0.000680,0.083374,-0.065415,-0.008515,0.047261,0.042404,0.035419,0.059149,rec.motorcycles
8,-0.089965,-0.017877,0.020218,-0.060343,0.010492,-0.026772,0.019693,-0.027253,0.058246,0.011499,...,-0.014743,-0.010517,0.052166,-0.021592,-0.039748,0.040507,0.046267,0.011517,0.087518,rec.motorcycles
9,-0.078160,-0.023529,0.002218,-0.040569,-0.012595,-0.004889,0.015876,-0.005516,0.055291,-0.011488,...,-0.000654,-0.016180,0.035070,-0.008485,-0.031401,0.056715,0.036504,-0.000791,0.055320,alt.atheism


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.117148,-0.021425,0.000099,-0.113312,0.024052,0.004719,0.000188,0.012720,0.083197,-0.005689,...,-0.033537,-0.013109,0.059935,-0.033495,-0.015420,0.052427,0.064892,0.036332,0.064498,rec.autos
1,-0.102021,-0.037374,0.000158,-0.067116,0.005629,-0.028177,0.034275,-0.017012,0.071624,-0.003065,...,-0.007797,-0.018987,0.035484,-0.020930,-0.038710,0.043458,0.037201,0.010973,0.079081,rec.autos
2,-0.089513,-0.022315,-0.005322,-0.054825,-0.000899,-0.023198,0.032110,-0.005001,0.061583,0.002839,...,-0.011607,-0.011517,0.045687,-0.018633,-0.039093,0.030294,0.044496,0.008770,0.062570,rec.autos
3,-0.102370,-0.018573,0.005848,-0.061515,0.010681,-0.016677,0.038279,-0.001188,0.054219,-0.010859,...,-0.016656,0.000728,0.068354,-0.041630,-0.023546,0.038867,0.051942,0.020903,0.065162,comp.graphics
4,-0.061355,0.017289,0.028212,-0.070635,-0.003494,0.007862,0.002878,-0.011259,0.062566,-0.024810,...,-0.012964,-0.020334,0.039886,-0.028835,-0.019112,0.025722,0.047179,0.003098,0.071323,soc.religion.christian
5,-0.069710,-0.007212,0.012999,-0.063925,-0.010738,-0.016342,-0.000341,0.002082,0.081737,-0.000286,...,-0.019473,-0.020042,0.057389,-0.014258,-0.044634,0.038760,0.048375,-0.002766,0.075653,alt.atheism
6,-0.105284,-0.018029,0.001020,-0.062485,-0.005475,-0.023145,0.021618,-0.002325,0.051883,-0.012287,...,0.005362,-0.009142,0.051488,-0.016278,-0.026575,0.043514,0.038242,0.020516,0.059948,comp.graphics
7,-0.090577,-0.045687,-0.028171,-0.042403,0.011147,-0.036393,0.021754,0.017434,0.044115,-0.000617,...,-0.003240,-0.017318,0.043766,-0.007699,-0.029511,0.048096,0.035694,0.024916,0.060409,comp.graphics
8,-0.067873,-0.023985,-0.012259,-0.046345,0.004283,-0.012247,0.035525,0.007658,0.066258,-0.006282,...,-0.014699,-0.018004,0.040865,-0.012956,-0.030815,0.038679,0.042452,-0.012858,0.055945,soc.religion.christian
9,-0.113428,-0.046355,-0.011424,-0.061250,0.017173,-0.013115,0.021826,0.002743,0.065482,0.001897,...,-0.003136,-0.017735,0.047614,-0.021488,-0.042859,0.052737,0.054281,0.010689,0.067286,comp.graphics


CPU times: user 864 ms, sys: 16 ms, total: 880 ms
Wall time: 877 ms


###### SVM classifier(gensim + corpus)

In [96]:
%%time

# Step 5.1.1: SVM

# if 'TFIDF' == modelChoice:

#train
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']
#test
X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

# else:
#     #train
#     X_train = df_new['train']['x']
#     y_train = df_new['train']['y']
#     #test
#     X_test = df_new['test']['x']
#     y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 1.24 s, sys: 4 ms, total: 1.25 s
Wall time: 1.25 s


In [97]:
%%time
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.827368421053
0.823115546239
0.827368421053
CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 19.7 ms


###### DNN classifier(gensim + corpus)

In [98]:
%%time

# Step 4: One-hot representation for labels

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lb = LabelBinarizer()
lb.fit(df['train']['class'])

df_new = {}
for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelFrom))
        df_new[tpart][subpart].to_csv(csvpath)
    
print("label cleaning succussfully")

label cleaning succussfully
CPU times: user 3.7 s, sys: 52 ms, total: 3.75 s
Wall time: 3.75 s


In [99]:
%%time

## Step 5 : Train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {
        str(col): tf.constant(
            df[dataset][str(col)].values
        )
        for col in COL_FEATURE
    }
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=len(df['train']['class'].unique())
)

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f1464dfc2d0>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}






































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpjRXwq8/model.ckpt.
INFO:tensorflow:loss = 1.60779, step = 1
INFO:tensorflow:global_step/sec: 11.3903
INFO:tensorflow:loss = 1.3959, step = 101
INFO:tensorflow:global_step/sec: 11.525
INFO:tensorflow:loss = 1.21157, step = 201
INFO:tensorflow:global_step/sec: 11.6707
INFO:tensorflow:loss = 1.07089, step = 301
INFO:tensorflow:global_step/sec: 11.8325
INFO:tensorflow:loss = 0.954396, step = 401
INFO:tensorflow:global_step/sec: 11.7476
INFO:tensorflow:loss = 0.863967, step = 501
INFO:tensorflow:global_step/sec: 11.6798
INFO:tensorflow:loss = 0.794398, step = 601
INFO:ten

In [100]:
%%time

## Step 6: Evaluate

accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print("Test Accuracy by TensorFlow: {}".format(accuracy_score))

X_tensor_test, yt = my_input_fn('test')
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print('Test Accuracy by Scikit-learn: ', f1_score(y_test_true, tensorPredClsStr, average='micro'))







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-05-13-09:41:54
INFO:tensorflow:Evaluation [1/1900]
INFO:tensorflow:Evaluation [2/1900]
INFO:tensorflow:Evaluation [3/1900]
INFO:tensorflow:Evaluation [4/1900]
INFO:tensorflow:Evaluation [5/1900]
INFO:tensorflow:Evaluation [6/1900]
INFO:tensorflow:Evaluation [7/1900]
INFO:tensorflow:Evaluation [8/1900]
INFO:tensorflow:Evaluation [9/1900]
INFO:tensorflow:Evaluation [10/1900]
INFO:tensorflow:Evaluation [11/1900]
INFO:tensorflow:Evaluation [12/1900]
INFO:tensorflow:Evaluation [13/1900]
INFO:tensorflow:Evaluation [14/1900]
INFO:tensorflow:Evaluation [15/1900]
INFO:tensorflow:Evaluation [16/1900]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [73/1900]
INFO:tensorflow:Evaluation [74/1900]
INFO:tensorflow:Evaluation [75/1900]
INFO:tensorflow:Evaluation [76/1900]
INFO:tensorflow:Evaluation [77/1900]
INFO:tensorflow:Evaluation [78/1900]
INFO:tensorflow:Evaluation [79/1900]
INFO:tensorflow:Evaluation [80/1900]
INFO:tensorflow:Evaluation [81/1900]
INFO:tensorflow:Evaluation [82/1900]
INFO:tensorflow:Evaluation [83/1900]
INFO:tensorflow:Evaluation [84/1900]
INFO:tensorflow:Evaluation [85/1900]
INFO:tensorflow:Evaluation [86/1900]
INFO:tensorflow:Evaluation [87/1900]
INFO:tensorflow:Evaluation [88/1900]
INFO:tensorflow:Evaluation [89/1900]
INFO:tensorflow:Evaluation [90/1900]
INFO:tensorflow:Evaluation [91/1900]
INFO:tensorflow:Evaluation [92/1900]
INFO:tensorflow:Evaluation [93/1900]
INFO:tensorflow:Evaluation [94/1900]
INFO:tensorflow:Evaluation [95/1900]
INFO:tensorflow:Evaluation [96/1900]
INFO:tensorflow:Evaluation [97/1900]
INFO:tensorflow:Evaluation [98/1900]
INFO:tensorflow:Evaluation [99/1900]
I

INFO:tensorflow:Evaluation [290/1900]
INFO:tensorflow:Evaluation [291/1900]
INFO:tensorflow:Evaluation [292/1900]
INFO:tensorflow:Evaluation [293/1900]
INFO:tensorflow:Evaluation [294/1900]
INFO:tensorflow:Evaluation [295/1900]
INFO:tensorflow:Evaluation [296/1900]
INFO:tensorflow:Evaluation [297/1900]
INFO:tensorflow:Evaluation [298/1900]
INFO:tensorflow:Evaluation [299/1900]
INFO:tensorflow:Evaluation [300/1900]
INFO:tensorflow:Evaluation [301/1900]
INFO:tensorflow:Evaluation [302/1900]
INFO:tensorflow:Evaluation [303/1900]
INFO:tensorflow:Evaluation [304/1900]
INFO:tensorflow:Evaluation [305/1900]
INFO:tensorflow:Evaluation [306/1900]
INFO:tensorflow:Evaluation [307/1900]
INFO:tensorflow:Evaluation [308/1900]
INFO:tensorflow:Evaluation [309/1900]
INFO:tensorflow:Evaluation [310/1900]
INFO:tensorflow:Evaluation [311/1900]
INFO:tensorflow:Evaluation [312/1900]
INFO:tensorflow:Evaluation [313/1900]
INFO:tensorflow:Evaluation [314/1900]
INFO:tensorflow:Evaluation [315/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [506/1900]
INFO:tensorflow:Evaluation [507/1900]
INFO:tensorflow:Evaluation [508/1900]
INFO:tensorflow:Evaluation [509/1900]
INFO:tensorflow:Evaluation [510/1900]
INFO:tensorflow:Evaluation [511/1900]
INFO:tensorflow:Evaluation [512/1900]
INFO:tensorflow:Evaluation [513/1900]
INFO:tensorflow:Evaluation [514/1900]
INFO:tensorflow:Evaluation [515/1900]
INFO:tensorflow:Evaluation [516/1900]
INFO:tensorflow:Evaluation [517/1900]
INFO:tensorflow:Evaluation [518/1900]
INFO:tensorflow:Evaluation [519/1900]
INFO:tensorflow:Evaluation [520/1900]
INFO:tensorflow:Evaluation [521/1900]
INFO:tensorflow:Evaluation [522/1900]
INFO:tensorflow:Evaluation [523/1900]
INFO:tensorflow:Evaluation [524/1900]
INFO:tensorflow:Evaluation [525/1900]
INFO:tensorflow:Evaluation [526/1900]
INFO:tensorflow:Evaluation [527/1900]
INFO:tensorflow:Evaluation [528/1900]
INFO:tensorflow:Evaluation [529/1900]
INFO:tensorflow:Evaluation [530/1900]
INFO:tensorflow:Evaluation [531/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [722/1900]
INFO:tensorflow:Evaluation [723/1900]
INFO:tensorflow:Evaluation [724/1900]
INFO:tensorflow:Evaluation [725/1900]
INFO:tensorflow:Evaluation [726/1900]
INFO:tensorflow:Evaluation [727/1900]
INFO:tensorflow:Evaluation [728/1900]
INFO:tensorflow:Evaluation [729/1900]
INFO:tensorflow:Evaluation [730/1900]
INFO:tensorflow:Evaluation [731/1900]
INFO:tensorflow:Evaluation [732/1900]
INFO:tensorflow:Evaluation [733/1900]
INFO:tensorflow:Evaluation [734/1900]
INFO:tensorflow:Evaluation [735/1900]
INFO:tensorflow:Evaluation [736/1900]
INFO:tensorflow:Evaluation [737/1900]
INFO:tensorflow:Evaluation [738/1900]
INFO:tensorflow:Evaluation [739/1900]
INFO:tensorflow:Evaluation [740/1900]
INFO:tensorflow:Evaluation [741/1900]
INFO:tensorflow:Evaluation [742/1900]
INFO:tensorflow:Evaluation [743/1900]
INFO:tensorflow:Evaluation [744/1900]
INFO:tensorflow:Evaluation [745/1900]
INFO:tensorflow:Evaluation [746/1900]
INFO:tensorflow:Evaluation [747/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [938/1900]
INFO:tensorflow:Evaluation [939/1900]
INFO:tensorflow:Evaluation [940/1900]
INFO:tensorflow:Evaluation [941/1900]
INFO:tensorflow:Evaluation [942/1900]
INFO:tensorflow:Evaluation [943/1900]
INFO:tensorflow:Evaluation [944/1900]
INFO:tensorflow:Evaluation [945/1900]
INFO:tensorflow:Evaluation [946/1900]
INFO:tensorflow:Evaluation [947/1900]
INFO:tensorflow:Evaluation [948/1900]
INFO:tensorflow:Evaluation [949/1900]
INFO:tensorflow:Evaluation [950/1900]
INFO:tensorflow:Evaluation [951/1900]
INFO:tensorflow:Evaluation [952/1900]
INFO:tensorflow:Evaluation [953/1900]
INFO:tensorflow:Evaluation [954/1900]
INFO:tensorflow:Evaluation [955/1900]
INFO:tensorflow:Evaluation [956/1900]
INFO:tensorflow:Evaluation [957/1900]
INFO:tensorflow:Evaluation [958/1900]
INFO:tensorflow:Evaluation [959/1900]
INFO:tensorflow:Evaluation [960/1900]
INFO:tensorflow:Evaluation [961/1900]
INFO:tensorflow:Evaluation [962/1900]
INFO:tensorflow:Evaluation [963/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [1150/1900]
INFO:tensorflow:Evaluation [1151/1900]
INFO:tensorflow:Evaluation [1152/1900]
INFO:tensorflow:Evaluation [1153/1900]
INFO:tensorflow:Evaluation [1154/1900]
INFO:tensorflow:Evaluation [1155/1900]
INFO:tensorflow:Evaluation [1156/1900]
INFO:tensorflow:Evaluation [1157/1900]
INFO:tensorflow:Evaluation [1158/1900]
INFO:tensorflow:Evaluation [1159/1900]
INFO:tensorflow:Evaluation [1160/1900]
INFO:tensorflow:Evaluation [1161/1900]
INFO:tensorflow:Evaluation [1162/1900]
INFO:tensorflow:Evaluation [1163/1900]
INFO:tensorflow:Evaluation [1164/1900]
INFO:tensorflow:Evaluation [1165/1900]
INFO:tensorflow:Evaluation [1166/1900]
INFO:tensorflow:Evaluation [1167/1900]
INFO:tensorflow:Evaluation [1168/1900]
INFO:tensorflow:Evaluation [1169/1900]
INFO:tensorflow:Evaluation [1170/1900]
INFO:tensorflow:Evaluation [1171/1900]
INFO:tensorflow:Evaluation [1172/1900]
INFO:tensorflow:Evaluation [1173/1900]
INFO:tensorflow:Evaluation [1174/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1361/1900]
INFO:tensorflow:Evaluation [1362/1900]
INFO:tensorflow:Evaluation [1363/1900]
INFO:tensorflow:Evaluation [1364/1900]
INFO:tensorflow:Evaluation [1365/1900]
INFO:tensorflow:Evaluation [1366/1900]
INFO:tensorflow:Evaluation [1367/1900]
INFO:tensorflow:Evaluation [1368/1900]
INFO:tensorflow:Evaluation [1369/1900]
INFO:tensorflow:Evaluation [1370/1900]
INFO:tensorflow:Evaluation [1371/1900]
INFO:tensorflow:Evaluation [1372/1900]
INFO:tensorflow:Evaluation [1373/1900]
INFO:tensorflow:Evaluation [1374/1900]
INFO:tensorflow:Evaluation [1375/1900]
INFO:tensorflow:Evaluation [1376/1900]
INFO:tensorflow:Evaluation [1377/1900]
INFO:tensorflow:Evaluation [1378/1900]
INFO:tensorflow:Evaluation [1379/1900]
INFO:tensorflow:Evaluation [1380/1900]
INFO:tensorflow:Evaluation [1381/1900]
INFO:tensorflow:Evaluation [1382/1900]
INFO:tensorflow:Evaluation [1383/1900]
INFO:tensorflow:Evaluation [1384/1900]
INFO:tensorflow:Evaluation [1385/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1572/1900]
INFO:tensorflow:Evaluation [1573/1900]
INFO:tensorflow:Evaluation [1574/1900]
INFO:tensorflow:Evaluation [1575/1900]
INFO:tensorflow:Evaluation [1576/1900]
INFO:tensorflow:Evaluation [1577/1900]
INFO:tensorflow:Evaluation [1578/1900]
INFO:tensorflow:Evaluation [1579/1900]
INFO:tensorflow:Evaluation [1580/1900]
INFO:tensorflow:Evaluation [1581/1900]
INFO:tensorflow:Evaluation [1582/1900]
INFO:tensorflow:Evaluation [1583/1900]
INFO:tensorflow:Evaluation [1584/1900]
INFO:tensorflow:Evaluation [1585/1900]
INFO:tensorflow:Evaluation [1586/1900]
INFO:tensorflow:Evaluation [1587/1900]
INFO:tensorflow:Evaluation [1588/1900]
INFO:tensorflow:Evaluation [1589/1900]
INFO:tensorflow:Evaluation [1590/1900]
INFO:tensorflow:Evaluation [1591/1900]
INFO:tensorflow:Evaluation [1592/1900]
INFO:tensorflow:Evaluation [1593/1900]
INFO:tensorflow:Evaluation [1594/1900]
INFO:tensorflow:Evaluation [1595/1900]
INFO:tensorflow:Evaluation [1596/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1783/1900]
INFO:tensorflow:Evaluation [1784/1900]
INFO:tensorflow:Evaluation [1785/1900]
INFO:tensorflow:Evaluation [1786/1900]
INFO:tensorflow:Evaluation [1787/1900]
INFO:tensorflow:Evaluation [1788/1900]
INFO:tensorflow:Evaluation [1789/1900]
INFO:tensorflow:Evaluation [1790/1900]
INFO:tensorflow:Evaluation [1791/1900]
INFO:tensorflow:Evaluation [1792/1900]
INFO:tensorflow:Evaluation [1793/1900]
INFO:tensorflow:Evaluation [1794/1900]
INFO:tensorflow:Evaluation [1795/1900]
INFO:tensorflow:Evaluation [1796/1900]
INFO:tensorflow:Evaluation [1797/1900]
INFO:tensorflow:Evaluation [1798/1900]
INFO:tensorflow:Evaluation [1799/1900]
INFO:tensorflow:Evaluation [1800/1900]
INFO:tensorflow:Evaluation [1801/1900]
INFO:tensorflow:Evaluation [1802/1900]
INFO:tensorflow:Evaluation [1803/1900]
INFO:tensorflow:Evaluation [1804/1900]
INFO:tensorflow:Evaluation [1805/1900]
INFO:tensorflow:Evaluation [1806/1900]
INFO:tensorflow:Evaluation [1807/1900]
INFO:tensorflow:Evaluatio







































Test Accuracy by Scikit-learn:  0.818421052632
CPU times: user 9min 1s, sys: 50.7 s, total: 9min 51s
Wall time: 1min 45s


###### CNN classifier(gensim + corpus)

In [101]:
%%time

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

sess = tf.InteractiveSession()

COL_OUTCOME = 'class'
COL_FEATURE = [col for col in list(df['train'].columns) if col != COL_OUTCOME]

# cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

count_feature = len(COL_FEATURE)
count_class = len(df['train']['class'].unique())

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, count_class], name='y_')

W = tf.Variable(tf.zeros([count_feature, count_class]))
b = tf.Variable(tf.zeros([count_class]))
y = tf.matmul(x, W) + b

# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_text = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_text, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32, name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, count_class])
b_fc2 = bias_variable([count_class])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

print("CNN initialization finished")

CNN initialization finished
CPU times: user 36 ms, sys: 4 ms, total: 40 ms
Wall time: 39 ms


In [102]:
%%time

### Start to traini and evaluate the model

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.global_variables_initializer())

x_input = df_new['train']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['train']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]
# y_input = [np.array([y_input.iloc[i].values]) for i in range(y_input.shape[0])]

# not use random input

for i in range(df['train'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, training accuracy {}".format(i, np.mean(train_accuracy)))
    train_step.run(feed_dict={
        keep_prob: 0.5,
        x:  np.array([elem[0] for elem in x_input[i:i+50]]),#x_input.iloc[i].values, #
        y_: np.array([elem[0] for elem in y_input[i:i+50]])#y_input.iloc[i].values#
    })

print("CNN training finished")

step 0, training accuracy 0.175200000405
step 100, training accuracy 0.474399983883
step 200, training accuracy 0.585599958897
step 300, training accuracy 0.564399957657
step 400, training accuracy 0.557599961758
step 500, training accuracy 0.614799976349
step 600, training accuracy 0.706399977207
step 700, training accuracy 0.64359998703
step 800, training accuracy 0.70759999752
step 900, training accuracy 0.7675999403
step 1000, training accuracy 0.782399892807
step 1100, training accuracy 0.740800023079
step 1200, training accuracy 0.716000080109
step 1300, training accuracy 0.711199998856
step 1400, training accuracy 0.87640017271
step 1500, training accuracy 0.841600060463
step 1600, training accuracy 0.770799994469
step 1700, training accuracy 0.743200004101
step 1800, training accuracy 0.870800018311
step 1900, training accuracy 0.76239991188
step 2000, training accuracy 0.836400151253
step 2100, training accuracy 0.773999929428
step 2200, training accuracy 0.790799915791
step 2

In [103]:
%%time

# Evaluate

x_input = df_new['test']['x']#df_new['test']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['test']['y']#df_new['test']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]

for i in range(df['test'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, testing accuracy {}".format(i, np.mean(train_accuracy)))

        
print("CNN testing finished")

step 0, testing accuracy 0.698399960995
step 100, testing accuracy 0.814400017262
step 200, testing accuracy 0.720800101757
step 300, testing accuracy 0.780399918556
step 400, testing accuracy 0.846400022507
step 500, testing accuracy 0.793200075626
step 600, testing accuracy 0.743200063705
step 700, testing accuracy 0.833999931812
step 800, testing accuracy 0.732000052929
step 900, testing accuracy 0.76919990778
step 1000, testing accuracy 0.717200040817
step 1100, testing accuracy 0.86000007391
step 1200, testing accuracy 0.784800112247
step 1300, testing accuracy 0.80119997263
step 1400, testing accuracy 0.761600017548
step 1500, testing accuracy 0.79239988327
step 1600, testing accuracy 0.782799899578
step 1700, testing accuracy 0.815999925137
step 1800, testing accuracy 0.792400062084
CNN testing finished
CPU times: user 2min 43s, sys: 10 s, total: 2min 53s
Wall time: 27.1 s


#### 3.2.2 TensorFlow 训练

In [169]:
import tensorflow as tf

modelFrom = 'TensorFlow'

In [170]:
%%time

import sys
sys.path.append(os.path.abspath('../'))

from modules.embedding.w2v_opt_full_01 import *

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 106 µs


##### 3.2.1.1 基于 text8 建模

In [175]:
%%time

embedFrom = 'text8'

FLAGS.train_data = os.path.join(paths['dir.dataroot'], 'trialdata', embedFrom)
FLAGS.eval_data = os.path.join(paths['dir.dataroot'], 'trialdata', 'questions-words.txt')
FLAGS.save_path = paths['dir.{}.{}'.format(modelFrom, embedFrom)]
FLAGS.epochs_to_train = 15
FLAGS.embedding_size = vecsize

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 34.8 µs


In [176]:
%%time

!cat ../modules/embedding/w2v_opt_full_02.py

import tensorflow as tf

session = tf.InteractiveSession()
"""Train a word2vec model."""
if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path:
  print("--train_data --eval_data and --save_path must be specified.")
  sys.exit(1)
opts = Options()
#with tf.Graph().as_default() as session:
with tf.device("/cpu:0"):
  model = Word2Vec(opts, session)
  model.read_analogies() # Read analogy questions
for _ in xrange(opts.epochs_to_train):
  model.train()  # Process one epoch
  model.eval()  # Eval analogies.
# Perform a final save.
model.saver.save(session, os.path.join(opts.save_path, "model.ckpt"),
                 global_step=model.global_step)
if FLAGS.interactive:
  # E.g.,
  # [0]: model.analogy(b'france', b'paris', b'russia')
  # [1]: model.nearby([b'proton', b'elephant', b'maxwell'])
  _start_shell(locals())

CPU times: user 0 ns, sys: 36 ms, total: 36 ms
Wall time: 204 ms


In [177]:
%%time

session = tf.InteractiveSession()
"""Train a word2vec model."""
if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path:
    print("--train_data --eval_data and --save_path must be specified.")
    sys.exit(1)
opts = Options()
#with tf.Graph().as_default() as session:
with tf.device("/cpu:0"):
    model = Word2Vec(opts, session)
    model.read_analogies() # Read analogy questions
for _ in xrange(opts.epochs_to_train):
    model.train()  # Process one epoch
    model.eval()  # Eval analogies.
# Perform a final save.
model.saver.save(session, os.path.join(opts.save_path, "model-{}.ckpt".format(embedFrom)),
                 global_step=model.global_step)
if FLAGS.interactive:
    # E.g.,
    # [0]: model.analogy(b'france', b'paris', b'russia')
    # [1]: model.nearby([b'proton', b'elephant', b'maxwell'])
    _start_shell(locals())

Data file:  /home/sushangjun/gits/Udacity/MLND/capstone.now/proposal/new/notebooks/../data/trialdata/text8
Vocab size:  71290  + UNK
Words per epoch:  17005207
Initialization:  [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
Eval analogy file:  /home/sushangjun/gits/Udacity/MLND/capstone.now/proposal/new/notebooks/../data/trialdata/questions-words.txt
Questions:  17827
Skipped:  1717
Epoch    1 Step   150876: lr = 0.024 words/sec =    16751
Eval 1372/17827 accuracy =  7.7%
Epoch    2 Step   301785: lr = 0.023 words/sec =    17531
Eval 1649/17827 accuracy =  9.3%
Epoch    3 Step   452698: lr = 0.021 words/sec =     8236
Eval 1640/17827 accuracy =  9.2%
Epoch    4 Step   603616: lr = 0.020 words/sec =     7911
Eval 1701/17827 accuracy =  9.5%
Epoch    5 Step   754493: lr = 0.019 words/sec =    12508
Eval 1949/17827 accuracy = 10.9%
Epoch 

In [180]:
# from scipy.spatial.distance import cosine

# # tmma=model._w_in.eval()
# # tmma
# tmpemb = model._w_out.eval()
# # print(len(tmpemb[model._word2id['slow']]))
# # print(tmpemb[model._word2id['slowly']])
# print(cosine(tmpemb[model._word2id['slow']], tmpemb[model._word2id['slowly']]))
# # print(cosine(tmpemb[model._word2id['man']], tmpemb[model._word2id['men']]))
# # print(cosine(tmpemb[model._word2id['woman']], tmpemb[model._word2id['women']]))
# # print(cosine(tmpemb[model._word2id['woman']], tmpemb[model._word2id['woman']]))
# # print(tmpemb[model._word2id['UNK']])
# print(cosine([1,1], [1.1, 1.3]))

In [182]:
%%time

# Step 1: preprocess the data

# import stoplist
stopwords = ""

pathtemp_TFIDF = os.path.join(paths['dir.dataroot'], 'stoplist-baseTFIDF.txt')
with open(pathtemp_TFIDF, 'r') as stoplistfile:
    stopwords = stoplistfile.read()
stopwords = stopwords.split()

pathtemp_web = os.path.join(paths['dir.dataroot'], 'stoplist-web.txt')
with open(pathtemp_web, 'r') as stoplistfile2:
    stopwords2 = stoplistfile2.read()
    stopwords2 = stopwords2.split('\n')
    stopwords = set(stopwords)
    stopwords = list(stopwords.union(set(stopwords)))
    
print("Read stop words successfully.")

Read stop words successfully.
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 29.3 ms


In [184]:
%%time

# Step 2: read data and save it in data['vec.train'] 和 data['vec.test']

data = {}
data['vec.train'] = {'w2v.mean':[], 'class':[]}
data['vec.test'] = {'w2v.mean':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                tokens = [token for token in readf.read().split() if token not in stopwords]#readf.read().split()#
                # Word2Vec representation
                # begin
                vec = np.array([0.0 for i in range(vecsize)])
                expectationVal = np.array([0.0 for i in range(vecsize)])
                countvec = 0
                for token in tokens:
                    try:
                        vec += tmpemb[model._word2id[token]] #model[token]
                        countvec += 1
                    except:
                        vec += tmpemb[model._word2id['UNK']]
                vec = vec / float(countvec)
                 # end
            data['vec.{}'.format(tpart)]['w2v.mean'].append(vec)
            data['vec.{}'.format(tpart)]['class'].append(cls)

    tmp = data['vec.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[corpus]\n {}\n[class]\n{}".format(tpart, ind, tmp['w2v.mean'][ind], tmp['class'][ind]))
    print()
    
print("Step 2 Succeed")

sample(transformed) from train[1148]:
[corpus]
 [ -9.96603407e-02   1.14235073e-01   4.87520915e-02  -3.99263393e-02
   1.03870710e-01   1.73305357e-02  -8.93831919e-02   1.19042018e-01
  -2.28948947e-02   5.87820504e-02   1.13846079e-02  -6.56907692e-02
  -1.50849392e-02  -6.74119421e-02  -4.09063270e-02   1.12255504e-01
   3.56880805e-02   4.81683357e-02   2.01546599e-02   1.71496643e-01
  -7.46743342e-02   2.97399331e-02   1.70645058e-01  -6.09753110e-02
   2.06357649e-02  -3.08952279e-02   6.20912504e-03   1.57769550e-01
   7.41829704e-02   8.86011216e-02  -2.44455459e-02   1.68424957e-02
  -3.67608798e-02  -1.30859618e-01  -4.36157938e-02  -1.11629976e-01
   9.82143396e-02   9.82492790e-03  -1.35626635e-02  -4.24545004e-02
  -1.03769611e-01   1.96121165e-01  -2.93205626e-02  -1.96371578e-01
  -1.13746641e-01   3.62143000e-02   9.79908944e-02   1.13563232e-01
   4.05143333e-02   1.30632895e-02  -1.14868810e-01  -4.35108331e-02
  -1.07200862e-01  -4.58310493e-02  -1.31418910e-02   1

sample(transformed) from test[540]:
[corpus]
 [ -6.46396886e-02   2.42271544e-02   2.44639674e-02  -2.72202944e-02
   9.98638799e-04   3.78417556e-02  -2.89962393e-02   3.18431460e-02
   9.56417334e-03   1.37226456e-01  -5.49257988e-02  -4.43405140e-02
   1.51597137e-02  -3.92107286e-02  -9.36229662e-03   1.42487843e-01
   8.88744312e-02   1.12198757e-01   6.18111436e-02   1.06159809e-01
  -7.91951507e-02   7.72533927e-02   8.68962515e-02   9.52377094e-03
  -5.92476699e-02   2.26452493e-02  -3.78951147e-02   4.58099847e-02
   5.34265479e-02   1.22909027e-02   1.91986339e-02   4.09438554e-02
  -9.50648633e-02  -1.03207956e-01   2.81913384e-03  -1.13300449e-01
   2.86964725e-02   4.76082927e-02  -2.22589809e-02   9.80216321e-03
  -5.85902861e-02   1.67057980e-01  -2.85485954e-02  -1.23521216e-01
  -1.63160895e-01   5.62528471e-02   7.90062219e-02   5.05483913e-02
   1.81772220e-02   4.77569216e-02  -1.41804892e-01  -6.50451692e-02
  -1.26190987e-01  -5.34384894e-03  -6.58932425e-02   7.9

In [185]:
%%time

# Step 3: Save in Pandas.DataFrame
#
# 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

df = {}
csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['vec.{}'.format(tpart)]['class']
    datavec = np.array(data['vec.{}'.format(tpart)]['w2v.mean'])
    for col in range(vecsize):
        datadict[col]= datavec[:, col]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, '{}-w2v-{}-{}.csv'.format(tpart, embedFrom, modelFrom))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)
    
print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

See df[train]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.058523,0.013638,0.024487,-0.030197,-0.015306,0.033814,-0.023289,0.027039,-0.012052,0.101854,...,0.009520,0.268571,-0.097666,0.022824,0.137221,0.120145,0.042478,0.069818,0.024586,soc.religion.christian
1,-0.035728,0.018440,0.027882,-0.007253,0.000318,0.035095,-0.008307,0.050132,0.021606,0.123589,...,-0.027584,0.254759,-0.099986,0.012034,0.126510,0.103783,0.058195,0.041614,0.015369,soc.religion.christian
2,-0.030835,-0.002645,0.013619,-0.032663,-0.036917,0.044206,0.001154,0.044904,0.007797,0.122667,...,-0.025182,0.242497,-0.106797,-0.005275,0.137966,0.106617,0.063244,0.067828,0.039724,soc.religion.christian
3,-0.019594,0.045219,0.011064,-0.018853,-0.018432,0.039846,-0.012094,0.079092,-0.000249,0.148417,...,-0.007299,0.253940,-0.088050,0.027485,0.112267,0.115603,0.053982,0.075722,0.039394,soc.religion.christian
4,-0.030478,0.033525,0.000909,-0.033771,-0.006535,0.020864,-0.011159,0.073313,0.005399,0.134877,...,-0.009814,0.246666,-0.083068,0.032013,0.116406,0.112866,0.051360,0.046786,0.034479,soc.religion.christian
5,-0.068971,0.022524,0.022240,-0.025342,0.044779,0.054707,-0.002061,0.053398,-0.019471,0.106976,...,0.022863,0.300212,-0.107021,0.036935,0.123293,0.139480,0.059774,0.066357,0.046229,soc.religion.christian
6,-0.040616,0.007487,0.012622,-0.015940,-0.017342,0.044873,-0.014794,0.067907,0.004165,0.114806,...,-0.022901,0.246773,-0.097176,0.027566,0.135201,0.111781,0.068158,0.071924,0.040155,soc.religion.christian
7,-0.088533,0.022957,0.035795,-0.032581,0.024442,0.011931,-0.011633,0.028860,0.019157,0.104923,...,0.005853,0.285087,-0.111841,0.031712,0.138796,0.128962,0.030158,0.049613,0.027840,soc.religion.christian
8,-0.081727,0.021808,0.046780,-0.027344,0.037914,0.025987,-0.022177,0.028687,-0.004210,0.119180,...,0.017514,0.290323,-0.102362,0.039628,0.127815,0.123184,0.051371,0.050606,0.051381,soc.religion.christian
9,-0.052158,0.020152,0.011457,-0.029701,0.010596,0.033001,-0.031144,0.051754,0.013221,0.120225,...,-0.009498,0.253548,-0.113777,0.018805,0.133704,0.126106,0.065190,0.058086,0.026126,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.031296,0.031190,0.033662,-0.017676,-0.000093,0.026671,0.013627,0.064756,-0.002502,0.125967,...,-0.006821,0.259358,-0.100459,0.020410,0.107689,0.117543,0.046935,0.058230,0.042685,soc.religion.christian
1,-0.052576,-0.010007,0.001262,-0.051575,-0.001714,0.037924,-0.017614,0.040380,0.000108,0.110576,...,0.002236,0.252903,-0.090882,0.025433,0.145794,0.133865,0.068060,0.073909,0.042318,soc.religion.christian
2,-0.037210,0.011418,0.011127,-0.038540,0.001013,0.025820,-0.002396,0.024885,0.018824,0.114064,...,0.003935,0.261865,-0.107376,0.013261,0.134763,0.144042,0.055032,0.022252,0.035977,soc.religion.christian
3,-0.058710,0.044231,0.035115,-0.017702,0.017434,0.036446,-0.019522,0.067778,-0.015364,0.083287,...,0.018137,0.267266,-0.106094,0.033556,0.120758,0.113386,0.044914,0.085380,0.044995,soc.religion.christian
4,-0.026366,0.016108,0.020979,-0.009220,-0.011549,0.030171,-0.012905,0.080754,-0.010876,0.129785,...,-0.003263,0.251054,-0.116747,0.026203,0.157351,0.132706,0.053787,0.071162,0.048086,soc.religion.christian
5,-0.003536,0.015019,0.015085,-0.002393,-0.009813,0.031112,-0.019398,0.068297,0.002323,0.115758,...,-0.015482,0.252973,-0.107032,0.024972,0.131752,0.105315,0.054258,0.059199,0.033886,soc.religion.christian
6,-0.039451,0.006575,0.017538,-0.014845,-0.000004,0.029863,-0.007273,0.058649,0.008164,0.117822,...,-0.010667,0.249252,-0.099601,0.017445,0.122034,0.122272,0.067075,0.065184,0.038287,soc.religion.christian
7,-0.030305,0.002403,0.022767,-0.021758,0.003630,0.038488,0.005574,0.057593,-0.000147,0.113814,...,0.003822,0.257768,-0.120562,0.041963,0.117600,0.124138,0.056058,0.045259,0.032298,soc.religion.christian
8,-0.055916,0.002920,0.012610,-0.028886,0.005426,0.037931,-0.010174,0.050305,0.003893,0.125212,...,-0.014562,0.254358,-0.098674,-0.001167,0.127779,0.132079,0.061554,0.043235,0.032576,soc.religion.christian
9,-0.050818,0.014585,0.029024,-0.045672,0.034625,0.016308,-0.019175,0.038426,0.008568,0.079724,...,0.004623,0.276395,-0.110139,0.038464,0.119581,0.128562,0.053603,0.055968,0.030504,soc.religion.christian






Step 3 Succeed.
CPU times: user 2.22 s, sys: 60 ms, total: 2.28 s
Wall time: 3.67 s


In [186]:
%%time

# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(
        csvpath_root, '{}-w2v-{}-{}.csv'.format(
            tpart, embedFrom, modelFrom
        )
    )
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])

read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.057341,0.014694,0.047682,-0.015018,0.009562,0.019051,-0.005166,0.046955,-0.014300,0.133241,...,0.007038,0.286418,-0.112458,0.025238,0.141784,0.125696,0.055557,0.065492,0.052077,comp.graphics
1,-0.023846,0.001160,0.010394,-0.022836,0.026280,0.051872,0.011654,0.028351,-0.017943,0.135171,...,0.022687,0.283771,-0.096455,0.031670,0.109492,0.150816,0.035601,0.049927,0.042134,rec.motorcycles
2,-0.031411,0.011987,0.002825,-0.031440,-0.009375,0.025572,0.008416,0.062001,0.018286,0.134346,...,0.014888,0.279204,-0.095205,-0.001121,0.131459,0.119123,0.049085,0.053070,0.033233,rec.motorcycles
3,-0.042833,0.015912,0.007204,-0.012253,-0.006918,0.034820,-0.018460,0.020200,-0.004529,0.127584,...,-0.010878,0.261667,-0.089696,0.025584,0.125700,0.163203,0.045023,0.054125,0.026043,rec.autos
4,-0.049115,0.006210,0.029890,-0.039029,0.027411,0.037111,-0.012789,0.038126,-0.004609,0.105153,...,0.004538,0.255003,-0.097393,0.046014,0.131058,0.155517,0.073182,0.065936,0.055590,soc.religion.christian
5,-0.023096,0.017767,-0.004985,-0.026241,-0.032956,0.031708,-0.014613,0.060181,-0.008096,0.118619,...,-0.032374,0.250342,-0.087871,0.025866,0.129169,0.115180,0.049557,0.047132,0.041679,alt.atheism
6,-0.050383,0.005988,0.024275,-0.032639,0.023947,0.033099,-0.007682,0.083294,-0.024961,0.118540,...,0.057446,0.294207,-0.153790,0.032245,0.122300,0.139739,0.065485,0.096452,0.048330,rec.motorcycles
7,-0.075066,0.021640,0.024893,-0.035045,0.033250,0.037919,-0.012097,0.038798,-0.005341,0.130516,...,0.016501,0.278293,-0.116479,0.025677,0.126034,0.133222,0.034075,0.031640,0.028024,rec.autos
8,-0.023873,-0.019982,0.006920,0.027267,0.047651,0.015003,0.023975,0.106521,-0.000969,0.121501,...,0.025447,0.311267,-0.090267,0.024427,0.104483,0.133039,0.044945,0.071998,0.067379,rec.autos
9,-0.082079,-0.002612,0.002923,-0.019352,0.021831,0.037990,0.006890,0.032810,-0.011633,0.116781,...,-0.013299,0.293357,-0.117362,0.023673,0.114332,0.149332,0.030224,0.057418,0.026547,rec.motorcycles


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.087221,0.037610,0.076202,-0.085028,0.069781,0.011738,0.010871,0.070988,-0.108553,0.028930,...,0.157808,0.386495,-0.191177,0.084594,0.179159,0.077864,0.091934,0.041715,0.065478,comp.graphics
1,-0.058829,-0.032066,0.083235,-0.037977,-0.030693,0.043876,-0.024962,0.004570,-0.054611,0.145081,...,0.019536,0.282640,-0.072797,0.107911,0.157876,0.183181,0.090611,0.080142,0.062756,soc.religion.christian
2,-0.058821,0.007064,0.025833,-0.013926,0.035914,0.035166,0.007778,0.045814,-0.012347,0.119940,...,0.032537,0.292283,-0.101899,0.012648,0.120706,0.151207,0.035528,0.035747,0.028452,rec.autos
3,-0.084904,-0.005975,0.039602,-0.025309,0.045134,0.045177,-0.043969,0.047071,-0.016143,0.124390,...,0.012929,0.289601,-0.109395,0.032417,0.167009,0.103590,0.065869,0.049717,0.066308,rec.autos
4,-0.066778,-0.008851,0.032158,-0.028911,-0.013782,0.029620,-0.032480,0.062637,0.006886,0.111917,...,0.026468,0.247941,-0.114674,0.034530,0.150863,0.120913,0.063900,0.060704,0.025238,alt.atheism
5,-0.083665,-0.089845,0.042763,-0.051298,0.035474,-0.009569,-0.004769,0.124835,0.011801,0.137345,...,0.055882,0.343373,-0.103744,0.029216,0.091094,0.171804,0.040556,0.125682,0.095259,comp.graphics
6,-0.064888,0.017411,0.015754,-0.015379,0.003927,0.020462,-0.006277,0.051208,0.009471,0.111201,...,-0.009833,0.277543,-0.114362,0.020225,0.136011,0.117653,0.052173,0.049221,0.021557,soc.religion.christian
7,-0.082919,-0.024437,0.052487,0.000173,0.092714,0.006752,-0.050026,0.075537,-0.025218,0.118346,...,0.083269,0.288357,-0.075109,0.009719,0.117507,0.139409,0.060297,0.047801,0.042497,alt.atheism
8,-0.061776,0.013502,0.027730,-0.014864,0.036355,0.038239,-0.011143,0.036786,-0.019621,0.140948,...,0.002193,0.271171,-0.087204,0.061817,0.141650,0.122194,0.059561,0.077485,0.019845,alt.atheism
9,-0.046962,0.020122,-0.002065,-0.013625,0.043326,0.014022,-0.051820,0.049011,0.004234,0.101400,...,0.047355,0.280542,-0.106741,0.034514,0.121276,0.121642,0.040239,0.043128,0.018928,rec.autos


CPU times: user 948 ms, sys: 4 ms, total: 952 ms
Wall time: 1.49 s


###### SVM classifier(TensorFlow + text8)

In [187]:
%%time

# Step 5.1.1: SVM

# if 'TFIDF' == modelChoice:

#train
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']
#test
X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

# else:
#     #train
#     X_train = df_new['train']['x']
#     y_train = df_new['train']['y']
#     #test
#     X_test = df_new['test']['x']
#     y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 2.17 s, sys: 4 ms, total: 2.17 s
Wall time: 2.85 s


In [188]:
%%time
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.848947368421
0.846123858354
0.848947368421
CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 211 ms


###### DNN classifier(TensorFlow + text8)

In [189]:
%%time

# Step 4: One-hot representation for labels

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lb = LabelBinarizer()
lb.fit(df['train']['class'])

df_new = {}
for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelFrom))
        df_new[tpart][subpart].to_csv(csvpath)
    
print("label cleaning succussfully")

label cleaning succussfully
CPU times: user 4.17 s, sys: 84 ms, total: 4.25 s
Wall time: 5.12 s


In [190]:
%%time

## Step 5 : Train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {
        str(col): tf.constant(
            df[dataset][str(col)].values
        )
        for col in COL_FEATURE
    }
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=len(df['train']['class'].unique())
)

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f152b231950>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}






































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpiR94KS/model.ckpt.
INFO:tensorflow:loss = 1.61255, step = 1
INFO:tensorflow:global_step/sec: 9.96804
INFO:tensorflow:loss = 1.40367, step = 101
INFO:tensorflow:global_step/sec: 12.3857
INFO:tensorflow:loss = 1.22488, step = 201
INFO:tensorflow:global_step/sec: 12.3444
INFO:tensorflow:loss = 1.07681, step = 301
INFO:tensorflow:global_step/sec: 12.235
INFO:tensorflow:loss = 0.946175, step = 401
INFO:tensorflow:global_step/sec: 12.3181
INFO:tensorflow:loss = 0.838036, step = 501
INFO:tensorflow:global_step/sec: 12.2071
INFO:tensorflow:loss = 0.751527, step = 601
INFO:te

In [191]:
%%time

## Step 6: Evaluate

accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print("Test Accuracy by TensorFlow: {}".format(accuracy_score))

X_tensor_test, yt = my_input_fn('test')
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print('Test Accuracy by Scikit-learn: ', f1_score(y_test_true, tensorPredClsStr, average='micro'))







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-05-13-17:00:33
INFO:tensorflow:Evaluation [1/1900]
INFO:tensorflow:Evaluation [2/1900]
INFO:tensorflow:Evaluation [3/1900]
INFO:tensorflow:Evaluation [4/1900]
INFO:tensorflow:Evaluation [5/1900]
INFO:tensorflow:Evaluation [6/1900]
INFO:tensorflow:Evaluation [7/1900]
INFO:tensorflow:Evaluation [8/1900]
INFO:tensorflow:Evaluation [9/1900]
INFO:tensorflow:Evaluation [10/1900]
INFO:tensorflow:Evaluation [11/1900]
INFO:tensorflow:Evaluation [12/1900]
INFO:tensorflow:Evaluation [13/1900]
INFO:tensorflow:Evaluation [14/1900]
INFO:tensorflow:Evaluation [15/1900]
INFO:tensorflow:Evaluation [16/1900]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [73/1900]
INFO:tensorflow:Evaluation [74/1900]
INFO:tensorflow:Evaluation [75/1900]
INFO:tensorflow:Evaluation [76/1900]
INFO:tensorflow:Evaluation [77/1900]
INFO:tensorflow:Evaluation [78/1900]
INFO:tensorflow:Evaluation [79/1900]
INFO:tensorflow:Evaluation [80/1900]
INFO:tensorflow:Evaluation [81/1900]
INFO:tensorflow:Evaluation [82/1900]
INFO:tensorflow:Evaluation [83/1900]
INFO:tensorflow:Evaluation [84/1900]
INFO:tensorflow:Evaluation [85/1900]
INFO:tensorflow:Evaluation [86/1900]
INFO:tensorflow:Evaluation [87/1900]
INFO:tensorflow:Evaluation [88/1900]
INFO:tensorflow:Evaluation [89/1900]
INFO:tensorflow:Evaluation [90/1900]
INFO:tensorflow:Evaluation [91/1900]
INFO:tensorflow:Evaluation [92/1900]
INFO:tensorflow:Evaluation [93/1900]
INFO:tensorflow:Evaluation [94/1900]
INFO:tensorflow:Evaluation [95/1900]
INFO:tensorflow:Evaluation [96/1900]
INFO:tensorflow:Evaluation [97/1900]
INFO:tensorflow:Evaluation [98/1900]
INFO:tensorflow:Evaluation [99/1900]
I

INFO:tensorflow:Evaluation [290/1900]
INFO:tensorflow:Evaluation [291/1900]
INFO:tensorflow:Evaluation [292/1900]
INFO:tensorflow:Evaluation [293/1900]
INFO:tensorflow:Evaluation [294/1900]
INFO:tensorflow:Evaluation [295/1900]
INFO:tensorflow:Evaluation [296/1900]
INFO:tensorflow:Evaluation [297/1900]
INFO:tensorflow:Evaluation [298/1900]
INFO:tensorflow:Evaluation [299/1900]
INFO:tensorflow:Evaluation [300/1900]
INFO:tensorflow:Evaluation [301/1900]
INFO:tensorflow:Evaluation [302/1900]
INFO:tensorflow:Evaluation [303/1900]
INFO:tensorflow:Evaluation [304/1900]
INFO:tensorflow:Evaluation [305/1900]
INFO:tensorflow:Evaluation [306/1900]
INFO:tensorflow:Evaluation [307/1900]
INFO:tensorflow:Evaluation [308/1900]
INFO:tensorflow:Evaluation [309/1900]
INFO:tensorflow:Evaluation [310/1900]
INFO:tensorflow:Evaluation [311/1900]
INFO:tensorflow:Evaluation [312/1900]
INFO:tensorflow:Evaluation [313/1900]
INFO:tensorflow:Evaluation [314/1900]
INFO:tensorflow:Evaluation [315/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [506/1900]
INFO:tensorflow:Evaluation [507/1900]
INFO:tensorflow:Evaluation [508/1900]
INFO:tensorflow:Evaluation [509/1900]
INFO:tensorflow:Evaluation [510/1900]
INFO:tensorflow:Evaluation [511/1900]
INFO:tensorflow:Evaluation [512/1900]
INFO:tensorflow:Evaluation [513/1900]
INFO:tensorflow:Evaluation [514/1900]
INFO:tensorflow:Evaluation [515/1900]
INFO:tensorflow:Evaluation [516/1900]
INFO:tensorflow:Evaluation [517/1900]
INFO:tensorflow:Evaluation [518/1900]
INFO:tensorflow:Evaluation [519/1900]
INFO:tensorflow:Evaluation [520/1900]
INFO:tensorflow:Evaluation [521/1900]
INFO:tensorflow:Evaluation [522/1900]
INFO:tensorflow:Evaluation [523/1900]
INFO:tensorflow:Evaluation [524/1900]
INFO:tensorflow:Evaluation [525/1900]
INFO:tensorflow:Evaluation [526/1900]
INFO:tensorflow:Evaluation [527/1900]
INFO:tensorflow:Evaluation [528/1900]
INFO:tensorflow:Evaluation [529/1900]
INFO:tensorflow:Evaluation [530/1900]
INFO:tensorflow:Evaluation [531/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [722/1900]
INFO:tensorflow:Evaluation [723/1900]
INFO:tensorflow:Evaluation [724/1900]
INFO:tensorflow:Evaluation [725/1900]
INFO:tensorflow:Evaluation [726/1900]
INFO:tensorflow:Evaluation [727/1900]
INFO:tensorflow:Evaluation [728/1900]
INFO:tensorflow:Evaluation [729/1900]
INFO:tensorflow:Evaluation [730/1900]
INFO:tensorflow:Evaluation [731/1900]
INFO:tensorflow:Evaluation [732/1900]
INFO:tensorflow:Evaluation [733/1900]
INFO:tensorflow:Evaluation [734/1900]
INFO:tensorflow:Evaluation [735/1900]
INFO:tensorflow:Evaluation [736/1900]
INFO:tensorflow:Evaluation [737/1900]
INFO:tensorflow:Evaluation [738/1900]
INFO:tensorflow:Evaluation [739/1900]
INFO:tensorflow:Evaluation [740/1900]
INFO:tensorflow:Evaluation [741/1900]
INFO:tensorflow:Evaluation [742/1900]
INFO:tensorflow:Evaluation [743/1900]
INFO:tensorflow:Evaluation [744/1900]
INFO:tensorflow:Evaluation [745/1900]
INFO:tensorflow:Evaluation [746/1900]
INFO:tensorflow:Evaluation [747/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [938/1900]
INFO:tensorflow:Evaluation [939/1900]
INFO:tensorflow:Evaluation [940/1900]
INFO:tensorflow:Evaluation [941/1900]
INFO:tensorflow:Evaluation [942/1900]
INFO:tensorflow:Evaluation [943/1900]
INFO:tensorflow:Evaluation [944/1900]
INFO:tensorflow:Evaluation [945/1900]
INFO:tensorflow:Evaluation [946/1900]
INFO:tensorflow:Evaluation [947/1900]
INFO:tensorflow:Evaluation [948/1900]
INFO:tensorflow:Evaluation [949/1900]
INFO:tensorflow:Evaluation [950/1900]
INFO:tensorflow:Evaluation [951/1900]
INFO:tensorflow:Evaluation [952/1900]
INFO:tensorflow:Evaluation [953/1900]
INFO:tensorflow:Evaluation [954/1900]
INFO:tensorflow:Evaluation [955/1900]
INFO:tensorflow:Evaluation [956/1900]
INFO:tensorflow:Evaluation [957/1900]
INFO:tensorflow:Evaluation [958/1900]
INFO:tensorflow:Evaluation [959/1900]
INFO:tensorflow:Evaluation [960/1900]
INFO:tensorflow:Evaluation [961/1900]
INFO:tensorflow:Evaluation [962/1900]
INFO:tensorflow:Evaluation [963/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [1150/1900]
INFO:tensorflow:Evaluation [1151/1900]
INFO:tensorflow:Evaluation [1152/1900]
INFO:tensorflow:Evaluation [1153/1900]
INFO:tensorflow:Evaluation [1154/1900]
INFO:tensorflow:Evaluation [1155/1900]
INFO:tensorflow:Evaluation [1156/1900]
INFO:tensorflow:Evaluation [1157/1900]
INFO:tensorflow:Evaluation [1158/1900]
INFO:tensorflow:Evaluation [1159/1900]
INFO:tensorflow:Evaluation [1160/1900]
INFO:tensorflow:Evaluation [1161/1900]
INFO:tensorflow:Evaluation [1162/1900]
INFO:tensorflow:Evaluation [1163/1900]
INFO:tensorflow:Evaluation [1164/1900]
INFO:tensorflow:Evaluation [1165/1900]
INFO:tensorflow:Evaluation [1166/1900]
INFO:tensorflow:Evaluation [1167/1900]
INFO:tensorflow:Evaluation [1168/1900]
INFO:tensorflow:Evaluation [1169/1900]
INFO:tensorflow:Evaluation [1170/1900]
INFO:tensorflow:Evaluation [1171/1900]
INFO:tensorflow:Evaluation [1172/1900]
INFO:tensorflow:Evaluation [1173/1900]
INFO:tensorflow:Evaluation [1174/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1361/1900]
INFO:tensorflow:Evaluation [1362/1900]
INFO:tensorflow:Evaluation [1363/1900]
INFO:tensorflow:Evaluation [1364/1900]
INFO:tensorflow:Evaluation [1365/1900]
INFO:tensorflow:Evaluation [1366/1900]
INFO:tensorflow:Evaluation [1367/1900]
INFO:tensorflow:Evaluation [1368/1900]
INFO:tensorflow:Evaluation [1369/1900]
INFO:tensorflow:Evaluation [1370/1900]
INFO:tensorflow:Evaluation [1371/1900]
INFO:tensorflow:Evaluation [1372/1900]
INFO:tensorflow:Evaluation [1373/1900]
INFO:tensorflow:Evaluation [1374/1900]
INFO:tensorflow:Evaluation [1375/1900]
INFO:tensorflow:Evaluation [1376/1900]
INFO:tensorflow:Evaluation [1377/1900]
INFO:tensorflow:Evaluation [1378/1900]
INFO:tensorflow:Evaluation [1379/1900]
INFO:tensorflow:Evaluation [1380/1900]
INFO:tensorflow:Evaluation [1381/1900]
INFO:tensorflow:Evaluation [1382/1900]
INFO:tensorflow:Evaluation [1383/1900]
INFO:tensorflow:Evaluation [1384/1900]
INFO:tensorflow:Evaluation [1385/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1572/1900]
INFO:tensorflow:Evaluation [1573/1900]
INFO:tensorflow:Evaluation [1574/1900]
INFO:tensorflow:Evaluation [1575/1900]
INFO:tensorflow:Evaluation [1576/1900]
INFO:tensorflow:Evaluation [1577/1900]
INFO:tensorflow:Evaluation [1578/1900]
INFO:tensorflow:Evaluation [1579/1900]
INFO:tensorflow:Evaluation [1580/1900]
INFO:tensorflow:Evaluation [1581/1900]
INFO:tensorflow:Evaluation [1582/1900]
INFO:tensorflow:Evaluation [1583/1900]
INFO:tensorflow:Evaluation [1584/1900]
INFO:tensorflow:Evaluation [1585/1900]
INFO:tensorflow:Evaluation [1586/1900]
INFO:tensorflow:Evaluation [1587/1900]
INFO:tensorflow:Evaluation [1588/1900]
INFO:tensorflow:Evaluation [1589/1900]
INFO:tensorflow:Evaluation [1590/1900]
INFO:tensorflow:Evaluation [1591/1900]
INFO:tensorflow:Evaluation [1592/1900]
INFO:tensorflow:Evaluation [1593/1900]
INFO:tensorflow:Evaluation [1594/1900]
INFO:tensorflow:Evaluation [1595/1900]
INFO:tensorflow:Evaluation [1596/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1783/1900]
INFO:tensorflow:Evaluation [1784/1900]
INFO:tensorflow:Evaluation [1785/1900]
INFO:tensorflow:Evaluation [1786/1900]
INFO:tensorflow:Evaluation [1787/1900]
INFO:tensorflow:Evaluation [1788/1900]
INFO:tensorflow:Evaluation [1789/1900]
INFO:tensorflow:Evaluation [1790/1900]
INFO:tensorflow:Evaluation [1791/1900]
INFO:tensorflow:Evaluation [1792/1900]
INFO:tensorflow:Evaluation [1793/1900]
INFO:tensorflow:Evaluation [1794/1900]
INFO:tensorflow:Evaluation [1795/1900]
INFO:tensorflow:Evaluation [1796/1900]
INFO:tensorflow:Evaluation [1797/1900]
INFO:tensorflow:Evaluation [1798/1900]
INFO:tensorflow:Evaluation [1799/1900]
INFO:tensorflow:Evaluation [1800/1900]
INFO:tensorflow:Evaluation [1801/1900]
INFO:tensorflow:Evaluation [1802/1900]
INFO:tensorflow:Evaluation [1803/1900]
INFO:tensorflow:Evaluation [1804/1900]
INFO:tensorflow:Evaluation [1805/1900]
INFO:tensorflow:Evaluation [1806/1900]
INFO:tensorflow:Evaluation [1807/1900]
INFO:tensorflow:Evaluatio







































Test Accuracy by Scikit-learn:  0.801578947368
CPU times: user 9min 13s, sys: 18.9 s, total: 9min 32s
Wall time: 1min 40s


###### CNN classifier(TensorFlow + text8)

In [192]:
%%time

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

sess = tf.InteractiveSession()

COL_OUTCOME = 'class'
COL_FEATURE = [col for col in list(df['train'].columns) if col != COL_OUTCOME]

# cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

count_feature = len(COL_FEATURE)
count_class = len(df['train']['class'].unique())

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, count_class], name='y_')

W = tf.Variable(tf.zeros([count_feature, count_class]))
b = tf.Variable(tf.zeros([count_class]))
y = tf.matmul(x, W) + b

# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_text = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_text, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32, name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, count_class])
b_fc2 = bias_variable([count_class])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

print("CNN initialization finished")

CNN initialization finished
CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 36.2 ms


In [193]:
%%time

### Start to traini and evaluate the model

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.global_variables_initializer())

x_input = df_new['train']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['train']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]
# y_input = [np.array([y_input.iloc[i].values]) for i in range(y_input.shape[0])]

# not use random input

for i in range(df['train'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, training accuracy {}".format(i, np.mean(train_accuracy)))
    train_step.run(feed_dict={
        keep_prob: 0.5,
        x:  np.array([elem[0] for elem in x_input[i:i+50]]),#x_input.iloc[i].values, #
        y_: np.array([elem[0] for elem in y_input[i:i+50]])#y_input.iloc[i].values#
    })

print("CNN training finished")

step 0, training accuracy 0.17239998281
step 100, training accuracy 0.411599963903
step 200, training accuracy 0.393200010061
step 300, training accuracy 0.57959997654
step 400, training accuracy 0.571999967098
step 500, training accuracy 0.64519995451
step 600, training accuracy 0.675600111485
step 700, training accuracy 0.639999985695
step 800, training accuracy 0.697600007057
step 900, training accuracy 0.63480001688
step 1000, training accuracy 0.676400005817
step 1100, training accuracy 0.721599936485
step 1200, training accuracy 0.780399918556
step 1300, training accuracy 0.767999947071
step 1400, training accuracy 0.855599999428
step 1500, training accuracy 0.803999960423
step 1600, training accuracy 0.720800101757
step 1700, training accuracy 0.724400043488
step 1800, training accuracy 0.812799990177
step 1900, training accuracy 0.825599968433
step 2000, training accuracy 0.795599997044
step 2100, training accuracy 0.8492000103
step 2200, training accuracy 0.823999941349
step 2

In [194]:
%%time

# Evaluate

x_input = df_new['test']['x']#df_new['test']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['test']['y']#df_new['test']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]

for i in range(df['test'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, testing accuracy {}".format(i, np.mean(train_accuracy)))

        
print("CNN testing finished")

step 0, testing accuracy 0.702000021935
step 100, testing accuracy 0.664400041103
step 200, testing accuracy 0.757999956608
step 300, testing accuracy 0.742399990559
step 400, testing accuracy 0.694000005722
step 500, testing accuracy 0.754399955273
step 600, testing accuracy 0.75
step 700, testing accuracy 0.74640005827
step 800, testing accuracy 0.808000028133
step 900, testing accuracy 0.677200078964
step 1000, testing accuracy 0.738800048828
step 1100, testing accuracy 0.744000017643
step 1200, testing accuracy 0.795199990273
step 1300, testing accuracy 0.712799966335
step 1400, testing accuracy 0.718400061131
step 1500, testing accuracy 0.79240000248
step 1600, testing accuracy 0.734800040722
step 1700, testing accuracy 0.672400057316
step 1800, testing accuracy 0.730400025845
CNN testing finished
CPU times: user 2min 42s, sys: 1.65 s, total: 2min 44s
Wall time: 25 s


##### 3.2.1.2 只用待学习语料建模

In [195]:
embedFrom = 'corpus'

In [196]:
%%time

# collect sentences from raw data
sentences = {}
pathtmp = {}
pathtmp['root'] = os.path.join(paths['dir.dataroot'], 'trialdata')
for tpart in ['train']:#, 'test']:
    pathtmp[tpart] = os.path.join(pathtmp['root'], tpart)
    sentences[tpart] = []
    folderList = os.listdir(pathtmp[tpart])
    for folder in folderList:
        fileList = os.listdir(os.path.join(pathtmp[tpart], folder))
        for eachf in fileList:
            fpathtmp = os.path.join(pathtmp[tpart], folder, eachf)
            with open(fpathtmp, 'r') as f:
                sentences[tpart].append(f.read())
      #save sentences in file
        sentencePath = os.path.join(pathtmp['root'], 'corpus')
        with open(sentencePath, 'w') as f:
            for sentence in sentences[tpart]:
                f.write(sentence)
                f.write('\n')

filepathtmp = os.path.join(pathtmp['root'], 'corpus')
sentences = ""#[]
with open(filepathtmp, 'r') as f:
    buff = f.read()
    sentencesBuffer = buff.split('\n')
    sentencesBucket = [ch for stcbuffer in sentencesBuffer for ch in stcbuffer.split()]
    for ch in sentencesBucket:
        sentences += ch
        sentences += " "

filepathtmp = os.path.join(pathtmp['root'], 'corpus')
with open(filepathtmp, 'w') as f:
    f.write(sentences)
        
print('get sentences from training corpus successfully')
print('example:')
print(len(sentences))
print(sentences[:60]) #random.randrange(len(sentences))])        

get sentences from training corpus successfully
example:
4546437
from jenk microsoft com jen kilmer subject re sex education 
CPU times: user 360 ms, sys: 108 ms, total: 468 ms
Wall time: 3.33 s


In [197]:
FLAGS.train_data = os.path.join(paths['dir.dataroot'], 'trialdata', embedFrom)
FLAGS.eval_data = os.path.join(paths['dir.dataroot'], 'trialdata', 'questions-words.txt')
FLAGS.save_path = paths['dir.{}.{}'.format(modelFrom, embedFrom)]
FLAGS.epochs_to_train = 15
FLAGS.embedding_size = vecsize

In [198]:
!cat ../modules/embedding/w2v_opt_full_02.py

import tensorflow as tf

session = tf.InteractiveSession()
"""Train a word2vec model."""
if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path:
  print("--train_data --eval_data and --save_path must be specified.")
  sys.exit(1)
opts = Options()
#with tf.Graph().as_default() as session:
with tf.device("/cpu:0"):
  model = Word2Vec(opts, session)
  model.read_analogies() # Read analogy questions
for _ in xrange(opts.epochs_to_train):
  model.train()  # Process one epoch
  model.eval()  # Eval analogies.
# Perform a final save.
model.saver.save(session, os.path.join(opts.save_path, "model.ckpt"),
                 global_step=model.global_step)
if FLAGS.interactive:
  # E.g.,
  # [0]: model.analogy(b'france', b'paris', b'russia')
  # [1]: model.nearby([b'proton', b'elephant', b'maxwell'])
  _start_shell(locals())



In [199]:
session = tf.InteractiveSession()
"""Train a word2vec model."""
if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path:
    print("--train_data --eval_data and --save_path must be specified.")
    sys.exit(1)
opts = Options()
#with tf.Graph().as_default() as session:
with tf.device("/cpu:0"):
    model = Word2Vec(opts, session)
    model.read_analogies() # Read analogy questions
for _ in xrange(opts.epochs_to_train):
    model.train()  # Process one epoch
    model.eval()  # Eval analogies.
# Perform a final save.
model.saver.save(session, os.path.join(opts.save_path, "model-{}.ckpt".format(embedFrom)),
                 global_step=model.global_step)
if FLAGS.interactive:
    # E.g.,
    # [0]: model.analogy(b'france', b'paris', b'russia')
    # [1]: model.nearby([b'proton', b'elephant', b'maxwell'])
    _start_shell(locals())

Data file:  /home/sushangjun/gits/Udacity/MLND/capstone.now/proposal/new/notebooks/../data/trialdata/corpus
Vocab size:  10790  + UNK
Words per epoch:  843524
Initialization:  [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
Eval analogy file:  /home/sushangjun/gits/Udacity/MLND/capstone.now/proposal/new/notebooks/../data/trialdata/questions-words.txt
Questions:  3905
Skipped:  15639
Epoch    1 Step     7449: lr = 0.024 words/sec =    27852
Eval   26/3905 accuracy =  0.7%
Epoch    2 Step    14913: lr = 0.023 words/sec =    18191
Eval   42/3905 accuracy =  1.1%
Epoch    3 Step    22373: lr = 0.021 words/sec =    14611
Eval   42/3905 accuracy =  1.1%
Epoch    4 Step    29822: lr = 0.020 words/sec =    15228
Eval   55/3905 accuracy =  1.4%
Epoch    5 Step    37284: lr = 0.019 words/sec =    14999
Eval   70/3905 accuracy =  1.8%
Epoch    6 S

In [200]:
# from scipy.spatial.distance import cosine

# # tmma=model._w_in.eval()
# # tmma
# tmpemb = model._w_out.eval()
# print(len(tmpemb[model._word2id['slow']]))
# # print(tmpemb[model._word2id['slowly']])
# # print(cosine(tmpemb[model._word2id['slow']], tmpemb[model._word2id['slowly']]))
# # print(cosine(tmpemb[model._word2id['man']], tmpemb[model._word2id['men']]))
# # print(cosine(tmpemb[model._word2id['woman']], tmpemb[model._word2id['women']]))
# # print(cosine(tmpemb[model._word2id['woman']], tmpemb[model._word2id['woman']]))
# print(tmpemb[model._word2id['UNK']])

In [201]:
%%time

# Step 1: preprocess the data

# import stoplist
stopwords = ""

pathtemp_TFIDF = os.path.join(paths['dir.dataroot'], 'stoplist-baseTFIDF.txt')
with open(pathtemp_TFIDF, 'r') as stoplistfile:
    stopwords = stoplistfile.read()
stopwords = stopwords.split()

pathtemp_web = os.path.join(paths['dir.dataroot'], 'stoplist-web.txt')
with open(pathtemp_web, 'r') as stoplistfile2:
    stopwords2 = stoplistfile2.read()
    stopwords2 = stopwords2.split('\n')
    stopwords = set(stopwords)
    stopwords = list(stopwords.union(set(stopwords)))
    
print("Read stop words successfully.")

Read stop words successfully.
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 534 ms


In [203]:
%%time

# Step 2: read data and save it in data['vec.train'] 和 data['vec.test']

data = {}
data['vec.train'] = {'w2v.mean':[], 'class':[]}
data['vec.test'] = {'w2v.mean':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                tokens = [token for token in readf.read().split() if token not in stopwords]#readf.read().split()#
                # Word2Vec representation
                # begin
                vec = np.array([0.0 for i in range(vecsize)])
                expectationVal = np.array([0.0 for i in range(vecsize)])
                countvec = 0
                for token in tokens:
                    try:
                        vec += tmpemb[model._word2id[token]] #model[token]
                        countvec += 1
                    except:
                        vec += tmpemb[model._word2id['UNK']]
                vec = vec / float(countvec)
                 # end
            data['vec.{}'.format(tpart)]['w2v.mean'].append(vec)
            data['vec.{}'.format(tpart)]['class'].append(cls)

    tmp = data['vec.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[corpus]\n {}\n[class]\n{}".format(tpart, ind, tmp['w2v.mean'][ind], tmp['class'][ind]))
    print()
    
print("Step 2 Succeed")

sample(transformed) from train[1134]:
[corpus]
 [ -9.14742260e-02   1.97657122e-02   4.88034414e-02  -2.14501246e-02
   2.89537850e-02   3.69584016e-02  -1.96420203e-02   2.52398994e-02
   2.04244950e-02   1.09431348e-01  -1.63828118e-02  -3.04512786e-02
   4.90008398e-02  -4.13084460e-02  -4.08775516e-02   1.13422616e-01
   1.09650938e-01   1.24379964e-01   6.62501065e-02   1.37616521e-01
  -7.81755025e-02   5.58870664e-02   9.81221798e-02  -7.58008440e-03
  -2.30257731e-02   3.67379372e-02  -3.06308693e-02   4.86731597e-02
   5.36724845e-02   5.25526757e-02   1.73793200e-02   5.38565988e-02
  -8.19522027e-02  -1.23996193e-01   3.59019776e-02  -1.04500795e-01
   2.66090137e-02   5.39927276e-02  -1.18805606e-02  -7.62467396e-03
  -5.00652456e-02   1.40609098e-01  -3.06014993e-02  -1.45811353e-01
  -1.23290552e-01   1.14025326e-02   6.53584848e-02   6.93422551e-02
   6.93176080e-03   2.72750730e-02  -1.49903798e-01  -5.76191623e-02
  -1.00610992e-01   1.58461575e-02  -6.41822497e-02   8

sample(transformed) from test[1150]:
[corpus]
 [ -1.03927469e-01   1.28367011e-02   3.54833295e-02  -2.80670408e-02
   5.74810325e-02   3.48400845e-02  -5.37967970e-03   1.45811907e-02
  -6.20764606e-03   9.07164651e-02  -3.09716331e-02  -3.27652136e-02
   3.21756180e-02  -3.75893238e-02  -2.12911729e-02   1.18714038e-01
   1.05336746e-01   1.26588981e-01   6.89988573e-02   1.23069280e-01
  -8.59222234e-02   6.09937534e-02   6.97960741e-02  -9.31329114e-03
  -3.33334676e-02   2.60146730e-03  -3.20336184e-02   6.50785132e-02
   5.16393836e-02   4.05634036e-02   4.33328248e-02   3.59174728e-02
  -9.40747148e-02  -1.00249261e-01   3.92326864e-02  -1.02162152e-01
   1.11504561e-02   5.20214231e-02  -9.51513184e-03   8.30358924e-03
  -7.12003048e-02   1.40510476e-01  -1.49001290e-02  -1.54843942e-01
  -1.17387325e-01   2.95322843e-03   7.48798623e-02   7.76931734e-02
   3.62647636e-03   3.74099804e-02  -1.25890590e-01  -5.80434733e-02
  -1.03830822e-01   3.48429638e-03  -6.69677367e-02   1.

In [204]:
%%time

# Step 3: Save in Pandas.DataFrame
#
# 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

df = {}
csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['vec.{}'.format(tpart)]['class']
    datavec = np.array(data['vec.{}'.format(tpart)]['w2v.mean'])
    for col in range(vecsize):
        datadict[col]= datavec[:, col]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, '{}-w2v-{}-{}.csv'.format(tpart, embedFrom, modelFrom))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)
    
print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

See df[train]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.116876,0.033983,0.030950,-0.042242,0.033010,0.035774,-0.030156,0.055865,0.033010,0.105757,...,0.025690,0.289178,-0.129078,0.047123,0.127470,0.108361,0.029603,0.019417,0.020784,soc.religion.christian
1,-0.106944,0.028698,0.018356,-0.023065,0.009855,0.023431,-0.000504,0.035627,0.017729,0.094470,...,0.006116,0.263038,-0.125299,0.016513,0.128191,0.119239,0.011966,0.035388,0.043331,soc.religion.christian
2,-0.106025,0.013320,0.046333,-0.019221,0.056579,0.037451,-0.001478,0.044321,-0.007963,0.119153,...,0.027512,0.290272,-0.140925,0.039777,0.111820,0.124639,0.031612,0.004810,0.034234,soc.religion.christian
3,-0.098518,0.012643,0.057831,-0.033276,0.055667,0.037273,-0.017768,0.026530,0.008117,0.095474,...,0.027730,0.268816,-0.120517,0.027204,0.125490,0.127358,0.026708,0.046275,0.036046,soc.religion.christian
4,-0.106880,0.023066,0.033355,-0.026325,0.048548,0.032794,0.000838,0.026678,0.000256,0.100428,...,0.019634,0.273445,-0.130329,0.017374,0.119251,0.122077,0.018156,0.050223,0.022818,soc.religion.christian
5,-0.115496,0.003603,0.051215,-0.015904,0.054078,0.045506,-0.014170,0.024706,0.005369,0.074234,...,0.027765,0.238926,-0.118742,0.027739,0.129710,0.116102,0.043534,0.050734,0.011128,soc.religion.christian
6,-0.099007,0.014794,0.044544,-0.028178,0.029779,0.008130,-0.015576,0.022471,0.014737,0.095170,...,0.018207,0.281266,-0.126033,0.020907,0.134576,0.129007,0.030309,0.025600,0.030300,soc.religion.christian
7,-0.113498,0.003352,0.040725,-0.053493,0.014363,0.038982,-0.013571,0.031157,0.009423,0.085821,...,-0.008253,0.287447,-0.133423,0.014019,0.140513,0.132070,0.022193,0.030131,0.019594,soc.religion.christian
8,-0.087977,0.000019,0.049841,-0.039036,0.048831,0.028929,0.001292,0.027166,-0.004165,0.097317,...,0.032238,0.287662,-0.140681,0.030989,0.123054,0.139986,0.037384,0.033287,0.024361,soc.religion.christian
9,-0.072080,0.011728,0.031687,-0.046160,0.015392,0.020370,-0.014864,0.032203,0.009204,0.108036,...,0.007336,0.275966,-0.123287,0.022220,0.123867,0.142963,0.031454,-0.004335,0.028350,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.109175,0.006434,0.052499,-0.035539,0.038042,0.018722,-0.021271,0.022448,0.002775,0.094616,...,0.018121,0.268725,-0.110195,0.022211,0.128320,0.119358,0.023708,0.048878,0.025346,soc.religion.christian
1,-0.086023,0.010532,0.035806,-0.040836,0.001932,0.040264,-0.030591,0.037642,-0.016305,0.110644,...,0.012944,0.294688,-0.136378,0.034916,0.141657,0.126255,0.054169,0.056551,0.029865,soc.religion.christian
2,-0.100463,0.017651,0.039525,-0.041685,0.015738,0.011001,-0.007462,0.040823,-0.010943,0.093741,...,0.022178,0.315021,-0.118551,0.010560,0.149061,0.134064,0.043805,0.027940,0.030051,soc.religion.christian
3,-0.099581,0.019049,0.059957,-0.031027,0.070112,0.034792,-0.003536,0.029869,0.007749,0.068344,...,0.038316,0.275238,-0.129777,0.054773,0.130151,0.129353,0.032414,0.041548,0.027990,soc.religion.christian
4,-0.101563,0.012229,0.047215,-0.022962,0.049003,0.006819,-0.019644,0.016365,0.018876,0.093624,...,0.017613,0.272265,-0.124863,0.031556,0.134030,0.123166,0.030394,0.051655,0.020820,soc.religion.christian
5,-0.090941,0.031712,0.027228,-0.032390,0.019586,0.018775,-0.001527,0.045439,0.009462,0.086524,...,0.004226,0.252909,-0.129342,0.010127,0.124375,0.125785,0.035749,0.032288,0.037939,soc.religion.christian
6,-0.091229,0.019519,0.055969,-0.043914,0.028017,0.020788,-0.000473,0.036728,0.021548,0.091064,...,0.018523,0.292026,-0.134881,0.013887,0.139625,0.113272,0.045624,0.031523,0.025329,soc.religion.christian
7,-0.084787,0.027582,0.035511,-0.017268,0.010101,0.015533,-0.015024,0.031286,-0.001485,0.103618,...,-0.006102,0.265162,-0.112222,0.026784,0.127100,0.119700,0.053049,0.043639,0.031117,soc.religion.christian
8,-0.106846,0.024212,0.025683,-0.037574,0.034163,0.034421,-0.000292,0.035312,0.003690,0.106033,...,0.034857,0.287239,-0.128969,0.036335,0.120857,0.118393,0.033900,0.036942,0.023708,soc.religion.christian
9,-0.086679,0.009507,0.031419,-0.043341,0.038208,0.027166,-0.013665,0.037618,-0.003874,0.094513,...,0.012627,0.289762,-0.132256,0.017143,0.126189,0.128509,0.038124,0.029118,0.035583,soc.religion.christian






Step 3 Succeed.
CPU times: user 2.14 s, sys: 40 ms, total: 2.18 s
Wall time: 2.86 s


In [205]:
%%time

# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(
        csvpath_root, '{}-w2v-{}-{}.csv'.format(
            tpart, embedFrom, modelFrom
        )
    )
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])

read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.102856,0.036089,0.045112,-0.024410,0.045801,0.025042,-0.017396,0.022539,0.005828,0.109232,...,0.020612,0.265497,-0.105461,0.010558,0.116649,0.131539,0.025551,0.034942,0.033294,alt.atheism
1,-0.089879,0.018595,0.033348,-0.034271,0.051995,0.031616,-0.028541,0.042776,0.010406,0.090177,...,0.047064,0.266561,-0.110345,0.038955,0.133685,0.126593,0.018551,0.014432,0.023872,comp.graphics
2,-0.096920,0.031071,0.032962,0.026218,0.023096,0.052752,0.006651,0.001465,-0.015802,0.119026,...,0.023573,0.277577,-0.117074,0.002563,0.148260,0.135632,0.039746,0.036109,0.032306,rec.motorcycles
3,-0.074621,0.020761,0.032951,0.007336,0.034159,0.019508,-0.003822,0.022748,-0.007425,0.107278,...,0.017103,0.259219,-0.128942,0.028585,0.143614,0.108203,0.037986,0.060950,0.004737,rec.autos
4,-0.093792,0.023535,0.048847,-0.036319,0.073383,0.017483,-0.003060,0.020692,-0.005620,0.086382,...,0.020948,0.261059,-0.132241,0.051624,0.129941,0.135616,0.037920,0.055156,0.040645,rec.motorcycles
5,-0.095458,0.016303,0.053389,-0.042542,0.024726,0.018404,-0.015711,0.030069,0.016334,0.085861,...,0.009553,0.289607,-0.125048,0.017120,0.115562,0.106040,0.035861,0.036947,0.037134,soc.religion.christian
6,-0.072861,0.004685,0.084253,-0.041709,-0.001089,0.033984,-0.018489,0.012183,0.027388,0.125138,...,0.018131,0.344625,-0.123736,0.001368,0.113467,0.146872,0.006362,-0.000244,0.025473,soc.religion.christian
7,-0.081550,0.013147,0.071630,-0.034115,0.004887,0.036768,-0.007111,0.012277,0.027721,0.103201,...,0.031859,0.286007,-0.131502,0.017466,0.125419,0.150931,0.045398,0.052896,0.058480,comp.graphics
8,-0.099867,0.027510,0.046250,-0.010917,0.059423,0.023384,-0.034621,0.041797,0.011871,0.099911,...,0.046114,0.289700,-0.122825,0.044400,0.125233,0.121571,0.016702,0.013863,0.029136,alt.atheism
9,-0.110281,0.025558,0.059280,-0.022295,0.053129,0.026960,0.017374,0.045463,0.004489,0.107307,...,0.017117,0.254028,-0.131809,0.025750,0.126439,0.136016,0.034999,0.015386,0.034598,comp.graphics


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.085709,0.013900,0.040913,-0.032098,0.070461,0.038555,-0.023598,0.039486,0.005183,0.119037,...,0.035185,0.266568,-0.150982,0.021981,0.134651,0.134242,0.064253,0.021961,0.015164,alt.atheism
1,-0.062669,0.049578,0.052611,0.038462,0.013200,0.062592,-0.012973,0.006689,-0.004989,0.184555,...,0.016667,0.298404,-0.129311,0.017271,0.134963,0.150141,0.044829,0.028969,0.021020,comp.graphics
2,-0.100344,0.002995,0.033974,-0.028160,0.022248,0.025789,0.003818,0.025065,-0.003698,0.098390,...,0.014049,0.292742,-0.124550,0.037781,0.138999,0.125040,0.038925,0.039271,0.037268,soc.religion.christian
3,-0.083875,0.030184,0.075929,-0.046418,0.107492,0.030632,-0.003442,-0.000013,-0.017486,0.106113,...,0.033786,0.281438,-0.127884,0.056665,0.108442,0.144610,0.025621,0.063151,0.006332,comp.graphics
4,-0.091570,0.027541,0.059540,-0.057670,0.075479,0.030094,-0.025848,0.036572,0.003269,0.082793,...,0.029750,0.284570,-0.146682,0.032444,0.099341,0.134048,0.064531,0.043612,0.011866,rec.autos
5,-0.116354,0.022108,0.056102,-0.030260,0.060148,0.022880,-0.019455,0.041818,0.016843,0.098593,...,0.021080,0.289563,-0.142924,0.026885,0.137184,0.140236,0.036400,0.030481,0.025073,rec.autos
6,-0.084998,0.011209,0.026111,-0.047338,0.009979,0.025555,0.004694,0.045804,0.036844,0.110580,...,-0.014714,0.281388,-0.124818,0.007577,0.141873,0.131853,0.038945,0.030298,0.035742,rec.autos
7,-0.071892,0.003674,0.040130,-0.032850,0.012505,0.048784,-0.000445,0.025410,-0.021167,0.117251,...,0.024359,0.266679,-0.112458,0.031398,0.130701,0.118292,0.058422,0.046159,0.016899,rec.autos
8,-0.095329,0.012222,0.022365,-0.008072,0.044362,0.049514,-0.003086,0.019384,-0.019112,0.080876,...,0.017479,0.257510,-0.126674,0.018842,0.131691,0.128399,0.072148,0.074005,0.018638,rec.motorcycles
9,-0.132657,0.037954,0.041172,-0.072653,0.036338,0.005063,-0.019449,0.077277,-0.003921,0.099118,...,0.041017,0.307151,-0.126942,0.078547,0.140684,0.164324,0.073587,0.030354,0.001013,rec.autos


CPU times: user 924 ms, sys: 28 ms, total: 952 ms
Wall time: 1.18 s


###### SVM classifier(TensorFlow + corpus)

In [206]:
%%time

# Step 5.1.1: SVM

# if 'TFIDF' == modelChoice:

#train
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']
#test
X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

# else:
#     #train
#     X_train = df_new['train']['x']
#     y_train = df_new['train']['y']
#     #test
#     X_test = df_new['test']['x']
#     y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 2.19 s, sys: 8 ms, total: 2.2 s
Wall time: 2.25 s


In [207]:
%%time
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.766842105263
0.762434507233
0.766842105263
CPU times: user 20 ms, sys: 4 ms, total: 24 ms
Wall time: 132 ms


###### DNN classifier(TensorFlow + corpus)

In [208]:
%%time

# Step 4: One-hot representation for labels

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lb = LabelBinarizer()
lb.fit(df['train']['class'])

df_new = {}
for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelFrom))
        df_new[tpart][subpart].to_csv(csvpath)
    
print("label cleaning succussfully")

label cleaning succussfully
CPU times: user 3.86 s, sys: 72 ms, total: 3.93 s
Wall time: 4.55 s


In [209]:
%%time

## Step 5 : Train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {
        str(col): tf.constant(
            df[dataset][str(col)].values
        )
        for col in COL_FEATURE
    }
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=len(df['train']['class'].unique())
)

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f14cca7e9d0>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}






































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpnd64UN/model.ckpt.
INFO:tensorflow:loss = 1.60874, step = 1
INFO:tensorflow:global_step/sec: 11.2287
INFO:tensorflow:loss = 1.57178, step = 101
INFO:tensorflow:global_step/sec: 11.9962
INFO:tensorflow:loss = 1.52732, step = 201
INFO:tensorflow:global_step/sec: 12.0191
INFO:tensorflow:loss = 1.45977, step = 301
INFO:tensorflow:global_step/sec: 12.1779
INFO:tensorflow:loss = 1.37355, step = 401
INFO:tensorflow:global_step/sec: 12.2203
INFO:tensorflow:loss = 1.2847, step = 501
INFO:tensorflow:global_step/sec: 12.0566
INFO:tensorflow:loss = 1.20312, step = 601
INFO:tenso

In [210]:
%%time

## Step 6: Evaluate

accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print("Test Accuracy by TensorFlow: {}".format(accuracy_score))

X_tensor_test, yt = my_input_fn('test')
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print('Test Accuracy by Scikit-learn: ', f1_score(y_test_true, tensorPredClsStr, average='micro'))







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-05-14-01:44:47
INFO:tensorflow:Evaluation [1/1900]
INFO:tensorflow:Evaluation [2/1900]
INFO:tensorflow:Evaluation [3/1900]
INFO:tensorflow:Evaluation [4/1900]
INFO:tensorflow:Evaluation [5/1900]
INFO:tensorflow:Evaluation [6/1900]
INFO:tensorflow:Evaluation [7/1900]
INFO:tensorflow:Evaluation [8/1900]
INFO:tensorflow:Evaluation [9/1900]
INFO:tensorflow:Evaluation [10/1900]
INFO:tensorflow:Evaluation [11/1900]
INFO:tensorflow:Evaluation [12/1900]
INFO:tensorflow:Evaluation [13/1900]
INFO:tensorflow:Evaluation [14/1900]
INFO:tensorflow:Evaluation [15/1900]
INFO:tensorflow:Evaluation [16/1900]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [73/1900]
INFO:tensorflow:Evaluation [74/1900]
INFO:tensorflow:Evaluation [75/1900]
INFO:tensorflow:Evaluation [76/1900]
INFO:tensorflow:Evaluation [77/1900]
INFO:tensorflow:Evaluation [78/1900]
INFO:tensorflow:Evaluation [79/1900]
INFO:tensorflow:Evaluation [80/1900]
INFO:tensorflow:Evaluation [81/1900]
INFO:tensorflow:Evaluation [82/1900]
INFO:tensorflow:Evaluation [83/1900]
INFO:tensorflow:Evaluation [84/1900]
INFO:tensorflow:Evaluation [85/1900]
INFO:tensorflow:Evaluation [86/1900]
INFO:tensorflow:Evaluation [87/1900]
INFO:tensorflow:Evaluation [88/1900]
INFO:tensorflow:Evaluation [89/1900]
INFO:tensorflow:Evaluation [90/1900]
INFO:tensorflow:Evaluation [91/1900]
INFO:tensorflow:Evaluation [92/1900]
INFO:tensorflow:Evaluation [93/1900]
INFO:tensorflow:Evaluation [94/1900]
INFO:tensorflow:Evaluation [95/1900]
INFO:tensorflow:Evaluation [96/1900]
INFO:tensorflow:Evaluation [97/1900]
INFO:tensorflow:Evaluation [98/1900]
INFO:tensorflow:Evaluation [99/1900]
I

INFO:tensorflow:Evaluation [290/1900]
INFO:tensorflow:Evaluation [291/1900]
INFO:tensorflow:Evaluation [292/1900]
INFO:tensorflow:Evaluation [293/1900]
INFO:tensorflow:Evaluation [294/1900]
INFO:tensorflow:Evaluation [295/1900]
INFO:tensorflow:Evaluation [296/1900]
INFO:tensorflow:Evaluation [297/1900]
INFO:tensorflow:Evaluation [298/1900]
INFO:tensorflow:Evaluation [299/1900]
INFO:tensorflow:Evaluation [300/1900]
INFO:tensorflow:Evaluation [301/1900]
INFO:tensorflow:Evaluation [302/1900]
INFO:tensorflow:Evaluation [303/1900]
INFO:tensorflow:Evaluation [304/1900]
INFO:tensorflow:Evaluation [305/1900]
INFO:tensorflow:Evaluation [306/1900]
INFO:tensorflow:Evaluation [307/1900]
INFO:tensorflow:Evaluation [308/1900]
INFO:tensorflow:Evaluation [309/1900]
INFO:tensorflow:Evaluation [310/1900]
INFO:tensorflow:Evaluation [311/1900]
INFO:tensorflow:Evaluation [312/1900]
INFO:tensorflow:Evaluation [313/1900]
INFO:tensorflow:Evaluation [314/1900]
INFO:tensorflow:Evaluation [315/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [506/1900]
INFO:tensorflow:Evaluation [507/1900]
INFO:tensorflow:Evaluation [508/1900]
INFO:tensorflow:Evaluation [509/1900]
INFO:tensorflow:Evaluation [510/1900]
INFO:tensorflow:Evaluation [511/1900]
INFO:tensorflow:Evaluation [512/1900]
INFO:tensorflow:Evaluation [513/1900]
INFO:tensorflow:Evaluation [514/1900]
INFO:tensorflow:Evaluation [515/1900]
INFO:tensorflow:Evaluation [516/1900]
INFO:tensorflow:Evaluation [517/1900]
INFO:tensorflow:Evaluation [518/1900]
INFO:tensorflow:Evaluation [519/1900]
INFO:tensorflow:Evaluation [520/1900]
INFO:tensorflow:Evaluation [521/1900]
INFO:tensorflow:Evaluation [522/1900]
INFO:tensorflow:Evaluation [523/1900]
INFO:tensorflow:Evaluation [524/1900]
INFO:tensorflow:Evaluation [525/1900]
INFO:tensorflow:Evaluation [526/1900]
INFO:tensorflow:Evaluation [527/1900]
INFO:tensorflow:Evaluation [528/1900]
INFO:tensorflow:Evaluation [529/1900]
INFO:tensorflow:Evaluation [530/1900]
INFO:tensorflow:Evaluation [531/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [722/1900]
INFO:tensorflow:Evaluation [723/1900]
INFO:tensorflow:Evaluation [724/1900]
INFO:tensorflow:Evaluation [725/1900]
INFO:tensorflow:Evaluation [726/1900]
INFO:tensorflow:Evaluation [727/1900]
INFO:tensorflow:Evaluation [728/1900]
INFO:tensorflow:Evaluation [729/1900]
INFO:tensorflow:Evaluation [730/1900]
INFO:tensorflow:Evaluation [731/1900]
INFO:tensorflow:Evaluation [732/1900]
INFO:tensorflow:Evaluation [733/1900]
INFO:tensorflow:Evaluation [734/1900]
INFO:tensorflow:Evaluation [735/1900]
INFO:tensorflow:Evaluation [736/1900]
INFO:tensorflow:Evaluation [737/1900]
INFO:tensorflow:Evaluation [738/1900]
INFO:tensorflow:Evaluation [739/1900]
INFO:tensorflow:Evaluation [740/1900]
INFO:tensorflow:Evaluation [741/1900]
INFO:tensorflow:Evaluation [742/1900]
INFO:tensorflow:Evaluation [743/1900]
INFO:tensorflow:Evaluation [744/1900]
INFO:tensorflow:Evaluation [745/1900]
INFO:tensorflow:Evaluation [746/1900]
INFO:tensorflow:Evaluation [747/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [938/1900]
INFO:tensorflow:Evaluation [939/1900]
INFO:tensorflow:Evaluation [940/1900]
INFO:tensorflow:Evaluation [941/1900]
INFO:tensorflow:Evaluation [942/1900]
INFO:tensorflow:Evaluation [943/1900]
INFO:tensorflow:Evaluation [944/1900]
INFO:tensorflow:Evaluation [945/1900]
INFO:tensorflow:Evaluation [946/1900]
INFO:tensorflow:Evaluation [947/1900]
INFO:tensorflow:Evaluation [948/1900]
INFO:tensorflow:Evaluation [949/1900]
INFO:tensorflow:Evaluation [950/1900]
INFO:tensorflow:Evaluation [951/1900]
INFO:tensorflow:Evaluation [952/1900]
INFO:tensorflow:Evaluation [953/1900]
INFO:tensorflow:Evaluation [954/1900]
INFO:tensorflow:Evaluation [955/1900]
INFO:tensorflow:Evaluation [956/1900]
INFO:tensorflow:Evaluation [957/1900]
INFO:tensorflow:Evaluation [958/1900]
INFO:tensorflow:Evaluation [959/1900]
INFO:tensorflow:Evaluation [960/1900]
INFO:tensorflow:Evaluation [961/1900]
INFO:tensorflow:Evaluation [962/1900]
INFO:tensorflow:Evaluation [963/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [1150/1900]
INFO:tensorflow:Evaluation [1151/1900]
INFO:tensorflow:Evaluation [1152/1900]
INFO:tensorflow:Evaluation [1153/1900]
INFO:tensorflow:Evaluation [1154/1900]
INFO:tensorflow:Evaluation [1155/1900]
INFO:tensorflow:Evaluation [1156/1900]
INFO:tensorflow:Evaluation [1157/1900]
INFO:tensorflow:Evaluation [1158/1900]
INFO:tensorflow:Evaluation [1159/1900]
INFO:tensorflow:Evaluation [1160/1900]
INFO:tensorflow:Evaluation [1161/1900]
INFO:tensorflow:Evaluation [1162/1900]
INFO:tensorflow:Evaluation [1163/1900]
INFO:tensorflow:Evaluation [1164/1900]
INFO:tensorflow:Evaluation [1165/1900]
INFO:tensorflow:Evaluation [1166/1900]
INFO:tensorflow:Evaluation [1167/1900]
INFO:tensorflow:Evaluation [1168/1900]
INFO:tensorflow:Evaluation [1169/1900]
INFO:tensorflow:Evaluation [1170/1900]
INFO:tensorflow:Evaluation [1171/1900]
INFO:tensorflow:Evaluation [1172/1900]
INFO:tensorflow:Evaluation [1173/1900]
INFO:tensorflow:Evaluation [1174/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1361/1900]
INFO:tensorflow:Evaluation [1362/1900]
INFO:tensorflow:Evaluation [1363/1900]
INFO:tensorflow:Evaluation [1364/1900]
INFO:tensorflow:Evaluation [1365/1900]
INFO:tensorflow:Evaluation [1366/1900]
INFO:tensorflow:Evaluation [1367/1900]
INFO:tensorflow:Evaluation [1368/1900]
INFO:tensorflow:Evaluation [1369/1900]
INFO:tensorflow:Evaluation [1370/1900]
INFO:tensorflow:Evaluation [1371/1900]
INFO:tensorflow:Evaluation [1372/1900]
INFO:tensorflow:Evaluation [1373/1900]
INFO:tensorflow:Evaluation [1374/1900]
INFO:tensorflow:Evaluation [1375/1900]
INFO:tensorflow:Evaluation [1376/1900]
INFO:tensorflow:Evaluation [1377/1900]
INFO:tensorflow:Evaluation [1378/1900]
INFO:tensorflow:Evaluation [1379/1900]
INFO:tensorflow:Evaluation [1380/1900]
INFO:tensorflow:Evaluation [1381/1900]
INFO:tensorflow:Evaluation [1382/1900]
INFO:tensorflow:Evaluation [1383/1900]
INFO:tensorflow:Evaluation [1384/1900]
INFO:tensorflow:Evaluation [1385/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1572/1900]
INFO:tensorflow:Evaluation [1573/1900]
INFO:tensorflow:Evaluation [1574/1900]
INFO:tensorflow:Evaluation [1575/1900]
INFO:tensorflow:Evaluation [1576/1900]
INFO:tensorflow:Evaluation [1577/1900]
INFO:tensorflow:Evaluation [1578/1900]
INFO:tensorflow:Evaluation [1579/1900]
INFO:tensorflow:Evaluation [1580/1900]
INFO:tensorflow:Evaluation [1581/1900]
INFO:tensorflow:Evaluation [1582/1900]
INFO:tensorflow:Evaluation [1583/1900]
INFO:tensorflow:Evaluation [1584/1900]
INFO:tensorflow:Evaluation [1585/1900]
INFO:tensorflow:Evaluation [1586/1900]
INFO:tensorflow:Evaluation [1587/1900]
INFO:tensorflow:Evaluation [1588/1900]
INFO:tensorflow:Evaluation [1589/1900]
INFO:tensorflow:Evaluation [1590/1900]
INFO:tensorflow:Evaluation [1591/1900]
INFO:tensorflow:Evaluation [1592/1900]
INFO:tensorflow:Evaluation [1593/1900]
INFO:tensorflow:Evaluation [1594/1900]
INFO:tensorflow:Evaluation [1595/1900]
INFO:tensorflow:Evaluation [1596/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1783/1900]
INFO:tensorflow:Evaluation [1784/1900]
INFO:tensorflow:Evaluation [1785/1900]
INFO:tensorflow:Evaluation [1786/1900]
INFO:tensorflow:Evaluation [1787/1900]
INFO:tensorflow:Evaluation [1788/1900]
INFO:tensorflow:Evaluation [1789/1900]
INFO:tensorflow:Evaluation [1790/1900]
INFO:tensorflow:Evaluation [1791/1900]
INFO:tensorflow:Evaluation [1792/1900]
INFO:tensorflow:Evaluation [1793/1900]
INFO:tensorflow:Evaluation [1794/1900]
INFO:tensorflow:Evaluation [1795/1900]
INFO:tensorflow:Evaluation [1796/1900]
INFO:tensorflow:Evaluation [1797/1900]
INFO:tensorflow:Evaluation [1798/1900]
INFO:tensorflow:Evaluation [1799/1900]
INFO:tensorflow:Evaluation [1800/1900]
INFO:tensorflow:Evaluation [1801/1900]
INFO:tensorflow:Evaluation [1802/1900]
INFO:tensorflow:Evaluation [1803/1900]
INFO:tensorflow:Evaluation [1804/1900]
INFO:tensorflow:Evaluation [1805/1900]
INFO:tensorflow:Evaluation [1806/1900]
INFO:tensorflow:Evaluation [1807/1900]
INFO:tensorflow:Evaluatio







































Test Accuracy by Scikit-learn:  0.69
CPU times: user 9min 13s, sys: 4.73 s, total: 9min 17s
Wall time: 1min 29s


###### CNN classifier(TensorFlow + corpus)

In [211]:
%%time

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

sess = tf.InteractiveSession()

COL_OUTCOME = 'class'
COL_FEATURE = [col for col in list(df['train'].columns) if col != COL_OUTCOME]

# cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

count_feature = len(COL_FEATURE)
count_class = len(df['train']['class'].unique())

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, count_class], name='y_')

W = tf.Variable(tf.zeros([count_feature, count_class]))
b = tf.Variable(tf.zeros([count_class]))
y = tf.matmul(x, W) + b

# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_text = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_text, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32, name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, count_class])
b_fc2 = bias_variable([count_class])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

print("CNN initialization finished")

CNN initialization finished
CPU times: user 40 ms, sys: 0 ns, total: 40 ms
Wall time: 128 ms


In [212]:
%%time

### Start to traini and evaluate the model

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.global_variables_initializer())

x_input = df_new['train']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['train']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]
# y_input = [np.array([y_input.iloc[i].values]) for i in range(y_input.shape[0])]

# not use random input

for i in range(df['train'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, training accuracy {}".format(i, np.mean(train_accuracy)))
    train_step.run(feed_dict={
        keep_prob: 0.5,
        x:  np.array([elem[0] for elem in x_input[i:i+50]]),#x_input.iloc[i].values, #
        y_: np.array([elem[0] for elem in y_input[i:i+50]])#y_input.iloc[i].values#
    })

print("CNN training finished")

step 0, training accuracy 0.188000023365
step 100, training accuracy 0.180399984121
step 200, training accuracy 0.409599989653
step 300, training accuracy 0.482399970293
step 400, training accuracy 0.457200020552
step 500, training accuracy 0.402399927378
step 600, training accuracy 0.383599996567
step 700, training accuracy 0.559599995613
step 800, training accuracy 0.49279999733
step 900, training accuracy 0.458400040865
step 1000, training accuracy 0.594000041485
step 1100, training accuracy 0.546400010586
step 1200, training accuracy 0.595999956131
step 1300, training accuracy 0.656800031662
step 1400, training accuracy 0.738400042057
step 1500, training accuracy 0.557600021362
step 1600, training accuracy 0.60480004549
step 1700, training accuracy 0.583599984646
step 1800, training accuracy 0.600800037384
step 1900, training accuracy 0.6507999897
step 2000, training accuracy 0.793999910355
step 2100, training accuracy 0.631200015545
step 2200, training accuracy 0.733600020409
step

In [213]:
%%time

# Evaluate

x_input = df_new['test']['x']#df_new['test']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['test']['y']#df_new['test']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]

for i in range(df['test'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, testing accuracy {}".format(i, np.mean(train_accuracy)))

        
print("CNN testing finished")

step 0, testing accuracy 0.665600061417
step 100, testing accuracy 0.600000023842
step 200, testing accuracy 0.609199941158
step 300, testing accuracy 0.561600029469
step 400, testing accuracy 0.722000062466
step 500, testing accuracy 0.772000074387
step 600, testing accuracy 0.748399972916
step 700, testing accuracy 0.624000012875
step 800, testing accuracy 0.529599964619
step 900, testing accuracy 0.729199886322
step 1000, testing accuracy 0.660800039768
step 1100, testing accuracy 0.676400065422
step 1200, testing accuracy 0.694399952888
step 1300, testing accuracy 0.536000013351
step 1400, testing accuracy 0.725200116634
step 1500, testing accuracy 0.678799986839
step 1600, testing accuracy 0.76240003109
step 1700, testing accuracy 0.602800011635
step 1800, testing accuracy 0.668000102043
CNN testing finished
CPU times: user 2min 41s, sys: 728 ms, total: 2min 42s
Wall time: 23.7 s


## 总结

### 1. 对于同一种分类器训练法，不同表示模型对结果的影响

### 2. 对于同一种表示模型，不同训练模型对结果的影响

### 3. 综合来看，「表示模型 + 分类器」组合的效果评估

### 4. 展望