## -2. 更新日志

+ 【2017/05/11】
    - 重新梳理前期实验结果，并整合到该份报告中
    - 前期实验报告参见同一目录下的其他以 `trial_` 开头的 `.ipynb` 文件

## -1. 备注

1. 搜索该符号以定位到报告中待完善的部分：。。。
2. 稍后可考虑将部分代码包装成 Python 脚本，以模块的形式导入，使整个 notebook 更简洁？或者不这样处理，从而使读者更方便阅读（而不用另外切换于多个页面之间）？
3. 。。。

## 0. 参考文献

1. (#miscellaneous) [20 Newsgroup Document Classification Report](http://cn-static.udacity.com/mlnd/Capstone_Poject_Sample01.pdf)
2. (#word2vec, #tensorflow) [Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)
3. (#word2vec) [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
4. (#word2vec, #gensim) [models.word2vec – Deep learning with word2vec](https://radimrehurek.com/gensim/models/word2vec.html)
5. (#CNN, #tensorflow) [Deep MNIST for Experts](https://www.tensorflow.org/get_started/mnist/pros)
6. (#text8) [text8](http://mattmahoney.net/dc/textdata)
7. 。。。

## 1. 实验模块规划

粗略规划如下，稍后精细整理：实现下述共计 6 种「表示 + 训练」组合

表示模型 | 分类器训练法
---------------|---------------------
BOW + TF-IDF | SVM
                     | LSA
                     | LDA
Word2Vec      | SVM
                     | DNN
                     | CNN
                     
其中：
+ 名词解释
    - 表示模型
        * BOW：Bag-of-Words，词袋模型
        * TF-IDF：Term Frequency - Inverse Document Frequency，文档-逆文档频率
    - 分类器训练法
        * SVM：Support Vector Machine，支持向量机
        * LSA：潜在语义分析
        * LDA：Latent Dirichlet Allocation，隐含狄利克雷分布
        * DNN：Deep Neural Network，深度神经网络（普通的多层感知机构成的多层神经网络）
        * CNN：Convolution Neural Network，卷积神经网络
+ 训练工具
    - 表示模型
        * BOW+TF-IDF：使用 scikit-learn 建模
        * Word2Vec：使用 gensim 与 TensorFlow 建模
    - 分类器训练法
        * 传统算法：SVM, LSA, LDA 使用 scikit-learn 训练
        * 神经网络算法：DNN, CNN：使用 TensorFlow 训练

## 2. 实验流程规划

1. 模块导入
2. 数据预处理（通用预处理）
  + 对文本进行清洗，包括但不限于去除特殊符号、进行大小写转换等工作，最终使文本中只包含：由小写字母 a-z 组成的单词、单一空格
  + 不在 a-z 之间的字符将一律被转换为空格
3. 文本表示建模
  + BOW+TF-IDF 表示：
      1. 读入原始语料并保存
      2. 使用 scikit-learn，在原始语料的基础上，进行建立词袋（BOW）、计算 TF-IDF
      3. 通过使用 BOW+TF-IDF 向量表示文档中的每个词，从而表示每篇文档（包括所有语料：训练集和测试集）
      4. 每篇文档对应的标签独热（one-hot）向量化
  + Word2Vec 表示：
      1. 分别使用 gensim 和 TensorFlow 中的每一种，分别在 text8 的基础上、在待学习样本的基础上，建立词嵌入（word embedding）模型（Word2Vec）
          + 即：训练出 gensim+text8, gensim+待学习样本, TensorFlow+text8, TensorFlow+待学习样本 共计 4 种表示模型
          + 使用 Skip-Gram 方法进行建模
      2. 通过使用 Word2Vec 向量表示文档中的每个词，然后建立 2 种文档表示模型：
          + 求这些词向量的和，以求和向量表示每一篇文档；对于不在词汇表中的词，以某常量代替——具体而言，可指定为加入零向量，或在求和向量乘上某个常量系数
          + 对于上述求和向量进行求算术平均，使用算术平均向量表示每一篇文档；对于不在词汇表中的词，同上述处理方法
      3. 每篇文档对应的标签独热（one-hot）向量化
4. 分类器训练
  + 传统算法：SVM, LSA, LDA 使用 scikit-learn 训练
  + 神经网络算法：DNN, CNN：使用 TensorFlow 训练
5. 分类器评估
  + 对于传统算法：
    - 使用 sckit-learn 提供的 GridSearchCV 与 LearningCurve 方法寻找最优参数组合
    - 使用 scikit-learn 提供的 accuracy_score（查准率 P） 与 f1_score（F1 分数，同时考察了查准率 P 与查全率 R） 评估训练结果
  + 对于神经网络算法：
    - 暂定手工选择一组参数进行训练；待考察是否可使用 GridSearchCV 与 LearningCurve 进行参数组合寻找最优参数组合
    - 暂定使用手工编写的方法计算查准率 P、查全率 R、F1 分数；待考察是否可对数据格式进行一定程度上的转换或存储，以使用上述提及的 scikit-learn 提供的评估工具

## 3. 实验记录

In [1]:
# Step 0: import module

from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import os
import random
import datetime

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelBinarizer

from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

import gensim
import tensorflow as tf

from IPython.display import display
%matplotlib inline

print("import modules successfully")

import modules successfully


In [2]:
%%time

# Step 1: preprocess the data
vecsize = 784

paths = {}
paths['dir.dataroot'] =  os.path.join(os.getcwd(), '..', 'data')
paths['dir.train'] = os.path.join(paths['dir.dataroot'], 'trialdata', 'train')
paths['dir.test'] = os.path.join(paths['dir.dataroot'], 'trialdata', 'test')
        
preprocessedFlag = os.path.join(paths['dir.dataroot'], 'preprocessed')
if not os.path.isfile(preprocessedFlag):
    for tpart in ['train', 'test']:
        dirpath = paths['dir.{}'.format(tpart)]
        for cls in os.listdir(dirpath):
            clspath = os.path.join(dirpath, cls)
            files = os.listdir(clspath)
            for f in files:
                fpath = os.path.join(clspath, f)
                os.system('mv {} {}.old'.format(fpath, fpath))
                os.system('perl {} {}.old > {}'.format(os.path.join(paths['dir.dataroot'], 'newfil.pl'), fpath, fpath))
                os.system('rm {}.old'.format(fpath))
    os.system('touch {}'.format(preprocessedFlag))
                
print("file preporcessing succefully")
            
stopwordlist = []
with open(os.path.join(paths['dir.dataroot'], 'stoplist-web.txt'), 'r') as readf:
    stopwordlist = readf.read()
    stopwordlist = stopwordlist.split('\n')
            
print("read stop word list successfully")
        
print("Step 1 Succeed")

file preporcessing succefully
read stop word list successfully
Step 1 Succeed
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 596 µs


### 3.1 BOW+TF-IDF 表示法

### 3.2 Word2Vec 表示法

In [3]:
paths['dir.modelroot'] = os.path.join(paths['dir.dataroot'], '..', 'models')
for modeltool in ['gensim', 'TensorFlow']:
    for embedsource in ['text8', 'corpus']:
        dname = os.path.join(paths['dir.modelroot'], '{}.{}'.format(modeltool, embedsource))
        if not os.path.isdir(dname):
            os.mkdir(dname)
        paths['dir.{}.{}'.format(modeltool, embedsource)] = dname

#### 3.2.1 gensim 训练

#### 3.2.2 TensorFlow 训练

In [4]:
import tensorflow as tf

modelFrom = 'TensorFlow'

In [5]:
%%time

import sys
sys.path.append(os.path.abspath('../'))

from modules.embedding.w2v_opt_full_01 import *

CPU times: user 12 ms, sys: 4 ms, total: 16 ms
Wall time: 30.9 ms


##### 3.2.1.1 基于 text8 建模

In [9]:
%%time

embedFrom = 'text8'

FLAGS.train_data = os.path.join(paths['dir.dataroot'], 'trialdata', embedFrom) #embedFrom #
FLAGS.eval_data = os.path.join(paths['dir.dataroot'], 'trialdata', 'questions-words.txt') #'questions-words.txt' #
FLAGS.save_path = paths['dir.{}.{}'.format(modelFrom, embedFrom)]
FLAGS.epochs_to_train = 15
FLAGS.embedding_size = 200#vecsize

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21.9 µs


In [10]:
%%time

!cat ../modules/embedding/w2v_opt_full_02.py

import tensorflow as tf

session = tf.InteractiveSession()
"""Train a word2vec model."""
if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path:
  print("--train_data --eval_data and --save_path must be specified.")
  sys.exit(1)
opts = Options()
#with tf.Graph().as_default() as session:
with tf.device("/cpu:0"):
  model = Word2Vec(opts, session)
  model.read_analogies() # Read analogy questions
for _ in xrange(opts.epochs_to_train):
  model.train()  # Process one epoch
  model.eval()  # Eval analogies.
# Perform a final save.
model.saver.save(session, os.path.join(opts.save_path, "model.ckpt"),
                 global_step=model.global_step)
if FLAGS.interactive:
  # E.g.,
  # [0]: model.analogy(b'france', b'paris', b'russia')
  # [1]: model.nearby([b'proton', b'elephant', b'maxwell'])
  _start_shell(locals())

CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 112 ms


In [11]:
%%time

session = tf.InteractiveSession()
"""Train a word2vec model."""
if not FLAGS.train_data or not FLAGS.eval_data or not FLAGS.save_path:
    print("--train_data --eval_data and --save_path must be specified.")
    sys.exit(1)
opts = Options()
#with tf.Graph().as_default() as session:
with tf.device("/cpu:0"):
    model = Word2Vec(opts, session)
    model.read_analogies() # Read analogy questions
for _ in xrange(opts.epochs_to_train):
    model.train()  # Process one epoch
    model.eval()  # Eval analogies.
# Perform a final save.
model.saver.save(session, os.path.join(opts.save_path, "model-{}.ckpt".format(embedFrom)),
                 global_step=model.global_step)
if FLAGS.interactive:
    # E.g.,
    # [0]: model.analogy(b'france', b'paris', b'russia')
    # [1]: model.nearby([b'proton', b'elephant', b'maxwell'])
    _start_shell(locals())

Data file:  /home/sushangjun/gits/Udacity/MLND/capstone.now/proposal/new/notebooks/../data/trialdata/text8
Vocab size:  71290  + UNK
Words per epoch:  17005207
Initialization:  [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
Eval analogy file:  /home/sushangjun/gits/Udacity/MLND/capstone.now/proposal/new/notebooks/../data/trialdata/questions-words.txt
Questions:  17827
Skipped:  1717
Epoch    1 Step   150899: lr = 0.024 words/sec =    61829
Eval 1539/17827 accuracy =  8.6%
Epoch    1 Step   155608: lr = 0.024 words/sec =    78835

KeyboardInterrupt: 

In [180]:
# from scipy.spatial.distance import cosine

# # tmma=model._w_in.eval()
# # tmma
# tmpemb = model._w_out.eval()
# # print(len(tmpemb[model._word2id['slow']]))
# # print(tmpemb[model._word2id['slowly']])
# print(cosine(tmpemb[model._word2id['slow']], tmpemb[model._word2id['slowly']]))
# # print(cosine(tmpemb[model._word2id['man']], tmpemb[model._word2id['men']]))
# # print(cosine(tmpemb[model._word2id['woman']], tmpemb[model._word2id['women']]))
# # print(cosine(tmpemb[model._word2id['woman']], tmpemb[model._word2id['woman']]))
# # print(tmpemb[model._word2id['UNK']])
# print(cosine([1,1], [1.1, 1.3]))

In [182]:
%%time

# Step 1: preprocess the data

# import stoplist
stopwords = ""

pathtemp_TFIDF = os.path.join(paths['dir.dataroot'], 'stoplist-baseTFIDF.txt')
with open(pathtemp_TFIDF, 'r') as stoplistfile:
    stopwords = stoplistfile.read()
stopwords = stopwords.split()

pathtemp_web = os.path.join(paths['dir.dataroot'], 'stoplist-web.txt')
with open(pathtemp_web, 'r') as stoplistfile2:
    stopwords2 = stoplistfile2.read()
    stopwords2 = stopwords2.split('\n')
    stopwords = set(stopwords)
    stopwords = list(stopwords.union(set(stopwords)))
    
print("Read stop words successfully.")

Read stop words successfully.
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 29.3 ms


In [184]:
%%time

# Step 2: read data and save it in data['vec.train'] 和 data['vec.test']

data = {}
data['vec.train'] = {'w2v.mean':[], 'class':[]}
data['vec.test'] = {'w2v.mean':[], 'class':[]}

for tpart in ['train', 'test']:
    dirpath = paths['dir.{}'.format(tpart)]
    for (ind, cls) in enumerate(os.listdir(dirpath)):
        clspath = os.path.join(dirpath, cls)
        files = os.listdir(clspath)
        for f in files:
            fpath = os.path.join(clspath, f)
            with open(fpath, 'r') as readf:
                tokens = [token for token in readf.read().split() if token not in stopwords]#readf.read().split()#
                # Word2Vec representation
                # begin
                vec = np.array([0.0 for i in range(vecsize)])
                expectationVal = np.array([0.0 for i in range(vecsize)])
                countvec = 0
                for token in tokens:
                    try:
                        vec += tmpemb[model._word2id[token]] #model[token]
                        countvec += 1
                    except:
                        vec += tmpemb[model._word2id['UNK']]
                vec = vec / float(countvec)
                 # end
            data['vec.{}'.format(tpart)]['w2v.mean'].append(vec)
            data['vec.{}'.format(tpart)]['class'].append(cls)

    tmp = data['vec.{}'.format(tpart)]
    ind = (random.sample(range(len(tmp['class'])), 1))[0]
    print("sample(transformed) from {}[{}]:\n[corpus]\n {}\n[class]\n{}".format(tpart, ind, tmp['w2v.mean'][ind], tmp['class'][ind]))
    print()
    
print("Step 2 Succeed")

sample(transformed) from train[1148]:
[corpus]
 [ -9.96603407e-02   1.14235073e-01   4.87520915e-02  -3.99263393e-02
   1.03870710e-01   1.73305357e-02  -8.93831919e-02   1.19042018e-01
  -2.28948947e-02   5.87820504e-02   1.13846079e-02  -6.56907692e-02
  -1.50849392e-02  -6.74119421e-02  -4.09063270e-02   1.12255504e-01
   3.56880805e-02   4.81683357e-02   2.01546599e-02   1.71496643e-01
  -7.46743342e-02   2.97399331e-02   1.70645058e-01  -6.09753110e-02
   2.06357649e-02  -3.08952279e-02   6.20912504e-03   1.57769550e-01
   7.41829704e-02   8.86011216e-02  -2.44455459e-02   1.68424957e-02
  -3.67608798e-02  -1.30859618e-01  -4.36157938e-02  -1.11629976e-01
   9.82143396e-02   9.82492790e-03  -1.35626635e-02  -4.24545004e-02
  -1.03769611e-01   1.96121165e-01  -2.93205626e-02  -1.96371578e-01
  -1.13746641e-01   3.62143000e-02   9.79908944e-02   1.13563232e-01
   4.05143333e-02   1.30632895e-02  -1.14868810e-01  -4.35108331e-02
  -1.07200862e-01  -4.58310493e-02  -1.31418910e-02   1

sample(transformed) from test[540]:
[corpus]
 [ -6.46396886e-02   2.42271544e-02   2.44639674e-02  -2.72202944e-02
   9.98638799e-04   3.78417556e-02  -2.89962393e-02   3.18431460e-02
   9.56417334e-03   1.37226456e-01  -5.49257988e-02  -4.43405140e-02
   1.51597137e-02  -3.92107286e-02  -9.36229662e-03   1.42487843e-01
   8.88744312e-02   1.12198757e-01   6.18111436e-02   1.06159809e-01
  -7.91951507e-02   7.72533927e-02   8.68962515e-02   9.52377094e-03
  -5.92476699e-02   2.26452493e-02  -3.78951147e-02   4.58099847e-02
   5.34265479e-02   1.22909027e-02   1.91986339e-02   4.09438554e-02
  -9.50648633e-02  -1.03207956e-01   2.81913384e-03  -1.13300449e-01
   2.86964725e-02   4.76082927e-02  -2.22589809e-02   9.80216321e-03
  -5.85902861e-02   1.67057980e-01  -2.85485954e-02  -1.23521216e-01
  -1.63160895e-01   5.62528471e-02   7.90062219e-02   5.05483913e-02
   1.81772220e-02   4.77569216e-02  -1.41804892e-01  -6.50451692e-02
  -1.26190987e-01  -5.34384894e-03  -6.58932425e-02   7.9

In [185]:
%%time

# Step 3: Save in Pandas.DataFrame
#
# 将 data['matrix.train'] 与 data['matrix.test'] 转换成 Pandas.DataFrame 格式，保存到 df['train'] 和 df['test'] 中（df 为字典格式：String -> DataFrame）

df = {}
csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')
for tpart in ['train', 'test']:
    datadict = {}
    datadict['class'] = data['vec.{}'.format(tpart)]['class']
    datavec = np.array(data['vec.{}'.format(tpart)]['w2v.mean'])
    for col in range(vecsize):
        datadict[col]= datavec[:, col]

    df[tpart] = pd.DataFrame(data=datadict)
    print("See df[{}]".format(tpart))
    display(df[tpart])
    print("\n\n\n")
    # write data in DataFrame into CSV
    csvpath = os.path.join(csvpath_root, '{}-w2v-{}-{}.csv'.format(tpart, embedFrom, modelFrom))
    df[tpart].to_csv(csvpath, columns=df[tpart].columns)
    
print("Step 3 Succeed.")

# 繁琐点：研究如何把 CSR 矩阵中的数据规整好放到 DataFrame 中，并与 Class 一一对应

See df[train]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.058523,0.013638,0.024487,-0.030197,-0.015306,0.033814,-0.023289,0.027039,-0.012052,0.101854,...,0.009520,0.268571,-0.097666,0.022824,0.137221,0.120145,0.042478,0.069818,0.024586,soc.religion.christian
1,-0.035728,0.018440,0.027882,-0.007253,0.000318,0.035095,-0.008307,0.050132,0.021606,0.123589,...,-0.027584,0.254759,-0.099986,0.012034,0.126510,0.103783,0.058195,0.041614,0.015369,soc.religion.christian
2,-0.030835,-0.002645,0.013619,-0.032663,-0.036917,0.044206,0.001154,0.044904,0.007797,0.122667,...,-0.025182,0.242497,-0.106797,-0.005275,0.137966,0.106617,0.063244,0.067828,0.039724,soc.religion.christian
3,-0.019594,0.045219,0.011064,-0.018853,-0.018432,0.039846,-0.012094,0.079092,-0.000249,0.148417,...,-0.007299,0.253940,-0.088050,0.027485,0.112267,0.115603,0.053982,0.075722,0.039394,soc.religion.christian
4,-0.030478,0.033525,0.000909,-0.033771,-0.006535,0.020864,-0.011159,0.073313,0.005399,0.134877,...,-0.009814,0.246666,-0.083068,0.032013,0.116406,0.112866,0.051360,0.046786,0.034479,soc.religion.christian
5,-0.068971,0.022524,0.022240,-0.025342,0.044779,0.054707,-0.002061,0.053398,-0.019471,0.106976,...,0.022863,0.300212,-0.107021,0.036935,0.123293,0.139480,0.059774,0.066357,0.046229,soc.religion.christian
6,-0.040616,0.007487,0.012622,-0.015940,-0.017342,0.044873,-0.014794,0.067907,0.004165,0.114806,...,-0.022901,0.246773,-0.097176,0.027566,0.135201,0.111781,0.068158,0.071924,0.040155,soc.religion.christian
7,-0.088533,0.022957,0.035795,-0.032581,0.024442,0.011931,-0.011633,0.028860,0.019157,0.104923,...,0.005853,0.285087,-0.111841,0.031712,0.138796,0.128962,0.030158,0.049613,0.027840,soc.religion.christian
8,-0.081727,0.021808,0.046780,-0.027344,0.037914,0.025987,-0.022177,0.028687,-0.004210,0.119180,...,0.017514,0.290323,-0.102362,0.039628,0.127815,0.123184,0.051371,0.050606,0.051381,soc.religion.christian
9,-0.052158,0.020152,0.011457,-0.029701,0.010596,0.033001,-0.031144,0.051754,0.013221,0.120225,...,-0.009498,0.253548,-0.113777,0.018805,0.133704,0.126106,0.065190,0.058086,0.026126,soc.religion.christian






See df[test]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.031296,0.031190,0.033662,-0.017676,-0.000093,0.026671,0.013627,0.064756,-0.002502,0.125967,...,-0.006821,0.259358,-0.100459,0.020410,0.107689,0.117543,0.046935,0.058230,0.042685,soc.religion.christian
1,-0.052576,-0.010007,0.001262,-0.051575,-0.001714,0.037924,-0.017614,0.040380,0.000108,0.110576,...,0.002236,0.252903,-0.090882,0.025433,0.145794,0.133865,0.068060,0.073909,0.042318,soc.religion.christian
2,-0.037210,0.011418,0.011127,-0.038540,0.001013,0.025820,-0.002396,0.024885,0.018824,0.114064,...,0.003935,0.261865,-0.107376,0.013261,0.134763,0.144042,0.055032,0.022252,0.035977,soc.religion.christian
3,-0.058710,0.044231,0.035115,-0.017702,0.017434,0.036446,-0.019522,0.067778,-0.015364,0.083287,...,0.018137,0.267266,-0.106094,0.033556,0.120758,0.113386,0.044914,0.085380,0.044995,soc.religion.christian
4,-0.026366,0.016108,0.020979,-0.009220,-0.011549,0.030171,-0.012905,0.080754,-0.010876,0.129785,...,-0.003263,0.251054,-0.116747,0.026203,0.157351,0.132706,0.053787,0.071162,0.048086,soc.religion.christian
5,-0.003536,0.015019,0.015085,-0.002393,-0.009813,0.031112,-0.019398,0.068297,0.002323,0.115758,...,-0.015482,0.252973,-0.107032,0.024972,0.131752,0.105315,0.054258,0.059199,0.033886,soc.religion.christian
6,-0.039451,0.006575,0.017538,-0.014845,-0.000004,0.029863,-0.007273,0.058649,0.008164,0.117822,...,-0.010667,0.249252,-0.099601,0.017445,0.122034,0.122272,0.067075,0.065184,0.038287,soc.religion.christian
7,-0.030305,0.002403,0.022767,-0.021758,0.003630,0.038488,0.005574,0.057593,-0.000147,0.113814,...,0.003822,0.257768,-0.120562,0.041963,0.117600,0.124138,0.056058,0.045259,0.032298,soc.religion.christian
8,-0.055916,0.002920,0.012610,-0.028886,0.005426,0.037931,-0.010174,0.050305,0.003893,0.125212,...,-0.014562,0.254358,-0.098674,-0.001167,0.127779,0.132079,0.061554,0.043235,0.032576,soc.religion.christian
9,-0.050818,0.014585,0.029024,-0.045672,0.034625,0.016308,-0.019175,0.038426,0.008568,0.079724,...,0.004623,0.276395,-0.110139,0.038464,0.119581,0.128562,0.053603,0.055968,0.030504,soc.religion.christian






Step 3 Succeed.
CPU times: user 2.22 s, sys: 60 ms, total: 2.28 s
Wall time: 3.67 s


In [186]:
%%time

# if wanna read data from CSV file

df = {}

for tpart in ['train', 'test']:
    csvpath = os.path.join(
        csvpath_root, '{}-w2v-{}-{}.csv'.format(
            tpart, embedFrom, modelFrom
        )
    )
    if os.path.exists(csvpath):
        df[tpart] = pd.DataFrame.from_csv(csvpath)
        df[tpart] = df[tpart].sample(frac=1)
        df[tpart].reset_index(drop=True, inplace=True)
        print("read {} successfully".format(tpart))
        display(df[tpart])

read train successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.057341,0.014694,0.047682,-0.015018,0.009562,0.019051,-0.005166,0.046955,-0.014300,0.133241,...,0.007038,0.286418,-0.112458,0.025238,0.141784,0.125696,0.055557,0.065492,0.052077,comp.graphics
1,-0.023846,0.001160,0.010394,-0.022836,0.026280,0.051872,0.011654,0.028351,-0.017943,0.135171,...,0.022687,0.283771,-0.096455,0.031670,0.109492,0.150816,0.035601,0.049927,0.042134,rec.motorcycles
2,-0.031411,0.011987,0.002825,-0.031440,-0.009375,0.025572,0.008416,0.062001,0.018286,0.134346,...,0.014888,0.279204,-0.095205,-0.001121,0.131459,0.119123,0.049085,0.053070,0.033233,rec.motorcycles
3,-0.042833,0.015912,0.007204,-0.012253,-0.006918,0.034820,-0.018460,0.020200,-0.004529,0.127584,...,-0.010878,0.261667,-0.089696,0.025584,0.125700,0.163203,0.045023,0.054125,0.026043,rec.autos
4,-0.049115,0.006210,0.029890,-0.039029,0.027411,0.037111,-0.012789,0.038126,-0.004609,0.105153,...,0.004538,0.255003,-0.097393,0.046014,0.131058,0.155517,0.073182,0.065936,0.055590,soc.religion.christian
5,-0.023096,0.017767,-0.004985,-0.026241,-0.032956,0.031708,-0.014613,0.060181,-0.008096,0.118619,...,-0.032374,0.250342,-0.087871,0.025866,0.129169,0.115180,0.049557,0.047132,0.041679,alt.atheism
6,-0.050383,0.005988,0.024275,-0.032639,0.023947,0.033099,-0.007682,0.083294,-0.024961,0.118540,...,0.057446,0.294207,-0.153790,0.032245,0.122300,0.139739,0.065485,0.096452,0.048330,rec.motorcycles
7,-0.075066,0.021640,0.024893,-0.035045,0.033250,0.037919,-0.012097,0.038798,-0.005341,0.130516,...,0.016501,0.278293,-0.116479,0.025677,0.126034,0.133222,0.034075,0.031640,0.028024,rec.autos
8,-0.023873,-0.019982,0.006920,0.027267,0.047651,0.015003,0.023975,0.106521,-0.000969,0.121501,...,0.025447,0.311267,-0.090267,0.024427,0.104483,0.133039,0.044945,0.071998,0.067379,rec.autos
9,-0.082079,-0.002612,0.002923,-0.019352,0.021831,0.037990,0.006890,0.032810,-0.011633,0.116781,...,-0.013299,0.293357,-0.117362,0.023673,0.114332,0.149332,0.030224,0.057418,0.026547,rec.motorcycles


read test successfully


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,class
0,-0.087221,0.037610,0.076202,-0.085028,0.069781,0.011738,0.010871,0.070988,-0.108553,0.028930,...,0.157808,0.386495,-0.191177,0.084594,0.179159,0.077864,0.091934,0.041715,0.065478,comp.graphics
1,-0.058829,-0.032066,0.083235,-0.037977,-0.030693,0.043876,-0.024962,0.004570,-0.054611,0.145081,...,0.019536,0.282640,-0.072797,0.107911,0.157876,0.183181,0.090611,0.080142,0.062756,soc.religion.christian
2,-0.058821,0.007064,0.025833,-0.013926,0.035914,0.035166,0.007778,0.045814,-0.012347,0.119940,...,0.032537,0.292283,-0.101899,0.012648,0.120706,0.151207,0.035528,0.035747,0.028452,rec.autos
3,-0.084904,-0.005975,0.039602,-0.025309,0.045134,0.045177,-0.043969,0.047071,-0.016143,0.124390,...,0.012929,0.289601,-0.109395,0.032417,0.167009,0.103590,0.065869,0.049717,0.066308,rec.autos
4,-0.066778,-0.008851,0.032158,-0.028911,-0.013782,0.029620,-0.032480,0.062637,0.006886,0.111917,...,0.026468,0.247941,-0.114674,0.034530,0.150863,0.120913,0.063900,0.060704,0.025238,alt.atheism
5,-0.083665,-0.089845,0.042763,-0.051298,0.035474,-0.009569,-0.004769,0.124835,0.011801,0.137345,...,0.055882,0.343373,-0.103744,0.029216,0.091094,0.171804,0.040556,0.125682,0.095259,comp.graphics
6,-0.064888,0.017411,0.015754,-0.015379,0.003927,0.020462,-0.006277,0.051208,0.009471,0.111201,...,-0.009833,0.277543,-0.114362,0.020225,0.136011,0.117653,0.052173,0.049221,0.021557,soc.religion.christian
7,-0.082919,-0.024437,0.052487,0.000173,0.092714,0.006752,-0.050026,0.075537,-0.025218,0.118346,...,0.083269,0.288357,-0.075109,0.009719,0.117507,0.139409,0.060297,0.047801,0.042497,alt.atheism
8,-0.061776,0.013502,0.027730,-0.014864,0.036355,0.038239,-0.011143,0.036786,-0.019621,0.140948,...,0.002193,0.271171,-0.087204,0.061817,0.141650,0.122194,0.059561,0.077485,0.019845,alt.atheism
9,-0.046962,0.020122,-0.002065,-0.013625,0.043326,0.014022,-0.051820,0.049011,0.004234,0.101400,...,0.047355,0.280542,-0.106741,0.034514,0.121276,0.121642,0.040239,0.043128,0.018928,rec.autos


CPU times: user 948 ms, sys: 4 ms, total: 952 ms
Wall time: 1.49 s


###### SVM classifier(TensorFlow + text8)

In [187]:
%%time

# Step 5.1.1: SVM

# if 'TFIDF' == modelChoice:

#train
X_train = df['train'].drop('class', axis=1)
y_train = df['train']['class']
#test
X_test = df['test'].drop('class', axis=1)
y_test_true = df['test']['class']

# else:
#     #train
#     X_train = df_new['train']['x']
#     y_train = df_new['train']['y']
#     #test
#     X_test = df_new['test']['x']
#     y_test_true = df_new['test']['y']

clf = LinearSVC()
clf.fit(X_train, y_train)

print("Step 4 finished")

Step 4 finished
CPU times: user 2.17 s, sys: 4 ms, total: 2.17 s
Wall time: 2.85 s


In [188]:
%%time
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Step 5.1.2: Test
y_test_pred = clf.predict(X_test)
print(accuracy_score(y_test_true, y_test_pred))
print(f1_score(y_test_true, y_test_pred, average='macro'))
print(f1_score(y_test_true, y_test_pred, average='micro'))

0.848947368421
0.846123858354
0.848947368421
CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 211 ms


###### DNN classifier(TensorFlow + text8)

In [189]:
%%time

# Step 4: One-hot representation for labels

csvpath_root = os.path.join(paths['dir.dataroot'], 'data_CSV')

lb = LabelBinarizer()
lb.fit(df['train']['class'])

df_new = {}
for tpart in ['train', 'test']:
    labels = lb.transform(df[tpart]['class'])
    labelsDf = pd.DataFrame(labels, columns=["class-{}".format(i) for i in range(len(lb.classes_))])
    df_new[tpart] = {}
    df_new[tpart]['y'] = labelsDf
    df_new[tpart]['x'] = df[tpart].drop('class', axis=1)
    df_new[tpart]['all'] = df_new[tpart]['x'].join(df_new[tpart]['y'])
    #save in CSV
    for subpart in ['x', 'y', 'all']:
        csvpath = os.path.join(csvpath_root, "{}-cleanLabels-{}-{}.csv".format(tpart, subpart, modelFrom))
        df_new[tpart][subpart].to_csv(csvpath)
    
print("label cleaning succussfully")

label cleaning succussfully
CPU times: user 4.17 s, sys: 84 ms, total: 4.25 s
Wall time: 5.12 s


In [190]:
%%time

## Step 5 : Train the classifier

COL_OUTCOME = 'class'
COL_FEATURE = [str(col) for col in list(df['train'].columns) if col != COL_OUTCOME]

cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

def my_input_fn(dataset):
    # Save dataset in tf format
    feature_cols = {
        str(col): tf.constant(
            df[dataset][str(col)].values
        )
        for col in COL_FEATURE
    }
    labels = tf.constant([cls2num[labelname] for labelname in df[dataset][COL_OUTCOME].values])
    # Returns the feature columns and labels in tf format
    return feature_cols, labels

feature_columns = [tf.contrib.layers.real_valued_column(column_name=str(col)) for col in COL_FEATURE]
clf = tf.contrib.learn.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=[512], 
    n_classes=len(df['train']['class'].unique())
)

clf.fit(input_fn=lambda: my_input_fn('train'), steps=2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': None, '_environment': 'local', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f152b231950>, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_task_id': 0, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_evaluation_master': '', '_keep_checkpoint_every_n_hours': 10000, '_master': ''}






































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpiR94KS/model.ckpt.
INFO:tensorflow:loss = 1.61255, step = 1
INFO:tensorflow:global_step/sec: 9.96804
INFO:tensorflow:loss = 1.40367, step = 101
INFO:tensorflow:global_step/sec: 12.3857
INFO:tensorflow:loss = 1.22488, step = 201
INFO:tensorflow:global_step/sec: 12.3444
INFO:tensorflow:loss = 1.07681, step = 301
INFO:tensorflow:global_step/sec: 12.235
INFO:tensorflow:loss = 0.946175, step = 401
INFO:tensorflow:global_step/sec: 12.3181
INFO:tensorflow:loss = 0.838036, step = 501
INFO:tensorflow:global_step/sec: 12.2071
INFO:tensorflow:loss = 0.751527, step = 601
INFO:te

In [191]:
%%time

## Step 6: Evaluate

accuracy_score = clf.evaluate(input_fn=lambda: my_input_fn('test'), steps=df['test'].shape[0])['accuracy']
print("Test Accuracy by TensorFlow: {}".format(accuracy_score))

X_tensor_test, yt = my_input_fn('test')
tensorPredCls = list(clf.predict(input_fn=lambda: my_input_fn('test')))
num2cls = {v:k for (k, v) in cls2num.items()}
tensorPredClsStr = [num2cls[i] for i in tensorPredCls]
y_test_true = df['test']['class']
print('Test Accuracy by Scikit-learn: ', f1_score(y_test_true, tensorPredClsStr, average='micro'))







































Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Starting evaluation at 2017-05-13-17:00:33
INFO:tensorflow:Evaluation [1/1900]
INFO:tensorflow:Evaluation [2/1900]
INFO:tensorflow:Evaluation [3/1900]
INFO:tensorflow:Evaluation [4/1900]
INFO:tensorflow:Evaluation [5/1900]
INFO:tensorflow:Evaluation [6/1900]
INFO:tensorflow:Evaluation [7/1900]
INFO:tensorflow:Evaluation [8/1900]
INFO:tensorflow:Evaluation [9/1900]
INFO:tensorflow:Evaluation [10/1900]
INFO:tensorflow:Evaluation [11/1900]
INFO:tensorflow:Evaluation [12/1900]
INFO:tensorflow:Evaluation [13/1900]
INFO:tensorflow:Evaluation [14/1900]
INFO:tensorflow:Evaluation [15/1900]
INFO:tensorflow:Evaluation [16/1900]
INFO:tensorflow:Evaluation [1

INFO:tensorflow:Evaluation [73/1900]
INFO:tensorflow:Evaluation [74/1900]
INFO:tensorflow:Evaluation [75/1900]
INFO:tensorflow:Evaluation [76/1900]
INFO:tensorflow:Evaluation [77/1900]
INFO:tensorflow:Evaluation [78/1900]
INFO:tensorflow:Evaluation [79/1900]
INFO:tensorflow:Evaluation [80/1900]
INFO:tensorflow:Evaluation [81/1900]
INFO:tensorflow:Evaluation [82/1900]
INFO:tensorflow:Evaluation [83/1900]
INFO:tensorflow:Evaluation [84/1900]
INFO:tensorflow:Evaluation [85/1900]
INFO:tensorflow:Evaluation [86/1900]
INFO:tensorflow:Evaluation [87/1900]
INFO:tensorflow:Evaluation [88/1900]
INFO:tensorflow:Evaluation [89/1900]
INFO:tensorflow:Evaluation [90/1900]
INFO:tensorflow:Evaluation [91/1900]
INFO:tensorflow:Evaluation [92/1900]
INFO:tensorflow:Evaluation [93/1900]
INFO:tensorflow:Evaluation [94/1900]
INFO:tensorflow:Evaluation [95/1900]
INFO:tensorflow:Evaluation [96/1900]
INFO:tensorflow:Evaluation [97/1900]
INFO:tensorflow:Evaluation [98/1900]
INFO:tensorflow:Evaluation [99/1900]
I

INFO:tensorflow:Evaluation [290/1900]
INFO:tensorflow:Evaluation [291/1900]
INFO:tensorflow:Evaluation [292/1900]
INFO:tensorflow:Evaluation [293/1900]
INFO:tensorflow:Evaluation [294/1900]
INFO:tensorflow:Evaluation [295/1900]
INFO:tensorflow:Evaluation [296/1900]
INFO:tensorflow:Evaluation [297/1900]
INFO:tensorflow:Evaluation [298/1900]
INFO:tensorflow:Evaluation [299/1900]
INFO:tensorflow:Evaluation [300/1900]
INFO:tensorflow:Evaluation [301/1900]
INFO:tensorflow:Evaluation [302/1900]
INFO:tensorflow:Evaluation [303/1900]
INFO:tensorflow:Evaluation [304/1900]
INFO:tensorflow:Evaluation [305/1900]
INFO:tensorflow:Evaluation [306/1900]
INFO:tensorflow:Evaluation [307/1900]
INFO:tensorflow:Evaluation [308/1900]
INFO:tensorflow:Evaluation [309/1900]
INFO:tensorflow:Evaluation [310/1900]
INFO:tensorflow:Evaluation [311/1900]
INFO:tensorflow:Evaluation [312/1900]
INFO:tensorflow:Evaluation [313/1900]
INFO:tensorflow:Evaluation [314/1900]
INFO:tensorflow:Evaluation [315/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [506/1900]
INFO:tensorflow:Evaluation [507/1900]
INFO:tensorflow:Evaluation [508/1900]
INFO:tensorflow:Evaluation [509/1900]
INFO:tensorflow:Evaluation [510/1900]
INFO:tensorflow:Evaluation [511/1900]
INFO:tensorflow:Evaluation [512/1900]
INFO:tensorflow:Evaluation [513/1900]
INFO:tensorflow:Evaluation [514/1900]
INFO:tensorflow:Evaluation [515/1900]
INFO:tensorflow:Evaluation [516/1900]
INFO:tensorflow:Evaluation [517/1900]
INFO:tensorflow:Evaluation [518/1900]
INFO:tensorflow:Evaluation [519/1900]
INFO:tensorflow:Evaluation [520/1900]
INFO:tensorflow:Evaluation [521/1900]
INFO:tensorflow:Evaluation [522/1900]
INFO:tensorflow:Evaluation [523/1900]
INFO:tensorflow:Evaluation [524/1900]
INFO:tensorflow:Evaluation [525/1900]
INFO:tensorflow:Evaluation [526/1900]
INFO:tensorflow:Evaluation [527/1900]
INFO:tensorflow:Evaluation [528/1900]
INFO:tensorflow:Evaluation [529/1900]
INFO:tensorflow:Evaluation [530/1900]
INFO:tensorflow:Evaluation [531/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [722/1900]
INFO:tensorflow:Evaluation [723/1900]
INFO:tensorflow:Evaluation [724/1900]
INFO:tensorflow:Evaluation [725/1900]
INFO:tensorflow:Evaluation [726/1900]
INFO:tensorflow:Evaluation [727/1900]
INFO:tensorflow:Evaluation [728/1900]
INFO:tensorflow:Evaluation [729/1900]
INFO:tensorflow:Evaluation [730/1900]
INFO:tensorflow:Evaluation [731/1900]
INFO:tensorflow:Evaluation [732/1900]
INFO:tensorflow:Evaluation [733/1900]
INFO:tensorflow:Evaluation [734/1900]
INFO:tensorflow:Evaluation [735/1900]
INFO:tensorflow:Evaluation [736/1900]
INFO:tensorflow:Evaluation [737/1900]
INFO:tensorflow:Evaluation [738/1900]
INFO:tensorflow:Evaluation [739/1900]
INFO:tensorflow:Evaluation [740/1900]
INFO:tensorflow:Evaluation [741/1900]
INFO:tensorflow:Evaluation [742/1900]
INFO:tensorflow:Evaluation [743/1900]
INFO:tensorflow:Evaluation [744/1900]
INFO:tensorflow:Evaluation [745/1900]
INFO:tensorflow:Evaluation [746/1900]
INFO:tensorflow:Evaluation [747/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [938/1900]
INFO:tensorflow:Evaluation [939/1900]
INFO:tensorflow:Evaluation [940/1900]
INFO:tensorflow:Evaluation [941/1900]
INFO:tensorflow:Evaluation [942/1900]
INFO:tensorflow:Evaluation [943/1900]
INFO:tensorflow:Evaluation [944/1900]
INFO:tensorflow:Evaluation [945/1900]
INFO:tensorflow:Evaluation [946/1900]
INFO:tensorflow:Evaluation [947/1900]
INFO:tensorflow:Evaluation [948/1900]
INFO:tensorflow:Evaluation [949/1900]
INFO:tensorflow:Evaluation [950/1900]
INFO:tensorflow:Evaluation [951/1900]
INFO:tensorflow:Evaluation [952/1900]
INFO:tensorflow:Evaluation [953/1900]
INFO:tensorflow:Evaluation [954/1900]
INFO:tensorflow:Evaluation [955/1900]
INFO:tensorflow:Evaluation [956/1900]
INFO:tensorflow:Evaluation [957/1900]
INFO:tensorflow:Evaluation [958/1900]
INFO:tensorflow:Evaluation [959/1900]
INFO:tensorflow:Evaluation [960/1900]
INFO:tensorflow:Evaluation [961/1900]
INFO:tensorflow:Evaluation [962/1900]
INFO:tensorflow:Evaluation [963/1900]
INFO:tensorf

INFO:tensorflow:Evaluation [1150/1900]
INFO:tensorflow:Evaluation [1151/1900]
INFO:tensorflow:Evaluation [1152/1900]
INFO:tensorflow:Evaluation [1153/1900]
INFO:tensorflow:Evaluation [1154/1900]
INFO:tensorflow:Evaluation [1155/1900]
INFO:tensorflow:Evaluation [1156/1900]
INFO:tensorflow:Evaluation [1157/1900]
INFO:tensorflow:Evaluation [1158/1900]
INFO:tensorflow:Evaluation [1159/1900]
INFO:tensorflow:Evaluation [1160/1900]
INFO:tensorflow:Evaluation [1161/1900]
INFO:tensorflow:Evaluation [1162/1900]
INFO:tensorflow:Evaluation [1163/1900]
INFO:tensorflow:Evaluation [1164/1900]
INFO:tensorflow:Evaluation [1165/1900]
INFO:tensorflow:Evaluation [1166/1900]
INFO:tensorflow:Evaluation [1167/1900]
INFO:tensorflow:Evaluation [1168/1900]
INFO:tensorflow:Evaluation [1169/1900]
INFO:tensorflow:Evaluation [1170/1900]
INFO:tensorflow:Evaluation [1171/1900]
INFO:tensorflow:Evaluation [1172/1900]
INFO:tensorflow:Evaluation [1173/1900]
INFO:tensorflow:Evaluation [1174/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1361/1900]
INFO:tensorflow:Evaluation [1362/1900]
INFO:tensorflow:Evaluation [1363/1900]
INFO:tensorflow:Evaluation [1364/1900]
INFO:tensorflow:Evaluation [1365/1900]
INFO:tensorflow:Evaluation [1366/1900]
INFO:tensorflow:Evaluation [1367/1900]
INFO:tensorflow:Evaluation [1368/1900]
INFO:tensorflow:Evaluation [1369/1900]
INFO:tensorflow:Evaluation [1370/1900]
INFO:tensorflow:Evaluation [1371/1900]
INFO:tensorflow:Evaluation [1372/1900]
INFO:tensorflow:Evaluation [1373/1900]
INFO:tensorflow:Evaluation [1374/1900]
INFO:tensorflow:Evaluation [1375/1900]
INFO:tensorflow:Evaluation [1376/1900]
INFO:tensorflow:Evaluation [1377/1900]
INFO:tensorflow:Evaluation [1378/1900]
INFO:tensorflow:Evaluation [1379/1900]
INFO:tensorflow:Evaluation [1380/1900]
INFO:tensorflow:Evaluation [1381/1900]
INFO:tensorflow:Evaluation [1382/1900]
INFO:tensorflow:Evaluation [1383/1900]
INFO:tensorflow:Evaluation [1384/1900]
INFO:tensorflow:Evaluation [1385/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1572/1900]
INFO:tensorflow:Evaluation [1573/1900]
INFO:tensorflow:Evaluation [1574/1900]
INFO:tensorflow:Evaluation [1575/1900]
INFO:tensorflow:Evaluation [1576/1900]
INFO:tensorflow:Evaluation [1577/1900]
INFO:tensorflow:Evaluation [1578/1900]
INFO:tensorflow:Evaluation [1579/1900]
INFO:tensorflow:Evaluation [1580/1900]
INFO:tensorflow:Evaluation [1581/1900]
INFO:tensorflow:Evaluation [1582/1900]
INFO:tensorflow:Evaluation [1583/1900]
INFO:tensorflow:Evaluation [1584/1900]
INFO:tensorflow:Evaluation [1585/1900]
INFO:tensorflow:Evaluation [1586/1900]
INFO:tensorflow:Evaluation [1587/1900]
INFO:tensorflow:Evaluation [1588/1900]
INFO:tensorflow:Evaluation [1589/1900]
INFO:tensorflow:Evaluation [1590/1900]
INFO:tensorflow:Evaluation [1591/1900]
INFO:tensorflow:Evaluation [1592/1900]
INFO:tensorflow:Evaluation [1593/1900]
INFO:tensorflow:Evaluation [1594/1900]
INFO:tensorflow:Evaluation [1595/1900]
INFO:tensorflow:Evaluation [1596/1900]
INFO:tensorflow:Evaluatio

INFO:tensorflow:Evaluation [1783/1900]
INFO:tensorflow:Evaluation [1784/1900]
INFO:tensorflow:Evaluation [1785/1900]
INFO:tensorflow:Evaluation [1786/1900]
INFO:tensorflow:Evaluation [1787/1900]
INFO:tensorflow:Evaluation [1788/1900]
INFO:tensorflow:Evaluation [1789/1900]
INFO:tensorflow:Evaluation [1790/1900]
INFO:tensorflow:Evaluation [1791/1900]
INFO:tensorflow:Evaluation [1792/1900]
INFO:tensorflow:Evaluation [1793/1900]
INFO:tensorflow:Evaluation [1794/1900]
INFO:tensorflow:Evaluation [1795/1900]
INFO:tensorflow:Evaluation [1796/1900]
INFO:tensorflow:Evaluation [1797/1900]
INFO:tensorflow:Evaluation [1798/1900]
INFO:tensorflow:Evaluation [1799/1900]
INFO:tensorflow:Evaluation [1800/1900]
INFO:tensorflow:Evaluation [1801/1900]
INFO:tensorflow:Evaluation [1802/1900]
INFO:tensorflow:Evaluation [1803/1900]
INFO:tensorflow:Evaluation [1804/1900]
INFO:tensorflow:Evaluation [1805/1900]
INFO:tensorflow:Evaluation [1806/1900]
INFO:tensorflow:Evaluation [1807/1900]
INFO:tensorflow:Evaluatio







































Test Accuracy by Scikit-learn:  0.801578947368
CPU times: user 9min 13s, sys: 18.9 s, total: 9min 32s
Wall time: 1min 40s


###### CNN classifier(TensorFlow + text8)

In [192]:
%%time

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

sess = tf.InteractiveSession()

COL_OUTCOME = 'class'
COL_FEATURE = [col for col in list(df['train'].columns) if col != COL_OUTCOME]

# cls2num = {cls:ind for (ind, cls) in enumerate(df['train']['class'].unique())}

count_feature = len(COL_FEATURE)
count_class = len(df['train']['class'].unique())

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.float32, shape=[None, count_class], name='y_')

W = tf.Variable(tf.zeros([count_feature, count_class]))
b = tf.Variable(tf.zeros([count_class]))
y = tf.matmul(x, W) + b

# cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])
x_text = tf.reshape(x, [-1, 28, 28, 1])
h_conv1 = tf.nn.relu(conv2d(x_text, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32, name='keep_prob')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, count_class])
b_fc2 = bias_variable([count_class])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

print("CNN initialization finished")

CNN initialization finished
CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 36.2 ms


In [193]:
%%time

### Start to traini and evaluate the model

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.global_variables_initializer())

x_input = df_new['train']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['train']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]
# y_input = [np.array([y_input.iloc[i].values]) for i in range(y_input.shape[0])]

# not use random input

for i in range(df['train'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, training accuracy {}".format(i, np.mean(train_accuracy)))
    train_step.run(feed_dict={
        keep_prob: 0.5,
        x:  np.array([elem[0] for elem in x_input[i:i+50]]),#x_input.iloc[i].values, #
        y_: np.array([elem[0] for elem in y_input[i:i+50]])#y_input.iloc[i].values#
    })

print("CNN training finished")

step 0, training accuracy 0.17239998281
step 100, training accuracy 0.411599963903
step 200, training accuracy 0.393200010061
step 300, training accuracy 0.57959997654
step 400, training accuracy 0.571999967098
step 500, training accuracy 0.64519995451
step 600, training accuracy 0.675600111485
step 700, training accuracy 0.639999985695
step 800, training accuracy 0.697600007057
step 900, training accuracy 0.63480001688
step 1000, training accuracy 0.676400005817
step 1100, training accuracy 0.721599936485
step 1200, training accuracy 0.780399918556
step 1300, training accuracy 0.767999947071
step 1400, training accuracy 0.855599999428
step 1500, training accuracy 0.803999960423
step 1600, training accuracy 0.720800101757
step 1700, training accuracy 0.724400043488
step 1800, training accuracy 0.812799990177
step 1900, training accuracy 0.825599968433
step 2000, training accuracy 0.795599997044
step 2100, training accuracy 0.8492000103
step 2200, training accuracy 0.823999941349
step 2

In [194]:
%%time

# Evaluate

x_input = df_new['test']['x']#df_new['test']['x']
x_input = [np.array([
            np.float32(x_input.iloc[i].values)
        ])
    for i in range(x_input.shape[0])]
y_input = df_new['test']['y']#df_new['test']['y']
y_input = [np.array([
            np.float32(y_input.iloc[i].values)
        ])
    for i in range(y_input.shape[0])]

for i in range(df['test'].shape[0] - 50):
    if 0 == i % 100:
        train_accuracy = []
        for j in range(50):
            train_accuracy.append(accuracy.eval(feed_dict={
                    keep_prob: 1,
                    x:  np.array([elem[0] for elem in x_input[i+j:i+j+50]]),#x_input.iloc[i+j].values, #
                    y_: np.array([elem[0] for elem in y_input[i+j:i+j+50]])#y_input.iloc[i+j].values #
                })
            )
        print("step {}, testing accuracy {}".format(i, np.mean(train_accuracy)))

        
print("CNN testing finished")

step 0, testing accuracy 0.702000021935
step 100, testing accuracy 0.664400041103
step 200, testing accuracy 0.757999956608
step 300, testing accuracy 0.742399990559
step 400, testing accuracy 0.694000005722
step 500, testing accuracy 0.754399955273
step 600, testing accuracy 0.75
step 700, testing accuracy 0.74640005827
step 800, testing accuracy 0.808000028133
step 900, testing accuracy 0.677200078964
step 1000, testing accuracy 0.738800048828
step 1100, testing accuracy 0.744000017643
step 1200, testing accuracy 0.795199990273
step 1300, testing accuracy 0.712799966335
step 1400, testing accuracy 0.718400061131
step 1500, testing accuracy 0.79240000248
step 1600, testing accuracy 0.734800040722
step 1700, testing accuracy 0.672400057316
step 1800, testing accuracy 0.730400025845
CNN testing finished
CPU times: user 2min 42s, sys: 1.65 s, total: 2min 44s
Wall time: 25 s


## 总结

### 1. 对于同一种分类器训练法，不同表示模型对结果的影响

### 2. 对于同一种表示模型，不同训练模型对结果的影响

### 3. 综合来看，「表示模型 + 分类器」组合的效果评估

### 4. 展望