## 脚本使用说明：

### 0. 环境说明：

In [1]:
!head -n 14 cnn.py | nl

     1	# /usr/bin/env python
     2	# -*- coding: utf-8 -*-
       
     3	import pickle
     4	import click
     5	import numpy as np
     6	import tensorflow as tf
     7	from sklearn.metrics import confusion_matrix
     8	from mlxtend.plotting import plot_confusion_matrix
       
     9	from text_cnn import TextCNN
    10	from text_helpers import build_dataset
       
       


In [2]:
%load_ext watermark
%watermark -a 'Scott Ming' -v -m -d -p click,numpy,pandas,scipy,matplotlib,mlxtend,sklearn,tensorflow

Scott Ming 2017-04-14 

CPython 3.6.0
IPython 5.3.0

click 6.7
numpy 1.12.1
pandas 0.19.2
scipy 0.19.0
matplotlib 2.0.0
mlxtend 0.6.0
sklearn 0.18.1
tensorflow 1.0.1

compiler   : GCC 4.9.2
system     : Linux
release    : 3.16.0-4-amd64
machine    : x86_64
processor  : 
CPU cores  : 4
interpreter: 64bit


#### 注意：

如果是 linux 下的 pyenv 环境，直接在终端引入 mlxtend 包时，会报错，两种解决办法：

* 注释掉  `8 from mlxtend.plotting import plot_confusion_matrix` 这行，但就没有画图效果了。
* 参考 [Tkinter import error for pyenv Pythons · Issue #94 · pyenv/pyenv](https://github.com/pyenv/pyenv/issues/94) 彻底解决 tkinter 问题

另外直接在终端运行 tensorflow 时，会有报 warning, 可参看这里隐藏 ["The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations" in "Hello, TensorFlow!" program · Issue #7778 · tensorflow/tensorflow](https://github.com/tensorflow/tensorflow/issues/7778)

### 1. 脚本简介

脚本分为 2 个 command(可理解为子命令)，可用 `--help` 命令查看

In [1]:
!python cnn.py --help

Usage: cnn.py [OPTIONS] COMMAND [ARGS]...

  CNN for Text Classification in Tensorflow.

  Examples:

      python cnn.py train  # train

      python cnn.py train --confusion-matrix  # plot confusion matrix

      python cnn.py --train-path train_shuffle.txt --test-path test_shuffle.txt clean  # text clean

Options:
  --train-path TEXT  Default: data/train_data.txt.
  --test-path TEXT   Default: data/test_data.txt.
  --help             Show this message and exit.

Commands:
  clean
  train


`python cnn.py` 后面接的是 group 参数，即脚本全局参数，`clean` 和 `train` 都需要 data，默认的 path 参数是清理好的数据，所以如果需要重新清理，需要指定 `--train-path` 和 `--test-path`

子命令下还有参数，继续接 `--help` 可以查看：

In [2]:
!python cnn.py  train --help

Usage: cnn.py train [OPTIONS]

Options:
  --vocab-size INTEGER
  --num-classes INTEGER
  --filter-num INTEGER
  --batch-size INTEGER
  --word-embed-size INTEGER
  --training-steps INTEGER
  --learning-rate FLOAT
  --print-loss-every INTEGER
  --confusion-matrix
  --help                      Show this message and exit.


几乎所有参数都可以指定，当然，都有默认

### 2. 数据清理

数据清理的运行时间比较长，预计 2-3 分钟

In [6]:
!python cnn.py --train-path train_shuffle.txt --test-path test_shuffle.txt clean

cleaning...
Done!


In [8]:
!ls data/

cleared_data.pkl   stop_words_chinese.txt  train_data.txt
reversed_dict.pkl  test_data.txt	   word_dict.pkl


In [3]:
!head -n 5 data/test_data.txt

3525 0 406 237 144 0 854 495 475 326 180 145 0 0 0 0 0 0 0 0 1
141 3479 1310 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
6429 1093 389 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 0 138 312 316 718 1898 2164 1013 353 1691 2383 779 1 3 316 1792 697 0 4781 1
41 247 134 65 34 48 14 443 16 2695 3260 41 667 16 325 1612 41 1287 0 782 1


In [4]:
!awk '{print "Column counts: " NF; exit}' data/test_data.txt

Column counts: 21


└ 最后一列是 sentiment

### 3. 数据训练

In [3]:
!python cnn.py train --help

Usage: cnn.py train [OPTIONS]

Options:
  --vocab-size INTEGER
  --num-classes INTEGER
  --filter-num INTEGER
  --batch-size INTEGER
  --word-embed-size INTEGER
  --training-steps INTEGER
  --learning-rate FLOAT
  --print-loss-every INTEGER
  --confusion-matrix
  --help                      Show this message and exit.


In [3]:
%%bash
export TF_CPP_MIN_LOG_LEVEL=2  
python cnn.py train

After 0 training steps, cross entropy on batch data is 0.827254, trian accuracy is 0.47, test accuracy is 0.47
After 2 training steps, cross entropy on batch data is 0.945222, trian accuracy is 0.47, test accuracy is 0.47
After 4 training steps, cross entropy on batch data is 0.814763, trian accuracy is 0.47, test accuracy is 0.47
After 6 training steps, cross entropy on batch data is 0.686127, trian accuracy is 0.48, test accuracy is 0.48
After 8 training steps, cross entropy on batch data is 0.744843, trian accuracy is 0.48, test accuracy is 0.48


In [5]:
%%bash
export TF_CPP_MIN_LOG_LEVEL=2 # 把头部的 warning 隐藏
python cnn.py train --training-steps 10001 --print-loss-every 1000

After 0 training steps, cross entropy on batch data is 1.332271, trian accuracy is 0.47, test accuracy is 0.47
After 1000 training steps, cross entropy on batch data is 0.574108, trian accuracy is 0.63, test accuracy is 0.58
After 2000 training steps, cross entropy on batch data is 0.419571, trian accuracy is 0.72, test accuracy is 0.65
After 3000 training steps, cross entropy on batch data is 0.341926, trian accuracy is 0.82, test accuracy is 0.73
After 4000 training steps, cross entropy on batch data is 0.188271, trian accuracy is 0.89, test accuracy is 0.79
After 5000 training steps, cross entropy on batch data is 0.103248, trian accuracy is 0.94, test accuracy is 0.84
After 6000 training steps, cross entropy on batch data is 0.122121, trian accuracy is 0.97, test accuracy is 0.87
After 7000 training steps, cross entropy on batch data is 0.035493, trian accuracy is 0.98, test accuracy is 0.89
After 8000 training steps, cross entropy on batch data is 0.089365, trian accuracy is 0.99,

`--confusion-matrix` 作为 `flag`，添加时，可打印矩阵，画图在本地环境会呈现

In [6]:
%matplotlib inline

In [7]:
%%bash
export TF_CPP_MIN_LOG_LEVEL=2
python cnn.py train --training-steps 5001 --print-loss-every 500 --confusion-matrix

After 0 training steps, cross entropy on batch data is 0.749080, trian accuracy is 0.47, test accuracy is 0.48
After 500 training steps, cross entropy on batch data is 0.640280, trian accuracy is 0.59, test accuracy is 0.56
After 1000 training steps, cross entropy on batch data is 0.577666, trian accuracy is 0.63, test accuracy is 0.59
After 1500 training steps, cross entropy on batch data is 0.532600, trian accuracy is 0.69, test accuracy is 0.63
After 2000 training steps, cross entropy on batch data is 0.413534, trian accuracy is 0.79, test accuracy is 0.71
After 2500 training steps, cross entropy on batch data is 0.433147, trian accuracy is 0.81, test accuracy is 0.72
After 3000 training steps, cross entropy on batch data is 0.402461, trian accuracy is 0.83, test accuracy is 0.74
After 3500 training steps, cross entropy on batch data is 0.309006, trian accuracy is 0.88, test accuracy is 0.78
After 4000 training steps, cross entropy on batch data is 0.259623, trian accuracy is 0.91, 

### 4. 数据验证

待补一个 command