# CNN

语言描述能力真正强大的地方不仅在于每个单词的意思，更重要的在于order and combination of words. The connection between words creates depth, information and complexity.

Meaning is hidden beneath the words.

下面我们要介绍，如何获取**latent semantic information** (meaning, emotion, etc.) from a sequence of words. 

人类的语言和机器生成的语言的差别在于tone and flow.

## 1. Learning Meaning

一个单词的意义很大程度上取决于和周边词的关系(relationship)。这种关系包括：

1. word order
2. word proximity

relationship 可以有两种建模方法：

1. spatially: as writing
    - viewd through a fixed-width window
2. temporarily: as spoken -- time series data
    - extend for an unknown amount of time
    
传统neural nets (e.g., fully connected neural nets) 的强项是从数据中提取pattern，但是不能提取token 之间的relation。之后我们要介绍一些能够捕捉关系的神经网络：
- CNN
- RNN

<img src="img/fully_connected_NN.png" alt="drawing" width="500"/>



## 2. CNN Review

**Convolution**: the concept of sliding (or convolving) a small window over the data sample.

#### kernel (filter):
- 3*3
- random initialization with small numbers (close to 0)
- can have n filters -- output n "images"

#### stride (step size):
- 小于filter size，保证每个snapshot有overlap

#### padding

adding enough data to the input’s outer edges so that the first real data point is treated just as the innermost data points are.

- why padding?
    - without padding, 输出的size 和输入的size 不同
    - 图片边界部分undersampling, 因为只有一个filter
        - 对于tweets 这样的短文，undersampling 影响很大
        
- padding 的方法
    - valid: 不加padding
    - same: 用相同的元素补齐
    - guess at what the padding should be: 适用于图片，不适用于文本

- padding 可能带来的问题：
    - adding potentially unrelated data to the input, which in itself can skew the outcome
    
#### pipeline
- 多个conv layer 可以拼接
- 最后一个conv layer 需要展开，连成一个vector，然后接入fully connected layer, 然后做分类等特性。

#### filter composition
每snapshot 都和其他的snapshot 无关，所以可以最大程度的使用CPU 的并行机制。

## 3. CNN for text

Use 1-dimensional CNN 来检测IMDB 电影评价数据集。
- 每个负面的评论被标记为0
- 每个正向的评论被标记为1

下载数据集：https://ai.stanford.edu/%7eamaas/data/sentiment/

In [8]:
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D

### step 1: 读取IMDB数据

In [9]:
imdb_datasets = '/Users/chenwang/Workspace/datasets/IMDB/aclImdb/train'

In [10]:
import glob
import os

from random import shuffle

def pre_process_data(filepath):
    """
    This is dependent on your training data source but we will try to generalize it as best as possible.
    """
    positive_path = os.path.join(filepath, 'pos')
    negative_path = os.path.join(filepath, 'neg')
    
    pos_label = 1
    neg_label = 0
    
    dataset = []
    
    for filename in glob.glob(os.path.join(positive_path, '*.txt')):
        with open(filename, 'r') as f:
            dataset.append((pos_label, f.read()))
            
    for filename in glob.glob(os.path.join(negative_path, '*.txt')):
        with open(filename, 'r') as f:
            dataset.append((neg_label, f.read()))
    
    shuffle(dataset)
    
    return dataset



In [11]:
dataset = pre_process_data(imdb_datasets)


In [12]:
len(dataset)  # 一共25000 条评论

25000

In [13]:
dataset[0]  # 第一条评论

(1,
 'The anime that got me hooked on anime...<br /><br />Set in the year 2010 (hey, that\'s not too far away now!) the Earth is now poison gas wasteland of pollution and violence. Seeing as how crimes are happening ever 30 seconds are so and committed by thieves who have the fire power of third world terrorists, the government of the fictional New Port City form the Tank Police to deal with the problem - cops with tanks! Oh the insanity!<br /><br />The "heroes" of this series include the new recruit Leona Ozaki, a red haired Japanese woman (yeah I know, they never match their distinctly Japanese names with a Japanese appearance) who has just been drafted into the Tank Police and is quickly partnered with blond, blue eyed nice guy Al. Leona is new at using tanks and unfortunately she destroys the favorite tank of Tank Police Commander Charles Britain (also known as "Brenten"), a big guy who looks like Tom Selleck on steroids and sporting a pair of nifty sunglasses, a big revolver and a

### step 2: tokenization

In [14]:
from nltk.tokenize import TreebankWordTokenizer
from gensim.models.keyedvectors import KeyedVectors

In [15]:
word_vectors = KeyedVectors.load_word2vec_format('/Users/chenwang/Workspace/datasets/GoogleNews-vectors-negative300.bin.gz', binary=True, limit=200000)


In [17]:
type(word_vectors)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [20]:
def tokenize_and_vectorize(dataset):
    tokenizer = TreebankWordTokenizer()
    vectorized_data = []
    expected = []
    for sample in dataset:
        # sample 是一个二元组，sample[0] 是label, sample[1] 是评论text
        tokens = tokenizer.tokenize(sample[1])
        
        sample_vecs = []
        for token in tokens:
            try:
                sample_vecs.append(word_vectors[token])

            except KeyError:
                pass  # No matching token in the Google w2v vocab
        
        # list of list    
        vectorized_data.append(sample_vecs)

    return vectorized_data

In [21]:
def collect_expected(dataset):
    """ Peel of the target values from the dataset """
    expected = []
    for sample in dataset:
        expected.append(sample[0])
    return expected

In [22]:
vectorized_data = tokenize_and_vectorize(dataset)
expected = collect_expected(dataset)

In [23]:
len(vectorized_data)

25000

In [24]:
len(expected)

25000

### step 3. split training and testing sets

以及shuffle 过，所以直接取前80%elements 作为training set，剩余的作为testing set.

In [25]:
split_point = int(len(vectorized_data)*.8)

x_train = vectorized_data[:split_point]
y_train = expected[:split_point]
x_test = vectorized_data[split_point:]
y_test = expected[split_point:]

### step 4. set up CNN parameters

CNN 网络的输入维度都是相同的。所以要使用maxlen 参数，对于超长的review 进行截断，对于短的review 进行padding。padding 可以是Null 或者0. padding = "ignore me". 

⚠️ 这个padding 和上面CNN 中讲的padding 不同：
- CNN 中的padding 是为了避免边缘输入subsampling
- 这里的padding 是为了输入数据保持相同的尺寸

kernel_size = 3 means looking at 3-grams of your input text.4

In [26]:
maxlen = 400            # max review length
batch_size = 32         # How many samples to show the net before backpropogating the error and updating the weights
embedding_dims = 300    # Length of the token vectors we will create for passing into the Convnet
filters = 250           # Number of filters we will train
kernel_size = 3         # The width of the filters, actual filters will each be a matrix of weights of size: embedding_dims x kernel_size or 50 x 3 in our case
hidden_dims = 250       # Number of neurons in the plain feed forward net at the end of the chain
epochs = 2              # Number of times we will pass the entire training dataset through the network

### step 5: preprocessing: padding & truncate

下面函数可以用一行list comprehension 替换：

```python

[smp[:maxlen] + [[0.] * emb_dim] * (maxlen - len(smp)) for smp in data]

```

In [27]:
def pad_trunc(data, maxlen):
    """ For a given dataset pad with zero vectors or truncate to maxlen """
    new_data = []

    # Create a vector of 0's the length of our word vectors
    
    zero_vector = []
    
    # data[0] 第一个review 所有token 的词向量
    for _ in range(len(data[0][0])):
        zero_vector.append(0.0)

    for sample in data:
 
        if len(sample) > maxlen:
            temp = sample[:maxlen]
        elif len(sample) < maxlen:
            temp = sample
            additional_elems = maxlen - len(sample)
            for _ in range(additional_elems):
                temp.append(zero_vector)
        else:
            temp = sample
        new_data.append(temp)
    return new_data

In [29]:
len(x_train[0])

446

In [30]:
len(x_train[1])

118

In [31]:
x_train = pad_trunc(x_train, maxlen)
x_test = pad_trunc(x_test, maxlen)

In [32]:
len(x_train[0])

400

In [33]:
len(x_train[1])

400

In [35]:
x_train[1][200]  # 查看补全的全龄向量 -- 第2个review 的第200个token 的向量

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

下一步，转化成numpy array，作为keras 的输入。

In [37]:
x_train = np.reshape(x_train, (len(x_train), maxlen, embedding_dims))  # 20000 * 400 * 300
y_train = np.array(y_train)

x_test = np.reshape(x_test, (len(x_test), maxlen, embedding_dims))
y_test = np.array(y_test)

In [38]:
x_train.shape  # 20000 条评论，每个评论400个tokens，每个token 是一个长度为300 的vector 

(20000, 400, 300)

### step 6. CNN architecture

- `padding = valid` 也就是不需要padding 即输出比输入的dimension 要小。
- `strides=1` 每次一步

In [39]:
print('Build model...')

# standard model definition pattern for keras
model = Sequential()

# we add a Convolution1D, which will learn word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1,
                 input_shape=(maxlen, embedding_dims)))

Build model...


#### pooling

two methods for pooling:
- max
- average

we use max pooling: let networks see the most prominent feature of each subsection.

default size of pooling window is 2.

In [40]:
# we use max pooling:
model.add(GlobalMaxPooling1D())

Now, 对每一个input review，我们都有一个1D 的vector 来represent that sample. 这个vector 可以看作semantic representation. 

#### dropout

goal: avoid overfitting

通过randomly "turn off" 一些neurons，来减少参数，避免overfitting。被turn off 的neuron 相当于output 0，因此这些neurons 对cost function 的贡献相当于0， 因此相应的weights 在backpropagation 中不会被更新。

因为turn off 了一些neurons，所以整个neural networks 的signal 强度减小。所以keras 会自动按比例boosting 没有被turn off 的neurons。

⚠️ 在inference/prediction 的过程中不使用dropout。

In [41]:
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

#### output layer

Actual classifier, 这里我们使用sigmoid function。 

对于多分类问题，最后的output layer 可以如下定义。

```python
model.add(Dense(1))
model.add(Activation('sigmoid'))
```

In [42]:
# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))


#### 网络结构总结 (每一层的维度信息)

- 生成doc matrix:
    - 对于每个document，做tokenization，不足400个token 的补齐，多于400个tokens 的剪切。所以每个document 最后变成400 个token 的。(注意，在下图中没有展现padding 这一步)
    - 每个token做word embedding (或者其他方式的向量化)，则每个document 转变成一个embedding_dim * maxlen 的matrix， 例如下图所示，每个embedding 是6维向量，因此doc matrix 是6 * 9
- 生成 filter matrix: embedding_dim * kernel_size, 例如下图所示，6 * 3
- filter matrix 和document matrix 做convole。生成一下较小的vector，如下图所示，生成 1 * 7 的vector

<img src="img/cnn_nlp.png" alt="drawing" width="500"/>

- global max pooling: 取这个vector 的最大值, 所以得到一个数，如下图所示：

<img src="img/1d_max_pooling.png" alt="drawing" width="400"/>

- "thought vector": 我们有`filters` 个filters，每个filter 如上述步骤，生成一个数，所以我们最后会等得到一个长度为`filters` 的向量，假设我们有250 个filters，我们会得到一个1 * 250 的向量。

- fully connected network (可以使用dropout)
    - 输入是thought vector 的长度，即filter 的个数
    - 输出是hidden_dims 个neuron
    - 使用dropout(0.2)
    
- output layer
    - 输入是hidden_dims
    - 输出是1 (0/1)


### step 7: compile model

- loss function
    - binary_crossentropy: 1个neuron，输出 0 or 1
    - categorical_crossentropy: n个neurons，输出 one-hot vector
- optimizer: algorithms to minimize the cost function
    - Stochastic gradient descent
    - Adam
    - RSMProp

- metrics

In [43]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

### step 8: train

- `compile` - build model
- `fit` - train model

In [44]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1a61c136a0>

从上面的结果来看，model 有些overfitting，training acc: 0.9077，validation acc: 0.8838. 

但是overfiting 并不严重，因为both traning acc 和validation acc 都在涨。如果training acc 在提高，validation acc 在降低，则是一个strong sign of overfitting.

如果model 还在内存中，我们可以在训练一个epoch，直接调用`fit`方法。如果model 不在内存中，我们可以load from disk (step 10).

In [46]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=1,
          validation_data=(x_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/1


<keras.callbacks.History at 0x1a45ce0278>

可以看出，我们再训练一个epoch，出现了overfitting 的情况，因为training acc 提升，validation acc 降低。

我们下面看如何使用模型进行prediction。


### step 9. prediction

In [47]:
sample_1 = "I'm hate that the dismal weather that had me down for so long, when will it break! Ugh, when does happiness return?  The sun is blinding and the puffy clouds are too thin.  I can't wait for the weekend."


In [48]:
# We pass a dummy value in the first element of the tuple 
# just because our helper expects it from the way processed the initial data.  
# That value won't ever see the network, so it can be whatever.
vec_list = tokenize_and_vectorize([(1, sample_1)])

# Tokenize returns a list of the data (length 1 here)
test_vec_list = pad_trunc(vec_list, maxlen)

test_vec = np.reshape(test_vec_list, (len(test_vec_list), maxlen, embedding_dims))


- `predict` 方法返回raw data，即output layer sigmoid 方法的输出
- `predict_classes` 方法返回class label

In [52]:
model.predict(test_vec)


array([[0.01506066]], dtype=float32)

In [53]:
model.predict_classes(test_vec)  # 0 is negative comments

array([[0]], dtype=int32)

### step 10: save the model



In [45]:
model_structure = model.to_json()
with open("cnn_model.json", "w") as json_file:
    json_file.write(model_structure)

model.save_weights("cnn_weights.h5")
print('Model saved.')

Model saved.


为了保证每次一样，可以事先设置一个随机数种子。

下面两行加在model definition 之前。

```python
import numpy as np
np.random.seed(1337)
```

### step 11: load a saved model


In [51]:
from keras.models import model_from_json

with open("cnn_model.json", "r") as json_file:
    json_string = json_file.read()
model = model_from_json(json_string)

model.load_weights('cnn_weights.h5')

## 4. 总结

#### 为什么使用CNN？
efficiency: dropout 丢失和很多数据，更有效

#### how to improve?
- 多个conv1D layer stack together
- 使用不同长度的filters，然后把结果拼成一个更长的vector (thought vector)，再传入output layer.