# 7. CNN - Convolutional Neural Networks for Sentence Classification
Yoon Kim's Paper introduced a very simple and lightweight CNN architecture for sentence classification. If you use a pretrained embedding here, The number of model's parameter will be lower than 1mil but it's still powerful! In this notebook, you'll be able to train the model for several datasets.

### References
- [A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional
Neural Networks for Sentence Classification - Zhang et al. 2015](https://arxiv.org/pdf/1510.03820.pdf)
- [CS224n: Natural Language Processing with Deep Learning - Lecture 12](http://web.stanford.edu/class/cs224n/lectures/lecture12.pdf)
- [yoonkim/CNN_sentence](https://github.com/yoonkim/CNN_sentence)
- [harvardnlp/sent-conv-torch](https://github.com/harvardnlp/sent-conv-torch)

## Data Preprocessing
Preprocessing codes are borrowed from [harvardnlp/sent-conv-torch](https://github.com/harvardnlp/sent-conv-torch).

It's getting harder and harder to preprecess data in our model class. So we will preprocess before using `fit_to_corpus()` method as far as we can.

You have to select among these datasets `MR/SST1/SST2/Subj/TREC/CR/MPQA`.

In [1]:
import data.sentiment_datasets.preprocess as preprocess
from models import CNN

import random
import numpy as np

In [2]:
random.seed(1004)

In [3]:
w2v, train, train_label, test, test_label, dev, dev_label, word_to_idx = preprocess.build_dataset("SST2")

SST2.pkl exists! loading from pkl..


In [4]:
def train_test_dev_split(train, test, dev, train_label, test_label, dev_label):
    cnt = 0
    for data in [train, test, dev]:
        if len(data) != 0:
            cnt += 1
    
    if cnt == 0:
        raise ValueError("not a proper train,dev,test input")
        
    elif cnt == 1:  # only train set is provided.
        train_set = list(zip(train, train_label))
        random.shuffle(train_set)
        idx1 = int(len(train_set) * 0.8)
        idx2 = int(len(train_set) * 0.9)
        train_set, test_set, dev_set = train_set[:idx1], train_set[idx1:idx2], train_set[idx2:]
        train, train_label = list(zip(*train_set))
        test, test_label = list(zip(*test_set))
        dev, dev_label = list(zip(*dev_set))

    elif cnt == 2:  # train/test sets are provided.
        train_set = list(zip(train, train_label))
        random.shuffle(train_set)
        idx1 = int(len(train_set) * 0.9)
        train_set, dev_set = train_set[:idx1], train_set[idx1:]
        train, train_label = list(zip(*train_set))
        dev, dev_label = list(zip(*dev_set))

    elif cnt == 3:  # train/test/dev sets are provided.
        pass
    
    else:
        raise ValueError("Is it possible to reach here??")
        
    return np.array(train), np.array(test), np.array(dev), \
           np.array(train_label), np.array(test_label), np.array(dev_label)

In [5]:
train, test, dev, train_label, test_label, dev_label = \
    train_test_dev_split(train, test, dev, train_label, test_label, dev_label)

In [6]:
train_data = [train, train_label, dev, dev_label, w2v, word_to_idx]
test_data = [test, test_label]

## Training!

In [7]:
model = CNN.CNN(learning_rate=5e-4)

DEBUG: 04180000


In [8]:
model.fit_to_corpus(train_data)

Instructions for updating:
Use the retry module or similar alternatives.


In [9]:
model.train(20, save_dir="save/07_cnn", log_dir="log/07_cnn", print_every=500)

--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 361502
--------------------------------------------------------------------------------
Epoch training time: 1.3180923461914062

Finished Epoch 1
train_loss = 0.60351326, train_accruacy = 0.66652174
valid_loss = 0.46909564, valid_accuracy = 0.81411764

Epoch training time: 0.8415021896362305

Finished Epoch 2
train_loss = 0.42120360, train_accruacy = 0.81043478
valid_loss = 0.41845854, valid_accuracy = 0.81764705

Epoch training time: 0.8786766529083252

Finished Epoch 3
train_loss = 0.33527417, train_accruacy = 0.86043478
valid_loss = 0.42911296, valid_accuracy = 0.81529411

Epoch training time: 0.8922944068908691

Finished Epoch 4
train_loss = 0.27726179, train_accruacy = 0.89362318
valid_loss = 0.39775787, valid_accuracy = 0.83529412

Epoch training time: 0.897108793258667

Finished Epoch 5
train_loss = 0.22240816, train_accruacy = 0.92492753
valid_loss = 0.397

In [10]:
model.test(test_data, load_dir="save/07_cnn")

INFO:tensorflow:Restoring parameters from save/07_cnn/epoch020_0.4196.model
--------------------------------------------------------------------------------
Restored model from checkpoint for testing. Size: 361502
--------------------------------------------------------------------------------
test loss = 0.35618254, test accuracy = 0.85055555
test samples: 001800, time elapsed: 0.0993, time per one batch: 0.0028
