- https://richliao.github.io/supervised/classification/2016/11/26/textclassifier-convolutional/

# Text Classification, Part I - Convolutional Networks
- data
    - [IMDB](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
    - [Glove](https://nlp.stanford.edu/projects/glove/)

## 1. Text classification using CNN
- Background: [Convolutional Neural Networks for Sentence Classification - Yoon Kim](http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf)

In [1]:
from bs4 import BeautifulSoup

import re
import sys
import os
import numpy as np
import pandas as pd
from nltk import tokenize

In [61]:
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

from keras.layers import Embedding
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout
from keras.models import Model

In [3]:
def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\\", "", string)    
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [4]:
data_train = pd.read_csv('IMDB/labeledTrainData.tsv', sep='\t')
data_train.shape

(25000, 3)

In [5]:
data_train[:5]

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [6]:
texts = []
labels = []
for idx in range(data_train.review.shape[0]):
    text = BeautifulSoup(data_train.review[idx]) # html 태그 제거
    texts.append(clean_str(text.get_text()))
    labels.append(data_train.sentiment[idx])



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [8]:
MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(texts)

In [9]:
word2id = tokenizer.word_index
word2id

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'in': 7,
 'it': 8,
 'i': 9,
 'this': 10,
 'that': 11,
 'was': 12,
 'as': 13,
 'for': 14,
 'with': 15,
 'movie': 16,
 'but': 17,
 'film': 18,
 'on': 19,
 'not': 20,
 'you': 21,
 'are': 22,
 'his': 23,
 'have': 24,
 'be': 25,
 'he': 26,
 'one': 27,
 'its': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'who': 34,
 'so': 35,
 'from': 36,
 'like': 37,
 'her': 38,
 'or': 39,
 'just': 40,
 'about': 41,
 'out': 42,
 'if': 43,
 'has': 44,
 'some': 45,
 'there': 46,
 'what': 47,
 'good': 48,
 'more': 49,
 'when': 50,
 'very': 51,
 'up': 52,
 'time': 53,
 'no': 54,
 'she': 55,
 'even': 56,
 'my': 57,
 'would': 58,
 'which': 59,
 'story': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'their': 64,
 'were': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'than': 70,
 'we': 71,
 'much': 72,
 'bad': 73,
 'been': 74,
 'get': 75,
 'will': 76,
 'do': 77,
 'also': 78,
 'people': 79,
 'into': 80,
 'other': 81,
 'great': 82,


In [10]:
print('Found %s unique tokens.' % len(word2id))

Found 81501 unique tokens.


In [11]:
sequences = tokenizer.texts_to_sequences(texts)
sequences[:3]

[[15,
  29,
  10,
  537,
  165,
  177,
  30,
  1,
  560,
  15,
  10066,
  200,
  644,
  2622,
  5,
  23,
  223,
  144,
  1,
  1025,
  657,
  128,
  2,
  46,
  291,
  1,
  19364,
  2,
  291,
  11563,
  169,
  275,
  9,
  40,
  180,
  5,
  75,
  3,
  807,
  2623,
  80,
  10,
  226,
  34,
  9,
  193,
  12,
  62,
  637,
  7,
  1,
  4262,
  40,
  5,
  275,
  93,
  52,
  57,
  325,
  724,
  26,
  6,
  2515,
  39,
  1346,
  11563,
  6,
  168,
  5050,
  168,
  778,
  18,
  59,
  9,
  373,
  165,
  5,
  63,
  30,
  1,
  432,
  50,
  8,
  12,
  1819,
  623,
  45,
  4,
  8,
  44,
  1297,
  3445,
  41,
  545,
  944,
  1,
  3528,
  2,
  78,
  1,
  577,
  745,
  4,
  1653,
  22,
  73,
  2013,
  1155,
  17,
  4,
  259,
  10,
  6,
  29,
  41,
  485,
  1875,
  35,
  895,
  21,
  2594,
  37,
  10066,
  7,
  554,
  90,
  21,
  22,
  165,
  5,
  781,
  10,
  2,
  164,
  8,
  355,
  45,
  198,
  679,
  10066,
  32,
  14,
  5,
  1,
  227,
  4,
  10,
  16,
  17,
  10066,
  2,
  87,
  4,
  23,
  444,
  58,
  

In [12]:
texts[:1]

['with all this stuff going down at the moment with mj ive started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 minu

In [76]:
MAX_SEQUENCE_LENGTH = 2000
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH) # MAX_SEQUENCE_LENGTH만큼 뒤 기준으로 잘림.

In [78]:
labels = to_categorical(np.asarray(labels)) # label one-hot encoding
labels

array([[ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       ..., 
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.]])

In [79]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (25000, 2000)
Shape of label tensor: (25000, 2)


In [87]:
indices = np.arange(data.shape[0])
indices[:20]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [88]:
np.random.shuffle(indices) # 데이터 셔플링

In [89]:
indices[:20]

array([10395, 13193, 15882,  8130,  3556, 24194,  2637, 18569, 23977,
        6745, 17221, 11317, 10614, 10988, 17612,  1431,  9784, 12498,
       24152, 20647])

In [90]:
data = data[indices] # 데이터 셔플링
labels = labels[indices]

In [95]:
VALIDATION_SPLIT = 0.3
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

In [96]:
x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

In [None]:
print('Number of positive and negative reviews in traing and validation set ')
print(y_train.sum(axis=0))
print(y_val.sum(axis=0))

In [37]:
word2vec = {}
f = open(os.path.join('glove.6B', 'glove.6B.100d.txt')) # word \t 0.00 \t 0.00 \t .. 꼴로 라인 구성되어 있음.
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word2vec[word] = coefs
f.close()

In [27]:
EMBEDDING_DIM = 100

In [57]:
embedding_matrix = np.random.random((len(word2id) + 1, EMBEDDING_DIM)) # bias term 때문에 +1 해준듯..
embedding_matrix # 랜덤값으로 초기화

array([[ 0.7631436 ,  0.81256759,  0.08622842, ...,  0.53005366,
         0.11106959,  0.34091621],
       [ 0.69605764,  0.98450165,  0.20503831, ...,  0.80643854,
         0.50662894,  0.41638633],
       [ 0.08451954,  0.35699544,  0.95098242, ...,  0.65124377,
         0.13085246,  0.92388928],
       ..., 
       [ 0.49933424,  0.49014808,  0.16895336, ...,  0.34977789,
         0.84806739,  0.86776523],
       [ 0.36604906,  0.13214829,  0.05296055, ...,  0.30474155,
         0.06733064,  0.65668264],
       [ 0.09151193,  0.61138537,  0.68975277, ...,  0.14721679,
         0.12856383,  0.62567053]])

In [58]:
for word, id_ in word2id.items():
    temp_vector = word2vec.get(word)
    if temp_vector is not None:
        # word2vec에 없는 단어면 초기화된 랜덤값으로 그대로.
        embedding_matrix[id_] = temp_vector

In [59]:
word2vec.get('of'), type(word2vec.get('of'))

(array([-0.1529    , -0.24279   ,  0.89837003,  0.16996001,  0.53516001,
         0.48784   , -0.58825999, -0.17982   , -1.35810006,  0.42541   ,
         0.15377   ,  0.24214999,  0.13474   ,  0.41192999,  0.67043   ,
        -0.56418002,  0.42985001, -0.012183  , -0.11677   ,  0.31781   ,
         0.054177  , -0.054273  ,  0.35516   , -0.30241001,  0.31434   ,
        -0.33846   ,  0.71714997, -0.26855001, -0.15837   , -0.47466999,
         0.051581  , -0.33252001,  0.15003   , -0.12989999, -0.54617   ,
        -0.37843001,  0.64261001,  0.82187003, -0.080006  ,  0.078479  ,
        -0.96976   , -0.57740998,  0.56490999, -0.39873001, -0.057099  ,
         0.19743   ,  0.065706  , -0.48091999, -0.20125   , -0.40834001,
         0.39456001, -0.02642   , -0.11838   ,  1.01199996, -0.53171003,
        -2.74740005, -0.042981  , -0.74848998,  1.75740004,  0.59085   ,
         0.04885   ,  0.78267002,  0.38497001,  0.42096999,  0.67882001,
         0.10337   ,  0.63279998, -0.026595  ,  0.5

In [60]:
type(word2vec.get('ofdfdf')) # word2vec에 없는단어면 None반환.

NoneType

## 1) A simplified Convolutional
- Simply use total 128 filters with size 5 and max pooling of 5 and 35

In [70]:
embedding_layer = Embedding(len(word2id) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

In [71]:
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
l_cov1= Conv1D(128, 5, activation='relu')(embedded_sequences)
l_pool1 = MaxPooling1D(5)(l_cov1)
l_cov2 = Conv1D(128, 5, activation='relu')(l_pool1)
l_pool2 = MaxPooling1D(5)(l_cov2)
l_cov3 = Conv1D(128, 5, activation='relu')(l_pool2)
l_pool3 = MaxPooling1D(35)(l_cov3)  # global max pooling
l_flat = Flatten()(l_pool3)
l_dense = Dense(128, activation='relu')(l_flat)
preds = Dense(2, activation='softmax')(l_dense)

In [72]:
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])

In [74]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 2000)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 2000, 100)         8150200   
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 1996, 128)         64128     
_________________________________________________________________
max_pooling1d_10 (MaxPooling (None, 399, 128)          0         
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 395, 128)          82048     
_________________________________________________________________
max_pooling1d_11 (MaxPooling (None, 79, 128)           0         
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 75, 128)           82048     
__________

In [98]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=1024)

Train on 17500 samples, validate on 7500 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x15c5b8cc0>

In [31]:
# MAX_SENT_LENGTH = 100
# MAX_SENTS = 15
# MAX_NB_WORDS = 20000
# EMBEDDING_DIM = 100
# VALIDATION_SPLIT = 0.2

In [36]:
embeddings_index['the']

array([-0.038194  , -0.24487001,  0.72812003, -0.39961001,  0.083172  ,
        0.043953  , -0.39140999,  0.3344    , -0.57545   ,  0.087459  ,
        0.28786999, -0.06731   ,  0.30906001, -0.26383999, -0.13231   ,
       -0.20757   ,  0.33395001, -0.33848   , -0.31742999, -0.48335999,
        0.1464    , -0.37303999,  0.34577   ,  0.052041  ,  0.44946   ,
       -0.46970999,  0.02628   , -0.54154998, -0.15518001, -0.14106999,
       -0.039722  ,  0.28277001,  0.14393   ,  0.23464   , -0.31020999,
        0.086173  ,  0.20397   ,  0.52623999,  0.17163999, -0.082378  ,
       -0.71787   , -0.41531   ,  0.20334999, -0.12763   ,  0.41367   ,
        0.55186999,  0.57907999, -0.33476999, -0.36559001, -0.54856998,
       -0.062892  ,  0.26583999,  0.30204999,  0.99774998, -0.80480999,
       -3.0243001 ,  0.01254   , -0.36941999,  2.21670008,  0.72201002,
       -0.24978   ,  0.92136002,  0.034514  ,  0.46744999,  1.10790002,
       -0.19358   , -0.074575  ,  0.23353   , -0.052062  , -0.22

In [None]:

print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)