- https://richliao.github.io/supervised/classification/2016/11/26/textclassifier-convolutional/

# Text Classification, Part I - Convolutional Networks

## 1. Text classification using CNN
- Background: [Convolutional Neural Networks for Sentence Classification - Yoon Kim](http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf)

In [17]:
from bs4 import BeautifulSoup

import re
import sys
import os
import pandas as pd
from nltk import tokenize

In [18]:
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


In [4]:
def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\\", "", string)    
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [6]:
data_train = pd.read_csv('IMDB/labeledTrainData.tsv', sep='\t')
data_train.shape

(25000, 3)

In [7]:
data_train[:5]

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [14]:
texts = []
labels = []
for idx in range(data_train.review.shape[0]):
    text = BeautifulSoup(data_train.review[idx]) # html 태그 제거
    texts.append(clean_str(text.get_text()))
    labels.append(data_train.sentiment[idx])



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [32]:
MAX_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(texts)

In [40]:
word2id = tokenizer.word_index
word2id

{'nips': 46328,
 'wanes': 33763,
 'hungrily': 29887,
 'devoreaux': 49371,
 'shipmans': 49372,
 'bilateral': 39411,
 'ikuru': 69677,
 'onegin': 33764,
 'sibrel': 16753,
 'works1': 49373,
 'vile': 5970,
 'chopra': 12998,
 'mundo': 60328,
 'pleasant': 2199,
 'gorcey': 33765,
 'kiberlain': 33766,
 'caw': 33767,
 'pleaseee': 39412,
 'grievous': 29888,
 'pgby': 49374,
 'meshing': 33768,
 'adhering': 34994,
 'outstandingly': 18428,
 'relations': 4221,
 'tony': 1220,
 'secretly': 4542,
 'altitude': 21652,
 'wholesale': 23103,
 'quarterfinals': 49376,
 'ashitaka': 49377,
 'lemay': 39413,
 'heterogeneous': 49378,
 'portaraying': 49379,
 'cheadles': 18429,
 'tearful': 18430,
 'employees': 8054,
 'alphavilles': 49380,
 'loiret': 76943,
 'gutteridge': 49381,
 'wkw': 49382,
 'valuable': 4543,
 '44yrs': 74930,
 'turners': 24909,
 'grovel': 49385,
 'romeros': 8922,
 'oyl': 27154,
 'faubourg': 20415,
 'binding': 19357,
 'repugnant': 13414,
 'merciless': 12604,
 'ghod': 39414,
 'pyrotechnics': 16754,
 '

In [41]:
print('Found %s unique tokens.' % len(word2id))

Found 81484 unique tokens.


In [34]:
sequences = tokenizer.texts_to_sequences(texts)
sequences[:3]

[[15,
  29,
  10,
  537,
  165,
  177,
  30,
  1,
  560,
  15,
  10247,
  200,
  647,
  2621,
  5,
  23,
  223,
  144,
  1,
  1026,
  657,
  128,
  2,
  46,
  291,
  1,
  19555,
  2,
  291,
  11709,
  169,
  275,
  9,
  40,
  179,
  5,
  75,
  3,
  807,
  2631,
  79,
  10,
  226,
  34,
  9,
  193,
  12,
  62,
  638,
  7,
  1,
  4272,
  40,
  5,
  275,
  93,
  52,
  57,
  325,
  724,
  26,
  6,
  2517,
  39,
  1346,
  11709,
  6,
  168,
  5063,
  168,
  778,
  18,
  59,
  9,
  373,
  165,
  5,
  63,
  30,
  1,
  431,
  50,
  8,
  12,
  1820,
  623,
  45,
  4,
  8,
  44,
  1296,
  3452,
  41,
  545,
  949,
  1,
  3534,
  2,
  78,
  1,
  579,
  746,
  4,
  1654,
  22,
  73,
  2015,
  1156,
  17,
  4,
  260,
  10,
  6,
  29,
  41,
  486,
  1875,
  35,
  895,
  21,
  2602,
  37,
  10247,
  7,
  554,
  90,
  21,
  22,
  165,
  5,
  781,
  10,
  2,
  164,
  8,
  355,
  45,
  198,
  679,
  10247,
  32,
  14,
  5,
  1,
  227,
  4,
  10,
  16,
  17,
  10247,
  2,
  87,
  4,
  23,
  444,
  58,
  

In [35]:
texts[:1]

['with all this stuff going down at the moment with mj ive started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 minu

In [42]:
MAX_SEQUENCE_LENGTH = 100
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

In [None]:
MAX_SENT_LENGTH = 100
MAX_SENTS = 15
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

In [None]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)