# Text Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/
- Keras Documentation: https://keras.io


In [21]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Text classification

Our goal is to perform a binary classification on text data. We will perform both a Spam detection example and a Sentiment analysis example. We will attempt 3 strategies:

1) build naive features based on our ideas

2) use well tested feature extraction technique

3) use deep learning and recurrent models on text

### 1. Spam detection on SMS messages

In [22]:
df = pd.read_csv('../data/sms.tsv', sep='\t')
df.head()

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [23]:
df['label'].value_counts() / len(df)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

### Exercise1: Encode Labels to 0 and 1

Create a variable called y that contains 0 for HAM messages and 1 for SPAM messages. There are several ways to do this.

In [24]:
# TODO: python list comprehension?
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(df['label'])

### Exercise 2: Build naive features based on keywords

- turn all your sms messages to lowercase
- define a function to count occurrences of a single keyword with the following signature:

        def count_word(word, sentence):
            ....
            return count_word_in_sentence
            
            
- to test your function, try it on these examples and check that the results match:
   
        count_word("the", "quick brown fox") # -> 0
        count_word("fox", "quick brown fox") # -> 1
        count_word("a", "a b a abab") # -> 2
     

- using the function `count_word` you just wrote, create a feature matrix `X` using counts of some keywords of your choice. (this will a bag-of-words representation.)
- create other similar features. You could use:
    - the length of the message
    - the presence of numbers
    - the presence of special characters
    - ...

In [64]:
docs = df['msg'].values
docs_lower = [d.lower() for d in docs]
def count_word(word, sentence):
#     count_word_in_sentence = sentence.lower().count(word.lower())
    count_word_in_sentence = sentence.count(word)
    return count_word_in_sentence
# print(count_word("the", "quick brown fox"))
# print(count_word("fox", "quick brown fox"))

In [65]:
X = pd.DataFrame([count_word('free', d) for d in docs_lower], columns=['free'])
for keyword in ['win', 'discount', 'call']:
    X[keyword] = [count_word(keyword, d) for d in docs_lower]
# x = {}
# for sentence in df['msg']:
#     for word in sentence.split(' '):
#         print(word)
#         break
#     break
# x = {word: count_word for word in sentence.split(' ') }
# import random
# x = {s: random.randint(1, 100) for s in "the quick brown fox".split(' ')}
# x
# x

In [66]:
import re

In [67]:
def count_numbers(sentence):
    return len(re.findall('[0-9]', sentence))

In [68]:
X['num_char'] = [count_numbers(d) for d in docs_lower]

In [69]:
X.head()

Unnamed: 0,free,win,discount,call,num_char
0,0,0,0,0,0
1,0,0,0,0,0
2,1,1,0,0,25
3,0,0,0,0,0
4,0,0,0,0,0


### Exercise 3: Train first model and evaluate performance

- split data in to train and test sets with `test_size=0.3, random_state=0`. you can use the `train_test_split` function from sklearn, which we have used in previous labs
- train model of your choice on these features
- evaluate performance on training and test set
- discuss with classmate:
    - how did you evaluate performance?
    - is model overfitting?
    - is model better than benchmark?

In [74]:
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD, Adam

Using TensorFlow backend.


In [75]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

In [76]:
x.shape

(5572, 1)

In [81]:
model = Sequential()
model.add(Dense(3, input_shape=(5,), activation='tanh'))
model.add(Dense(1, activation='sigmoid'))
model.compile('adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [82]:
model.fit(X_train, y_train, epochs=2)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f6282c91b70>

In [85]:
from sklearn.metrics import accuracy_score

In [87]:
y_train_pred = model.predict_classes(X_train)
y_test_pred = model.predict_classes(X_test)
print("The train accuracy score is {:0.3f}".format(accuracy_score(y_train, y_train_pred)))
print("The test accuracy score is {:0.3f}".format(accuracy_score(y_test, y_test_pred)))

The train accuracy score is 0.942
The test accuracy score is 0.935


In [97]:
# df.values[0]
# model.predict(df.values[0])
# TODO: no idea if this stuff above works!!

In [98]:
# From TA
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.3, random_state=0)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model.fit(X_train, y_train)
model.score(X_train, y_train)
model.score(X_test, y_test)

0.9754784688995215

### Exercise 4: Cross Validation

- perform a 5-Fold cross validation on your model. you can refer back to lab 8 to refresh your memory on how to do this.
- print the confusion matrix and the classification report on the test data

### Exercise 5: Count Features

- use features based on word counts using the `CountVectorizer` class from Scikit Learn
- use the following function to simplify your code (it encapsulates model training and evaluation):


    def split_fit_eval(X, y, model=None, epochs=10, random_state=0):
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

        if not model:
            model = Sequential()
            model.add(Dense(1, input_dim=X.shape[1], activation='sigmoid'))

            model.compile(loss='binary_crossentropy',
                          optimizer='adam',
                          metrics=['accuracy'])

        h = model.fit(X_train, y_train, epochs=epochs, verbose=1)

        train_loss, train_acc = model.evaluate(X_train, y_train)
        test_loss, test_acc = model.evaluate(X_test, y_test)

        return train_loss, train_acc, test_loss, test_acc, model, h


- did you improve the performance?

In [99]:
def split_fit_eval(X, y, model=None, epochs=10, random_state=0):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

    if not model:
        model = Sequential()
        model.add(Dense(1, input_dim=X.shape[1], activation='sigmoid'))

        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=['accuracy'])

    h = model.fit(X_train, y_train, epochs=epochs, verbose=1)

    train_loss, train_acc = model.evaluate(X_train, y_train)
    test_loss, test_acc = model.evaluate(X_test, y_test)

    return train_loss, train_acc, test_loss, test_acc, model, h

In [120]:
train_loss, train_acc, test_loss, test_acc, model, h = split_fit_eval(X, y, epochs=2)
print(train_acc, test_acc)

Epoch 1/2
Epoch 2/2
0.9272553242687747 0.9224694903942636


In [106]:
from sklearn.feature_extraction.text import CountVectorizer

In [127]:
# # model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
# model = RandomForestClassifier()
# # model.fit(X_train, y_train)
# # model.score(X_train, y_train)
# # model.score(X_test, y_test)
# train_loss, train_acc, test_loss, test_acc, model, h = split_fit_eval(X, y, model=model, epochs=2)
# # print(train_acc, test_acc)

In [128]:
# vectorizer = CountVectorizer()
# # vectorizer.fit_transform()
# # X_train = vectorizer.fit_transform(X_train)
# X_train.head()

In [146]:
# From TA all the way to sentiment analysis
vocab_size = 3000
vect = CountVectorizer(decode_error='ignore',
                      stop_words='english',
                      lowercase=True,
                      max_features=vocab_size)


In [147]:
X = vect.fit_transform(docs)

In [148]:
vect.transform(["hello"])

<1x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [149]:
Xd = X.todense()
Xd

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [150]:
vocab = vect.get_feature_names()
vocab[:10]

['00',
 '000',
 '02',
 '0207',
 '02073162414',
 '03',
 '04',
 '05',
 '06',
 '07123456789']

In [151]:
vocab[-10:]

['yogasana', 'yor', 'yr', 'yrs', 'yummy', 'yun', 'yunny', 'yuo', 'yup', 'zed']

In [152]:
# TODO: see the rest in the solution
# TODO: takeaway: feature engineering is hard and tedious
# TODO: should also understand ML somewhere in there

## Sentiment Analysis

The previous dataset was easy. Let's switch to a harder one and do sentiment analysis on it.

In [153]:
df = pd.read_csv('../data/rt_critics.csv')
df.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


In [154]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14072 entries, 0 to 14071
Data columns (total 8 columns):
critic         13382 non-null object
fresh          14072 non-null object
imdb           14072 non-null float64
publication    14072 non-null object
quote          14072 non-null object
review_date    14072 non-null object
rtid           14072 non-null float64
title          14072 non-null object
dtypes: float64(2), object(6)
memory usage: 879.6+ KB


In [155]:
df['fresh'].value_counts() / len(df)

fresh     0.612067
rotten    0.386299
none      0.001634
Name: fresh, dtype: float64

In [156]:
df = df[df.fresh != 'none'].copy() # data cleaning
df['fresh'].value_counts() / len(df)

fresh     0.613069
rotten    0.386931
Name: fresh, dtype: float64

In [157]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df['fresh'])

LabelEncoder()

In [160]:
y = le.transform(df['fresh'])
y

array([0, 0, 0, ..., 0, 0, 0])

### Exercise 6: TFIDF

- Build features with word frequencies (Tfidf). (sklearn has a preprocessor for this.)
- do train/test split
- train and evaluate a model

In [161]:
# All from TA
from sklearn.feature_extraction.text import TfidfVectorizer

In [162]:
vect = TfidfVectorizer(decode_error='ignore',
                      stop_words='english',
                      max_features=20000)
X = vect.fit_transform(df['quote'])

In [163]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [165]:
model = RandomForestClassifier(n_estimators=100, n_jobs=1)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [167]:
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.9997966239576977
0.708185053380783


### Exercise 7: NLP with deep learning

- Use the Tokenizer from Keras to:
    - Create a vocabulary
    - Convert sentences to sequences of integers
- pad the sequences so that they look like a tensor using the `pad_sequences` function from Keras.

In [169]:
# from TA
from keras.preprocessing.text import Tokenizer

In [176]:
tokenizer = Tokenizer(num_words=20000)
docs = df['quote']
tokenizer.fit_on_texts(docs)
sequences = tokenizer.texts_to_sequences(docs)
# have turned dictionary into a 1-hot encoding?

In [177]:
max_features = max([max(seq) for seq in sequences if len(seq) > 0]) + 1
max_features

20000

In [178]:
maxlen = max([len(seq) for seq in sequences])

In [179]:
from keras.preprocessing.sequence import pad_sequences

In [180]:
X = pad_sequences(sequences, maxlen=maxlen)

### Train / Test split on sequences

In [181]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

### Exercise 8: Build recurrent neural network model
- use what you have learned to build a recurrent model that classifies the sentiment

In [182]:
# See solution
# values selected heuristically
# recap

### Exercise 9

- Try changing the network architecture and re-train the model at each change. Can you avoid overfitting?
    - change the number of nodes in the LSTM layer
    - change the output dimension of the Embedding layer
    - add dropout and recurrent dropout to the LSTM
    - add a second LSTM layer
    - add kernel regularizers