## NLP basics

Sometimes, even basic NLP approches, such as BOW or Tfidf lead to decent performance in many tasks. In this notebook, I use a subset of IMDB movie reviews and process the data before training a supervised model to detect the sentiment. 


In [9]:
import pandas as pd
import numpy as np

In [7]:
# Import the data

data = pd.read_csv('IMDB_sample.csv', index_col=None)
data = data.iloc[:, 1:]

In [6]:
data.shape

(7501, 3)

In [8]:
data.head()

Unnamed: 0,review,label
0,This short spoof can be found on Elite's Mille...,0
1,A singularly unfunny musical comedy that artif...,0
2,"An excellent series, masterfully acted and dir...",1
3,The master of movie spectacle Cecil B. De Mill...,1
4,I was gifted with this movie as it had such a ...,0


### Feature for lenght of review

I will use the nltk module and its word_tokenize function to translate each review in tokens. Then I will loop over each tokenized review and count its length. I assign this to a new feature, which accounts for the length of a review.

We construct thus a simple feature but it can be pretty informative in certain contexts and can increase the predictive performance of our models. 

First, let's get a feel for how nltk works. Let's test it with a simple string. The output is a list of the string, split into tokens. 

In [11]:
my_string = "the weather is great today"
word_tokenize(my_string)

['the', 'weather', 'is', 'great', 'today']

In [13]:
from nltk import word_tokenize

# List comprehension to tokenize the reviews: the result is a list
tokens = [word_tokenize(review) for review in data.review]

# Tokens will be a list of lists, where each inner list is a single review. Let's check whether that's indeed the case.
print('The first review in the data set: ', data.iloc[0,0])
print('The first list of the tokens: ', tokens[0]) 

The first review in the data set:  This short spoof can be found on Elite's Millennium Edition DVD of "Night of the Living Dead". Good thing to as I would have never went even a tad out of my way to see it.Replacing zombies with bread sounds just like silly harmless fun on paper. In execution, it's a different matter. This short didn't even elicit a chuckle from me. I really never thought I'd say this, but "Night of the Day of the Dawn of the Son of the Bride of the Return of the Revenge of the Terror of the Attack of the Evil, Mutant, Alien, Flesh Eating, Hellbound, Zombified Living Dead Part 2: In Shocking 2-D" was a VERY better parody and not nearly as lame or boring.<br /><br />My Grade: F
The first list of the tokens:  ['This', 'short', 'spoof', 'can', 'be', 'found', 'on', 'Elite', "'s", 'Millennium', 'Edition', 'DVD', 'of', '``', 'Night', 'of', 'the', 'Living', 'Dead', "''", '.', 'Good', 'thing', 'to', 'as', 'I', 'would', 'have', 'never', 'went', 'even', 'a', 'tad', 'out', 'of', 

In [17]:
# We create an empty list for the length of tokens in each review, then loop over the tokens list, which remember is a list
# of lists, count how many tokens we have in each inner list and append it to the length_tokens. 
length_tokens = []
for item in range(len(tokens)):
    length_tokens.append(len(tokens[item]))

print(length_tokens[0], len(tokens[0]))
data['n_words']= length_tokens
#nw = pd.Series(length_tokens)

155 155


In [18]:
data.head()

Unnamed: 0,review,label,n_words
0,This short spoof can be found on Elite's Mille...,0,155
1,A singularly unfunny musical comedy that artif...,0,646
2,"An excellent series, masterfully acted and dir...",1,121
3,The master of movie spectacle Cecil B. De Mill...,1,128
4,I was gifted with this movie as it had such a ...,0,248


Another thing we might want to do is to detect the language of each review because not all of them are necessarily in English. Or it could be that some sentences are in English, others in a different language within the same review.

We will use the langdetect package (another alternative is langid package and they functions very similarly).

In [19]:
# Example of langdetect 

import langdetect
foreign_string = 'ik vind her echt heel saai!'

langdetect.detect_langs(foreign_string)

[nl:0.9999977669757246]

We see that the output is in the form of ['language': likelihood]. We might want to capture only the 'nl' part. 

In [22]:
# Transform the list to a string, then split on :
str(langdetect.detect_langs(foreign_string)).split(':')

['[nl', '0.9999951271122164]']

In [23]:
# Now extract the first element of this split
str(langdetect.detect_langs(foreign_string)).split(':')[0]

'[nl'

In [24]:
# Since we want to get rid of the [, we need to do one more filtering step, chained to the previous
str(langdetect.detect_langs(foreign_string)).split(':')[0][1:]

'nl'

This is the desired output. We will follow a similar approach to creating a feature for the length of each review. Here, after the langdetect package is applied to each step, we still need to perform the above cleaning step, to extract only the first, string part of the output of the detect_langs function.

In [26]:
languages = [] 

for i in range(len(data)):
    languages.append(langdetect.detect_langs(data.iloc[i, 0]))

In [31]:
languages[0]

'en'

In [29]:
#transform the languages2 list 
languages = [str(lang).split(':')[0][1:] for lang in languages]
data['language'] = languages

In [32]:
# Let's inspect how the data looks like now
data.head()

Unnamed: 0,review,label,n_words,language
0,This short spoof can be found on Elite's Mille...,0,155,en
1,A singularly unfunny musical comedy that artif...,0,646,en
2,"An excellent series, masterfully acted and dir...",1,121,en
3,The master of movie spectacle Cecil B. De Mill...,1,128,en
4,I was gifted with this movie as it had such a ...,0,248,en


### Count/tfidf vectorizer

Sklearn has a function called CountVectorizer to transform the review column to BOW, and a TfidfVectorizer, which transform the review into a tfidf vocabulary.

The functions are almost identical in the parameters they take. I will explore here the TfIDFvectorizer.

In [35]:
# Importing the required library
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# I expand the default ENGLISH_STOP_WORDS with words related to movies and reviews

my_stop_words = ENGLISH_STOP_WORDS.union(['movie', 'film', 'cinema', 'theatre'])

# Define the vectorizer, specifying the stop_words argument, the max_features (it will take the 1000 most frequent words)
# and the token_pattern, which specifies I don't want any digits
vect = TfidfVectorizer(max_features=1000, stop_words = my_stop_words, token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(data.review)
tfidf = vect.transform(data.review)

tfidf

<7501x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 342906 stored elements in Compressed Sparse Row format>

In [36]:
# The output is a sparse matrix, we can transform it to a pandas DF very easily!
X_df = pd.DataFrame(tfidf.toarray() , columns = vect.get_feature_names())
X_df.head()

Unnamed: 0,able,absolutely,accent,act,acted,acting,action,actor,actors,actress,...,wrong,wrote,yeah,year,years,yes,york,young,zombie,zombies
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.218637
1,0.0,0.0,0.0,0.0,0.0,0.049654,0.0,0.0,0.057781,0.0,...,0.0,0.0,0.0,0.140262,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.22474,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.156487,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.137626,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.131886,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.094979,0.0,0.0,0.0,...,0.106637,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we need to concatenate the X_df with the rest of the data and the two features we created.

In [39]:
data_slice = data.iloc[:, 1:]
# Recode the language to be binary, so that it is == 1 if the language is en, 0 otherwise
data_slice['language'] = data_slice['language'].apply(lambda x: 1 if x=='en' else 0)
data_slice.head()

Unnamed: 0,label,n_words,language
0,0,155,1
1,0,646,1
2,1,121,1
3,1,128,1
4,0,248,1


In [53]:
# Concatenate the two data sets together

data_for_modelling = pd.concat([X_df, data_slice], axis=1)
data_for_modelling.head()

Unnamed: 0,able,absolutely,accent,act,acted,acting,action,actor,actors,actress,...,year,years,yes,york,young,zombie,zombies,label,n_words,language
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.218637,0,155,1
1,0.0,0.0,0.0,0.0,0.0,0.049654,0.0,0.0,0.057781,0.0,...,0.140262,0.0,0.0,0.0,0.0,0.0,0.0,0,646,1
2,0.0,0.0,0.0,0.0,0.22474,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.156487,0.0,0.0,1,121,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.137626,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.131886,0.0,0.0,1,128,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.094979,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,248,1


Now we can continue with a regular supervised learning modelling, where the label is the target. 