# Data cleaning

Download necessary NLTK files:
* `stopwords`: Stopwords Corpus
* `wordnet`: WordNet

In [1]:
%%script false
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

In [1]:
import pandas as pd
from os.path import join

In [2]:
src = 'data'
df = pd.read_csv(join(src, 'labeledTrainData.tsv'), sep='\t')
df.shape, df.columns.values

((25000, 3), array(['id', 'sentiment', 'review'], dtype=object))

In [3]:
df.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,<br /><br />This movie is full of references. ...


## Remove HTML markup

In [4]:
from bs4 import BeautifulSoup

In [5]:
sample = df['review'][9]
sample

'<br /><br />This movie is full of references. Like \\Mad Max II\\", \\"The wild one\\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."'

In [6]:
cleaned_sample = BeautifulSoup(sample, 'html.parser').get_text()
cleaned_sample

'This movie is full of references. Like \\Mad Max II\\", \\"The wild one\\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future."'

## Clean text
1. Remove punctuation and numbers
1. Tokenize by converting to lowercase and splitting the sentence
1. Remove stop words with `NLTK`
1. Lemmatize with `NLTK`

In [22]:
import re
pattern = r'[^a-zA-Z]'
cleaned_sample = re.sub(pattern, ' ', cleaned_sample)

In [23]:
tokens = cleaned_sample.lower().split()
tokens[:5]

['this', 'movie', 'is', 'full', 'of']

In [27]:
from nltk.corpus import stopwords
stopwords.words('english')[:5]

['i', 'me', 'my', 'myself', 'we']

In [28]:
removed_stopwords = [w for w in tokens if w not in stopwords.words('english')]
print(removed_stopwords)

['movie', 'full', 'references', 'like', 'mad', 'max', 'ii', 'wild', 'one', 'many', 'others', 'ladybug', 'face', 'clear', 'reference', 'tribute', 'peter', 'lorre', 'movie', 'masterpiece', 'talk', 'much', 'future']


In [15]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [29]:
lemmatized = [lemmatizer.lemmatize(w) for w in removed_stopwords]
print(lemmatized)

['movie', 'full', 'reference', 'like', 'mad', 'max', 'ii', 'wild', 'one', 'many', 'others', 'ladybug', 'face', 'clear', 'reference', 'tribute', 'peter', 'lorre', 'movie', 'masterpiece', 'talk', 'much', 'future']


## Post-cleaning

In [31]:
print(sample, '\n\n', ' '.join(lemmatized))

<br /><br />This movie is full of references. Like \Mad Max II\", \"The wild one\" and many others. The ladybug´s face it´s a clear reference (or tribute) to Peter Lorre. This movie is a masterpiece. We´ll talk much more about in the future." 

 movie full reference like mad max ii wild one many others ladybug face clear reference tribute peter lorre movie masterpiece talk much future
