# Bag of Words Meets Bags of Popcorn
## 1. Clean the unstructured movie reviews
#### Kaggle NLP Training

In [32]:
import pandas as pd
import re
import nltk

from bs4 import BeautifulSoup

nltk.download('stopwords')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /Users/boukhris-
[nltk_data]     escandon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/boukhris-
[nltk_data]     escandon/nltk_data...


In [33]:
train = pd.read_csv('word2vec-nlp-tutorial/labeledTrainData.tsv', delimiter='\t')

In [34]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [35]:
train.shape

(25000, 3)

In [36]:
train.columns

Index(['id', 'sentiment', 'review'], dtype='object')

In [37]:
train['review'].loc[0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [38]:
# Example of how to remove html tags from the unstructured text using BeautifulSoup4
example = BeautifulSoup(train['review'].loc[0])

In [39]:
example.get_text()

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

In [40]:
# Example of how to remove all punctuation with Regular Expressions (import re)
letters_only = re.sub('[^a-zA-Z]', ' ', example.get_text())
letters_only

'With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [41]:
# Convert all to lower case
lower = letters_only.lower()
lower

'with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    mi

In [42]:
# Split unstructured text into a list of words
words = lower.split()
words[0:10]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

In [43]:
# Remove stopwords from the NLTK stopwords Corpus
words = [w for w in words if not w in stopwords.words('english')]
words[0:10]

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary']

In [55]:
# Lemmatize verbs using NLTKs WordNetLemmatizer
lemmatized_words = []
for w in words:
    lemmatized_words.append(WordNetLemmatizer().lemmatize(w, 'v'))
lemmatized_words[0:10]

['stuff',
 'go',
 'moment',
 'mj',
 'start',
 'listen',
 'music',
 'watch',
 'odd',
 'documentary']

In [58]:
### Put all the steps together into one fuction:
# 1. Remove HTML Tags
# 2. Remove punctuation
# 3. Convert to lower case
# 4. Split into individual words
# 5. Remove stopwords
# 6. Lemmatize verbs
# 7. Join the words back into a string separated by a single space

def clean_txt(raw_text):
    
    # 1. Remove HTML Tags
    review = BeautifulSoup(raw_text)
    
    # 2. Remove punctuation
    letters_only = re.sub('[^a-zA-Z]', ' ', review.get_text())
    
    # 3. Convert to lower case
    lower = letters_only.lower()
    
    # 4. Split into individual words
    words = lower.split()
    
    # 5. Remove stopwords
    words = [w for w in words if not w in stopwords.words('english')]
    
    # 6. Lemmatize verbs
    lemmatized_words = []
    for w in words:
        lemmatized_words.append(WordNetLemmatizer().lemmatize(w, 'v'))

    # 7. Join the words back into a string separated by a single space
    return(' '.join(lemmatized_words)) 


In [59]:
# Verify that the clean_txt function operates as expected
clean_txt(train['review'].loc[0])

'stuff go moment mj start listen music watch odd documentary watch wiz watch moonwalker maybe want get certain insight guy think really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember go see cinema originally release subtle message mj feel towards press also obvious message drug bad kay visually impressive course michael jackson unless remotely like mj anyway go hate find bore may call mj egotist consent make movie mj fan would say make fan true really nice actual feature film bite finally start minutes exclude smooth criminal sequence joe pesci convince psychopathic powerful drug lord want mj dead bad beyond mj overhear plan nah joe pesci character rant want people know supply drug etc dunno maybe hat mj music lot cool things like mj turn car robot whole speed demon sequence also director must patience saint come film kiddy bad sequence usually directors hate work one kid let alone whole bunch perform complex dance scene botto

In [65]:
# Loop through all of the reviews in the training set and clean the text
clean_train = []

for i in range(0, train.shape[0]):
    if (i+1)%1000 == 0:
        print('Review %d of %d/n' % (i+1, train.shape[0]))
    clean_train.append(clean_txt(train['review'][i]))



Review 1000 of 25000/n
Review 2000 of 25000/n
Review 3000 of 25000/n
Review 4000 of 25000/n
Review 5000 of 25000/n
Review 6000 of 25000/n
Review 7000 of 25000/n
Review 8000 of 25000/n
Review 9000 of 25000/n
Review 10000 of 25000/n
Review 11000 of 25000/n
Review 12000 of 25000/n
Review 13000 of 25000/n
Review 14000 of 25000/n
Review 15000 of 25000/n
Review 16000 of 25000/n
Review 17000 of 25000/n
Review 18000 of 25000/n
Review 19000 of 25000/n
Review 20000 of 25000/n
Review 21000 of 25000/n
Review 22000 of 25000/n
Review 23000 of 25000/n
Review 24000 of 25000/n
Review 25000 of 25000/n


In [67]:
clean_train[0:3]

['stuff go moment mj start listen music watch odd documentary watch wiz watch moonwalker maybe want get certain insight guy think really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember go see cinema originally release subtle message mj feel towards press also obvious message drug bad kay visually impressive course michael jackson unless remotely like mj anyway go hate find bore may call mj egotist consent make movie mj fan would say make fan true really nice actual feature film bite finally start minutes exclude smooth criminal sequence joe pesci convince psychopathic powerful drug lord want mj dead bad beyond mj overhear plan nah joe pesci character rant want people know supply drug etc dunno maybe hat mj music lot cool things like mj turn car robot whole speed demon sequence also director must patience saint come film kiddy bad sequence usually directors hate work one kid let alone whole bunch perform complex dance scene bott

In [68]:
train['clean_review'] = clean_train

In [69]:
train.head()

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,stuff go moment mj start listen music watch od...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",classic war worlds timothy hines entertain fil...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,film start manager nicholas bell give welcome ...
3,3630_4,0,It must be assumed that those who praised this...,must assume praise film greatest film opera ev...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy wondrously unpretentious explo...


In [70]:
train.tail()

Unnamed: 0,id,sentiment,review,clean_review
24995,3453_3,0,It seems like more consideration has gone into...,seem like consideration go imdb review film go...
24996,5064_1,0,I don't believe they made this film. Completel...,believe make film completely unnecessary first...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil...",guy loser get girls need build pick stronger s...
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...,minute documentary bu uel make early one spain...
24999,8478_8,1,I saw this movie as a child and it broke my he...,saw movie child break heart story unfinished e...


In [72]:
train.to_csv('clean_labeled_train.tsv', sep='\t', header=True, index=False)

In [73]:
# Clean the test data for scoring and submission to Kaggle
test = pd.read_csv('word2vec-nlp-tutorial/testData.tsv', delimiter='\t')

In [74]:
test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [75]:
# Loop through all of the reviews in the test set and clean the text
clean_test = []

for i in range(0, test.shape[0]):
    if (i+1)%1000 == 0:
        print('Review %d of %d/n' % (i+1, test.shape[0]))
    clean_test.append(clean_txt(test['review'][i]))



Review 1000 of 25000/n
Review 2000 of 25000/n
Review 3000 of 25000/n
Review 4000 of 25000/n
Review 5000 of 25000/n
Review 6000 of 25000/n
Review 7000 of 25000/n
Review 8000 of 25000/n
Review 9000 of 25000/n
Review 10000 of 25000/n
Review 11000 of 25000/n
Review 12000 of 25000/n
Review 13000 of 25000/n
Review 14000 of 25000/n
Review 15000 of 25000/n
Review 16000 of 25000/n
Review 17000 of 25000/n
Review 18000 of 25000/n
Review 19000 of 25000/n
Review 20000 of 25000/n
Review 21000 of 25000/n
Review 22000 of 25000/n
Review 23000 of 25000/n
Review 24000 of 25000/n
Review 25000 of 25000/n


In [76]:
test['clean_revew'] = clean_test

In [77]:
test.head()

Unnamed: 0,id,review,clean_revew
0,12311_10,Naturally in a film who's main themes are of m...,naturally film main theme mortality nostalgia ...
1,8348_2,This movie is a disaster within a disaster fil...,movie disaster within disaster film full great...
2,5828_4,"All in all, this is a movie for kids. We saw i...",movie kid saw tonight child love one point kid...
3,7186_2,Afraid of the Dark left me with the impression...,afraid dark leave impression several different...
4,12128_7,A very accurate depiction of small time mob li...,accurate depiction small time mob life film ne...


In [78]:
test.to_csv('clean_labeled_test.tsv', sep='\t', header=True, index=False)