# Data preprocessing 

In [1]:
import pandas as pd
import re

In [2]:
import nltk

In [3]:
import os

### 1. Load data into pandas dataframe

In [3]:
df = pd.DataFrame()

In [None]:
for file in os.listdir('../data/train/pos'):
    with open(f'../data/train/pos/{file}', 'r') as input_text:
        df = df.append([{'comment': input_text.read(), 'sentiment': 1}])
        
for file in os.listdir('../data/train/neg'):
    with open(f'../data/train/neg/{file}', 'r') as input_text:
        df = df.append([{'comment': input_text.read(), 'sentiment': 0}])

In [31]:
df = df.reset_index()

In [33]:
df = df.drop(columns=['index'])

In [34]:
# persist
df.to_csv('../data/train/comments.csv')

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
comment      25000 non-null object
sentiment    25000 non-null int64
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [37]:
df.head(n=10)

Unnamed: 0,comment,sentiment
0,For a movie that gets no respect there sure ar...,1
1,Bizarre horror movie filled with famous faces ...,1
2,"A solid, if unremarkable film. Matthau, as Ein...",1
3,It's a strange feeling to sit alone in a theat...,1
4,"You probably all already know this by now, but...",1
5,I saw the movie with two grown children. Altho...,1
6,You're using the IMDb.<br /><br />You've given...,1
7,This was a good film with a powerful message o...,1
8,"Made after QUARTET was, TRIO continued the qua...",1
9,"For a mature man, to admit that he shed a tear...",1


### 2. NLP 

This chapter performs all text pre-processing steps, which result in (creating vocabulary as well as ) training embedding vectors. 

We will apply embedding on level of documents (comments) - since we want to categorize whole comments into positive / negative sentiments. 

Later we will thing about adding POS taggs / create sentence embeddding instead of whole documents embeddings...

Tokenization (remove non-alphanum chars), lowercase, ...


In [4]:
df = pd.read_csv('../data/train/comments.csv', index_col=0)

In [14]:
df.head(n=5)

Unnamed: 0,comment,sentiment
0,For a movie that gets no respect there sure ar...,1
1,Bizarre horror movie filled with famous faces ...,1
2,"A solid, if unremarkable film. Matthau, as Ein...",1
3,It's a strange feeling to sit alone in a theat...,1
4,"You probably all already know this by now, but...",1


Let's look into some comments to see what we need to clean in general

In [15]:
df.comment.iloc[50]

"I am writing this after just seeing The Perfect Son at the 2002 Gay and Lesbian Mardi Gras Film Festival in Sydney, Australia.<br /><br />When their Father dies, two estranged brothers meet at the funeral and after discovering that one of the brothers is dying from AIDS, they enter on a heart warming journey of reconciliation. The two leads do a magnificent job of creating the gradual warmth and respect that builds up between them as the movie progresses. I do have one qualm about the movie though - whilst the brother who is dying acts sick, he doesn't look it. A person of 0 T4 cells would look quite ill - not even a make up job to make the actor look ill was employed. A small gripe, but one that makes it a bit less realistic. Despite that one small gripe, The Perfect Son is a wonderful movie and should you have the chance to see it- do. I'm hoping for a DVD release in the near future!"

In [16]:
df.comment.iloc[150]

'Easily one of the ten best movies of the 20th century. In Cold Blood is brilliant in the simplicity and realism of its storytelling, and absolutely riveting.<br /><br />Robert Blake walks away with the film. The story seems to be presented almost entirely from Perry\'s viewpoint, despite Dick being the leader and planner of the pair. The viewer will invariable perceive Dick as being more unstable, immature, and generally feel like Perry would not have been pulled into this nightmare but for Dick and his need to be somebody and pull off a big score.<br /><br />Based on a true story with particular attention to accuracy, In Cold Blood depicts the story behind the brutal and senseless murder of a rural Kansas family one cold, windy night, because Dick has bought into an age-old rural myth about prosperous farmers having a safe full of cash in their home. As "prosecutor" (a character that isn\'t given a name in the script), played by Will Geer, so astutely points out, their lives are boug

In [21]:
df.comment.iloc[15000]

'Despite an overall pleasing plot and expensive production one wonders how a director can make so many clumsy cultural mistakes. Where were the Japanese wardrobe and cultural consultants? Not on the payroll apparently. <br /><br />A Japanese friend of mine actually laughed out loud at some of the cultural absurdities she watched unfold before her eyes. In a later conversation she said, "Imagine a Finnish director making a movie in Fnnish about the American Civil War using blond Swedish actors as union Army and Frenchmen as the Confederates. Worse imagine dressing the Scarlet O\'Hara female lead in a period hoop skirt missing the hoop and sporting a 1950\'s hairdo. Maybe some people in Finland might not realize that the hoop skirt was "missing the hoop" or recognize the bizarre Jane Mansfield hair, but in Atlanta they would not believe their eyes or ears....and be laughing in the aisles...excellent story and photography be damned.<br /><br />So...watching Memoirs of a Geisha was painful

We definitely have some html-like tags to remove (replace with space).
We can also see that apostrophs are escaped (\\'). 
Also, we probably will want to replace .... with space as well.

In [5]:
df['comment'] = df.comment.apply(lambda x: re.sub(r'<.*?>', ' ', x))

We will not work with other than alpha chars. 
Also, we will try to convert n't parts to not first.

In [6]:
# can't is special case
df['comment'] = df.comment.apply(lambda x: re.sub(r'can\'t', 'can not', x))

In [7]:
# and the rest of negatives
df['comment'] = df.comment.apply(lambda x: re.sub(r'n\'t', ' not', x))

In [8]:
# now we want alpha tokens only to work with
df['comment'] = df.comment.apply(lambda x: re.sub(r'[^A-Za-z ]+', '', x))

In [9]:
# and we want to lowercase everything
df['comment'] = df.comment.apply(lambda x: x.lower())

In [45]:
df['comment'][24998]

'eight academy nominations its beyond belief i can only think it was a very bad year  even by hollywood standards with huston as director and jack nicholson and kathleen turner as leads i probably would have swallowed the bait and watched this anyway but the oscar nominations really sold it to me and i feel distinctly cheated as a result  so its a black comedy is it can anyone tell me where the humour is in prizzis honor its certainly tasteless the shooting in the head of a policemans wife is but another supposedly comic interlude in this intended farce about mafia life but with the exception of a joke about your favourite mexican cigars which i imagine is an old joke for americans who have been officially forbidden from buying anything cuban for the last  years i failed to spot anything of a comic nature  and i did try there is a lot of mafia clich but clich does not constitute humour in my book  is it a romantic comedy of sorts never the characters and their relationships are so comp

In [10]:
# squeeze multiple white-spaces into one
df['comment'] = df.comment.apply(lambda x: re.sub(' +', ' ', x))

In [48]:
df['comment'][24998]

'eight academy nominations its beyond belief i can only think it was a very bad year even by hollywood standards with huston as director and jack nicholson and kathleen turner as leads i probably would have swallowed the bait and watched this anyway but the oscar nominations really sold it to me and i feel distinctly cheated as a result so its a black comedy is it can anyone tell me where the humour is in prizzis honor its certainly tasteless the shooting in the head of a policemans wife is but another supposedly comic interlude in this intended farce about mafia life but with the exception of a joke about your favourite mexican cigars which i imagine is an old joke for americans who have been officially forbidden from buying anything cuban for the last years i failed to spot anything of a comic nature and i did try there is a lot of mafia clich but clich does not constitute humour in my book is it a romantic comedy of sorts never the characters and their relationships are so completel

Now we need to remove stopwords. We will remove language (english) based stopwords, since we haven't constructed list of domain (movies) related stopwords yet. 

We get the stopwords list from nltk library.

##### version 2.0 - we decided not to remove stopwords, since they might carry important info as well. 

However, we are leaving the code below here (commented), because it was tried in first iteration of our project.

In [56]:
# stopwords = nltk.corpus.stopwords.words('english')

In [58]:
# stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

We will not process negative form of verbs here, since we do not want to match them as stopwords 
(Later, we want to try to keep the 'not' in sentence). 

In [62]:
# process stopwords the same way we processed text
# stopwords = list(map(lambda x: re.sub(r'[^A-Za-z ]+', '', x),  stopwords))

In [11]:
# temporarily convert each comment into list of words (tokens) for matching
df.comment = df.comment.apply(lambda x: x.split(' '))

In [65]:
# and remove stopwords
# df.comment = df.comment.apply(lambda x: list(filter(lambda word: word not in stopwords, x)))

In first iteration of our model, we will remove language based stop words (for Eng. language). 
However, during next iteration we might identify topic related stop words and remove them as well. 

In [66]:
df.comment[150]

['easily',
 'one',
 'ten',
 'best',
 'movies',
 'th',
 'century',
 'cold',
 'blood',
 'brilliant',
 'simplicity',
 'realism',
 'storytelling',
 'absolutely',
 'riveting',
 'robert',
 'blake',
 'walks',
 'away',
 'film',
 'story',
 'seems',
 'presented',
 'almost',
 'entirely',
 'perrys',
 'viewpoint',
 'despite',
 'dick',
 'leader',
 'planner',
 'pair',
 'viewer',
 'invariable',
 'perceive',
 'dick',
 'unstable',
 'immature',
 'generally',
 'feel',
 'like',
 'perry',
 'would',
 'pulled',
 'nightmare',
 'dick',
 'need',
 'somebody',
 'pull',
 'big',
 'score',
 'based',
 'true',
 'story',
 'particular',
 'attention',
 'accuracy',
 'cold',
 'blood',
 'depicts',
 'story',
 'behind',
 'brutal',
 'senseless',
 'murder',
 'rural',
 'kansas',
 'family',
 'one',
 'cold',
 'windy',
 'night',
 'dick',
 'bought',
 'ageold',
 'rural',
 'myth',
 'prosperous',
 'farmers',
 'safe',
 'full',
 'cash',
 'home',
 'prosecutor',
 'character',
 'given',
 'name',
 'script',
 'played',
 'geer',
 'astutely',
 '

Now we have the last step of preprocessing in front of us - transforming words into the basic forms. 
We have two options - lemmatizers and stemmers. 

To use lemmatizer and achieve proper results, we would need to first perform POS tagging - assign each word its role in sentence. This can be tried in next iterations of our project. For now, we will go for stemming, which leaves us with the minimum base of word. One risk of stemming is that we are sometimes left with really minimum of what is left of the word, which makes the backward interpretation of text a little bit harder. However, it helps us to unite the same words now, so we will try it. 

In [12]:
# create stemmer and stem
stemmer = nltk.stem.porter.PorterStemmer()
df.comment = df.comment.apply(lambda x: list(map(stemmer.stem, x)))

In [13]:
# lastly we do not want to work with words with less than 2 chars
df.comment = df.comment.apply(lambda x: list(filter(lambda word: len(word) > 2, x)))

In [69]:
df.comment[0]

['movi',
 'get',
 'respect',
 'sure',
 'lot',
 'memor',
 'quot',
 'list',
 'gem',
 'imagin',
 'movi',
 'joe',
 'piscopo',
 'actual',
 'funni',
 'maureen',
 'stapleton',
 'scene',
 'stealer',
 'moroni',
 'charact',
 'absolut',
 'scream',
 'watch',
 'alan',
 'skipper',
 'hale',
 'polic',
 'sgt']

In [14]:
# persist
df.to_csv('../data/train/comments_processed_v2.csv')

In [15]:
del df

## Test dataset
We also need to apply the same operations to our test data set:
(this could have been a pipeline or at least a parametrizable function instead of code duplication. Maybe in next iterations we will refactor). :)

In [74]:
df = pd.DataFrame()
for file in os.listdir('../data/test/pos'):
    with open(f'../data/test/pos/{file}', 'r') as input_text:
        df = df.append([{'comment': input_text.read(), 'sentiment': 1}])
        
for file in os.listdir('../data/test/neg'):
    with open(f'../data/test/neg/{file}', 'r') as input_text:
        df = df.append([{'comment': input_text.read(), 'sentiment': 0}])

In [75]:
df = df.reset_index()
df = df.drop(columns=['index'])
df.to_csv('../data/test/comments.csv')

In [17]:
df = pd.read_csv('../data/test/comments.csv', index_col=0)

In [18]:
df.head(n=10)

Unnamed: 0,comment,sentiment
0,"Based on an actual story, John Boorman shows t...",1
1,This is a gem. As a Film Four production - the...,1
2,"I really like this show. It has drama, romance...",1
3,This is the best 3-D experience Disney has at ...,1
4,"Of the Korean movies I've seen, only three had...",1
5,this movie is funny funny funny my favorite qu...,1
6,I'm just starting to explore the so far wonder...,1
7,There is no need for me to repeat the synopsis...,1
8,"I got this movie with my BBC ""Jane Austen Coll...",1
9,"This was a great movie, I would compare it to ...",1


In [19]:
df['comment'] = df.comment.apply(lambda x: re.sub(r'<.*?>', ' ', x))
df['comment'] = df.comment.apply(lambda x: re.sub(r'can\'t', 'can not', x))
df['comment'] = df.comment.apply(lambda x: re.sub(r'n\'t', ' not', x))
df['comment'] = df.comment.apply(lambda x: re.sub(r'[^A-Za-z ]+', '', x))
df['comment'] = df.comment.apply(lambda x: x.lower())
df['comment'] = df.comment.apply(lambda x: re.sub(' +', ' ', x))


In [None]:
df.comment = df.comment.apply(lambda x: x.split(' '))
# df.comment = df.comment.apply(lambda x: list(filter(lambda word: word not in stopwords, x)))
df.comment = df.comment.apply(lambda x: list(map(stemmer.stem, x)))
df.comment = df.comment.apply(lambda x: list(filter(lambda word: len(word) > 2, x)))

In [None]:
df.to_csv('../data/test/comments_processed_v2.csv')

In [None]:
del df