# Natural Language Processing

We will use natural language processing to help with our analysis of the horror movie reviews. This topic is very broad and can be used for many things, for our particular case I will be using to get create sentiment analysis scores and also clustering movies. Lets kick this off by loading data from our previous script.

In [4]:
# lets load our pickled data, this includes a list holding reviews for our horror movies.

import pickle
import json

with open('../data/reviews_list.pkl', 'rb') as file:
    reviews_list = pickle.load(file)

# and my corresponding movie titles

with open("../data/titles_list.json", "r") as file:
    titles_list = json.load(file)


The data above contains a list of reviews for our horror movies. Each element in our list is a review, which is really a long string containing several sentences. 

In [5]:
print(f'{titles_list[0]} : {reviews_list[0][0:50]}') # grabbing movie title and the first first 50 characters in our first review

Ghost Ship : Sean Murphy and his crew are the top salvage exper


# Text normalization

Lets begin by processing our data. We start at the lowest level with an Lexical analysis, which focuses from the most basic structure dealing with words. 

### Tokenization

The first step is to create features for each of our words, in NLP jargon, these are referred to as tokens. Note this can also be done at the sentence level.

In [6]:
import nltk

nltk.download('stopwords') # stopwords corpus

Token_Pattern = r'\w+'

regex_wt = nltk.RegexpTokenizer(pattern=Token_Pattern,gaps=False)

#word tokenization
tokenized_reviews = [regex_wt.tokenize(x) for x in reviews_list]

tokenized_reviews[0][0:9]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\marti\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['Sean', 'Murphy', 'and', 'his', 'crew', 'are', 'the', 'top', 'salvage']

### Removing stop words

Stop words, which 'don't' provide much value for our analysis are identified and removed. We need a corpus, analagous to a dictionary to help us identify stop words such as 'and','or','either'. I also supplement my stop word corpus with additional words that might not be as relevant for our analysis.

In [7]:
#function to remove stop words
from nltk.tokenize.toktok import ToktokTokenizer
import pandas as pd
import re

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

stopword_list=stopword_list + ['movie','film','horror','good','show','films','see','much','list', 'com', 'http', 'www', 'imdb',
                               'tvd','characters','one','pretty','really','thing','ever','like','definitely','also','well']


def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

stop_words = [remove_stopwords(x) for x in tokenized_reviews]

normalized_reviews=[re.sub('[^A-Za-z0-9]+', ' ', x) for x in stop_words]
normalized_reviews = [x.strip() for x in normalized_reviews]

#Preview our data so far

df = pd.DataFrame({'Movie': titles_list,'Review': normalized_reviews})

df.head()

Unnamed: 0,Movie,Review
0,Ghost Ship,Sean Murphy crew top salvage experts land sea ...
1,The Craft,SPOILERSI thought decent teen flick remember e...
2,House of 1000 Corpses,opinion House 1000 Corpses fan Fans genre Rob ...
3,The Haunting of Bly Manor,many people saying right Haunted house tales g...
4,Attack on Titan,moment watch audiovisual masterpiece immediate...


In [15]:
import pickle


with open('../data/normalized_reviews.pkl', 'wb') as file:
    pickle.dump(df, file)

### Parts of Speech Tagging

We apply a shallow form of syntactic parsing, this means we only look at individual tokens, and not decompose them into phrases. We identify what parts of speech these words are, meaning are they nouns, verbs, adjectives. We combine <i>names</i> corpus to identify names plus nltk's default <i>pos_tag</i> function which acts as a secondary tagger following the initial POS tagging using the names corpus.

We use the popular Penn Treebank Tagset, which uses Penn Treebank corpus training on Wall Street Journal data. Corresponding tagsets can be found here. 

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


In [12]:
import nltk.tag, nltk.data
from nltk import pos_tag
from nltk.corpus import names
nltk.download('maxent_treebank_pos_tagger')

tagger_path = 'taggers/maxent_treebank_pos_tagger/english.pickle' 
default_tagger = nltk.data.load(tagger_path)

#Supplementing NLTK's default pos_tag function with name corpus

mlist = names.words('male.txt')
flist = names.words('female.txt')
names = mlist+flist
dict_names = dict.fromkeys(names, 'NNP')

#Combine unigram and nltk's POS_Tag function
tagger = nltk.tag.UnigramTagger(model=dict_names, backoff=default_tagger) # try BigramTagger and notice the difference

normalized_tkn = [nltk.word_tokenize(x) for x in normalized_reviews]

reviews = [tagger.tag(x) for x in normalized_tkn] #POS tagging

reviews[0][0:5]

[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     C:\Users\marti\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!


[('Sean', 'NNP'),
 ('Murphy', 'NNP'),
 ('crew', 'VBD'),
 ('top', 'JJ'),
 ('salvage', 'NN')]

We see above that 'Sean' from the movie Ghost Ship is identified as a NNP or a proper noun.

In [13]:
import pickle


with open('../data/processed_reviews.pkl', 'wb') as file:
    pickle.dump(reviews, file)