# Appendix

#### Alfredo Di Massimo

#### BrainStation, Data Science

#### April 4th, 2022

___

This notebook will be used to perform tasks required within other notebooks, but will minimize on the actual copmutations performed within our analysis. The primary tasks being completed are:
- Importing the reviews
- Creating a custom stopwords list for NLP analysis

In [2]:
# importing libraries
import numpy as np
import pandas as pd
import os, glob
from sklearn.feature_extraction import text
import nltk
import joblib

## Reviews Import

The reviews collected from the [IMDb Sentiment Analysis](https://paperswithcode.com/dataset/imdb-movie-reviews) study conducted by Andrew L. Maas et al. divides the reviews into already-separated train and test files for machine learning further separated by positive and negative sentiment. Given the sheer size of each review file, this notebook will solely serve to import the individual 50,000 review text files and save them to `.csv` so as to be able to access the information in a much more computationally efficient way for data processing and exploration.

In [26]:
### WARNING: 1 HOUR CELL
# Specify directory and create a list with all file names

#TEST POSITIVE
os.chdir(r"C:\Users\Alfredo\OneDrive\Desktop\Alfredo's File\Brainstation\Winter 2022 Data Science Bootcamp\Projects\Capstone\Data\Reviews\test\pos")
filenames_test_pos = [i for i in glob.glob(f"*.txt")]
df_test_pos = pd.concat([pd.read_table(item, names=[item]) for item in filenames_test_pos]) #Create dataframe to store reviews

In [30]:
df_test_pos.head()

Unnamed: 0,0_10.txt,10000_7.txt,10001_9.txt,10002_8.txt,10003_8.txt,10004_9.txt,10005_8.txt,10006_7.txt,10007_10.txt,10008_8.txt,...,9993_10.txt,9994_10.txt,9995_8.txt,9996_10.txt,9997_10.txt,9998_8.txt,9999_10.txt,999_8.txt,99_10.txt,9_7.txt
0,I went and saw this movie last night after bei...,,,,,,,,,,...,,,,,,,,,,
0,,Actor turned director Bill Paxton follows up h...,,,,,,,,,...,,,,,,,,,,
0,,,As a recreational golfer with some knowledge o...,,,,,,,,...,,,,,,,,,,
0,,,,"I saw this film in a sneak preview, and it is ...",,,,,,,...,,,,,,,,,,
0,,,,,Bill Paxton has taken the true story of the 19...,,,,,,...,,,,,,,,,,


In [34]:
df_test_pos.to_csv('df_test_pos.csv')

In [27]:
### WARNING: 1 HOUR CELL
##TEST NEGATIVE
os.chdir(r"C:\Users\Alfredo\OneDrive\Desktop\Alfredo's File\Brainstation\Winter 2022 Data Science Bootcamp\Projects\Capstone\Data\Reviews\test\neg")
filenames_test_neg = [i for i in glob.glob(f"*.txt")]
df_test_neg = pd.concat([pd.read_table(item, names=[item]) for item in filenames_test_neg])

In [31]:
df_test_neg.head()

Unnamed: 0,0_2.txt,10000_4.txt,10001_1.txt,10002_3.txt,10003_3.txt,10004_2.txt,10005_2.txt,10006_2.txt,10007_4.txt,10008_4.txt,...,9993_2.txt,9994_3.txt,9995_2.txt,9996_2.txt,9997_2.txt,9998_1.txt,9999_1.txt,999_3.txt,99_3.txt,9_4.txt
0,Once again Mr. Costner has dragged out a movie...,,,,,,,,,,...,,,,,,,,,,
0,,This is an example of why the majority of acti...,,,,,,,,,...,,,,,,,,,,
0,,,"First of all I hate those moronic rappers, who...",,,,,,,,...,,,,,,,,,,
0,,,,Not even the Beatles could write songs everyon...,,,,,,,...,,,,,,,,,,
0,,,,,Brass pictures (movies is not a fitting word f...,,,,,,...,,,,,,,,,,


In [35]:
df_test_neg.to_csv('df_test_neg.csv')

In [28]:
### WARNING: 1 HOUR CELL
#TRAIN POSITIVE
os.chdir(r"C:\Users\Alfredo\OneDrive\Desktop\Alfredo's File\Brainstation\Winter 2022 Data Science Bootcamp\Projects\Capstone\Data\Reviews\train\pos")
filenames_train_pos = [i for i in glob.glob(f"*.txt")]
df_train_pos = pd.concat([pd.read_table(item, names=[item]) for item in filenames_train_pos])

In [32]:
df_train_pos.head()

Unnamed: 0,0_9.txt,10000_8.txt,10001_10.txt,10002_7.txt,10003_8.txt,10004_8.txt,10005_7.txt,10006_7.txt,10007_7.txt,10008_7.txt,...,9993_10.txt,9994_10.txt,9995_10.txt,9996_9.txt,9997_7.txt,9998_9.txt,9999_8.txt,999_10.txt,99_8.txt,9_7.txt
0,Bromwell High is a cartoon comedy. It ran at t...,,,,,,,,,,...,,,,,,,,,,
0,,Homelessness (or Houselessness as George Carli...,,,,,,,,,...,,,,,,,,,,
0,,,Brilliant over-acting by Lesley Ann Warren. Be...,,,,,,,,...,,,,,,,,,,
0,,,,This is easily the most underrated film inn th...,,,,,,,...,,,,,,,,,,
0,,,,,This is not the typical Mel Brooks film. It wa...,,,,,,...,,,,,,,,,,


In [36]:
df_train_pos.to_csv('df_train_pos.csv')

In [29]:
### WARNING: 1 HOUR CELL
#TRAIN NEGATIVE
os.chdir(r"C:\Users\Alfredo\OneDrive\Desktop\Alfredo's File\Brainstation\Winter 2022 Data Science Bootcamp\Projects\Capstone\Data\Reviews\train\neg")
filenames_train_neg = [i for i in glob.glob(f"*.txt")]
df_train_neg = pd.concat([pd.read_table(item, names=[item]) for item in filenames_train_neg])

In [33]:
df_train_neg.head()

Unnamed: 0,0_3.txt,10000_4.txt,10001_4.txt,10002_1.txt,10003_1.txt,10004_3.txt,10005_3.txt,10006_4.txt,10007_1.txt,10008_2.txt,...,9993_4.txt,9994_2.txt,9995_1.txt,9996_4.txt,9997_2.txt,9998_4.txt,9999_3.txt,999_3.txt,99_1.txt,9_1.txt
0,Story of a man who has unnatural feelings for ...,,,,,,,,,,...,,,,,,,,,,
0,,Airport '77 starts as a brand new luxury 747 p...,,,,,,,,,...,,,,,,,,,,
0,,,This film lacked something I couldn't put my f...,,,,,,,,...,,,,,,,,,,
0,,,,"Sorry everyone,,, I know this is supposed to b...",,,,,,,...,,,,,,,,,,
0,,,,,When I was little my parents took me along to ...,,,,,,...,,,,,,,,,,


In [37]:
df_train_neg.to_csv('df_train_neg.csv')

___

## Creating a list of stop words to be applied during vectorization
These are words that are either frequently occuring within the english language, within the film industry or simply provide not additional insight. We can begin by importing the *stopwords* module from the **nltk.corpus** library and building upon it.

In [4]:
from nltk.corpus import stopwords 

ENGLISH_STOP_WORDS = stopwords.words('english')
print(len(ENGLISH_STOP_WORDS))

179


With this pre-defined listed of stop-words, we can add onto it stop-words more specific to the film industry, the name of the genres as well as adjectives and adverbs that provide no additional insight other than their inherent meaning.

In [5]:
ENGLISH_STOP_WORDS.extend(['br', 'bad', 'beautiful', 'best', 'better', 'going', 'great', 'truly', 
                           'movies', 'movie', 'film', 'films', 'really', 'make', 'don', 'watch', 'seen', 
                           'actually', 'way', 'screen', 'quite', 'lot', 'drama', 'comedy', 
                           'action', 'romance', 'crime', 'horror', 'thriller', 'adventure', 'fantasy', 'mystery', 
                           'sci-fi', 'family', 'biography', 'music', 'war', 'history', 'animation', 'musical', 
                           'western', 'sport', 'documentary', 'film-noir', 'news', 'adult', 'one', 'like', 'good', 
                           'even', 'get', 'see', 'would', 'much', 'well', 'also', 'dont', 'could', '310', '410', 
                           'terrible', 'worst', 'waste', 'awful', 'terrible', 'horrible', 'boring', 'stupid', 
                           'worse', 'disappointing', 'excellent', 'wonderful', '710', 'favorite', 'perfect', 
                           'loved', 'amazing', 'enjoyed', '810', '1010', 'awesome'])

This will be an iterative process; should additional stop-words appear after vectorizing, I can add them to this list and modify my tokenizer.

In [6]:
print(len(ENGLISH_STOP_WORDS))

260


We can now save this in .pkl for importing into our modeling notebook:

In [7]:
joblib.dump(ENGLISH_STOP_WORDS, 'stopwords.pkl')

['stopwords.pkl']