# TRANSFORMING AND CLEANING THE DATASET
This notebook shows the cleaning and transformation of the dataset. In order for our model to be able to give the highest accuracy at predicting the sentiment analysis, tokenization and removing stop words are crucial to the cleaning aspect of the dataset. 

In [1]:
#Import all dependencies
import pandas as pd
import re
import io 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
import string 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from functools import reduce


In [2]:
stop_words = set(stopwords.words("english"))
punctuation = string.punctuation

#Create a function to clean dataset, tokenize words and remove stopwords and punctuation
def tokenize_words(text, stopwords, punctuation):
    text = text.lower() 
    text = text.replace("<br />", " ")
    text = re.sub(r"[^a-z ]", " ", text)
    text = re.sub(r" +", " ", text)
    tokens = word_tokenize(text)
    filtered = []
    for w in tokens:
        if w not in stopwords and w not in punctuation:
            filtered.append(w)
    text = reduce((lambda x,y: x + " " + y), filtered)
    return text

In [3]:
#word_tokenize accepts a string as an input, not a file. 
stop_words = set(stopwords.words('english')) 
stop_words = [re.sub(r"[^a-z ]", "", w) for w in stop_words]

#Read in .txt file
test_neg_path = "../train_neg.txt" 
test_pos_path = "../train_pos.txt" #need to change this

test_neg_df = pd.read_table(test_neg_path, sep="\n", header=None, names=['Reviews'])
test_pos_df = pd.read_table(test_pos_path, sep="\n", header=None, names=['Reviews'])

#Encoding each review with 0 for negative and 1 for positive 
test_neg_df['Encoding'] = 0
test_pos_df['Encoding'] = 1

#Concatenating both negative and positive reviews to insert into a dataframe
test_df = pd.concat([test_neg_df, test_pos_df])

#Tokenize words, removing stop words, removing punctuation and creating the dataframe
test_df['Reviews (Cleaned)'] = test_df['Reviews'].apply(tokenize_words, args=(stop_words, punctuation))

test_df

Unnamed: 0,Reviews,Encoding,Reviews (Cleaned)
0,Working with one of the best Shakespeare sourc...,0,working one best shakespeare sources film mana...
1,"Well...tremors I, the original started off in ...",0,well tremors original started found movie quit...
2,Ouch! This one was a bit painful to sit throug...,0,ouch one bit painful sit cute amusing premise ...
3,"I've seen some crappy movies in my life, but t...",0,seen crappy movies life one must among worst d...
4,Carriers follows the exploits of two guys and ...,0,carriers follows exploits two guys two gals st...
...,...,...,...
12495,About a year ago I finally gave up on American...,1,year ago finally gave american television thou...
12496,When I saw the elaborate DVD box for this and ...,1,saw elaborate dvd box dreadful red queen figur...
12497,"Last November, I had a chance to see this film...",1,last november chance see film reno film festiv...
12498,Great movie -I loved it. Great editing and use...,1,great movie loved great editing use soundtrack...


In [4]:
#Creating a for loop to find the word frequency for tokenized words for visualization purposes
wordfreq = {}
for sentence in test_df['Reviews (Cleaned)']:
    tokens = word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

KeyboardInterrupt: 

In [None]:
#Top 200 most frequent words 
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

#Create dataframe and save into csv 
df_new = pd.DataFrame.from_dict(wordfreq, orient="index")
df_new.to_csv('word_frequency.csv', index=True)

# TESTING OUR MODEL 
The code below shows the steps of how we tested our model. 

In [5]:
reviews_np = test_df['Reviews (Cleaned)']

In [6]:
# Vectorizing our words
CV = CountVectorizer(input="content", lowercase=False)
CV

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=False, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [7]:
#Standardize the data
cv_matrix = CV.fit_transform(reviews_np)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [8]:
#Depicts term frequency vector for each review (bag of words)
vocab = CV.get_feature_names()
df_reviews = pd.DataFrame(cv_matrix, columns=vocab)
df_reviews.head(500)

Unnamed: 0,aa,aaa,aaaaaaah,aaaaah,aaaaatch,aaaahhhhhhh,aaaand,aaaarrgh,aaah,aaargh,...,zyuranger,zz,zzzz,zzzzz,zzzzzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
#Make a new column in dataframe to match each matrix to its corresponding review
test_df['matrix'] = list(cv_matrix)

In [10]:
#Logistic Regression 
LogisticRegression

sklearn.linear_model.logistic.LogisticRegression

In [11]:
# Set variables to train dataset
X_train = cv_matrix
y_train = test_df['Encoding']

In [12]:
#Create model variable
model = LogisticRegression()

In [13]:
#Fit linear model 
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [14]:
#Check model accuracy 
model.score(X_train, y_train)

0.99824

In [17]:
# review_ex = test_df.iloc[3620]
# review_ex

Reviews              When childhood memory tells you this was a sca...
Encoding                                                             0
Reviews (Cleaned)    childhood memory tells scary movie touch go wh...
matrix               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 3620, dtype: object

In [18]:
# model_example = model.coef_

In [19]:
# model_example

array([[-1.95057656e-01, -1.21714029e-01, -2.21655083e-07, ...,
         3.20608939e-06,  3.20608939e-06,  3.20608939e-06]])

In [26]:
# len(review_ex['matrix'])

73081

In [23]:
# import numpy as np

In [25]:
# np.dot(review_ex['matrix'], model_example[0])

-2.537607604777733

In [28]:
# strength = [ model_example[0][i] * review_ex['matrix'][i] for i in range(73081)]

In [31]:
# look = pd.DataFrame({"Weights": strength, "Vocab": vocab})


Unnamed: 0,Weights,Vocab
0,-0.0,aa
1,-0.0,aaa
2,-0.0,aaaaaaah
3,0.0,aaaaah
4,0.0,aaaaatch
...,...,...
73076,-0.0,zzzzzzzzzzzz
73077,-0.0,zzzzzzzzzzzzz
73078,0.0,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
73079,0.0,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz


In [44]:
# review_weights = look.loc[look['Weights'] != 0]

In [45]:
# review_weights.max()

Weights    1.61446
Vocab         work
dtype: object

In [46]:
# review_weights.min()

Weights   -1.78048
Vocab       across
dtype: object

In [51]:
# review_weights["Weights"].sum()

-2.537607604777733

In [52]:
# review_weights.to_csv("example_weights.csv")

In [54]:
# review_ex['Reviews']

'When childhood memory tells you this was a scary movie; it\'s touch and go whether you should revisit it. Anyway, I remembered a scary scene involving a homeless person and a cool villain played by Jeff Kober.<br /><br />"The First Power" is not a very good movie, sad to say. It\'s chock full of those cop clichés and a very poor script with holes a truck could drive through (along with countless convenient "twists" that help the story run along). Lou Diamond Phillips is the over-confident bad ass cop who sends baddie serial killer Kober to the gas chamber only to find out he was a minion of Satan himself and now has the power of resurrection along with the power of possessing every weak minded person who he comes across. Through in the mix a very poorly realized psychic who helps with the case.<br /><br />Ahhh, this is trash. But enjoyable as such, especially if you have fond memories of it. It scared me as a kid and that scene with the homeless person is still pretty good. As for any

In [55]:
# review_ex

Reviews              When childhood memory tells you this was a sca...
Encoding                                                             0
Reviews (Cleaned)    childhood memory tells scary movie touch go wh...
matrix               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Name: 3620, dtype: object

In [60]:
model.predict_proba(review_ex['matrix'].reshape(1,-1))

array([[0.9262405, 0.0737595]])