# Load user data, preprocess, make feature vector

Preprocessing 
 - Escape characters 
 For when the user breaks their input into multiple paragraphs
 https://stackoverflow.com/questions/8115261/how-to-remove-all-the-escape-sequences-from-a-list-of-strings
 
 - punctuation (don't) --> dont
 Take care to not break up words with apostrophes
 
 - lowercase

References:
* https://machinelearningmastery.com/clean-text-machine-learning-python/
* https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184

### Load the data

In [1]:
reviews_train = []
for line in open('../data/movie_data/full_train.txt', 'r'):
    
    reviews_train.append(line.strip())

In [2]:
reviews_test = []
for line in open('../data/movie_data/full_test.txt', 'r'):
    
    reviews_test.append(line.strip())

In [3]:
user_data = reviews_test[75]

In [4]:
reviews_test[75]

'"What\'s his name?" "Loudon." "Loudon what?" "Clear."<br /><br />That gag still gets me, TWENTY ONE years after the film was released.<br /><br />I loved the film back then and I love it today. I must have watched this a hundred times back in the day, and when I bought the DVD recently I could still remember some of the dialogue.<br /><br />Madonna plays Nikki Finn, a young woman jailed for a crime she didn\'t commit. When she gets out she decides to seek revenge.<br /><br />Griffin Dunne (whatever happened to him?), plays an attorney for his fiancée\'s father (John McMartin). The future father-in-law asks Loudon to take Nikki from prison to the bus station and to make sure she gets on the bus, as part of a supposed new public relations programme. A seemingly easy task, but there are complications aplenty, some funny dialogue, and some admittedly stupid-but-funny scenes along the way.<br /><br />Madonna has a stupid voice in this film, which until I was able to watch with subtitles ma

### Remove escape sequences:

In [6]:
escapes = ''.join([chr(char) for char in range(1, 32)]) 
# all escape sequences have an ANSI encoding value between 1 and 32

In [7]:
escapes

'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f'

In [8]:
test_string = 'Hello, \n \tDoes this really \nwork?'

In [9]:
def remove_escapes(input_string):
    escapes = ''.join([chr(char) for char in range(1, 32)]) 
    # all escape sequences have an ANSI encoding value between 1 and 32
    
    translator = str.maketrans('', '', escapes) # Different syntax for py2/ py3 / py3.1 +
    output_string = input_string.translate(translator)
    return output_string

In [10]:
remove_escapes(test_string)

'Hello,  Does this really work?'

In [11]:
remove_escapes(reviews_test[75])

'"What\'s his name?" "Loudon." "Loudon what?" "Clear."<br /><br />That gag still gets me, TWENTY ONE years after the film was released.<br /><br />I loved the film back then and I love it today. I must have watched this a hundred times back in the day, and when I bought the DVD recently I could still remember some of the dialogue.<br /><br />Madonna plays Nikki Finn, a young woman jailed for a crime she didn\'t commit. When she gets out she decides to seek revenge.<br /><br />Griffin Dunne (whatever happened to him?), plays an attorney for his fiancée\'s father (John McMartin). The future father-in-law asks Loudon to take Nikki from prison to the bus station and to make sure she gets on the bus, as part of a supposed new public relations programme. A seemingly easy task, but there are complications aplenty, some funny dialogue, and some admittedly stupid-but-funny scenes along the way.<br /><br />Madonna has a stupid voice in this film, which until I was able to watch with subtitles ma

### Remove punctuation. Remove extra characters:

In [13]:
import re

def preprocess_reviews(input_string):
    """
    """
    REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\_)|(\d+)")
    REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
    NO_SPACE = ""
    SPACE = " "
    
    input_string = remove_escapes(input_string)
    input_string = REPLACE_NO_SPACE.sub(NO_SPACE, input_string.lower())
    input_string = REPLACE_WITH_SPACE.sub(SPACE, input_string)
    
    return input_string

In [14]:
reviews_train_clean = [preprocess_reviews(line) for line in reviews_train]

In [15]:
reviews_test_clean = [preprocess_reviews(line) for line in reviews_test]

In [16]:
reviews_test_clean[75]

'whats his name loudon loudon what clear that gag still gets me twenty one years after the film was released i loved the film back then and i love it today i must have watched this a hundred times back in the day and when i bought the dvd recently i could still remember some of the dialogue madonna plays nikki finn a young woman jailed for a crime she didnt commit when she gets out she decides to seek revenge griffin dunne whatever happened to him plays an attorney for his fiancées father john mcmartin the future father in law asks loudon to take nikki from prison to the bus station and to make sure she gets on the bus as part of a supposed new public relations programme a seemingly easy task but there are complications aplenty some funny dialogue and some admittedly stupid but funny scenes along the way madonna has a stupid voice in this film which until i was able to watch with subtitles made one or two lines of dialogue incomprehensible for me hence only   but on the other hand i ca

In [None]:
# Got rid of the "100" - numeric character. # eliminates all digits.

# Still has the word fiance with the accent on it. Could lead to complication

In [17]:
processed_user_data = preprocess_reviews(user_data)

### Get feature vector:
- Do the extra things from his next session ?

ref:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


Save the count vectorizer object

### Extra: 
 - Use TFIDF transformer. 
This changes the value in feature vector. Instead of counts it has a ratio of how often the term occurs in the document / the number of terms it would appear in the entire text corpus.
Helps if someone wrote Ashton Kutcher a million times as their review. https://stackoverflow.com/questions/29788047/keep-tfidf-result-for-predicting-new-content-using-scikit-for-python
 - Think about how to deal with emoji

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

In [20]:
vectorizer = CountVectorizer(binary=True, decode_error="replace", strip_accents='unicode', 
                             stop_words='english', min_df=0.005 
                            )
#Remove proper nouns - less than 0.5% of the dataset
vectorizer.fit(reviews_train_clean)
X = vectorizer.transform(reviews_train_clean)

In [21]:
X.shape

(25000, 2920)

In [22]:
X[0]

<1x2920 sparse matrix of type '<class 'numpy.int64'>'
	with 36 stored elements in Compressed Sparse Row format>

In [30]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# You can provide a fixed vocabulary too
# So for reuse, let's store this vocabulary 

In [23]:
vectorizer.vocabulary_

{'high': 1206,
 'cartoon': 348,
 'comedy': 457,
 'ran': 2019,
 'time': 2621,
 'school': 2196,
 'life': 1489,
 'years': 2907,
 'lead': 1457,
 'believe': 223,
 'satire': 2176,
 'closer': 439,
 'reality': 2040,
 'survive': 2520,
 'students': 2462,
 'right': 2125,
 'pathetic': 1836,
 'situation': 2315,
 'remind': 2082,
 'knew': 1421,
 'saw': 2183,
 'episode': 813,
 'student': 2461,
 'tried': 2671,
 'immediately': 1284,
 'classic': 426,
 'line': 1500,
 'im': 1276,
 'welcome': 2833,
 'expect': 857,
 'adults': 43,
 'age': 53,
 'think': 2599,
 'far': 916,
 'pity': 1876,
 'isnt': 1352,
 'george': 1069,
 'stated': 2408,
 'issue': 1353,
 'plan': 1881,
 'help': 1192,
 'street': 2447,
 'considered': 493,
 'human': 1253,
 'did': 667,
 'going': 1091,
 'work': 2881,
 'vote': 2783,
 'matter': 1598,
 'people': 1843,
 'just': 1396,
 'lost': 1531,
 'cause': 360,
 'things': 2598,
 'war': 2799,
 'kids': 1408,
 'succeed': 2477,
 'technology': 2556,
 'end': 785,
 'streets': 2448,
 'given': 1080,
 'bet': 231,


In [24]:
sorted(vectorizer.vocabulary_)   
# shows all of the unnecessary words in here 
# Why it will help to stem words

['abandoned',
 'ability',
 'able',
 'absolute',
 'absolutely',
 'absurd',
 'abuse',
 'academy',
 'accent',
 'accents',
 'accept',
 'accepted',
 'accident',
 'accidentally',
 'according',
 'account',
 'accurate',
 'achieve',
 'achieved',
 'act',
 'acted',
 'acting',
 'action',
 'actions',
 'actor',
 'actors',
 'actress',
 'actresses',
 'acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adaptation',
 'adapted',
 'add',
 'added',
 'adding',
 'addition',
 'adds',
 'admit',
 'admittedly',
 'adult',
 'adults',
 'advantage',
 'adventure',
 'adventures',
 'advice',
 'affair',
 'afraid',
 'africa',
 'african',
 'afternoon',
 'age',
 'aged',
 'agent',
 'ages',
 'ago',
 'agree',
 'ahead',
 'aint',
 'air',
 'aired',
 'aka',
 'al',
 'alan',
 'alas',
 'albeit',
 'albert',
 'alex',
 'alice',
 'alien',
 'aliens',
 'alike',
 'alive',
 'allen',
 'allow',
 'allowed',
 'allowing',
 'allows',
 'alright',
 'amateur',
 'amateurish',
 'amazed',
 'amazing',
 'amazingly',
 'america',
 'american',
 'americans',
 

In [25]:
#Save vectorizer.vocabulary_
# Dictionary with words and indices
#pickle.dump(vectorizer.vocabulary_,open("./vectorizer_vocabulary.pkl","wb"))

In [26]:
pickle.dump(vectorizer,open("./count_vectorizer_object.pkl","wb"))

In [None]:
#X_test = cv.transform(reviews_test_clean)

In [None]:
#Save the count vectorizer object

In [None]:
# corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
# vectorizer = CountVectorizer(decode_error="replace")
# vec_train = vectorizer.fit_transform(corpus)
# #Save vectorizer.vocabulary_
# pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

# #Load it later
# transformer = TfidfTransformer()
# loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
# tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

In [27]:
vectorizer  
# You can provide a list of stop words, strip accents (like in fiance)
# max_features 
# minimum document frequency

# max document frequency - helps eliminate corpus specific stop words  - like "movie ?"
# We would do this if we don't that that word adds too much value and we can compress our representation 

CountVectorizer(analyzer='word', binary=True, decode_error='replace',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=0.005,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents='unicode', token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [65]:
#vectorizer.stop_words_

### Load from saved vocabulary.
In the flask app, you load the vectorizer when you launch the app. Its instantiated as a global in the script so that you don't need to repeat this step everytime a new request is made - (User refreshes the page / enters a new string). This will get reset when the server restarts though.

In [28]:
# saved_vocabulary = pickle.load(open('./vectorizer_vocabulary.pkl', 'rb'))
# # write the count vectorizer model parameters to a config file, so you can reinitialize.

# loaded_vec = CountVectorizer(vocabulary=saved_vocabulary)
# #tfidf = transformer.fit_transform(loaded_vec.fit_transform(corpus))

# loaded_vec

In [29]:
user_data

'"What\'s his name?" "Loudon." "Loudon what?" "Clear."<br /><br />That gag still gets me, TWENTY ONE years after the film was released.<br /><br />I loved the film back then and I love it today. I must have watched this a hundred times back in the day, and when I bought the DVD recently I could still remember some of the dialogue.<br /><br />Madonna plays Nikki Finn, a young woman jailed for a crime she didn\'t commit. When she gets out she decides to seek revenge.<br /><br />Griffin Dunne (whatever happened to him?), plays an attorney for his fiancée\'s father (John McMartin). The future father-in-law asks Loudon to take Nikki from prison to the bus station and to make sure she gets on the bus, as part of a supposed new public relations programme. A seemingly easy task, but there are complications aplenty, some funny dialogue, and some admittedly stupid-but-funny scenes along the way.<br /><br />Madonna has a stupid voice in this film, which until I was able to watch with subtitles ma

In [30]:
processed_user_data 

'whats his name loudon loudon what clear that gag still gets me twenty one years after the film was released i loved the film back then and i love it today i must have watched this a hundred times back in the day and when i bought the dvd recently i could still remember some of the dialogue madonna plays nikki finn a young woman jailed for a crime she didnt commit when she gets out she decides to seek revenge griffin dunne whatever happened to him plays an attorney for his fiancées father john mcmartin the future father in law asks loudon to take nikki from prison to the bus station and to make sure she gets on the bus as part of a supposed new public relations programme a seemingly easy task but there are complications aplenty some funny dialogue and some admittedly stupid but funny scenes along the way madonna has a stupid voice in this film which until i was able to watch with subtitles made one or two lines of dialogue incomprehensible for me hence only   but on the other hand i ca

In [31]:
saved_vectorizer = pickle.load(open('./count_vectorizer_object.pkl', 'rb'))

In [32]:
user_data_vector = saved_vectorizer.transform([processed_user_data, ''])[:-1] 
#extra string cuz it expects more than one document to transform

In [33]:
user_data_vector.shape

(1, 2920)

In [36]:
saved_vectorizer.vocabulary_['great']

1114

In [35]:
user_data_vector[0, 234], user_data_vector[0, 235], user_data_vector[0, 236]

(0, 0, 0)

In [37]:
# If you want to inspect it 
user_data_vector.todense()
# note that they're all binary

matrix([[0, 0, 1, ..., 0, 0, 0]])

In [58]:
## See which words are retained:
import operator
import numpy as np
from numpy.ma import masked_array

In [None]:
vocab_by_index = sorted(saved_vectorizer.vocabulary_.items(), key=operator.itemgetter(1))

In [43]:
review_mask = np.array(user_data_vector.todense())[0]

In [48]:
words = list(zip(*vocab_by_index))[0]

In [75]:
retained_words = []

In [76]:
for i, w in enumerate(words):
    if review_mask[i]:
        retained_words.append(w)

In [77]:
retained_words

['able',
 'admittedly',
 'asks',
 'better',
 'bought',
 'bus',
 'clear',
 'comic',
 'crime',
 'day',
 'days',
 'decides',
 'dialogue',
 'didnt',
 'doing',
 'dvd',
 'easy',
 'father',
 'film',
 'funny',
 'future',
 'gag',
 'genuinely',
 'gets',
 'great',
 'hand',
 'happened',
 'happily',
 'imagine',
 'john',
 'jokes',
 'lacking',
 'law',
 'lines',
 'love',
 'loved',
 'make',
 'makes',
 'new',
 'normal',
 'perfect',
 'plays',
 'predictable',
 'prison',
 'public',
 'recently',
 'released',
 'remember',
 'revenge',
 'role',
 'scenes',
 'seek',
 'seemingly',
 'shows',
 'sit',
 'station',
 'stupid',
 'subtitles',
 'supposed',
 'sure',
 'task',
 'think',
 'times',
 'today',
 'voice',
 'watch',
 'watched',
 'way',
 'whats',
 'woman',
 'years',
 'yes',
 'young']

### Let's train the model and inspect it now

- Logistic Regression Classifier. 

- There's a regularization parameter called C. Higher the value, the less tendency to overfit to the training data

- Simple binary classifer - positive or negative

https://towardsdatascience.com/the-basics-logistic-regression-and-regularization-828b0d2d206c

(find a better link for regularization - more practical, slightly less math)

- Also load and store the trained model file.


In [78]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [79]:
# We're able to set the target like this because we know the reviews are ordered as being positive for the first 12500 and negative for the next ones

In [80]:
target = [1 if i < 12500 else 0 for i in range(25000)]

In [81]:
X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)

In [83]:
X_test = vectorizer.transform(reviews_test_clean)

In [84]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.86256
Accuracy for C=0.05: 0.87088
Accuracy for C=0.25: 0.864
Accuracy for C=0.5: 0.85792
Accuracy for C=1: 0.85312


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Pick the value of c that gives the highest accuracy on the validation set and retrain on the full training set

the target is the same for the test set because it is also ordered

In [85]:
final_model = LogisticRegression(C=0.05)
final_model.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test)))
# Final Accuracy: 0.88128

Final Accuracy: 0.87092


### Let's inspect what the model has learned

In [86]:
vectorizer.get_feature_names()

['abandoned',
 'ability',
 'able',
 'absolute',
 'absolutely',
 'absurd',
 'abuse',
 'academy',
 'accent',
 'accents',
 'accept',
 'accepted',
 'accident',
 'accidentally',
 'according',
 'account',
 'accurate',
 'achieve',
 'achieved',
 'act',
 'acted',
 'acting',
 'action',
 'actions',
 'actor',
 'actors',
 'actress',
 'actresses',
 'acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adaptation',
 'adapted',
 'add',
 'added',
 'adding',
 'addition',
 'adds',
 'admit',
 'admittedly',
 'adult',
 'adults',
 'advantage',
 'adventure',
 'adventures',
 'advice',
 'affair',
 'afraid',
 'africa',
 'african',
 'afternoon',
 'age',
 'aged',
 'agent',
 'ages',
 'ago',
 'agree',
 'ahead',
 'aint',
 'air',
 'aired',
 'aka',
 'al',
 'alan',
 'alas',
 'albeit',
 'albert',
 'alex',
 'alice',
 'alien',
 'aliens',
 'alike',
 'alive',
 'allen',
 'allow',
 'allowed',
 'allowing',
 'allows',
 'alright',
 'amateur',
 'amateurish',
 'amazed',
 'amazing',
 'amazingly',
 'america',
 'american',
 'americans',
 

In [87]:
final_model.coef_[0]

array([ 0.18642663, -0.1870072 ,  0.22950748, ..., -0.15128324,
       -0.14033211, -0.10895653])

In [88]:
feature_to_coef = {
    word: coef for word, coef in zip(
        vectorizer.get_feature_names(), final_model.coef_[0]
    )
}
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:5]:
    print (best_positive)
    
#     ('excellent', 0.9288812418118644)
#     ('perfect', 0.7934641227980576)
#     ('great', 0.675040909917553)
#     ('amazing', 0.6160398142631545)
#     ('superb', 0.6063967799425831)
    

('excellent', 0.9745851262654305)
('perfect', 0.826031017132604)
('favorite', 0.7028390725567992)
('amazing', 0.7017429487457373)
('wonderful', 0.6819654680349783)


In [89]:
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)

('worst', -1.375134199529046)
('waste', -1.2710344923315653)
('awful', -1.127797419542129)
('poorly', -0.9812138734109177)
('disappointment', -0.9565228851956116)


### Let's check what label it gave to the user input. This is what we return through flask

In [90]:
user_data_vector

<1x2920 sparse matrix of type '<class 'numpy.int64'>'
	with 73 stored elements in Compressed Sparse Row format>

In [94]:
final_model.predict(user_data_vector)[0] # 'Positive'

1

In [92]:
user_data

'"What\'s his name?" "Loudon." "Loudon what?" "Clear."<br /><br />That gag still gets me, TWENTY ONE years after the film was released.<br /><br />I loved the film back then and I love it today. I must have watched this a hundred times back in the day, and when I bought the DVD recently I could still remember some of the dialogue.<br /><br />Madonna plays Nikki Finn, a young woman jailed for a crime she didn\'t commit. When she gets out she decides to seek revenge.<br /><br />Griffin Dunne (whatever happened to him?), plays an attorney for his fiancée\'s father (John McMartin). The future father-in-law asks Loudon to take Nikki from prison to the bus station and to make sure she gets on the bus, as part of a supposed new public relations programme. A seemingly easy task, but there are complications aplenty, some funny dialogue, and some admittedly stupid-but-funny scenes along the way.<br /><br />Madonna has a stupid voice in this film, which until I was able to watch with subtitles ma

### Also show an example with a negative review: 
Package it as well. Port to script

In [99]:
len(reviews_test)

25000

In [103]:
reviews_test[22000]

'A post-apocalyptic warrior goes off to save some kind of Nun and on the way meets some cyber-punks on skates who want to kick his ass. This is one of the hardest to watch films ever, There are scenes with silence that seems to last hours before somebody comes out with the next badly written, badly acted line. There are action sequences that keep repeating - and we\'re not talking the quickfire 1-2-3 action repeat on a particularly good kick that was made popular by eastern directors, we\'re talking many, many repeats of long, bad fight sequences. This is incredibly confusing at first but then quickly becomes annoying as you\'re watching a 30 second sequence for the 2nd, 3rd and 4th time. Any kind of plot or vision is lost within the confusing continuity, the only thing thats keeps this film in the videoplayer (apart from the bet from a friend that i couldn\'t watch it all the way through without begging for it to be turned off and disposed off safely so it may harm no-one else) is the

# other libraries:
nltk, spacy

In [18]:
import nltk

In [None]:
#nltk.download()

In [None]:
# load data  ### For the training
# filename = 'metamorphosis_clean.txt'
# file = open(filename, 'rt')
# text = file.read()
# file.close()
# # split into sentences
# from nltk import sent_tokenize
# sentences = sent_tokenize(text)
# print(sentences[0])

In [None]:
# Remove stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)
1
2
3
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

# How to improve this:
 - Remove stopwords
 - Remove proper nouns 