# Home Work: Sentiment analysis

You will deal with the movies reviews. 12500 rows sample for training; 0 - negative sentiment, 1 - positive. 

__Task: predict movie review sentiment__

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)

  


__data__ - for training & testing, __data_validation__ - to be filled in with your best classifier's  predictions and send for evaluation along with the notebook.

In [None]:
data = pd.read_csv("reviews.csv")
data_validation = pd.read_csv("validation_preds.csv")

In [None]:
data.groupby(["Sentiment"]).size()

Sentiment
0    5058
1    4942
dtype: int64

In [None]:
data.head(3) 

Unnamed: 0,SentimentText,Sentiment
0,"Actually I'm surprised many comments movie. saw part Slavic film festival major American University. nobody USA heard it, real shame! dynamics people makes funny sad. stuck together long bus trip--someplace us been!! never one like this!! <br /><br />My favorite scene one stop funeral. man & woman sneak Lovemaking forest everybody follows watch without knowing! raises skirt enters way--the consumptive starts hacking & realize everybody watching!! Talk surprised! But...you really feel even hilariously funny! see ending sort ironic enjoyed did! Serb humor it's best!",1
1,"someone lives near Buffalo, New York, movie scored points even saw it, since story based here. even bit parts real-life news-TV anchor people Buffalo..and, once, doesn't knock area. Hallelujah!<br /><br />Theology-wise, puh-leeze!!! God still made look think like humans...and, course, bit liberal side. lightweight comedy is, it's nothing win awards still entertaining pleasant way kill 102 minutes. <br /><br />There laugh-out-loud slapstick comedy scenes and, hopefully, audiences - Christians atheists.- got something besides laughs, prayer really about. Kudos writers least getting theology correct giving good message.<br /><br />Overall, it's good-hearted film offend few.",1
2,"first half hour movie liked. obvious budding romance Ingrid Bergman Mel Ferrer cute watch wanted see inevitable happen them. However, action switched home Ingrid's fiancГ©, completely fell apart. Instead romance charm, see excruciatingly dopey parallel characters emerge ruin film. fiancГ©'s boorish son military attachГ©'s vying maid's attention looked stupid--sort like subplot old Love Boat episode. charm elegance first portion film give way dopiness beyond me. film obvious attempt Renoir recapture success RULES GAME, movie similar action switches country estate (just film). huge fan RULES GAME, ELENA MEN appreciating artistry nuances original film.",0


In [None]:
data_validation.head(3)

Unnamed: 0,SentimentText,Sentiment
0,"one best films seen years! Gwyneth Paltrow fan, excellent Emma Woodhouse. Alan Cumming superb Reverand Elton, Emma Thompson's sister, Sophie, hysterical Miss Bates. check gorgeous Jeremy Northam Mr. Knightley; gentleman! Whoever said need sex violence movie make good never seen Emma. think separates many others--it's classy.<br /><br />If you're looking film watch whole family, looking romance yourself, look further. Emma movie. beautiful setting, wonderful costumes, outstanding cast (have mentioned gorgeous Jeremy Northam?), Emma perfect ten!",
1,"excellent, fast paced thriller Wes Craven (Nightmare Elm Street), 85 minutes leaves aside supernatural presents us something even terrifying - evil human beings. far likely encounter benign evil Jackson Rippner Freddy Kruger, Cillian Murphy (Batman Begins) excellent job presenting sociable, friendly, even charismatic killer. performances Murphy Rachel McAdams (Claire, Wedding Crashers)are brilliant. film takes place intimate level, two people, eyes, faces. action small scale, broad sweep canvas, less compelling limitations. cinematography nothing special, though course one much camera confines passenger jet, dialog excellent, story taut. distractions, subplots confuse issue heart battle main characters. keeping focus avoiding distractions, Wes Craven able take minimal plot turning exciting, fast-paced action thriller.",
2,"don't ruin you, I'll brief. There's great acting funny lines attractive cast. young graduate Harvard Med School (Brian White) finds doesn't know much thinks people. goes small hospital Florida internship girlfriend (Mya) left job TV Producer. Senior Resident (Wood Harris), helped marvelously 'creative collaborator'(Zoe Saldana) bring speed. help protect career show wider possibilities come compassionate doctor instead player wants make money (as seems true many pre-med friends).",


## Apply text normalization for train & validation sets


In [None]:
import pickle
import re
from nltk.tokenize import WordPunctTokenizer
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

Load dictionary with negations contractions:

In [None]:
with open('negations_contractions.pickle', 'rb') as f:
     negations = pickle.load(f)

Compile patterns for html tags and web hyperlinks. 

re.I - for ignoring case of the letters (e.g. no need to trat upper & lower case separately, re will take care of it)

use __r__ in pattern strings to ensure correct slashes processing

In [None]:
pattern_1 =re.compile( r'(<[^>]+>)|((www\.[^ ]+)\b)|((https?://)\S+)', re.I)
negation_pattern = re.compile(r'\b(' + '|'.join(negations.keys()) + r')\b', re.I)

Handy way to deal with regexps search results & contractions dictionary. Function group() returns matching group, example below just for illustration purposes:

In [None]:
negation_pattern.sub(lambda x: negations[x.group()], "ain't|aren't|can't")

'is not|are not|cannot'

In [None]:
def data_cleaner(text):
    """Function for text normalization"""
    lower_case = text.lower()
    preprocessed_1 = re.sub(pattern_1, '', lower_case)
    w_o_negations = negation_pattern.sub(lambda x: negations[x.group()], preprocessed_1)
    letters_only = re.sub("[^a-zA-Z]", " ", w_o_negations)
    tokens = tokenizer.tokenize(letters_only)
    return (" ".join(tokens)).strip()   

In [None]:
def post_process(data, tokenizer):
    """Function for applying text normalization to the all corpus"""
    data['SentimentText'] = data['SentimentText'].progress_map(data_cleaner)  
    data.reset_index(inplace=True)
    data.drop('index', inplace=True, axis=1)
    return data

In [None]:
tokenizer = WordPunctTokenizer()
data_processed = post_process(data, tokenizer)

progress-bar: 100%|██████████| 10000/10000 [00:07<00:00, 1423.80it/s]


Validation data should be also normalized:

In [None]:
validation_processed = post_process(data_validation, tokenizer)

progress-bar: 100%|██████████| 2500/2500 [00:01<00:00, 1371.27it/s]


In [None]:
validation_processed.head(3)

Unnamed: 0,SentimentText,Sentiment
0,one best films seen years gwyneth paltrow fan excellent emma woodhouse alan cumming superb reverand elton emma thompson s sister sophie hysterical miss bates check gorgeous jeremy northam mr knightley gentleman whoever said need sex violence movie make good never seen emma think separates many others it is classy if you are looking film watch whole family looking romance yourself look further emma movie beautiful setting wonderful costumes outstanding cast have mentioned gorgeous jeremy northam emma perfect ten,
1,excellent fast paced thriller wes craven nightmare elm street minutes leaves aside supernatural presents us something even terrifying evil human beings far likely encounter benign evil jackson rippner freddy kruger cillian murphy batman begins excellent job presenting sociable friendly even charismatic killer performances murphy rachel mcadams claire wedding crashers are brilliant film takes place intimate level two people eyes faces action small scale broad sweep canvas less compelling limitations cinematography nothing special though course one much camera confines passenger jet dialog excellent story taut distractions subplots confuse issue heart battle main characters keeping focus avoiding distractions wes craven able take minimal plot turning exciting fast paced action thriller,
2,do not ruin you i will brief there is great acting funny lines attractive cast young graduate harvard med school brian white finds does not know much thinks people goes small hospital florida internship girlfriend mya left job tv producer senior resident wood harris helped marvelously creative collaborator zoe saldana bring speed help protect career show wider possibilities come compassionate doctor instead player wants make money as seems true many pre med friends,


## Prepare train/test sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
SEED = 42
x_train, x_validation, y_train, y_validation = train_test_split(data_processed.SentimentText, data_processed.Sentiment,\
                                                                test_size=.2, random_state=SEED)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Sentiment Prediction & Real Challenge

Use your knowledge to create machine learning pipeline for the most accurate sentiment predictions. Metric for maximization - ROC AUC.

Some variants to try:

 * tune tfidf (max_features, max_df, min_df, n_grams...) You may want to test CountVectorizer :)
 * test some classifiers
 * tune classifiers parameters
 * word embeddinds??
 * topics models??
 
 __Do not forget to make predictions for the validation set "validation_processed" we created above, save predictions in validation_preds.csv and send it to me along with the notebook.__  
 
 Final score of your work will be assessed on validation_pred.csv.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

class Tokenizer_(BaseEstimator, TransformerMixin):
    def __init__(self, num_words):
        BaseEstimator.__init__(self)
        TransformerMixin.__init__(self)
        self.tokenizer = Tokenizer(num_words=num_words)
        
    def fit(self, X, y=None):
        self.tokenizer.fit_on_texts(X.values)
        return self
    
    def transform(self, X, y=None):
        X = self.tokenizer.texts_to_sequences(X.values)
        return X

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

class Padder(BaseEstimator, TransformerMixin):
    def __init__(self, maxlen):
        BaseEstimator.__init__(self)
        TransformerMixin.__init__(self)
        self.maxlen = maxlen
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X = pad_sequences(X, self.maxlen)
        return X

In [None]:
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, SpatialDropout1D
from tensorflow.keras.layers import Embedding

class Embedding_LSTM(BaseEstimator, TransformerMixin):
    def __init__(self, num_words=5000, maxlen=1000):
        BaseEstimator.__init__(self)
        TransformerMixin.__init__(self)

        self.model = Sequential()
        self.model.add(Embedding(num_words, 32, input_length=maxlen))
        self.model.add(SpatialDropout1D(0.25))
        self.model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))
        self.model.add(Dropout(0.2))
        self.model.add(Dense(1, activation='sigmoid'))
        self.model.compile(loss='binary_crossentropy',optimizer='adam', 
                    metrics=[tensorflow.keras.metrics.AUC()])
        
    def fit(self, X, y=None):
        self.model.fit(X, y, validation_split=0.2, epochs=3, batch_size=32)
        return self
    
    def transform(self, X, y=None):
        y_predicted = [1 if (x >= 0.5) else 0 for x in self.model.predict(X)]
        return y_predicted

In [None]:
words_num = 5000
max_len = 1000

pipeline = Pipeline(steps = [
                             ('tokenization ', Tokenizer_(words_num)), 
                             ('padding', Padder(max_len)),
                             ('embedding+NN', Embedding_LSTM(words_num, max_len))
                             ]
                    )

pipeline.fit(x_train, y_train)

Epoch 1/3
Epoch 2/3
Epoch 3/3




Pipeline(memory=None,
         steps=[('1', Tokenizer_(num_words=None)), ('2', Padder(maxlen=1000)),
                ('3', Embedding_LSTM(maxlen=None, num_words=None))],
         verbose=False)

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(pipeline.transform(x_validation), y_validation)

0.8582843365001043

In [None]:
validation_preds = pipeline.transform(validation_processed.SentimentText)
data_validation['Sentiment'] = validation_preds
data_validation.to_csv('validation_preds.csv')