### Use IMDB review to perform sentimental analysis

In [2]:
import pandas as pd

review_df = pd.read_csv("./labeledTrainData.tsv", header=0, sep="\t", quoting=3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [3]:
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

### Data preprocessing

In [4]:
import re

review_df['review'] = review_df['review'].str.replace('<br />', ' ')

#change all string that do not start with an alphabet
review_df['review'] = review_df['review'].apply(lambda x :re.sub("[^a-zA-Z]", " ", x))

In [5]:
print(review_df['review'][0])

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for  

### divide train and test data

In [6]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id', 'sentiment'], axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size=0.3, random_state=156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# stop words = english, ngram=(1,2), for cnt vect
# C = 10, for lr
pipeline= Pipeline([
    ('cnt_vect', CountVectorizer(stop_words="english", ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

print("prediction Accuracy {0: .4f}, ROC-AUC {1: .4f}".format(accuracy_score(y_test, pred),
                                                             roc_auc_score(y_test, pred_probs)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


prediction Accuracy  0.8860, ROC-AUC  0.9503


In [17]:
# stop words = english, ngram=(1,2), for tfidf
# C = 10, for lr
pipeline= Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words="english", ngram_range=(1,2))),
    ('lr_clf', LogisticRegression(C=10))
])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

print("prediction Accuracy {0: .4f}, ROC-AUC {1: .4f}".format(accuracy_score(y_test, pred),
                                                             roc_auc_score(y_test, pred_probs)))

prediction Accuracy  0.8936, ROC-AUC  0.9598


### Use Vader to perform sentimental anaysis

In [19]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\DooDoo\AppData\Roaming\nltk_data...


{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


In [22]:
def vader_polarity(review, threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    score = analyzer.polarity_scores(review)
    
    agg_score = score['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

review_df['vader_preds'] = review_df['review'].apply(lambda x : vader_polarity(x, 0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

In [23]:
print('#### VADER evalution ####')
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, vader_preds))
print("Accuracy:", accuracy_score(y_target , vader_preds))
print("Precision:", precision_score(y_target , vader_preds))
print("Recall:", recall_score(y_target, vader_preds))

#### VADER evalution ####
[[ 6736  5764]
 [ 1867 10633]]
Accuracy: 0.69476
Precision: 0.6484722815149113
Recall: 0.85064


In [None]:
;