<a href="https://colab.research.google.com/github/ale-camer/Data-Science/blob/Finance/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentiment Analysis is an approach to natural language processing (NLP) that identifies the emotional tone behind a body of text. In this script we evaluate if unsupervised models can be as good as supervised models or how good they can be in comparison with the latter. For this purpose we calculate the polarity in each comment with different lexicons, for the unsupervised models, and we predict the sentiment with Logistic Regression, for the supervised model.

We used a toy dataset that can be find [here](https://github.com/dipanjanS/data_science_for_all/blob/master/tds_deep_transfer_learning_nlp_classification/movie_reviews.csv.bz2).


## Packages, functions and data

In [None]:
!pip install unidecode --quiet
!pip install afinn --quiet
!pip install vaderSentiment --quiet

import numpy as np
import pandas as pd
from tqdm import tqdm

from afinn import Afinn as afn
from textblob import TextBlob as tb
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as vd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc

def text_normalizer(data, language:str='english'):

    """
    The objective of this function is to normalize text data. Therefore, it takes as inputs:
        
        - a string that must be specified in the input 'data',
        - and the language, in the input 'language', in which the articles were written, which by default is english.
    """
        
    import re
    import nltk
    import unidecode
    
    assert type(language) == str, "The 'language' must be a string"
        
    nltk.download('stopwords', quiet=True) # downloading stopwords
    stopword_list = nltk.corpus.stopwords.words(language) 
    urlRegex = re.compile('http\S+')

    def conti_rep_char(str1):
        tchr = str1.group(0)
        if len(tchr) > 1:
          return tchr[0:1]
         
    def check_unique_char(rep, sent_text): # regex to keep only words and numbers
         convert = re.sub(r'[^a-zA-Z0-9\s]',rep,sent_text)
         return convert
    
    lista = []
    for dat in data:
                
        dat = ' '.join([word for word in dat.lower().split() if word not in stopword_list]) # lower capital letters and remove stopwords
        dat = check_unique_char(conti_rep_char, dat) # remove special characters
        dat = ' '.join([word for word in dat.split() if not re.match(urlRegex, word)]) # remove URLs
        dat = ' '.join([unidecode.unidecode(word) for word in dat.split()]) # reemplacing diacritical marks 
        lista.append(dat)

    return lista

data = pd.read_csv('/content/movie_reviews.csv.bz2',compression='bz2')
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## Preprocessing and Unsupervised Modeling

In [None]:
print("Normalizing data")
reviews, sentiments = np.array(text_normalizer(data['review'])), np.array(data['sentiment']) # normalizing and partitioning data
afn, vd = afn(emoticons=True), vd() # polarity instantiation

print("Calculating Polarities")
afn_pols, textblob_pols, vader_pols = [], [], []
for review in tqdm(reviews):
    
    afn_pols.append(afn.score(review))
    textblob_pols.append(tb(review).sentiment.polarity)
    vader_pols.append(vd.polarity_scores(review))

print("Reprocessing Polarity Values")
for i in tqdm(range(len(vader_pols))):

    if vader_pols[i]['neg'] > vader_pols[i]['pos']:
        vader_pols[i] = -1
    elif vader_pols[i]['neg'] == vader_pols[i]['pos']:
        vader_pols[i] = 0
    else:
        vader_pols[i] = 1

names = ['afn','textblob','vader']
pols = [afn_pols, textblob_pols, vader_pols]
for name, pol in zip(names, range(len(names))):
    data[name] = pols[pol]

for col in names:
    for row in tqdm(data.index):
        if data.loc[row,col] < 0:
            data.loc[row,col] = -1
        elif data.loc[row,col] > 0:
            data.loc[row,col] = 1
        else: 
            data.loc[row,col] = 0

print("Evaluating Polarities")
data.sentiment = np.where(data.sentiment == 'positive',1,-1)
for name in names:
    data[f"{name}_same"] = np.where(data['sentiment'] == data[name],1,0)
data

Normalizing data
Calculating Polarities


100%|██████████| 50000/50000 [09:21<00:00, 89.08it/s]


Reprocessing Polarity Values


100%|██████████| 50000/50000 [00:00<00:00, 974984.19it/s]
100%|██████████| 50000/50000 [00:34<00:00, 1435.79it/s]
100%|██████████| 50000/50000 [00:34<00:00, 1435.68it/s]
100%|██████████| 50000/50000 [00:34<00:00, 1437.94it/s]


Evaluating Polarities


Unnamed: 0,review,sentiment,afn,textblob,vader,afn_same,textblob_same,vader_same
0,One of the other reviewers has mentioned that ...,1,-1.0,1.0,-1,0,1,0
1,A wonderful little production. <br /><br />The...,1,1.0,1.0,1,1,1,1
2,I thought this was a wonderful way to spend ti...,1,1.0,1.0,1,1,1,1
3,Basically there's a family where a little boy ...,-1,-1.0,1.0,-1,1,0,1
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,1.0,1.0,1,1,1,1
...,...,...,...,...,...,...,...,...
49995,I thought this movie did a down right good job...,1,1.0,1.0,1,1,1,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",-1,-1.0,-1.0,-1,1,1,1
49997,I am a Catholic taught in parochial elementary...,-1,-1.0,1.0,-1,1,0,1
49998,I'm going to have to disagree with the previou...,-1,-1.0,-1.0,-1,1,1,1


## Supervised Modeling and Evaluation

In [None]:
print("Partitioning data")
n = 35000
train_reviews, test_reviews, train_sentiments, test_sentiments = reviews[:n], reviews[n:], sentiments[:n], sentiments[n:]

print("Feature engineering")
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2), sublinear_tf=True)
tv_train_features, tv_test_features = tv.fit_transform(train_reviews), tv.transform(test_reviews)

print("Modeling, predicting and evaluating")
lr_model = LogisticRegression(penalty='l2', max_iter=100, C=1) # model instantiation
acc = round(acc(test_sentiments, lr_model.fit(tv_train_features,train_sentiments).predict(tv_test_features))*100,2)

Partitioning data
Feature engineering
Modeling, predicting and evaluating


In [None]:
print("Unsupervised Models Accuracy")
for name in names:
    print(f"{name}: {round(data[f'{name}_same'].mean()*100,2)}%")
print(f"\n Supervised Model Accuracy \n Logistic Regression: {acc}%")

Unsupervised Models Accuracy
afn: 68.41%
textblob: 69.92%
vader: 66.69%

 Supervised Model Accuracy 
 Logistic Regression: 89.2%
