<a href="https://colab.research.google.com/github/bikash-bhoi/END_NLP_P1/blob/main/S1_Intro/Sentiment_Analysis_using_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [57]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from imblearn.over_sampling import SMOTE

## Reading dataset

In [2]:
data=pd.read_csv('tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


In [5]:
data['labels'].unique()

array([1, 0, 2])

## Text processing for the tweets

In [6]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [24]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    tweet=str(tweet).lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet
    tweet = word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

## Process all tweets

In [25]:
processed=[]

for tweet in data['tweets']:
    
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

In [26]:
data['processed'] = processed

In [27]:
data.head()

Unnamed: 0,tweets,labels,processed
0,Obama has called the GOP budget social Darwini...,1,obama called gop budget social darwinism nice ...
1,"In his teen years, Obama has been known to use...",0,teen years obama known use marijuana cocaine
2,IPA Congratulates President Barack Obama for L...,0,ipa congratulates president barack obama leade...
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0,rt whatsromneyhiding connection supporters cri...
4,RT @wardollarshome: Obama has approved more ta...,1,rt obama approved targeted assassinations mode...


## Create pipeline and define parameters for GridSearch

In [72]:
os_smote = SMOTE(sampling_strategy='auto')
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('os', os_smote),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    # 'os__sampling_strategy': (0.1, 0.2, 0.3),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [63]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=99)

## Perform classification (using GridSearch)

In [74]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
clf = GridSearchCV(text_clf, tuned_parameters, cv=10 )
clf.fit(x_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        pre

In [67]:
print(clf.best_params_)

{'clf__alpha': 0.01, 'tfidf__norm': 'l1', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


## Classification report 

In [71]:
#Without Oversampling 
# print classification report after predicting on test set with best model obtained in GridSearch
from sklearn.metrics import classification_report

print(classification_report(y_test, clf.predict(x_test), digits=4))

              precision    recall  f1-score   support

           0     0.8507    0.9792    0.9104       192
           1     0.8936    0.6269    0.7368        67
           2     0.7500    0.3529    0.4800        17

    accuracy                         0.8551       276
   macro avg     0.8314    0.6530    0.7091       276
weighted avg     0.8549    0.8551    0.8418       276



In [75]:
#With Oversampling 
# print classification report after predicting on test set with best model obtained in GridSearch
from sklearn.metrics import classification_report

print(classification_report(y_test, clf.predict(x_test), digits=4))

              precision    recall  f1-score   support

           0     0.8978    0.8698    0.8836       192
           1     0.7324    0.7761    0.7536        67
           2     0.4737    0.5294    0.5000        17

    accuracy                         0.8261       276
   macro avg     0.7013    0.7251    0.7124       276
weighted avg     0.8316    0.8261    0.8284       276



## Important:

In [37]:
counts = data.labels.value_counts()
print(counts)

0    947
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.