<a href="https://colab.research.google.com/github/MuditDubey01/Sentimental-Analysis-Project/blob/main/SentimentalAnalysis_on_Sentiment140dataset_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Twitter Tweets Sentiment Classification

* Sentiment classification is the automated process of identifying opinions in text and labeling them as positive, negative, or neutral, based on the emotions customers express within them.

* Here since the labelled data that we have only has 2 classes, 0-Negative, 4-Positive (later encoded as 1) we will perform a binary classification


Importing necessary libraries and dependencies

In [None]:
import numpy as np 
import pandas as pd 
import os
import matplotlib.pyplot as plt 
import seaborn as sns 
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import re
import nltk
from nltk.corpus import stopwords
from tqdm import tqdm
from nltk.stem import WordNetLemmatizer
import spacy

tqdm.pandas()
spacy_eng = spacy.load("en_core_web_sm")
lemm = WordNetLemmatizer()
sns.set_style("darkgrid")
plt.rcParams['figure.figsize'] = (20,8)
plt.rcParams['font.size'] = 18
nltk.download()

#Data Import
* Reading and sampling the data

* Since the data contains 1.6m records, it is not possible to create a feature vector of this size with atleast 1000-2000 reasonable features

* Therefore we randomly sample out 100000 records for our training and validation

In [None]:
data = pd.read_csv("../input/sentiment140/training.1600000.processed.noemoticon.csv",encoding = "latin",header=None)
data = data.iloc[:,[5,0]]
data.columns = ['Text','Class']
data = data.sample(102000)
data.reset_index(drop=True,inplace=True)
data.head(10)

#**Text Cleaning**



1. Removal of extra spaces, lines and indentations
2. Removal of special characters like punctuations and emojis
3. Conversion of text into lower case
4. Tokenization
5. Filtering and removal of stopwords
5. Lemmatization



While other steps are fairly easy to understand, given below is an example of lemmatization (Stemming can also be used as an alternative, but lets say we are more interested in the intelligible root words so we will go with Lemmatization)

__Example of Lemmatization__
<img src='https://user.oc-static.com/upload/2020/10/22/16033589274175_lemmatization%20example%2001.png'>

In [None]:
stop_words = stopwords.words('english')
stop_words.remove('not')

def text_cleaning(x):
    
    text = re.sub('\s+\n+', ' ', x)
    text = re.sub('[^a-zA-Z0-9]', ' ', x)
    text = re.sub('^\S', ' ', x)
    text = text.lower()
    text = text.split()
    
    text = [lemm.lemmatize(word, "v") for word in text if word not in stop_words]
    text = ' '.join(text)
    
    return text

In [None]:
data['Clean Text'] = data['Text'].progress_apply(text_cleaning)

In [None]:
data.head(10)

In [None]:
data.isna().sum()

In [None]:
data.loc[data['Class']==4,'Class'] = 1
data['Class'].value_counts()

#EDA and Visualization
* Exploring class count distribution
* Frequently occuring words in negative and positive tweets
* Sentence Length Distribution Analysis

In [None]:
sns.countplot(data=data,y='Class')
plt.title("Class Count Distribution")
plt.show()  

In [None]:
positive = data[data['Class']==1]['Clean Text'].tolist()
negative = data[data['Class']==0]['Clean Text'].tolist()

#WordClouds and Barplots of Frequently Occuring words

In [None]:
wordcloud = WordCloud(max_words=1500, width=600, background_color='black').generate(" ".join(positive))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title("Positive Tweets Wordcloud")
plt.axis("off")
plt.show()

In [None]:
x = []
y = []
for key,value in wordcloud.words_.items():
    x.append(key)
    y.append(value)
    if len(x) == 15:
        break
sns.barplot(x=x,y=y,color='black')
plt.title("Normalized Count of Top-15 Frequent Words with Positive Sentiments")
plt.xlabel("Words")
plt.ylabel("Normalized Count")
plt.show()

In [None]:
wordcloud = WordCloud(max_words=1500, width=600, background_color='black').generate(" ".join(negative))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title("Negative Tweets Wordcloud")
plt.axis("off")
plt.show()

In [None]:
x = []
y = []
for key,value in wordcloud.words_.items():
    x.append(key)
    y.append(value)
    if len(x) == 15:
        break
sns.barplot(x=x,y=y,color='black')
plt.title("Normalized Count of Top-15 Frequent Words with Negative Sentiments")
plt.xlabel("Words")
plt.ylabel("Normalized Count")
plt.show()

In [None]:
data['sentence_length'] = data['Clean Text'].progress_apply(lambda x: len(x.split()))

In [None]:
sns.boxplot(data=data,y='sentence_length',x='Class')
plt.title("IQR Analysis of Sentence Lengths")
plt.show()

In [None]:
data['sentence_length'].describe()

In [None]:
test = data.tail(2000)
data = data.iloc[:100000,:]

#Feature Extraction and Modelling
* Extracting features using TF-iDF extraction method
* We will avoid using Count Vectorizer over here since the document is very large and certain words might get too high values/importance based on their presence in the entire document corpus, and since we want to focus on the relative presence of different kinds of words in the text, TF-iDF will be a more preferred choice of feature extraction

<img src='https://miro.medium.com/max/1200/1*V9ac4hLVyms79jl65Ym_Bw.jpeg'>

```

```



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=2500)
X = vectorizer.fit_transform(data['Clean Text'].values.tolist()).toarray()
y = data['Class'].values

* ngram_range=(1,2):  vectorizer should consider both unigrams and bigrams in the text, which means that it will consider individual words as well as pairs of words.

* max_features=2500 : vectorizer should only consider the 2500 most frequently occurring words in the text.

* X = vectorizer.fit_transform(data['Clean Text'].values.tolist()).toarray() 
: Applies the vectorize,  The fit_transform() method applies the vectorizer to the data and converts it into a sparse matrix of TF-IDF features. The toarray() method is then used to convert the sparse matrix into a dense matrix, which can be used as input to a machine learning model. The resulting X matrix contains the TF-IDF features for each tweet in the dataset.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score

In [None]:
def model_train(model, X_train, X_test, y_train, y_test):
    model.fit(X_train,y_train)
    y_pred_tr = model.predict(X_train)
    y_pred = model.predict(X_test)

    print("--------------------Training Performance---------------------")
    print(classification_report(y_train,y_pred_tr))
    print("-------------------------------------------------------------")
    print("--------------------Testing Performance----------------------")
    print(classification_report(y_test,y_pred))
    
    sns.heatmap(confusion_matrix(y_test, y_pred),cmap='viridis',annot=True,fmt='.4g',
            xticklabels=['Negative','Positive'],yticklabels=['Negative','Positive'])
    plt.xlabel('Predicted Class')
    plt.ylabel('Actual Class')
    plt.show()
    
    fpr, tpr, _ = roc_curve(y_test,  y_pred)
    auc = roc_auc_score(y_test, y_pred)
    plt.plot(fpr,tpr,label="CNN Model, auc="+str(auc),lw=2)
    plt.plot([0, 1], [0, 1], color="orange", lw=2, linestyle="--")
    plt.title("ROC Curve")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend(loc=4)
    plt.show()

#Navie Bayes

In [None]:
model = MultinomialNB()
model_train(model, X_train, X_test, y_train, y_test)

#Logistic Regression
* Max iterations is set to 1000, so that the model can run for more iterations for convergence
* With its default number of iterations (100) the model fails to converge

In [None]:
model = LogisticRegression(max_iter=1000)
model_train(model, X_train, X_test, y_train, y_test)

#Random Forest
* Uses 100 decision trees to perform bagging and produce and ensemble result
* Max depth is set to 15 to avoid overfitting on the training set
* Max features have been set sqrt of the number of features (performs slightly better than log and auto)

In [None]:
model = RandomForestClassifier(n_estimators=100,max_depth=15,max_features='sqrt')
model_train(model, X_train, X_test, y_train, y_test)

| Model | F1-Score | Accuracy | AUC Score |
| --- | --- | --- | --- |
| Naive Bayes | 0.75 | 0.75 | 0.75 |
| __Logistic Regression__ | 0.76 | 0.76 | 0.758 |
| Naive Bayes | 0.70 | 0.70 | 0.69 |

__Inference:__ Based on the Accuracy, F1-Score and AUC Score Logistic Regression performs the best on the validation set

#Post Feature Extraction Analysis
* Here we will apply PCA just for the sake of visualizing how the data looks on a 2D plane
* Since the number of records and features are very high it is difficult to look at the correlation and reason with it to apply PCA

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA()
X_pca = pca.fit_transform(X_train)
variance_explained = np.cumsum(pca.explained_variance_ratio_)
pcs = range(1,len(variance_explained)+1)

#Analysing the Variance Explanantion
* 2 components will explain even less than 15% of the total variance of our feature extracted data, but will try to visualize it regardless
* It will still require over 2000 components to represent more than 90% of the variance of the data, seems like there is not much correlation in the feature extracted data, which is justified since it is a sparse matrix

In [None]:
plt.plot(pcs,variance_explained)
plt.title("PCs Explained Variance")
plt.xlabel("Principal Components")
plt.ylabel("Variance Explanation")
plt.show()

#Training Set

In [None]:
vis_df = pd.DataFrame(X_pca[:,:2])
vis_df['class'] = y_train

sns.scatterplot(x=0,y=1,data=vis_df,hue='class')
plt.title("Visualization of TF-iDF embeddings in 2D-Plane using PCA")
plt.xlabel("First PC")
plt.ylabel("Second PC")
plt.show()

#Validation Set

In [None]:
X_test_pca = pca.transform(X_test)
vis_df = pd.DataFrame(X_test_pca[:,:2])
vis_df['class'] = y_test

sns.scatterplot(x=0,y=1,data=vis_df,hue='class')
plt.title("Visualization of TF-iDF embeddings in 2D-Plane using PCA")
plt.xlabel("First PC")
plt.ylabel("Second PC")
plt.show()

**Inference**: No clear distinctice pattern as expected, the non linearities however can be captured using feed forward neural nets or some other non linear models

# Inference Pipeline
- Lets put the above feature extraction and modelling steps inside a pipeline

<img src='https://miro.medium.com/max/1400/1*ah8eEa2j4NULlMUts6UFNA.png'>

#Test Set

In [None]:
from sklearn.pipeline import Pipeline
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
pipe = Pipeline([('feature_extraction', vectorizer), ('logit', model)])

In [None]:
test_input = test['Clean Text'].tolist()
test_label = test['Class']

In [None]:
outputs = pipe.predict(test_input)

In [None]:
print(classification_report(test_label,outputs))

**Conclusion**: *The model gives same performance on the previously unseen non preprocessed test set as well*