### Tweets Sentiment Prediction using NLP and Classification algorithm

The goal is to predict if a review has got negative or positive sentiment aiding faster decision by the reader or the business in classifying reviews and make decisions basis the sentiment. Finding the sentiment also aids in summarizing the outcome for perfoemance measurement which otherwise becomes highly manual process.


#### Dataset:
The dataset contains tweets and associated sentiments tagged.
Source: https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset

### Importing dataset

In [22]:
import pandas as pd
import numpy as np
data = pd.read_csv(r"Tweets.csv")

#### Dataset Exploration

In [23]:
data.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [24]:
data["sentiment"].value_counts()

neutral     11118
positive     8582
negative     7781
Name: sentiment, dtype: int64

Data looks well balanced and we'll use only the 'text' and 'sentiment' columns

In [25]:
data = data[['text','sentiment']]

In [26]:
data.head()

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative


In [41]:
data = data.dropna()

Resetting the index to remove outdated indices

In [42]:
data = data.reset_index() # to remove outdated indices

In [43]:
data = data[['text','sentiment']]

In [44]:
data.head()

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD I will miss you here in San Diego!!!,negative
2,my boss is bullying me...,negative
3,what interview! leave me alone,negative
4,"Sons of ****, why couldn`t they put them on t...",negative


Encoding target feature 'sentiment' before feeding to classifier

In [54]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['sentiment']= le.fit_transform(data['sentiment'])

In [57]:
data.sentiment.unique()

array([1, 0, 2])

Here we have three classes encoded as 0,1,2

Functionalities and explanations for each cleaning step will be added as a part of another detailed readout.

In [None]:
# Libraries for cleaning text data
import re
import nltk
nltk.download('stopwords') # To identify stop words
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer  # To stem the word to it's basic form

In this section, we will look at only positive and negative reviews to do a binary classification.

In [59]:
word_list = []
for i in range(0, len(data)):
  tweet = re.sub('[^a-zA-Z]', ' ', data['text'][i])  # Using only alphabets
  tweet = tweet.lower()   
  tweet = tweet.split()
  ps = PorterStemmer()
  all_stopwords = stopwords.words('english')   # using English stopwords
  all_stopwords.remove('not')
  tweet = [ps.stem(word) for word in tweet if not word in set(all_stopwords)] # stemming all words that are not stop words
  tweet = ' '.join(tweet)
  word_list.append(tweet)  

In [60]:
word_list[1:5]

['sooo sad miss san diego',
 'boss bulli',
 'interview leav alon',
 'son put releas alreadi bought']

Model can be improved by better methods available as we see they are not done as expected. 
Let us use lemmatizing which is a better and advanced way of reducing words to it's basic form.

Downloading required resources

In [None]:
import nltk
nltk.download('omw-1.4')

In [62]:
from nltk.stem import WordNetLemmatizer
 
# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()

In [63]:
word_list = []
for i in range(0, len(data)):
  tweet = re.sub('[^a-zA-Z]', ' ', data['text'][i])  # Using only alphabets
  tweet = tweet.lower()   
  tweet = tweet.split()
  all_stopwords = stopwords.words('english')   # using English stopwords
  all_stopwords.remove('not')
  tweet = [wnl.lemmatize(word) for word in tweet if not word in set(all_stopwords)] # lemmatizing all words that are not stop words
  tweet = ' '.join(tweet)
  word_list.append(tweet)  

In [64]:
word_list[1:5]

['sooo sad miss san diego',
 'bos bullying',
 'interview leave alone',
 'son put release already bought']

Lemmatized list is much better but still not great. Will use this list to train the model and understand the performance.

In [65]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

Checking the size of the X vector

In [66]:
len(X[1])

22536

Considering only 15000 frequent words

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features= 15000)
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

### Splitting dataset for training and testing

In [68]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Model training

Using a Multinomial Naive Bayes classifier for classifying the sentiment

In [69]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB()

### Testing with test data

In [70]:
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[2 2]
 [2 1]
 [2 2]
 ...
 [1 0]
 [2 1]
 [0 0]]


### Measuring accuracy and making confusion matrix

In [71]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[ 912  501  110]
 [ 372 1484  419]
 [  83  455 1160]]


0.6470160116448326

Now limiting the vector to 12000 words to see if performance on test data increases

In [72]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features= 12000)
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

In [73]:
len(X[1])

12000

In [75]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[ 926  488  109]
 [ 381 1473  421]
 [  86  452 1160]]


0.6475618631732168

Not a lot of improvement in accuracy. Trying with more limited words

In [76]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features= 1000)
X = vectorizer.fit_transform(word_list).toarray()
y = data.iloc[:, -1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


classifier = MultinomialNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
# print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

# from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[ 873  554   96]
 [ 322 1608  345]
 [  79  488 1131]]


0.6572052401746725

By going with 10000 words vector, the accuracy that we achieved is 66%