<a href="https://colab.research.google.com/github/chota-mota01/Capstone_Classification_Project_Coronavirus_Tweet_Sentiment_Analysis/blob/main/Coronavirus_Tweet_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Coronavirus Tweet Sentiment Analysis**



##### **Project Type**    - Classification
##### **Contribution**    - Individual

# **Project Summary -**

The project task involves constructing a classification model to determine the sentiment of corona virus tweets. The dataset comprises tweets gathered from Twitter, which have undergone manual tagging for sentiment analysis.

The project was conducted individually, and after analysing the dataset, it was discovered that it contained 41157 rows and 6 columns. The 'location' feature has numerous null values. To avoid affecting the original data, a copy was made, and specified column was converted to appropriate data type. The analysis of the cleaned data provided valuable insights into sentiment analysis.

The researcher utilized data visualization techniques employing libraries such as seaborn and matplotlib. Various types of graphs, including bar graphs, count plots, line charts, box plots, histogram plots, correlation heatmaps, and pair plots, were employed. These visualizations played a crucial role in simplifying complex data and enhancing its interpretability.

In our analysis of COVID-19 tweets, we focused on the "OriginalTweet" and "Sentiment" columns, disregarding irrelevant columns like "UserName" and "ScreenName." This streamlined our analysis pipeline, allowing us to extract relevant data efficiently. We used stemming and lemmatizing for text normalization.

We explored five machine learning models, including Logistic Regression, Naive Bayes Classifier, Random Forest, KNN(K-Nearest Neighbors), and Decision Tree. 'Logistic Regression' is statistical method for binary classification tasks, 'Naive Bayes Classifier' is a probabilistic classifier based on Bayes' theorem, 'Random Forest' is an ensemble learning method using multiple decision trees, 'KNN (K-Nearest Neighbors)' is a non-parametric classification algorithm and 'Decision Tree' is a tree-like structure used for classification and regression tasks. Despite employing grid search cross-validation to optimize these models, we observed minimal improvements in test accuracy across all models.

The accuracy score and classification report are crucial metrics influencing analysis. They offer insights into a model's performance and its ability to correctly classify instances, aiding in real-world applications. The Logistic Regression model with Grid Search CV and count vectorization emerged as the top performer, boasting an accuracy of 79.51% without signs of overfitting.

Our analysis revealed that despite the COVID-19 pandemic's challenges, positive sentiments outweighed negative ones in the tweets. Nonetheless, a notable portion of negative sentiments persists, presenting opportunities for initiatives aimed at addressing public concerns and bolstering morale.










# **GitHub Link -**

https://github.com/chota-mota01/Capstone_Classification_Project_Coronavirus_Tweet_Sentiment_Analysis

# **Problem Statement**


The project entails the development of a classification model aimed at predicting the sentiment expressed in COVID-19-related tweets. The dataset comprises tweets sourced from Twitter, which have undergone manual sentiment tagging. To ensure privacy protection, both names and usernames have been anonymized through encoding.

The primary objective is to leverage machine learning techniques to accurately classify the sentiment conveyed in COVID-19 tweets. By building and deploying a robust classification model, we aim to gain insights into public sentiment surrounding the pandemic, enabling organizations to better understand public perception, sentiment trends, and potential areas of concern or positivity. Through this analysis, we seek to contribute to a deeper understanding of the public discourse surrounding COVID-19 on social media platforms.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GroupKFold
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

import re
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from pylab import rcParams
from sklearn.metrics import f1_score
import plotly.graph_objects as go

# Importing datetime modules
from datetime import datetime
from datetime import date

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Read Dataset
path = '/content/drive/My Drive/Colab Notebooks/Coronavirus Tweets.csv'
covi_data = pd.read_csv(path,encoding='latin-1')

### Dataset First View

In [None]:
# Dataset First Look
# head() method returns first 5 rows of the dataset
covi_data.head()

In [None]:
# If number is specified, head() returns specified number of first rows
covi_data.head(7)

In [None]:
# Dataset Last Look
# tail() method returns last 5 rows of the dataset
covi_data.tail()

In [None]:
# If number is specified, tail() returns specified number of last rows
covi_data.tail(8)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
covi_data.shape

### Dataset Information

In [None]:
# Dataset Info
covi_data.info()

In [None]:
# Columns present in dataset
list(covi_data.columns)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
covi_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
covi_data.isna().sum().sum()

In [None]:
#Used isnull().sum method to view null value in each column
covi_data.isnull().sum()

In [None]:
# Visualizing the missing values
# Check Null value by plotting Heatmap
from pickle import FALSE
plt.figure(figsize=(10,5))
sns.heatmap(covi_data.isnull(),cbar=FALSE)

### What did you know about your dataset?

The dataset given contains coronavirus tweet information. We need to analyze the important factors in the dataset for tweet sentiment analysis.

The dataset has 41157 rows and 6 columns. The dataset contains 8590 missing/null values and 0 duplicate values. The null values are in column 'Location'.

Using seaborn library, we have visualized the following missing/null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
covi_data.columns

In [None]:
# Dataset Describe
covi_data.describe(include='all')

### Variables Description

* **Username** **-** Unique user-IDs
* **ScreenName** **-**  Unique screen name of the user
* **Location** **-** Location of the user
* **TweetAt** **-** Date of the tweet
* **OriginalTweet** **-** The real tweet
* **Sentiment** **-** Sentiment of the tweet


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in covi_data:
  print(covi_data[column].unique())

In [None]:
# Count of Unique Values for each variable.
for col in covi_data:
  print("Count of unique values in",col,"is",covi_data[col].nunique(),".")

In [None]:
# Count of Location
covi_data['Location'].value_counts()

In [None]:
# Top 15 Countries
covi_data['Location'].value_counts().head(15)

In [None]:
# Count of TweetAt
covi_data['TweetAt'].value_counts()

In [None]:
# Count of Sentiment
covi_data['Sentiment'].value_counts()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create copy of dataset
covi_df=covi_data.copy()
covi_df.columns

In [None]:
# Replace all null values in Location column by NA
covi_df["Location"].fillna("NA",inplace=True)

In [None]:
# Converting data type of date column
covi_df['TweetAt'] = pd.to_datetime(covi_df['TweetAt'].apply(lambda x: datetime.strptime(x,'%d-%m-%Y')))

In [None]:
covi_df.info()

### What all manipulations have you done and insights you found?

While analyzing dataset, we found many null values. Before manipulation of the data, we created a copy of the coronavirus tweet sentiment analysis dataset because of which the changes made in the duplicate dataset won't affect the original dataset.

After creating duplicate dataset, we replaced the null values of location column with NA. We changed the datatype of TweetAt to datetime data type.

The manipulations performed are for better visualization of the dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Countplot of TweetAt

In [None]:
# Chart - 1 visualization code
# Count of OriginalTweet with TweetAt
plt.figure(figsize=(12,6))
grp_tweetAt = covi_df.groupby('TweetAt').count()['OriginalTweet'].plot()
plt.ylabel('Count')
plt.title('Tweeting Date', fontweight='bold')
plt.show()

##### 1. Why did you pick the specific chart?

The plot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

I selected the specific chart to know the tweet date of the original tweet.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is that maximum tweets are done between dates 16-23 with nearly 3500 words and minimum tweets are done between dates 28-30.

#### Chart - 2 - Histogram of Positive Sentiment

In [None]:
# Chart - 2 visualization code
# Plot number of characters for Positive sentiment
tweet_len=covi_df[covi_df['Sentiment']=="Positive"]['OriginalTweet'].str.len()
plt.hist(tweet_len,color='green')
plt.title('Positive Sentiments')

##### 1. Why did you pick the specific chart?

A histogram consists of bars that show the frequency of data within certain intervals, known as bins. The height of each bar represents the frequency of data falling within that bin.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is that the number of characters for positive sentiment are between 10 to 350. 250-270 characters are used for positive sentiment for more than 2500 times.

#### Chart - 3 - Histplot of TweetAt with different Sentiments

In [None]:
# Chart - 3 visualization code
# Plot Tweet date with different sentiments
plt.figure(figsize=(10,8))
sns.histplot(x= "TweetAt",data=covi_df, hue="Sentiment", multiple="stack")
plt.xticks(rotation=45, ha='right')
plt.title("Tweet Date of Sentiments", fontweight='bold')
plt.ylabel("TweetAt",fontsize = 12)
plt.show()


##### 1. Why did you pick the specific chart?

A histplot is a type of visualization commonly used to display the distribution of a univariate dataset. It represents the frequency or count of observations falling within predefined intervals, called bins, by plotting bars whose heights correspond to these counts.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart are that on 20-03-2020, when maximum tweet took place showing the maximum sentiment types. Among all the sentiments, positive sentiment dominates the most followed by the negative in second place. The least tweeting date is 28-03-2020 with less number of sentiments.
The maximum extremely postive sentiments is observed on 25-03-2020, whereas extremely negative sentiment was mostly tweeted on 20-03-2020.

#### Chart - 4 - Histogram of Negative Sentiment

In [None]:
# Chart - 4 visualization code
# Plot number of characters for Negative sentiment
tweet_len=covi_df[covi_df['Sentiment']=="Negative"]['OriginalTweet'].str.len()
plt.hist(tweet_len,color='brown')
plt.title('Negative Sentiments')

##### 1. Why did you pick the specific chart?

A histogram consists of bars that show the frequency of data within certain intervals, known as bins. The height of each bar represents the frequency of data falling within that bin.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is that the number of characters for negative sentiment are between 10 to 350. 250 characters are used for negative sentiment for more than 1750 times.

#### Chart - 5 - Countplot for top 15 Countries

In [None]:
# Chart - 5 visualization code
# Top 15 Countries
plt.figure(figsize=(12,6))
sns.countplot(y=covi_df.Location, order = covi_df.Location.value_counts().iloc[1:16].index, palette ='rocket_r')
plt.title('Top 15 locations')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

##### 2. What is/are the insight(s) found from the chart?

We can observe the top 15 countries for tweet sentiment analysis. The maximum number of tweets are from London followed by United States and other countries.

#### Chart - 6 - Histogram of Neutral Sentiment

In [None]:
# Chart - 6 visualization code
# Plot number of characters for Neutral sentiment
tweet_len=covi_df[covi_df['Sentiment']=="Neutral"]['OriginalTweet'].str.len()
plt.hist(tweet_len,color='yellow')
plt.title('Neutral Sentiments')

##### 1. Why did you pick the specific chart?

A histogram comprises bars representing the frequency of data within specific intervals, referred to as bins. Each bar's height corresponds to the frequency of data falling within that particular bin.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is that the number of characters for neutral sentiment are between 10 to 350. 110-140 characters are used for neutral sentiment for 1200 times.

#### Chart - 7 - Histogram of Extremely Positive & Extremely Negative Sentiments

In [None]:
# Chart - 7 visualization code
# Plot number of characters for Extremely Positive & Extremely Negative Sentiments
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(12,5))
tweet_len=covi_df[covi_df['Sentiment']=="Extremely Positive"]['OriginalTweet'].str.len()
ax1.hist(tweet_len,color='darkmagenta')
ax1.set_title('Extremely Positive Sentiments')

tweet_len=covi_df[covi_df['Sentiment']=="Extremely Negative"]['OriginalTweet'].str.len()
ax2.hist(tweet_len,color='grey')
ax2.set_title('Extremely Negative Sentiments')


fig.suptitle("Characters in Tweet Sentiment", size=15,fontweight="bold")
# Showing the plot
plt.show()

##### 1. Why did you pick the specific chart?

A histogram comprises bars representing the frequency of data within specific intervals, referred to as bins. Each bar's height corresponds to the frequency of data falling within that particular bin.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is that the number of characters for extremely positive and extremely negative sentiments are between 10 to 350. 250 characters are used for extremely positive sentiment for nearly 1600 times and 250-300 characters are used for extremely positive sentiment for more than 1600 times.

#### Chart - 8 - Countplot of TweetAt

In [None]:
# Chart - 8 visualization code
# Dates of Tweets
plt.figure(figsize=(12,6))
sns.countplot(x='TweetAt', data=covi_df, palette ='icefire')
plt.xticks(rotation=45, ha='right')
plt.title("Date of Tweets")
plt.xlabel("TweetAt")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart are the maximum number of tweets are done on 20-03-2020 with nearly 3500 count and the minimum number of tweets are done on 28-03-2020.

#### Chart - 9 - Pie Chart on Sentiment

In [None]:
# Chart - 9 visualization code
# Percentage of Sentiment
covi_df.Sentiment.value_counts()
covi_df['Sentiment'].value_counts().plot(kind='pie',
                                         figsize=(15,6),
                                         autopct="%.2f%%",
                                         startangle=90,
                                         labels=['Positive','Negative','Neutral','Extremely Positive','Extremely Negative'],
                                         colors=['pink','brown','r','g','y'],
                                         explode=[0.04,0.04,0.04,0.04,0.04])
plt.legend(title='Sentiment:')


##### 1. Why did you pick the specific chart?

A pie chart compares the contribution of each part to the data. It is a circular statistical graphic which is divided into slices to illustrate numerical proportion.

##### 2. What is/are the insight(s) found from the chart?

The following chart helps us understand that the positive and negative sentiments are fairly high with 27.75% and 24.10%.

#### Chart - 10 - Interactive pie plot in percentage for Top 15 locations

In [None]:
# Chart - 10 visualization code
location_per = pd.DataFrame(covi_df['Location'].value_counts().sort_values(ascending=False))
location_per = location_per.rename(columns={'Location':'count'})

In [None]:
# Plotting the interactive pie plot in percentage for Top 15 locations
data = {
   "values": location_per['count'][1:16],
   "labels": location_per.index[1:16],
   "domain": {"column": 0},
   "name": "Location Name",
   "hoverinfo":"label+percent+name",
   "hole": .4,
   "type": "pie"
}
layout = go.Layout(title="Percentage of Location", legend=dict(x=0.1, y=1.0, orientation="v"))
data = [data]
fig = go.Figure(data = data, layout = layout)
fig.update_layout(title_x=0.5)
fig.show()


##### 1. Why did you pick the specific chart?

A pie chart compares the contribution of each part to the data. It is a circular statistical graphic which is divided into slices to illustrate numerical proportion.

##### 2. What is/are the insight(s) found from the chart?

The interactive pie plot shows the percentage of Top 15 locations. London is the topmost location with 11.7%.

#### Chart - 11 - Barplot for Top 10 Reference Present in Tweets

In [None]:
# Chart - 11 visualization code
# Find different Reference present in tweets with having Reference using @
def Reference(text):
    line=re.findall(r'(?<=@)\w+',text)
    return " ".join(line)
covi_df['Reference']=covi_df['OriginalTweet'].apply(lambda x:Reference(x))

temp=covi_df['Reference'].value_counts()[:][1:11]
temp =temp.to_frame().reset_index().rename(columns={'index':'Reference','Reference':'count'})

# Ploting the bar plot
plt.figure(figsize=(12,6))
sns.barplot(x="Reference",y="count", data = temp, palette="pastel")
plt.title("Top 10 Reference Present in Tweets", fontweight='bold')

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

##### 2. What is/are the insight(s) found from the chart?

The following chart helps us with top 15 reference present in the tweets. 'realDonaldTrump' holds highest reference and 'narendramodi' holds lowest reference in the tweets.

#### Chart - 12 - Boxplot of Data

In [None]:
# Chart - 12 visualization code
# Boxplot of numerical column of dataset
col = list(covi_df.columns)

covi_df[col].plot(kind='box', title='Boxplot of Data',color='red')
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots are a measure of how well the data is distributed in the dataset. These charts display ranges within variables measured. This includes the outliers, the median, the mode, and where the majority of the data points lie in the “box”.

##### 2. What is/are the insight(s) found from the chart?

The given dataset has 2 numerical columns. The insight found from the chart is that the user name and screen name has no outliers present.

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(20,5))
cor = sns.heatmap(covi_df.corr(),annot=True)

##### 1. Why did you pick the specific chart?

Correlation heatmaps are a type of plot that visualize the strength of relationships between numerical variables. Correlation plots are used to understand which variables are related to each other and the strength of this relationship.

##### 2. What is/are the insight(s) found from the chart?

The numerical column is represented for visualization of correlation heatmap.

#### Chart - 14- Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(covi_df)

##### 1. Why did you pick the specific chart?

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. Pairplot allows us to plot pairwise relationships between variables within a dataset.

The specific chart consists of entire dataset with each variable plotted. The plots are in matrix in which column name represents y-axis and row name represents x-axis.

##### 2. What is/are the insight(s) found from the chart?

The pairplot basically plots entire dataframe. As our data has 2 numerical columns, we can see the relation between them using pair plot.

## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
''' Replace all null values in Location column by NA '''

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
'''No outliers present in the data'''

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
'''Not required'''

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
!pip install contractions

In [None]:
# Expand Contraction
# import library
import contractions

def expand_contraction(text):
  expand_contraction = contractions.fix(text)
  return expand_contraction

In [None]:
covi_df['RealTweet']=covi_df['OriginalTweet'].apply(expand_contraction)

#### 2. Lower Casing

In [None]:
# Convert OriginalTweet to Lowercase
covi_df['RealTweet'] = covi_df['RealTweet'].str.lower()
covi_df['RealTweet']

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
# Define function to remove punctuation
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space,
    # which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
# Apply function to remove punctuation
covi_df['RealTweet']=covi_df['RealTweet'].apply(remove_punctuation)
covi_df.head(3)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
covi_df['RealTweet'] = covi_df['RealTweet'].str.replace('http\S+|www.\S+', '', case=False)

In [None]:
# write function for removing @user
def remove_pattern(txt, pattern):
    r = re.findall(pattern, txt)
    for i in r:
        txt = re.sub(i,'',txt)
    return txt

In [None]:
# Create new column with removed @user
covi_df['RealTweet'] = np.vectorize(remove_pattern)(covi_df['RealTweet'], '@[\w]*')
covi_df.head(3)

In [None]:
# Remove special characters, numbers, punctuations
covi_df['RealTweet'] = covi_df['RealTweet'].str.replace('[^a-zA-Z#]+',' ')
covi_df.head(3)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Extract stopwords
nltk.download('stopwords')
# Extract the stopwords from nltk library
sw = stopwords.words('english')
# Display the stopwords
np.array(sw)

In [None]:
# Define function to remove stopwords
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)

In [None]:
# Remove stopwords using function
covi_df['RealTweet']=covi_df['RealTweet'].apply(stopwords)
covi_df.tail(4)

#### 6. Rephrase Text

In [None]:
# Rephrase Text
covi_df.head()

#### 7. Tokenization

In [None]:
# Tokenization
''' Tokenization is being take care of by Stemming'''
#covi_df['RealTweet']=covi_df['RealTweet'].apply(lambda x:str(x).split())


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# Stemming
# Create an object of stemming function
stemmer = SnowballStemmer("english")

def stemming(text):
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in str(text).split( )]
    return " ".join(text)

In [None]:
covi_df['Stem'] = covi_df['RealTweet'].apply(lambda x: stemming(x))
covi_df.head(10)

In [None]:
# Lemmatizing
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
covi_df['Lemm'] = covi_df['RealTweet'].apply(lambda x: [lemmatizer.lemmatize(y) for y in x.split()])

In [None]:
covi_df.head(3)

##### Which text normalization technique have you used and why?

We used stemming and lemmatizing for text normalization. Stemming is a text processing technique used in natural language processing (NLP) to reduce words to their root or base form, known as the stem. It involves removing prefixes, suffixes, and other affixes from words to normalize them and group together words with the same root, even if they have different inflected forms. Lemmatizing is a text processing technique used in natural language processing (NLP) to reduce words to their base or dictionary form, known as the lemma. Unlike stemming, which simply removes prefixes and suffixes to obtain a word stem, lemmatizing uses language-specific dictionaries and morphological analysis to ensure that the resulting lemma is a valid word.

#### 9. Part of speech tagging

In [None]:
# POS Taging
covi_df['po_tag']=nltk.pos_tag((covi_df['RealTweet']))
covi_df.head()

#### 10. Text Vectorization

In [None]:
covi_df['Sentiment'] = covi_df['Sentiment'].replace("Extremely Positive","Positive")
covi_df['Sentiment'] = covi_df['Sentiment'].replace("Extremely Negative","Negative")

In [None]:
covi_df['Sentiment'].value_counts().reset_index()
covi_df['Sentiment'].value_counts()

In [None]:
# Train Test Split
#Assigning dependent and independent features
x= covi_df['Lemm']
y= covi_df['Sentiment']
# Applying Train test split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,stratify=y,random_state=1)
#Printing the result
print(" X Train : ", x_train.shape)
print(" X Test : ", x_test.shape)

In [None]:
# Bag of words
co_vect = CountVectorizer(binary=False,max_df=1.0,min_df=5,ngram_range=(1,2))
co_x_train = co_vect.fit_transform(x_train.astype(str).str.strip())


In [None]:
# TF-IDF
tfidf_vect = TfidfVectorizer(use_idf=True,max_df=1.0,min_df=5,ngram_range=(1,2),sublinear_tf=True)
tfidf_x_train = tfidf_vect.fit_transform(x_train.astype(str).str.strip())

In [None]:
tfidf_x_train.shape

In [None]:
co_x_test = co_vect.transform(x_test.astype(str).str.strip())
tfidf_x_test = tfidf_vect.transform(x_test.astype(str).str.strip())

In [None]:
tfidf_x_test.shape

##### Which text vectorization technique have you used and why?

Bag of words and TF-IDF are the techniques used for text vectorization. Bag of Words (BoW) represents text data as a matrix where rows correspond to documents and columns correspond to unique words in the corpus. TF-IDF (Term Frequency-Inverse Document Frequency) assigns weights to words based on their frequency in the document and inverse frequency in the corpus.

### 5. Data Scaling

In [None]:
# Scaling your data
''' As numerical features present in the data are not useful for Sentiment Analysis.
So, data scaling is not performed on the dataset.'''

### 6. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
'''Splitting done before text vectorization'''
'''Splitting the data before text vectorization is essential for preventing data
leakage, ensuring realistic evaluation, simulating real-world deployment scenarios,
and enabling techniques like cross-validation for model assessment and hyperparameter tuning.'''
#checking splitted data
print(x_train.head())
y_train.head()

##### What data splitting ratio have you used and why?

We have used 80:20 splitting ratio for sufficient training data, reasonable testing data and balance between bias and variance.

## ***6. ML Model Implementation***

In [None]:
labels = ['Negative', 'Neutral', 'Positive']

### ML Model - 1 - **Logistic Regression**

Logistic Regression is a statistical method used for binary classification tasks, where the outcome variable or target variable is categorical and has only two possible outcomes. It is a type of regression analysis that models the probability of a binary outcome by fitting the data to a logistic function.

In [None]:
# ML Model - 1 Implementation
# Initializing model
lor= LogisticRegression()

# Fit the Algorithm
lor.fit(co_x_train,y_train)

# Predict on the model
pred_lor_cv=lor.predict(co_x_test)

In [None]:
# Accuracy
accuracy_lor_cv = accuracy_score(y_test,pred_lor_cv)
print("Accuracy :",(accuracy_lor_cv))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_lor_cv))

In [None]:
# Plotting Confussion matrix
cf1= (confusion_matrix(y_test,pred_lor_cv))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf1, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Logistic Regression with CV )', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Initializing model
lr_cv= LogisticRegression()
parameters = dict(penalty=['l1', 'l2'],C=[100, 10, 1.0, 0.1, 0.01])

# Hyperparameter tuning by GridserchCV
lr_gcv = GridSearchCV(lr_cv,parameters)

# Fit the Algorithm
lr_gcv.fit(co_x_train,y_train)

# Predict on the model
pred_lr_gcv=lr_gcv.predict(co_x_test)

In [None]:
accuracy_lr_gcv = accuracy_score(y_test,pred_lr_gcv)
print("Accuracy :",(accuracy_lr_gcv))

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_lr_gcv))

In [None]:
# Plot Confussion matrix
cf1= (confusion_matrix(y_test,pred_lr_gcv))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf1, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Logistic Regression with CV)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

##### Which hyperparameter optimization technique have you used and why?

GridSearch cross-validation is a valuable technique for evaluating and optimizing linear regression models, ensuring that they generalize well to new data and provide reliable predictions. It helps improve the robustness and generalization ability of the model, leading to more accurate and reliable results in real-world applications.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The accuracy score improved after applying GridSearchCV from 0.7831 to 0.7951.

### ML Model - 2- **Naive Bayes Classifier**

Naive Bayes Classifier is a probabilistic machine learning model based on Bayes' theorem with an assumption of independence among features. It is commonly used for classification tasks, particularly in text classification and spam filtering. Naive Bayes Classifier remains a popular choice for classification tasks, especially in scenarios where the assumption holds reasonably well and computational efficiency is important.

In [None]:
# ML Model - 2 Implementation
nb_clf = MultinomialNB()

# Fit the Algorithm
nb_clf.fit(co_x_train,y_train)

# Predict on the model
pred_nb_clf = nb_clf.predict(co_x_test)

In [None]:
# Accuracy
accuracy_nb_clf = accuracy_score(y_test,pred_nb_clf)
print("Accuracy :",(accuracy_nb_clf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Classification report of Performance metrics
label = ['neutral','positive','negative']
print(classification_report(y_test,pred_nb_clf))

In [None]:
# Plot Confussion matrix
cf1= (confusion_matrix(y_test,pred_nb_clf))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf1, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Naive Bayes Classifier with CV)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
nb_clf = MultinomialNB()
parameters = dict(alpha=[100, 10, 1.0, 0.1, 0.01], fit_prior=[True,False])

#Hyperparameter tuning by GridserchCV
nb_gcv = GridSearchCV(nb_clf,parameters)

In [None]:
# Fit the Algorithm
nb_gcv.fit(co_x_train,y_train)

# Predict on the model
pred_nb_gcv = nb_gcv.predict(co_x_test)

In [None]:
accuracy_nb_gcv = accuracy_score(y_test,pred_nb_gcv)
print("Accuracy :",(accuracy_nb_gcv))

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_nb_gcv))

##### Which hyperparameter optimization technique have you used and why?

GridSearch cross-validation is a valuable technique for evaluating and optimizing linear regression models, ensuring that they generalize well to new data and provide reliable predictions. It helps improve the robustness and generalization ability of the model, leading to more accurate and reliable results in real-world applications.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After applying grid search cross-validation, there appears to be no significant improvement in the accuracy score.

### ML Model - 3 - **Random Forest Classifier**

Random Forest Classifier is a popular ensemble learning algorithm that combines the power of multiple decision trees to make predictions. Random Forest Classifier is widely used in various machine learning applications, including classification, regression, and anomaly detection. It is known for its robustness, accuracy, and ease of use, making it a popular choice among practitioners and researchers. Proper tuning of hyperparameters such as the number of trees and maximum depth of trees is important for optimizing the performance of Random Forest models.

In [None]:
# ML Model - 3 Implementation
rf_clf = RandomForestClassifier()

# Fit the Algorithm
rf_clf.fit(co_x_train,y_train)

# Predict on the model
pred_rf_clf = rf_clf.predict(co_x_test)

In [None]:
# Accuracy
accuracy_rf_clf = accuracy_score(y_test,pred_rf_clf)
print("Accuracy :",(accuracy_rf_clf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_rf_clf))

In [None]:
# Plot Confussion matrix
cf2= (confusion_matrix(y_test,pred_rf_clf))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf2, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Random Forest with CV)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# creating param dict to check diffirent value of parameter
# Define the parameter grid for hyperparameter tuning
param_grid_rf = {'n_estimators': [50,80,100],
                 'max_depth': [1,2,6],
                 'min_samples_split':[10,20,30],
                 'min_samples_leaf': [1,2,8]}

rf_clf = RandomForestClassifier()

#fit the parameter with Cross Validation
rf_rcv = RandomizedSearchCV(rf_clf, param_grid_rf,verbose= 3, scoring ='accuracy')

# Fit the Algorithm
rf_rcv.fit(co_x_train, y_train)

In [None]:
print(rf_rcv.best_params_)
print(rf_rcv.best_estimator_)

In [None]:
# Predict on the model
pred_rf_rcv = rf_rcv.predict(co_x_test)

In [None]:
# Accuracy
accuracy_rf_rcv = accuracy_score(y_test,pred_rf_rcv)
print("Accuracy :",(accuracy_rf_rcv))

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_rf_rcv))

In [None]:
# Plot Confussion matrix
cf3= (confusion_matrix(y_test,pred_rf_rcv))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf3, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Random Forest with CV)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

##### Which hyperparameter optimization technique have you used and why?

RandomizedSearchCV is a technique used for hyperparameter optimization in machine learning, including for Random Forest models. It efficiently searches through a specified number of random combinations of hyperparameters and selects the combination that yields the best performance. This method saves time compared to exhaustive grid search by randomly sampling hyperparameters from predefined distributions. After fitting the RandomizedSearchCV object to the training data and selecting the best model based on cross-validated performance, the final model's performance is evaluated on the test set to obtain an unbiased estimate of its performance on unseen data.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The accuracy score notably decreased after applying random search cross-validation on the random forest classifier.

### ML Model - 4 - **K-Nearest Neighbors**

KNN is a versatile algorithm with various applications, including classification, regression, clustering, and outlier detection. It is particularly useful when the decision boundary is nonlinear or when the underlying data distribution is unknown. However, it may not perform well with high-dimensional data or imbalanced datasets. Proper preprocessing, feature scaling, and tuning of hyperparameters are essential for maximizing the performance of KNN.

In [None]:
# ML Model - 4 Implementation
# Initializing model
knn_clf = KNeighborsClassifier()

# Fit the Algorithm
knn_clf.fit(co_x_train,y_train)

# Predict on the model
pred_knn_clf=knn_clf.predict(co_x_test)

In [None]:
# Accuracy
accuracy_knn_clf = accuracy_score(y_test,pred_knn_clf)
print("Accuracy :",(accuracy_lor_cv))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_knn_clf))

In [None]:
# Plot Confussion matrix
cf4= (confusion_matrix(y_test,pred_knn_clf))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf4, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (KNN with CV)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 4 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Initializing model
knn_clf= KNeighborsClassifier()

param = {'n_neighbors': [1,2,3,4]}    # Different values for the number of neighbors
knn_gcv = GridSearchCV(estimator=knn_clf,param_grid=param)


In [None]:
# Fit the Algorithm
knn_gcv.fit(co_x_train,y_train)

# Predict on the model
pred_knn_gcv=knn_gcv.predict(co_x_test)

In [None]:
accuracy_knn_gcv = accuracy_score(y_test,pred_knn_gcv)
print("Accuracy :",(accuracy_knn_gcv))

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_knn_gcv))

In [None]:
#Plotting Confussion matrix
cf1= (confusion_matrix(y_test,pred_knn_gcv))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf1, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (KNN with CV)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

##### Which hyperparameter optimization technique have you used and why?

GridSearch cross-validation is a valuable technique for evaluating and optimizing linear regression models, ensuring that they generalize well to new data and provide reliable predictions. It helps improve the robustness and generalization ability of the model, leading to more accurate and reliable results in real-world applications.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After implementing grid search cross-validation, the accuracy score of the KNN model experienced a notable decrease. This suggests that the hyperparameters selected by the grid search may not be optimal for the KNN algorithm when applied to your dataset.

### ML Model - 5 - **Decision Tree**

The Decision Tree Classifier algorithm builds a predictive model in the form of a tree structure. Each internal node in the tree represents a decision based on a feature, and each leaf node represents a class label or a decision. The algorithm partitions the feature space into regions, with each region corresponding to a leaf node in the tree. Decision trees are trained using a recursive partitioning algorithm that selects the best feature and split point at each node based on criteria such as Gini impurity or entropy.

In [None]:
# ML Model - 5 Implementation
# Initializing model
dt_ti_clf=DecisionTreeClassifier()

# Fit the data to model
dt_ti_clf.fit(tfidf_x_train,y_train)

# Prediction
pred_dt_ti_clf=dt_ti_clf.predict(tfidf_x_test)

In [None]:
# Accuracy
accuracy_dt_ti_clf = accuracy_score(y_test,pred_dt_ti_clf)
print("Accuracy :",(accuracy_dt_ti_clf))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_dt_ti_clf))

In [None]:
# Plot Confussion matrix
cf5= (confusion_matrix(y_test,pred_dt_ti_clf))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf5, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Decision Tree with TF-IDF)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 5 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Initializing model
dt_ti_clf=DecisionTreeClassifier()

# Define parameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Perform GridSearchCV
dt_ti_gcv = GridSearchCV(estimator=dt_ti_clf, param_grid=param_grid, scoring='accuracy', n_jobs=-1)
dt_ti_gcv.fit(tfidf_x_train,y_train)

#prediction
pred_dt_ti_gcv=dt_ti_gcv.predict(tfidf_x_test)

In [None]:
accuracy_dt_ti_gcv = accuracy_score(y_test,pred_dt_ti_gcv)
print("Accuracy :",(accuracy_dt_ti_gcv))

In [None]:
# Classification report of Performance metrics
label=['neutral','positive','negative']
print(classification_report(y_test,pred_dt_ti_gcv))

In [None]:
#Plotting Confussion matrix
cf1= (confusion_matrix(y_test,pred_dt_ti_gcv))
plt.figure(figsize=(8,5))
ax= plt.subplot()
sns.heatmap(cf1, annot=True, fmt=".0f",ax = ax)

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=15)
ax.set_ylabel('Actual labels', fontsize=15)
ax.set_title('Confusion Matrix (Decision Tree with TF-IDF)', fontsize=20)
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

##### Which hyperparameter optimization technique have you used and why?

GridSearch cross-validation is a valuable technique for evaluating and optimizing linear regression models, ensuring that they generalize well to new data and provide reliable predictions. It helps improve the robustness and generalization ability of the model, leading to more accurate and reliable results in real-world applications.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After applying grid search cross-validation, there was a marginal reduction in the accuracy score of the Decision Tree model when utilizing TF-IDF vectorization. This indicates that the hyperparameters determined through grid search may not have perfectly aligned with the dataset's characteristics, resulting in a slight decline in performance.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The accuracy score and classification report are vital metrics that can greatly influence business decisions and outcomes. By accurately assessing the performance of a classification model, businesses can make informed decisions regarding various aspects of their operations. For instance, a high accuracy score and a detailed classification report provide valuable insights into the model's ability to correctly classify instances, helping businesses gauge the effectiveness of their predictive models in real-world scenarios.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Logistic Regression with GridSearchCV is the ML model choosed from the above created models with the accuracy of 79.51%. Logistic Regression with GridSearchCV is utilizing grid search cross-validation to optimize hyperparameters for logistic regression.

# **Conclusion**

The primary objective is to leverage machine learning techniques to accurately classify the sentiment conveyed in COVID-19 tweets. We focused our analysis solely on the "OriginalTweet" and "Sentiment" columns, as columns like "UserName" and "ScreenName" do not provide meaningful insights for our analysis. These two columns contain the primary data we need to analyze sentiments expressed in the tweets, thereby streamlining our data processing and analysis pipeline. We used stemming and lemmatizing for text normalization.
In our analysis of COVID-19 tweets, we explored five machine learning models, including Logistic Regression, Naive Bayes Classifier, Random Forest, KNN, and Decision Tree. Despite implementing grid search cross-validation to optimize the models, we did not observe significant improvements in the test accuracy across all models.

However, the Logistic Regression model with Grid Search CV and count vectorization stood out, achieving an accuracy of 79.51%. This model demonstrated robust performance without signs of overfitting, indicating its suitability for deployment.

Our analysis revealed that despite the challenging circumstances of the COVID-19 pandemic, positive sentiments outweighed negative ones in the tweets. Nonetheless, a substantial portion of negative sentiments persists, presenting opportunities for government agencies, NGOs, and other entities to implement initiatives aimed at boosting public morale and addressing concerns.

Looking ahead, repeating the analysis in the future and comparing it with the present sentiment analysis could provide valuable insights into the effectiveness of such initiatives over time. This iterative approach allows for ongoing assessment and adjustment of strategies to better address evolving sentiments and needs during unprecedented situations like the COVID-19 pandemic.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***