# **Sentiment Analysis**

## **Steps**
- import dataset
- inspect dataset
- Data Cleaning

      - reomvoe duplicates

      - handle null valueus

      - remove leading and trailing whitespaces

Univariate Data Analysis

Bivariate Data Analysis

Which platforms drive the most positive engagement?

Which countries show the highest positive vs negative sentiment?

What time of day drives most engagement (likes & retweets)?

Which hashtags drive the most engagement?

Which sentiment generates higher engagement?

Business Insights We Can Expect

Best Platforms: Identify which social platform has most positivity.

Geographic Strategy: Target countries with positive sentiment, address negatives.

Posting Time: Optimal posting hours for engagement.

Hashtag Strategy: Focus on top-performing hashtags.

Message Tone: Check if positive posts get more engagement than negative/neutral.

**Import Dataset**

In [5]:
# import libraries
import  numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [22]:
# import dataset
df = pd.read_csv(
    'dataset/sentiment_dataset.csv', 
    index_col=0, 
    usecols = ['Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour']).reset_index()

## **Data Cleaning**

In [58]:
# check duplicates
print(f"There are {len(df[df.duplicated()])} duplicated values")

There are 20 duplicated values


In [63]:
# remove duplicates in place
df.drop_duplicates(inplace=True)

# Confirm duplicates removal
df.shape

(712, 13)

In [66]:
# check for null values
print(f"There are {df.isnull().sum().sum()} null values")

There are 0 null values


In [91]:
# remove leading and trailing whitespaces
df['Text'] = [txt.strip() for txt in list(df['Text'])]
df['Sentiment'] = [sent.strip() for sent in list(df['Sentiment'])]
df['User'] = [usr.strip() for usr in list(df['User'])]

df['Platform'] = [pltform.strip() for pltform in list(df['Platform'])]
df['Hashtags'] = [tags.strip() for tags in list(df['Hashtags'])]
df['Hashtags'][0]

'#Nature #Park'

## **Exploratory Data Analysis**

In [23]:
# inspect dataset head
df.head()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


In [24]:
# inspect dataset tail
df.tail()

Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
727,Collaborating on a science project that receiv...,Happy,2017-08-18 18:20:00,ScienceProjectSuccessHighSchool,Facebook,#ScienceFairWinner #HighSchoolScience,20.0,39.0,UK,2017,8,18,18
728,Attending a surprise birthday party organized ...,Happy,2018-06-22 14:15:00,BirthdayPartyJoyHighSchool,Instagram,#SurpriseCelebration #HighSchoolFriendship,25.0,48.0,USA,2018,6,22,14
729,Successfully fundraising for a school charity ...,Happy,2019-04-05 17:30:00,CharityFundraisingTriumphHighSchool,Twitter,#CommunityGiving #HighSchoolPhilanthropy,22.0,42.0,Canada,2019,4,5,17
730,"Participating in a multicultural festival, cel...",Happy,2020-02-29 20:45:00,MulticulturalFestivalJoyHighSchool,Facebook,#CulturalCelebration #HighSchoolUnity,21.0,43.0,UK,2020,2,29,20
731,Organizing a virtual talent show during challe...,Happy,2020-11-15 15:15:00,VirtualTalentShowSuccessHighSchool,Instagram,#VirtualEntertainment #HighSchoolPositivity,24.0,47.0,USA,2020,11,15,15


In [27]:
# descriptive statistics
df.describe()

Unnamed: 0,Retweets,Likes,Year,Month,Day,Hour
count,732.0,732.0,732.0,732.0,732.0,732.0
mean,21.508197,42.901639,2020.471311,6.122951,15.497268,15.521858
std,7.061286,14.089848,2.802285,3.411763,8.474553,4.113414
min,5.0,10.0,2010.0,1.0,1.0,0.0
25%,17.75,34.75,2019.0,3.0,9.0,13.0
50%,22.0,43.0,2021.0,6.0,15.0,16.0
75%,25.0,50.0,2023.0,9.0,22.0,19.0
max,40.0,80.0,2023.0,12.0,31.0,23.0


In [28]:
# dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Text       732 non-null    object 
 1   Sentiment  732 non-null    object 
 2   Timestamp  732 non-null    object 
 3   User       732 non-null    object 
 4   Platform   732 non-null    object 
 5   Hashtags   732 non-null    object 
 6   Retweets   732 non-null    float64
 7   Likes      732 non-null    float64
 8   Country    732 non-null    object 
 9   Year       732 non-null    int64  
 10  Month      732 non-null    int64  
 11  Day        732 non-null    int64  
 12  Hour       732 non-null    int64  
dtypes: float64(2), int64(4), object(7)
memory usage: 74.5+ KB


In [34]:
# convert timestamp to datetime 
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Text       732 non-null    object        
 1   Sentiment  732 non-null    object        
 2   Timestamp  732 non-null    datetime64[ns]
 3   User       732 non-null    object        
 4   Platform   732 non-null    object        
 5   Hashtags   732 non-null    object        
 6   Retweets   732 non-null    float64       
 7   Likes      732 non-null    float64       
 8   Country    732 non-null    object        
 9   Year       732 non-null    int64         
 10  Month      732 non-null    int64         
 11  Day        732 non-null    int64         
 12  Hour       732 non-null    int64         
dtypes: datetime64[ns](1), float64(2), int64(4), object(6)
memory usage: 74.5+ KB


In [36]:
# data shape
df.shape

(732, 13)

In [38]:
# check for null values
df.isnull().sum()

Text         0
Sentiment    0
Timestamp    0
User         0
Platform     0
Hashtags     0
Retweets     0
Likes        0
Country      0
Year         0
Month        0
Day          0
Hour         0
dtype: int64

### **Univariate Data Analysis**

In [41]:
df.columns

Index(['Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags',
       'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour'],
      dtype='object')

In [45]:
df['Sentiment'].unique()

array([' Positive  ', ' Negative  ', ' Neutral   ', ' Anger        ',
       ' Fear         ', ' Sadness      ', ' Disgust      ',
       ' Happiness    ', ' Joy          ', ' Love         ',
       ' Amusement    ', ' Enjoyment    ', ' Admiration   ',
       ' Affection    ', ' Awe          ', ' Disappointed ',
       ' Surprise     ', ' Acceptance   ', ' Adoration    ',
       ' Anticipation ', ' Bitter       ', ' Calmness     ',
       ' Confusion    ', ' Excitement   ', ' Kind         ',
       ' Pride        ', ' Shame        ', ' Confusion ', ' Excitement ',
       ' Shame ', ' Elation       ', ' Euphoria      ', ' Contentment   ',
       ' Serenity      ', ' Gratitude     ', ' Hope          ',
       ' Empowerment   ', ' Compassion    ', ' Tenderness    ',
       ' Arousal       ', ' Enthusiasm    ', ' Fulfillment  ',
       ' Reverence     ', ' Compassion', ' Fulfillment   ', ' Reverence ',
       ' Elation   ', ' Despair         ', ' Grief           ',
       ' Loneliness     