# Sentiment Analysis of Social Media Content
### Week 5 Presentation

This notebook presents our work on analyzing a Social Media Sentiment Analysis Dataset. In the following sections, we:
- Explore and clean the dataset
- Perform sentiment analysis and visualize the sentiment distribution
- Analyze temporal trends using the timestamp data
- Investigate user engagement metrics (Likes and Retweets) and their relationship to sentiment
- Examine platform-specific trends, trending hashtags, and geographical differences
- (Optional) Build a simple predictive model to estimate user engagement


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Enable inline plotting
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')


## Dataset Exploration

We begin by loading the dataset and exploring its structure. Note that our dataset contains extra columns (`Unnamed: 0.1`, `Unnamed: 0`) that we will drop.

In [None]:
# Load the dataset
# Replace 'social_media_sentiment.csv' with the correct file path if needed
try:
    df = pd.read_csv('social_media_sentiment.csv')
except Exception as e:
    print('Error loading dataset:', e)

# Drop unwanted unnamed columns
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Display the first few rows
print(df.head())

# Basic info and summary
print(df.info())

# Check for missing values
print('Missing values in each column:')
print(df.isnull().sum())


## Sentiment Analysis

We visualize the distribution of sentiments across the dataset using the **Sentiment** column.

In [None]:
# Visualize sentiment distribution
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='Sentiment', order=df['Sentiment'].value_counts().index)
plt.title('Distribution of Sentiments')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


## Temporal Analysis

We convert the **Timestamp** column into a datetime object and analyze how posts evolve over time.

In [None]:
# Convert 'Timestamp' to datetime (if not already)
if not np.issubdtype(df['Timestamp'].dtype, np.datetime64):
    df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')

# Create a new 'date' column
df['date'] = df['Timestamp'].dt.date

# Plot the number of posts over time
plt.figure(figsize=(12,6))
df.groupby('date').size().plot()
plt.title('Number of Posts Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Posts')
plt.xticks(rotation=45)
plt.show()


## User Engagement Insights

We study the distribution of user engagement metrics (**Likes** and **Retweets**) and analyze how they vary with sentiment.

In [None]:
# Plot distributions for Likes and Retweets
plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
sns.histplot(df['Likes'], kde=True)
plt.title('Distribution of Likes')

plt.subplot(1,2,2)
sns.histplot(df['Retweets'], kde=True)
plt.title('Distribution of Retweets')

plt.tight_layout()
plt.show()

# Boxplots to compare engagement across sentiments
plt.figure(figsize=(12,6))

plt.subplot(1,2,1)
sns.boxplot(data=df, x='Sentiment', y='Likes', order=df['Sentiment'].value_counts().index)
plt.title('Likes by Sentiment')
plt.xticks(rotation=45)

plt.subplot(1,2,2)
sns.boxplot(data=df, x='Sentiment', y='Retweets', order=df['Sentiment'].value_counts().index)
plt.title('Retweets by Sentiment')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Engagement correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df[['Likes','Retweets']].corr(), annot=True, cmap='coolwarm')
plt.title('Engagement Metrics Correlation')
plt.show()


## Platform-Specific Analysis

We now compare sentiment trends and post counts across different social media platforms using the **Platform** column.

In [None]:
# Posts per platform
plt.figure(figsize=(8,5))
sns.countplot(data=df, x='Platform', order=df['Platform'].value_counts().index)
plt.title('Posts by Platform')
plt.xlabel('Platform')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Sentiment distribution per platform
plt.figure(figsize=(12,6))
sns.countplot(data=df, x='Platform', hue='Sentiment', order=df['Platform'].value_counts().index)
plt.title('Sentiment Distribution by Platform')
plt.xlabel('Platform')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Sentiment')
plt.show()


## Hashtag and Topic Trends

We analyze trending hashtags and investigate their relationship with user engagement. The **Hashtags** column is processed to extract individual hashtags. In our dataset, hashtags may be separated by commas or whitespace.

In [None]:
# Analyze hashtag trends
if 'Hashtags' in df.columns:
    # Replace missing hashtags with an empty string
    df['Hashtags'] = df['Hashtags'].fillna('')
    
    # Create a new column for hashtags list
    def split_hashtags(x):
        x = x.strip()
        if not x:
            return []
        # If comma exists, split by comma; otherwise, split by whitespace
        if ',' in x:
            return [tag.strip() for tag in x.split(',') if tag.strip()]
        else:
            return [tag.strip() for tag in x.split() if tag.strip()]
    
    df['hashtags_list'] = df['Hashtags'].apply(split_hashtags)
    
    # Explode the list so each hashtag gets its own row
    hashtags_exploded = df.explode('hashtags_list')
    
    # Remove empty hashtag entries
    hashtags_exploded = hashtags_exploded[hashtags_exploded['hashtags_list'] != '']
    
    # Top 20 hashtags
    top_hashtags = hashtags_exploded['hashtags_list'].value_counts().head(20)
    plt.figure(figsize=(10,6))
    sns.barplot(x=top_hashtags.values, y=top_hashtags.index, orient='h')
    plt.title('Top 20 Hashtags')
    plt.xlabel('Frequency')
    plt.ylabel('Hashtag')
    plt.show()
    
    # Average likes per hashtag (top 10)
    hashtag_engagement = hashtags_exploded.groupby('hashtags_list')['Likes'].mean().sort_values(ascending=False).head(10)
    plt.figure(figsize=(10,6))
    sns.barplot(x=hashtag_engagement.values, y=hashtag_engagement.index, orient='h')
    plt.title('Top Hashtags by Average Likes')
    plt.xlabel('Average Likes')
    plt.ylabel('Hashtag')
    plt.show()
else:
    print('No Hashtags column found in the dataset.')


## Geographical Trends

We examine how sentiment and engagement vary by geography using the **Country** column.

In [None]:
# Plot posts by country
if 'Country' in df.columns:
    plt.figure(figsize=(10,6))
    sns.countplot(data=df, x='Country', order=df['Country'].value_counts().index)
    plt.title('Posts by Country')
    plt.xlabel('Country')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.show()
    
    # Sentiment distribution for top 10 countries
    top_countries = df['Country'].value_counts().head(10).index
    plt.figure(figsize=(12,6))
    sns.countplot(data=df[df['Country'].isin(top_countries)], x='Country', hue='Sentiment', order=top_countries)
    plt.title('Sentiment Distribution by Country')
    plt.xlabel('Country')
    plt.ylabel('Count')
    plt.xticks(rotation=45)
    plt.legend(title='Sentiment')
    plt.show()
else:
    print('No Country column found in the dataset.')


## Cross-Feature Analysis

We combine multiple features to gain deeper insights. The following cell examines how average Likes evolve over time by platform.

In [None]:
# Example cross-feature analysis: average Likes over time for each platform
if 'Platform' in df.columns and 'date' in df.columns:
    pivot_df = df.pivot_table(index='date', columns='Platform', values='Likes', aggfunc='mean')
    plt.figure(figsize=(12,6))
    pivot_df.plot()
    plt.title('Average Likes Over Time by Platform')
    plt.xlabel('Date')
    plt.ylabel('Average Likes')
    plt.xticks(rotation=45)
    plt.legend(title='Platform')
    plt.show()
else:
    print('Required columns for cross-feature analysis not found.')


## Predictive Modeling (Optional)

Below is an example of building a simple linear regression model to predict user engagement (Likes) based on features such as **Sentiment** and **Platform**.

In [None]:
# Import modeling libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Prepare features for prediction
features = []
if 'Sentiment' in df.columns:
    # Encode Sentiment to numeric
    df['Sentiment_encoded'] = df['Sentiment'].astype('category').cat.codes
    features.append('Sentiment_encoded')
if 'Platform' in df.columns:
    df['Platform_encoded'] = df['Platform'].astype('category').cat.codes
    features.append('Platform_encoded')

# Ensure we have a target and features
if 'Likes' in df.columns and features:
    X = df[features].fillna(0)
    y = df['Likes'].fillna(0)

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create and train the linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Make predictions and evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print('Mean Squared Error:', mse)

    # Visualize predicted vs. actual Likes
    plt.figure(figsize=(8,6))
    plt.scatter(y_test, y_pred, alpha=0.5)
    plt.title('Predicted vs Actual Likes')
    plt.xlabel('Actual Likes')
    plt.ylabel('Predicted Likes')
    plt.show()
else:
    print('Required columns for predictive modeling not found.')


## Conclusion

In this notebook, we explored the social media sentiment dataset and uncovered insights related to sentiment, temporal trends, user engagement, platform specifics, hashtag trends, and geographical variations.

### Future Work
- Enhance predictive models with additional features (e.g., hashtag counts, more granular time features).
- Perform deeper analysis into regional sentiment trends and seasonal variations.
- Explore more advanced sentiment analysis techniques and natural language processing approaches.
