# SenseSmart

*Jerry Zhao, Xuying Yang, Yujuan Zhou, Clement Mo, Haozheng Liu*

---

## Project Name: Social Media Sentiment Analysis Dataset
**Dataset**: [Social Media Sentiment Analysis Dataset](https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset/code)

This project aims to analyze user-generated content across various social media platforms to uncover sentiment trends and user behavior. The dataset offers a rich source of data, including text-based content, user sentiments, timestamps, hashtags, user engagement metrics (likes and retweets), and geographical information. By exploring this data, we can identify how emotions fluctuate over time, platform, and geography. We will also investigate the correlation between popular content and user engagement metrics. 

**Problem Statement:**
The primary goal is to perform sentiment analysis, investigate temporal and geographical trends in user-generated content, and analyze platform-specific user behavior. The project will focus on identifying popular topics through hashtags, exploring engagement levels, and understanding regional differences in sentiment trends. 

**Tasks:**
- **Dataset Exploration:**
  - Gain familiarity with the dataset by understanding its structure and key features such as sentiment, timestamps, and user engagement (likes and retweets).
- **Sentiment Analysis:**
  - Conduct sentiment analysis to classify the user-generated content into different categories such as surprise, excitement, admiration, etc.
  - Visualize the distribution of sentiments and examine the emotional landscape of social media platforms.
- **Temporal Analysis:**
  - Explore temporal patterns in user sentiment over time using the "Timestamp" column.
  - Identify recurring themes, seasonal variations, or any significant trends in the data.
- **User Engagement Insights:**
  - Analyze user engagement by studying the likes and retweets associated with posts.
  - Investigate how sentiment correlates with higher levels of user engagement.
- **Platform-Specific Analysis:**
  -  Compare sentiment trends across various platforms using the "Platform" column.
  -  Identify how emotions differ depending on the platform.
- **Hashtag and Topic Trends:**
  - Explore trending topics by analyzing the hashtags.
  - Investigate the relationship between hashtags and user engagement or sentiment.
- **Geographical Trends:**
  - Examine regional sentiment variations using the "Country" column.
  - Understand how social media content and sentiment differ across various regions.
- **Cross-Feature Analysis:**
  - Combine features (e.g., sentiment and hashtags, sentiment and platform) to uncover deeper insights about user behavior and content trends.
- **Predictive Modeling (Optional):**
  - Explore the possibility of building predictive models to predict user engagement (likes/retweets) based on sentiment, hashtags, and platform.
  - Evaluate the performance of the model and explore its potential for predicting popular content. 

Students are encouraged to draw connections between data-driven insights and potential policy implications.

Students are encouraged to draw connections between data-driven insights and potential policy implications. The project should foster a deeper understanding of the dynamics of air quality in India and its impact on public health and the environment.

## Dataset acquisition
**Dataset**: [Social Media Sentiment Analysis Dataset](https://www.kaggle.com/datasets/kashishparmar02/social-media-sentiments-analysis-dataset/code)

In [1]:
# Install needed packages
%pip -q install kagglehub pandas matplotlib scikit-learn nltk

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Download dataset
from pathlib import Path
import kagglehub, zipfile, shutil

# # Download latest version
# path = kagglehub.dataset_download("kashishparmar02/social-media-sentiments-analysis-dataset")
cache = Path(kagglehub.dataset_download("kashishparmar02/social-media-sentiments-analysis-dataset"))
print("KaggleHub cache:", cache)

# Prepare and clear ./data folder
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
for p in data_dir.iterdir():
    if p.is_file():
        p.unlink() # remove file
    else:
        shutil.rmtree(p)

# Collect .csv files
csv_found = []
for f in cache.rglob("*.csv"):
    dst = data_dir / f.name
    if not dst.exists(): # Avoid duplicates
        shutil.copy2(f, dst)
        csv_found.append(dst.name)

# If none found, scan all zips and extract ONLY .csv files into ./data
if not csv_found:
    for z in cache.rglob("*.zip"):
        try:
            with zipfile.ZipFile(z) as zf:
                for member in zf.infolist():
                    # Filter by extension .csv
                    name = Path(member.filename).name
                    if name.lower().endswith(".csv"):
                        with zf.open(member) as src, open(data_dir / name, "wb") as dst:
                            shutil.copyfileobj(src, dst)
                        csv_found.append(name)
        except Exception as e:
            print("Skip bad zip:", z, "->", e)

# 5) 结果检查与回显 / Verify result and show summary
if not csv_found:
    raise FileNotFoundError("No .csv found.")
print("Path to dataset files:", data_dir.resolve(), ", the dataset file name is:", csv_found)


KaggleHub cache: C:\Users\yujua\.cache\kagglehub\datasets\kashishparmar02\social-media-sentiments-analysis-dataset\versions\3
Path to dataset files: C:\Users\yujua\Desktop\F25\DataScience_BootCamp\SenseSmart\data , the dataset file name is: ['sentimentdataset.csv']


## Load Data & Column Standardization
Text — the post text
Sentiment — emotion label (e.g., Positive, Negative, Neutral, ...)
Timestamp — time the post was made
Platform — social platform name
Likes — number of likes
Retweets — number of retweets
Country — country string
Hashtags — raw hashtag text

In [3]:
from pathlib import Path
import pandas as pd, numpy as np, re, json, warnings
from IPython.display import display

# Data directory ./data
DATA_DIR = Path("data")

# Pick the dataset file in ./data
csv_files = sorted(DATA_DIR.glob("*.csv"))
if not csv_files:
    raise FileNotFoundError("No data file found in ./data.\n")
DATA_FILE = csv_files[0]
print("Selected file:", DATA_FILE.name)

# Read the CSV
df = pd.read_csv(DATA_FILE, low_memory=False)
print("Raw shape:", df.shape)
print("Raw columns:", list(df.columns))

# Normalize original column names to lowercase + strip for matching
# df = df.rename(columns={c: str(c).lower().strip() for c in df.columns})

# Drop index-duplicates: Unnamed
unnamed_cols = [c for c in df.columns if re.match(r"^Unnamed", str(c), flags=re.IGNORECASE)]
if unnamed_cols:
    df = df.drop(columns=unnamed_cols)
    print("Dropped Unnamed columns:", unnamed_cols)

# Preview
desired_order = ["Text", "Sentiment", "Timestamp", "User", 
                 "Platform", "Hashtags", "Retweets", "Likes", 
                 "Country", "Year", "Month", "Day", "Hour"
]

if all(col in df.columns for col in desired_order):
    view = df[desired_order]
else:
    present = [c for c in desired_order if c in df.columns]
    others  = [c for c in df.columns if c not in present]
    view = df[present + others]
print("\nPreview：")
display(view.head())

print("Final shape:", view.shape)
print("Final columns:", list(view.columns))

Selected file: sentimentdataset.csv
Raw shape: (732, 15)
Raw columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour']
Dropped Unnamed columns: ['Unnamed: 0.1', 'Unnamed: 0']

Preview：


Unnamed: 0,Text,Sentiment,Timestamp,User,Platform,Hashtags,Retweets,Likes,Country,Year,Month,Day,Hour
0,Enjoying a beautiful day at the park! ...,Positive,2023-01-15 12:30:00,User123,Twitter,#Nature #Park,15.0,30.0,USA,2023,1,15,12
1,Traffic was terrible this morning. ...,Negative,2023-01-15 08:45:00,CommuterX,Twitter,#Traffic #Morning,5.0,10.0,Canada,2023,1,15,8
2,Just finished an amazing workout! 💪 ...,Positive,2023-01-15 15:45:00,FitnessFan,Instagram,#Fitness #Workout,20.0,40.0,USA,2023,1,15,15
3,Excited about the upcoming weekend getaway! ...,Positive,2023-01-15 18:20:00,AdventureX,Facebook,#Travel #Adventure,8.0,15.0,UK,2023,1,15,18
4,Trying out a new recipe for dinner tonight. ...,Neutral,2023-01-15 19:55:00,ChefCook,Instagram,#Cooking #Food,12.0,25.0,Australia,2023,1,15,19


Final shape: (732, 13)
Final columns: ['Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour']
