<a href="https://colab.research.google.com/github/uzairname/OtsegoStoryProject/blob/main/experiments/Final_Analysis_Results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook works best in Google Colab.
If visible, click the "open in colab" link to open this notebook in Colab. Otherwise, you must run `pip install -r requirements.txt` before running this notebook.

If you would like to view the data, it is available at https://docs.google.com/spreadsheets/d/1WMoeCGweQA0xUtEwlK8R_XzRuZOgPqpHQDGWizb9slQ/

To be able to use Gemma, follow the instructions that show up after running the cell below to create a hugging face API key, and enter it when prompted.

If asked to add the token as a git credential, type n.


In [None]:
!huggingface-cli login

In [None]:
!pip install -U bertopic bitsandbytes accelerate -q

# Justice for Otsego Sentiment Analysis

This notebook performs sentiment analysis on Facebook posts collected by the Justice for Otsego project using both VADER (lexicon-based) and BERT (transformer-based) approaches.

## ✅ How to Reproduce This Notebook

To ensure full reproducibility:

1. **Environment**  
   Use Python 3.8+ with the following packages installed:
   - `pandas`
   - `matplotlib`
   - `nltk`
   - `transformers`
   - `sklearn`
   - `numpy`
   - `tqdm`
   - `plotly`

   You can install these via:
   ```bash
   pip install pandas matplotlib nltk transformers scikit-learn numpy tqdm plotly

## 📊 Preview of Preprocessed Data

The following output displays a sample of the original post content (`content`) alongside the cleaned and tokenized version (`cleaned_content`).

This step is crucial because:

- **Noise Reduction**: Social media posts often contain extra line breaks, emojis, or short fragments. These are removed or cleaned to make the text more analyzable.
- **Sentence Filtering**: We only keep sentences with 20 or more characters to avoid processing fragments like "ugh" or "yes," which don’t carry much sentiment weight.
- **Standardization**: All text is converted to lowercase and stripped of unwanted characters, preserving only punctuation that affects sentence meaning (e.g., `!`, `?`, `.`).
- **Readability**: The cleaned version ensures that the models (VADER and BERT) process more meaningful input for sentiment scoring.

This table gives us an at-a-glance view of how much transformation the original post undergoes before analysis begins.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import pipeline
from datetime import timedelta
from sklearn.linear_model import LinearRegression
import numpy as np
import random
import re
from nltk.tokenize import sent_tokenize
from tqdm.notebook import tqdm
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

np.random.seed(42)
random.seed(42)

tqdm.pandas()

# -------------------------------
# 1. Data Loading & Preprocessing
# -------------------------------
nltk.download('vader_lexicon')

# Load CSV
csv_url = f'https://docs.google.com/spreadsheets/d/1WMoeCGweQA0xUtEwlK8R_XzRuZOgPqpHQDGWizb9slQ/export?format=csv'
df = pd.read_csv(csv_url)
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df[df['timestamp'].notnull()]  # Remove rows with invalid timestamps
df['content'] = df['content'].fillna("").astype(str)

nltk.download('punkt_tab')

def preprocess_text(text):

    # Replace newlines with a period and a space
    text = re.sub(r'[\r\n]+', '. ', text)

    # Tokenize into sentences
    sentences = sent_tokenize(text)

    # Filter out short sentences
    filtered_sentences = [s for s in sentences if len(s) >= 20]

    # Remove unwanted characters but preserve punctuation: , " . ? !
    cleaned_sentences = [
        re.sub(r'[^a-zA-Z0-9\s,.!?\"\'’]', '', s)
        for s in filtered_sentences
    ]

    # Optionally: lowercase everything for consistency
    cleaned_sentences = [s.lower() for s in cleaned_sentences]

    return ' '.join(cleaned_sentences)

# Remove "See more"
df['content'] = df['content'].str.replace("\nSee more", "", regex=False)

# Apply the preprocessing function to the content column
df['cleaned_content'] = df['content'].apply(preprocess_text)

# remove empty rows
df = df[df['cleaned_content'] != '']

# Inspect the first few rows of the new column
print(df[['content', 'cleaned_content']].head())

In [None]:
df[(df['content'].str.len() < 21)]

### Apply Sentiment Analysis

The following cell computes sentiment scores via BERT and Vader. It may take a few minutes

## 💬 Sentiment Analysis: VADER vs. BERT

In this section, we analyze the emotional tone of each Facebook post using **two complementary sentiment analysis models**:

---

### 🧠 VADER (Valence Aware Dictionary and sEntiment Reasoner)

- **What it is**: A lexicon- and rule-based sentiment analysis tool optimized for social media text.
- **Why it's used**: It's **fast**, easy to interpret, and works well with shorter, informal text like Facebook posts.
- **What it returns**: A `compound` score between -1 (very negative) and +1 (very positive), based on a weighted combination of positive, neutral, and negative word valence.

#### Reproducibility Tips:
- No randomness in outputs — purely deterministic based on text and the internal dictionary.
- Requires downloading the `vader_lexicon` once using `nltk.download()`.

---

### 🤖 BERT (Bidirectional Encoder Representations from Transformers)

- **What it is**: A transformer-based deep learning model trained on a large corpus of human-labeled sentiment data.
- **Why it's used**: BERT can detect more **nuanced emotions**, understand **context**, and generally outperforms simpler models on complex or mixed-tone text.
- **What it returns**: Labels such as `POSITIVE` or `NEGATIVE` and a confidence `score`.

#### How We Convert BERT Output to a Compound Score:
- We use a simple transformation:
  - `POSITIVE` → +score
  - `NEGATIVE` → -score
- This allows direct comparison with VADER’s compound score.

#### Reproducibility Tips:
- BERT models can behave slightly differently across versions or hardware due to floating-point precision.
- To ensure stable results:
  - Fix the `transformers` version in `requirements.txt`
  - Optionally set environment variables like `TRANSFORMERS_NO_ADVISORY_WARNINGS=1` to suppress version warnings.
  - Set a fixed random seed if doing advanced fine-tuning or sampling.

---

### ⚖️ Why Both?

Using both VADER and BERT gives a **more comprehensive sentiment picture**:
- VADER excels at **speed and simplicity**.
- BERT provides **depth and context sensitivity**.

This dual approach also helps validate whether both models detect similar emotional trends, which boosts confidence in our findings.


In [None]:
# -------------------------------
# 2. Sentiment Analysis: VADER
# -------------------------------
vader = SentimentIntensityAnalyzer()
df['vader_compound'] = df['content'].apply(lambda x: vader.polarity_scores(x)['compound'])

# -------------------------------
# 3. Sentiment Analysis: BERT
# -------------------------------
# Initialize a BERT sentiment-analysis pipeline.


# bert_pipeline = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

# def get_bert_compound(text):
#     # Get the sentiment result (returns a list of dicts)
#     result = bert_pipeline(text)
#     # Extract the star rating from the label (e.g., "4 stars")
#     label = result[0]['label']
#     rating = int(label.split()[0])
#     # Map rating (1-5) to a compound score between -1 and 1:
#     # 1 -> -1.0, 2 -> -0.5, 3 -> 0.0, 4 -> 0.5, 5 -> 1.0
#     compound = (rating - 3) / 2.0
#     return compound


# Apply BERT sentiment analysis (this may take a bit, depending on the dataset size)
# df['bert_compound'] = df['content'].progress_apply(get_bert_compound)

# Load the sentiment analysis pipeline (uses a BERT-based model by default)
sentiment_pipeline = pipeline("sentiment-analysis")

# Apply BERT sentiment analysis (this may take a bit, depending on the dataset size)
results = sentiment_pipeline(df['content'].tolist(), batch_size=16)

compound_scores = [
    result['score'] if result['label'] == 'POSITIVE' else -result['score']
    for result in results
]

# Add the new column to the DataFrame
df['bert_compound'] = compound_scores



In [None]:
vader_dist = pd.cut(df['vader_compound'], bins=[float('-inf'), -0.01, 0.01, float('inf')], labels=['negative', 'neutral', 'positive'], right=True).value_counts()
bert_dist = pd.cut(df['bert_compound'], bins=[float('-inf'), -0.7, 0.7, float('inf')], labels=['negative', 'neutral', 'positive'], right=True).value_counts()
print('Vader sentiment distribution')
print(vader_dist)
print('BERT sentiment distribution')
print(bert_dist)

In [None]:
df['bert_compound'].value_counts()

# Fit Topic Model

The following cells train a topic model on the posts. It assigns a topic index to each post. We can then analyze the posts by topic, date, and sentiment.

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import HDBSCAN
from bertopic.representation import KeyBERTInspired
from umap import UMAP

model = BERTopic(
    embedding_model="all-MiniLM-L6-v2",
    top_n_words=10,
    min_topic_size=10,
    n_gram_range=(1, 2),
    vectorizer_model=CountVectorizer(ngram_range=(1, 2), stop_words="english"),
    representation_model=KeyBERTInspired(),
    hdbscan_model= HDBSCAN(min_cluster_size=20),
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine'),
    verbose=True,
    nr_topics=6,
)

topics, probs = model.fit_transform(df['cleaned_content'])
df['topic'] = topics


# Generate Topic Names

The following cells will use a language model to assign the topics short titles in natural language. This is a free, open source model from hugging face. It may take a minute to download the model.

In [None]:
from transformers import pipeline

# client = OpenAI()
# def chat(prompt):
#   response = client.responses.create(
#       model="gpt-4o-mini",
#       input=prompt
#   )
#.  return response.output_text

pipe = pipeline("text-generation", model="google/gemma-3-1b-it")

In [None]:
def chat(prompt):
  from transformers import pipeline

  messages = [
      {"role": "user", "content": prompt},
  ]
  output = pipe(messages)
  return output[0]['generated_text'][-1]['content']

def get_topic_info(model, topic_id, df, n_docs=10):
  topic_docs = df[df['topic']==topic_id]
  topic_docs = topic_docs.sample(min(n_docs, len(topic_docs)))
  keywords = ', '.join([i for i in model.generate_topic_labels(nr_words=5) if int(i.split('_')[0]) == topic_id][0].split('_')[1:])
  return topic_docs['content'].tolist(), keywords


In [None]:

topic_names = {}
for topic_id in list(set(model.topics_)):
  docs, keywords = get_topic_info(model, topic_id, df)

  docs_str = "\n".join(docs)

  prompt = docs_str + f"\n\nAbove are a subset of facebook posts that follow a common theme. The theme has the key words \"{keywords}\". Please come up with a title/label representing the subject of what people are discussing in 5 words or less. Be specific. Respond with just the label and nothing else."

  print(prompt)

  response = chat(prompt)
  print(response)

  topic_names[topic_id]=response.strip()

df['topic_name'] = df['topic'].map(topic_names)


In [None]:
df['topic_name'].value_counts()

# Figures

## Sentiment by topic

## 📈 Sentiment by Topic

This bar chart shows the **average sentiment score (using VADER)** for each of the topics identified by BERTopic.

### 🧪 What It Shows:

- **X-axis (Topic)**: The names or short descriptions of each topic, generated using BERTopic's keyword extraction.
- **Y-axis (Mean Sentiment)**: The average VADER compound sentiment score for posts within that topic.
  - Values range from -1 (very negative) to +1 (very positive).
- **Bar Colors**: Colored by sentiment value to visually emphasize more positive or negative themes.

---

### 🔍 Why This Matters:

This visualization helps us understand:
- Which topics are generally associated with **positive, neutral, or negative emotions**.
- Whether certain issues (e.g., water safety, government response, health symptoms) evoke more emotional responses.
- How **emotion and content** are related in community discourse.

This is a key step in linking **quantitative emotion data** to **qualitative themes** from the Otsego community's experiences.


In [None]:
import plotly.express as px
# Determine mean sentiments by topic
sentiment_by_topic = df.groupby(['topic_name']).agg({'vader_compound': 'mean'}).reset_index()

fig = px.bar(
    sentiment_by_topic,
    x='topic_name',
    y='vader_compound',
    color='vader_compound',
    labels={'topic_name': 'Topic', 'vader_compound': "Mean Sentiment"},
    title='Mean Sentiment by Topic',
)

fig.show()

## 📆 Sentiment Trends by Topic Over Time

This line plot shows how the **average sentiment** of each topic has changed **year by year** in the Justice for Otsego Facebook posts.

---

### 🔍 What This Chart Tells Us:

- **X-axis (Year)**: Time progression based on the year each post was created.
- **Y-axis (Mean Sentiment)**: The average VADER sentiment score for posts in each topic during a given year.
- **Color-coded Lines**: Each line represents a different topic discovered by BERTopic.

---

### 🧠 Why It’s Useful:

- **Reveals emotional shifts**: Spot years when the community felt more hopeful, upset, or neutral about key issues.
- **Tracks community response**: See how public sentiment around certain topics evolved after events like news releases, town halls, or environmental updates.
- **Supports storytelling**: Helps frame emotional arcs across time in response to unfolding local crises.

This visualization is critical for understanding how **sentiment changes in response to external events** and gives researchers a way to map emotion to real-world context.

---

**Note**: Since this uses VADER, the sentiment score is between -1 (very negative) and +1 (very positive).


In [None]:
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Determine mean sentiment by topics over time
sentiment_by_topic_year = df.groupby(['topic_name', df['timestamp'].dt.year]).agg({'vader_compound': 'mean'})

sentiment_by_topic_year = sentiment_by_topic_year.reset_index()

# Create the Plotly line plot
fig = px.line(
    sentiment_by_topic_year,
    x='timestamp',
    y=['vader_compound'],
    color='topic_name',
    title='Mean Sentiment by Topic Over Time',
    labels={'timestamp': 'Year', 'value': 'Mean Sentiment', 'topic_name': 'Topic'},
)

fig.show()

## 🧮 Volume of Posts by Topic Over Time

This line plot illustrates how frequently each topic appeared in the Justice for Otsego Facebook posts, **broken down by year**.

---

### 📊 Chart Overview:

- **X-axis (Year)**: The year the posts were made, extracted from the `timestamp` column.
- **Y-axis (Post Volume)**: The number of posts associated with each topic in that year.
- **Color-coded Lines**: Each line represents a different topic identified by BERTopic.

---

### 🧠 Why This is Important:

- **Tracks community focus**: Highlights which issues were most frequently discussed in different years.
- **Reveals emerging or fading concerns**: For example, a spike might signal a specific event, crisis, or news report that triggered discussion.
- **Supports correlation with sentiment**: Helps contextualize the sentiment trends shown in the previous chart — a rise in post volume might explain emotional spikes.

---

By analyzing **volume and sentiment together**, we gain a more complete picture of how community dialogue has evolved, what topics dominated attention, and when public concern was at its highest.


In [None]:
import plotly.express as px

# Group by topic and year, then count the number of rows
volume_by_topic_year = df.groupby(['topic_name', df['timestamp'].dt.year]).size().reset_index(name='count')

# Rename the year column for clarity
volume_by_topic_year.rename(columns={'timestamp': 'year'}, inplace=True)

# Create the Plotly line plot
fig = px.line(
    volume_by_topic_year,
    x='year',
    y='count',
    color='topic_name',
    title='Volume of Posts by Topic Over Time',
    labels={'year': 'Year', 'count': 'Post Volume', 'topic_name': 'Topic'},
)

fig.show()


## 📊 Sentiment Analysis & Forecasting Results

This section presents a series of visualizations to better understand the **patterns, differences, and future trends** in community sentiment captured through Facebook posts.

---

### 📈 1. Sentiment Trend & Forecast (VADER vs. BERT)

Two side-by-side line plots compare **daily average sentiment** over time using:
- **VADER** (left): Lexicon-based, interpretable, and fast.
- **BERT** (right): Deep-learning-based, context-aware sentiment.

Each plot includes:
- ⚫ **Historical sentiment** (dots)
- ❌ **Forecasted sentiment** using linear regression (dashed line)
- 📉 Threshold markers for **Positive (≥ 0.05)**, **Neutral (-0.05 to 0.05)**, and **Negative (≤ -0.05)** tone

This lets us explore how sentiment might shift through the end of **2026**, and how each model perceives emotional trends differently.

---

### 🔁 2. VADER vs. BERT Score Comparison (Scatter Plot)

This scatter plot directly compares the **compound sentiment score** of each post as calculated by both models.

- Each point = one Facebook post
- Red dashed line = perfect agreement between models
- Deviations from the line show where **VADER and BERT disagree**

This is especially helpful to:
- Highlight posts where VADER sees neutral tone but BERT detects strong emotion
- Validate consistency between models

---

### 📊 3. Sentiment Distributions (Histograms)

Two side-by-side histograms show how sentiment scores are distributed across all posts for each model:

- **VADER** tends to be slightly more conservative, clustering around 0
- **BERT** may show stronger positive or negative emotions due to deeper language understanding

These plots provide a sense of **emotional polarity** and the overall tone of the dataset.

---

Together, these visualizations paint a rich picture of:
- The emotional state of the Otsego community
- How those emotions have evolved
- How two different NLP models interpret public sentiment
- What might be coming in the future based on trend projections


In [None]:
df['date'] = df['timestamp'].dt.date

# Aggregate daily average sentiment for VADER and BERT
daily_vader = df.groupby('date')['vader_compound'].mean().reset_index().rename(columns={'vader_compound': 'avg_compound'})
daily_bert = df.groupby('date')['bert_compound'].mean().reset_index().rename(columns={'bert_compound': 'avg_compound'})

# Forecast function using linear regression
def forecast_sentiment(daily_df):
    # Convert dates to ordinal numbers for regression
    daily_df['date_ordinal'] = pd.to_datetime(daily_df['date']).apply(lambda date: date.toordinal())
    X = daily_df['date_ordinal'].values.reshape(-1, 1)
    y = daily_df['avg_compound'].values
    model = LinearRegression()
    model.fit(X, y)

    # Forecast from the day after the last date until the end of 2026
    last_date = pd.to_datetime(daily_df['date'].max())
    future_dates = pd.date_range(start=last_date + timedelta(days=1), end=pd.Timestamp("2026-12-31"))
    future_ordinals = np.array([d.toordinal() for d in future_dates]).reshape(-1, 1)
    predicted = model.predict(future_ordinals)
    future_df = pd.DataFrame({'date': future_dates, 'predicted_compound': predicted})
    return future_df

future_vader = forecast_sentiment(daily_vader)
future_bert = forecast_sentiment(daily_bert)

# -------------------------------
# 5. Visualization
# -------------------------------

# Figure 1: Trend Plots (side-by-side) for VADER and BERT with Forecasts
fig, axs = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# VADER Plot
axs[0].plot(daily_vader['date'], daily_vader['avg_compound'], marker='o', label='Historical')
axs[0].plot(future_vader['date'], future_vader['predicted_compound'], marker='x', linestyle='--', label='Forecast')
axs[0].set_xlabel('Date', fontsize=12)
axs[0].set_ylabel('Avg Compound Sentiment', fontsize=12)
axs[0].set_title('VADER Sentiment Trend & Forecast', fontsize=14)
axs[0].set_ylim(-1, 1)
axs[0].axhline(y=0.05, color='gray', linestyle='--', linewidth=1)
axs[0].axhline(y=-0.05, color='gray', linestyle='--', linewidth=1)
axs[0].grid(True)
axs[0].legend(fontsize=10)
axs[0].text(daily_vader['date'].iloc[-1], 0.07, 'Positive', color='green', fontsize=10)
axs[0].text(daily_vader['date'].iloc[-1], 0.00, 'Neutral', color='blue', fontsize=10)
axs[0].text(daily_vader['date'].iloc[-1], -0.09, 'Negative', color='red', fontsize=10)

# BERT Plot
axs[1].plot(daily_bert['date'], daily_bert['avg_compound'], marker='o', label='Historical')
axs[1].plot(future_bert['date'], future_bert['predicted_compound'], marker='x', linestyle='--', label='Forecast')
axs[1].set_xlabel('Date', fontsize=12)
axs[1].set_title('BERT Sentiment Trend & Forecast', fontsize=14)
axs[1].set_ylim(-1, 1)
axs[1].axhline(y=0.05, color='gray', linestyle='--', linewidth=1)
axs[1].axhline(y=-0.05, color='gray', linestyle='--', linewidth=1)
axs[1].grid(True)
axs[1].legend(fontsize=10)
axs[1].text(daily_bert['date'].iloc[-1], 0.07, 'Positive', color='green', fontsize=10)
axs[1].text(daily_bert['date'].iloc[-1], 0.00, 'Neutral', color='blue', fontsize=10)
axs[1].text(daily_bert['date'].iloc[-1], -0.09, 'Negative', color='red', fontsize=10)

plt.tight_layout()
plt.show()

# Figure 2: Scatter Plot Comparing VADER vs. BERT for Each Post
plt.figure(figsize=(8, 6))
plt.scatter(df['vader_compound'], df['bert_compound'], alpha=0.6)
plt.xlabel('VADER Compound Score', fontsize=12)
plt.ylabel('BERT Compound Score', fontsize=12)
plt.title('VADER vs. BERT Sentiment Scores', fontsize=14)
plt.grid(True)
lims = [-1, 1]
plt.plot(lims, lims, 'r--', linewidth=1)  # Diagonal reference line
plt.xlim(lims)
plt.ylim(lims)
plt.tight_layout()
plt.show()

# Figure 3: Histograms of Sentiment Distributions for VADER and BERT
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

ax1.hist(df['vader_compound'], bins=20, color='skyblue', edgecolor='black')
ax1.set_title('VADER Sentiment Distribution', fontsize=14)
ax1.set_xlabel('VADER Compound Score', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_xlim(-1, 1)
ax1.grid(True)

ax2.hist(df['bert_compound'], bins=20, color='salmon', edgecolor='black')
ax2.set_title('BERT Sentiment Distribution', fontsize=14)
ax2.set_xlabel('BERT Compound Score', fontsize=12)
ax2.set_xlim(-1, 1)
ax2.grid(True)

plt.tight_layout()
plt.show()

# -------------------------------
# 6. Summary Statistics & Analysis Output
# -------------------------------

## 📅 Additional Visualizations: Weekly Patterns & Statistical Summary

To explore **temporal patterns** in sentiment, we break down scores by **day of the week** and analyze overall statistics across the dataset.

---

### 📦 Visualization: Box Plots of Sentiment Scores by Day of Week

These two side-by-side box plots show the **distribution of daily sentiment** (VADER and BERT) across the 7 days of the week.

- Each box represents the **spread and median** of compound sentiment scores for that day.
- Helps identify:
  - Are there days where posts are more emotionally intense?
  - Do certain days tend to be more negative or positive?

This is useful for seeing if emotional tone fluctuates around weekdays/weekends or during heightened activity days (e.g., town halls, news releases).

---

### 📊 Visualization: Stacked Bar Chart of VADER Sentiment Categories by Day

This stacked bar chart visualizes the **volume of posts categorized as Positive, Neutral, or Negative** using VADER — grouped by day of the week.

- Provides a **categorical breakdown** of how emotions are distributed over time.
- Useful to identify:
  - Which day sees the most negative posts?
  - Which days foster positive or neutral discussions?

---

### 📈 Summary Statistics

A few key statistics help summarize the overall emotional profile of the posts:

- **Total posts analyzed**: Gives the dataset scope.
- **Average compound score** (VADER and BERT): Measures general tone — closer to 0 = more neutral.
- **Standard deviation**: Measures variability in emotional tone.
- **Sentiment category counts**: How many posts are Positive, Neutral, or Negative.
- **Correlation between VADER and BERT scores**:
  - Indicates **how aligned the two models are** in their sentiment evaluations.
  - A strong positive correlation (close to 1) means consistent scoring.

This section supports both a **broad overview** and **granular insights**, combining visual and numeric analysis of the community’s emotional tone.


In [None]:
# Additional Visualizations

import matplotlib.pyplot as plt

# --- Visualization 4: Box Plots of VADER and BERT Compound Scores by Day of Week ---

# Create a 'day_of_week' column in the DataFrame with proper ordering
days_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df['day_of_week'] = pd.Categorical(df['timestamp'].dt.day_name(), categories=days_order, ordered=True)

# Prepare the data lists for each day for both VADER and BERT scores
vader_box_data = [df.loc[df['day_of_week'] == day, 'vader_compound'].dropna() for day in days_order]
bert_box_data = [df.loc[df['day_of_week'] == day, 'bert_compound'].dropna() for day in days_order]

# Create side-by-side box plots
fig, axs = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

axs[0].boxplot(vader_box_data, labels=days_order)
axs[0].set_title("VADER Compound Scores by Day of Week")
axs[0].set_xlabel("Day of Week")
axs[0].set_ylabel("Compound Score")
axs[0].grid(True)

axs[1].boxplot(bert_box_data, labels=days_order)
axs[1].set_title("BERT Compound Scores by Day of Week")
axs[1].set_xlabel("Day of Week")
axs[1].grid(True)

plt.tight_layout()
plt.show()


# --- Visualization 5: Stacked Bar Chart of VADER Sentiment Categories by Day of Week ---

# Define a helper function to assign sentiment categories for VADER
def categorize(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply the function to create a new category column for VADER scores
df['vader_category'] = df['vader_compound'].apply(categorize)

# Create a pivot table: counts of sentiment categories by day of week
pivot = df.groupby('day_of_week')['vader_category'].value_counts().unstack().fillna(0).loc[days_order]

# Plot a stacked bar chart for the VADER sentiment categories by day
pivot.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='viridis')
plt.title("VADER Sentiment Categories by Day of Week")
plt.xlabel("Day of Week")
plt.ylabel("Number of Posts")
plt.legend(title="Sentiment Category")
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

print("Summary Statistics:")

total_posts = len(df)
print(f"Total posts analyzed: {total_posts}")

avg_vader = df['vader_compound'].mean()
avg_bert = df['bert_compound'].mean()
std_vader = df['vader_compound'].std()
std_bert = df['bert_compound'].std()

print(f"Average VADER compound score: {avg_vader:.3f}")
print(f"Average BERT compound score: {avg_bert:.3f}")
print(f"Standard Deviation VADER compound score: {std_vader:.3f}")
print(f"Standard Deviation BERT compound score: {std_bert:.3f}")

# Define a helper function to assign a sentiment category
def categorize(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

vader_categories = df['vader_compound'].apply(categorize)
bert_categories = df['bert_compound'].apply(categorize)

vader_counts = vader_categories.value_counts()
bert_counts = bert_categories.value_counts()

print("\nVADER Sentiment Category Counts:")
print(vader_counts)
print("\nBERT Sentiment Category Counts:")
print(bert_counts)

corr = df['vader_compound'].corr(df['bert_compound'])
print(f"\nCorrelation between VADER and BERT compound scores: {corr:.3f}")

## Summary Statistics:

- **Total posts analyzed**: 845  
- **Average VADER compound score**: 0.079  
- **Average BERT compound score**: -0.304  
- **Standard Deviation VADER compound score**: 0.452  
- **Standard Deviation BERT compound score**: 0.926  

---

### VADER Sentiment Category Counts:

| Sentiment | Count |
|-----------|-------|
| Positive  | 350   |
| Neutral   | 272   |
| Negative  | 223   |

---

### BERT Sentiment Category Counts:

| Sentiment | Count |
|-----------|-------|
| Negative  | 552   |
| Positive  | 293   |

## Insights & Recommendations
- **Context Matters:**  
  The slightly positive averages hide variability. BERT’s higher negative count may better reflect community distress.
  
- **Multiple Methods:**  
  Using both VADER and BERT provides a fuller picture. Consider fine-tuning BERT on domain-specific data for improved accuracy.

- **Additional Visualizations:**  
  - **Heatmaps/Calendar Plots:** Identify periods with heightened negative sentiment.  
  - **Topic Modeling:** Use LDA to link themes (e.g., water quality, cancer) with sentiment trends.  
  - **Event Annotations:** Overlay key community events on trend graphs to contextualize shifts in sentiment.

## Conclusion
While overall sentiment appears slightly positive, the variability and methodological differences (especially BERT’s detection of more negative sentiment) suggest that the community's emotional response is more nuanced and polarized. This underscores the need for a combined quantitative and qualitative approach to fully capture the community's experiences, particularly given the serious issues of water quality and health.

