
# Facebook Post Sentiment Breakdown – Figure Instructions

This notebook provides full instructions for generating sentiment analysis figures from Facebook group post data. These figures are included in the project presentation and final report to visually represent the distribution of sentiment in scraped Facebook posts.



## 1. Installation Instructions

For full installation and setup steps, see the [README.md on GitHub](https://github.com/uzairname/OtsegoStoryProject/blob/main/README.md)


## 2. Data Retrieval

First, obtain the Facebook posts data. You can use the scraping tool in `fb-scraper.ipynb` to generate raw post data (with columns like post content and timestamp). If real-time scraping is not feasible or you want to use prepared data, use the intermediate dataset,`otsego_data_combined.csv`, provided as a final deliverable of this project.

This CSV contains scraped Facebook posts (author, timestamp, content, etc.). Ensure this file is present in the data/ directory of the project. It will be used for sentiment analysis and figure generation.

## 3. Data Loading and Sentiment Categorization


In this section, we load the data and perform sentiment analysis using two methods:

- VADER (Valence Aware Dictionary and sEntiment Reasoner): A rule-based sentiment analyzer from NLTK that provides a compound sentiment score between -1 (most negative) and 1 (most positive) for each post.
- BERT-based model: A pre-trained transformer model (nlptown/bert-base-multilingual-uncased-sentiment) that predicts a star rating (1 to 5 stars) for each post, which we convert to a -1 to 1 scale for comparison with VADER.

We will add the sentiment scores as new columns in our DataFrame (vader_compound and bert_compound). Before analysis, we also clean the data by parsing dates and handling missing content.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sentiment analysis tools
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from transformers import pipeline
from sklearn.linear_model import LinearRegression
from datetime import timedelta

# Download VADER lexicon for sentiment analysis (if not already downloaded)
nltk.download('vader_lexicon')

# Load the CSV data (ensure the file path is correct relative to this notebook)
df = pd.read_csv('otsego_data_combined.csv', parse_dates=['timestamp'])
# Convert timestamps to datetime and drop any rows with invalid or missing dates
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df[df['timestamp'].notnull()].copy()
# Ensure the post content is a string and handle missing content
df['content'] = df['content'].fillna("").astype(str)

# Initialize VADER sentiment analyzer and compute compound sentiment for each post
vader = SentimentIntensityAnalyzer()
df['vader_compound'] = df['content'].apply(lambda text: vader.polarity_scores(text)['compound'])

# Initialize a BERT sentiment analysis pipeline (pre-trained model)
bert_pipeline = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

# Helper function to convert BERT's star rating output into a -1 to 1 compound score
def get_bert_compound(text):
    """
    Uses the BERT sentiment pipeline to get a star rating (1 to 5 stars) for the text,
    then maps that rating to a compound score between -1 (negative) and 1 (positive).
    """
    result = bert_pipeline(text)        # e.g., [{'label': '4 stars', 'score': 0.85}]
    label = result[0]['label']          # e.g., "4 stars"
    rating = int(label.split()[0])      # extract the numeric rating (1-5)
    # Map 1-> -1.0, 2-> -0.5, 3-> 0.0, 4-> 0.5, 5-> 1.0
    compound = (rating - 3) / 2.0
    return compound

# Apply BERT sentiment analysis to each post (this may take a few minutes for the whole dataset)
df['bert_compound'] = df['content'].apply(get_bert_compound)

# Preview the data with new sentiment columns
print(df[['timestamp', 'content', 'vader_compound', 'bert_compound']].head(5))


Output: The DataFrame df now has two new columns (vader_compound and bert_compound) with sentiment scores for each post. Each score ranges from -1 (very negative) to 1 (very positive). At this point, we have sentiment analyses ready for visualization. Next, we will aggregate these sentiment scores by date and prepare a simple forecast.

In [None]:
# Create a date column (without time) for daily aggregation
df['date'] = df['timestamp'].dt.date

# Aggregate daily average sentiment for VADER and BERT
daily_vader = df.groupby('date')['vader_compound'].mean().reset_index()
daily_vader.rename(columns={'vader_compound': 'avg_compound'}, inplace=True)
daily_bert = df.groupby('date')['bert_compound'].mean().reset_index()
daily_bert.rename(columns={'bert_compound': 'avg_compound'}, inplace=True)

# Define a function to forecast sentiment trend using linear regression
def forecast_sentiment(daily_df, forecast_until="2026-12-31"):
    """
    Fit a simple linear regression on the daily average sentiment scores and 
    project the trend forward to the specified end date.
    """
    # Convert dates to ordinal (numeric) form for regression
    daily_df = daily_df.copy()
    daily_df['date_ordinal'] = pd.to_datetime(daily_df['date']).apply(lambda d: d.toordinal())
    X = daily_df['date_ordinal'].values.reshape(-1, 1)
    y = daily_df['avg_compound'].values
    model = LinearRegression()
    model.fit(X, y)
    # Create future dates from the day after the last observed date up to forecast_until
    last_date = pd.to_datetime(daily_df['date'].max())
    future_dates = pd.date_range(start=last_date + timedelta(days=1), end=pd.to_datetime(forecast_until))
    if len(future_dates) == 0:
        return pd.DataFrame(columns=['date', 'predicted_compound'])  # no future dates to forecast
    # Predict sentiment for future dates
    future_ordinals = np.array([d.toordinal() for d in future_dates]).reshape(-1, 1)
    predicted = model.predict(future_ordinals)
    future_df = pd.DataFrame({'date': future_dates.date, 'predicted_compound': predicted})
    return future_df

# Generate forecast dataframes for VADER and BERT sentiment trends
future_vader = forecast_sentiment(daily_vader)
future_bert = forecast_sentiment(daily_bert)


In [None]:
import torch
print(torch.__version__)

In the code above, we:

- Grouped the data by date to compute the daily average sentiment score for each method.
- Built a simple linear regression model for each to project future sentiment trends. The forecast extends from the last date in the dataset through the end of 2026.
  
Now that we have both historical daily sentiment and a forecast, we can create our figures.

## 4. Generate and Export Visualizations

We will create three figures to summarize the sentiment analysis results. 

**Figure 1 (Sentiment Trend & Forecast)**
Description: This figure shows the trend of average daily sentiment over time for the Facebook posts, using two different sentiment analysis methods. We plot the historical daily sentiment (solid line) and a linear forecast (dashed line) side by side for VADER and BERT. This visualization is used in the project presentation to illustrate how sentiment in the group has changed over time and to project future sentiment direction into 2026.
python
Copy

In [None]:
# Create the figures directory if it doesn't exist
import os
os.makedirs('figures', exist_ok=True)

# Plot side-by-side sentiment trends for VADER and BERT, including forecasts
fig, axs = plt.subplots(1, 2, figsize=(16, 6), sharey=True)

# VADER Sentiment Trend
axs[0].plot(daily_vader['date'], daily_vader['avg_compound'], marker='o', label='Historical')
if not future_vader.empty:
    axs[0].plot(future_vader['date'], future_vader['predicted_compound'], marker='x', linestyle='--', label='Forecast')
axs[0].set_xlabel('Date', fontsize=12)
axs[0].set_ylabel('Average Compound Sentiment', fontsize=12)
axs[0].set_title('VADER Sentiment Trend & Forecast', fontsize=14)
axs[0].set_ylim(-1, 1)  # sentiment score range
# Mark threshold lines for slight positive/negative sentiment
axs[0].axhline(y=0.05, color='gray', linestyle='--', linewidth=1)
axs[0].axhline(y=-0.05, color='gray', linestyle='--', linewidth=1)
axs[0].grid(True)
axs[0].legend(fontsize=10)

# Add labels for sentiment regions (using last date position as reference)
last_date = daily_vader['date'].iloc[-1]
axs[0].text(last_date, 0.08, 'Positive (>=0.05)', color='green', fontsize=10)
axs[0].text(last_date, -0.02, 'Neutral', color='blue', fontsize=10)
axs[0].text(last_date, -0.10, 'Negative (<=-0.05)', color='red', fontsize=10)

# BERT Sentiment Trend
axs[1].plot(daily_bert['date'], daily_bert['avg_compound'], marker='o', label='Historical')
if not future_bert.empty:
    axs[1].plot(future_bert['date'], future_bert['predicted_compound'], marker='x', linestyle='--', label='Forecast')
axs[1].set_xlabel('Date', fontsize=12)
axs[1].set_title('BERT Sentiment Trend & Forecast', fontsize=14)
axs[1].set_ylim(-1, 1)
# Same threshold lines for BERT plot
axs[1].axhline(y=0.05, color='gray', linestyle='--', linewidth=1)
axs[1].axhline(y=-0.05, color='gray', linestyle='--', linewidth=1)
axs[1].grid(True)
axs[1].legend(fontsize=10)

plt.tight_layout()
# Save the figure to file
plt.savefig('experiments/figures/sentiment_trend_forecast.png')
plt.show()


The Sentiment Trend figure above (saved as figures/sentiment_trend_forecast.png) has two panels: the left is VADER sentiment over time (with forecast), and the right is BERT sentiment over time (with forecast). The horizontal dashed lines at 0.05 and -0.05 indicate a near-neutral range, with values above considered slightly positive and below considered slightly negative. We can see the overall sentiment trajectory and how it might continue if current trends persist.

**Figure 2 (VADER vs. BERT Scatter)**
Description: This scatter plot compares the sentiment scores produced by VADER and BERT for each individual Facebook post. Each point represents a single post, positioned by its VADER compound score (x-axis) and BERT compound score (y-axis). A red dashed diagonal line is drawn where y = x (i.e., points on this line would indicate equal sentiment scores by both methods). This figure is used in the final presentation to assess the agreement between the two sentiment analysis approaches – points clustering along the diagonal indicate that VADER and BERT often concur on sentiment, whereas deviations indicate differences in sentiment assessment.

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(df['vader_compound'], df['bert_compound'], alpha=0.6, edgecolors='none')
plt.xlabel('VADER Compound Score', fontsize=12)
plt.ylabel('BERT Compound Score', fontsize=12)
plt.title('VADER vs. BERT Sentiment Scores per Post', fontsize=14)
plt.grid(True)

# Diagonal reference line (y = x) for visualizing agreement
lims = [-1, 1]
plt.plot(lims, lims, 'r--', linewidth=1)

plt.xlim(lims)
plt.ylim(lims)
plt.tight_layout()
# Save the scatter plot
plt.savefig('experiments/figures/sentiment_scatter_vader_vs_bert.png')
plt.show()


The Sentiment Comparison Scatter above (saved as figures/sentiment_scatter_vader_vs_bert.png) helps validate our analysis by showing how similarly (or differently) the two methods rated each post. 

**Figure 3 (Sentiment Distributions)**
Description: The final figure shows the distribution of sentiment scores across all posts for each method. We plot two side-by-side histograms: one for VADER compound scores and one for BERT compound scores. This provides an overview of how many posts fall into positive, neutral, or negative sentiment ranges. This figure is included in the final presentation to illustrate the overall sentiment polarity of the group’s content.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

# Histogram for VADER sentiment scores
ax1.hist(df['vader_compound'], bins=20, color='skyblue', edgecolor='black')
ax1.set_title('VADER Sentiment Distribution', fontsize=14)
ax1.set_xlabel('VADER Compound Score', fontsize=12)
ax1.set_ylabel('Number of Posts', fontsize=12)
ax1.set_xlim(-1, 1)
ax1.grid(True)

# Histogram for BERT sentiment scores
ax2.hist(df['bert_compound'], bins=20, color='salmon', edgecolor='black')
ax2.set_title('BERT Sentiment Distribution', fontsize=14)
ax2.set_xlabel('BERT Compound Score', fontsize=12)
ax2.set_xlim(-1, 1)
ax2.grid(True)

plt.tight_layout()
# Save the histogram figure
plt.savefig('experiments/figures/sentiment_score_distribution.png')
plt.show()


The Sentiment Distribution figure above (saved as figures/sentiment_score_distribution.png) reveals how sentiments are spread out. For example, you might observe a concentration of posts around 0.0 (neutral sentiment), with fewer posts at the extreme positive or negative ends. Comparing the two histograms can also show if one method tends to give more neutral vs. extreme scores than the other. In our analysis, both VADER and BERT histograms show the overall sentiment leaning and variability in the Facebook posts.

**Figure 4 (Box Plots of Sentiment by Day of Week)** Description: This figure summarizes the number of posts categorized as Positive, Neutral, or Negative by day of week, using VADER sentiment scores.

In [None]:
# Boxplots of VADER and BERT scores grouped by day of week
days_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df['day_of_week'] = pd.Categorical(df['timestamp'].dt.day_name(), categories=days_order, ordered=True)

vader_box_data = [df.loc[df['day_of_week'] == day, 'vader_compound'].dropna() for day in days_order]
bert_box_data = [df.loc[df['day_of_week'] == day, 'bert_compound'].dropna() for day in days_order]

fig, axs = plt.subplots(1, 2, figsize=(16, 6), sharey=True)
axs[0].boxplot(vader_box_data, labels=days_order)
axs[0].set_title("VADER Compound Scores by Day of Week")
axs[0].set_xlabel("Day of Week")
axs[0].set_ylabel("Compound Score")
axs[0].grid(True)

axs[1].boxplot(bert_box_data, labels=days_order)
axs[1].set_title("BERT Compound Scores by Day of Week")
axs[1].set_xlabel("Day of Week")
axs[1].grid(True)

plt.tight_layout()
plt.savefig('experiments/figures/sentiment_boxplots_by_day.png')
plt.show()


The Sentiment Distribution figure above reveals how sentiments are spread out. For example, you might observe a concentration of posts around 0.0 (neutral sentiment), with fewer posts at the extreme positive or negative ends. Comparing the two histograms can also show if one method tends to give more neutral vs. extreme scores than the other. In our analysis, both VADER and BERT histograms show the overall sentiment leaning and variability in the Facebook posts.

**Figure 5 (VADER Sentiment Categories by Day)** Decription: These visualizations extend our analysis of sentiment scores by breaking them down by day of the week. These figures help us understand how sentiment changes based on time context.

In [None]:
# Categorize VADER sentiment
def categorize(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['vader_category'] = df['vader_compound'].apply(categorize)

pivot = df.groupby('day_of_week')['vader_category'].value_counts().unstack().fillna(0).loc[days_order]
pivot.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='viridis')
plt.title("VADER Sentiment Categories by Day of Week")
plt.xlabel("Day of Week")
plt.ylabel("Number of Posts")
plt.legend(title="Sentiment Category")
plt.grid(True, axis='y')
plt.tight_layout()
plt.savefig('experiments/figures/vader_sentiment_stacked_bar_by_day.png')
plt.show()


This stacked bar chart provides a breakdown of sentiment categories—**Positive**, **Neutral**, and **Negative**—for each day of the week using the VADER sentiment scores. It helps visualize how the emotional tone of Facebook posts fluctuates throughout the week. For example, you can observe whether weekends tend to have more positive posts or if certain weekdays see spikes in negativity. The height of each stacked bar represents the total number of posts for that day, while the colored segments indicate how those posts are distributed across the sentiment categories.


## 5. Figure Usage

These figures are used in the final project presentation and in the report’s results section to support our findings:

- **Figure 1 (Sentiment Trend & Forecast)**: Included in the final presentation and report to discuss how sentiment in the group has evolved and the projected future trend.
  - File: `experiments/figures/sentiment_trend_forecast.png`

- **Figure 2 (VADER vs. BERT Scatter)**: Included in the presentation and report to demonstrate the consistency between two different sentiment analysis techniques. This supports the methodology by showing that both tools yield comparable results for most posts.
  - File: `experiments/figures/sentiment_scatter_vader_vs_bert.png`

- **Figure 3 (Sentiment Distributions)**: Used in the presentation and report to provide an overview of the overall sentiment breakdown. It highlights the proportion of posts that are neutral, positive, or negative in tone.
  - File: `experiments/figures/sentiment_score_distribution.png`

- **Figure 4 (Box Plots of Sentiment by Day of Week)**: Included in the presentation and report to analyze how sentiment varies throughout the week. This figure supports temporal sentiment insights.
  - File: `experiments/figures/sentiment_boxplots_by_day.png`

- **Figure 5 (VADER Sentiment Categories by Day)**: Used in the presentation and report to show trends in emotional tone by weekday. Helps identify patterns such as whether posts become more positive or negative on specific days.
  - File: `experiments/figures/vader_sentiment_stacked_bar_by_day.png`


## 6. Notes

- If real-time scraping is not feasible, ensure `otsego_good_maybe.csv` is updated and stored in the `data/` directory before running this notebook.
