# Lesson 4: Sentiment Analysis on Toponym Sentences

## Overview

This lesson will cover two sentiment analysis methods:
- Using the **NLTK** library's VADER sentiment analysis tool.
- Using **Hugging Face's RoBERTa** model for sentiment analysis.

We will compare how these two tools perform on sentences containing toponyms extracted from the `virginia_toponyms_pickle` file, and we will store the results in a **Pandas DataFrame** for further analysis. The key goal is to understand how different tools analyze sentiment, identify their limitations, and explore why their outputs might differ.

---

## 1. Loading the Dataset

We will begin by loading the data containing the sentences with toponyms into a dataframe.

In [None]:
import pandas as pd

In [None]:
df_reddit_geoparsed_long = pd.read_pickle('data/jmu_reddit_geoparsed_long.pickle')

## 2. Sentiment Analysis with NLTK (VADER)

### 2.1 Overview
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It was largely trained on twitter, and really only looks at sentiment-per-word. This makes it relatively speedy, but there are some issues with this.



### 2.2 Loading VADER

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

You will only need to download the lexicon once.

In [None]:
# Check if 'vader_lexicon' is installed and download if necessary
try:
    nltk.data.find('vader_lexicon')
    print("✅ VADER lexicon already installed")
except LookupError:
    print("📥 Downloading VADER lexicon...")
    nltk.download('vader_lexicon')
    print("✅ VADER lexicon installed successfully")

In [8]:
# Initialize the VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

### 2.3 Using `sia.polarity_scores()`

The sentiment analyzer works by applying the VADER model to any text passed into the function `sia.polarity_scores()`. It will then generate a list of scores for that particular phrase.

#### 2.3.1 Good Vibes!

In [9]:
sia.polarity_scores('JMU is the best university!')

{'neg': 0.0, 'neu': 0.471, 'pos': 0.529, 'compound': 0.6696}

#### 2.3.2 Bad Vibes!

In [10]:
sia.polarity_scores('UVA is not the best university!')

{'neg': 0.423, 'neu': 0.577, 'pos': 0.0, 'compound': -0.5661}

### 2.4 Critical Thinking Challenge

For the next activity, you are going to try to push the limits of the tokenizer. For each challenge, think of a sentence that will get the scores you want, even if those scores don't make sense.


#### 2.4.1 Most Goodest Vibes

Try to create a sentence with a compound polarity score of 1.0.

In [None]:
sia.polarity_scores('')

#### 2.4.2 Most Baddest Vibes

Try to create a sentence with a compound polarity score of 1.0, but keep it pg-13!

In [None]:
sia.polarity_scores('')

#### 2.4.3 Most Strangest Vibes

Try to create a sentence with either a positive or negative compound score, but that means the exact opposite of what it says.

In [None]:
sia.polarity_scores('')

### 2.5 Run VADER on all sentences.

In [11]:
# Perform sentiment analysis on each sentence and store the compound score in the DataFrame
df_reddit_geoparsed_long['nltk_sentiment'] = df_reddit_geoparsed_long['sentences'].apply(lambda x: sia.polarity_scores(x)['compound'])


See result

In [13]:

df_reddit_geoparsed_long[['sentences','nltk_sentiment']].sample(10, random_state=60)


Unnamed: 0,sentences,nltk_sentiment
1348,I’ve been to the Lavender Lounge in SSC (which...,0.8016
912,I received this emergency notification on my p...,0.0258
1285,"They'll climb in Harrisonburg, climb in studen...",0.0
792,I don't know...I've always been a little off-p...,0.0
1612,When you walk into Dukes and see Topio's has a...,-0.5423
345,Get used to working in Harrison Hall.,0.0
1063,The arrested perpetrator is currently detained...,-0.8402
1114,So Greek Row is all frats again?,0.0
907,Devon lane is cordoned off now.,0.0
1312,&nbsp; That PDF also has phone numbers listed ...,0.4389


#### 2.6 Evaluate the result

The compound score ranges from -1 to 1. When a passage is very negative it gets a -1 and when it is possitive it gets a 1. Read through the passages above and try to figure out why these passages received the sentiments they did.

### 2.7 Critical Question

How effective is the VADER tokenizer in dealing with sentiments?

## 3. Sentiment Analysis with Hugging Face (RoBERTa)

RoBERTa (Robustly Optimized BERT Pretraining Approach) is a transformer-based model that has been fine-tuned for sentiment analysis tasks. We will use Hugging Face's `transformers` library to analyze the sentiment of the toponym-containing sentences. This model is available on a site called [Hugging Face](https://huggingface.co/). Check out the sentiment models [here](https://huggingface.co/models?sort=trending&search=sentiment).

### 3.1 Prepping your system

You will need to insall yet more libraries. 

Open up a new terminal window in Juypter and type the following commands:

- `pip install transformers`
- `pip install torch`
- `pip install scipy`
  

## 3.2 Load Functions into memory

Getting Roberta to code the sentiments is a fairly common procedure. There is a great in-depth video [here](https://www.youtube.com/watch?v=QpzMWQvxXWk). I have adapted and updated the code for newer versions of Python. The only thing you need to do is to load the functions into memory.

Step through the code blocks below. 

In [15]:
try:
    from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
    from tqdm import tqdm
    from scipy.special import softmax
    from typing import Dict, Any
    print("✅ All required libraries loaded successfully!")
except ImportError as e:
    print(f"❌ Missing required library: {e}")
    print("\nTo install missing libraries, run these commands in a terminal:")
    print("pip install transformers")
    print("pip install torch")
    print("pip install scipy")
    print("pip install tqdm")
    print("\nThen restart the kernel and try again.")

✅ All required libraries loaded successfully!


In [17]:
# Initialize RoBERTa with caching and error handling
try:
    # Check if model variables already exist (avoid re-downloading)
    if 'tokenizer' not in globals() or 'model' not in globals():
        print("🔄 Loading RoBERTa model (this may take a moment on first run)...")
        MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
        
        # The model files will be cached locally after first download
        tokenizer = AutoTokenizer.from_pretrained(MODEL)
        model = AutoModelForSequenceClassification.from_pretrained(MODEL)
        
        print("✅ RoBERTa model loaded successfully!")
        print("💡 Model files are now cached locally for faster future loading")
    else:
        print("✅ RoBERTa model already loaded in memory")
        
except Exception as e:
    print(f"❌ Error loading RoBERTa model: {e}")
    print("This might be due to:")
    print("- No internet connection (needed for first download)")
    print("- Insufficient disk space for model cache")
    print("- Missing transformers library")

✅ RoBERTa model already loaded in memory


In [26]:
# Function to calculate RoBERTa sentiment scores

def polarity_scores_roberta(text: str) -> Dict[str, float]:
    """
    Calculate RoBERTa sentiment scores for a given text.
    
    Args:
    - text: The text to analyze
    
    Returns:
    - A dictionary with sentiment scores for negative, neutral, and positive sentiment
    """
    # Tokenize and truncate to max length (512 tokens)
    encoded_text = tokenizer.encode_plus(
        text, 
        max_length=512, 
        truncation=True, 
        return_tensors='pt'
    )
    
    # Get model output and convert to probabilities
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    
    return {
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2],
        'roberta_compound': (scores[2] - scores[0])* (1-scores[1])  # Positive minus Negative as a compound score
    }


In [27]:
# Function to attach sentiment analysis to a specific column in the dataframe
def add_sentiment_to_column(
    df: pd.DataFrame, column_name: str, num_rows: int = None
) -> pd.DataFrame:
    """
    Adds RoBERTa sentiment analysis to a specified column in a dataframe.
    
    Args:
    - df: The dataframe to process
    - column_name: The name of the column containing the text to analyze
    - num_rows: The number of rows to process (default: 500)
    
    Returns:
    - df: A dataframe with added sentiment analysis columns
    """
        # If num_rows is specified, limit the dataframe, otherwise process all rows
    if num_rows:
        df_subset = df.head(num_rows).reset_index(drop=True)
    else:
        df_subset = df.reset_index(drop=True)  # Process all rows and reset the index
    
    # Function to process each row and add sentiment analysis
    def process_row(text: str) -> Dict[str, Any]:
        try:
            return polarity_scores_roberta(text)
        except Exception as e:
            print(f"Error processing text: {text}. Error: {e}")
            return {'roberta_neg': None, 'roberta_neu': None, 'roberta_pos': None}
    
    # Apply the RoBERTa sentiment analysis to each row
    tqdm.pandas(desc="Processing Sentiment Analysis")
    sentiment_scores = df_subset[column_name].progress_apply(process_row)
    
    # Convert the resulting list of dictionaries into a DataFrame and concatenate it with the original subset
    sentiment_df = pd.DataFrame(sentiment_scores.tolist())
    df_subset = pd.concat([df_subset, sentiment_df], axis=1)
    
    return df_subset

### 3.2 A Very Simple Explanation

The code blocks above are quite complex, but they essentially do one thing: add columns with sentiment scores to a dataframe that contains sentences. The function is fairly straightforward and has three possible parameters: 

- `dataframe` - The dataframe where you want to perform the function. In our case, `df_virginia_toponyms_compact`
- `column` - The column name where the sentences are stored
- `num_rows=` - (Optional) Integer value of the number of rows you want to process. Since this is very processor intensive. It makes sense to be able to just grab a sample. Leaving this blank will process every row.

```python
df_virginia_toponym_sentiment_sample = add_sentiment_to_column(df_virginia_toponyms_compact, 'cleaned_sentences', num_rows=1000)
```
With that explanation in mind, what does the above line of code do?


### 3.3 Get a Sample Sentiment Column

In [28]:
df_reddit_sentiment_sample = add_sentiment_to_column(df_reddit_geoparsed_long, 'sentences', num_rows=100)

Processing Sentiment Analysis: 100%|██████████| 100/100 [00:05<00:00, 19.99it/s]
Processing Sentiment Analysis: 100%|██████████| 100/100 [00:05<00:00, 19.99it/s]


#### Evaluate the Sample

If you could not get the tokenizer to work, you can get the result by running this line of code:

```python
df_reddit_sentiment_sample = pd.read_pickle('data/df_reddit_sentiment_sample_stable.pickle')
```

In [31]:
df_reddit_sentiment_sample[['sentences','nltk_sentiment','roberta_neg','roberta_pos','roberta_neu','roberta_compound']].sample(10, random_state=42)

Unnamed: 0,sentences,nltk_sentiment,roberta_neg,roberta_pos,roberta_neu,roberta_compound
83,So students who are off-campus living in 22802...,0.0,0.165166,0.038555,0.796279,-0.025793
53,&#x200B; |My students are split in half (150 i...,0.0,0.164073,0.043016,0.792911,-0.02507
70,Patriot front is the same group that rented a ...,0.0,0.639833,0.011513,0.348654,-0.409254
45,My SO's parents' neighborhood in Ellicott City...,0.0,0.019201,0.600919,0.37988,0.360736
44,"But then you go into Howard County, Maryland (...",0.0,0.017535,0.837776,0.144688,0.701562
39,LMAO great minds think alike and idk what nort...,0.8531,0.088797,0.399247,0.511956,0.151513
22,But if JMU goes under Harrisonburg won’t recov...,0.0,0.561574,0.02193,0.416495,-0.314885
80,Just remember this as you grow into adults tha...,0.2716,0.74697,0.02081,0.23222,-0.557532
10,"Just from my observation, a lot of gmu student...",0.0,0.116177,0.074508,0.809314,-0.007946
0,Mary Ann and I both lived in the Washington me...,0.0,0.002597,0.563593,0.43381,0.31763


#### 3.3.1 Critical Question

1. How did the tokenizer do?
2. Where would you dispute the sentiment?

#### 3.3.2 Critical Activity

1. Cycle through the samples by changing `random_state=` to a different integer.
2. Look through the sentences
3. Identify a sentence where the language model does particularly well or poorly.



## 4. Creating the entire dataset

This process will take a very long time. I will create this data set for you, but if you ever want to do it on your own. The line of code is below. Simply remove the hashtag to uncomment it.

In [34]:

df_reddit_sentiment_full = add_sentiment_to_column(df_reddit_geoparsed_long, 'sentences')

Processing Sentiment Analysis: 100%|██████████| 1785/1785 [01:28<00:00, 20.22it/s]
Processing Sentiment Analysis: 100%|██████████| 1785/1785 [01:28<00:00, 20.22it/s]


In [None]:
df_reddit_sentiment_full.sample(10, random_state=42)

Unnamed: 0,type,date,score,year_month,sentences,toponyms,place,latitude,longitude,feature_name,nltk_sentiment,roberta_neg,roberta_neu,roberta_pos,roberta_compound
152,comment,2020-09-19 19:45:29,-4,2020-09,1- not everyone in harrisonburg is a student a...,"[harrisonburg, harrisonburg, charleston, VT, b...",Vermont,44.00034,-72.74983,,0.5423,0.769441,0.218348,0.012212,-0.59189
315,comment,2022-02-08 02:37:02,22,2022-02,"one duke passed away by suicide, another suici...",[VT],Vermont,44.00034,-72.74983,,-0.875,0.846192,0.148544,0.005264,-0.716013
629,comment,2020-08-04 15:32:28,18,2020-08,Have a feeling they'll wait to see what Mason ...,"[Mason, VT]",Vermont,44.00034,-72.74983,,0.128,0.088114,0.873441,0.038445,-0.006286
1202,comment,2022-06-02 21:51:43,5,2022-06,I felt VT was ultimately too far for me person...,[VT],Vermont,44.00034,-72.74983,,0.0,0.903091,0.092616,0.004293,-0.815556


### 4.1 Check peformance

We can check the performance of both tokenizers by looking up "edge cases" where one gives a negative evaluation and the other positive. 

#### 4.1.2
How would we go about this? I have stubbed out some of the code below.

In [37]:
df_sentiment_edge_cases= df_reddit_sentiment_full[(df_reddit_sentiment_full['nltk_sentiment']<0 )& 
                         (df_reddit_sentiment_full['roberta_compound']>0)]

df_sentiment_edge_cases[['sentences','nltk_sentiment','roberta_compound']].sample(10, random_state=42)

Unnamed: 0,sentences,nltk_sentiment,roberta_compound
1464,my roommate and I lived in Shenandoah that yea...,-0.25,0.002466
1540,86 degrees here in LA.,-0.3818,0.011005
358,"Apparently, she also stressed the picture as i...",-0.1531,0.030693
827,"The rest of the major will stay difficult, but...",-0.1901,0.3828
1357,"In any case, I've parked in garages overnight ...",-0.5574,0.384445
1141,"I don’t think D-Hall is bad, i just think ther...",-0.3024,0.021729
448,"For one, ppl from jmu are also from other comm...",-0.4588,0.013155
1193,It looks really similar to the stadium at jmu ...,-0.3612,0.629072
1086,I wasn't 100% positive on the name bc its been...,-0.2411,0.007748
831,"Ah, nothing like walking from Festival, past I...",-0.0052,0.030209


### 4.2 Visualizing the Correlation

Let's create a scatter plot to see how well the two sentiment analysis methods correlate with each other:

In [41]:
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import pearsonr
import numpy as np

# Create scatter plot comparing Reddit post scores vs RoBERTa sentiment
fig = px.scatter(
    df_reddit_sentiment_full, 
    x='score', 
    y='roberta_compound',
    title='Relationship Between Reddit Post Score and RoBERTa Sentiment',
    labels={
        'score': 'Reddit Post Score',
        'roberta_compound': 'RoBERTa Compound Sentiment Score'
    },
    hover_data=['sentences'],  # Show sentences on hover
    opacity=0.6
)

# Add horizontal line at y=0 (neutral sentiment)
fig.add_hline(y=0, line_dash="dash", line_color="gray", 
              annotation_text="Neutral Sentiment", annotation_position="top right")

# Add vertical line at x=0 (neutral score)
fig.add_vline(x=0, line_dash="dash", line_color="gray",
              annotation_text="Neutral Score", annotation_position="top right")

# Calculate correlation coefficient
correlation, p_value = pearsonr(df_reddit_sentiment_full['score'], df_reddit_sentiment_full['roberta_compound'])

# Add correlation info to the plot
fig.add_annotation(
    x=0.02, y=0.98,
    xref='paper', yref='paper',
    text=f'Correlation: {correlation:.3f}<br>P-value: {p_value:.3e}',
    showarrow=False,
    font=dict(size=12),
    bgcolor="rgba(255,255,255,0.8)",
    bordercolor="black",
    borderwidth=1
)

fig.update_layout(
    width=800,
    height=600,
    showlegend=True
)

fig.show()

print(f"📊 Correlation Analysis: Reddit Score vs RoBERTa Sentiment")
print(f"Pearson correlation coefficient: {correlation:.4f}")
print(f"P-value: {p_value:.2e}")
print(f"Interpretation: ", end="")
if abs(correlation) > 0.7:
    print("Strong correlation")
elif abs(correlation) > 0.5:
    print("Moderate correlation") 
elif abs(correlation) > 0.3:
    print("Weak correlation")
else:
    print("Very weak or no correlation")



📊 Correlation Analysis: Reddit Score vs RoBERTa Sentiment
Pearson correlation coefficient: -0.0499
P-value: 3.50e-02
Interpretation: Very weak or no correlation


### 4.3 Most Positive Sentiment by Location

Let's analyze which places are associated with the most positive sentiment scores:

In [45]:
# Group by place and calculate average sentiment scores
city_sentiment_avg = (df_reddit_sentiment_full
                      .groupby('place')['roberta_compound']
                      .agg(['mean', 'count'])
                      .reset_index())

city_sentiment_avg.columns = ['place', 'avg_sentiment', 'post_count']

# Filter cities with at least 3 posts for more reliable averages
city_sentiment_filtered = city_sentiment_avg[city_sentiment_avg['post_count'] >= 3].copy()

# Get top 5 and bottom 5 cities
top_5_cities = city_sentiment_filtered.nlargest(5, 'avg_sentiment')
bottom_5_cities = city_sentiment_filtered.nsmallest(5, 'avg_sentiment')

# Combine top and bottom cities
top_bottom_cities = pd.concat([top_5_cities, bottom_5_cities])
top_bottom_cities['category'] = ['Top 5'] * 5 + ['Bottom 5'] * 5

# Create interactive plotly chart
import plotly.express as px
import plotly.graph_objects as go

# Create a horizontal bar chart showing top and bottom cities
fig = px.bar(
    top_bottom_cities.sort_values('avg_sentiment'),
    x='avg_sentiment',
    y='place',
    orientation='h',
    color='category',
    title='🌟 Top 5 vs Bottom 5 Cities by Average Sentiment Score',
    labels={
        'avg_sentiment': 'Average RoBERTa Sentiment Score',
        'place': 'City',
        'category': 'Ranking'
    },
    hover_data={
        'post_count': True,
        'avg_sentiment': ':.3f'
    },
    color_discrete_map={
        'Top 5': '#2E8B57',      # Sea Green
        'Bottom 5': '#CD5C5C'    # Indian Red
    },
    height=600
)

# Customize the layout
fig.update_layout(
    xaxis_title="Average RoBERTa Sentiment Score",
    yaxis_title="City",
    font=dict(size=12),
    margin=dict(l=150, r=50, t=80, b=50),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1
    )
)

# Add a vertical line at neutral sentiment (0)
fig.add_vline(x=0, line_dash="dash", line_color="gray", 
              annotation_text="Neutral", annotation_position="top")

# Customize hover template
fig.update_traces(
    hovertemplate="<b>%{y}</b><br>" +
                  "Average Sentiment: %{x:.3f}<br>" +
                  "Number of Posts: %{customdata[0]}<br>" +
                  "<extra></extra>"
)

fig.show()



In [54]:
df_reddit_place_sentiments = df_reddit_sentiment_full.groupby('place').agg(
    location_count=('place', 'size'),  # Count occurrences of each location
    latitude=('latitude', 'first'),    # Take the first latitude 
    longitude=('longitude', 'first'),  # Take the first longitude
    sentences=('sentences', lambda x: list(x)),  # Collect all sentences into an array
    avg_roberta_compound=('roberta_compound', 'mean'),  # Average of roberta_compound
    
    
).reset_index()

df_reddit_place_sentiments.sample(10, random_state=42)


Unnamed: 0,place,location_count,latitude,longitude,sentences,avg_roberta_compound
434,Stine,1,37.49524,-114.58889,[Not an alumni but I found it neat that the St...,0.687468
440,Subway,1,26.07205,-80.23203,[Meal plan changes invoke rage I thought the m...,-0.262419
6,Airport,2,22.31602,113.93663,"[""Airport"" was my shit., Airport, that's what ...",-0.336629
184,Hall Mountain,2,40.99729,-77.27803,[Maury is now Mountain Hall and Ashby is now V...,0.00055
78,Chicago,3,41.85003,-87.65005,[Twice as many black people died from black on...,0.048485
299,Mill,2,30.47318,-83.40015,"[Not saying you get tons of extra amenities, b...",0.230436
521,Ṭikar,5,27.4813,83.52467,"[P.S., P.S., P.S., P.S., P.S.]",0.021212
484,Vizslás,1,48.05171,19.8199,[Are we sure it's the guy with vizslas?],-0.005962
117,Duryea,1,41.34397,-75.73853,"[Babylon, Taste of India (the one near East Ca...",0.60725
137,Engeo,2,53.47231,9.13028,[The top floors of EnGeo and King hall are alw...,0.063711


In [62]:
# Create the map using plotly.express 
fig = px.scatter_map(
    df_reddit_place_sentiments,  #put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count",        # Bubble size based on location count
    color="avg_roberta_compound",      # Color based on sentiment score
    color_continuous_scale=px.colors.cyclical.IceFire,  # Use IceFire scale (blue to red)
    size_max=20,                  # Maximum size of the bubbles
    labels={
        'place': 'place',
        'avg_roberta_compound': 'RoBERTa Compound Sentiment Score'
    },
    # Show sentences on hover
    
    
    center={"lat": 37.5246322, "lon": -77.5758331},
    zoom=4                        # Adjust zoom level for better visibility
)

# Update the layout to use the default map style (which doesn't need a token)
fig.update_layout(
    mapbox_style="open-streetmap",  # No token needed for this style
    margin={"r":0,"t":0,"l":0,"b":0}  # Remove margins for a cleaner view
)



fig.show()

In [63]:
# Create the map using plotly.express 
fig = px.scatter_mapbox(
    df_reddit_place_sentiments,  # put your dataframe here
    lat="latitude",               # Latitude column
    lon="longitude",              # Longitude column
    size="location_count",        # Bubble size based on location count
    color="avg_roberta_compound", # Color based on sentiment score
    hover_name="place",           # Show place name as the main hover title
    hover_data={
        'location_count': True,   # Show count of posts
        'avg_roberta_compound': ':.3f',  # Show sentiment with 3 decimal places
        'latitude': False,        # Hide latitude in hover
        'longitude': False        # Hide longitude in hover
    },
    color_continuous_scale='RdYlGn',  # Red to Green scale (red=negative, green=positive)
    color_continuous_midpoint=0,      # Center the color scale at neutral (0)
    size_max=50,                      # Maximum size of the bubbles
    title='🗺️ Interactive Sentiment Map<br><sub>Bubble size = post count, Color = sentiment (red=negative, green=positive)</sub>',
    mapbox_style="open-street-map",   # No token needed for this style
    center={"lat": 37.5246322, "lon": -77.5758331},
    zoom=6,                           # Adjust zoom level for better visibility
    height=700
)

# Customize the hover template for better readability
fig.update_traces(
    hovertemplate="<b>%{hovertext}</b><br>" +
                  "Posts: %{customdata[0]}<br>" +
                  "Avg Sentiment: %{customdata[1]}<br>" +
                  "<extra></extra>"  # Removes the trace box
)

# Update the layout
fig.update_layout(
    margin={"r":0,"t":70,"l":0,"b":0},  # Add top margin for title
    coloraxis_colorbar=dict(
        title="Average Sentiment",
        tickvals=[-0.5, -0.25, 0, 0.25, 0.5],
        ticktext=["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
    )
)

fig.show()

# Print some summary info about the locations
print(f"📍 Total locations mapped: {len(df_reddit_place_sentiments)}")
print(f"📊 Total posts represented: {df_reddit_place_sentiments['location_count'].sum()}")
if len(df_reddit_place_sentiments) > 0:
    most_discussed = df_reddit_place_sentiments.loc[df_reddit_place_sentiments['location_count'].idxmax()]
    most_positive = df_reddit_place_sentiments.loc[df_reddit_place_sentiments['avg_roberta_compound'].idxmax()]
    most_negative = df_reddit_place_sentiments.loc[df_reddit_place_sentiments['avg_roberta_compound'].idxmin()]
    
    print(f"🔥 Most discussed: {most_discussed['place']} ({most_discussed['location_count']} posts)")
    print(f"😊 Most positive: {most_positive['place']} (sentiment: {most_positive['avg_roberta_compound']:.3f})")
    print(f"😞 Most negative: {most_negative['place']} (sentiment: {most_negative['avg_roberta_compound']:.3f})")


*scatter_mapbox* is deprecated! Use *scatter_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/



📍 Total locations mapped: 522
📊 Total posts represented: 1785
🔥 Most discussed: City of Harrisonburg (190 posts)
😊 Most positive: Planetarium (sentiment: 0.983)
😞 Most negative: Hampton Roads (sentiment: -0.839)
