# Lesson 3 Mini-Practice: Exploring JMU Reddit Data

## Overview

In this mini-lesson, you'll practice working with the cleaned JMU Reddit data from Lesson 3. You'll:
- Load the pre-cleaned pickle file
- Modify and experiment with existing visualizations
- Create a new visualization showing keyword trends over time

**Prerequisites:** Complete Lesson 3 first to understand the data cleaning process.

**Time:** 30-45 minutes

## Step 1: Import Libraries and Load Data

First, let's import our required libraries and load the cleaned data from the pickle file we created in Lesson 3.

In [13]:
# Import required libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [14]:
# Load the cleaned JMU Reddit data from pickle file
df = pd.read_pickle('data/jmu_reddit.pickle')

# Quick check of our data
print(f"Data shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

Data shape: (11375, 6)
Columns: ['type', 'title', 'text', 'date', 'score', 'year_month']


Unnamed: 0,type,title,text,date,score,year_month
0,post,President Alger leaving to take same job at Am...,President Alger leaving to take same job at Am...,2024-03-18 12:47:10,358,2024-03
1,comment,President Alger leaving to take same job at Am...,"Like him or not, he did help transform this sc...",2024-03-18 12:49:04,82,2024-03
2,comment,President Alger leaving to take same job at Am...,Massive changes happening at JMU this year. Al...,2024-03-18 12:50:05,34,2024-03
3,comment,President Alger leaving to take same job at Am...,Rather short tenure as JMU presidents go. He f...,2024-03-18 13:13:29,37,2024-03
4,comment,President Alger leaving to take same job at Am...,He was very nice and friendly when I spoke to ...,2024-03-18 12:57:33,30,2024-03


## Step 2: Review and Modify Visualization 1 - Posts Over Time

Let's recreate the first visualization from Lesson 3 and then experiment with modifications. 
First, let's try to understand how time is being manipulated here. 
What does the following line do?

```python
df['year_month'] = df['date'].dt.to_period('M')
```

  - How can we manipulate this to get a different result? 
  - What method is being called and what variable is being passed?

In [15]:
# Recreate the monthly posting activity visualization
# Extract year-month from the date column for monthly analysis
df['year_month'] = df['date'].dt.to_period('M')
df.head()

Unnamed: 0,type,title,text,date,score,year_month
0,post,President Alger leaving to take same job at Am...,President Alger leaving to take same job at Am...,2024-03-18 12:47:10,358,2024-03
1,comment,President Alger leaving to take same job at Am...,"Like him or not, he did help transform this sc...",2024-03-18 12:49:04,82,2024-03
2,comment,President Alger leaving to take same job at Am...,Massive changes happening at JMU this year. Al...,2024-03-18 12:50:05,34,2024-03
3,comment,President Alger leaving to take same job at Am...,Rather short tenure as JMU presidents go. He f...,2024-03-18 13:13:29,37,2024-03
4,comment,President Alger leaving to take same job at Am...,He was very nice and friendly when I spoke to ...,2024-03-18 12:57:33,30,2024-03


### Critical Question 1:
What do we expect to happen to the count of the comments as we change the interval at which we are measuring it?

In [16]:

# Group the data by year-month and post type, then count entries in each group
monthly_counts = df.groupby(['year_month', 'type'], observed=True).size().reset_index(name='count')

# Convert year_month back to datetime format so Plotly can understand it
monthly_counts['year_month'] = monthly_counts['year_month'].dt.to_timestamp()

monthly_counts.head()


Unnamed: 0,year_month,type,count
0,2011-09-01,comment,19
1,2011-09-01,post,3
2,2011-10-01,comment,32
3,2011-10-01,post,5
4,2011-11-01,comment,15


### Critical Question 2:
If we change the interval in the count, how will the chart below look different?

In [17]:

# Create the original line chart
fig = px.line(monthly_counts, 
              x='year_month', 
              y='count', 
              color='type',
              title='Original: Posts and Comments per Month on r/JMU',
              labels={'count': 'Number of Posts/Comments', 'year_month': 'Year-Month'},
              markers=True)

fig.show()

### 🎯 Practice Task 1: Playing with Plotly

Plotly is a powerful visualization library that allows you to make interactive graphics with relatively few lines of code. The basic format is always the same. You create a plotly object by running a function:

```python

figure = px.chart_type(dataframe,
                        x='column_representing_x_data',
                        y='column_representing_y_data', 
                        color='column_indicating_different_data_groups')

```
**Note** If you do not enter styling information and labels, Plotly leaves these blank. All key=value pairs (i.e. `color='type'`) should be followed by a comma except for the last one.

In [65]:
# YOUR TURN: Modify the visualization here
# Try changing px.line to px.bar or px.area
# Experiment with labels and titles

# Example modification (you can change this):
fig_modified = px.line(monthly_counts, 
                      x='year_month', 
                      y='count', 
                      color='type',
                      title='YOUR TITLE HERE',
                      labels={'count': 'Your Y-Label', 'year_month': 'Your X-Label'})

fig_modified.show()

### Customizing your chart: Markers

Like most major libraries, Plotly has extensive documentation. 
You can use this documentation to figure out most things. Let's see if we can add markers to our chart. The example is here: [add markers](https://plotly.com/python/line-charts/#line-charts-with-markers)


In [19]:
fig_modified = px.line(
    monthly_counts,
    x="year_month",
    y="count",
    color="type",
    title="YOUR TITLE HERE",
    labels={"count": "Your Y-Label", "year_month": "Your X-Label"},
    #modify code to add markers (see link above)
)

fig_modified.show()

### Customizing your chart: Colors

A colorway is a standard sequence of colors. By default, plotly uses it's own, but there are many more built in.

You can see all the color swatches with the commands:

```python
fig = px.colors.qualitative.swatches()
fig.show()
```
Run the code block below and see what happens.

In [66]:
fig = px.colors.qualitative.swatches()
fig.show()

**Note** That these are all *qualitative* color swatches. This is good for variables that are categories.
We can use these swatches in our charts by setting the value:  
`color_discrete_sequence=px.colors.qualitative.G10`

The value after the last dot is the swatch. In this case `G10`.

Modify the chart below to add your own color swatch.


In [None]:
fig_modified = px.line(
    monthly_counts,
    x="year_month",
    y="count",
    color="type",
    title="YOUR TITLE HERE",
    labels={"count": "Your Y-Label", "year_month": "Your X-Label"},
    #Add color swatch here 
)

fig_modified.show()

## Step 3: Review and Modify Visualization 2 - Keyword Scores

Let's recreate the keyword analysis and tweak the layout to simplify reading the chart.

In [69]:
# Recreate the keyword analysis from Lesson 3
keywords = ['tuition', 'covid', 'party', 'football', 'class', 'library', 'campus']

# Helper function to check if text contains a keyword
def contains_keyword(text, keyword):
    if pd.isna(text):
        return False
    return keyword.lower() in text.lower()

# Calculate average scores for each keyword
keyword_scores = []
for keyword in keywords:
    has_keyword_mask = df['text'].apply(lambda x: contains_keyword(x, keyword))
    avg_score = df[has_keyword_mask]['score'].mean()
    keyword_scores.append({'keyword': keyword, 'score': avg_score})

keyword_df = pd.DataFrame(keyword_scores)

# Original bar chart - unsorted
fig = px.bar(keyword_df, 
             x='keyword', 
             y='score',
             title='Original: Average Post Scores by Keyword')
fig.show()

#### Critical Question

What are some issues with the layout of this chart?

### 🎯 Practice Task 2: Sort the Data

The chart above shows keywords in their original order, but it's hard to compare scores. Let's improve this visualization using Plotly's `update_layout()` method.

`update_layout()` allows you to modify the layout and styling of a figure after it has been created. As per usual, there are a ton of options, but we'll focus on a common annoyance: category order and category names. 

#### Category Order
In the above chart, the values are out of order and it is hard to see what is the highest value. Plotly can automatically sort these in 1 of 4 ways by setting the `categoryorder` key to a specific value. 

- `'total ascending'` - Sort by y-values (low to high)
- `'total descending'` - Sort by y-values (high to low)  
- `'category ascending'` - Sort alphabetically A-Z
- `'category descending'` - Sort alphabetically Z-A

In [68]:
# Step 1: Create the chart
fig_sorted = px.bar(keyword_df, 
                    x='keyword', 
                    y='score',
                    title='Plotly Sorting: Keywords by Score Value',
                    labels={'keyword': 'Keywords', 'score': 'Average Score'})

# Step 2: Sort the x-axis by the y-values (scores) 
fig_sorted.update_layout(
    xaxis={'categoryorder': 'total ascending'}  # This sorts by the y-values!
)

fig_sorted.show()

### 🤔 Critical Question: Category Order

Imagine you want to order the chart alphabetically in ascending order (A-Z). How would you change the code below?

In [71]:
# Step 1: Create the chart
fig_alphabetical = px.bar(
    keyword_df,
    x='keyword',
    y='score',
    title='Alphabetically Sorted Keywords',
    labels={'keyword': 'Keywords', 'score': 'Average Score'},
)

# Step 2: Sort the x-axis alphabetically
fig_alphabetical.update_layout(
    xaxis={'categoryorder': 'total descending'}
)

fig_alphabetical.show()

### Text Case
Another annoying feature of this chart is that the labels for each category are lowercase. This looks sloppy. Plotly has a way to change this. We can update the `xaxis` layout and add the key: `tickfont_textcase`. This tells us how we want the text case to appear. Our options are:

`'normal' | 'word caps' | 'upper' | 'lower'`

How would we modify the code below if we wanted to capitalize all of the words?

In [None]:
# Step 1: Create the chart
fig_textcase = px.bar(
    keyword_df,
    x='keyword',
    y='score',
    title='Keywords with Proper Text Formatting',
    labels={'keyword': 'Keywords', 'score': 'Average Score'},
)

# Step 2: Sort the x-axis by the y-values (scores)
fig_textcase.update_layout(
    xaxis={'categoryorder': 'total ascending',
           'tickfont_textcase': }  # Fill in the text case option here!
)

fig_textcase.show()

### Templates

It is also possible to change the entire layout of the chart in one fell swoop with templates. A template is a set of style variables that you can apply to entire chart. 

You set the template as you create the chart. The options are:

- 'ggplot2'
- 'seaborn'
- 'simple_white'
- 'plotly'
- 'plotly_white'
- 'plotly_dark'
- 'presentation'
- 'xgridoff'
- 'ygridoff'
- 'gridon'
- 'none'

Cycle through the templates below to see if there is one you like.

In [72]:
# Step 1: Create the chart with template
fig_template = px.bar(
    keyword_df,
    x='keyword',
    y='score',
    title='Keywords with Custom Template',
    labels={'keyword': 'Keywords', 'score': 'Average Score'},
    template='plotly'  # Try different templates: 'ggplot2', 'seaborn', 'plotly_dark', etc.
)

# Step 2: Sort the x-axis by the y-values (scores)
fig_template.update_layout(
    xaxis={'categoryorder': 'total ascending',
           'tickfont_textcase': 'word caps'}
)

fig_template.show()

## Step 3: Create New Visualization - Keywords Over Time

Now let's create something completely new: tracking how different keywords trend over time!

### 🎯 Practice Task 3: Build Keywords Over Time Analysis

This is more challenging! We'll provide the scaffolding, but you need to fill in the missing pieces.

In [None]:
# Step 1: Choose keywords to track over time
# Feel free to modify this list!
time_keywords = ['covid', 'party', 'football', 'class']

# Step 2: Create a function to count keyword mentions by month
def count_keyword_by_month(df, keyword):
    """
    Count how many posts contain a keyword each month
    """
    # Create mask for posts containing the keyword
    has_keyword = df['text'].apply(lambda x: contains_keyword(x, keyword))
    
    # Filter dataframe to only posts with the keyword
    keyword_posts = df[has_keyword].copy()
    
    # Group by year_month and count
    monthly_keyword_counts = keyword_posts.groupby('year_month', observed=True).size().reset_index(name='count')
    
    # Add keyword column for identification
    monthly_keyword_counts['keyword'] = keyword
    
    return monthly_keyword_counts

In [84]:
# Step 3: Collect data for all keywords
all_keyword_trends = []

# YOUR TURN: Complete this loop
for keyword in time_keywords:
    keyword_data = count_keyword_by_month(df, keyword)
    all_keyword_trends.append(keyword_data)

# Combine all data into one DataFrame
keyword_trends_df = pd.concat(all_keyword_trends, ignore_index=True)

# Convert year_month back to datetime for plotting
keyword_trends_df['year_month'] = keyword_trends_df['year_month'].dt.to_timestamp()

# Check our data
keyword_trends_df.head(10)

Unnamed: 0,year_month,count,keyword
0,2020-03-01,3,covid
1,2020-04-01,2,covid
2,2020-06-01,1,covid
3,2020-07-01,12,covid
4,2020-08-01,90,covid
5,2020-09-01,66,covid
6,2020-10-01,9,covid
7,2020-11-01,2,covid
8,2020-12-01,6,covid
9,2021-01-01,2,covid


In [85]:
# Step 4: Create the keywords over time visualization
# YOUR TURN: Complete this visualization

fig_trends = px.line(keyword_trends_df, 
                     x='year_month', 
                     y='count', 
                     color='keyword',
                     title='Keyword Trends Over Time on r/JMU',
                     labels={'count': 'Number of Posts', 'year_month': 'Year-Month'},
                     markers=True,
                    template='plotly_dark')  # Try different templates here!

# Customize the layout
fig_trends.update_layout(
    xaxis_title="Time",
    yaxis_title="Posts Containing Keyword",
    legend_title="Keywords"
)

fig_trends.show()

### Critical Question

How can we modify the function so that we show fewer keywords?