---
title: "30538 Final Project: Reproducible Research - Volunteerism, Engagement, and Polarization in the U.S."
author: "Andrew White, Charles Huang, Justine Silverstein" 
date: "December 7, 2024"
format: pdf
execute:
  echo: false
  warning: false
---

In [None]:
# General
import pandas as pd
import numpy as np

# Visualization
import altair as alt

# shiny framework
from shiny import App, ui, render

# for spatial data
from us import states  # Background

# 1. Background

This project began as a shared interest in trends behind volunteering rates in America, as two of our members (Justine and Charles) are AmeriCorps alumni.

For the past few years, concerns about the American public's increasing rates of isolation, decreasing lack of civic engagement and faith in institutions, and greater rates of political polarization have been prominent in the news and media. Our personal experiences with AmeriCorps and volunteering have taught us that volunteering can be effective at reducing isolation, increasing civic engagement/community awareness, and decreasing negative polarization towards "the other side". However, is volunteering a legitimate part of a public policy solution to these issues, or is it just a red herring?

Our research questions were: 
1. What is the current state of volunteerism, political engagement and polarization in America? 
2. What factors make people more likely to volunteer or be civically engaged?


# 2. Data Importing/Cleaning

Our datasets for this project were:

1. AmeriCorps CEV (Civic Engagement and Volunteering Supplement) for 2021
2. U.S. Census Bureau Volunteering and Civic Life (VCL) Supplement - September 2021
3. ANES (American National Election Studies) Time Series Data, 2020

#1 and #2 primarily contain respondent information about volunteering and measures of civic engagement, while #3 contains information on political affiliation and polarization.

We are importing the data from the AmeriCorps and ANES websites. Because the datasets are over 100 MB, we include a Google Drive link here:

https://drive.google.com/drive/folders/1PUTN2pyh78MLoK0RVtGnf1ZwiM1BAAuV?usp=sharing


In [None]:
#Google Drive link for data: https://drive.google.com/drive/folders/1PUTN2pyh78MLoK0RVtGnf1ZwiM1BAAuV?usp=sharing


cev_2021_raw = pd.read_csv("/Users/charleshuang/Documents/GitHub/student30538/problem_sets/final_project/data/2021_CEV__Current_Population_Survey_Civic_Engagement_and_Volunteering_Supplement_20241031.csv", encoding = 'utf-8')

vcl_supplement_raw = pd.read_csv("/Users/charleshuang/Documents/GitHub/student30538/problem_sets/final_project/data/sep21pub.csv")

polarization = pd.read_csv("/Users/charleshuang/Documents/GitHub/student30538/problem_sets/final_project/data/anes_timeseries_2020_csv_20220210.csv")

As there are over 400 variables in the CEV and VCL data, we focused on the 20 most relevant variables in the following categories:

1. Frequency and Type of Volunteering
2. Political Engagement (did respondents discuss politics, did they write to elected officials, boycott products, etc.)
3. Civic/Community Participation (did respondents belong to groups/associations, interact with neighbors, etc.)
4. Basic demographics (age, gender, race, income, education, etc.)

Because the CEV and VCL data use similar variable names (by design), we were able to merge the two datasets together after cleaning the column names. Since the data values in the CEV/VCL data are entered in numeric codes (-1, 1, 2, etc.), we also created mapping functions with dictionaries to convert the data as needed.

(One notable issue we encountered with cleaning the CEV/VCL data was a hetereogeneous mix of numeric code and qualitative input. We made an additional helper function that identifies all the values in the data that aren't picked up by our data dictionaries- this function is located in our config.py file.)

# 2b. Data cleaning - ANES data

As with the CEV/VCL data, our goal was to subset the data so that it only contains relevant variables. We accomplished this by making two lists- one designed to capture variables covering geographic information (V201011, V201013a, V201013b, V201014a, V201014b), and one designed to capture variables covering information about assessments of political positioning (i.e. left, right, center)

Similarly, we also used a mapping function to change numerical codes to qualitative data in two relevant questions: 

V201200 - "Where would you place yourself on this [extremely liberal to extremely conservative] scale, or haven’t you thought much about this?"

V201228 - "Generally speaking, do you usually think of yourself as [a Democrat, a Republican / a Republican, a Democrat], an independent, or something else?"

# 3. Custom Variables

We devised two custom measures of political engagement and polarization derived from the survey results:

I. Political Engagement Score 

We chose five of the most relevant questions from the CEV/VCL data and weighted each based on their level of effort:

1. “How frequently do you talk to a family member/neighbor about politics?” (15%)
2. “How frequently do you post political views on social media?” (15%)
3. “How frequently do you consume political news/media?” (10%)
4. “Did you contact an elected official to express your opinion in the last 12 months?” (30%) 
5. “Did you boycott a company based on their values in the last 12 months?” (30%)

This generated a score from 0 - 100 that we could use as a (imperfect) proxy for political engagement. We mutated a new variable to measure this and added it to our dataset.

II. Polarization Score

In a paper on quantifying polarization written by Aaron Bramson et al (https://inferenceproject.yale.edu/sites/default/files/688938.pdf), the authors examine a range of polarization indicators. A relatively simple (and in some ways problematic) measurement is called spread, or dispersion. Bramson et al. explain: "Polarization...can be measured as the value of the agent with the highest belief value minus the value of the agent with the lowest belief value."

We (imperfectly) approximate this using two more variables: V201206 and V201207. These ask respondents to position political parties on the political spectrum. We selected the most ideologically distant nodes on the personal ideology scale (extremely liberal and extremely conservative) and capture how far apart their conceptions of each party are, on average.

We created a scale to assign the different ideological positions on a spectrum, namely:
-3, -2, and -1 are "Extremely Liberal", "Liberal",  and "Slightly Liberal"; 
0 is "Moderate; middle of the road"; 
1, 2, and 3 are "Slightly conservative", "Conservative", and "Extremely conservative."

For example, if the average extremely liberal respondent in Texas places Democrats at -1 (slightly liberal) and the average for the extremely conservative respondents is 3 (extremely conservative), then the distance between the two is 4, meaning Texas would have a spread of 4 for this question.


# 4. Static Plots and Outcomes

# Caveats

Before discussing the data, we acknowledge we cannot discuss patterns over time as we only have data from 2021 in the case of the AmeriCorps and U.S. Census data, and 2020 in the case of the ANES data. However, the data is still useful as we can garner a lot from even a snapshot in time, especially as this was right after the height of the COVID pandemic and a highly contentious election. Furthermore (as noted in the shiny app), many entries were missing from the datasets due to nonresponse. As an example, over 80% respondents did not answer a majority of our political engagement questions, forcing us to exclude them from the analysis. While we believe our analyses still provide useful information, this potential selection bias should be taken into account.

# 4a. Exploratory Analysis - Volunteering

When initially working with the data, we generated multiple plots with different variables to see if we could notice any noteworthy trends between volunteering frequency and civic indicators such as public officials, boycotts, etc. (These charts are not shown here for space, but the output is included in our code.)

As part of our exploratory analysis, we also ran a groupby on state-level data to see if there was any correlation between the number of volunteers per state and the average hours volunteered; however, there did not seem to be any correlation between the two.

We want to highlight two charts out of the ones we produced. The first chart depicts the top and bottom five states by average hours spent volunteering:

![Top 5 and Bottom 5 US States by average hours spent volunteering]('top_bottom_5_states_avg_volunteering.png')

The second chart depicts volunteering frequency when measured against voters and nonvoters:

![Volunteering Frequency for Voters and Non-Voters]('if_volunteered_last_year.png')

An interesting point from our exploratory analysis is that in each volunteering category, the majority of people did vote in their local election. However, it is unclear if this is simply correlation, as AmeriCorps may disproportionately attract the kind of person already primed to vote in their local election and be politically engaged. We will engage this question of correlation further with our Shiny app results. 


# 4b. Polarization Analysis

With the ANES data, we focused more specifically on measuring political polarization, to try to examine our hypothesis that volunteering would correlate negatively with polarization. 

As mentioned above, we devised a simple "spread" variable from Bramson et al. to measure polarization. We then graphed each state’s spread alongside average political engagement score and average volunteer hours. As an example, here is a graph of spread of views on Democrat Party ideological position, including volunteering hours: 

![Spread of views on Democrat Party](dem_graph_spread_2.png)

We found that these graphs do not appear to show a significant meaningful correlation between extreme views and volunteering in either Democrat or Republican analyses; we may have chosen some of the states with very high or very low spread, and it is clear that there is not much of a relationship, with high spread being found in some higher-volunteer states and some lower-volunteer states. 

In the future, causal techniques such as regression analysis with controls for potential bias coming from variables such as income and education, or a difference-in-differences approach examining specific states over time, could better help to precisely measure polarization alongside political engagement.


# 5. Dynamic Plots in Shiny - Demographics vs. Volunteering Rate/Political Engagement

Building off of our exploratory analysis, we wanted to more easily see the correlations between volunteerism, political engagement, and potential confounders like income and education. We made a Shiny dashboard app using the CEV/VCL data that lets us see demographics (age, education level, income, US state, etc.) on the X-axis and the user's choice of volunteer rates or political engagement on the Y-axis. 

A screenshot of the app is here:
![Shiny dashboard screenshot](shiny_screenshot.png)

As an example, we can see here that volunteering is positively correlated with educational attainment, with over 25% of PhD/professional/master’s degree holders having volunteered in 2021.

Using this app, we were able to find the most common traits associated with volunteering and civic engagement for 2021: family income and education, marital status, age, and being of White or Native American/Alaskan Indian heritage.

We also found that women and rural inhabitants were slightly more likely to volunteer than men/urban inhabitants, but civic engagement remained the same. Additionally (but not surprisingly) social media was a negative predictor of volunteering, but not civic engagement. 

This analysis reveals something critical about our hypothesis- while volunteering can still be a solution to low civic engagement, we can’t dismiss that both are simply correlated with other overarching demographic factors like income, education, and race, which makes sense as those factors can indicate well-off people having more resources and time to volunteer than others.


# 6. Conclusion/Takeaways

As mentioned before, our data and analysis has several disclaimers and caveats that we cannot fully account for. Nevertheless, our takeaways are here in order:

1. Volunteers are more likely to vote than non-volunteers, but this is less likely to be a causation and more a correlation of demographic factors
2. We did not find a meaningful correlation between volunteering and polarized attitudes; the evidence that volunteering in particular has a positive effect on polarization and engagement seems weak. 
3. We found that predictors of volunteering and civic engagement are concentrated in disproportionately well-off, privileged populations.

For organizations like AmeriCorps that want to attract younger or more diverse volunteers, as well as improving civic participation/engagement in general, this has important implications- organizations like those should consider that simple appeals to volunteer more, or attempts to diversify volunteer populations, may not mean much without additional incentives that can address the structural barriers of volunteering.

While volunteering can still be a solution to low civic engagement, and we can speak personally to its interpersonal benefits, we can’t dismiss that both are simply correlated with other overarching demographic factors like income, education, and race. This should give us pause to the theory that volunteering is a neat solution to reducing polarization and improving civic engagement in and of itself, but future studies can examine specific populations of low-income or other "non-typical" volunteers that reported positive outcomes- perhaps volunteering itself has certain aspects (like community, sense of purpose, etc.) that can still be valuable regardless of socioeconomic status.

# 7. Coding Analysis (Not Shown in Writeup)


In [None]:
vcl_supplement_raw.columns = vcl_supplement_raw.columns.str.lower()

cev_2021_raw = cev_2021_raw.astype(str)
vcl_supplement_raw = vcl_supplement_raw.astype(str)

#This finds all the variables in common between the two for a merge
common_keys = list(set(vcl_supplement_raw.columns).intersection(set(cev_2021_raw.columns)))

cev_all_2021 = pd.merge(cev_2021_raw, vcl_supplement_raw, on=common_keys, how="outer")

cev_all_2021.head(5)

#Note: Each row is a person- multiple people in the same household can have same household ID, so should not filter by unique

# We can use a config.py file to keep this readable. Full details are in the config.py file

In [None]:
from config import selected_variables, rename_mapping

# selected_variables = ['hrhhid', 'hrhhid2', etc.]
# This chooses the variables we want from the merged raw data

# rename_mapping = "hrhhid": "Household_ID", "hrhhid2": "Household_ID_2", etc.
# Renaming the variables for clarity


cev_all_2021_filter = cev_all_2021[selected_variables].copy()
#copy avoids potentially modifying original dataframe

cev_all_2021_filter.rename(columns=rename_mapping, inplace=True)

#Debugging: removing duplicate column(s)
cev_all_2021_filter = cev_all_2021_filter.loc[:, ~cev_all_2021_filter.columns.duplicated()]

In [None]:
# Attribution: Asked ChatGPT "why isn't config.py updating when I add dictionaries to it; ChatGPT suggested using importlib reload"
# Import and reload config
import importlib
import config
importlib.reload(config)
from config import *
import us as states

# Replace FIPS codes with states

fips_to_state = {int(state.fips): state.abbr for state in states.STATES}

cev_all_2021_filter['US State'] = pd.to_numeric(
    cev_all_2021_filter['US State'], errors='coerce')

cev_all_2021_filter['US State'] = cev_all_2021_filter['US State'].map(
    fips_to_state)

# Replace "Volunteered Past Year" data (pes16)
cev_all_2021_filter['Volunteered_Past_Year'] = cev_all_2021_filter['Volunteered_Past_Year'].map(
    pes16_dict)

# Replace "Volunteering Frequency" (pes16d)
cev_all_2021_filter['Volunteering_Frequency'] = cev_all_2021_filter['Volunteering_Frequency'].map(
    pes16d_dict)

# Replace "Hours Spent Volunteering" (pts16e)
cev_all_2021_filter['Hours_Spent_Volunteering'] = cev_all_2021_filter['Hours_Spent_Volunteering'].map(
    pts16e_dict)

# Replace "Discussed Issues with Friends/Family" data (pes2)
cev_all_2021_filter['Discussed_Issues_With_Friends_Family'] = cev_all_2021_filter['Discussed_Issues_With_Friends_Family'].map(
    pes2_dict)

# Replace "Discussed Issues with Neighbors" data (pes5)
cev_all_2021_filter['Discussed_Issues_With_Neighbors'] = cev_all_2021_filter['Discussed_Issues_With_Neighbors'].map(
    pes5_dict)

# Replace "Contacted Public Official" data (pes13)
cev_all_2021_filter['Contacted_Public_Official'] = cev_all_2021_filter['Contacted_Public_Official'].map(
    pes13_dict)

# Replace "Boycott Based on Values" data (pes14)
cev_all_2021_filter['Boycott_Based_On_Values'] = cev_all_2021_filter['Boycott_Based_On_Values'].map(
    pes14_dict)

# Replace "Belonged to Groups" data (pes15)
cev_all_2021_filter['Belonged_To_Groups'] = cev_all_2021_filter['Belonged_To_Groups'].map(
    pes15_dict)


# Replace "Community Improvement Activities" data (pes7)
cev_all_2021_filter['Community_Improvement_Activities'] = cev_all_2021_filter['Community_Improvement_Activities'].map(
    pes7_dict)

# Replace "Voted in Local Election" data (pes11)
cev_all_2021_filter['Voted_In_Local_Election'] = cev_all_2021_filter['Voted_In_Local_Election'].map(
    pes11_dict)

# Replace "Posted Views on Social Media" data (pes9)
cev_all_2021_filter['Posted_Views_On_Social_Media'] = cev_all_2021_filter['Posted_Views_On_Social_Media'].map(
    pes9_dict)

# Replace "Frequency of News Consumption" data (pes10)
cev_all_2021_filter['Frequency_Of_News_Consumption'] = cev_all_2021_filter['Frequency_Of_News_Consumption'].map(
    pes10_dict)

# Replace "Age" data (prtage)
cev_all_2021_filter['Age'] = cev_all_2021_filter['Age'].map(prtage_dict)

# Replace "Gender" data (pesex)
cev_all_2021_filter['Gender'] = cev_all_2021_filter['Gender'].map(pesex_dict)

# Replace "Race/Ethnicity" data (ptdtrace)
cev_all_2021_filter['Race_Ethnicity'] = cev_all_2021_filter['Race_Ethnicity'].map(
    ptdtrace_dict)

# Replace "Marital Status" data (pemaritl)
cev_all_2021_filter['Marital_Status'] = cev_all_2021_filter['Marital_Status'].map(
    pemaritl_dict)

# Replace "Household Size" data (hrnumhou)
cev_all_2021_filter['Household_Size'] = cev_all_2021_filter['Household_Size'].map(
    hrnumhou_dict)

# Replace "Family Income Level" data (hefaminc)
cev_all_2021_filter['Family_Income_Level'] = cev_all_2021_filter['Family_Income_Level'].map(
    hefaminc_dict)

# Replace "Education Level" data (peeduca_dict)
cev_all_2021_filter['Education_Level'] = cev_all_2021_filter['Education_Level'].map(
    peeduca_dict)

# Replace "Urban Rural Status" data (gtmetsta_dict)
cev_all_2021_filter['Urban_Rural_Status'] = cev_all_2021_filter['Urban_Rural_Status'].map(
    gtmetsta_dict)

In [None]:
#list 1
subset_list_1 = [column for column in polarization if "V2010" in column]

#list 2
subset_list_2 = [column for column in polarization if "V2012" in column]

#put the lists together
#make a loop 

#empty receiver
subset_list = []

#loop 1
for value in subset_list_1:
        subset_list.append(value)

#loop 2
for value in subset_list_2:
        subset_list.append(value)                    

#subset 
sub_polarization = polarization.filter(items= subset_list)

#add a column that just equals 1 to use for tracking the number
#of entries
sub_polarization["Observations"] = 1

In [None]:
##Note: The below debugging code was suggested by ChatGPT to deal with some column value errors caused by political_engagement_score mutation. We weren't sure how to fix the errors without taking out the debugging code.

# Debugging code - Before engagement score calculation
print("Columns before adding engagement score:")
print(cev_all_2021_filter.columns.tolist())
print("\nShape before:", cev_all_2021_filter.shape)


# Mutate the engagement score variable we came up with in config.py
cev_all_2021_filter = add_engagement_score(cev_all_2021_filter)

# More debugging code
print("\nColumns after adding engagement score:")
print(cev_all_2021_filter.columns.tolist())
print("\nShape after:", cev_all_2021_filter.shape)

# Before saving
print("\nVerifying engagement score columns exist:")
print("'political_engagement_score' exists:",
      'political_engagement_score' in cev_all_2021_filter.columns)
print("'engagement_level' exists:",
      'engagement_level' in cev_all_2021_filter.columns)

In [None]:
# confirm this code exports the data into the 'data' folder into the repo
from pathlib import Path

# Lastly, export the cleaned data as a csv for plotting purposes
shiny_data_path = Path("shiny-app/basic-app/data")

# Create directory if it doesn't exist
shiny_data_path.mkdir(parents=True, exist_ok=True)

# Save dataset
cev_all_2021_filter.to_csv(
    shiny_data_path / "cev_2021_cleaned.csv", index=False)

print(f"Dataset saved to: {shiny_data_path / 'cev_2021_cleaned.csv'}")

In [None]:
'''As with the CEV/VCL data, our goal was to subset the data so that it only contains relevant variables. We accomplish this by making two lists:

List 1: This list is designed to capture variables covering geographic information (V201011, V201013a, V201013b, V201014a, V201014b)

List 2: This list is designed to capture variables covering information about assessments of political positioning (i.e. left, right, center)

'''

#list 1
subset_list_1 = [column for column in polarization if "V2010" in column]

#list 2
subset_list_2 = [column for column in polarization if "V2012" in column]

#put the lists together
#make a loop 

#empty receiver
subset_list = []

#loop 1
for value in subset_list_1:
        subset_list.append(value)

#loop 2
for value in subset_list_2:
        subset_list.append(value)                    

#subset 
sub_polarization = polarization.filter(items= subset_list)

#add a column that just equals 1 to use for tracking the number
#of entries
sub_polarization["Observations"] = 1

In [None]:
#Make variables more clear

'''Analyzing Question V201200, which is a question asking:

"Where would you place yourself on this scale, or haven’t you
 thought much about this?
 Value Labels-9. Refused -8. Don’t know 
1. Extremely Liberal 
2. Liberal 
3. Slightly Liberal 
4. Moderate; middle of the road 
5. Slightly Conservative 
6. Conservative 
7. Extremely Conservative 
99. Haven’t thought much about this"
'''

crosswalk_polar = pd.DataFrame({
    "Self_Rating_(V201200)":[-9, -8, 1, 2, 3, 4, 5, 6, 7, 99],
    "Ideology_(V201200)": [
        "Refused", "Don't Know", "Extremely Liberal",
        "Liberal", "Slightly Liberal", 
        "Moderate; middle of the road",
        "Slightly Conservative",
        "Conservative",
        "Extremely Conservative",
        "Haven’t thought much about this"
    ]
    })

#merge using crosswalk
sub_polarization = sub_polarization.merge(crosswalk_polar, left_on = "V201200", right_on = "Self_Rating_(V201200)")

#We use this data to make a dataframe aggregated by state, and then we can show correlation between measure of polarity and the share of respondents in a state who did volunteer work.

In [None]:
#Replace FIPS codes with states

# Note that we previously used two ANES variables as US State variables,  with only "US State 2" being used in analysis, purely because it has more in-universe entries. This appears to be partially due to respondent reactions to different questions, and partially due to information restrictions on the dataset. As such, "US State" is only used for CEV data. 

from us import states

fips_to_state = {int(state.fips): state.abbr for state in states.STATES}

sub_polarization["US State 2"] = pd.to_numeric(sub_polarization['V201014b'], errors='coerce')

sub_polarization['US State 2'] = sub_polarization['US State 2'].map(fips_to_state)


In [None]:
'''
We clean data related to V201228, which asks:

"Generally speaking, do you usually think of yourself as [a Democrat, a Republican / a Republican, a Democrat], an
independent, or what?"

-9. Refused
-8. Don’t know
-4. Technical error
0. No preference {VOL - video/phone only}
1. Democrat
2. Republican
3. Independent
5. Other party {SPECIFY}

'''

crosswalk_party_2 = pd.DataFrame({
    "Party_Numbers_(V201228)": [-9, -8, -4, 0, 1, 2, 3, 5],
    "Party_Affiliation_(V201228)": [
        "Refused", "Don't know", "Technical error", "No Preference", "Democrat",
        "Republican", "Independent", "Other party"
    ]})

sub_polarization = sub_polarization.merge(
    crosswalk_party_2, left_on="V201228", right_on="Party_Numbers_(V201228)")


We will use two measures of polarization from the ANES data; each provides some amount of information that can be interpreted to indicate polarization to a certain extent, though both have their drawbacks.

1. Share of Outliers - we create a series of functions that group respondents by party Democrats with conservative-leaning ideologies, and Republicans with liberal-leaning ones. 


In [None]:
def find_share(df, state):
    """
    Processes a subset of survey data for a specific state to identify and count party-affiliated respondents and ideological outliers.

    Returns a subset of the input DataFrame, including new columns for:
        - 'Party_Count': Counts of individuals affiliated with major parties.
        - 'Outliers': Counts of ideological outliers within each party.
"""
    #subset by state
    sub = df[df["US State 2"] == state]
    Party_People = ["Democrat", "Republican"]
    Conservatives = ["Slightly conservative",
    "Conservative",
    "Extremely conservative"]
    Liberals = ["Extremely Liberal",
    "Liberal", "Slightly Liberal"]
    #count of partisans
    Party_Count = []
    #count of entries that are Dem but conservative or Repub but liberal
    Outlier_box = []
    #get Party_Count
    for index, entry in sub.iterrows():
            if entry["Party_Affiliation_(V201228)"] in Party_People:
                    Party_Count.append(1)
            else:
                    Party_Count.append(0)
    
    for index, entry in sub.iterrows():
            if (entry["Ideology_(V201200)"] in Conservatives) & (entry["Party_Affiliation_(V201228)"] == "Democratic"):
                    Outlier_box.append(1)
            elif (entry["Ideology_(V201200)"] in Liberals) & (entry["Party_Affiliation_(V201228)"] == "Republican"):
                    Outlier_box.append(1)
            else:
                    Outlier_box.append(0)
    
    sub["Party_Count"] = Party_Count
    sub["Outliers"] = Outlier_box
    return sub
        
Z = find_share(sub_polarization, "IL")                     

IL_Group = Z.groupby("Party_Affiliation_(V201228)")[["Outliers", "Party_Count"]].sum().reset_index()

IL_Group["Percent_Outliers"] =  IL_Group["Outliers"] / sum(IL_Group["Party_Count"])


In [None]:
def seek_polar(df, state):
    '''
    Aggregates polarization data for a specific state, calculating the share of ideological outliers within party-affiliated populations.

    '''
    mod_df = find_share(df, state)
    mod_df = mod_df.groupby(
        "Party_Affiliation_(V201228)"
        )[["Outliers", "Party_Count"]].sum().reset_index()

    mod_df["Percent_Outliers"] =  mod_df["Outliers"] / sum(mod_df["Party_Count"])
    return mod_df

#Example with IL
seek_polar(sub_polarization, "IL")


In [None]:
def find_share_state(df, state):
    '''
    Generates a modified subset of survey data for a given state, identifying partisans and ideological outliers.
'''
    #subset by state
    sub = df[df["US State 2"] == state]
    Party_People = ["Democrat", "Republican"]
    Conservatives = ["Slightly Conservative",
    "Conservative",
    "Extremely Conservative"]
    Liberals = ["Extremely Liberal",
    "Liberal", "Slightly Liberal"]
    #count of partisans
    Party_Count = []
    #count of entries that are Dem but conservative or Repub but liberal
    Outlier_box = []
    #get Party_Count
    for index, entry in sub.iterrows():
            if entry["Party_Affiliation_(V201228)"] in Party_People:
                    Party_Count.append(1)
            else:
                    Party_Count.append(0)
    
    for index, entry in sub.iterrows():
            if (entry["Ideology_(V201200)"] in Conservatives) & (entry["Party_Affiliation_(V201228)"] == "Democratic"):
                    Outlier_box.append(1)
            elif (entry["Ideology_(V201200)"] in Liberals) & (entry["Party_Affiliation_(V201228)"] == "Republican"):
                    Outlier_box.append(1)
            else:
                    Outlier_box.append(0)
    
    sub["Party_Count"] = Party_Count
    sub["Outliers"] = Outlier_box
    return sub


In [None]:
def find_share_nation(df):
    '''
    Processes survey data at the national level to identify party-affiliated respondents and ideological outliers.
'''
    Party_People = ["Democrat", "Republican"]
    Conservatives = ["Slightly Conservative",
                     "Conservative",
                     "Extremely Conservative"]
    Liberals = ["Extremely Liberal",
                "Liberal", "Slightly Liberal"]
    # count of partisans
    Party_Count = []
    # count of entries that are Dem but conservative or Repub but liberal
    Outlier_box = []
    # get Party_Count
    for index, entry in df.iterrows():
        if entry["Party_Affiliation_(V201228)"] in Party_People:
            Party_Count.append(1)
        else:
            Party_Count.append(0)
    # get outlier count
    for index, entry in df.iterrows():
        if (entry["Ideology_(V201200)"] in Conservatives) & (entry["Party_Affiliation_(V201228)"] == "Democratic"):
            Outlier_box.append(1)
        elif (entry["Ideology_(V201200)"] in Liberals) & (entry["Party_Affiliation_(V201228)"] == "Republican"):
            Outlier_box.append(1)
        else:
            Outlier_box.append(0)

    df["Party_Count"] = Party_Count
    df["Outliers"] = Outlier_box
    return df

In [None]:
def polar_table(df):
    '''
    Generates a summary table of polarization data, grouping by state to calculate the share of ideological outliers.
    '''
    mod_df = find_share_nation(df)
    #make a grouped df that takes sum of each states' outliers and 
    #people who identify with a party
    mod_df = mod_df.groupby("US State 2")[
        ["Outliers", "Party_Count"]
        ].sum().reset_index()
    mod_df["Percent_Outliers"] =  mod_df["Outliers"] / sum(mod_df["Party_Count"])
    return mod_df


In [None]:
polar_by_party = polar_table(sub_polarization)

#graphing percentages
percent_graph_outliers = alt.Chart(polar_by_party).mark_bar().encode(
    alt.X("US State 2", title = "State"),
    alt.Y("Percent_Outliers", title = "Outlier Share"),
    alt.Color("US State 2")
)

#graphing sums
sum_graph_outliers = alt.Chart(polar_by_party).mark_bar().encode(
    alt.X("US State 2", title = "State"),
    alt.Y("Outliers", title = "Outlier Sum"),
    alt.Color("US State 2")
)

In [None]:
'''
In a paper on quantifying polarization written by Aaron Bramson et al (https://inferenceproject.yale.edu/sites/default/files/688938.pdf), the authors examine a range of polarization indicators. A relatively simple (and in some ways problematic) measurement is called spread, or dispersion- essentially the gap between the most extreme political positions.

In the paper, Bramson et al. explain: "Polarization in the sense of spread can be measured as the value of the agent with the highest belief value minus the value of the agent with the lowest belief value (sometimes called the ‘range’ of the data).""

We (imperfectly) approximate this using two more variables: V201206 and V201207. These ask respondents to position political parties on the political spectrum. We can select the most ideologically distant nodes on the personal ideology scale (extremely liberal and extremely conservative) and capture how far apart their conceptions of each party are,  on average, and then disaggregate by state.

We will assign the different ideological positions to different points on a spectrum, namely:
-3, -2, and -1 are "Extremely Liberal", 
"Liberal",  and "Slightly Liberal"; 

and:
0 is "Moderate; middle of the road"; 

finally,
1, 2, and 3 are "Slightly conservative",
"Conservative" and "Extremely conservative.""

We'll then compare average positions by state. For example, if the average 
extremely liberal respondent in Texas places Democrats at at -1 (slightly liberal) and the average for the extremely conservative respondents in -3 (extremely liberal), then the distance between the two is 4, meaning Texas would have a spread of 4 for this question.
'''
crosswalk_V201206 = pd.DataFrame({
    #the dataset's scaled
    "Dem_Num_(V201206)":[-9, -8, 1, 2, 3, 4, 5, 6, 7],
    #the dataset's ideology
    "Dem_Ideology_(V201206)": [
        "Refused", "Don't Know", "Extremely Liberal",
        "Liberal", "Slightly Liberal", 
        "Moderate; middle of the road",
        "Slightly Conservative",
        "Conservative",
        "Extremely Conservative"
    ],
    #new values for comparison
    "Dem_Positioning(V201206)":[0, 0, -3, -2, -1, 0, 1, 2, 3]
    })

#They're exactly the same
crosswalk_V201207 = pd.DataFrame({
    #the dataset's scale
    "Repub_Num_(V201207)":[-9, -8, 1, 2, 3, 4, 5, 6, 7],
    #the dataset's ideology
    "Repub_Ideology_(V201207)": [
        "Refused", "Don't Know", "Extremely Liberal",
        "Liberal", "Slightly Liberal", 
        "Moderate; middle of the road",
        "Slightly Conservative",
        "Conservative",
        "Extremely Conservative"
    ],
    #new values for comparison
    "Repub_Positioning(V201207)":[0, 0, -3, -2, -1, 0, 1, 2, 3]
    })

#crosswalking V201206
sub_polarization = sub_polarization.merge(crosswalk_V201206, left_on = "V201206", right_on = "Dem_Num_(V201206)")

#crosswalking V201207
sub_polarization = sub_polarization.merge(crosswalk_V201207, left_on = "V201207", right_on = "Repub_Num_(V201207)")

#drop Dem_Num and Repub_Num for clarity (we'll be using the reassigned numbers)
sub_polarization = sub_polarization.drop(["Dem_Num_(V201206)", "Repub_Num_(V201207)"], axis = 1)

In [None]:
'''We use the Positioning columns to compare the average party position selections between the 2 extremes. '''

#Step 1: First grouping

#filter out all self-identifiers that aren't "Extremely Liberal"
#or Extremely Conservative
ex_polarization = sub_polarization[
    (sub_polarization["Ideology_(V201200)"] == "Extremely Conservative") | 
    (sub_polarization["Ideology_(V201200)"] == "Extremely Liberal")
    ]

#filter out entries of Party Ideology Position that are "Don't Know" and "Refused"
ex_polarization = ex_polarization[
    (sub_polarization["Dem_Ideology_(V201206)"] != "Don't Know") &
    (sub_polarization["Dem_Ideology_(V201206)"] != "Refused") &
    (sub_polarization["Repub_Ideology_(V201207)"] != "Don't Know") &
    (sub_polarization["Repub_Ideology_(V201207)"] != "Refused") 
    ]


#make column for Liberal (i.e. Left tail) and Conservative (i.e. Right tail) 
Extreme_L_C = [
    "Liberal" if row["Ideology_(V201200)"] == "Extremely Liberal" else "Conservative" 
    for index, row in ex_polarization.iterrows()
    ]

#save to dataframe
ex_polarization["Extreme_Position"] = Extreme_L_C

#groupby Extreme_Position and US State 2 and get average Dem and Repub positioning. Then save this to a df called position_groups
position_groups = ex_polarization.groupby(["US State 2", "Extreme_Position"])[["Dem_Positioning(V201206)", "Repub_Positioning(V201207)"]].mean().reset_index()

In [None]:
'''
Step 2: We use the variable position_groups to create a dataframe that has 4 columns:
(1) The position Liberals give Democrats on the spectrum
(2) the position Conservatives give Democrats on the spectrum
(3) the position Liberals give Republicans on the spectrum
(4) the position Conservatives give Republicans on the spectrum
'''

#Attribution: Asked ChatGPT for help fixing my list comprehension. Also asked ChatGPT how to flatten a multi-index column, which recommended ##source: https://www.w3resource.com/pandas/dataframe/dataframe-pivot.php


#pivot on Extreme_Position
pivot_position = position_groups.pivot(
    index = "US State 2", 
    columns = "Extreme_Position", 
    values = ["Dem_Positioning(V201206)", "Repub_Positioning(V201207)"]
    )

#reset index to retrieve US State 2 variable from index
pivot_position = pivot_position.reset_index()

#Flatten multi-index by joining the upper and lower
#levels of the multi-indexed columns together
pivot_position.columns = ['_'.join(col).strip() for col in pivot_position.columns.values]

#rename flattened columns
pivot_position = pivot_position.rename(
        columns = {
            "US State 2_":"US State 2", 
            "Dem_Positioning(V201206)_Conservative":"Dem_Position_C",
            "Dem_Positioning(V201206)_Liberal":"Dem_Position_L",
            "Repub_Positioning(V201207)_Conservative":"Repub_Position_C",
            "Repub_Positioning(V201207)_Liberal":"Repub_Position_L"
        }
    )


In [None]:
'''Step 3: Now, we use those 4 columns to create the spread, meaning the absolute value of the difference betweeen:

(1) The position Liberals give Democrats on the spectrum and the position Conservatives give Democrats on the spectrum

(2) The position Liberals give Republicans on the spectrum and the position Conservatives give Republicans on the spectrum
'''

# Attribution: Used this article: https://www.w3resource.com/pandas/dataframe/dataframe-pivot.php
# Asked ChatGPT how to flatten a multi-index column

# capture spread variables
pivot_position["Spread_Dem"] = abs(
    pivot_position["Dem_Position_C"] -
    pivot_position["Dem_Position_L"]
)

pivot_position["Spread_Repub"] = abs(
    pivot_position["Repub_Position_C"] -
    pivot_position["Repub_Position_L"]
)

In [None]:
'''
Step 4: Now, we graph those differences and interpret these spreads as an indicator of polarization. We acknowledge that this is too simple an analysis to account for the full complexity of this kind of measurement, as polarization involves not just distance between extremes, but also clustering around them. We also acknowledge that our political scale assumes a linear ideological spectrum, which isn't always the case.


'''

#Opinion on the position of the Democratic Party
Democratic_Party_Spread_Graph = alt.Chart(pivot_position).mark_bar().encode(
    alt.X("US State 2", title = "State"),
    alt.Y("Spread_Dem", 
    title = "Spread of Views"),
    alt.Color("US State 2")
).properties(title = alt.TitleParams("Spread of Views on Democratic Party Ideological Position"))

#Opinion on the position of the Republican Party
Republican_Party_Spread_Graph = alt.Chart(pivot_position).mark_bar().encode(
    alt.X("US State 2", title = "State"),
    alt.Y("Spread_Repub", 
    title = "Spread of Views"),
    alt.Color("US State 2")
).properties(title = alt.TitleParams("Spread of Views on Republican Party Ideological Position"))

In [None]:
'''
Step 5: We merge the two datasets together on "US State 2". Note that this merged dataset could be misleading, because some of the merged variables only apply to the "extreme" respondents in each state, and so it's not representative of the entire state's respondents' positioning of different parties. We've dropped those variables from the merge to try to account for this.
'''
merged_polarization = sub_polarization.merge(pivot_position, left_on = "US State 2", right_on = "US State 2", how = "outer")

#drop variables that don't make sense when merged
merged_polarization = merged_polarization.drop([
    "Dem_Position_C", "Dem_Position_L",
    "Repub_Position_C", "Repub_Position_L"], 
    axis = 1
    )

In [None]:
'''Step 6: We merge the AmeriCorps CEV/VCL data with the spread data so far'''

#remove 'Not in Universe', 'No Answer', 'Refused'
#from cev_all_2021_filter

Non_Number_List = [
    'Not in Universe', np.nan, "Do Not Know",
    'No Answer', 'Refused']

cev_all_2021_all_number = cev_all_2021_filter[
    ~cev_all_2021_filter["Hours_Spent_Volunteering"].isin(Non_Number_List)
    ]

#group volunteer rates by state
group_cev = cev_all_2021_all_number.groupby(
    "US State"
    )[["Hours_Spent_Volunteering", 
    "political_engagement_score"]].mean().reset_index()

main_group = group_cev.merge(
    pivot_position, left_on = "US State", right_on = "US State 2"
)

In [None]:
'''
Step 7: We create a function for viewing Political Engagement alongside spread, one generating a table with 2 states for comparison, and another 2 
returning graphs
'''

#return table
def engagement_spread_table(df, state_1, state_2):
        sub = df[
            (df["US State"] == state_1) |
            (df["US State"] == state_2)]
        sub = sub.filter(["US State", "political_engagement_score", "Spread_Dem", "Spread_Repub"])
        return sub

#return graph with Democratic positioning 
def engagement_spread_graph_D(df, column):
        graph = alt.Chart(df).mark_bar().encode(
        alt.X("US State 2", title = "State", sort = "-y"),
        alt.Y("Spread_Dem", title = "Spread of Extreme Views"),
         alt.Color(
            column, 
            scale = alt.Scale(range = ["lightblue", "darkblue"])
            )
        ).properties(
            title = alt.TitleParams(
                "Spread of Views on Democratic Party Ideological Position"
                )
            )
        return graph

#return graph with Republican positioning 
def engagement_spread_graph_R(df, column):    
        graph = alt.Chart(df).mark_bar().encode(
        alt.X("US State 2", title = "State", sort = "-y"),
        alt.Y("Spread_Repub", title = "Spread of Extreme Views"),
        alt.Color(
            column, 
            scale = alt.Scale(range = ["lightblue", "darkblue"]),
            )
        ).properties(
            title = alt.TitleParams(
                "Spread of Views on Republican Party Ideological Position"
                )
            )
        return graph

In [None]:
#Below is an example of using our function to generate a political engagement score for different states: 

#sample static table with 
Sample_Table = engagement_spread_table(main_group, "IL", "OR")

print(Sample_Table)

In [None]:
#And below are sample static graphs on Democratic and Republican positions:

#using average political engagement
Dem_Graph_Spread_1 = engagement_spread_graph_D(main_group, "political_engagement_score")

#using average volunteer hours
Dem_Graph_Spread_2 = engagement_spread_graph_D(main_group, "Hours_Spent_Volunteering")

Dem_Graph_Spread_1.show()
Dem_Graph_Spread_1.save('dem_graph_spread.png', format='png')

Dem_Graph_Spread_2.show()
Dem_Graph_Spread_2.save('dem_graph_spread_2.png', format='png')

In [None]:
#using average political engagement
Repub_Graph_Spread_1 = engagement_spread_graph_R(main_group, "political_engagement_score")

#using average volunteer hours
Repub_Graph_Spread_2 = engagement_spread_graph_R(main_group, "Hours_Spent_Volunteering")

Repub_Graph_Spread_1.show() 
Repub_Graph_Spread_1.save('repub_graph_spread.png', format='png')


Repub_Graph_Spread_2.show()
Repub_Graph_Spread_2.save('repub_graph_spread_2.png', format='png')

Below is a atatic graph of ideological position in the US nationally as of 2020:


In [None]:
alt.data_transformers.enable("vegafusion")
#create list for graph to sort on 
sorting_list = [
        "Refused", "Don't Know", "Extremely Liberal",
        "Liberal", "Slightly Liberal", 
        "Moderate; middle of the road",
        "Slightly Conservative",
        "Conservative",
        "Extremely Conservative",
        "Haven’t thought much about this"
    ]

#filter out non-responses
global_non_response_list = [
    "Haven’t thought much about this",
    "Refused", "Don't Know"
    ]

local_filter = sub_polarization[
    ~sub_polarization["Ideology_(V201200)"].isin(global_non_response_list)
]

Ideological_Position_US = alt.Chart(local_filter).mark_bar().encode(
    alt.X(
        "Ideology_(V201200):N", 
        title = "Ideological Position",
        sort = sorting_list),
    alt.Y("Observations", title = "Number of Respondents"),
    alt.Color("Ideology_(V201200)", sort = sorting_list)
).properties(title = alt.TitleParams(
            f"Ideological Position in the U.S."
        ))

Ideological_Position_US.show()

#Note: We removed non-responses, likely creating a bias in the relative size of the remaining groups (it's difficult to predict the direction of said bias, however.)

In [None]:
#Additionally, below is a sample static graph of ideological position by state, using IL as an example:

def ideology_by_state(df, state):
        #filter out non-responses
        non_response_list = [
            "Haven’t thought much about this", 
            "Refused", "Don't Know"
            ]
        sub = df[
            ~df["Ideology_(V201200)"].isin(non_response_list)
            ]
        #filter for the selected state
        sub = sub[sub["US State 2"] == state]
        #create list for graph to sort on 
        sorting_list = [
        "Extremely Liberal",
        "Liberal", "Slightly Liberal", 
        "Moderate; middle of the road",
        "Slightly Conservative",
        "Conservative",
        "Extremely Conservative"
        ]
        graph = alt.Chart(sub).mark_bar().encode(
            alt.X(
                "Ideology_(V201200):N", 
                title = "Ideological Position",
                sort = sorting_list
                ),
            alt.Y("count():Q", title = "Number of Respondents"),
            alt.Color("Ideology_(V201200):N", sort = sorting_list)
        ).properties(title = alt.TitleParams(
            f"Ideological Position in {state}"
        ))
        return graph

#Example for Illinois
Ideological_Position_IL = ideology_by_state(sub_polarization, "IL")

Ideological_Position_IL.show()

In [None]:
#Lastly, we will use the merged AmeriCorps and ANES data to plot a correlation between volunteer hours and political engagement:

Correlation_Graph = alt.Chart(main_group).mark_circle().encode(
    alt.X("Hours_Spent_Volunteering", title="Average Volunteer Hours"),
    alt.Y("political_engagement_score", title="Average Political Engagement Score").scale(
        domain=(
            main_group["political_engagement_score"].min() - 3,
            main_group["political_engagement_score"].max() + 3
        )
    ),
    alt.Color("US State 2")
).properties(title=alt.TitleParams(
    "Correlation between volunteer hours and political engagement"
))

Correlation_Graph.show()

In [None]:
# Below is the exploratory data analysis (h/t Justine)

# Basic plot to show Frequency of Volunteering Categories

filtered_data = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know")
]

frequency_counts = filtered_data['Volunteering_Frequency'].value_counts(
).reset_index()
frequency_counts = pd.DataFrame(frequency_counts)

frequency_counts.columns = ['Volunteering_Frequency', 'Frequency']

chart = alt.Chart(frequency_counts).mark_bar().encode(
    y=alt.Y('Volunteering_Frequency:O', title='Frequency Count'),
    x=alt.X('Frequency:Q', title='Frequency Count'),
    color=alt.Color('Frequency:Q', scale=alt.Scale(
        scheme='blues'), title='Volunteering Frequency'),
    tooltip=['Volunteering_Frequency', 'Frequency']
).properties(
    title='Frequency of Volunteering Categories (Excluding "No Answer")',
    width=550,
    height=200
)

chart.show()
chart.save('volunteer_frequency.png', format='png')

In [None]:
# Plot to show Volunteering Frequency based on if one volunteered last year (2020)

volunteered_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &  
    (cev_all_2021_filter['Volunteered_Past_Year'] == "Yes")
]

did_not_volunteer_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &  
    (cev_all_2021_filter['Volunteered_Past_Year'] == "No")
]

volunteered_counts = volunteered_last_year['Volunteering_Frequency'].value_counts().reset_index()
volunteered_counts.columns = ['Volunteering_Frequency', 'Frequency']
volunteered_counts['Volunteered_Past_Year'] = 'Yes'

did_not_volunteer_counts = did_not_volunteer_last_year['Volunteering_Frequency'].value_counts().reset_index()
did_not_volunteer_counts.columns = ['Volunteering_Frequency', 'Frequency']
did_not_volunteer_counts['Volunteered_Past_Year'] = 'No'

combined_counts = pd.concat([volunteered_counts, did_not_volunteer_counts])

chart = alt.Chart(combined_counts).mark_bar().encode(
    y=alt.Y('Volunteering_Frequency:O', title='Volunteering Frequency'),  
    x=alt.X('Frequency:Q', title='Count'),  
    color=alt.Color('Volunteered_Past_Year:N', title='Volunteered Last Year', legend=alt.Legend(title="Volunteered Last Year")),  
    tooltip=['Volunteering_Frequency', 'Frequency', 'Volunteered_Past_Year']
).properties(
    title='Volunteering Frequency for Those Who Volunteered Last Year vs. Those Who Did Not',
    width=600,
    height=200
)

chart.show()
chart.save('if_volunteered_last_year.png', format='png')

In [None]:
# If they voted or not in the local election with no orange


#filtered if did vote
volunteered_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &  
    (cev_all_2021_filter['Voted_In_Local_Election'] == "Yes")
]

# Filter data for those who did not vote
did_not_volunteer_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &  
    (cev_all_2021_filter['Voted_In_Local_Election'] == "No")
]

volunteered_counts = volunteered_last_year['Volunteering_Frequency'].value_counts().reset_index()
volunteered_counts.columns = ['Volunteering_Frequency', 'Frequency']
volunteered_counts['Voted_In_Local_Election'] = 'Yes'

did_not_volunteer_counts = did_not_volunteer_last_year['Volunteering_Frequency'].value_counts().reset_index()
did_not_volunteer_counts.columns = ['Volunteering_Frequency', 'Frequency']
did_not_volunteer_counts['Voted_In_Local_Election'] = 'No'

combined_counts = pd.concat([volunteered_counts, did_not_volunteer_counts])

chart = alt.Chart(combined_counts).mark_bar().encode(
    y=alt.Y('Volunteering_Frequency:O', title='Volunteering Frequency'),  
    x=alt.X('Frequency:Q', title='Count'),  
    color=alt.Color('Voted_In_Local_Election:N', title='Voted in Local Election', scale=alt.Scale(domain=['Yes', 'No'], range=['blue', 'red']), legend=alt.Legend(title="Voted")),  
    tooltip=['Volunteering_Frequency', 'Frequency', 'Voted_In_Local_Election']
).properties(
    title='Volunteering Frequency for Those Who Voted vs. Those Who Did Not',
    width=600,
    height=200
)

chart.show()
chart.save('if_voted_in_local_election.png', format='png')

In [None]:
# What is the correlation or relationship between volunteering frequency, news consumption and voting? 


volunteering_frequencies = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know")
]

if_voted = volunteering_frequencies.loc[volunteering_frequencies['Voted_In_Local_Election'].isin(['Yes', 'No'])]

News_Consumption = if_voted.loc[
    if_voted['Frequency_Of_News_Consumption'].notna() & 
    ~if_voted['Frequency_Of_News_Consumption'].isin(['Basically Every Day', 'Refusal', 'Do Not Know', 'No Answer'])
]

News_Consumption_map = {
    'A Few Times a Week': 'blue',       
    'Not at All': 'green',              
    'A Few Times a Month': 'orange',    
    'Once A Month': 'red',   
    'Less Than Once a Month': 'purple'              
}

News_Consumption['News_Consumption_Color'] = News_Consumption['Frequency_Of_News_Consumption'].map(News_Consumption_map)

counts = News_Consumption.groupby(
    ['Volunteering_Frequency', 'Voted_In_Local_Election', 'Frequency_Of_News_Consumption', 'News_Consumption_Color']
).size().reset_index(name='Frequency')

chart = alt.Chart(counts).mark_point(filled=True).encode(
    y=alt.Y('Frequency_Of_News_Consumption:N', title='News Consumption Frequency'),  
    x=alt.X('Frequency:Q', title='Count'),  
    shape=alt.Shape('Voted_In_Local_Election:N',  
                    title='Voted in Local Election',
                    scale=alt.Scale(domain=['Yes', 'No'], range=['circle', 'square'])),  
    color=alt.Color('Volunteering_Frequency:N',  
                    title='Volunteering Frequency',
                    scale=alt.Scale(domain=['Not at All', 'A Few Times a Month', 'A Few Times a Week', 'Once a Month', 'Less Than Once a Month'],
                                    range=['blue', 'green', 'orange', 'red', 'purple'])),  
    size=alt.SizeValue(100),
    tooltip=['Volunteering_Frequency', 'Frequency', 'Voted_In_Local_Election', 'Frequency_Of_News_Consumption']  
).properties(
    title='News Consumption vs. Volunteering Frequency with Voting Behavior',
    width=600,
    height=200
)

chart.show()
chart.save('news_consumption_volunteering_correlation.png', format='png')

In [None]:
# If consumed news basically everyday, what is your voting frequency 

volunteering_frequencies = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know")
]

if_voted = volunteering_frequencies.loc[volunteering_frequencies['Voted_In_Local_Election'].isin(['Yes', 'No'])]

News_Consumption_daily = if_voted.loc[
    if_voted['Frequency_Of_News_Consumption'] == 'Basically Every Day'  
]
counts = News_Consumption_daily.groupby(
    ['Volunteering_Frequency', 'Voted_In_Local_Election']
).size().reset_index(name='Frequency')


chart = alt.Chart(counts).mark_bar().encode(
    x=alt.X('Volunteering_Frequency:N', title='Volunteering Frequency', 
            axis=alt.Axis(labelAngle=45)),  
    y=alt.Y('Frequency:Q', title='Count'),  
    color=alt.Color('Voted_In_Local_Election:N',  
                    scale=alt.Scale(domain=['Yes', 'No'], range=['blue', 'orange']),
                    title='Voted in Local Election'),

    tooltip=['Volunteering_Frequency', 'Frequency', 'Voted_In_Local_Election']  
).properties(
    title='Volunteering Frequency vs. Voting Behavior',
    width=300,
    height=300
)

chart.show()
chart.save('volunteering_vs_voting_news_daily.png', format='png')

In [None]:
# Does the consumption of news impact hours spent volunteering?

filtered_data = volunteering_frequencies[
    (volunteering_frequencies['Frequency_Of_News_Consumption'].notna()) & 
    ~volunteering_frequencies['Frequency_Of_News_Consumption'].isin(['Basically Every Day', 'Refusal', 'Do Not Know', 'No Answer']) & 
    (volunteering_frequencies['Hours_Spent_Volunteering'].notna()) & 
    ~volunteering_frequencies['Hours_Spent_Volunteering'].isin(['Not in Universe', 'Do Not Know', 'Refused'])
]
filtered_data = filtered_data[filtered_data['Hours_Spent_Volunteering'] <= 250]

chart = alt.Chart(filtered_data).mark_boxplot(extent='min-max').encode(
    x=alt.X('Frequency_Of_News_Consumption:N', title='News Consumption Frequency', sort=None),  
    y=alt.Y('Hours_Spent_Volunteering:Q', 
            title='Hours Spent Volunteering', 
            scale=alt.Scale(domain=[0, 250])), 
    tooltip=['Frequency_Of_News_Consumption', 'Hours_Spent_Volunteering']  
).properties(
    title='Boxplot of Hours Spent Volunteering by News Consumption Frequency',
    width=200,  
    height=600  
)

chart.show()
chart.save('boxplot_news_consumption_vs_volunteering_hours.png', format='png')

In [None]:
# preparing some data, to list the average number of hours volunteered by state

filtered_data = volunteering_frequencies[
    (volunteering_frequencies['Hours_Spent_Volunteering'].notna()) & 
    ~volunteering_frequencies['Hours_Spent_Volunteering'].isin(['Not in Universe', 'Do Not Know', 'Refusal'])
]

filtered_data['Hours_Spent_Volunteering'] = pd.to_numeric(filtered_data['Hours_Spent_Volunteering'], errors='coerce')

filtered_data = filtered_data[filtered_data['Hours_Spent_Volunteering'].notna()]

average_hours_per_state = filtered_data.groupby('US State')['Hours_Spent_Volunteering'].mean().reset_index()

average_hours_per_states = average_hours_per_state.sort_values(by='Hours_Spent_Volunteering', ascending=False)

In [None]:
# using the above data to show the top and bottom 5 states when it comes to average hours of volunteering

top_bottom_states = pd.concat([average_hours_per_state.nlargest(5, 'Hours_Spent_Volunteering'), average_hours_per_state.nsmallest(5, 'Hours_Spent_Volunteering')])

scatter_plot = alt.Chart(top_bottom_states).mark_point(size=100, color='blue').encode(
    x=alt.X('US State:N', title='US State', sort='-y'),  
    y=alt.Y('Hours_Spent_Volunteering:Q', title='Average Hours Spent Volunteering')  
)
chart = scatter_plot + top_bottom_labels

chart = chart.properties(
    title='Top 5 and Bottom 5 States by Average Hours Spent Volunteering',
    width=400,   
    height=00   
).configure_axisY(
    labelAngle=0  
).configure_axisX(
    labelAngle=45  
)
chart.show()
chart.save('top_bottom_5_states_avg_volunteering.png', format='png')

In [None]:
# the following chart is total number of volunteers per state and stacked by volunteer frequency
filtered_data = volunteering_frequencies.dropna(subset=['US State'])

aggregated_data = filtered_data.groupby(['US State', 'Volunteering_Frequency']).size().reset_index(name='Count')

total_counts = aggregated_data.groupby('US State')['Count'].sum().reset_index(name='Total_Count')
sorted_states = total_counts.sort_values(by='Total_Count', ascending=False)

aggregated_data_sorted = pd.merge(aggregated_data, sorted_states[['US State']], on='US State', how='inner')

chart = alt.Chart(aggregated_data_sorted).mark_bar().encode(
    x=alt.X('US State:N', title='State', sort=sorted_states['US State'].tolist()),  
    y=alt.Y('Count:Q', title='Count of Volunteers per State'),  
    color='Volunteering_Frequency:N',  
    tooltip=['US State', 'Volunteering_Frequency', 'Count']  
).properties(
    title='Volunteering Frequency by State (Sorted by Total Count)',
    width=600,
    height=400
)

text = chart.mark_text(
    align='center',
    baseline='middle',
    dy=-5,  
    size=6
).encode(
    text='Count:Q'  
)

final_chart = chart + text
final_chart

In [None]:
final_chart.save('number_volunteers_per_state.png', format='png')

In [None]:
# chart depicted share of people who boycotted based on volunteering frequency
#boycott chart
#filtered if did vote
volunteered_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") & 
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not at All") & 
    (cev_all_2021_filter['Boycott_Based_On_Values'] == "Yes")
]

# Filter data for those who did not vote
did_not_volunteer_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not at All") &  
    (cev_all_2021_filter['Boycott_Based_On_Values'] == "No")
]

volunteered_counts = volunteered_last_year['Volunteering_Frequency'].value_counts().reset_index()
volunteered_counts.columns = ['Volunteering_Frequency', 'Frequency']
volunteered_counts['Boycott_Based_On_Values'] = 'Yes'

did_not_volunteer_counts = did_not_volunteer_last_year['Volunteering_Frequency'].value_counts().reset_index()
did_not_volunteer_counts.columns = ['Volunteering_Frequency', 'Frequency']
did_not_volunteer_counts['Boycott_Based_On_Values'] = 'No'

combined_counts = pd.concat([volunteered_counts, did_not_volunteer_counts])

chart = alt.Chart(combined_counts).mark_bar().encode(
    y=alt.Y('Volunteering_Frequency:O', title='Volunteering Frequency'),  
    x=alt.X('Frequency:Q', title='Count'),  
    color=alt.Color('Boycott_Based_On_Values:N', title='Boycotted', scale=alt.Scale(domain=['Yes', 'No'], range=['blue', 'purple']), legend=alt.Legend(title="Boycotted")),  
    tooltip=['Volunteering_Frequency', 'Frequency', 'Boycott_Based_On_Values']
).properties(
    title='Volunteering Frequency for Those Who Boycotted based on their Values or not',
    width=600,
    height=400
)

chart.show()
chart.save('boycotted.png', format='png')

# same as above except now if person contacted public official

In [None]:
volunteered_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &  
    (cev_all_2021_filter['Contacted_Public_Official'] == "Yes")
]

# Filter data for those who did not vote
did_not_volunteer_last_year = cev_all_2021_filter[
    (cev_all_2021_filter['Volunteering_Frequency'] != "Not in Universe") &  
    (cev_all_2021_filter['Volunteering_Frequency'].notna()) &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "No Answer") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Refusal") &  
    (cev_all_2021_filter['Volunteering_Frequency'] != "Do Not Know") &  
    (cev_all_2021_filter['Contacted_Public_Official'] == "No")
]

volunteered_counts = volunteered_last_year['Volunteering_Frequency'].value_counts().reset_index()
volunteered_counts.columns = ['Volunteering_Frequency', 'Frequency']
volunteered_counts['Contacted_Public_Official'] = 'Yes'

did_not_volunteer_counts = did_not_volunteer_last_year['Volunteering_Frequency'].value_counts().reset_index()
did_not_volunteer_counts.columns = ['Volunteering_Frequency', 'Frequency']
did_not_volunteer_counts['Contacted_Public_Official'] = 'No'

combined_counts = pd.concat([volunteered_counts, did_not_volunteer_counts])

chart = alt.Chart(combined_counts).mark_bar().encode(
    y=alt.Y('Volunteering_Frequency:O', title='Volunteering Frequency'),  
    x=alt.X('Frequency:Q', title='Count'),  
    color=alt.Color('Contacted_Public_Official:N', title='Contacted Elected Official', scale=alt.Scale(domain=['Yes', 'No'], range=['green', 'brown']), legend=alt.Legend(title="Called")),  
    tooltip=['Volunteering_Frequency', 'Frequency', 'Contacted_Public_Official']
).properties(
    title='Volunteering Frequency for Those Who Called their Public Official',
    width=400,
    height=200
)

chart.show()
chart.save('contacted.png', format='png')