<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'>Problem Set III<br><br>
Tiago Ventura</center></center> <h1> 


# Introduction to Problem Set 03

This problem set will focus on data exploration and data wrangling using Pandas. 

## Dataset

For this problem set, we will with a mix of electoral and survey data.  Primarely, you will work with data from the 2020 Cooperative Election Study (CCES).  

## CCES

The Cooperative Election Study, or CCES, seeks to study how Americans view and hold their representatives accountable during elections, how they voted and their electoral experiences, and how their behavior and experiences vary with political geography and social context. The CCES is a 50,000+ person national stratified sample survey administered by YouGov. 

The survey consists of two waves in election years. In the pre-election wave, respondents answer two-thirds of the questionnaire. This segment of the survey asks about general political attitudes, various demographic factors, assessment of roll call voting choices, political information, and vote intentions. The pre-election wave is in the field from xlate September to late October. In the post-election wave, respondents answer the other third of the questionnaire, mostly consisting of items related to the election that just occurred. The post-election wave is administered in November.

## Data Documentation

Information about the CCES project can be found in their website: https://cces.gov.harvard.edu/

Documentation for the 2020 wave can be download in the following links: 

- Pre-Election Survey: https://dataverse.harvard.edu/file.xhtml?fileId=4462965&version=4.0

- Post-Election Survey: https://dataverse.harvard.edu/file.xhtml?fileId=4462966&version=4.0

- Full Guide: https://dataverse.harvard.edu/file.xhtml?fileId=5793681&version=4.0

The full dataset and the three codebooks are available in the repository. This survey data is saved as `cces_2020.csv` inside of a compressed file. The documentation is inside of the documentation folder. 

## 0. Load packages and imports


In [313]:
## basic functionality
import pandas as pd
import numpy as np
import re
import os
import plotnine


# see all columns
pd.set_option('display.max_columns', None)


## 1. Understand the Dataset (10pts)

Our first step as data scientist is to understand the structure of the data. Do the following: 


- Open the `cces_2022.csv` as a pandas data frame. Open the dataset using only the set of columns saved in the `col_to_open`
- What is the unit of analysis of this data frame?
- Are these units unique? Or are these units duplicated over the rows?
- How is the panel data (pre and post-election surveys) encoded in the data frame?
- How many variables exist in the data frame?


In [314]:
col_to_open = ['caseid', 
        "gender", "gender_post",
        "birthyr", "birthyr_post",
        'educ', "race", "hispanic",
        "pid3",
        "votereg", "votereg_post", 
        "inputstate", "region", 
        "CC20_330a", "CC20_330b", "CC20_330c",
        "CC20_331a", "CC20_331b", "CC20_331c", "CC20_331d", "CC20_331e", 
         "presvote16post", 
         "CC20_363", 
         "CC20_364b"]

In [None]:
# add your responses. Split them across multiple cells!

## 2. Analyzing voter registration (20pts)

In the survey, there are two self-reported measures for voter registration (`votereg` and `votereg_post`). If you want to understand better what voter registration means, read [here](https://electionlab.mit.edu/research/voter-registration)

- Calculate the share of voters in the entire sample who were registered to vote in the first wave of the survey
- Analyze the difference between voter registration across gender, race, and party identification. To do, you need to calculate the proportion of registered voters in each subgroup. 
- Create a new column indicating which voters reported having registered to vote only in the second wave of the survey. How many voters reported to have registered to vote only in the second wave?
- Filter the dataset only with voters who report having registered to voter only in the second wave. Call this dataset `late_voters`. Look at the racial composition of late voters, and compare with the racial composition of the entire sample. What is the largest difference between the two groups?


In [None]:
# add your responses. Split them across multiple cells!

#### Understanding you results

- Write 1-2 setences analyzing the results about the demographic composition of registered voters in the United States and also the late registered voters. Use these results to understand the reasons behind effort in American Politics to make voter registration harded. Which demographics and political groups win when registration is harder for voters?


In [None]:
## Add here your response. No code. Write in Markdown

## 3. Visualization of Voter Registration (10pts)

Provide a visual representation (bars would be nice here) of the difference in proportion of voter registration across three socio-demographics group: gender, race, and party identification. 

- On y-axis, you need to provide the proportion of registered voters. 
- On the x-asis, the subgroups. 
- Use subplots to separate each of the three groups.
- Remember the label properly your graph with caption and meaningful axis.
- In addition to ploting the graph, save it as a figure in yout github repo. 

In [None]:
# add your responses. Split them across multiple cells!

## 4. Scaling Policy Preferences (20pts)

When doing surveys, we usually work on ways to map multiple survey questions into a single dimension. This is a process called dimensionality reduction. We will use our data wrangling skills to perform this task, and you will learn more sophisticated way to do this type task in DS II and DS III

The idea here is that you can ask multiple questions about the same issue, for example, gun control, and apply a statistical technique to aggregate these answer across a single number. This single number works as a summary of the person position on a this one-dimensional space. 

Let's go step by step. 

**Renaming Variables**

- All variables starting with `CC20_330` represent a different grid question about gun control. Read the documentation of the survey to understand those. 
- Rename all these question in a way that `CC20_330` gets replaced by `gun_control`

**Recoding Values**

All these questions have the same format asking about survey respondents approval for certain gun control policy. However, the directionality of the questions are sometime flipped. For example, `CC22_330b` asks about support for banning assault rifles, and `CC20_330c` asks about making it easier for people	to	obtain	 concealed-carry permit. In this sense, a answer `support` for `CC22_330b` is very different from `support` to `CC22_330b`. Here is your task: 

- Recode the variables `CC20_330` in a way that all responses are mapped on the same direction. For example, a respondent who strongly support gun control will have always the same answer (could be 1 for all items), and a respondent who does not support any gun control policy will be on the opposity (could be 0 for all items. Make sure the items go on the same direction. 
- Write a user-defined function that performs the task above for all others `CC22_30` questions. There are many ways you can write this function. The simplest way is just to write a function that rotates the values of a variable given as an input. 

**Building an index**

Using the recoded answers (maping all variables in the same direction), you will build a composite index of support for gun control: 

- create a new column (this a nice use of the apply method) summing row-wise all the values of the recoded `CC22_330` items
- create a new column with the normalized value for the summing. The normalization here means creating a index that has mean equal zero and variance equal to one. Call this `gun_control_index`

In [None]:
# add your responses. Split them across multiple cells!

## 5 - Visualize the index (10pts)

Plot the `gun_control_index` in two ways: 

- Provide a bar graph with the average value of the `gun_control_index` by state.
    - To do so, you need to recode the state from numbers to their names. Check the codebook of the survey. The easiest way to complete this task is using `.map()` and a dictionary. 
- Provide a density distribution with the values of the `gun_control_index` for each participants 
- Provide a second density splitting the values for democrats and republicans. 

In [None]:
# add your responses. Split them across multiple cells!

#### Understanding you results


- Write 1-2 setences explaining the differences of gun_control_index preferences over state and across voters affiliated with Democrats and Republicans

In [None]:
## Add here your response. No code. Write in Markdown

## Repeat and Plot. (10pts)

Repeat all the steps from question four with the following tricks: 

**Part A:**

- You should use the columns for immigration (CC20_331). 


In [None]:
# add your responses. Split them across multiple cells!

**Part B:**

- Present a box plot with values for the immigration index in one axis, and Democrats and Republicans in the other axis 
- Write two lines explaining your results
    

In [None]:
# add your responses. Split them across multiple cells!

## Extra Points

**Part C (extra 10 pts!!)**

If you complete the Part A writing three functions to perform each steps from question 4, you will get extra 10pts. 

Your code should look something like this:


```
# assume a data set called d

## rename
d = rename_columns(d, [list_of_columns_to_be_renamed])


# recode

d = recode_columns(d, [list_columns_to_be_recoded])

# build index

d = build_index(d, [list_of_columns_to_aggregate_for_index])
```

## 8 - Merging Survey Data with Election Returns (20pts)

Now, we will merge the survey data with the election results, and run some analysis with this augmented dataset. 

To do so, download the data with election results from the [Redistricting Data Hub](https://redistrictingdatahub.org/)

- Link for the data: https://redistrictingdatahub.org/dataset/2020-presidential-democratic-republican-vote-share-on-nationwide-2020-census-blocks/

The data is also available in the github repo

**Processing Electoral Returns**

- Open the dataset
- What is the unit of analysis?
- Aggregate the election results by state. Your output should columns with the state and the total number of votes received for each candidate in state
- Create three new variables with the vote share for each presidential candidate. To calculate the vote share, you need to divide the sum of votes for each candidate by the total votes received by the three candidates in the state. 
- Which state did Joe Bidden receive his largest vote share? And Trump?

**Voting Choices in the Survey Data**

- Using the variable `CC20_364b`, calculate the predicted vote share for president for each state. This is a predicted value since you are using survey responses and not real observational data. 

**Comparison with electoral returns**

- Merge the data with the predicted vote share by state with the data with electoral returns. To do so, you will need to clean the variable for state (one dataset uses the full name, while another uses numbers)

- Compare the vote share for Donald Trump across the states. On average, how much off the results using only the self-reported survey responses are? To do so, take the average of the squared difference across the states. 

- In which states the results are more off the target? Write a few sentences to explain why your predictions from the survey were off. A few issues here you can explore are the fact you are not using the weights of the survey, sample size across the state, voters under-reporting support for Trump, only one survey, online vs in-person surveys. Use some of these options to explain what might be off here. 

In [None]:
# add your responses. Split them across multiple cells!