# Python and R

This setup allows you to use *Python* and *R* in the same notebook.

To set up a similar notebook, see quickstart instructions here:

https://github.com/dmil/jupyter-quickstart

Some thoughts on why I like this setup and how I use it at the [end](notebook.ipynb#Thoughts) of  this notebook.

In [24]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

This is a Python notebook, but below is an R cell. The `%%R` at the top of the cell indicates that the code in this cell will be R code.

# Get Census Data

You can use https://censusreporter.org/ to identify which variables you need from the US census and at what demographic level (e.g. county, tract, etc.). The code below uses the `tidycensus` package in R to download particular census demographic variables.

In [26]:
%%R

# My commonly used R imports
require('tidyverse')
require('tidycensus')

# Load Census API key from environment
census_api_key(Sys.getenv("CENSUS_API_KEY"))

To install your API key for use in future sessions, run this function with `install = TRUE`.


In [27]:
%%R

# Get demographic data for NY State Assembly Districts from ACS
# Use correct geography string: "state legislative district (lower chamber)"

ny_assembly <- get_acs(
  geography = "state legislative district (lower chamber)",
  state = "NY",
  variables = c(
    total_pop = "B01003_001",      # Total population
    white = "B03002_003",          # White alone, not Hispanic
    black = "B03002_004",          # Black alone, not Hispanic
    asian = "B03002_006",          # Asian alone, not Hispanic
    hispanic = "B03002_012",       # Hispanic or Latino
    median_income = "B19013_001",  # Median household income
    median_age = "B01002_001"      # Median age
  ),
  year = 2024,
  survey = "acs5",
  geometry = FALSE
)

ny_assembly

# A tibble: 1,050 × 5
   GEOID NAME                                 variable      estimate    moe
   <chr> <chr>                                <chr>            <dbl>  <dbl>
 1 36001 Assembly District 1 (2024); New York median_age        48.4    1.3
 2 36001 Assembly District 1 (2024); New York total_pop     129486    748  
 3 36001 Assembly District 1 (2024); New York white          91182   2898  
 4 36001 Assembly District 1 (2024); New York black           4047   1241  
 5 36001 Assembly District 1 (2024); New York asian           2926    684  
 6 36001 Assembly District 1 (2024); New York hispanic       26714   2808  
 7 36001 Assembly District 1 (2024); New York median_income 127022   5774  
 8 36002 Assembly District 2 (2024); New York median_age        46.3    0.8
 9 36002 Assembly District 2 (2024); New York total_pop     136040   3458  
10 36002 Assembly District 2 (2024); New York white         104026   3513  
# ℹ 1,040 more rows
# ℹ Use `print(n = ...)` to see more rows


Getting data from the 2020-2024 5-year ACS
Using FIPS code '36' for state 'NY'


In [28]:
%%R 

# Pivot wider to have one row per district
ny_assembly_wide <- ny_assembly %>%
  select(GEOID, NAME, variable, estimate) %>%
  pivot_wider(names_from = variable, values_from = estimate)

ny_assembly_wide

# A tibble: 150 × 9
   GEOID NAME     median_age total_pop  white black asian hispanic median_income
   <chr> <chr>         <dbl>     <dbl>  <dbl> <dbl> <dbl>    <dbl>         <dbl>
 1 36001 Assembl…       48.4    129486  91182  4047  2926    26714        127022
 2 36002 Assembl…       46.3    136040 104026  4824  2705    20188        116673
 3 36003 Assembl…       38.6    136388  83197 11065  3719    33542        109846
 4 36004 Assembl…       38.2    127743  78215 10228 11675    20815        126580
 5 36005 Assembl…       38.8    136012  91059  5578  8771    25551        123888
 6 36006 Assembl…       36      137763  21815 17509  5220    89341        120383
 7 36007 Assembl…       43.7    124167  91673  4997  3700    18860        126602
 8 36008 Assembl…       44.3    134194 102151  2745  7927    16369        148354
 9 36009 Assembl…       41.6    130483 100716  4700  2023    18810        152006
10 36010 Assembl…       45.7    129316  83656  8121 11644    20642        169831
# ℹ 140 

In [29]:
%%R 

# Calculate percentages and round to 2 decimal places
ny_assembly_wide <- ny_assembly_wide %>%
  mutate(
    pct_white = round(white / total_pop * 100, 2),
    pct_black = round(black / total_pop * 100, 2),
    pct_asian = round(asian / total_pop * 100, 2),
    pct_hispanic = round(hispanic / total_pop * 100, 2)
  )

head(ny_assembly_wide, 20)

# A tibble: 20 × 13
   GEOID NAME     median_age total_pop  white black asian hispanic median_income
   <chr> <chr>         <dbl>     <dbl>  <dbl> <dbl> <dbl>    <dbl>         <dbl>
 1 36001 Assembl…       48.4    129486  91182  4047  2926    26714        127022
 2 36002 Assembl…       46.3    136040 104026  4824  2705    20188        116673
 3 36003 Assembl…       38.6    136388  83197 11065  3719    33542        109846
 4 36004 Assembl…       38.2    127743  78215 10228 11675    20815        126580
 5 36005 Assembl…       38.8    136012  91059  5578  8771    25551        123888
 6 36006 Assembl…       36      137763  21815 17509  5220    89341        120383
 7 36007 Assembl…       43.7    124167  91673  4997  3700    18860        126602
 8 36008 Assembl…       44.3    134194 102151  2745  7927    16369        148354
 9 36009 Assembl…       41.6    130483 100716  4700  2023    18810        152006
10 36010 Assembl…       45.7    129316  83656  8121 11644    20642        169831
11 36011

In [30]:
%%R 

# write to csv
write_csv(ny_assembly_wide, "data/census_demographics.csv")

## Combine with Election Data

For demonstration purposes we load the demographics data from the `ny_assembly_demographics.csv` file. 

⚠️  But this is actually incorrect (I think) because of redistricting - the district numbers may have changed. I'll have to check more closely.  

The demographic data in `NYC Mayoral Election 2025.xlsx` definitely has the right district numbers in the `NYS Census Demographics` sheet. 

But for demonstration purposes, we'll keep going with the CSV data now.

In [31]:
census_df = pd.read_csv('data/census_demographics.csv')

# last three digits of GEOID is ad
census_df['ad'] = census_df['GEOID'].astype(str).str[-3:].astype(int)

# grab only columns we need
census_df = census_df[['ad', 'total_pop', 'pct_white', 'pct_black', 'pct_asian', 'pct_hispanic', 'median_income', 'median_age']]

# display
census_df

Unnamed: 0,ad,total_pop,pct_white,pct_black,pct_asian,pct_hispanic,median_income,median_age
0,1,129486,70.42,3.13,2.26,20.63,127022,48.4
1,2,136040,76.47,3.55,1.99,14.84,116673,46.3
2,3,136388,61.00,8.11,2.73,24.59,109846,38.6
3,4,127743,61.23,8.01,9.14,16.29,126580,38.2
4,5,136012,66.95,4.10,6.45,18.79,123888,38.8
...,...,...,...,...,...,...,...,...
145,146,130546,76.51,5.21,9.49,4.69,93483,39.1
146,147,141366,92.34,2.13,0.39,2.31,86120,46.5
147,148,130604,89.90,1.37,1.15,2.23,61087,41.3
148,149,132659,67.65,10.04,4.46,13.04,69378,37.4


In [32]:
election_df = pd.read_csv('data/election_data.csv')
election_df

Unnamed: 0,ad,cuomo_primary,mamdani_primary,total_primary,mamdani_general,cuomo_general,sliwa_general,adams_general,other_general,total_general
0,23,6688,3871,11130,9072,19096,7086,82,152,35488
1,24,5549,6233,12200,13825,11985,2256,66,113,28245
2,25,4329,3928,8648,8420,13485,2485,59,154,24603
3,26,7190,5296,13244,9978,20901,5205,70,196,36350
4,27,6314,3783,10555,7624,16738,3495,71,136,28064
...,...,...,...,...,...,...,...,...,...,...
60,83,8234,3280,12099,9928,7198,616,102,94,17938
61,84,4812,3872,9088,10111,6701,810,84,121,17827
62,85,5121,2990,8551,8957,7222,923,86,111,17299
63,86,4446,2610,7369,7322,5951,664,81,97,14115


In [33]:
combined_df = pd.merge(election_df, census_df, on='ad', how='inner')
combined_df

Unnamed: 0,ad,cuomo_primary,mamdani_primary,total_primary,mamdani_general,cuomo_general,sliwa_general,adams_general,other_general,total_general,total_pop,pct_white,pct_black,pct_asian,pct_hispanic,median_income,median_age
0,23,6688,3871,11130,9072,19096,7086,82,152,35488,147825,45.75,14.16,9.97,24.29,85222,42.0
1,24,5549,6233,12200,13825,11985,2256,66,113,28245,136161,15.65,12.43,33.11,23.37,92749,42.8
2,25,4329,3928,8648,8420,13485,2485,59,154,24603,118412,21.78,4.60,55.25,14.94,84098,43.6
3,26,7190,5296,13244,9978,20901,5205,70,196,36350,128700,38.50,2.86,39.63,14.78,107053,49.3
4,27,6314,3783,10555,7624,16738,3495,71,136,28064,123443,37.58,5.05,27.07,25.79,83162,41.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,83,8234,3280,12099,9928,7198,616,102,94,17938,132556,1.98,63.31,1.84,27.17,65487,37.1
61,84,4812,3872,9088,10111,6701,810,84,121,17827,134648,2.99,23.69,0.87,69.07,35581,33.9
62,85,5121,2990,8551,8957,7222,923,86,111,17299,130329,2.11,27.39,1.50,66.38,43267,34.3
63,86,4446,2610,7369,7322,5951,664,81,97,14115,132471,1.91,24.94,0.81,70.28,38316,33.7


In [34]:
combined_df.to_csv('data/combined_data.csv', index=False)