Some questions we might ask are: 
1. What were the fastest and slowest growing population centers in the US? (done)
2. How does population density one census affect the rate of population growth the next decade, if at all? (done)
3. To what extent do regions mirror one another in population growth, as opposed to diverging from one another? 
4. What are the best and worse states in terms of people per representative? (done)
5. What is the spread of people per representative across states, and how has this changed over time? (done)
6. Fraction of representatives should correlate with fraction of population, right? (WIP)

Pop_change notes: 
CHANGE EXPRESSED AS PERCENTAGE (0-100).

Pop_density notes: 
DENSITY EXPRESSED AS PEOPLE PER SQUARE MILE.
DENSITY RANKING EXPRESSED IN ORDER OF MOST DENSE (1) TO LEAST DENSE (52).

Interesting observations: 
1. DC was the most dense from every census through 1910
2. The number of house seats was 433 in 1910, 1920, and then 435 from 1930 onwards. 
3. The worst-represented places aren't big states, but small ones like Rhode Island

## Setting Up

I'm going to import and clean up the data here.

In [103]:
import pandas as pd
import numpy as np
import math

'''The population of each state at each decade, as well as the change from previous decade. 
X_POPULATION and X_CHANGE are the column names (X referring to a year ending with 10). 
There are these data for United States, Northeast, Midwest, South, West, 
Puerto Rico, and each individual state'''
pop_change_df = pd.read_csv('data/pop_change.csv', index_col=0, header=0, thousands=',')
pop_change_df.apply(pd.to_numeric)

'''The population density of each state. X_POPULATION, X_DENSITY, X_RANK are the keys'''
pop_density_df = pd.read_csv('data/pop_density.csv', index_col=0, header=0, skiprows=3, thousands=',')
pop_density_df.apply(pd.to_numeric)


'''The apportionment of representatives to the House by state. Keys include X_REPS,
X_PEOPLE_PER_REP, X_'''
apportionment_df = pd.read_csv('data/apportionment.csv', index_col=0, header=0, skiprows=1)
apportionment_df.apply(pd.to_numeric)
fix_nan_short = lambda num_or_nan: 0 if math.isnan(num_or_nan) else num_or_nan
apportionment_df = apportionment_df.applymap(fix_nan_short)

'''For some reason, 1920 people per rep is a column of zeroes. We can fix that here by just calculating it.'''
apportionment_df['1920_PEOPLE_PER_REP'] = pop_change_df['1920_POPULATION']/apportionment_df['1920_REPS']
fix_inf = lambda num_or_inf: math.nan if math.isinf(num_or_inf) else num_or_inf #this definitely works 
apportionment_df['1920_PEOPLE_PER_REP'] = apportionment_df['1920_PEOPLE_PER_REP'].apply(fix_inf)

'''The dataframes, but only with states.'''
states_pop_change = pop_change_df.iloc[range(5, len(pop_change_df))]
states_pop_density = pop_density_df.iloc[range(1, len(pop_density_df))]

  return_indexers=True)


In [65]:
#Answering 1: What were the fastest and slowest growing population centers in the US? 

'''Returns the NUMBER slowest or fastest growing states by population in YEAR.
TOP is a Boolean. If TOP, we want the NUMBER fastest growing states. If not TOP, 
we want the NUMBER slowest growing states'''
def n_max(year, number, top): 
    key = str(year) + '_CHANGE'
    if top: 
        return states_pop_change[key].nlargest(n=number)
    else: 
        return states_pop_change[key].nsmallest(n=number)

five_fastest_growing_2010 = n_max(2010, 5, True)
five_slowest_growing_2010 = n_max(2010, 5, False)

## Does population density affect growth in states?

Does population density affect the rate of population growth in a state? That is, if a state 

Both directions could be plausible. People might flock to a state if it is growing quickly and is the place to be. Or, they might choose to go to less populated states, which might have cheaper and more abundant land, and more opportunities. 

In [66]:
#Answering 2: How does population density one census affect the rate of population growth the next decade, if at all?

def corr_density_growth(initial_year): 
    density_key = str(initial_year) + '_DENSITY'
    growth_key = str(initial_year + 10) + '_CHANGE'
    return pop_density_df[density_key].corr(pop_change_df[growth_key], method='pearson')

correlations = [corr_density_growth(1910 + (x * 10)) for x in range(10)]
print(correlations)

[0.16759764250300993, -0.033015074352785424, 0.42880048186971909, 0.028952055112537411, -0.18073037791760463, -0.15831595591159223, -0.34969210174663073, -0.19039483919421596, -0.28296925289714092, -0.16320851336858472]


It seems that population density of a state in one census is negatively correlated with its population growth in the next decade. However, these correlations are fairly weak. Interestingly, the decade 1930-1940 is a strong exception - in this decade, the two were positively correlated. I would conjecture this has to do with the Great Depression, during which there was a great migration westward in search of jobs (if my faint memory of history class is correct). States in the west were 

## Regional correlation in population growth
3. To what extent do regions mirror one another in population growth, as opposed to diverging from one another?

In [67]:
#First question is, how to define a region? 

## Number of representatives per person
4. What are the best and worse states in terms of people per representative? 

In [83]:
people_per_rep_2010 = apportionment_df['2010_PEOPLE_PER_REP']
pop_2010 = pop_density_df['2010_POPULATION']
#interesting. it seems like the states with shittiest representation are not the biggest but in fact the smallest

corr_popsize_people_per_rep = (pop_2010.corr(people_per_rep_2010, method='pearson'))

mean_people_per_rep = {}
std_people_per_rep = {}
median_people_per_rep = {}
for x in range(11): 
    year = 1910 + (x * 10)
    mean_people_per_rep[year] = apportionment_df[str(year) + '_PEOPLE_PER_REP'].mean() 
    std_people_per_rep[year] = apportionment_df[str(year) + '_PEOPLE_PER_REP'].std()
    median_people_per_rep[year] = apportionment_df[str(year) + '_PEOPLE_PER_REP'].median()

median_people_per_rep.values()
#std_people_per_rep.values()
#so population has no correlation whatsoever with people per representative. 
#TODO: Data visualization here. X-axis can be size of population, and y-axis people per rep 
#TODO: Find out state trends in dilution of people per rep. 

dict_values([207599.0, 236809.875, 279447.5, 299770.5, 340023.0, 409902.5, 469647.5, 517160.0, 570807.0, 644637.0, 712400.5])

## Apportionment

We have __ interesting findings:

First, the lopsided-ness of representation by state is at a historical peak. That is, the standard deviation in people per representative by states is at its peak in 2010. 

Second, as of 2010, there's no correlation whatsoever between population size and people per representative. This violates my initial intuition that big states have the worst ratio of representatives to population. In fact, the smallest states have it the worst. 

Some not so surprising trends are that: 

First, representation is getting diluted decade after decade as the population grows but the number of representatives has remained fixed. Mean and median representatives per person show a clear trend of growing per decade. 



## Fraction of population vs people per rep

In [102]:
'''This is buggy af. Logically the correlation should be close to if not exactly one. I initially intended
for this to just be a sanity check but now it appears I am in fact going insane.'''
def corr_popfraction_repdensity(year): 
    fraction_of_pop = pop_change_df[str(year) + '_POPULATION']
    fraction_of_pop = fraction_of_pop/(fraction_of_pop['United States'])
    rep_key = str(year) + '_REPS'
    rep_density = apportionment_df[rep_key]/(apportionment_df[rep_key].sum())
    return apportionment_df[rep_key].corr(fraction_of_pop, method='pearson')

