# US Poverty Analysis

Task is to analyze the evolution of poverty and income distribution in the United States.

### Data source: 

https://www.census.gov/did/www/saipe/data/statecounty/data/index.html

Use the given code snippet to access the data files

The data fields are described here:https://www.census.gov/did/www/saipe/data/statecounty/data/2000.html 
Extract the required fields. The required fields are: 
   'State FIPS', 'County FIPS','Poverty Estimate All Ages', 'Poverty Percent All Ages', 
   'Poverty Estimate Under Age 18', 'Poverty Percent Under Age 18', 'Poverty Estimate Ages 5-17', 
   'Poverty Percent Ages 5-17', 'Median Household Income','Name','Postal'
 
###### Pandas Data-Frames

1)
    - create a data frame with just the country-wide data
    - create a data frame with just the data of the states
    - create a data frame with just the data of the counties

    
2) Plot a graph of 'Poverty Percent All Ages' of the entire country vs year (line graph).
   Plot a graph of 'Median Household Income' of the entire country vs year (line graph)

3) Plot the total poverty in each state across the years and compare it with the country-wide poverty

4) Plot county-wide poverty stats

- Create a dataframe with the Unique FIPS code (obtained by combining the state and county FIPS), 'Poverty Percent All Ages' in every county in 2000, 'Poverty Percent All Ages' in every county in 2013 and the change ratio between 2000 and 2013. (change ratio = poverty % in 2013 / poverty % in 2000. Divide this by the nationwide change ratio for normalization. A value > 1 indicates that poverty is higher compared to the nation-wide poverty and a value < 1 indicates that poverty is lower compared to the nationwide poverty.)


FIPS code is a unique code to identify counties and states in the US. In this data you have been given state code and county code separately. You will have to combine the state code and the county code to generate a unique code for each place, which will be used to plot on the map. Please note that before combining, make sure the state code is 2 digits and county code is 3 digits by adding zeroes to the front.


- Plot the 'Poverty Percent All Ages' for each county on the map for the years 2000 and 2013.
- Plot the normalized change ratio on the map.
- Find the counties which witnessed the most positive and most negative change in poverty,

For this visualization, we will use a library called Vincent, which lets you plot data for places using their FIPS code instead of using latitude/longitude. 

To install vincent do
`pip install vincent` or `conda install vincent`


To use it in the notebook and display the maps inside the notebook do

`import vincent`

`vincent.core.initialize_notebook()`

You can find further details about how to use it here - http://wrobstory.github.io/2013/10/mapping-data-python.html and https://github.com/wrobstory/vincent

Before closing your notebook, please clear the output of the vincent maps as it becomes difficult of reload the notebook later. For plotting the counties on the map, you will need to use the file us_counties.topo.json present in the exam folder.

Tips: -  
    Check the type of datafields before operating on them. It will also help you debug for errors.
    Clean the data before using it. Drop those rows before using them.

In [None]:
import urllib2
import numpy as np
import pandas as pd

df=pd.DataFrame(data=None,columns=['State FIPS', 'County FIPS','FIPS','Poverty Estimate All Ages', 'Poverty Percent All Ages', 'Poverty Estimate Under Age 18', 'Poverty Percent Under Age 18', 'Poverty Estimate Ages 5-17', 'Poverty Percent Ages 5-17', 'Median Household Income','Name','Postal','Year'])
urls = ['https://www.census.gov/did/www/saipe/downloads/estmod93/est93ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod95/est95ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod97/est97ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod98/est98ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod99/est99ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod00/est00ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod01/est01ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod02/est02ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod03/est03ALL.dat',
        'https://www.census.gov/did/www/saipe/downloads/estmod04/est04ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod05/est05ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod06/est06ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod07/est07ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod08/est08ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod09/est09ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod10/est10ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod11/est11all.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod12/est12ALL.txt',
        'https://www.census.gov/did/www/saipe/downloads/estmod13/est13ALL.txt']

def getUrl(urls):
    j=0
    for url in urls:
        i=0
        print url[-9:-7]
        if int(url[-9:-7]) > 50:
            year=str(19)+url[-9:-7]
        else:
            year=str(20)+url[-9:-7]
        response = urllib2.urlopen(url)
        lines = response.read().split('\n')
        l=len(lines)
        for line in lines:
            #print line
            cfip=line[3:6]
            cfip=cfip.replace(" ","0")
            #print cfip
            if i<l-1:
                df.loc[j]=[line[0:2],cfip,line[0:2]+cfip,line[7:16],line[34:39],line[49:58],line[76:81],line[92:100],line[118:123],line[134:140],line[193:239],line[239:241],year]
                i+=1
                j+=1
        
        #TODO -code to use the text data
       
        
getUrl(urls)

### Create the dataframes

Download and parse the data files and create the following three pandas dataframes: (your dataframes should have data in the format shown below)

 * US_stat: statistics for the whole United States.
 * states_stat: Statistics for each state.
 * county_stat: Statistics for each county.

In [None]:
US_stat=df[df['State FIPS']=='00']
US_stat=US_stat.set_index(US_stat['Year'])


In [None]:
US_stat.head()

In [None]:
states_stat=df[(df['State FIPS']!='00') & (df['County FIPS']=='000')]
states_stat=states_stat.set_index(states_stat['Postal'])
states_stat=states_stat.set_index(states_stat['Year'],append=True)
states_stat=states_stat.sort_index()

In [None]:
states_stat.head()

In [None]:
states_stat[states_stat['Postal']=='DC']#.unique()

In [None]:
county_stat=df[(df['State FIPS']!='00') & (df['County FIPS']!='000')]
county_stat=county_stat.set_index(county_stat['Postal'])
county_stat=county_stat.set_index(county_stat['FIPS'],append=True)
county_stat=county_stat.set_index(county_stat['Year'],append=True)
county_stat=county_stat.sort_index()

In [None]:
county_stat


### Plot the US-wide statistics on poverty.

Plot the 'Poverty Percent All Ages' and 'Median Household Income' across entire US over the years. 

Compute the percentage of increase in poverty in the US between 2000 and 2013. - [poverty % in 2013] / [poverty % in 2000]

In [None]:
import matplotlib.pyplot as plt

print "print poverty percent change ratio from 2000 to 2013 = " + str(float(US_stat['Poverty Percent All Ages'].loc['2013'])/float(US_stat['Poverty Percent All Ages'].loc['2000']))

fig = plt.figure(figsize=(8,2.5))
ax = fig.add_subplot(1,2,1)
ax.plot(US_stat['Median Household Income'])
ax.grid()
ax.set_title('Median Household Income')

ax = fig.add_subplot(1,2,2)
ax.plot(US_stat['Poverty Percent All Ages'])
ax.grid()
ax.set_title('Poverty Percent All Ages')
plt.show()

### 2000 was a good year

We see from these graphs that even though the median household in the US keeps increasing at a more or less constant rate, the pocerty levels reached a minimum at 2000 and have increased dramatically , by 40% since then.

We will now attempt to identify the geographic distribution of the rise in poverty since 2000.

We start by plotting the time evolution of poverty for each of the states.

### Plot the change in poverty percentages by state.

For each state, plot the poverty levels across time and compare it with the nation-wide poverty level. Produce a graph similar to the ones below.

In [None]:
states_stat.loc[states_stat.index[3][0]] #['Poverty Percent All Ages']

In [None]:
states_stat.loc[states_stat.index[968][0]]

In [None]:
fig, a = plt.subplots(11, 5,sharex='col', sharey='row',figsize=(7,15))
c=0
s=19
for ax in a:
    for j in range(5):
        if c>=969:
            for k in range(1,5):
                for tick in ax[k].xaxis.get_major_ticks():
                    tick.label.set_fontsize(5)
            break
        ax[j].plot(states_stat.loc[states_stat.index[c][0]]['Poverty Percent All Ages'],color='Red')
        ax[j].plot(US_stat['Poverty Percent All Ages'],color='Blue')
        ax[j].grid()
        ax[j].set_title(states_stat.index[c][0],size=8)
        for tick in ax[j].yaxis.get_major_ticks():
                tick.label.set_fontsize(7)
        for tick in ax[j].xaxis.get_major_ticks():
                tick.label.set_fontsize(5)
        c+=s
plt.show()

In [None]:
states_stat['diff2013']=pd.to_numeric(states_stat[states_stat['Year']=='2013']['Poverty Percent All Ages'])-pd.to_numeric(US_stat[US_stat['Year']=='2013']['Poverty Percent All Ages'])

In [None]:
import string
l=[]
k=[]
m=[]
for i in states_stat[states_stat['diff2013']>.5]['Name'].values:
    l.append(i.translate(None, string.whitespace))

for j in states_stat[states_stat['diff2013']<-.5]['Name'].values:
    k.append(j.translate(None, string.whitespace))
    
for o in states_stat[(states_stat['diff2013']>=-.5) & (states_stat['diff2013']<=.5)]['Name'].values:
    m.append(o.translate(None, string.whitespace))    

In [None]:
state_loc=pd.DataFrame()

In [None]:
lk=r'state_loc.csv'
state_loc=pd.read_csv(lk,header=None, error_bad_lines=False,names=['Postal','Name','Lat','Long'])

In [None]:
state_loc=state_loc[state_loc['Postal']!='AK']

In [None]:
state_loc=state_loc[state_loc['Postal']!='HI']

In [None]:
states_stat['state_diff']=states_stat["Postal"].map(str) + str(states_stat["diff2013"])

###### Below map shows the poverty levels as comapred to the country wide poverty for year 2013
####### An important thing to note from above plots is that there are few very poor states which have a big difference from national average

States in red are most poor states, there are states like MS,NM,LA which have over 4 percent difference with National overall poverty. Most of the high percent poverty rate states are in East coast.

The blue states are the states that have lower poverty rate when compared to National poverty rate.

The green ones are almost same as National.

AL,AR,DC,LA,MS,NM,WV - Most poor states

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon

plt.figure(figsize=(12,6))
# create the map
map = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
        projection='lcc',lat_1=33,lat_2=45,lon_0=-95)

map.readshapefile('st99_d00', name='states', drawbounds=True)

# collect the state names from the shapefile attributes so we can
# look up the shape obect for a state by it's name
state_names = []
for shape_dict in map.states_info:
    state_names.append(shape_dict['NAME'].translate(None, string.whitespace))

ax = plt.gca() # get current axes instance

# get Texas and draw the filled polygon
for i in l:
    #print i
    #print state_names.index(i)
    seg = map.states[state_names.index(i)]
    poly = Polygon(seg, facecolor='red',edgecolor='red')
    ax.add_patch(poly)
    seg = map.states[106]
    poly = Polygon(seg, facecolor='red',edgecolor='red')
    ax.add_patch(poly)

for x in k:
    #print x
    #print state_names.index(x)
    seg = map.states[state_names.index(x)]
    poly = Polygon(seg, facecolor='blue',edgecolor='blue')
    ax.add_patch(poly)   
    seg = map.states[97]
    poly = Polygon(seg, facecolor='blue',edgecolor='blue',alpha=.4)
    ax.add_patch(poly)  
    seg = map.states[101]
    poly = Polygon(seg, facecolor='blue',edgecolor='blue',alpha=.4)
    ax.add_patch(poly)

for y in m:
    #print y
    seg = map.states[state_names.index(y)]
    poly = Polygon(seg, facecolor='green',edgecolor='green',alpha=.7)
    ax.add_patch(poly)

lons = state_loc['Long'].values
lats = state_loc['Lat'].values
x,y = map(lons, lats)    
    
for label, xpt, ypt in zip(state_loc['Postal'], x,y):
    plt.text(xpt, ypt, label) 
#for label, xpt, ypt in zip(states_stat[states_stat['Year']=='2013']['diff2013'], x,y):
    #plt.text(xpt+10000, ypt+5000, label)
    
plt.show()

##### We can pretty much say from the below graph that the states which are below National average in poverty are also below national median household income
For example - states like AL,AR,LA,MS,NM,WV have median income way less than national median income

In [None]:
fig, a = plt.subplots(11, 5,sharex='col', sharey='row',figsize=(7,15))
c=0
s=19
for ax in a:
    for j in range(5):
        if c>=969:
            for k in range(1,5):
                for tick in ax[k].xaxis.get_major_ticks():
                    tick.label.set_fontsize(5)
            break
        ax[j].plot(states_stat.loc[states_stat.index[c][0]]['Median Household Income'],color='Red')
        ax[j].plot(US_stat['Median Household Income'],color='Blue')
        ax[j].grid()
        ax[j].set_title(states_stat.index[c][0],size=8)
        for tick in ax[j].yaxis.get_major_ticks():
                tick.label.set_fontsize(7)
        for tick in ax[j].xaxis.get_major_ticks():
                tick.label.set_fontsize(5)
        c+=s
plt.show()



### plot poverty statistics by county

Using the vincent library and the dataframe `county_stat`, generate the following three maps.

1. Overall percentage of poverty for each county in 2000.
![poverty2000](Poverty2000.jpg)
1. Overall percentage of poverty for each county in 2013.
![poverty2013](Poverty2013.jpg)
1. Ratio between percentage of change in poverty from 2000 to 2013 for each county. Divided by the nation-wide change in poverty (1.39).
![povertyChange](PovertyChange.jpg)

### Adding map for overall percentage of poverty for each county in 2000.

In [None]:
county_2000=county_stat[county_stat['Year']=='2000']
county_2000=county_2000.dropna(axis=0)
county_2000['Poverty Percent All Ages']=pd.to_numeric(county_2000['Poverty Percent All Ages'], errors='coerce');

In [None]:
import json
import pandas as pd
import vincent

vincent.core.initialize_notebook()

state_topo="https://raw.githubusercontent.com/wrobstory/vincent_map_data/master/us_counties.topo.json"
#r'us_counties.topo.json'
geo_data = [{'name': 'counties',
             'url': state_topo,
             'feature': 'us_counties.geo'}]

vis = vincent.Map(data=county_2000, geo_data=geo_data, scale=1000,
                  projection='albersUsa', data_bind='Poverty Percent All Ages',
                  data_key='FIPS', map_key={'counties': 'properties.FIPS'},brew='YlGnBu')
vis.scales['color'].type = 'threshold'
vis.scales['color'].domain = [0,4,6,8,10,12,20,30]
vis.legend(title='Poverty 2000 (%)')
vis.to_json('vega.json')
vis.display()


### Adding map for overall percentage of poverty for each county in 2000.

In [None]:
county_2013=county_stat[county_stat['Year']=='2013']
county_2013=county_2013.dropna(axis=0)
county_2013['Poverty Percent All Ages']=pd.to_numeric(county_2013['Poverty Percent All Ages'], errors='coerce')

In [None]:
vis = vincent.Map(data=county_2013, geo_data=geo_data, scale=1000,
                  projection='albersUsa', data_bind='Poverty Percent All Ages',
                  data_key='FIPS', map_key={'counties': 'properties.FIPS'},brew='YlGnBu')
vis.scales['color'].type = 'threshold'
vis.scales['color'].domain = [0,4,6,8,10,12,20,30]
vis.legend(title='Poverty 2013 (%)')
vis.to_json('vega.json')
vis.display()


### Adding map for overall percentage of poverty for each county in 2000.

In [None]:
county_2000=county_2000.reset_index(level=2,drop=True)
county_2013=county_2013.reset_index(level=2,drop=True)
county_2013=county_2013.rename(index=str, columns={"Poverty Percent All Ages":"Poverty Percent All Ages 2013"})
cs=pd.concat([county_2000,county_2013['Poverty Percent All Ages 2013']], axis=1)
cs['ratio']=cs['Poverty Percent All Ages 2013']/cs['Poverty Percent All Ages']
cs['ratio']=cs['ratio']/1.39
cs=cs.set_index(cs['Year'],append=True)

vis = vincent.Map(data=cs, geo_data=geo_data, scale=1000,
                  projection='albersUsa', data_bind='ratio',
                  data_key='FIPS', map_key={'counties': 'properties.FIPS'},brew='YlGnBu')
vis.scales['color'].type = 'threshold'
vis.scales['color'].domain = [0,0.8,0.9,1,1.1,1.2,1.3]
vis.legend(title='Poverty 2013 (%)')
vis.to_json('vega.json')
vis.display()

### Q5: Identify the extremes.
Find out which are the counties in which the poverty percentage increased or decreased the most during the period 2000 - 2013.


In [None]:
print "County with max increase in poverty -- " + cs[cs['ratio']==cs['ratio'].max()]['Name'].values[0].replace(" ","") + " in state " + cs[cs['ratio']==cs['ratio'].max()]['Postal'].values[0]
print "change= " + str(cs['ratio'].max())
print "County with max increase in poverty -- " + cs[cs['ratio']==cs['ratio'].min()]['Name'].values[0].replace(" ","") + " in state " + cs[cs['ratio']==cs['ratio'].min()]['Postal'].values[0]
print "change= " + str(cs['ratio'].min())