# Katrina Outflow Analysis
## Organising the Dataset
We import ```pandas```, ```numpy```, ```csv```, and ```os``` libraries

In [None]:
import pandas as pd
import numpy as np
import csv
import os

We also import the ```fill_table``` function which we will use later on

In [None]:
from fill_table import fill_table

We store the path to the ```IRS_migration_data``` repository folder in a string variable

In [None]:
repo_path = os.getcwd()[0:len(os.getcwd())-7]

We also create a new ```tables``` folder in which we will store the tables produced by this script, it will be a subfolder of your ```IRS_migration_data``` repository

In [None]:
results_path = repo_path + 'tables/'
if not os.path.exists(results_path):
    os.makedirs(results_path)

We upload the data on outflows from a csv file.
It covers the period 1998-2015.

In [None]:
outflow_df = pd.read_csv(repo_path + 'outflows/outflow.csv')

We print the first 10 lines of the dataframe wew just created to see how it is structured

In [None]:
outflow_df.head()

We store in a numpy array the unique fip codes of destination and origin counties

In [None]:
origins = outflow_df[(outflow_df['state_code_origin']<=56) & ((outflow_df['state_code_dest']<=56))]
origin_codes = pd.unique(origins['origin'].values)
destination_codes = np.append(origin_codes, [58000,59000,57009])

We upload the cleaned and restructured datasets for the nine period we will analyse:
    
* before Katrina (1999-2004);
* and after Katrina (2007-2009).

In [None]:
pre_1 = pd.read_csv(repo_path + 'outflows/9900out.csv')
pre_1.rename(columns={'Unnamed: 0':''}, inplace=True)
pre_1.set_index([''], inplace=True)
new_col_names = list(map(int, pre_1.columns.values))
pre_1.columns = new_col_names

pre_2 = pd.read_csv(repo_path + 'outflows/0001out.csv')
pre_2.rename(columns={'Unnamed: 0':''}, inplace=True)
pre_2.set_index([''], inplace=True)
new_col_names = list(map(int, pre_2.columns.values))
pre_2.columns = new_col_names

pre_3 = pd.read_csv(repo_path + 'outflows/0102out.csv')
pre_3.rename(columns={'Unnamed: 0':''}, inplace=True)
pre_3.set_index([''], inplace=True)
new_col_names = list(map(int, pre_3.columns.values))
pre_3.columns = new_col_names

pre_4 = pd.read_csv(repo_path + 'outflows/0203out.csv')
pre_4.rename(columns={'Unnamed: 0':''}, inplace=True)
pre_4.set_index([''], inplace=True)
new_col_names = list(map(int, pre_4.columns.values))
pre_4.columns = new_col_names

pre_5 = pd.read_csv(repo_path + 'outflows/0304out.csv')
pre_5.rename(columns={'Unnamed: 0':''}, inplace=True)
pre_5.set_index([''], inplace=True)
new_col_names = list(map(int, pre_5.columns.values))
pre_5.columns = new_col_names

pre_6 = pd.read_csv(repo_path + 'outflows/0405out.csv')
pre_6.rename(columns={'Unnamed: 0':''}, inplace=True)
pre_6.set_index([''], inplace=True)
new_col_names = list(map(int, pre_6.columns.values))
pre_6.columns = new_col_names

re_1 = pd.read_csv(repo_path + 'outflows/0708out.csv')
re_1.rename(columns={'Unnamed: 0':''}, inplace=True)
re_1.set_index([''], inplace=True)
new_col_names = list(map(int, re_1.columns.values))
re_1.columns = new_col_names

re_2 = pd.read_csv(repo_path + 'outflows/0809out.csv')
re_2.rename(columns={'Unnamed: 0':''}, inplace=True)
re_2.set_index([''], inplace=True)
new_col_names = list(map(int, re_2.columns.values))
re_2.columns = new_col_names

re_3 = pd.read_csv(repo_path + 'outflows/0910out.csv')
re_3.rename(columns={'Unnamed: 0':''}, inplace=True)
re_3.set_index([''], inplace=True)
new_col_names = list(map(int, re_3.columns.values))
re_3.columns = new_col_names

We import a set of csv files that contain the fip codes for different groups of counties we will use in the analysis.

In [None]:
disaster_counties_df = pd.read_csv(repo_path + 'county_groups/disaster_kat_counties.csv', usecols = ['fip_code'])
nearby_counties_df = pd.read_csv(repo_path + 'county_groups/nearby_kat_counties.csv', usecols = ['fip_code'])
distant_counties_df = pd.read_csv(repo_path + 'county_groups/distant_kat_counties.csv', usecols = ['fip_code'])
urban_nc_counties_df = pd.read_csv(repo_path + 'county_groups/urban_nc_counties.csv', usecols = ['fip_code'])

We now convert the dataframes into lists and we add one list with all the counties

In [None]:
disaster_counties = list(disaster_counties_df['fip_code'])
nearby_counties = list(nearby_counties_df['fip_code'])
distant_counties = list(distant_counties_df['fip_code'])
urban_nc_counties = list(urban_nc_counties_df['fip_code'])
all_nc_counties = disaster_counties + nearby_counties + distant_counties

Finally, using list comprehension, we divide all the groups we have defined so far into urban and rural areas by looking at their 2010 Census population. If the proportion living in rural areas is equal or above 50% we classify the county as rural otherwise as urban.

In [None]:
disaster_urban_counties = [x for x in disaster_counties if x in urban_nc_counties]
nearby_urban_counties = [x for x in nearby_counties if x in urban_nc_counties]
distant_urban_counties = [x for x in distant_counties if x in urban_nc_counties]

We now summarize the number of counties in each group:

In [None]:
print('There are', len(disaster_counties), 'disaster counties, of which', 
      len(disaster_urban_counties), 'are urban.' )

print('There are', len(nearby_counties), 'nearby counties, of which', 
      len(nearby_urban_counties), 'are urban.' )

print('There are', len(distant_counties), 'distant counties, of which', 
      len(distant_urban_counties), 'are urban.' )

print('There is a total of', len(all_nc_counties), 'counties, of which', 
      len(urban_nc_counties), 'are urban and the remaining',
      len(all_nc_counties) - len(urban_nc_counties), 'are rural.' )

For later use, we create a list containing al the lists of counties

In [None]:
group_list = [all_nc_counties, 
              disaster_counties, 
              nearby_counties, 
              distant_counties, 
              urban_nc_counties, 
              disaster_urban_counties, 
              nearby_urban_counties, 
              distant_urban_counties]

## Ties Analysis
We create six dataframes, one for each year considered, where we have the ties connecting each county to the others. Here a tie is defined as the presence of a flow of any size between two counties. The final result is a matrix for each period whose rows and columns are all the counties in the dataset and where a 1 indicates the presence of a tie between the two counties and a 0 its absence. We consider a tie to exist if a positive flow was recorded at least in one of the years composing the before and after periods.

In [None]:
pre_1_ties = pre_1.drop([58000,59000,57009],axis=1).where(pre_1==0,1)
pre_2_ties = pre_2.drop([58000,59000,57009],axis=1).where(pre_2==0,1)
pre_3_ties = pre_3.drop([58000,59000,57009],axis=1).where(pre_3==0,1)
pre_4_ties = pre_4.drop([58000,59000,57009],axis=1).where(pre_4==0,1)
pre_5_ties = pre_5.drop([58000,59000,57009],axis=1).where(pre_5==0,1)
pre_6_ties = pre_6.drop([58000,59000,57009],axis=1).where(pre_6==0,1)

re_1_ties = re_1.drop([58000,59000,57009],axis=1).where(re_1==0,1)
re_2_ties = re_2.drop([58000,59000,57009],axis=1).where(re_2==0,1)
re_3_ties = re_3.drop([58000,59000,57009],axis=1).where(re_3==0,1)

In [None]:
pre_ties = (pre_1_ties + pre_2_ties + pre_3_ties + pre_4_ties + pre_5_ties + pre_6_ties)
pre_ties = pre_ties.where(pre_ties==0,1)

re_ties = (re_1_ties + re_2_ties + re_3_ties)
re_ties = re_ties.where(re_ties==0,1)

The final result is a matrix for each period whose rows and columns are all the counties in the dataset and where a 1 indicates the presence of a tie between the two counties and a 0 its absence. We consider a tie to exist if a positive flow was recorded at least in one of the three years composing the before and after periods.

In [None]:
ties = re_ties - pre_ties
uties_pre = ties.where(ties==-1,0)*-1
uties_re = ties.where(ties==1,0)

In [None]:
for i in uties_pre.index:
    uties_pre.loc[i, i] = 0
for i in uties_re.index:
    uties_re.loc[i, i] = 0

We set up the table headers

In [None]:
counties_groups = ['All','Disaster Affected','Nearby','Distant',
                   'All (Urban)', 'Disaster Affected (Urban)', 'Nearby (Urban)','Distant (Urban)']
periods = ['Pre-Disaster','Recovery']

We call the ```fill_table``` function that we will use to fill and then print the tables summarising ties and flows from different county groups across the two periods. You can write ```print(fill_table.__doc__)``` to know more about the function. In particular, you can pass ```False``` to the ```change_col``` argument if you don't want the change column and ```True``` to the ```print_table``` argument if you want to print the table directly from the function.

In [None]:
ties_df = fill_table([uties_pre,uties_re], group_list, 
           disaster_counties, counties_groups, periods, print_table = False)
ties_df

To export the table to a csv file, uncomment the following line

In [None]:
#ties_df.to_csv(results_path + 'outties_table_katrina.csv')

## Flows Analysis
We create to dataframes with the same structure as the ones with the ties but containing average flows for the two periods respectively

In [None]:
pre_avg = (pre_1 + pre_2 + pre_3 + pre_4 + pre_5 + pre_6)/6
re_avg = (re_1 + re_2 + re_3)/3

pre_avg = pre_avg.round(decimals=0)
re_avg = re_avg.round(decimals=0)

pre_flows = pre_avg
re_flows = re_avg

We remove flows in the main diagonal as these represent household that remained in the same counties and are thus not interesting in our migration analysis 

In [None]:
for i in pre_flows.index:
    pre_flows.loc[i, i] = 0
for i in re_flows.index:
    re_flows.loc[i, i] = 0

We fill and print the flow table

In [None]:
flows_df = fill_table([pre_avg,re_avg], group_list, 
           disaster_counties, counties_groups, periods, print_table = False)
flows_df

To export the table to a csv file, uncomment the following line

In [None]:
#flows_df.to_csv(results_path + 'outflows_table_katrina.csv')