# Regression Analysis
## Set Up
We import ```pandas```, ```numpy```, ```csv```, ```os```, and ```tqdm``` libraries

In [None]:
import pandas as pd
import numpy as np
import csv
import os
from tqdm import tqdm, tnrange, tqdm_notebook

We store the path to the ```IRS_migration_data``` repository folder in a string variable

In [None]:
repo_path = os.getcwd()[0:len(os.getcwd())-7]

We also create a new ```regression_data``` folder in which we will store the datasets produced by this script, it will be a subfolder of your ```IRS_migration_data``` repository. If such a folder already exists, a new one will not be created.

In [None]:
results_path = repo_path + 'regression_data/'
if not os.path.exists(results_path):
    os.makedirs(results_path)

We upload the data on outflows and inflows from csv files.
They cover the period 1998-2015.

In [None]:
outflow_df = pd.read_csv(repo_path + 'outflows/outflow.csv')
inflow_df = pd.read_csv(repo_path + 'inflows/inflow.csv')

We print the first 10 lines of the ```inflow_df``` dataframe we just created to see how it is structured

In [None]:
inflow_df.head()

## Creating the Population Dataframe
As a first step, I create a dataframe with the county fip codes as indexes and the years as columns. I will then fill it with the population for each county and each year extracting it from the IRS data organised in the ```inflow_df``` dataframe.

First, we create an array with all the periods in the sample.

In [None]:
years = pd.unique(inflow_df['year'].values)

Then we create an array with all the fip codes for the counties in the data.

In [None]:
destinations = inflow_df[(inflow_df['state_code_dest']<=56) & ((inflow_df['state_code_origin']<=56))]
destination_codes = pd.unique(destinations['destination'].values)
destination_codes = set(destination_codes)

We drop the District of Columbia, whose particular status complicates the definition of migration flows into and out from it.

In [None]:
destination_codes = destination_codes - {11001}

We create a smaller dataframe starting from ```inflow_df``` that contains for each county and each year only the two rows necessary to compute the total population:

* the number of non-migrants;
* the total number of migrants.

In [None]:
pop_data = inflow_df[(inflow_df['destination'].isin(destination_codes))
                            & ((inflow_df['origin']==inflow_df['destination']) | (inflow_df['origin']==96000))]

We create two population dataframes, one for households, the other for individuals.

In [None]:
population_hh = pd.DataFrame(0, index=destination_codes, columns=years)
population_in = pd.DataFrame(0, index=destination_codes, columns=years)

And finally we fill them

In [None]:
for year in tqdm_notebook(years, desc='year loop'):
    for county in tqdm_notebook(destination_codes, desc='county_loop'):
        population_hh[year][county] = pop_data[(pop_data['year']==year) & 
                                               (pop_data['destination']==county)]['return_num'].sum()
        population_in[year][county] = pop_data[(pop_data['year']==year) & 
                                               (pop_data['destination']==county)]['exmpt_num'].sum()

Given how long it takes to fill them, we save them into a csv file so that we will be able to immediately access them afterwards.

In [None]:
population_hh.to_csv(results_path + 'population_hh.csv')
population_in.to_csv(results_path + 'population_in.csv')

We now upload the same dataframe directly from the csv file so that, after we have created the population dataframes the first time, we can skip this part and start from this point

In [None]:
population_hh = pd.read_csv(results_path + 'population_hh.csv', index_col = 0)
population_in = pd.read_csv(results_path + 'population_in.csv', index_col = 0)

## Creating the Migration Dataframes

For descriptive purposes, we also construct three other matrices containing respectively:

1. the number of migrants moving to a different county wihle remainig in the same state;
2. the numebr of migrants crossing a state boundary;
3. and the number of migrants moving from/to outside the US.

We start by creating a smaller dataframe starting from the ```inflow_df``` containing only the information we need.

In [None]:
immigration_data = inflow_df[(inflow_df['destination'].isin(set(destination_codes)))
                            & ((inflow_df['origin']==97001) | 
                               (inflow_df['origin']==97003) |
                               (inflow_df['origin']==98000))]

In [None]:
outmigration_data = outflow_df[(outflow_df['origin'].isin(set(destination_codes)))
                            & ((outflow_df['destination']==97001) | 
                               (outflow_df['destination']==97003) |
                               (outflow_df['destination']==98000))]

In [None]:
immigration_ic = pd.DataFrame(0, index=destination_codes, columns=years)
immigration_is = pd.DataFrame(0, index=destination_codes, columns=years)
immigration_ab = pd.DataFrame(0, index=destination_codes, columns=years)

In [None]:
outmigration_ic = pd.DataFrame(0, index=destination_codes, columns=years)
outmigration_is = pd.DataFrame(0, index=destination_codes, columns=years)
outmigration_ab = pd.DataFrame(0, index=destination_codes, columns=years)

In [None]:
for year in tqdm_notebook(years, desc='year loop'):
    for county in tqdm_notebook(destination_codes, desc='county loop'):
        immigration_ic[year][county] = immigration_data[(immigration_data['year']==year) & 
                                                      (immigration_data['destination']==county) &
                                                      (immigration_data['origin']==97001)]['exmpt_num'].sum()
        
        immigration_is[year][county] = immigration_data[(immigration_data['year']==year) & 
                                                      (immigration_data['destination']==county) &
                                                      (immigration_data['origin']==97003)]['exmpt_num'].sum()
        
        immigration_ab[year][county] = immigration_data[(immigration_data['year']==year) & 
                                                      (immigration_data['destination']==county) &
                                                      (immigration_data['origin']==98000)]['exmpt_num'].sum()
        
        

In [None]:
for year in tqdm_notebook(years, desc='year loop'):
    for county in tqdm_notebook(destination_codes, desc='county loop'):
        outmigration_ic[year][county] = outmigration_data[(outmigration_data['year']==year) & 
                                                      (outmigration_data['origin']==county) &
                                                      (outmigration_data['destination']==97001)]['exmpt_num'].sum()
        
        outmigration_is[year][county] = outmigration_data[(outmigration_data['year']==year) & 
                                                      (outmigration_data['origin']==county) &
                                                      (outmigration_data['destination']==97003)]['exmpt_num'].sum()
        
        outmigration_ab[year][county] = outmigration_data[(outmigration_data['year']==year) & 
                                                      (outmigration_data['origin']==county) &
                                                      (outmigration_data['destination']==98000)]['exmpt_num'].sum()

We export the six matrices ```immigration_ic```, ```immigration_is```, ```immigration_ic```, ```outmigration_is```, ```outmigration_ic```, and ```outmigration_ab``` to csv files.

In [None]:
immigration_ic.to_csv(results_path + 'immigration_ic.csv')
immigration_is.to_csv(results_path + 'immigration_is.csv')
immigration_ab.to_csv(results_path + 'immigration_ab.csv')

In [None]:
outmigration_ic.to_csv(results_path + 'outmigration_ic.csv')
outmigration_is.to_csv(results_path + 'outmigration_is.csv')
outmigration_ab.to_csv(results_path + 'outmigration_ab.csv')

## Creating the Dataset
We now set up the final dataset which will contain year, destination code, origin code, household flow, population in the destination, and population at the origin. The two population columns will be added in a second step.

In [None]:
reg_data = inflow_df[(inflow_df['destination'].isin(set(destination_codes))) &
                     (inflow_df['origin'].isin(set(destination_codes))) &
                     (inflow_df['destination']!=inflow_df['origin'])]
reg_data = reg_data[['year', 'destination', 'origin', 'return_num', 'exmpt_num']]
reg_data.reset_index(drop=True, inplace=True)

We set up the ```pop_cols_hh``` dataframe. It restructures the data in the ```population_hh``` dataframe so that it can be merged with the ```reg_data``` dataframe.

In [None]:
pop_cols_hh = pd.DataFrame(index=range(0,len(destination_codes)*years.size), columns=['year','county','pop_hh'])

Now we fill it

In [None]:
i = 0

for county in destination_codes:
    for year in years:
        pop_cols_hh['year'][i] = year
        pop_cols_hh['county'][i] = county
        pop_cols_hh['pop_hh'][i] = population_hh.loc[county][year]
    
        i = i+1
            

We repeat the operation with the ```population_in``` dataframe.

In [None]:
pop_cols_in = pd.DataFrame(index=range(0,len(destination_codes)*years.size), columns=['year','county','pop_in'])

In [None]:
i = 0

for county in destination_codes:
    for year in years:
        pop_cols_in['year'][i] = year
        pop_cols_in['county'][i] = county
        pop_cols_in['pop_in'][i] = population_in.loc[county][year]
    
        i = i+1

We now merge the two dataframes to create first the column with the population at the destination and then the one with the population at the origin.

In [None]:
reg_data.rename(columns={'destination':'county'}, inplace=True)

In [None]:
result = pd.merge(reg_data, pop_cols_hh, how='left', on=['year', 'county'])
result = pd.merge(result, pop_cols_in, how='left', on=['year', 'county'])
result.rename(columns={'county':'destination', 'origin':'county', 
                       'pop_hh':'pop_destination_hh', 'pop_in':'pop_destination_in'}, inplace=True)

In [None]:
result = pd.merge(result, pop_cols_hh, how='left', on=['year', 'county'])
result = pd.merge(result, pop_cols_in, how='left', on=['year', 'county'])
result.rename(columns={'county':'origin',
                       'pop_hh':'pop_origin_hh', 'pop_in':'pop_origin_in'}, inplace=True)

The last step is to add a group variable and a treatment variable

In [None]:
disaster_sandy_counties = pd.read_csv(repo_path + 'county_groups/disaster_sandy_counties.csv', usecols = ['fip_code'])
nearby_sandy_counties = pd.read_csv(repo_path + 'county_groups/nearby_sandy_counties.csv', usecols = ['fip_code'])
distant_sandy_counties = pd.read_csv(repo_path + 'county_groups/distant_sandy_counties.csv', usecols = ['fip_code'])

In [None]:
disaster_kat_counties = pd.read_csv(repo_path + 'county_groups/disaster_kat_counties.csv', usecols = ['fip_code'])
nearby_kat_counties = pd.read_csv(repo_path + 'county_groups/nearby_kat_counties.csv', usecols = ['fip_code'])
distant_kat_counties = pd.read_csv(repo_path + 'county_groups/distant_kat_counties.csv', usecols = ['fip_code'])

In [None]:
all_nc_urban_counties = pd.read_csv(repo_path + 'county_groups/urban_nc_counties.csv', usecols = ['fip_code'])
coastal_counties = pd.read_csv(repo_path + 'county_groups/coastline_counties.csv', usecols = ['fip_code'])

In [None]:
county_groups = [disaster_sandy_counties, nearby_sandy_counties, distant_sandy_counties,
                 disaster_kat_counties, nearby_kat_counties, distant_kat_counties]
groups = ['disaster', 'nearby', 'distant']

In [None]:
group = 0

for df in county_groups:
    
    if group<=2:
        df['group'] = groups[group]
    else:
        group = 0
        df['group'] = groups[group]
        
    group = group +1


In [None]:
sandy_group = county_groups[0].append(county_groups[1])
sandy_group = sandy_group.append(county_groups[2])

katrina_group = county_groups[3].append(county_groups[4])
katrina_group = katrina_group.append(county_groups[5])

In [None]:
all_nc_urban_counties['urban'] = 'urban'
coastal_counties['coastal'] = 'coastal'

In [None]:
result.rename(columns={'destination':'fip_code'}, inplace=True)

result = pd.merge(result, sandy_group, how='left', on=['fip_code'])
result.rename(columns={'group':'sandy_group_dest'}, inplace=True)

result = pd.merge(result, katrina_group, how='left', on=['fip_code'])
result.rename(columns={'group':'kat_group_dest'}, inplace=True)

result = pd.merge(result, all_nc_urban_counties, how='left', on=['fip_code'])
result.rename(columns={'urban':'urban_dest'}, inplace=True)
result['urban_dest'] = result['urban_dest'].where(result['urban_dest']=='urban', 'rural')

result = pd.merge(result, coastal_counties, how='left', on=['fip_code'])
result.rename(columns={'coastal':'coastal_dest'}, inplace=True)
result['coastal_dest'] = result['coastal_dest'].where(result['coastal_dest']=='coastal', 'continental')

In [None]:
result.rename(columns={'fip_code':'destination'}, inplace=True)
result.rename(columns={'origin':'fip_code'}, inplace=True)

result = pd.merge(result, sandy_group, how='left', on=['fip_code'])
result.rename(columns={'group':'sandy_group_origin'}, inplace=True)

result = pd.merge(result, katrina_group, how='left', on=['fip_code'])
result.rename(columns={'group':'kat_group_origin'}, inplace=True)

result = pd.merge(result, all_nc_urban_counties, how='left', on=['fip_code'])
result.rename(columns={'urban':'urban_origin'}, inplace=True)
result['urban_origin'] = result['urban_origin'].where(result['urban_origin']=='urban', 'rural')

result = pd.merge(result, coastal_counties, how='left', on=['fip_code'])
result.rename(columns={'coastal':'coastal_origin'}, inplace=True)
result['coastal_origin'] = result['coastal_origin'].where(result['coastal_origin']=='coastal', 'continental')

result.rename(columns={'fip_code':'origin'}, inplace=True)

We store the results in a dataframe called ```gravity_data``` which can then be used to estimate the gravity model.

In [None]:
gravity_data = result

In [None]:
gravity_data.head()

Finally we export the ```gravity_data``` dataframe to a csv file on which we will perform the regression analysis using Stata

In [None]:
gravity_data.to_csv(results_path + 'gravity_data.csv', index = False)