# Problem 3

This problem checks that you can perform some basic data cleaning and analysis. You'll work with what we think is a pretty interesting dataset, which can tell us something about how people move within the United States.

This problem has five (5) exercises (numbered 0-4) and is worth a total of ten (10) points.

## Setup: IRS Tax Migration Data

The data for this problem comes from the IRS, which can tell where many households move from or to in any given year based on their tax returns.

For your convenience, we've placed the data files you'll need at the links below. Download them now. They are split by year among four consecutive years (2011-2015). 

- 2011-2012 data: https://cse6040.gatech.edu/datasets/stateoutflow1112.csv
- 2012-2013 data: https://cse6040.gatech.edu/datasets/stateoutflow1213.csv
- 2013-2014 data: https://cse6040.gatech.edu/datasets/stateoutflow1314.csv
- 2014-2015 data: https://cse6040.gatech.edu/datasets/stateoutflow1415.csv

These data files reference states by their FIPS codes. So, we'll need some additional data to translate state FIPS numbers to "friendly" names.

- FIPS data: https://cse6040.gatech.edu/datasets/fips-state-2010-census.txt

> These are state-level data though county-level data also exist elsewhere. If you ever need that, you'll find it at the IRS website: https://www.irs.gov/uac/soi-tax-stats-migration-data. And if you ever need the original FIPS codes data, see the Census Bureau website: https://www.census.gov/geo/reference/codes/cou.html.

Beyond the data, you'll also need the following Python modules.

In [23]:
from IPython.display import display
import pandas as pd

def tbc (X):
    var_names = sorted (X.columns)
    Y = X[var_names].copy ()
    Y.sort_values (by=var_names, inplace=True)
    Y.set_index ([list (range (0, len (Y)))], inplace=True)
    return Y

def tbeq(A, B):
    A_c = tbc(A)
    B_c = tbc(B)
    return A_c.eq(B_c).all().all()

Here is a sneak peek of what one of the data files looks like. Note the encoding specification, which may be needed to get Pandas to parse it.

In [24]:
print ("First few rows...")
display (pd.read_csv ('stateoutflow1112.csv', encoding='latin-1').head (3))

print ("\n...and some from the middle somewhere...")
display (pd.read_csv ('stateoutflow1112.csv', encoding='latin-1').head (1000).tail (3))

First few rows...


Unnamed: 0,y1_statefips,y2_statefips,y2_state,y2_state_name,n1,n2,AGI
0,1,96,AL,AL Total Migration US and Foreign,51971,107304,2109108
1,1,97,AL,AL Total Migration US,50940,105006,2059642
2,1,98,AL,AL Total Migration Foreign,1031,2298,49465



...and some from the middle somewhere...


Unnamed: 0,y1_statefips,y2_statefips,y2_state,y2_state_name,n1,n2,AGI
997,22,13,GA,GEORGIA,2526,4984,83544
998,22,6,CA,CALIFORNIA,2267,3974,89566
999,22,5,AR,ARKANSAS,1355,2851,52356


The `y1_.*` fields describe the state in which the household originated (the "source" vertices) and the `y2_.*` fields describe the state into which the household moved (the "destination"). Column `n1` is the number of such households for the given (source, destination) locations. Notice that there are some special FIPS designators as well, e.g., in the first three rows. These show total outflows, which you can use to normalize counts.

**Exercise 0** (2 points). The data files are separated by year. Write some code to merge all of the data into a single Pandas data frame called `StateOutFlows`. It should have the same columns as the original data (e.g., `y1_statefips`, `y2_statefips`), plus an additional `year` column to hold the year.

> Represent the year by a 4-digit value, e.g., `2011` rather than just `11`. Also, use the starting year for the file. That is, if the file is the `1314` file, use `2013` as the year.

In [25]:
##### BEGIN SOLUTION
import numpy as np
all_df = []
for yy in range (11, 15):
    filename = "stateoutflow{}{}.csv".format (yy, yy+1)
    df = pd.read_csv (filename, encoding='latin-1')
    df['year'] = 2000 + yy
    all_df.append (df)

StateOutFlows = pd.concat (all_df)

StateOutFlows.to_csv ('StateOutFlows_soln.csv', index=False)
### END SOLUTION

In [26]:
assert 'StateOutFlows' in globals ()
assert type (StateOutFlows) is type (pd.DataFrame ())

print ("Found {} outflow records between 2011-2015.".format (len (StateOutFlows)))
print ("First few rows...")
display (StateOutFlows.head ())

StateOutFlows_soln = pd.read_csv ('StateOutFlows_soln.csv')
assert tbeq (StateOutFlows, StateOutFlows_soln)

print ("\n(Passed!)")

Found 11320 outflow records between 2011-2015.
First few rows...


Unnamed: 0,y1_statefips,y2_statefips,y2_state,y2_state_name,n1,n2,AGI,year
0,1,96,AL,AL Total Migration US and Foreign,51971,107304,2109108,2011
1,1,97,AL,AL Total Migration US,50940,105006,2059642,2011
2,1,98,AL,AL Total Migration Foreign,1031,2298,49465,2011
3,1,1,AL,AL Non-migrants,1584665,3603439,87222478,2011
4,1,13,GA,GEORGIA,9920,19470,329213,2011



(Passed!)


Observe that the `y2_state_name` column has some special values.

For instance, suppose you want to know the _total_ number of households that filed returns within the state of Alabama. Evidently, there is a row in each year with `AL Total Migration US and Foreign` as well as an `AL Non-migrants`, the sum of which is presumably the total number of returns.

**Exercise 1** (4 points). Create a new Pandas data frame named `Totals` with one row for each state and the following five (5) columns:

- `st`: The two-letter state abbreviation
- `year`: The year of the observation
- `migrated`: The state's `Total Migration US and Foreign` value during that year
- `stayed`: The state's `Non-migrants` value that year
- `all`: The sum of `migrated` and `stayed` columns

> _Hint:_ Before proceeding, run the cell below and observe how the strings marking total migrations appear.

In [44]:
print ("=== HINT! Observe this hint before proceeding with your solution... ===\n")
print (list (StateOutFlows[StateOutFlows['y2_state'] == 'GA']['y2_state_name'].unique ()))

#
# YOUR CODE HERE
#
migrated=StateOutFlows[StateOutFlows['y2_state_name'].str.contains("Total Migration US and Foreign")]

stayed=StateOutFlows[StateOutFlows['y2_state_name'].str.contains("Non-migrants")]
initial=migrated[['y2_state','year','n1']]
std= stayed[['y2_state','year','n1']]

Totals=initial.merge(std,on=['y2_state','year'],how='left')
Totals['all']=Totals['n1_x']+Totals['n1_y']

Totals.rename(columns={'y2_state': 'st','n1_x':'migrated','n1_y':'stayed'}, inplace=True)

Totals.head()


=== HINT! Observe this hint before proceeding with your solution... ===

['GEORGIA', 'GA Total Migration US and Foreign', 'GA Total Migration US', 'GA Total Migration Foreign', 'GA Non-migrants', 'Georgia', 'GA Total Migration-US and Foreign', 'GA Total Migration-US', 'GA Total Migration-Foreign', 'GA Total Migration-Same State']


Unnamed: 0,st,year,migrated,stayed,all
0,AL,2011,51971,1584665,1636636
1,AK,2011,19446,258223,277669
2,AZ,2011,91135,2121852,2212987
3,AR,2011,33258,944195,977453
4,CA,2011,266673,13084530,13351203


In [45]:
Totals_soln = pd.read_csv ('Totals_soln.csv')

assert 'Totals' in globals ()
assert type (Totals) is type (Totals_soln)
assert set (Totals.columns) == set (['st', 'year', 'migrated', 'stayed', 'all'])

print ("Some rows of Totals:")
print (Totals.head ())
print ("...")
print (Totals.tail ())

print ("\n({} rows total.)".format (len (Totals)))

assert tbeq (Totals, Totals_soln)

FileNotFoundError: File b'Totals_soln.csv' does not exist

**Exercise 2** (1 points). Load the FIPS codes from `fips-state-2010-census.txt`. Store them in a Pandas data frame named `FIPS`. Use the original column names from the input file: `STATE`, `STUSAB`, `STATE_NAME`, `STATENS`.

> Hint: You can use Pandas's [`read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to read the file. However, be sure to take a look at the file before you try to load it, so you know how to parse by setting the arguments of `read_csv()` appropriately.

In [29]:
#
# YOUR CODE HERE
#


In [None]:
assert 'FIPS' in globals ()
assert type (FIPS) is type (pd.DataFrame ())
assert len (FIPS) == 57

print ("FIPS data frame, at location 10:\n")
print (FIPS.loc[10])
assert FIPS.loc[10, 'STATE_NAME'] == 'Georgia'

print ("\n(Passed!)")

Inspect the test code above. Notice that the FIPS code for Georgia is 13, which is located at index position 10 of the data frame (i.e., at `FIPS.loc[10]`).

It would help if the index of the data frame were also the same as the FIPS state code (`STATE`). That way, you could use `FIPS.loc[13]` to get the state code for Georgia; in effect, converting the data frame into something similar to a Python dictionary.

**Exercise 3** (1 points). Convert the `STATE` column into an index. To do so, use the Pandas method, [`FIPS.set_index()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html). Set the arguments to `set_index()` so that the change is made in-place.

In [None]:
#
# YOUR CODE HERE
#


In [None]:
display (FIPS[10:15])

assert set (FIPS.columns) == set (['STUSAB', 'STATE_NAME', 'STATENS'])
assert FIPS.loc[13, 'STATE_NAME'] == 'Georgia'
assert FIPS.loc[15, 'STATE_NAME'] == 'Hawaii'
print ("\n(Passed!)")

## Migration edges

Using the code you've set up above, we can build a table of _migration edges_, that is, a succinct summary of the number of households that moved from one state to another, broken down by year. The following code cell does that, leaving the result in a Pandas data frame called `MigrationEdges`.

In [None]:
Edges = StateOutFlows[['y1_statefips', 'y2_state', 'year', 'n1']]
Edges = pd.merge (Edges, FIPS[['STUSAB']],
                  left_on='y1_statefips', right_index=True)
Edges.rename (columns={'STUSAB': 'from', 'y2_state': 'to', 'n1': 'moved'}, inplace=True)
del Edges['y1_statefips']

MigrationEdges = Edges[Edges['from'] != Edges['to']]
MigrationEdges.head ()

Using the `MigrationEdges` data frame, we can (relatively) easily determine the top 5 states whose households moved to the state of Georgia over all years. Here is one way to do so:

1. Filter rows keeping only those containing `'GA'` as the destination.
2. Group the results by originating state.
3. Sum the results over all years.
4. Sort these results in descending order.
5. Emit just the top 5 results.

In [None]:
# Steps 1 and 2
ToGA = (MigrationEdges['to'] == 'GA')
MovedToGA = MigrationEdges[ToGA].groupby ('from')

# Step 3
MovedToGA_counts_by_state = MovedToGA['moved'].sum ()
MovedToGA_counts_by_state[:10]

In [None]:
# Steps 4 and 5: Sort and report the top 5
MovedToGA_counts_by_state.sort_values (ascending=False)[:5]

**Exercise 4** (2 points). Following a similar procedure, determine the top 5 states that Georgians moved to. Store the resulting names and counts in a variable named `GAExodus`.

In [None]:
#
# YOUR CODE HERE
#


In [None]:
assert 'GAExodus' in globals ()
assert type (GAExodus) is type (pd.Series ())
assert len (GAExodus) == 5

print ("=== The exodus from Georgia ===")
assert set (GAExodus.index) == set (['FL', 'TX', 'AL', 'NC', 'SC'])
assert (GAExodus.values == [86178, 50467, 32970, 30352, 30141]).all ()
print (GAExodus)

print ("\n(Passed!)")

**Fin!** If you've reached this point and all tests above pass, you are ready to submit your solution to this problem. Don't forget to save you work prior to submitting.