## 1 Bringing in our Data

The following cells should be starting to look familiar to you - I'm not going to annotate them here.  They a) bring in our libraries, b) read in our .csv file, c) rename our variables, and d) create our dummy variables for our analysis.

In [None]:
#Call our libraries; note, we are adding some libraries to our notebook

import numpy as np
import pandas as pd
import math
from scipy import stats
from scipy.stats import ttest_ind
from scipy.stats import t
#from datascience import *

import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

pd.options.display.float_format = '{:.4f}'.format

In [None]:
person_df=pd.read_csv("sf_acs_pums_p.csv")
person_df.head()

In [None]:
person_df.rename(columns={"sporder":"p_number",
                          "wagp":"wages",
                          "occp":"occupation",
                          "rac1p":"race",
                          "fhisp":"hispanic"}, inplace=True)
person_df.head()

In [None]:
housingunit_df=pd.read_csv("sf_acs_pums_h.csv")
housingunit_df.head()

In [None]:
housingunit_df.rename(columns={"bld":"buildingsize",
                               "rntp":"rent",
                               "ybl":"yearbuilt"}, inplace=True)
housingunit_df.head()

## 2 Merging Data

### 2.1 Figuring Out What You're Merging

We generally merge datasets based on a ***key*** or common variable.  In this case, both datasets have a serialno, which is the identifier associated with the housing unit that the census sampled in conducting their ACS survey.  When you look at the HousingUnit dataset, you see each serialno is unique.  But in the person dataset, you have four rows of the serial number 2013000000253 - one for each person in the household, which is indicated by the person number. 

The options for merging are:

a ***one to one merge***: this is when both files have only one observation per key variable, for example, both datafiles contain aggregate data at the tract FIPS level, and you want to merge by the census tract FIPS.

a ***one to many merge***: this is when you want to merge a dataset with one observation per key value to one with more observations.  A good example here is if I want to attach the data on the housing unit (**one** record per unit) to every person in the dataset (**many** people per unit). 

a ***many to one merge***: this is when you want to merge a dataset with many observations per key value to one with just a single observation. A good example here is if I want to attach the data on the people (**many** records per unit) to every housing unit in the dataset (**one** per unit). 

Let's look at the code for all three and see what happens!

### 2.2 Merging One to One

First, let's look at the easiest scenario - we just want to merge one to one!  In our practice data, I'm going to drop everyone but the head of household, so I have the same number of observations of my key variable for my housing units and my person file.

In [None]:
# Here I drop the rows (index) where the person variable in the person_df is not equal to 1
# I'm going to rename the dataframe so I can use the "raw" person data later.
head_df=person_df.drop(person_df[person_df['p_number'] != 1].index)
head_df.head()

In [None]:
pums_person_unit_df=pd.merge(head_df, housingunit_df, on="serialno")
pums_person_unit_df.head()

### 2.3 One to Many Merge

This time, I want to add the "housing unit" information to every person in my dataset.  (For example, if I want to know if children are more likely to live in single-family versus larger apartment buildings, I might want to know what kind of building every child in the dataset lives in.)

Honestly, merging is the first function I've found in Python that's easier than in SAS or STATA!!

In [None]:
personwithhousing_df=pd.merge(person_df, housingunit_df, on="serialno")
personwithhousing_df.head(20)

### 2.4 OK - What about Many to One?

In general, we want to "group by" first - for example, I want to know how many people are in the housing unit.  But let's say I really just want to attach all the people to each housing unit.  I can just swap the order of the merge.  But, in reality, this just looks like my one to many merge above, it's just that the first columns contain the building information rather than the person information.

In [None]:
housing_allpeeps_df=pd.merge(housingunit_df, person_df, on="serialno")
housing_allpeeps_df.head(20)

## 3 Grouping "by" to Get New Data

More often, we want to group a number of observations by a certain characteristic.  For example, in this case, I might want to know the number of people in the household.  Others of you will want to count how many evictions or traffic accidents are happening in a census tract or zip code.  Once I merge the data, I can "aggregate" the information using the same "group by" concept we used when we were calculating descriptive statistics.

In [None]:
#In this code, I start by grouping my data by unit, or serialno
#Alternately, you could do this by fips tract
df_by_unit=housing_allpeeps_df.groupby("serialno")

In [None]:
# Now, I'm going to tell Python how I want to aggregate the various columns,
# as well as what I want the new variable to be named
housing_allpeeps_df["hh_size"] = df_by_unit["p_number"].transform("count")
housing_allpeeps_df["total_wages"] =df_by_unit["wages"].transform("sum")
housing_allpeeps_df.head(20)

### 3.1 Getting Rid of Duplicates

In [None]:
# Now, if I want to remove the duplicates, I can run my code to just select the head of household
# Notice the conditional statement (boolean) in this code. Take a moment to talk through this line of code with
# a neighbor, and describe what we're telling Python with each command.

pums_data=housing_allpeeps_df[housing_allpeeps_df['p_number'] == 1].copy()
pums_data.head()

## 4 Order of Operations

I generally clean and create dummies, then make new categories of each dataset before I merge them. A best practice is to first write out what steps you need to do before you do them, and then think about the logical approach to coding. And, I often make mistakes and have to go back, but luckily I have all my code so it goes faster as I make fixes to get to the analysis I want!

## 5 Dropping Values

The .drop() function allows us to drop columns, rows, or values from a dataframe. We can also specify conditional statements like we did up above.

### 5.1 Dropping Null Values

In [None]:
# This lets us know if there are any null values
print( pums_data.isnull().values.any() ) 

# This shows us the total number of missing values for each variable
print( pums_data.isnull().sum() ) 

In [None]:
# The '.dropna' function drops values that Python recognizes as missing.
pums_data.dropna(inplace=True)

print( pums_data.isnull().sum() ) 

### 5.2 Dropping a Column

In [None]:
# This code drops the wages column, so we don't get confused about which wage variable to use in this dataframe.
# To drop more than one column, simply list the column labes in square brackets: ['wages', 'occupation', etc.]
pums_data.drop('wages', axis=1, inplace=True)

pums_data.head()

### 5.3 Dropping a Row (Using a Conditional Statement)

In [None]:
# Let's say we want to know more about household size. Let's take a quick look at that variable.
pums_data['hh_size'].describe()

In [None]:
# Woah - 14 is a lot of people! Let's take a closer look to see if this is an outlier.

# Sneak peak of the code in the optional notebook we've put together with more resources. Check it out!

# Here we are defining a function (called hh_size_histogram) which runs all of the code we tell it to. 
# Defining our own functions just makes it easier to get the same output later. 
# We're also using a couple new libraries to plot a histogram.

def hh_size_histogram ():
    sns.set()
    plt.figure(figsize=(12,8))
    plt.title("Household Size")
    plt.xticks(rotation=35)
    sns.countplot(pums_data['hh_size'])
    plt.ylabel("Number of Housing Units")
    plt.xlabel("Household Size")
    plt.show()
    
hh_size_histogram()

In [None]:
# Let's go ahead and drop household sizes over 10 people. 
# Keep in mind there are many ways to get the same result in Python. What would be another option here?

pums_data.drop( pums_data[pums_data['hh_size'] > 10].index, inplace=True )

In [None]:
# Let's take a look at this variable in our dataframe to see what this did.

pums_data['hh_size'].describe()

In [None]:
hh_size_histogram()

# 6 Data Binning
Sometimes it can be useful to change a continuous variable into a categorical one. This is called 'binning.'

In [None]:
# Let's take a look at the distribution of our total wages
pums_data["total_wages"].describe()

In [None]:
pums_data["total_wages"].quantile([0,.2,.4,.6,.8,.99])

In [None]:
# Defining bin cutoffs
wage_bins = ( 0, 25000, 50000, 100000, 150000, 1008000 )
             # Far left number is minimum, far right number is maximum value

# Defining bin labels
bin_labels = ( 'Very Low Wage', 'Low Wage', 'Moderate Wage', 'High Wage', 'Very High Warnings')

# Defining a new variable 'monthly earnings'
pums_data['wage_brackets'] = pd.cut( pums_data['total_wages'], wage_bins, labels=bin_labels )
# Tip: Hold down shift and tab, and click on the command 'cut' to see more instructions from Python!

In [None]:
pums_data.head()