Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Merging DataFrames Together

In this module, we're going to talk about two different types of merging: concatenation and masking

In [4]:
import pandas as pd

## Concatenation

To "concatenate" means to combine things end-to-end.  That is, we're going to merge together multiple data sets in a way that we just keep appending more rows end-on-end.

In `/data/drinking/` there are a whole list of files that we want to merge together into a single data frame.  They all have the same format, but the are from different cities.

In [1]:
# First, we can get a list of the files that are in a particular directory using the os package
import os
files = os.listdir('/data/drinking/')

In [2]:
# Then, let's read each of those files into their own df and store that in a list of dfs
dataframes = []

In [5]:
for f in files:
    df = pd.read_csv('/data/drinking/'+f)
    dataframes.append(df)

In [6]:
# Then we can concatenate them together with pd.concat
drinking = pd.concat(dataframes)

Let's check to make sure the counts match up...

Length of combined dataframe == Sum of the length of the individual dataframes?

In [7]:
len(drinking)

599

In [8]:
sum([len(x) for x in dataframes])

599

It's also possible to label the rows as they get concatenated together.  That can be handy if you want to keep track of which input file each row came from.

In [9]:
drinking2 = pd.concat(dataframes, keys=files)

In [10]:
drinking2.head()

Unnamed: 0.1,Unnamed: 1,Unnamed: 0,Indicator Category,Indicator,Year,Sex,Race/Ethnicity,Value,Place,BCHC Requested Methodology,Source,Methods,Notes,90% Confidence Level - Low,90% Confidence Level - High,95% Confidence Level - Low,95% Confidence Level - High
Baltimore_MD.csv,0,21,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,All,14.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,1,22,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,Black,9.5,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,2,29,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Both,White,21.1,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,3,30,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Female,All,9.7,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,
Baltimore_MD.csv,4,31,Behavioral Health/Substance Abuse,Percent of Adults Who Binge Drank,2010,Male,All,20.3,"Baltimore, MD",BRFSS (or similar) How many times during the p...,CDC BRFSS,The three most recent years of available data ...,"Due to changes in BRFSS sampling methodology, ...",,,,


In [None]:
drinking2.index.levels[0]

## Concatenating Side-by-Side

The stacking example above is more common, but it might be interesting to concatenate data side-by-side. 

In [None]:
names1=[['Paul','Boal'],['Anny', 'Monroe'],['Eric','Westhus']]
names2=[['Paul Boal'],['Anny Monroe'],['Eric Westhus']]
n1 = pd.DataFrame(names1, columns=['First','Last'])
n2 = pd.DataFrame(names2, columns=['Full Name'])

In [None]:
pd.concat([n1,n2], axis=1)

## Masking

With "masking", we are taking two data sets and overlaying one ontop of the other.  If the first has values, then those will be kept.  If the first has a blank (NaN), then the underlying value from the next data set will be shown.

In [None]:
nppes1 = pd.read_csv('/data/nppes1.csv')
nppes2 = pd.read_csv('/data/nppes2.csv')
nppes1.set_index('NPI', inplace=True)
nppes2.set_index('NPI', inplace=True)

In [None]:
nppes1['State'].count()

In [None]:
len(nppes1)

In [None]:
len(nppes2)

In [None]:
nppes1[pd.isnull(nppes1['State'])]

In [None]:
combined = nppes1.combine_first(nppes2)

In [None]:
combined['State'].count()

In [None]:
combined.loc[1225590060]

In [None]:
nppes1.loc[1225590060]