# Day 3

- So now that we know how to the basics of wrangling data we will look at a few other necessary skills.

    1. Merging and grouping data
    2. Pivot
    3. Visualizing summary statistcs
    
- Only 2 quizzes today!

# Lecture 3-1: Merging and Grouping Data
- This lecture is going to focus on some fundamental aspects of data analysis.

- We focus on pandas objects as they will be the most useful.
    
- Today's content was adapted from *McKinney, Wes. Python for Data Analysis. O'Reilly Media. Kindle Edition.* 
- **If you have any questions over the course of this lecture, please post them to the 'Day 3 Lecture Questions' assignment on the Canvas course page.**

## Merging Pandas

- Concat stands for concatonate and it allows you to easily combine seres and data frame objects.

- We can load relevant data sets together and choose what aspects from each dataset we wish to retain.

- Data for [life expectancy](https://data.worldbank.org/indicator/SP.DYN.LE00.IN) and [population](https://data.worldbank.org/indicator/SP.POP.TOTL) from the World Bank

In [1]:
# here we have separate slices of data
import pandas as pd
import numpy as np

df1 = pd.DataFrame({'key': ['y', 'y', 'x', 'w', 'z', 'z', 'y','x','x'], 'data1': range( 9)})

df1


Unnamed: 0,key,data1
0,y,0
1,y,1
2,x,2
3,w,3
4,z,4
5,z,5
6,y,6
7,x,7
8,x,8


In [2]:
df2 = pd.DataFrame({'key': ['w', 'x', 'z'], 'data2': range( 3)})
df2


Unnamed: 0,key,data2
0,w,0
1,x,1
2,z,2


### Joins
- `df1` has multiple rows labeled `a` and `b`, where `df2` only has one row for each value of key.
- When we do not specify which column we are joining `merge` uses the column with overlapping names.
    - We can  always make the column which we join on secific for clarity

In [3]:
# implicit join
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,x,2,1
1,x,7,1
2,x,8,1
3,w,3,0
4,z,4,2
5,z,5,2


In [4]:
# explicit join

pd.merge(df1,df2, on='key')

Unnamed: 0,key,data1,data2
0,x,2,1
1,x,7,1
2,x,8,1
3,w,3,0
4,z,4,2
5,z,5,2


In [5]:
# But what if the keys have different names. You could change the name of one. But merge can also deal with different names.

df3 = pd.DataFrame({'key1': ['y', 'y', 'w', 'z', 'x', 'x', 'z','z','z'],'data1': range(9)}) 
df4 = pd.DataFrame({'key2': ['w', 'x', 'z'], 'data2': range( 3)})

pd.merge(df3, df4, left_on ='key1', right_on ='key2')



Unnamed: 0,key1,data1,key2,data2
0,w,2,w,0
1,z,3,z,2
2,z,6,z,2
3,z,7,z,2
4,z,8,z,2
5,x,4,x,1
6,x,5,x,1


### Merging options

- An **inner** join uses only the keys observed in both tables.
- A **left** join uses all the keys from the left table.
- A **right** join uses all the keys from the right table.
- An **outer** join uses all keys from both tables.

In [6]:
df1 = pd.DataFrame({'key':['x','x','z','y','z'], 'data1': range(5)})

df2 = pd.DataFrame({'key':['y','a','y','a','z', 'x'], 'data2': range(6)})

In [7]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,x,0,5
1,x,1,5
2,z,2,4
3,z,4,4
4,y,3,0
5,y,3,2


In [8]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,x,0.0,5
1,x,1.0,5
2,z,2.0,4
3,z,4.0,4
4,y,3.0,0
5,y,3.0,2
6,a,,1
7,a,,3


In [9]:
pd.merge(df1, df2, how='left')

Unnamed: 0,key,data1,data2
0,x,0,5
1,x,1,5
2,z,2,4
3,y,3,0
4,y,3,2
5,z,4,4


In [10]:
pd.merge(df1, df2, how='right')

Unnamed: 0,key,data1,data2
0,x,0.0,5
1,x,1.0,5
2,z,2.0,4
3,z,4.0,4
4,y,3.0,0
5,y,3.0,2
6,a,,1
7,a,,3


In [11]:
my_data1 = pd.DataFrame({'key1': ['green', 'green', 'red'], 'key2': [' one', 'two', 'one'], 'data1': [1, 2, 3], 'data3': [9,10,11]})

my_data2 = pd.DataFrame({'key1': ['green', 'green', 'red', 'red'],  'key2': [' one', 'one', 'one', 'two'], 'data2': [4, 5, 6, 7]})



In [12]:
my_data1

Unnamed: 0,key1,key2,data1,data3
0,green,one,1,9
1,green,two,2,10
2,red,one,3,11


In [13]:
my_data2

Unnamed: 0,key1,key2,data2
0,green,one,4
1,green,one,5
2,red,one,6
3,red,two,7


In [14]:
pd.merge(my_data1, my_data2, on =['key1', 'key2'], how ='outer')



Unnamed: 0,key1,key2,data1,data3,data2
0,green,one,1.0,9.0,4.0
1,green,two,2.0,10.0,
2,red,one,3.0,11.0,6.0
3,green,one,,,5.0
4,red,two,,,7.0


In [15]:
pd.merge(my_data1, my_data2, on =['key1', 'key2'], how ='inner')


Unnamed: 0,key1,key2,data1,data3,data2
0,green,one,1,9,4
1,red,one,3,11,6


In [16]:
pd.merge(my_data1, my_data2, on =['key1', 'key2'], how ='left')


Unnamed: 0,key1,key2,data1,data3,data2
0,green,one,1,9,4.0
1,green,two,2,10,
2,red,one,3,11,6.0


In [17]:
pd.merge(my_data1, my_data2, on =['key1', 'key2'], how ='right')


Unnamed: 0,key1,key2,data1,data3,data2
0,green,one,1.0,9.0,4
1,red,one,3.0,11.0,6
2,green,one,,,5
3,red,two,,,7


In [18]:
# if you have data you wish to exclude from your merge, you can simply subset the dataset to exclude data you do not wish to retain

subset_data1 = my_data1[['key1','key2','data1']] #<- remember when subset columns to use double brackets: one for the subset and a list of the variables you want
subset_data1

Unnamed: 0,key1,key2,data1
0,green,one,1
1,green,two,2
2,red,one,3


In [19]:
pd.merge(subset_data1, my_data2, on =['key1', 'key2'], how ='outer')


Unnamed: 0,key1,key2,data1,data2
0,green,one,1.0,4.0
1,green,two,2.0,
2,red,one,3.0,6.0
3,green,one,,5.0
4,red,two,,7.0


In [24]:
#make a new dataframe
new_data1 = subset_data1.rename(columns ={"data1": "data2"})

In [27]:
my_data2

Unnamed: 0,key1,key2,data2
0,green,one,4
1,green,one,5
2,red,one,6
3,red,two,7


In [28]:
new_data1

Unnamed: 0,key1,key2,data2
0,green,one,1
1,green,two,2
2,red,one,3


## Heirarchical grouping

In [34]:
#creating data with index
my_data = pd.Series( np.random.randint(0,9,9), index =[['w', 'w', 'x', 'x', 'x', 'y', 'y', 'z', 'z'], [2, 1, 3,2, 1, 2, 3,3, 1]]) 
#notice our values are random and we have two indexes here in a  nested list
data


w  2    6
   1    5
x  3    7
   2    2
   1    3
y  2    4
   3    5
z  3    6
   1    7
dtype: int32

In [3]:
# change the index of real data

data = pd.read_csv('https://data.medicare.gov/resource/ukfj-tt6v.csv')

new_index = pd.DataFrame.set_index(data, keys=["state", "city"])
new_index

Unnamed: 0_level_0,Unnamed: 1_level_0,provider_id,hospital_name,address,zip_code,county_name,phone_number,measure_id,measure_name,compared_to_national,denominator,score,lower_estimate,higher_estimate,footnote,measure_start_date,measure_end_date
state,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AL,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,36301,HOUSTON,(334) 793-8701,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,No Different Than the National Rate,292,3.2,2.1,4.8,,04/01/2015,03/31/2018
AL,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,36301,HOUSTON,(334) 793-8701,MORT_30_AMI,Death rate for heart attack patients,No Different Than the National Rate,688,13,11.0,15.5,,07/01/2015,06/30/2018
AL,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,36301,HOUSTON,(334) 793-8701,MORT_30_CABG,Death rate for CABG surgery patients,No Different Than the National Rate,291,4.3,2.6,6.8,,07/01/2015,06/30/2018
AL,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,36301,HOUSTON,(334) 793-8701,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,411,8.8,6.7,11.4,,07/01/2015,06/30/2018
AL,DOTHAN,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,36301,HOUSTON,(334) 793-8701,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,869,12.7,10.7,15.0,,07/01/2015,06/30/2018
AL,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AL,CAMDEN,10102,J PAUL JONES HOSPITAL,317 MCWILLIAMS AVENUE,36726,WILCOX,(334) 682-4131,PSI_10_POST_KIDNEY,Postoperative Acute Kidney Injury Requiring Di...,Not Available,Not Available,Not Available,Not Available,Not Available,7.0,07/01/2016,06/30/2018
AL,CAMDEN,10102,J PAUL JONES HOSPITAL,317 MCWILLIAMS AVENUE,36726,WILCOX,(334) 682-4131,PSI_11_POST_RESP,Postoperative Respiratory Failure Rate,Not Available,Not Available,Not Available,Not Available,Not Available,7.0,07/01/2016,06/30/2018
AL,CAMDEN,10102,J PAUL JONES HOSPITAL,317 MCWILLIAMS AVENUE,36726,WILCOX,(334) 682-4131,PSI_12_POSTOP_PULMEMB_DVT,Serious blood clots after surgery,Not Available,Not Available,Not Available,Not Available,Not Available,7.0,07/01/2016,06/30/2018
AL,CAMDEN,10102,J PAUL JONES HOSPITAL,317 MCWILLIAMS AVENUE,36726,WILCOX,(334) 682-4131,PSI_13_POST_SEPSIS,Blood stream infection after surgery,Not Available,Not Available,Not Available,Not Available,Not Available,7.0,07/01/2016,06/30/2018


### Partial indexing

In [35]:
# index by a single row name
my_data['x']

3    7
2    2
1    3
dtype: int32

In [36]:
# index through multiple consecutive rows
my_data['x':'z']

x  3    7
   2    2
   1    3
y  2    4
   3    5
z  3    6
   1    7
dtype: int32

In [37]:
# index by non consecutive rows (.loc)
my_data.loc[['z', 'w']]


w  2    6
   1    5
z  3    6
   1    7
dtype: int32

In [38]:
# index by the inner group
my_data.loc[:,2]


w    6
x    2
y    4
dtype: int32

In [39]:
#having multiple indexes can be useful for reshaping data
my_data.unstack()

Unnamed: 0,1,2,3
w,5.0,6.0,
x,3.0,2.0,7.0
y,,4.0,5.0
z,7.0,,6.0


In [30]:
# a method we can undo

my_data.unstack().stack()

a  1    1.0
   2    3.0
   3    5.0
b  1    5.0
   3    2.0
c  1    6.0
   2    4.0
d  2    7.0
   3    7.0
dtype: float64

In [49]:
# we can 
my_frame = pd.DataFrame( np.arange( 12). reshape(( 4, 3)), index =[['w', 'w', 'x', 'x'], [1, 2, 1, 2]], columns =[[' Missouri', 'Missouri', 'Illinois'],  [' Yellow', 'Blue', 'Yellow']])

my_frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Missouri,Missouri,Illinois
Unnamed: 0_level_1,Unnamed: 1_level_1,Yellow,Blue,Yellow
w,1,0,1,2
w,2,3,4,5
x,1,6,7,8
x,2,9,10,11


In [52]:
my_frame.index.names = ['key1', 'key2']

my_frame.columns.names = ['state','color']

my_frame

Unnamed: 0_level_0,state,Missouri,Missouri,Illinois
Unnamed: 0_level_1,color,Yellow,Blue,Yellow
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
w,1,0,1,2
w,2,3,4,5
x,1,6,7,8
x,2,9,10,11


In [54]:
my_frame.swaplevel('key1', 'key2')



Unnamed: 0_level_0,state,Missouri,Missouri,Illinois
Unnamed: 0_level_1,color,Yellow,Blue,Yellow
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,w,0,1,2
2,w,3,4,5
1,x,6,7,8
2,x,9,10,11


In [55]:
my_frame.sort_index(level = 1)



Unnamed: 0_level_0,state,Missouri,Missouri,Illinois
Unnamed: 0_level_1,color,Yellow,Blue,Yellow
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
w,1,0,1,2
x,1,6,7,8
w,2,3,4,5
x,2,9,10,11


### Using levels to summarize data

In [56]:
my_frame.sum( level ='key2')



state,Missouri,Missouri,Illinois
color,Yellow,Blue,Yellow
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [58]:
my_frame.sum( level ='color', axis = 1)


Unnamed: 0_level_0,color,Yellow,Blue,Yellow
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
w,1,0,1,2
w,2,3,4,5
x,1,6,7,8
x,2,9,10,11
