## The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. Or if you are dealing with time series data, for example, you may have a new dataset for each day. No matter the reason, it is important to be able to combine datasets so you can either clean a single dataset, or clean each dataset separately and then combine them later so you can run your analysis on a single dataset. In this chapter, you'll learn all about combining data.

# Combining rows of data
The dataset you'll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [23]:
uber_df = pd.read_csv('nyc_uber_2014.csv')

In [24]:
uber_df.

Unnamed: 0.1,Unnamed: 0,Date/Time,Lat,Lon,Base
0,0,4/1/2014 0:11:00,40.7690,-73.9549,B02512
1,1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4,4/1/2014 0:33:00,40.7594,-73.9722,B02512
5,5,4/1/2014 0:33:00,40.7383,-74.0403,B02512
6,6,4/1/2014 0:39:00,40.7223,-73.9887,B02512
7,7,4/1/2014 0:45:00,40.7620,-73.9790,B02512
8,8,4/1/2014 0:55:00,40.7524,-73.9960,B02512
9,9,4/1/2014 1:01:00,40.7575,-73.9846,B02512


In [28]:
uber_df.drop('Unnamed: 0', axis = 1, inplace = True)

In [32]:
uber_df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [34]:
uber_df[0:99].to_csv('uber1.csv', index = False)

In [35]:
uber1 = pd.read_csv('uber1.csv')
uber1.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [36]:
uber_df[99:198].to_csv('uber2.csv', index = False)

In [37]:
uber2 = pd.read_csv('uber2.csv')
uber2.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,5/1/2014 0:02:00,40.7521,-73.9914,B02512
1,5/1/2014 0:06:00,40.6965,-73.9715,B02512
2,5/1/2014 0:15:00,40.7464,-73.9838,B02512
3,5/1/2014 0:17:00,40.7463,-74.0011,B02512
4,5/1/2014 0:17:00,40.7594,-73.9734,B02512


In [41]:
uber_df[198:].to_csv('uber3.csv', index = False)

In [42]:
uber3 = pd.read_csv('uber3.csv')
uber3.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,6/1/2014 0:00:00,40.7293,-73.992,B02512
1,6/1/2014 0:01:00,40.7131,-74.0097,B02512
2,6/1/2014 0:04:00,40.3461,-74.661,B02512
3,6/1/2014 0:04:00,40.7555,-73.9833,B02512
4,6/1/2014 0:07:00,40.688,-74.1831,B02512


In [43]:
uber3.tail()

Unnamed: 0,Date/Time,Lat,Lon,Base
94,6/1/2014 6:27:00,40.7554,-73.9738,B02512
95,6/1/2014 6:35:00,40.7543,-73.9817,B02512
96,6/1/2014 6:37:00,40.7751,-73.9633,B02512
97,6/1/2014 6:46:00,40.6952,-74.1784,B02512
98,6/1/2014 6:51:00,40.7621,-73.9817,B02512


In [44]:
# concatenate uber1, uber2, uber3
row_concat = pd.concat([uber1, uber2, uber3])
row_concat.shape
row_concat.head()

(297, 4)

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


## Combining columns of data
Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

In [45]:
ebola = pd.read_csv('ebola.csv')

In [47]:
# Melt ebola
ebola_melt = pd.melt(frame = ebola, id_vars = ['Date', 'Day'], var_name = 'status_country', value_name = 'counts')
ebola_melt.head()

Unnamed: 0,Date,Day,status_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [54]:
status_country = pd.DataFrame()

In [57]:
status_country['status_country'] = ebola_melt.status_country.str.split('_')

In [58]:
status_country['status'] = status_country['status_country'].str.get(0)

In [59]:
status_country['country'] = status_country['status_country'].str.get(1)

In [61]:
status_country.drop('status_country', axis = 1, inplace = True)

In [62]:
status_country.head()

Unnamed: 0,status,country
0,Cases,Guinea
1,Cases,Guinea
2,Cases,Guinea
3,Cases,Guinea
4,Cases,Guinea


In [63]:
# Concatenate ebola_melt and status_country column-wise: 
ebola_tidy = pd.concat([ebola_melt, status_country], axis = 1)

In [64]:
ebola_tidy.shape
ebola_tidy.head()

(1952, 6)

Unnamed: 0,Date,Day,status_country,counts,status,country
0,1/5/2015,289,Cases_Guinea,2776.0,Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,Cases,Guinea


## Finding files that match a pattern

You're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.  
The glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.  
For example, if you know the pattern is part_ single digit number .csv, you can write the pattern as 'part_?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.



### Instructions
- Import the glob module along with pandas (as its usual alias pd).
- Write a pattern to match all .csv files.
- Save all files that match the pattern using the glob() function within the glob module. That is, by using glob.glob().
- Print the list of file names.
- Read the second last file in csv_files (i.e., index 1) into a DataFrame called csv2.


In [65]:
import glob
pattern = '*.csv'
#save all file matches
csv_files = glob.glob(pattern)
csv_files

['airquality.csv',
 'dob_job_application_filings_subset.csv',
 'ebola.csv',
 'nyc_uber_2014.csv',
 'tb.csv',
 'uber1.csv',
 'uber2.csv',
 'uber3.csv']

In [66]:
csv2 = pd.read_csv(csv_files[-2])
csv2.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,5/1/2014 0:02:00,40.7521,-73.9914,B02512
1,5/1/2014 0:06:00,40.6965,-73.9715,B02512
2,5/1/2014 0:15:00,40.7464,-73.9838,B02512
3,5/1/2014 0:17:00,40.7463,-74.0011,B02512
4,5/1/2014 0:17:00,40.7594,-73.9734,B02512


### Iterating and concatenating all matches
Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

iterate through each of the filenames
read each filename into a DataFrame, and then
append it to the frames list.
You can then concatenate this list of DataFrames using pd.concat(). Go for it!



In [67]:
frames = []
for csv in csv_files[-3:]:
    df = pd.read_csv(csv)
    frames.append(df)
    
uber = pd.concat(frames)

In [68]:
uber.shape
uber.head()

(297, 4)

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


# Merging Data

### 1-to-1 data merge
Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson.
Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [69]:
site = pd.DataFrame({'name':['DR-1', 'DR-3', 'MSK-4'], 'lat':[-49.85, -47.15, -48.87], 'long':[-128.57, -126.72, -123.40]})

In [70]:
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [74]:
visited = pd.DataFrame({'ident':[619, 734, 837], 'site':['DR-1', 'DR-3', 'MSK-4'], 'dated':['1927-02-08', '1939-01-07', '1932-01-14']})
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,734,DR-3,1939-01-07
2,837,MSK-4,1932-01-14


In [75]:
# merge the dataframes
o2o = pd.merge(left = site, right = visited, left_on = 'name', right_on = 'site' )
o2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


### Many-to-1 data merge
In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

In [80]:
visited_1 = pd.DataFrame({'ident':[619, 622, 734, 735, 751, 752, 837, 844], 'site':['DR-1', 'DR-1', 'DR-3', 'DR-3', 'DR-3', 'DR-3', 'MSK-4', 'DR-1'],
                          'dated':['1927-02-08', '1927-02-10', '1939-01-07', '1930-01-12',
       '1930-02-26', 'nan', '1932-01-14', '1932-03-22']})

In [78]:
visited_1

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


Note that this time, visited has multiple entries for the site column  
The .merge() method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

In [81]:
m2o = pd.merge(left = site, right = visited_1, left_on = 'name', right_on = 'site')
m2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-1,-49.85,-128.57,622,DR-1,1927-02-10
2,DR-1,-49.85,-128.57,844,DR-1,1932-03-22
3,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
4,DR-3,-47.15,-126.72,735,DR-3,1930-01-12
5,DR-3,-47.15,-126.72,751,DR-3,1930-02-26
6,DR-3,-47.15,-126.72,752,DR-3,
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


## Many-to-many data merge
The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

In [82]:
df1 = pd.DataFrame({'c1':['a', 'a', 'b', 'b'], 'c2':[1, 2, 3, 4]})
df1

Unnamed: 0,c1,c2
0,a,1
1,a,2
2,b,3
3,b,4


In [83]:
df2 = pd.DataFrame({'c1':['a', 'a', 'b', 'b'], 'c2':[10, 20, 30, 40]})
df2

Unnamed: 0,c1,c2
0,a,10
1,a,20
2,b,30
3,b,40


In [87]:
df3 = pd.merge(df1, df2, on ='c1')

In [88]:
df3

Unnamed: 0,c1,c2_x,c2_y
0,a,1,10
1,a,1,20
2,a,2,10
3,a,2,20
4,b,3,30
5,b,3,40
6,b,4,30
7,b,4,40


In [90]:
survey = pd.DataFrame({'taken':[619, 619, 622, 622, 734, 734, 734, 735, 735, 735, 751, 751, 751,
       752, 752, 752, 752, 837, 837, 837, 844], 'person':['dyer', 'dyer', 'dyer', 'dyer', 'pb', 'lake', 'pb', 'pb', 'nan', 'nan',
       'pb', 'pb', 'lake', 'lake', 'lake', 'lake', 'roe', 'lake', 'lake',
       'roe', 'roe'], 'quant':['rad', 'sal', 'rad', 'sal', 'rad', 'sal', 'temp', 'rad', 'sal',
       'temp', 'rad', 'temp', 'sal', 'rad', 'sal', 'temp', 'sal', 'rad',
       'sal', 'sal', 'rad'], 'reading':[  9.82,   0.13,   7.8 ,   0.09,   8.41,   0.05, -21.5 ,   7.22,
         0.06, -26.  ,   4.35, -18.5 ,   0.1 ,   2.19,   0.09, -16.  ,
        41.6 ,   1.46,   0.21,  22.5 ,  11.25]})

In [91]:
survey

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,9.82
1,619,dyer,sal,0.13
2,622,dyer,rad,7.8
3,622,dyer,sal,0.09
4,734,pb,rad,8.41
5,734,lake,sal,0.05
6,734,pb,temp,-21.5
7,735,pb,rad,7.22
8,735,,sal,0.06
9,735,,temp,-26.0


In [95]:
# merge site and visited_1
m2m = pd.merge(left=site, right=visited_1, left_on='name', right_on='site')

In [96]:
# merge m2m and survey
m2m = pd.merge(left=m2m, right=survey,  left_on='ident', right_on='taken')

In [97]:
m2m

Unnamed: 0,name,lat,long,ident,site,dated,taken,person,quant,reading
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08,619,dyer,rad,9.82
1,DR-1,-49.85,-128.57,619,DR-1,1927-02-08,619,dyer,sal,0.13
2,DR-1,-49.85,-128.57,622,DR-1,1927-02-10,622,dyer,rad,7.8
3,DR-1,-49.85,-128.57,622,DR-1,1927-02-10,622,dyer,sal,0.09
4,DR-1,-49.85,-128.57,844,DR-1,1932-03-22,844,roe,rad,11.25
5,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,pb,rad,8.41
6,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,lake,sal,0.05
7,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,pb,temp,-21.5
8,DR-3,-47.15,-126.72,735,DR-3,1930-01-12,735,pb,rad,7.22
9,DR-3,-47.15,-126.72,735,DR-3,1930-01-12,735,,sal,0.06
