# Combining data for analysis

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Concatenating-data" data-toc-modified-id="Concatenating-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Concatenating data</a></span></li><li><span><a href="#Finding-and-concatenating-data" data-toc-modified-id="Finding-and-concatenating-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Finding and concatenating data</a></span></li><li><span><a href="#Merge-data" data-toc-modified-id="Merge-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Merge data</a></span></li></ul></div>

## Concatenating data

- Combining data
    - Data may not always come in 1 huge file
        - 5 million row dataset may be broken into 5 separate datasets
        - Easier to store and share
        - May have new data for each day
    - Important to be able to combine then clean, or vice versa
- pd.concat()
        In [1]: concatenated = pd.concat([weather_p1, weather_p2])
        In [2]: print(concatenated)
             date         element  value
        0   2010-01-30  tmax    27.8
        1   2010-01-30  tmin     14.5
        0   2010-02-02  tmax    27.3
        1   2010-02-02  tmin     14.4
    - 純粹照順序來，如果 dataset 間的排序不同則無法對齊
    - index label won't change in the new dataset
    - ignore_index=True, to create new index
             In [4]: pd.concat([weather_p1, weather_p2], ignore_index=True)
            Out[4]:
                     date     element  value
            0  2010-01-30  tmax     27.8
            1  2010-01-30  tmin     14.5
            2  2010-02-02  tmax     27.3
            3  2010-02-02  tmin     14.4
    - axis=1, concatenate data column-wise
        - default axis=0, row-wise
    

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import jupyterthemes.jtplot as jtplot
%matplotlib inline
jtplot.style(theme='onedork')

In [2]:
# separate dataset in part
'''uber = pd.read_csv('exercise/nyc_uber_2014.csv', chunksize=99, index_col=0)
part_x = 1
for part in uber:
    if part_x == 1:
        uber1 = part
        uber1.to_csv('exercise/nyc_uber_2014_04.csv')
    elif part_x == 2:
        uber2 = part
        uber2.to_csv('exercise/nyc_uber_2014_05.csv')
    elif part_x == 3:
        uber3 = part
        uber3.to_csv('exercise/nyc_uber_2014_06.csv')
    part_x += 1'''

"uber = pd.read_csv('exercise/nyc_uber_2014.csv', chunksize=99, index_col=0)\npart_x = 1\nfor part in uber:\n    if part_x == 1:\n        uber1 = part\n        uber1.to_csv('exercise/nyc_uber_2014_04.csv')\n    elif part_x == 2:\n        uber2 = part\n        uber2.to_csv('exercise/nyc_uber_2014_05.csv')\n    elif part_x == 3:\n        uber3 = part\n        uber3.to_csv('exercise/nyc_uber_2014_06.csv')\n    part_x += 1"

In [3]:

uber1 = pd.read_csv('exercise/nyc_uber_2014_04.csv', index_col=0)
uber2 = pd.read_csv('exercise/nyc_uber_2014_05.csv', index_col=0)
uber3 = pd.read_csv('exercise/nyc_uber_2014_06.csv', index_col=0)


# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())
print(row_concat.loc[0])


(297, 4)
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
0  5/1/2014 0:02:00  40.7521 -73.9914  B02512
0  6/1/2014 0:00:00  40.7293 -73.9920  B02512


In [4]:
ebola = pd.read_csv('exercise/ebola.csv')

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], 
                     var_name='type_country', value_name='counts')
status_country = pd.DataFrame()
status_country['status'] = ebola_melt['type_country'].str.split('_').str.get(0)
status_country['country'] = ebola_melt['type_country'].str.split('_').str.get(1)

# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis=1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)

# Print the head of ebola_tidy
print(ebola_tidy.head())


(1952, 6)
         Date  Day  type_country  counts status country
0    1/5/2015  289  Cases_Guinea  2776.0  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  Cases  Guinea


## Finding and concatenating data

- Concatenating many files
    - Leverage Python’s features with data cleaning in pandas
    - In order to concatenate DataFrames: 
        - They must be in a list
        - Can individually load if there are a few datasets
    - But what if there are thousands?
        - Solution: glob function to find files based on a pattern
- Globbing
    - Pattern matching for file names 
    - Wildcards: * ?
        - *.csv
            -  Any csv file:
        - file_?.csv
            - Any single character
    - Returns a list of file names
    - Can use this list to load into separate DataFrames
- The plan
    - Load files from globbing into pandas 
    - Add the DataFrames into a list 
    - Concatenate multiple datasets at once
            In [1]: import glob
            In [2]: csv_files = glob.glob('*.csv')
            In [3]: print(csv_files)
            ['file5.csv', 'file2.csv', 'file3.csv', 'file1.csv', 'file4.csv']
            In [4]: list_data = []
            In [5]: for filename in csv_files:
               ...:     data = pd.read_csv(filename)
               ...:     list_data.append(data)
            In [6]: pd.concat(list_data)

In [5]:
import glob

# Write the pattern: pattern
pattern = 'exercise/nyc_uber_2014_*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)
csv_files.sort()
# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1], index_col=0)

# Print the head of csv2
print(csv2.head())


['exercise/nyc_uber_2014_04.csv', 'exercise/nyc_uber_2014_05.csv', 'exercise/nyc_uber_2014_06.csv']
          Date/Time      Lat      Lon    Base
0  5/1/2014 0:02:00  40.7521 -73.9914  B02512
1  5/1/2014 0:06:00  40.6965 -73.9715  B02512
2  5/1/2014 0:15:00  40.7464 -73.9838  B02512
3  5/1/2014 0:17:00  40.7463 -74.0011  B02512
4  5/1/2014 0:17:00  40.7594 -73.9734  B02512


In [6]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv, index_col=0)
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())

(297, 4)
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


## Merge data

- Combining data
    - Concatenation is not the only way data can be combined
        - pd.concat(): 純粹照順序來，如果 dataset 間的排序不同則無法對齊
- Merging data
    - Similar to joining tables in SQL
    - Combine disparate datasets based on common columns
            In [1]: pd.merge(left=state_populations  right=state_codes,
            ...:               on=None, left_on='state', right_on='name')
            Out[1]:
                    state      population_2016       name   ANSI
            0  California        39250017         California   CA
            1       Texas         27862596           Texas   TX
            2     Florida         20612439           Florida   FL
            3    New York      19745289          New York   NY
    - left, right
        - 指定ds
    -  left_on,  right_on
        - 指定 merge col_name
- Different types of merges
    - One-to-one
    - Many-to-one
    - Many-to-many
    - All use the same function
    - Only difference is the DataFrames you are merging

In [7]:
site = {'name':['DR-1',  'DR-3', 'MSK-4'], 'lat':[-49.85, -47.15, -48.87],
        'long': [-128.57, -126.72, -123.40]}
site_ds = pd.DataFrame(site)
visited =  {'ident':[619, 622, 734, 735, 751, 752, 837, 844],
            'site':['DR-1', 'DR-1', 'DR-3', 'DR-3', 'DR-3', 'DR-3', 'MSK-4', 'DR-1'],
        'dated': ['1927-02-08', '1927-02-10', '1939-01-07', '1930-01-12',
                   '1930-02-26', 'NaN', '1932-01-14', '1932-03-22']}
visited_ds = pd.DataFrame(visited)

# Merge the DataFrames: m2o
m2o = pd.merge(left=site_ds, right=visited_ds, left_on='name', right_on='site')

# Print m2o
print(site_ds)
print(visited_ds)
print(m2o)

    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57    622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57    844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72    734   DR-3  1939-01-07
4   DR-3 -47.15 -126.72    735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72    751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72    752   DR-3         NaN
7  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14
