The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. Or if you are dealing with time series data, for example, you may have a new dataset for each day. No matter the reason, it is important to be able to combine datasets so you can either clean a single dataset, or clean each dataset separately and then combine them later so you can run your analysis on a single dataset. In this chapter, you'll learn all about combining data.

### Combining data
- Data may not always come in 1 huge file
- 5 million row dataset may be broken into 5 separate datasets
- Easier to store and share
- May have new data for each day
- Important to be able to combine then clean, or vice versa

### Concatenation
![concat.PNG](concat.PNG)


### pandas concat

In [10]:
import pandas as pd
weather_p1 = pd.read_csv('w1.csv')
weather_p1['date'] = pd.to_datetime(weather_p1['date'])

weather_p2 = pd.read_csv('w2.csv')
weather_p2['date'] = pd.to_datetime(weather_p2['date'])

print(weather_p1)
print()
print(weather_p2)

        date element  value
0 2010-01-30    tmax   27.8
1 2010-01-30    tmin   14.5

        date element  value
0 2010-02-02    tmax   27.3
1 2010-02-02    tmin   14.4


In [11]:
concatenated = pd.concat([weather_p1, weather_p2])
concatenated

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
1,2010-01-30,tmin,14.5
0,2010-02-02,tmax,27.3
1,2010-02-02,tmin,14.4


### pandas concat

In [12]:
concatenated = concatenated.loc[0, :]
concatenated

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
0,2010-02-02,tmax,27.3


In [13]:
pd.concat([weather_p1, weather_p2], ignore_index=True)

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
1,2010-01-30,tmin,14.5
2,2010-02-02,tmax,27.3
3,2010-02-02,tmin,14.4


## Concatenating DataFrames
![concat2](concat2.PNG)

---
# Let’s practice!


# Finding and concatenating data

### Concatenating many files
-  Leverage Python’s features with data cleaning in
pandas
- In order to concatenate DataFrames:
- They must be in a list
- Can individually load if there are a few datasets
- But what if there are thousands?
- Solution: glob function to find files based
on a pa!ern

### Globbing
- Pattern matching for file names
- Wildcards: * ?
    - Any csv file: *.csv
    - Any single character: file_?.csv
- Returns a list of file names
- Can use this list to load into separate DataFrames

### The plan
- Load files from globbing into pandas
- Add the DataFrames into a list
- Concatenate multiple datasets at once

## Find and concatenate

In [15]:
import glob
csv_files = glob.glob('*.csv')


csv_files

['airquality.csv',
 'dob_job_application_filings_subset.csv',
 'ebola.csv',
 'gapminder.csv',
 'literacy_birth_rate.csv',
 'mp_data.csv',
 'nyc_uber_2014.csv',
 'tb.csv',
 'tiddy.csv',
 'tips.csv',
 'w1.csv',
 'w2.csv',
 'weather_tidy.csv']

### Using loops

```python
list_data = []

for filename in csv_files:
    data = pd.read_csv(filename)
    list_data.append(data)
    
pd.concat(list_data)
```

---
# Let’s practice!

# Merge data

In [16]:
state_populations = pd.read_csv('state_pop.csv')
state_codes = pd.read_csv('state_cod.csv')

In [17]:
state_populations

Unnamed: 0,state,population_2016
0,California,39250017
1,Texas,27862596
2,Florida,20612439
3,New York,19745289


In [18]:
state_codes

Unnamed: 0,name,ANSI
0,California,CA
1,Florida,FL
2,New York,NY
3,Texas,TX


## Merging data
- Similar to joining tables in SQL
- Combine disparate datasets based on common columns

In [19]:
pd.merge(left=state_populations, right=state_codes,
         on=None, left_on='state', right_on='name')

Unnamed: 0,state,population_2016,name,ANSI
0,California,39250017,California,CA
1,Texas,27862596,Texas,TX
2,Florida,20612439,Florida,FL
3,New York,19745289,New York,NY


### Types of merges
- One-to-one
- Many-to-one / one-to-many
- Many-to-many

### Different types of merges
- One-to-one
- Many-to-one
- Many-to-many
- All use the same function
- Only difference is the DataFrames you are merging

---
# Let’s practice!