# Data Analysis with Python
## Combining DataFrames with pandas
Questions
* Can I work with data from multiple sources?
* How can I combine data from different data sets?

Objectives
* Combine data from multiple files into a single DataFrame using `concat` and `merge`.
* Combine two DataFrames using a unique ID found in both DataFrames.

## List data files

In [None]:
# Function for "globbing" (searching by file name pattern)
from glob import glob

# List a collection of CSV files
csv_files = glob('../data/by_year/*.csv')
csv_files[-5:]

## Concatenating DataFrames

In [None]:
import pandas as pd

year2001 = pd.read_csv('../data/by_year/surveys_2001.csv')
year2002 = pd.read_csv('../data/by_year/surveys_2002.csv')

print(year2001.shape, year2002.shape)

In [None]:
# Stack the DataFrames on top of each other
vertical = pd.concat([year2001, year2002], axis='index')
vertical

In [None]:
# Reset index values of the dataframe
# The drop=True option avoids adding new index column with old index values
vertical = vertical.reset_index(drop=True)
vertical

In [None]:
# Accumulate data from all files in the collection
df_list = []

for filename in glob('../data/by_year/*.csv'):
    df_by_year = pd.read_csv(filename)
    df_list.append(df_by_year)

surveys_df = pd.concat(df_list, axis='index')
surveys_df = surveys_df.reset_index(drop=True)
surveys_df

## Exercise - Concatenating DataFrames
* Load the data from all CSV files in the directory
  `../data/by_species_id/` and accumulate them in a list.
* Concatenate the DataFrames of that list.
* Reset the index without preserving it.

(4 min.)

In [None]:
df_list = []

for filename in glob('../data/by_species_id/*.csv'):
    df_list.append(pd.read_csv(filename))

surveys_sp = pd.concat(df_list, axis='index').reset_index(drop=True)
surveys_sp

* Compute the average weight by sex for each species. (1 min.)

In [None]:
# Get the average weight by sex for each species
weight_species = surveys_sp.groupby(
    ['species_id', 'sex'])['weight'].mean().unstack()
weight_species

* Export your results as a CSV file and make sure
  it reads back into python properly. (3 min.)

In [None]:
# Writing to file while keeping the index
csv_file = 'weight_by_species.csv'
weight_species.to_csv(csv_file, index=True)

# Reading it back in with a specified index column
pd.read_csv(csv_file, index_col='species_id')

## Joining Two DataFrames

In [None]:
# Import a small subset of the species data designed for this part of the lesson
species_sub = pd.read_csv('../data/speciesSubset.csv')
species_sub

In [None]:
# The first ten records
head10 = surveys_df.head(10)
head10

### Identifying join keys

In [None]:
head10.columns

In [None]:
species_sub.columns

### Inner joins

![Inner join of tables A and B](https://datacarpentry.org/python-ecology-lesson/fig/inner-join.png)

In [None]:
# Computing the inner join of head10 and species_sub
key = 'species_id'
merged_inner = pd.merge(
    left=head10,
    right=species_sub,
    left_on=key,
    right_on=key
)
# What's the size of the output data?
merged_inner.shape

In [None]:
merged_inner

### Left joins

![Left join of tables A and B](https://datacarpentry.org/python-ecology-lesson/fig/left-join.png)

In [None]:
merged_left = pd.merge(
    left=head10,
    right=species_sub,
    how='left',
    on=key
)
# What's the size of the output data?
merged_left.shape

In [None]:
merged_left

### Other join types
* `how='right'` : all rows from the right DataFrame are kept
* `how='outer'` : all pairwise combinations of rows from both DataFrames

## Exercise - Joining all data
`1`. Create a new DataFrame by joining the contents of the
`surveys_df` and `species.csv` tables. Keep all survey records.
(3 min.)

In [None]:
species_df = pd.read_csv('../data/species.csv')

merged_left = pd.merge(
    left=surveys_df,
    right=species_df,
    how='left',
    on='species_id'
)
merged_left.shape

`2`. Calculate the evolution of the average hindfoot
length for each genus from year to year. Transform the
result such that each genus gets its own column. (4 min.)

In [None]:
average_lengths = merged_left.groupby(
    ['year', 'genus'])['hindfoot_length'].mean().unstack()
average_lengths.tail()

`3`. Calculate the average weight per sex for each genus. For this
exercise, we will use a pivot table instead of `unstack()`. (3 min.)

In [None]:
# Use pivot_table() instead of groupby() + unstack()
merged_left.pivot_table(
    values='weight',
    index='genus',
    columns='sex',
    aggfunc='mean'
)

## Technical Summary
* **Concatenate** DataFrames with `pandas.concat()`
  * Requires a list of DataFrames
  * Vertically if `axis='index'` (by default)
  * Horizontally if `axis='columns'`
  * Resetting the index: `reset_index(drop=True)`
* **Joining** DataFrames with `pandas.merge()`
  * `left=`, `right=`: both DataFrames to join
  * `how=`: `'inner'` (default), `'left'`, `'right'`, `'outer'`
  * `left_on=`, `right_on=`: join key for each DataFrame
  * `on=`: join key for both DataFrames