# Data Analysis and Visualization in Python
## Combining DataFrames with pandas
Questions
* Can I work with data from multiple sources?
* How can I combine data from different data sets?

Objectives
* Combine data from multiple files into a single DataFrame using `concat` and `merge`.
* Combine two DataFrames using a unique ID found in both DataFrames.

## Loading our data

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Concatenating DataFrames

In [None]:
# Read in first 10 lines of surveys table
surveys_head10 = surveys_df.head(10)
surveys_head10

In [None]:
# Grab the last 10 rows
surveys_tail10 = surveys_df.tail(10)
surveys_tail10

In [None]:
# Stack the DataFrames on top of each other
vertical_stack = pd.concat([surveys_head10, surveys_tail10], axis=0)
vertical_stack

In [None]:
# Reset index values of the dataframe
# The drop=True option avoids adding new index column with old index values
vertical_stack = vertical_stack.reset_index(drop=True)
vertical_stack

### Writing Out Data to CSV

In [None]:
# Write DataFrame to CSV without the index
vertical_stack.to_csv('surveys_sub.csv', index=False)

In [None]:
# Read our output back into python and make sure all looks good
new_output = pd.read_csv('surveys_sub.csv')
new_output

## Exercise - Concatenating DataFrames
In `surveys_df`, select rows where the year is 2001. Do the same for year 2002. Concatenate both dataframes. Create a single bar-plot that shows the average weight by sex for each year. Export your results as a CSV and make sure it reads back into python properly.

In [None]:
# Get data for each year
survey2001 = surveys_df[surveys_df['year'] == 2001]
survey2002 = surveys_df[surveys_df['year'] == 2002]

# Concatenate vertically
survey_all = pd.concat([survey2001, survey2002], axis=0)

In [None]:
# Get the average weight by sex for each year
weight_year = survey_all.groupby(['year', 'sex'])['weight'].mean().unstack()
weight_year

In [None]:
# Writing to file while keeping the index
weight_year.to_csv("weight_for_year.csv", index=True)

# Reading it back in with a specified index column
pd.read_csv("weight_for_year.csv", index_col='year')

## Joining Two DataFrames

In [None]:
# Import a small subset of the species data designed for this part of the lesson
species_sub = pd.read_csv('../data/speciesSubset.csv')
species_sub

### Identifying join keys

In [None]:
surveys_head10.columns

In [None]:
species_sub.columns

### Inner joins

![Inner join of tables A and B](https://datacarpentry.org/python-ecology-lesson/fig/inner-join.png)

In [None]:
merged_inner = pd.merge(left=surveys_head10, right=species_sub,
                        left_on='species_id', right_on='species_id')
# What's the size of the output data?
merged_inner.shape

In [None]:
merged_inner

### Left joins

![Left join of tables A and B](https://datacarpentry.org/python-ecology-lesson/fig/left-join.png)

In [None]:
merged_left = pd.merge(left=surveys_head10, right=species_sub,
                       how='left', on='species_id')
# What's the size of the output data?
merged_left.shape

In [None]:
merged_left

### Other join types
* `how='right'` : all rows from the right DataFrame are kept
* `how='outer'` : all pairwise combinations of rows from both DataFrames

## Exercise - Joining all data
`1`. Create a new DataFrame by joining the contents of the `surveys.csv` and `species.csv` tables.

In [None]:
species_df = pd.read_csv("../data/species.csv")
merged_left = pd.merge(left=surveys_df, right=species_df, how='left', on='species_id')
merged_left.shape

`2`. Calculate and plot the distribution of surveys (i.e. the number of `record_id`) by `taxa` for each `plot_id`.

In [None]:
taxa_site = merged_left.groupby(['plot_id', 'taxa'])['record_id'].count().unstack()
taxa_site.head()

In [None]:
taxa_site.plot(kind='bar', stacked=True)

`3`. Calculate and plot the distribution of `taxa` by `sex` for each `plot_id`.

In [None]:
# Data cleanup
merged_left.loc[merged_left['sex'].isnull(), 'sex'] = "F|M"
merged_left.loc[~merged_left['sex'].isin(["F", "F|M", "M"]), 'sex'] = "F|M"

In [None]:
ntaxa_sex_site = merged_left.groupby(['plot_id',
                                      'sex'])['taxa'].nunique().reset_index(level=1)
ntaxa_sex_site.head()

In [None]:
# Use pivot_table() instead of unstack()
ntaxa_sex_site = ntaxa_sex_site.pivot_table(values='taxa', columns='sex',
                                            index=ntaxa_sex_site.index)
ntaxa_sex_site.head()

In [None]:
ntaxa_sex_site.plot(kind="bar", stacked=True, legend=False)