# Data Analysis and Visualization in Python
## Combining DataFrames with pandas
Questions
* How to combine dataframes from multiple files?

Objectives
* Combine data from multiple files into a single DataFrame using merge and concat.
* Combine two DataFrames using a unique ID found in both DataFrames.
* Employ `to_csv` to export a DataFrame in CSV format.
* Join DataFrames using common fields (join keys).

### How to Use Jupyter
When a cell is in edit mode:

  Shortcut  | Description
----------- | -----------
Shift+Enter | Run the cell, and go to the next
Tab         | Indent code or auto-completion
Esc         | Go to command mode

When a cell is in command mode:

  Shortcut   | Description
------------ | -----------
Shift+Enter  | Run the cell, and go to the next
Double-click | Go to edit mode
Enter        | Go to edit mode

  Shortcut   | Description
------------ | -----------
A            | Insert a cell above
B            | Insert a cell below
C            | Copy the current cell
V            | Paste the cell below
D D          | Delete the current cell

To reset all cells:
* Go to the top menu, and select Kernel -> Restart & Clear Output

## Making Sure Our Data Are Loaded

In [None]:
# first make sure pandas is loaded
import pandas as pd

In [None]:
# read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv", keep_default_na=False, na_values=[""])
surveys_df = surveys_df.rename(columns={'species': 'species_id'})
surveys_df.head()

In [None]:
# read in the species csv
species_df = pd.read_csv("../data/species.csv", keep_default_na=False, na_values=[""])
species_df.head()

## Concatenating DataFrames

In [None]:
# read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
survey_sub

In [None]:
# grab the last 10 rows (minus the last one)
survey_sub_last10 = surveys_df[-11:-1]
survey_sub_last10

In [None]:
#reset the index values to the second dataframe appends properly
# drop=True option avoids adding new index column with old index values
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
survey_sub_last10

In [None]:
# stack the DataFrames on top of each other
vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis=0)
vertical_stack

In [None]:
# place the DataFrames side by side
horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)
horizontal_stack

### Writing Out Data to CSV

In [None]:
# Write DataFrame to CSV 
vertical_stack.to_csv('out.csv')

In [None]:
# for kicks read our output back into python and make sure all looks good
new_output = pd.read_csv('out.csv', keep_default_na=False, na_values=[""])
new_output

## Exercise - Concatenating DataFrames
In the data folder, there are two survey data files: `survey2001.csv` and `survey2002.csv`. Read the data into python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into python properly.

In [None]:
# read the files:
survey2001 = pd.read_csv("../data/survey2001.csv")
survey2002 = pd.read_csv("../data/survey2002.csv")
# concatenate
survey_all = pd.concat([survey2001, survey2002], axis=0)

In [None]:
# get the weight for each year, grouped by sex:
weight_year = survey_all.groupby(['year', 'sex']).mean()["wgt"].unstack()
weight_year

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# plot:
weight_year.plot(kind="bar")
plt.tight_layout()  # tip(!)

In [None]:
# writing to file:
weight_year.to_csv("weight_for_year.csv")

# reading it back in:
pd.read_csv("weight_for_year.csv", index_col=0)

## Joining DataFrames
### Joining Two DataFrames

In [None]:
# read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
survey_sub

In [None]:
# import a small subset of the species data designed for this part of the lesson
species_sub = pd.read_csv('../data/speciesSubset.csv', keep_default_na=False, na_values=[""])
species_sub

### Identifying join keys

In [None]:
survey_sub.columns

In [None]:
species_sub.columns

### Inner joins

![Inner join of tables A and B](http://blog.codinghorror.com/content/images/uploads/2007/10/6a0120a85dcdae970b012877702708970c-pi.png)

In [None]:
merged_inner = pd.merge(left=survey_sub, right=species_sub,
                        left_on='species_id', right_on='species_id')
# what's the size of the output data?
merged_inner.shape

In [None]:
merged_inner

### Left joins

![Left join of tables A and B](http://blog.codinghorror.com/content/images/uploads/2007/10/6a0120a85dcdae970b01287770273e970c-pi.png)

In [None]:
merged_left = pd.merge(left=survey_sub, right=species_sub, how='left',
                       left_on='species_id', right_on='species_id')
# what's the size of the output data?
merged_left.shape

In [None]:
merged_left

## Exercise - Joining all data
Create a new DataFrame by joining the contents of the `surveys.csv` and `species.csv` tables. Then calculate and plot the distribution of:
1. taxa by plot (number of species of each taxa per plot)
1. taxa by sex by plot

In [None]:
merged_left = pd.merge(left=surveys_df, right=species_df, how='left', on="species_id")

In [None]:
merged_left.groupby(["plot"])["taxa"].nunique().plot(kind='bar')

In [None]:
plot_taxa_plot = merged_left.groupby(["plot", "taxa"]).count()["record_id"].unstack()
plot_taxa_plot

In [None]:
plot_taxa_plot.plot(kind='bar', stacked=True)
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05))

In [None]:
# Part 2
merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'
merged_left.loc[~merged_left["sex"].isin(['F', 'M', 'M|F']), "sex"] = 'M|F'

In [None]:
ntaxa_sex_plot = merged_left.groupby(["plot", "sex"])["taxa"].nunique().reset_index(level=1)
ntaxa_sex_plot

In [None]:
# Use pivot_table() instead of unstack()
ntaxa_sex_plot = ntaxa_sex_plot.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_plot.index)
ntaxa_sex_plot

In [None]:
ntaxa_sex_plot.plot(kind="bar", stacked=True, legend=False)
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
           fontsize='small', frameon=False)