# Data Analysis and Visualization in Python
## Starting With Data
Questions
* How can I import data in Python?
* What is Pandas?
* Why should I use Pandas to work with data?

Objectives
* Navigate the workshop directory and download a dataset.
* Explain what a library is and what libraries are used for.
* Describe what the Python Data Analysis Library (Pandas) is.
* Load the Python Data Analysis Library (Pandas).
* Use `read_csv` to read tabular data into Python.
* Describe what a DataFrame is in Python.
* Access and summarize data stored in a DataFrame.
* Define indexing as it relates to data structures.
* Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.
* Create simple plots.

## Working With Pandas DataFrames in Python
### Our Data
For this lesson, we will be using the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA:
http://www.esapubs.org/archive/ecol/E090/118/default.htm

We will be using files from the Portal Project Teaching Database:
https://figshare.com/articles/Portal_Project_Teaching_Database/1314459

This section will use the **`../data/surveys.csv`** file, which is a simplified version of the original file that can be downloaded here: https://ndownloader.figshare.com/files/2292172

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a `.csv` file: each row holds information for a single animal, and the columns represent:

 Column         | Description
--------------- | -----------
record_id       | Unique id for the observation
month           | month of observation
day             | day of observation
year            | year of observation
plot_id         | ID of a particular site
species_id      | 2-letter code
sex             | sex of animal (“M”, “F”)
hindfoot_length | length of the hindfoot in mm
weight          | weight of the animal in grams

The first few rows of our first file look like this:
```
record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
1,7,16,1977,2,NL,M,32,
2,7,16,1977,3,NL,M,33,
3,7,16,1977,2,DM,F,37,
4,7,16,1977,7,DM,M,36,
5,7,16,1977,3,DM,M,35,
6,7,16,1977,1,PF,M,14,
7,7,16,1977,2,PE,F,,
8,7,16,1977,1,DM,M,37,
9,7,16,1977,1,DM,F,34,
```

### Pandas in Python
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

In [None]:
import pandas as pd

## Reading CSV Data Using Pandas
### So What’s a DataFrame?

In [None]:
# Note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("../data/surveys.csv")

In [None]:
surveys_df = pd.read_csv("../data/surveys.csv")

In [None]:
surveys_df

In [None]:
surveys_df.head() # The head() method displays the first several lines of a file.

In [None]:
surveys_df.head(15)

### Exercise - DataFrame Object
Try out the methods below to see what they return.

In [None]:
surveys_df.columns

In [None]:
surveys_df.shape
# Take note of the output of shape. What format does it return the shape of the DataFrame in?

In [None]:
surveys_df.tail()

### Calculating Statistics From Data In A Pandas DataFrame

In [None]:
# Look at the column names
surveys_df.columns

In [None]:
pd.unique(surveys_df['species_id'])

### Exercise - Calculating Statistics
`1`. Create a list of unique site ID’s (`plot_id`) found in the surveys data. Call it `site_names`. How many unique sites are there in the data? How many unique species are in the data?

In [None]:
site_names = pd.unique(surveys_df["plot_id"])
site_names.shape[0] # Can also use: len()

`2`. What is the difference between `len(site_names)` and `surveys_df['plot_id'].nunique()`?

In [None]:
print(len(site_names))
print(surveys_df['plot_id'].nunique())

## Groups in Pandas

In [None]:
surveys_df['weight'].describe()

In [None]:
print("Count:    ", surveys_df['weight'].count())
print("Mean:     ", surveys_df['weight'].mean())
print("Std Dev.: ", surveys_df['weight'].std())
print("Min:      ", surveys_df['weight'].min())
print("Max:      ", surveys_df['weight'].max())

In [None]:
# Group data by sex
grouped_data = surveys_df.groupby('sex')

In [None]:
# summary statistics for all numeric columns by sex
grouped_data.describe()

In [None]:
# provide the mean for each numeric column by sex
grouped_data.mean()

### Exercise - Grouping
`1`. How many recorded individuals are female `F` and how many male `M`

In [None]:
grouped_data.count()

`2`. What happens when you group by two columns using the following syntax and then grab mean values:

In [None]:
grouped_data2 = surveys_df.groupby(['plot_id','sex'])
grouped_data2.mean()

`3`. Summarize weight values for each site in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data `by_site['weight'].describe()`

In [None]:
surveys_df.groupby(['plot_id'])['weight'].describe()

### Quickly Creating Summary Counts in Pandas

In [None]:
# count the number of samples by species
species_counts = surveys_df.groupby('species_id')['record_id'].count()
print(species_counts)

In [None]:
surveys_df.groupby('species_id')['record_id'].count()['DO']

### Basic Math Functions

In [None]:
# Convert all weights in Kg
surveys_df['weight'] / 2.2

## Quick & Easy Plotting Data Using Pandas

In [None]:
# make sure figures appear inline in Ipython Notebook
%matplotlib inline

In [None]:
# create a quick bar chart
species_counts.plot(kind='bar');

In [None]:
total_count = surveys_df.groupby('plot_id')['record_id'].nunique()

# let's plot that too
total_count.plot(kind='bar');

### Exercise - Plotting Challenges
Create a plot of the average weight across all species per site.

In [None]:
surveys_df.groupby('plot_id')["weight"].mean().plot(kind='bar')

Create a plot of total males versus total females for the entire datase.

In [None]:
surveys_df.groupby('sex')["record_id"].count().plot(kind='bar')

## Summary Example

In [None]:
by_site_sex = surveys_df.groupby(['plot_id','sex'])
site_sex_totalw = by_site_sex['weight'].sum()
site_sex_totalw

In [None]:
site_sex_totalw.unstack()

In [None]:
spc = site_sex_totalw.unstack()
s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by site and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Site")