# Data Analysis and Visualization in Python
## Starting With Data
Questions
* How can I work with data-frames in Python?

Objectives
* Navigate the workshop directory and download a dataset.
* Explain what a library is and what libraries are used for.
* Describe what the Python Data Analysis Library (Pandas) is.
* Load the Python Data Analysis Library (Pandas).
* Use read_csv to read tabular data into Python.
* Describe what a DataFrame is in Python.
* Access and summarize data stored in a DataFrame.
* Define indexing as it relates to data structures.
* Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.
* Create simple plots.

### How to Use Jupyter
When a cell is in edit mode:

  Shortcut  | Description
----------- | -----------
Shift+Enter | Run the cell, and go to the next
Tab         | Indent code or auto-completion
Esc         | Go to command mode

When a cell is in command mode:

  Shortcut   | Description
------------ | -----------
Shift+Enter  | Run the cell, and go to the next
Double-click | Go to edit mode
Enter        | Go to edit mode

  Shortcut   | Description
------------ | -----------
A            | Insert a cell above
B            | Insert a cell below
C            | Copy the current cell
V            | Paste the cell below
D D          | Delete the current cell

To reset all cells:
* Go to the top menu, and select Kernel -> Restart & Clear Output

## Working With Pandas DataFrames in Python
### Our Data
For this lesson, we will be using the Portal Teaching data, a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA:
http://www.esapubs.org/archive/ecol/E090/118/default.htm

We will be using files from the Portal Project Teaching Database:
https://figshare.com/articles/Portal_Project_Teaching_Database/1314459

This section will use the **`../data/surveys.csv`** file, which is a simplified version of the original file that can be downloaded here: https://ndownloader.figshare.com/files/2292172

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a .csv file: each row holds information for a single animal, and the columns represent:

 Column         | Description
--------------- | -----------
record_id       | Unique id for the observation
month           | month of observation
day             | day of observation
year            | year of observation
plot            | ID of a particular plot
species         | 2-letter code
sex             | sex of animal (“M”, “F”)
wgt             | weight of the animal in grams

The first few rows of our first file look like this:
```
record_id,month,day,year,plot,species,sex,wgt
"1","7","16","1977","2","NA","M",
"2","7","16","1977","3","NA","M",
"3","7","16","1977","2","DM","F",
"4","7","16","1977","7","DM","M",
"5","7","16","1977","3","DM","M",
"6","7","16","1977","1","PF","M",
"7","7","16","1977","2","PE","F",
"8","7","16","1977","1","DM","M",
"9","7","16","1977","1","DM","F",
```

### Pandas in Python
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

In [None]:
import ### as ###

## Reading CSV Data Using Pandas
### So What’s a DataFrame?

In [None]:
# note that pd.read_csv is used because we imported pandas as pd
###("###/###/surveys.csv")

In [None]:
### = pd.read_csv("../data/surveys.csv")

In [None]:
surveys_df

### Manipulating Our Species Survey Data

In [None]:
###(surveys_df)

In [None]:
# this does the same thing as the above!
surveys_df.__###__

In [None]:
surveys_df.###

### Exercise - DataFrame Object
Try out the methods below to see what they return.

In [None]:
surveys_df.###

In [None]:
surveys_df.###
# What does surveys_df.head(15) do?

In [None]:
surveys_df.###

In [None]:
surveys_df.###
# Take note of the output of the shape method. What format does it return the shape of the DataFrame in?

### Calculating Statistics From Data In A Pandas DataFrame

In [None]:
# Look at the column names
surveys_df.columns.###

In [None]:
pd.###(surveys_df['species'])

### Exercise - Calculating Statistics
`1`. Create a list of unique plot ID’s found in the surveys data. Call it `plot_names`. How many unique plots are there in the data? How many unique species are in the data?

In [None]:
### = pd.###(surveys_df["###"])
###.### # Can also use: len()

In [None]:
###(pd.###(surveys_df["###"]))

`2`. What is the difference between `len(plot_names)` and `len(surveys_df["plot"])`?

In [None]:
# print(len(plot_names))
# print(len(surveys_df["plot"]))

## Groups in Pandas

In [None]:
surveys_df['wgt'].###

In [None]:
print("Count:    ", surveys_df['wgt'].###())
print("Mean:     ", surveys_df['wgt'].###())
print("Std Dev.: ", surveys_df['wgt'].###())
print("Min:      ", surveys_df['wgt'].###())
print("Max:      ", surveys_df['wgt'].###())

In [None]:
# Group data by sex
sorted_data = surveys_df.###('###')

In [None]:
# summary statistics for all numeric columns by sex
sorted_data.###

In [None]:
# provide the mean for each numeric column by sex
sorted_data.###

### Exercise - Grouping
`1`. How many recorded individuals are female `F` and how many male `M`

In [None]:
sorted_data.###

`2`. What happens when you group by two columns using the following syntax and then grab mean values:

In [None]:
sorted_data2 = surveys_df.groupby(['plot','sex'])
sorted_data2.mean()

In [None]:
# The mean does not make sense for each variable, so you can specify this column-wise:
surveys_df.groupby(['plot','sex']).agg({"year": 'min', 
                                        "wgt": 'mean'})

`3`. Summarize weight values for each plot in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data `by_plot['wgt'].describe()`

In [None]:
surveys_df.###(['###'])['###'].###

### Quickly Creating Summary Counts in Pandas

In [None]:
# count the number of samples by species
species_counts = surveys_df.groupby('###')['record_id'].###
species_counts

In [None]:
surveys_df.groupby('species')['record_id'].count()###

### Basic Math Functions

In [None]:
# multiply all weight values by 2
surveys_df['###'] ###

### Exercise - Groupby Count
What is another way to create a list of species and the associated count of the records in the data?

In [None]:
surveys_df.groupby('species')###['record_id']

## Quick & Easy Plotting Data Using Pandas

In [None]:
# make sure figures appear inline in Ipython Notebook
%matplotlib inline

In [None]:
# create a quick bar chart
species_counts.###(kind='###');

In [None]:
total_count = surveys_df['record_id'].groupby(surveys_df['plot']).###()

# let's plot that too
total_count.###(kind='bar');

### Exercise - Plotting Challenges
Create a plot of the average weight across all species per plot.

In [None]:
surveys_df.groupby('###').###[###].###(kind='bar')

Create a plot of total males versus total females for the entire datase.

In [None]:
surveys_df.groupby('###').###[###].###(kind='bar')

## Summary Example

In [None]:
by_plot_sex = surveys_df.groupby(['plot','sex'])
plot_sex_count = by_plot_sex['wgt'].sum()
plot_sex_count

In [None]:
plot_sex_count.unstack()

In [None]:
spc = plot_sex_count.unstack()
s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by plot and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Plot")