# Data Analysis and Visualization in Python
## Starting With Data
Questions
* How can I import data in Python?
* What is Pandas?
* Why should I use Pandas to work with data?

Objectives
* Navigate the workshop directory and download a dataset.
* Explain what a library is and what libraries are used for.
* Describe what the Python Data Analysis Library (Pandas) is.
* Load the Python Data Analysis Library (Pandas).
* Use `read_csv` to read tabular data into Python.
* Describe what a DataFrame is in Python.
* Access and summarize data stored in a DataFrame.
* Define indexing as it relates to data structures.
* Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.
* Create simple plots.

## Working With Pandas DataFrames in Python
### Our Data
For this lesson, we will be using the Portal Teaching data, which is a subset of the data from Ernst *et al* Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA:
http://www.esapubs.org/archive/ecol/E090/118/default.htm

We will be using files from the Portal Project Teaching Database:
https://figshare.com/articles/Portal_Project_Teaching_Database/1314459

This section will use the **`../data/surveys.csv`** file, which is a simplified version of the original file that can be downloaded here: https://ndownloader.figshare.com/files/2292172

We are studying the species and weight of animals caught in plots (or sites) in our study area. The dataset is stored as a `.csv` file: each row holds information for a single animal, and the columns represent:

 Column           | Description
----------------- | -----------
`record_id`       | Unique id for the observation
`month`           | month of observation
`day`             | day of observation
`year`            | year of observation
`plot_id`         | ID of a particular site
`species_id`      | 2-letter code
`sex`             | sex of animal (“M”, “F”)
`hindfoot_length` | length of the hindfoot in mm
`weight`          | weight of the animal in grams

The first few rows of our first file look like this:
```
record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
1,7,16,1977,2,NL,M,32,
2,7,16,1977,3,NL,M,33,
3,7,16,1977,2,DM,F,37,
4,7,16,1977,7,DM,M,36,
5,7,16,1977,3,DM,M,35,
6,7,16,1977,1,PF,M,14,
7,7,16,1977,2,PE,F,,
8,7,16,1977,1,DM,M,37,
9,7,16,1977,1,DM,F,34,
```

### Pandas in Python
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

In [None]:
import ### as ###

## Reading CSV Data Using Pandas
### So What’s a DataFrame?

In [None]:
# Note that pd.read_csv is used because we imported pandas as pd
###("###/###/surveys.csv")

In [None]:
### = pd.read_csv("../data/surveys.csv")

In [None]:
surveys_df

In [None]:
surveys_df### # Displays the first several rows of a file

### Demo - DataFrame Object
Try out the methods below to see what they return.

In [None]:
surveys_df.###

In [None]:
surveys_df.###

In [None]:
surveys_df.###
# What format does it return the shape of the DataFrame in?

In [None]:
# Convert all weights from grams to kilograms
surveys_df[###] ###

### Calculating Statistics From Data In A Pandas DataFrame

In [None]:
pd.###(surveys_df['year'])

In [None]:
surveys_df['species_id']###

### Exercise - Calculating Statistics
`1`. Create a list of unique site ID’s (`plot_id`) found in the surveys data. Call it `site_names`. How many unique sites are there in the data?

In [None]:
### = pd.###(surveys_df[###])
###.###

`2`. What is the difference between `len(site_names)` and `surveys_df['plot_id'].nunique()`?

In [None]:
# print(len(site_names))
# print(surveys_df['plot_id'].nunique())

## Groups in Pandas

In [None]:
surveys_df['weight'].###

In [None]:
print("Count:    ", surveys_df['weight'].###())
print("Mean:     ", surveys_df['weight'].###())
print("Std Dev.: ", surveys_df['weight'].###())
print("Min:      ", surveys_df['weight'].###())
print("Max:      ", surveys_df['weight'].###())

In [None]:
# Group data by sex
### = surveys_df.###('###')

In [None]:
# Summary statistics for all numeric columns by sex
by_sex.###

In [None]:
# Provide the mean for each numeric column by sex
by_sex.###

### Exercise - Grouping
`1`. How many recorded individuals are female `F`, and how many male `M`?

In [None]:
by_sex###

`2`. What happens when you group by two columns using the following syntax and then grab mean values:

In [None]:
by_site_sex = surveys_df.groupby(['plot_id','sex'])
by_site_sex.###()

`3`. Summarize `weight` values for each site (`plot_id`) in your data. HINT: it is possible to select a column once the data has been grouped.

In [None]:
by_site = surveys_df.###(['###'])
by_site['###'].###

### Getting the Number of Records of One Species

In [None]:
surveys_df.groupby('species_id')['record_id']#.count()###

## Quick & Easy Plotting Data Using Pandas

In [None]:
by_site['record_id'].###().###(kind='bar')

### Exercise - Plotting Challenge
Create a `line` plot of the median `weight` per month.

In [None]:
surveys_df.###('###')['###'].###().plot(kind='###')

## Summary Example

In [None]:
site_sex_totalw = by_site_sex['weight'].###()
site_sex_totalw###

In [None]:
# Change the right-most categorical values into columns
sst = site_sex_totalw.unstack()
sst

In [None]:
s_plot = sst.plot(kind='bar', ###,
                  title="Total weight by site and sex")
s_plot.set_ylabel("Weight")
s_plot.set_xlabel("Site")