# Data Analysis with Python
## Starting With Data
Questions
* How can I import data in Python?
* What is Pandas?
* Why should I use Pandas to work with data?

Objectives
* Load the Python Data Analysis Library (Pandas).
* Use `read_csv` to read tabular data into Python.
* Describe what a DataFrame and a Series are in Python.
* Access and summarize data stored in a DataFrame.
* Perform basic mathematical operations and summary statistics on data
  in a Pandas DataFrame.

## Working With Pandas DataFrames

### Pandas in Python
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

In [None]:
# Import the "pandas" library
import pandas as pd

## Reading CSV Data Using Pandas
### So What’s a DataFrame?

In [None]:
# Note that pd.read_csv is used because we imported pandas as pd
pd.read_csv('../data/surveys.csv')

In [None]:
# Display the first rows of the DataFrame
surveys_df.head()

In [None]:
# What format does it return the shape of the DataFrame in?
surveys_df.shape

In [None]:
# Accessing the list of column names
surveys_df.columns

In [None]:
# A column is an object of type Series
surveys_df['weight'].describe()

In [None]:
# Compute descriptive statistics per column
print("Count:    ", surveys_df['weight'].count())
print("Mean:     ", surveys_df['weight'].mean())
print("Std Dev.: ", surveys_df['weight'].std())
print("Min:      ", surveys_df['weight'].min())
print("Median:   ", surveys_df['weight'].median())
print("Max:      ", surveys_df['weight'].max())

In [None]:
# New column - Convert all weights from grams to kilograms
surveys_df['weight_kg'] = surveys_df['weight'] / 1000

# Display the last rows of the DataFrame
surveys_df.tail()

### Exercises - DataFrame and Series

1. Load the table of species from the `species.csv`
   file and assign the result to `species_df`. (3 min.)

2. Use the `.unique()` function to get
   the list of all different `taxa`. (2 min.)

In [None]:
species_df###

3. Use the `.nunique()` function to
   get the number of different taxa. (1 min.)

In [None]:
species_df###

## Groups in Pandas

In [None]:
# Group data by sex
surveys_df.groupby('sex')

In [None]:
# Provide the number of values for each column by sex
surveys_df.groupby('sex').count()

In [None]:
# Provide the mean for each numeric column by sex
surveys_df.groupby('sex').mean(numeric_only=True)

In [None]:
# Get the number of records per species ID
surveys_df.groupby('species_id').count()#['record_id']

In [None]:
# Mean values and standard deviations of weights per species ID
surveys_df.groupby('species_id')['weight'].aggregate(['mean', 'std'])

### Exercises - Grouping data
1. Summarize `weight` values for each
   species (`species_id`) in your data. (4 min.)

In [None]:
surveys_df###

2. What happens when you group by two columns using
   the following syntax and then grab mean values? (2 min.)

In [None]:
surveys_df.groupby(['species_id', 'sex'])###

## Pivot tables

In [None]:
surveys_df.groupby(
    ['species_id', 'sex']
)['hindfoot_length'].median()#.unstack()#.head()

In [None]:
surveys_df.pivot_table(
    values='hindfoot_length',
    index='species_id',
    columns='sex',
    aggfunc='median').head()

## Technical Summary
* **Import Pandas**: `import pandas as pd`
* **Load data**: `pd.read_csv()`
  * Argument: the **filename**
  * `index_col`: the column to interpret as the DataFrame index
* **DataFrame** object:
  * **Attributes**: `shape`, `index`, `columns`
  * **Methods**: `head()`, `tail()`, `describe()`
* Series object or **column**:
  * **Selection**: `df['column_name']`
  * **Methods**:
    * Descriptive statistics:
      * `count()`, `mean()`, `std()`
      * `min()`, `median()`, `max()`
      * `nunique()`, `unique()`
    * Statistical summary: `describe()`
* **Grouping by values** of one or many columns:
  * `groupby(column_name)`
  * `groupby([column_name1, column_name2])`
  * Descriptive statistics: `aggregate([function1, ...])`
* **Pivot tables**
  * Reshaping a DataFrame from values in the index: `unstack()`
  * Aggregation in a pivot table: `pivot_table()`
    * `values=colX`
    * `index=[col_ind]`
    * `columns=[category1, category2]`
    * `aggfunc=function` (default: mean)