# Data Analysis and Visualization in Python
## Starting With Data
Questions
* How can I import data in Python?
* What is Pandas?
* Why should I use Pandas to work with data?

Objectives
* Load the Python Data Analysis Library (Pandas).
* Use `read_csv` to read tabular data into Python.
* Describe what a DataFrame and a Series are in Python.
* Access and summarize data stored in a DataFrame.
* Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame.
* Create simple plots.

## Working With Pandas DataFrames in Python

### Pandas in Python
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

In [None]:
# Import the "pandas" library
import pandas as pd

## Reading CSV Data Using Pandas
### So What’s a DataFrame?

In [None]:
# Note that pd.read_csv is used because we imported pandas as pd
surveys_df = pd.read_csv("../data/surveys.csv")

In [None]:
# Display the first rows of the DataFrame
surveys_df.head()

In [None]:
# What format does it return the shape of the DataFrame in?
surveys_df.shape

In [None]:
# Accessing the list of column names
surveys_df.columns

In [None]:
# A column is an object of type Series
surveys_df['weight'].describe()

In [None]:
# Compute descriptive statistics per column
print("Count:    ", surveys_df['weight'].count())
print("Mean:     ", surveys_df['weight'].mean())
print("Std Dev.: ", surveys_df['weight'].std())
print("Min:      ", surveys_df['weight'].min())
print("Median:   ", surveys_df['weight'].median())
print("Max:      ", surveys_df['weight'].max())

In [None]:
# New column - Convert all weights from grams to kilograms
surveys_df['weight_kg'] = surveys_df['weight'] / 1000

# Display the last rows of the DataFrame
surveys_df.tail()

### Exercise - DataFrame and Series

1. Load the table of species from the `species.csv` file and assign the result to `species_df`.

In [None]:
species_df = pd.read_csv('../data/species.csv')

2. Use the `.unique()` function to get
   the list of all different `taxa`.

In [None]:
species_df['taxa'].unique()

3. Use the `.nunique()` function to
   get the number of different taxa.
   Note: `nan` will not be accounted.

In [None]:
species_df['taxa'].nunique()

## Groups in Pandas

In [None]:
# Group data by sex
by_sex = surveys_df.groupby('sex')

In [None]:
# Summary statistics for all numeric columns by sex
by_sex.describe()

In [None]:
# Provide the mean for each numeric column by sex
by_sex.mean(numeric_only=True)

In [None]:
# Get the number of records of the 'AB' species
surveys_df.groupby('species_id')['record_id'].count()['AB']

### Exercise - Grouping
`1`. How many recorded individuals are female `F`, and how many male `M`?

In [None]:
by_sex['record_id'].count()

`2`. What happens when you group by two columns using the following syntax and then grab mean values:

In [None]:
surveys_df.groupby(['plot_id', 'sex']).mean(numeric_only=True)

`3`. Summarize `weight` values for each site (`plot_id`) in your data. HINT: it is possible to select a column once the data has been grouped.

In [None]:
surveys_df.groupby('plot_id')['weight'].describe()

## Quick & Easy Plotting Data Using Pandas

In [None]:
surveys_df.groupby('plot_id')['record_id'].count().plot(kind='bar')

### Exercise - Plotting Challenge
Create a `line` plot of the median `weight` per month.

In [None]:
surveys_df.groupby('month')['weight'].median().plot(kind='line')

## Summary Example

In [None]:
by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
site_sex_total_weight = by_site_sex['weight'].sum()
site_sex_total_weight.head(15)

In [None]:
# Change the right-most categorical values into columns
sstw = site_sex_total_weight.unstack()
sstw

In [None]:
# Add a title and the axis labels
s_plot = sstw.plot(kind='bar', title="Total weight by site and sex")
s_plot.set_xlabel("Site")
s_plot.set_ylabel("Weight (g)")

## Technical Summary
* **Import Pandas**: `import pandas as pd`
* **Load data**: `pd.read_csv()`
  * Argument: the **filename**
  * `index_col`: the column to interpret as the DataFrame index
* **DataFrame** object:
  * **Attributes**: `shape`, `index`, `columns`
  * **Methods**: `head()`, `tail()`, `describe()`
* Series object or **column**:
  * **Selection**: `df['column_name']`
  * **Methods**:
    * Descriptive statistics:
      `count()`, `mean()`, `median()`, `min()`, `max()`
    * Others: `describe()`, `nunique()`, `unique()`
* **Grouping by values** of one or many columns:
  * `groupby(column_name)`
  * `groupby([column_name1, column_name2])`
* **Reshaping a DataFrame** from values in the index: `unstack()`
* **Making a plot from a DataFrame**:
  * `df.plot()`, where `df` has one or many columns
    * `kind='bar'` (line plot by default)
    * `stacked=True`
    * `title=""`
  * `plot_object.set_xlabel("")`
  * `plot_object.set_ylabel("")`