# Data Visualization in Python
## Visualization with Pandas and Matplotlib

Questions
* What are the different data types in Pandas?
* What impacts have data types on descriptive statistics?
* How to visualize data from a Pandas DataFrame?
* How to customize a plot?

Objectives
* Manipulate the data types.
* Create simple plots from a DataFrame.
* Customize a plot with the methods of Matplotlib objects.

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv('../data/surveys.csv')

## Types of Data
Since the data types can impact the appearance of a chart,
it is necessary to first validate the data type of each
column in the DataFrame that we will use in our analysis.

Native Python Type | Pandas Type | Description
:-----------------:|:-----------:|:-----------
`str`              | `object`    | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).
`int`              | `int64`     | 64 bits integer
`float`            | `float64`   | Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64.
 N/A               | `datetime64`| Values meant to hold time data.

In [None]:
# Getting the data type of species identifiers
surveys_df['species_id'].dtype

In [None]:
# Getting the data type of month values
surveys_df['month'].dtype

In [None]:
# Getting the data types of all columns
surveys_df.dtypes

### Working With Our Survey Data

In [None]:
# Summary of descriptive statistics
surveys_df.describe()

In [None]:
# Convert month numbers to nominal values
surveys_df['month'] = surveys_df['month'].astype('str')
surveys_df['month'].dtype

In [None]:
# Descriptive statistics on a categorical variable
surveys_df['month'].describe()

In [None]:
# Listing all different months
surveys_df['month'].unique()

In [None]:
# Listing all different years
surveys_df['year'].unique()

## Quick & Easy Plotting Data Using Pandas
By default:
* The `.plot()` method returns an
  [Axes object](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.html)
  from [Matplotlib](https://matplotlib.org/stable/gallery/index.html).
* The values in the index of the DataFrame will be along the X axis.

In [None]:
#surveys_df['month'] = surveys_df['month'].astype('int64')
med = surveys_df.groupby('month')['weight'].median()
print(med.index)

med.plot(kind='line')

In [None]:
sites = surveys_df.groupby('plot_id')['record_id'].count()
print(sites.index)

sites.plot(kind='bar')

Other features of `.plot()` and Matplotlib:
* Each column of the DataFrame will have a different color.
* Various arguments and methods to configure
  the plot title and the names of the axes.

In [None]:
species_sex_avg_weight = surveys_df.groupby(
    ['species_id', 'sex'])['weight'].mean().unstack()
species_sex_avg_weight.head(10)

In [None]:
# Add a title and the axis labels
s_plot = species_sex_avg_weight.plot(
    kind='bar', title='Average weight by species and sex')
s_plot.set_xlabel('Species (Identifier)')
s_plot.set_ylabel('Average weight (g)')

### Exercises - Create a plot
`1`. Generate a line-plot showing the average
`hindfoot_length` of each species from one year to the next.
Note: records without an `hindfoot_length` are discarded.
(4 min.)

In [None]:
selection = surveys_df[surveys_df['hindfoot_length'].notna()]

selection.groupby(
    ###
)['hindfoot_length'].mean()###

`2`. Generate a bar-plot showing the average weight
of females and males for each type of genus. (3 min.)

In [None]:
# Inner join with the details about the species
species_df = pd.read_csv('../data/species.csv')
merged = pd.merge(
    left=surveys_df,
    right=species_df,
    on='species_id'
)

In [None]:
# Compute mean values and visualize
merged.pivot_table(
    aggfunc='mean',
    values='weight',
    index=###,
    columns=###
)#.plot(kind=###)

## Technical Summary
* **Managing data types**
    * For a **DataFrame**:
        * Attribute: `dtypes`
    * For a **Series** (column):
        * Attribute: `dtype`
        * Method: `astype()`
* **Making a plot from a DataFrame**:
    * `df.plot()`, where `df` has one or many columns
        * `kind='bar'` (`'line'` plot by default)
        * `stacked=True`
        * `title=''`
    * `plot_object.set_xlabel('')`
    * `plot_object.set_ylabel('')`