# Data Analysis and Visualization in Python
## Making Plots With plotnine
Questions
* How can I visualize data in Python?
* What is ‘grammar of graphics’?

Objectives
* Create a `plotnine` object.
* Set universal plot settings.
* Modify an existing plotnine object.
* Change the aesthetics of a plot such as color.
* Edit the axis labels.
* Build complex plots using a step-by-step approach.
* Create scatter plots, box plots, and time series plots.
* Use the facet_wrap and facet_grid commands to create a collection of plots splitting the data by a factor variable.
* Create customized plot styles to meet their needs.

In [None]:
# Disable some warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

## Disclaimer
Python has powerful built-in plotting capabilities such as `matplotlib`, but for this episode, we will be using the `plotnine` package, which facilitates the creation of highly-informative plots of structured data based on the R implementation of `ggplot2` and [The Grammar of Graphics](https://link.springer.com/book/10.1007%2F0-387-28695-0) by Leland Wilkinson. The `plotnine` package is built on top of Matplotlib and interacts well with Pandas.

If plotnine is not installed in your *conda setup, see:
https://plotnine.readthedocs.io/en/stable/installation.html

In [None]:
import plotnine as p9

In [None]:
import pandas as pd

# Load and clean the data
surveys_complete = pd.read_csv('../data/surveys.csv')
surveys_complete = surveys_complete.dropna()

## Plotting with plotnine

`plotnine` graphics are built step by step by adding new elementsadding different elements on top of each other using the `+` operator. Putting the individual steps together in brackets `()` provides Python-compatible syntax.

In [None]:
# Initial empty plot
(p9.ggplot(data=surveys_complete))

* Define aesthetics (`aes`), by **selecting variables** used in the plot and `mapping` them to a presentation. The most important aes mappings are: `x`, `y`, `alpha`, `color`, `colour`, `fill`, `linetype`, `shape`, `size` and `stroke`.

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length')))

* To add a `geom_*` to the plot use + operator

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point()
)

* You can easily set up plot templates and conveniently explore different types of plots

In [None]:
# Create
surveys_plot = p9.ggplot(data=surveys_complete,
                         mapping=p9.aes(x='weight', y='hindfoot_length'))

# Draw the plot
surveys_plot + p9.geom_point()

* After creating your plot, you can save it to a file in your favourite format

In [None]:
my_plot = surveys_plot + p9.geom_point()

my_plot.save("scatterplot.png", width=10, height=10, dpi=300)

### Exercises - bar chart
Working on the `surveys_complete` data set, use the `plot_id` column to create a `bar` plot that counts the number of records for each plot. Hint: the count will be done implicitly by the `geom_bar()` function.

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='plot_id'))
    + p9.geom_bar()
)

## Building your plots iteratively
* Usually, `data`, `aes` and `geom-*` are the elementary elements of any graph:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point()
)

* Then, we start modifying this plot to extract more information from it. With transparency (alpha) and one color:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length'))
    + p9.geom_point(alpha=0.1, color='blue')
)

* We can also get a different color for each species (by mapping the `species_id` column to the `color` aesthetic):

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
)

* Changing labels:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.xlab("Weight (g)")
)

* Defining scale for colors, axes,... For example, a log-version of the x-axis:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.xlab("Weight (g)")
    + p9.scale_x_log10()
)

* Changing the theme (`theme_*`) or some specific theming (`theme`) elements. We can set the background to white using the function `theme_bw()`:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight', y='hindfoot_length', color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.xlab("Weight (g)")
    + p9.scale_x_log10()
    + p9.theme_bw()
    + p9.theme(text=p9.element_text(size=16))
)

### Exercise - Bar plot adaptations
Adapt the bar plot of the previous exercise by mapping the `sex` variable to the aesthetic `fill`. Then, use the `scale_fill_manual()` function in order to specify both colors `blue` and `orange` respectively for "F" and "M" values (see [API reference](https://plotnine.readthedocs.io/en/stable/api.html#color-and-fill-scales) for other `scale*` functions and how to use them).

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='plot_id',
                          fill='sex'))
    + p9.geom_bar()
    + p9.scale_fill_manual(["blue", "orange"])
)

## Plotting distributions
* A boxplot can be used:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='species_id',
                          y='weight'))
    + p9.geom_boxplot()
)

* Adding points behind the boxplot:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='species_id',
                          y='weight'))
    + p9.geom_jitter(alpha=0.2)
    + p9.geom_boxplot(alpha=0)
)

### Exercise - Distributions
An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape of the density of points is drawn.

* Replace the box plot with a violin plot, see `geom_violin()`
* Represent weight on the log10 scale, see `scale_y_log10()`
* Add color to the datapoints on your boxplot according to the `plot_id` from which the sample was taken

Hint: By using `factor()` within the `aes` mapping of a variable, `plotnine` will handle the values as category values.

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='species_id',
                          y='weight',
                          color='factor(plot_id)'))
    + p9.geom_jitter(alpha=0.3)
    + p9.geom_violin(alpha=0, color="0.5")
    + p9.scale_y_log10()
)

## Plotting time series data
* Let’s calculate number of records per year for each species.
* Reset the index - `year` and `species_id` will become columns.

In [None]:
yearly_counts = surveys_complete.groupby(['year', 'species_id'])['species_id'].count()
yearly_counts = yearly_counts.reset_index(name='counts')
yearly_counts

* Timelapse data can be visualised as a line plot (i.e. `geom_line`) with years on `x` axis and counts on the `y` axis.
* We need to tell `plotnine` to draw a line for each species by modifying the aesthetic function and map the species_id to the color:

In [None]:
(p9.ggplot(data=yearly_counts,
           mapping=p9.aes(x='year',
                          y='counts',
                          color='species_id'))
    + p9.geom_line()
)

## Faceting
* `plotnine` has a special technique called faceting that allows to split one plot into multiple plots based on a factor variable included in the dataset.

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight',
                          y='hindfoot_length',
                          color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.facet_wrap("sex")
)

* We can apply the same concept on any of the available categorical variables:

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='weight',
                          y='hindfoot_length',
                          color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.facet_wrap("plot_id")
)

* The `facet_grid` geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (`rows ~ columns`)

In [None]:
# Only select years of interest
survey_2000 = surveys_complete[surveys_complete["year"].isin([2000, 2001, 2002])]

(p9.ggplot(data=survey_2000,
           mapping=p9.aes(x='weight',
                          y='hindfoot_length',
                          color='species_id'))
    + p9.geom_point(alpha=0.1)
    + p9.facet_grid("sex ~ year")
)

### Exercise - Faceting
`1`. Create a separate plot for each of the species that depicts how the average weight of the species changes through the years.

In [None]:
yearly_weight = surveys_complete.groupby(['year',
                                          'species_id'])['weight'].mean().reset_index()
(p9.ggplot(data=yearly_weight,
           mapping=p9.aes(x='year',
                          y='weight'))
    + p9.geom_line()
    + p9.facet_wrap("species_id")
)

`2`. Based on the previous exercise, visually compare how the weights of male and females has changed through time by creating a separate plot for each sex and an individual color assigned to each `species_id`.

In [None]:
yearly_weight = surveys_complete.groupby(['year',
                                          'species_id',
                                          'sex'])['weight'].mean().reset_index()
(p9.ggplot(data=yearly_weight,
           mapping=p9.aes(x='year',
                          y='weight',
                          color='species_id'))
    + p9.geom_line()
    + p9.facet_wrap("sex")
)

## Further customization
* Consider the following example of a bar plot with the counts per year.

In [None]:
(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='factor(year)'))
    + p9.geom_bar()
)

* The `theme` functionality provides a way to rotate the text of the x-axis labels:

In [None]:
my_custom_theme = p9.theme(axis_text_x = p9.element_text(angle=90))

(p9.ggplot(data=surveys_complete,
           mapping=p9.aes(x='factor(year)'))
    + p9.geom_bar()
    + p9.theme_bw()
    + my_custom_theme
)