# Visualizing Data

## Metadata

- Teaching: 60
- Exercises: 0

## Questions

- How can we look at individual rows and columns in a dataframe?
- How can we style plots?
- How can we modify the table and data?

## Objectives

- Review processes for reading, modifying, and combining dataframes
- Make and customize scatter, box, and bar plots

We'll begin by loading and preparing the data we'd like to plot. This will include operations introduced in previous lessons, including reading CSVs into dataframes, merging dataframes, sorting a dataframe, and removing records that include null values. We'll begin by importing pands:

In [None]:
import pandas as pd

Next we'll load the surveys dataset using `pd.read_csv()`:

In [None]:
surveys = pd.read_csv("data/surveys.csv")

Now we want to take a quick look at the surveys dataset. Since we're going to be plotting data, we need to 

In [None]:
surveys.info()

There are 35,459 records in the table. Four columns--species_id, sex, hindfoot_length, and weight--include null values.

In [None]:
surveys["sex"] = surveys["sex"].fillna("U")

The plots below require records.

In [None]:
surveys = surveys.dropna()

Now we will merge the main surveys dataframe with two other datasets containing additional information:

- **species.csv** provides the genus and species corresponding to species_id
- **plots.csv** provides the plot type corresponding to plot_id

We will read each CSV and merge it into the main dataframe:

In [None]:
species = pd.read_csv("data/species.csv")
plots = pd.read_csv("data/plots.csv")

surveys = surveys.merge(species, how="left").merge(plots, how="left")

### Chaining

The previous cell performs two merges in the same line of code. Performing multiple operations on the same object in a single line of code is called chaining.

In [None]:
surveys["taxa"].unique()

In honor of this, we will rename our dataframe:

In [None]:
rodents = surveys
rodents.sample(5)

Finally we will save the rodents dataframe to a file:

In [None]:
rodents.to_csv("data/rodents.csv", index=False)

We can load that dataframe directly in the future.

## Using plotly

We've already worked a little with plotly in previous lessons.

- **Customizable.** Allows the appearance of plots to be extensively modified. 
- **Interactive.** Pan and zoom across plots, or hover over elements to get additional information about them.
- **Flexible.** Allows creation of many different plot types, often with only a few lines of code.
- **Embeddable.** Interactive plots can be embedded on websites using ploty's JavaScript library.

Plotly has two main ways of making plots:

- plotly.express provides a simplified interface for quickly building and customizing plots
- plotly.graph_objects uses a more complex interface to provide more granular control over the exact appearance of a plot

We will use plotly.express in this lesson.

## Other plotting libraries

The R community has largely coalesced around gg2plot for plotting. In contrast, the Python community uses a number of data visualization libraries. Some commonly used alternatives to plotly include:

- [Bokeh](https://bokeh.org/)
- [Matplotlib](https://matplotlib.org/)
- [seaborn](https://seaborn.pydata.org/)
- [Vega-Altair](https://altair-viz.github.io/)

We'll begin by reproducing the scatterplot from the end of lesson 4, which used weight on the x axis and hindfoot length on the y axis.

In [None]:
import plotly.express as px

px.scatter(surveys, x="weight", y="hindfoot_length")

Let's take a quick look at the interactive elements on this plot. When we hover over a plotly plot, a toolbar appears in the upper right corner. Each icon on the toolbar is a widget that allows us to interact with the plot. By default, the toolbar includes the following widgets:

- The camera allows us to save the current view as a PNG file
- The next four widgets are toggles that control how click-and-drag affects the plot. Only one can be active at a time.
    - The magnifying glass enable a zoom box
    - The crossing arrows enable panning
    - The dotted box enables drawing a box to select data
    - The dotted lasso enables drawing an arbitrary shape to to select data
- The plus box allows us to zoom in
- The minus box allows us to zoom out
- The crossing arrows autoscale the plot to show all adata
- The house resets the plot to the original view

### Challenge

What are some limitations to the plot above? Think about how the data itself is presented as well as the general appearance of the plot.

- All points are the same color
- Points overlap, making it difficult to understand how data is distributed
- Axis labels include underscores and lack units
- No plot title

Any others?

One issue with the plot is that many of the points in the dataframe overlap, making it difficult to get a feel for how the data is distributed. Does it cluster in places? Is it evenly distributed? We really can't tell. 

We can mitigate this issue in part by making the points semitransparent using the opacity keyword argument:

In [None]:
px.scatter(surveys, x="weight", y="hindfoot_length", opacity=0.2)

With the points now partially transparent, the places where they overlap are clearer, and we can see several areas where the observations cluster. 

In [None]:
px.scatter(surveys, x="weight", y="hindfoot_length", color="genus", opacity=0.2)

In [None]:
px.scatter(
    surveys,
    x="weight",
    y="hindfoot_length",
    color="genus",
    opacity=0.2,
    color_discrete_sequence=px.colors.qualitative.Safe,
)

The legend of the plot is not ordered. Because we want to use a consistent color scheme across plots. We can use the category_order keyword argument to order the legend alphabetically. 

This argument uses a `dict`, which is a built-in data type that we have not discussed yet. That means it can be used in any Python application without having to import anything. Like a `list`, a `dict` is a container that can include more than one object. Where as a `list` is a sequence, a `dict` is a mapping consisting of *keys* that map to *values*.

Let's see what that looks like in practice. Here we define a `dict` mapping lowercase to uppercase letters. Each pair of values is separated by a colon, with the key on the left and the value on the right.

In [None]:
letters = {"a": "A", "b": "B", "c": "C"}

To retrieve the value for a given key, we use square brackets:

In [None]:
letters["a"]

The `dict` passed to category_orders maps a column name from the dataframe to a list of values in the preferred order. We stated above that we'd like the genera in the plot to appear in alphabetical order. We can construct that list manually. Instead, we will used the built-in `sorted()` function to sort the values from the genus column of the dataframe. Because we will be using the same order in all following plots, we will store the sorted values in a variable.

In [None]:
order = sorted(surveys["genus"].unique())
order

We can then pass the full `dict`—that is, `{"genus": order}`—to the `px.scatter()` method to put the legend in alphabetical order:

In [None]:
px.scatter(
    surveys,
    x="weight",
    y="hindfoot_length",
    color="genus",
    opacity=0.2,
    color_discrete_sequence=px.colors.qualitative.Safe,
    category_orders={"genus": order}
)

## Making a box plot

In [None]:
px.box(
    surveys,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
    category_orders={"genus": order}
)

In [None]:
px.box(
    surveys,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
    category_orders={"genus": order},
)

In [None]:
px.box(
    surveys,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
    points="all",
)

To update, we can use the update_traces() method.

In [None]:
fig = px.box(
    surveys,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
    category_orders={"genus": order},
    points="all",
)
fig.update_traces(marker={"opacity": 0.2})

A similar plot is a violin plot. Try changing `px.box` to `px.violin` in the code above.

## Changing labels

By default, plotly uses the column names from the dataframe as axis labels. Column names may be legible but are rarely ideal. 

Like category_orders, updating labels requires a `dict`.

In [None]:
fig = px.box(
    surveys,
    x="genus",
    y="hindfoot_length",
    color="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
    category_orders={"genus": order},
    points="all",
    title="Rodent hindfoot length by genus",
    labels={
        "hindfoot_length": "Hindfoot length (mm)",
        "genus": "Genus",
    }
)
fig.update_traces(marker={"opacity": 0.2})

## Making a bar chart

The stacked bar chart allows us to compare the total number of observations per year but obscures how counts of individual genera have varied over time. We can break each category into its own subplot by adding the `facet-col` keyword argument:

In [None]:
grouped = rodents.groupby(["year", "genus"]).count().reset_index()
px.bar(
    grouped,
    x="year",
    y="weight",
    color="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
)

In [None]:
px.bar(
    grouped,
    x="year",
    y="weight",
    color="genus",
    facet_col="genus",
    color_discrete_sequence=px.colors.qualitative.Safe,
)

## Keypoints

- Use square brackets to access rows, columns, and specific cells
- Use operators like `+`, `-`, and `/` to perform arithmetic on rows and columns
- Store the results of calculations in a dataframe by adding a new column or overwriting an existing column
- Sort data, rename columns, and get unique values in a dataframe using methods provided by pandas
- By default, most dataframe operations return a copy of the original data