# Data Visualization in Python
## Plotting distributions

Question
* How to visualize data distributions by some factor?

Objective
* Create different types of plots.

In [None]:
import pandas as pd

# Load the cleaned data
surveys_complete = pd.read_csv('../data/surveys_0_NA.csv')
surveys_complete

In [None]:
import altair as alt
alt.data_transformers.disable_max_rows()

## Plotting distributions
* A boxplot can be used:

In [None]:
alt.Chart(surveys_complete).mark_boxplot().encode(
    x=alt.X('species_id').title('Species identifier'),
    y=alt.Y('weight').scale(type='log', base=2).title('Weight (g)'),
    color=alt.Color('species_id').legend(None),
)

* Narrow facets can be used to display multiple point clouds:

In [None]:
alt.Chart(surveys_complete).transform_calculate(
    noise='random() - 0.5',  # Horizontal position in the facet
    noisy_w='datum.weight + random() - 0.5',
).mark_circle(size=4).encode(
    x=alt.X('noise').type('quantitative').axis(None).title(None),
    y=alt.Y('noisy_w:Q').scale(type='log', base=2).title('Weight (g)'),
    color=alt.Color('species_id').legend(None),
    column=alt.Column('species_id').title('Weights by species'),
).configure_mark(
    opacity=0.25,  # Opacity factor of mark_circle()
).configure_facet(
    spacing=0,     # Delete the margin between each facet
).configure_view(
    stroke=None,   # Remove the box around each facet
).properties(
    width=18,      # Each facet width
)

### Exercise - Distributions
For this exercise, we want to display the
full species names on the X axis of a boxplot.

(Preparation) Compute the left-join of `surveys_complete`
and all the species details in `species.csv`. (3 min.)

In [None]:
species_df = pd.read_csv('../data/species.csv')

left_join = pd.merge(
    left=surveys_complete,
    right=species_df,
    on='species_id',
    how='left'
)

left_join.columns

Create the boxplot:
* The full species names on the X axis, with the label "Species"
* The weights on the Y axis, with a logarithmic
  scale in base 2 and with the label "Weight (g)"
* One color for each species identifier
* A title for the chart

(6 min.)

In [None]:
alt.Chart(left_join).mark_boxplot().encode(
    x=alt.X('species').title('Species'),
    y=alt.Y('weight').scale(type='log', base=2).title('Weight (g)'),
    color=alt.Color('species_id').legend(None),
).properties(
    title='Distribution of weights by species',
)

## Key points
* **Temporary columns**
  * `chart.transform_calculate(col2='datum.col1 + random()-0.5')`
* **Choosing a type of chart**
  * `chart.mark_boxplot()`
  * `chart.mark_circle(size=N)`
* **Assigning data fields to encoding channels**:
  * `chart.encode(...)`
  * Encoding channels:
    * `color=alt.Color('field_name_for_colors')`
      * `.legend(None)`
    * `column=alt.Column('field_name_for_facet_columns')`
* **Other properties of the chart**
  * `chart.configure_facet(spacing=0)`
  * `chart.configure_view(stroke=None)`
  * `chart.properties(width=20)`