# Altair Express Tutorial

Altair Express is a high-level data visualization library that provides the ability to quickly create statistical charts. Modeled after the seaborn library, Altair Express (abbreviated alx) allows you to rapidly create charts and add on interactions in a single line of code.

Today, You'll be testing a alpha version of Altair Express. We'll walk you through the main concepts.

First we'll import our libraries of interest:

In [2]:
import altair_express as alx 
import altair-alx-version as alt 
from vega_datasets import data 
import pandas as pd 


We'll be working with a paired-down gapminder dataset to explore Altair Express library API.

In [3]:
df = data.gapminder()
df.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
0,1955,Afghanistan,0,8891209,30.332,7.7
1,1960,Afghanistan,0,9829450,31.997,7.7
2,1965,Afghanistan,0,10997885,34.02,7.7
3,1970,Afghanistan,0,12430623,36.088,7.7
4,1975,Afghanistan,0,14132019,38.438,7.7


## A High-level Library for Data Visualization

Now that we've loaded our data it's time to begin exploring it. However, there are many different ways we can visualize this data. 

Altair Express provides functions to quickly create statistical graphics. 

The **profiler** function creates univariate charts that help to visualize each column inside of our dataframe.


In [None]:
columns = df.columns # returns list of column names ["year","country"...]
alx.profile(df,vars=columns)

The profiler view helps us see some general trends on our data:
- life expecentency seems to be trending up
- fertility seems to be trending down
- population seems to be *logarithmicly distributed*

Let's derive a new column in our dataframe that might help us understand population better and visualize it using **hist**.

In [None]:
import numpy as np 

df['log_pop'] = np.log(df['pop'])
alx.hist(df, x= 'log_pop')

While visualizations of singular columns help us understand distributions of our variables, how do variables affect each other?

For example, how does 'life_expect' relate to 'fertility'?

To answer this, we can use a **scatterplot**

In [None]:
alx.scatterplot(df,x='life_expect',y='fertility')

Just looking at the scatterplot can be difficult to get a sense for the distribution of life_expect and fertility. 

Lets add marginal histograms to the sides via a jointplot: 

In [None]:
alx.jointplot(df,x='life_expect',y='fertility')

Seems like countries with a higher life expectancy tend to have fewer children.

Let's dig into life expectancy by splitting by data according to their cluster, and see how that has changed over time.

To do this, we'll use a **lineplot** showing the average life expectancy of the different country clusters (grouped by geographic region)

In [22]:
alx.lineplot(df,x='year',y='average(life_expect)', color= 'cluster')

The average life expectancy is generally increasing for most clusters of countries. 

However, in the 80's, African countries reverse course when HIV and AIDs cause a drop in life expectancy. 

To see these countries split out, let's use a **strip plot** to show each country grouping's life expectency. 

In [25]:
alx.stripplot(df[df['year']==2005],x='life_expect',row='cluster',color='cluster:N')
# NOTE: we use cluster:N to tell alx, cluster is nominal

Let's explore more broad elements of our dataset next. 

How do the columns correlate with each other?

In [9]:
correlation_matrix = df.corr()
alx.heatmap(correlation_matrix) # heatmap takes a NxN matrix of numbers

Seems that life_expectancy and fertility are quite inversely correlated. 
Conversely,  life expectancy and time seem to be positively correlated. 

Lets dig into these relationships a bit more with a pairplot (scatterplot matrix). 

In [7]:
alx.pairplot(df)

And while the above data focuses on the gapminder dataset, we can also produce other visualizations such as barplots: 

In [None]:
alx.barplot(data.barley(),x='year:N',y='sum(yield)',color='year',column='site')


## Composing Charts

Creating singular charts can be helpful for showing relationships between a couple variables, but what if you want to see multiple visualizations together?

You can do this through using compositional operators!

### Layering:
[Layering](https://altair-viz.github.io/user_guide/compound_charts.html#layered-charts) overlays two charts on top of each other if they use the same axes. 

For these examples, we'll swap datasets to the Seattle Weather dataset– a dataset with daily weather readings for seattle.

In [38]:
df = data.seattle_weather()
df

We'll start by creting a filtered dataset where we only select the summer months.

In [48]:
df['month'] = df['date'].dt.month
filtered_df=df.query('month > 4 and month < 9')

Then, we'll layer two **countplots** on top of each other. These countplots groupby the values for the provided variable (weather) and then visualize the count of each data value. 

In [51]:
alx.countplot(df,x='weather') + alx.countplot(filtered_df,x='weather').mark_bar(color='orange')

### Vertical Concatenation
[Vertical Concatenation](https://altair-viz.github.io/user_guide/compound_charts.html#vconcat-chart)  puts the charts vertically atop one another. You can vertically concatenate two charts with the '&' operator.


In [10]:
alx.hist(df,x='precipitation') & alx.hist(df,x='temp_max')

### Horizontal Concatenation
[Horizontal Concatenation](https://altair-viz.github.io/user_guide/compound_charts.html#horizontal-concatenation) arranges the charts side by side.

In [12]:
alx.scatterplot(data.seattle_weather(),x='precipitation',y='temp_max') | alx.hist(data.seattle_weather(),y='temp_max',height=200,width=50)