# Library pageviews data

This is data from Google Analytics on a subset of library item web page views from 2012.

It has been severely reduced in size so that we can learn more about how to explore using Altair. You can see a version that deals with about a 20x larger subset in the `20_LibraryPageviews.ipynb` in this same repository.

In [None]:
import pandas as pd
import altair as alt
from altair import datum

## Read in library web site page views data


In [None]:
pageviews = pd.read_csv('data/pageviews_2012_small.csv')
pageviews.head()

In [None]:
len(pageviews)

In [None]:
pageviews.dtypes

## Change ISO_time to a true date and time data type

In [None]:
pageviews['timestamp'] = pd.to_datetime(pageviews.timestamp)
pageviews.dtypes

## MaxRowsError

Can take off limit for number of rows, but that's not a great idea because the notebook will have a Vega-Lite JSON specification (text) embedded for every output pane, which also includes the data, so you end up with huge notebooks!

Instead, we can specify that the outputs should just refer to a JSON file on your local drive and load in the data from there for each plot.

See the Altair tutorial notebook `03-Binning-and-aggregation` for more details.

In [None]:
alt.data_transformers.enable('json')

## Simple summary charts

### Bars of visitors per country

In [None]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = 'country'
)

### Library of Congress Categories by country

In [None]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = 'lcc_description',
    color = 'country'
)

### Sort LCC category by number of visitors

Also include lines along numeric axis and log scale

In [None]:
alt.Chart(pageviews).mark_point().encode(
    x = alt.X('sum(visitors)', scale=alt.Scale(type='log')),
    y = alt.Y('lcc_description',
            sort=alt.EncodingSortField(
                field="visitors",
                op="sum",
                order="descending")
    ),
    color = 'country',
    shape = 'country'
).configure_axisY(grid=True).configure_axisX(grid=False)

## TimeUnit transform: Visitors continuous time line

#### A sum of visitors per day shows some weekly and seasonal detail

[timeunit valid entries](https://altair-viz.github.io/user_guide/transform.html#timeunit-transform) are listed within a type of data transform called a **TimeUnit Transform**

In [None]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonthdate(timestamp):T',
    y = 'sum(visitors):Q'
).properties(
    width=600,
    height=150
)

## Just Durham data using transform_filter()

**You can see the school holidays more clearly in the Durham pageviews** as subtle drops in the number of visitors.

*Note: datum is just a way to reference the data elements in each row instead of a whole column*

In [None]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonthdate(timestamp):T',
    y = 'sum(visitors):Q',
    tooltip = 'yearmonthdate(timestamp):T'
).transform_filter(
    datum.city == 'Durham'
).properties(
    width=600,
    height=150
)

## TimeUnit transform: Visitors by hour of day

You can aggregate to various levels of time using "timeunit transform" functions. See [timeunit valid entries](https://altair-viz.github.io/user_guide/transform.html#timeunit-transform) for all the choices and more details. Change from temporal `T` data type to ordinal `O` (and `mark_bar`) if you want to make a more standard-looking bar chart.

In [None]:
alt.Chart(pageviews).mark_line().encode(
    x = 'hours(timestamp):T',
    y = 'sum(visitors):Q'
)

### Canada, UK time shift

If we filter down to just Canadian and UK visitors, and color by country, we can see a shift in the peak viewing time of day corresponding to their respective time zones.

In [None]:
alt.Chart(pageviews).mark_line().encode(
    x = 'hours(timestamp):T',
    y = 'sum(visitors):Q',
    color = 'country'
).transform_filter(
    (datum.country == 'Canada') | (datum.country == 'United Kingdom')
)

## Weekday vs hour of day heatmap

A heatmap is a compact way to view typical patterns throughout the day, and how that varies by weekday.

*Notice here that we change from a Time data type to Ordinal to get discreet marks*

In [None]:
alt.Chart(pageviews).mark_rect().encode(
    x= 'hours(timestamp):O',
    y= 'day(timestamp):O',
    color='sum(visitors)'
)

# Maps

### Mapping funcitonality isn't as advanced in Vega-Lite

- For now it doesn't allow zooming and interaction beyond tooltips. 
- Not as many map projections are supported in Vega-Lite as in Vega
- The clipping doesn't seem to work properly.
- You can still do some world and US filled maps and symbol maps, though.

We would typically load in country shapefiles from `vega_datasets`, but not all of your machines have that installed, so we'll just grab from the referring URL directly.

In [None]:
# from vega_datasets import data
# countries = alt.topo_feature(data.world_110m.url, feature='countries')

# Since we don't all have vega_datasets installed
countries = alt.topo_feature('https://vega.github.io/vega-datasets/data/world-110m.json', 
                             feature='countries')

## Visitors per country symbol map (version 1)

We'll use mean latitude and longitude from the data for the locations of the symbols for now and see how that works out

*Note again that with groupby you need to do all of your aggregation in the transform_aggregate() section! We also need to project both the data and the country boundary shapes.*


In [None]:
proj_type = 'mercator'
width = 600
height = 500
clip_extent = [[0,0.075*height],[width,0.8*height]]

background = alt.Chart(countries).mark_geoshape(
    fill = 'lightgray',
    stroke = 'white'
).project(
    type = proj_type,
    clipExtent = clip_extent
).properties(
    width = width,
    height = height
)

points = alt.Chart(pageviews).mark_circle().encode(
    longitude = 'mean_lon:Q',
    latitude = 'mean_lat:Q',
    size = 'sum_visitors:Q',
    tooltip = 'country'
).transform_aggregate(
    sum_visitors = 'sum(visitors)',
    mean_lat = 'mean(latitude)',
    mean_lon = 'mean(longitude)',
    groupby = ['country']
).project(
    type = proj_type,
    clipExtent = clip_extent
).properties(
    width = width,
    height = height
)

background + points

## Looking up average Lat/Lon for countries

We'll read in a CSV file with the average lat/lon values per country.

In [None]:
latlon = pd.read_csv('data/average-latitude-longitude-countries.csv')
latlon.head()

## Visitors per country symbol map (version 2)

### Doing data join within Altair using transform_lookup()

For some data sources (e.g. data available at a URL, or data that is streaming), it is desirable to have a means of joining data without having to download it for pre-processing in Pandas.

In [None]:
proj_type = 'mercator'
width = 600
height = 500
clip_extent = [[0,0.075*height],[width,0.8*height]]

background = alt.Chart(countries).mark_geoshape(
    fill='lightgray',
    stroke='white'
).project(
    type=proj_type,
    clipExtent=clip_extent
).properties(
    width=width,
    height=height
)

points = alt.Chart(pageviews).mark_circle().encode(
    longitude = 'Longitude:Q',
    latitude = 'Latitude:Q',
    size = 'sum_visitors:Q',
    tooltip = 'country'
).transform_aggregate(
    sum_visitors='sum(visitors)',
    groupby=['country']
).transform_lookup(
    lookup = "country",
    from_ = alt.LookupData(data=latlon, key='Country', fields=['Latitude','Longitude'])
).project(
    type=proj_type,
    clipExtent=clip_extent
).properties(
    width=width,
    height=height
)

background + points