# Library pageviews data

This is data from Google Analytics on a subset of library item web page views from 2012.

It has been severely reduced in size so that we can learn more about how to explore using Altair. You can see a version that deals with about a 20x larger subset in the `20_LibraryPageviews.ipynb` in this same repository.

In [8]:
import pandas as pd
import altair as alt
from altair import datum

# to get around MaxRowsError
import vegafusion as vf
vf.enable()

vegafusion.enable(mimetype='html', row_limit=10000, embed_options=None)

## Read in library web site page views data

The data documents web views of items in the Duke Library catalogue. 

- Each row documents visitors to a particular item page within an hour during 2012. 
- Things like the item URL have been stripped out,
- but a Library of Congress Category (LCC) has been retained for the item. 
- The data also includes the rough location of the visitor, and 
- how may people from that location viewed the page during that hour.

In [4]:
pageviews = pd.read_csv('data/pageviews_2012_small.csv', parse_dates=['timestamp'])
pageviews.head()

Unnamed: 0,timestamp,visitors,city,region,country,longitude,latitude,lcc_description
0,2012-01-01 16:00:00,1,Montreal,Quebec,Canada,-73.5542,45.5089,Military Science
1,2012-01-01 13:00:00,1,Durham,North Carolina,United States,-78.8986,35.994,History Of The Americas
2,2012-01-01 10:00:00,1,Edinburgh,Scotland,United Kingdom,-3.1875,55.9502,Social Sciences
3,2012-01-01 18:00:00,1,Plymouth,England,United Kingdom,-4.1427,50.3704,
4,2012-01-01 09:00:00,1,Edinburgh,Scotland,United Kingdom,-3.1875,55.9502,Social Sciences


---

## Sum of visitors within Library of Congress Categories by country

If we want to see how may items were viewed per LCC category, a bar chart is a good starting place. 

Since the label lines are long, it's easier to read them if they're horizontal.

We can also split these bars by country using color to give us a general sense of the split, *as long as we remember that it's not easy for people to compare bars that don't have the same baseline*.

In [12]:
alt.Chart(pageviews).mark_bar().encode(
    y = 'lcc_description',
    x = 'sum(visitors)',
    color = 'country'
)

In [11]:
alt.Chart(pageviews).mark_bar().encode(
    y = 'lcc_description:N',
    x = 'sum(visitors):Q',
    color = 'country:N'
)

In [18]:
alt.Chart(pageviews).mark_bar().encode(
    y = 'lcc_description:N',
    x = 'sum(visitors):Q',
    color = 'country:O'
)

In [14]:
alt.Chart(pageviews).mark_bar().encode(
    y = alt.Y('lcc_description:N', sort='-x'),
    x = 'sum(visitors):Q',
    color = 'country:N'
)

In [None]:
alt.Chart(pageviews).mark_bar().encode(
    y = alt.Y('lcc_description:N').sort('-x'),
    x = 'sum(visitors):Q',
    color = 'country:N'
)

In [None]:
alt.Chart(pageviews).mark_bar().encode(
    alt.Y('lcc_description:N').sort('-x'),
    x = 'sum(visitors):Q',
    color = 'country:N'
)

In [17]:
alt.Chart(pageviews).mark_bar().encode(
    alt.Y('lcc_description:N').sort('-x'),
    alt.X('sum(visitors):Q'),
    alt.Color('country:N')
)

### Providing extra arguments for encoding channels

Up until now we have used simple expressions for `x=` and `y=` because all we were feeding Altair was a column or a simple aggregation expression on a column.

**Sometimes you need to give extra arguments to alter the way the axes are displayed. Altair has special objects for the encoding channels, to help you do that.**

- They all start with capital letters, and 
- you have to reference them starting with the altair module.

e.g. 

`alt.Y('lcc_description', sort='descending')`

In [10]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = alt.Y('lcc_description', sort='descending'),
    color = 'country'
)

## Sorting bars by sum of visitors

**Alphabetical ordering is rarely the best choice for a categorical axis!** 

It's handy for lookup in a long list, but **ordering by a quantity lets us see the patterns in the data more easily, and automatically gives us a ranking of the categories**.

There are two equivalent methods to sort the Y axis:
- `alt.Y('lcc_description', sort='-x')` *(Attribute syntax – traditional)*
- `alt.Y('lcc_description').sort('-x')` ***(Method syntax – introduced in Altair 5, and now preferred)***

In [28]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = alt.Y('lcc_description', sort='-x'),
    color = 'country'
)

## Log scale on number of visitors

Let's introduce a log scale on X, since the x values distribution is a bit skewed. That way it'll be easier to see the small and large values at the same time.

**With a log scale you shouldn't use bars, since there's no zero-point, so we'll switch to `mark_point()`**

In [12]:
log_symbols_plot = alt.Chart(pageviews).mark_point().encode(
    x = alt.X('sum(visitors)').scale(type='log'),
    y = alt.Y('lcc_description').sort('-x'),
    color = 'country',
    shape = 'country'
)

log_symbols_plot

## Configuring grid lines

Altair's default is to put grid lines on a quantitative axis, but here let's use grids on the categorical Y-axis to help us associate the labels with the points.

- We could add an `axis=alt.Axis(grid=True)` or `grid=False` to the individual encoding X and Y fields
- If we wanted to control both X and Y together, we could add a `.configure_axis(grid=True)` to the Chart
- Here we'll turn on Y axis grids with  `.configure_axisY(grid=True)`

The grid lines in Altair can only go where there are axis values, so if we wanted to control the number of grid lines for the x-axis, we would need to manually set the values.

In [13]:
log_symbols_plot.encode(
    x = alt.X('sum(visitors)')
                .scale(type='log')
                .axis(values=[1,10,100,1000,10000])
).configure_axisY(grid=True)

### Alternative log scales

Sometimes the numbers and lookup work out better if you do **a log scale that's not base 10.**

In [14]:
log_symbols_plot.encode(
    x = alt.X('sum(visitors)')
                .scale(type='log', base=2)
).configure_axisY(grid=True)