# Library pageviews data

This is data from Google Analytics on a subset of library item web page views from 2012.

It has been severely reduced in size so that we can learn more about how to explore using Altair. You can see a version that deals with about a 20x larger subset in the [20_LibraryPageviews](20_LibraryPageviews.ipynb) notebook in this same repository.

---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

In [40]:
import pandas as pd
import altair as alt
from altair import datum

---

## VegaFusion renderer to get around the MaxRowsError

If you haven't installed VegaFusion, you need to run in the notebook:

```
!pip install "vegafusion-jupyter[embed]"
```

Or, in your terminal just run the same thing without the exclamation mark before it:

```
!pip install "vegafusion-jupyter[embed]"
```

**and then quit and restart JupyterLab.** Then run all cells up to here.


In [41]:
import vegafusion as vf
vf.enable()

vegafusion.enable(mimetype='html', row_limit=10000, embed_options=None)

---

## Read in library web site page views data

The data documents web views of items in the Duke Library catalogue. 

- Each row documents visitors to a particular item page within an hour during 2012. 
- Things like the item URL have been stripped out,
- but a Library of Congress Category (LCC) has been retained for the item. 
- The data also includes the rough location of the visitor, and 
- how may people from that location viewed the page during that hour.

In [42]:
pageviews = pd.read_csv('data/pageviews_2012_small.csv',
                       parse_dates=['timestamp'])
pageviews.head()

Unnamed: 0,timestamp,visitors,city,region,country,longitude,latitude,lcc_description
0,2012-01-01 16:00:00,1,Montreal,Quebec,Canada,-73.5542,45.5089,Military Science
1,2012-01-01 13:00:00,1,Durham,North Carolina,United States,-78.8986,35.994,History Of The Americas
2,2012-01-01 10:00:00,1,Edinburgh,Scotland,United Kingdom,-3.1875,55.9502,Social Sciences
3,2012-01-01 18:00:00,1,Plymouth,England,United Kingdom,-4.1427,50.3704,
4,2012-01-01 09:00:00,1,Edinburgh,Scotland,United Kingdom,-3.1875,55.9502,Social Sciences


In [43]:
len(pageviews)

10278

---

## EXERCISE 1: Horizontal bars of visitors per country

Make a bar chart of 
- *sum of the number of visitors over all the data* – horizontal (bottom) axis
- *per country* – vertical axis

![goal bar chart](images/libvisitscountrybars.png)

```
alt.Chart(----).mark_----().encode(
    x = '----:-',
    y = '----:-
)
```

- **Use the code chunk above as a guide, but try to type rather than copy/paste** 
- **Replace the dashes with correct code.**

---

## Sum of visitors within Library of Congress Categories by country

If we want to see how may items were viewed per LCC category, a bar chart is a good starting place. 

Since the label lines are long, it's easier to read them if they're horizontal.

We can also split these bars by country using color to give us a general sense of the split, *as long as we remember that it's not easy for people to compare bars that don't have the same baseline*.

In [44]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = 'lcc_description',
    color = 'country'
)

## Sorting bars by sum of visitors

**Alphabetical ordering is rarely the best choice for a categorical axis!** 

It's handy for lookup in a long list, but **ordering by a quantity lets us see the patterns in the data more easily, and automatically gives us a ranking of the categories**.

There are two equivalent methods to sort the Y axis:
- `alt.Y('lcc_description', sort='-x')` *(Attribute syntax – traditional)*
- `alt.Y('lcc_description').sort('-x')` ***(Method syntax – introduced in Altair 5 – preferred)***

In [45]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = alt.Y('lcc_description').sort('-x'),
    color = 'country'
)

## Log scale on number of visitors

Let's introduce a log scale on X, since the x values distribution is a bit skewed. That way it'll be easier to see the small and large values at the same time. 

**With a log scale you shouldn't use bars, since there's no zero-point, so we'll switch to `mark_point()`** *(I don't tend to use shape variation, but I wanted to show you how it's done.)*

### Alternative log scales

Sometimes the numbers and lookup work out better if you do **a log scale that's not base 10.**

In [46]:
alt.Chart(pageviews).mark_point().encode(
    x = alt.X('sum(visitors)').scale(type='log', base=2),
    y = alt.Y('lcc_description').sort('-x'),
    color = 'country',
    shape = 'country'
).configure_axisY(grid=True)

---

# TimeUnit transform: 

## Visitors continuous time points

**For time series, it's often useless to view our data in the original fine event detail!** 

If we look at timestamps and visitors directly, it looks like a bunch of very closely spaced ones with a few twos and threes, etc.

In [47]:
pageviews['visitors'].value_counts()

1    10189
2       76
3       10
4        1
7        1
5        1
Name: visitors, dtype: int64

### Let's plot a small section of visitors vs time

In [48]:
alt.Chart(pageviews.iloc[5250:5350,:]).mark_point().encode(
    x = 'timestamp:T',
    y = 'visitors:Q'
).properties(
    width=400,
    height=150
)

## Time aggregation

We'd like to aggregate it on different time scales to see what patterns pop out. We saw some aggregation already with the `sum(visitors)` and sorting, but **Altair has many built-in time-scale aggregation functions, too, called TimeUnit Transforms.**

**The documentation lists the 
[Timeunit valid entries](https://altair-viz.github.io/user_guide/transform/timeunit.html)**

### A sum of visitors per month shows some seasonal detail

Here we'll try a couple of different time scales on which to aggregate. First, monthly with `yearmonth()` to see the very coarse-scale **monthly** trends over the year.

In [49]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonth(timestamp):T',
    y = 'sum(visitors):Q'
).properties(
    width=400,
    height=150
)

### Sum of visitors per day adds within-week detail

`yearmonthdate()` retains all of these, the year month and date, aggregating to the day level.

In [50]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonthdate(timestamp):T',
    y = 'sum(visitors):Q'
).properties(
    width=600,
    height=150
)

---

### Weekly sums

**I wish Altair had a built-in time unit transform for *weekly* sum, average, etc.** If you want to see an example of how to do that tranform using Pandas, and then feed it into Altair, check out the
[WeeklyTimeAggregation](WeeklyTimeAggregation.ipynb) notebook!

---

## Just Durham data using transform_filter()

- **You can see the school holidays more clearly in the Durham pageviews** as subtle drops in the number of visitors.
- **Just like in Tableau, you're defining what gets *through* the filter**
- *Note: **datum is just a way to reference the data elements in each row** instead of a whole column*
- *Note also, that if we have lots of data, these types of filtering operations are faster in Pandas than in Altair, so you can pre-filter your data before feeding it to Altair. VegaFusion makes this time difference less noticeable, but Pandas is still faster in my trials.*

In [60]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonthdate(timestamp):T',
    y = 'sum(visitors):Q',
    tooltip = 'yearmonthdate(timestamp):T'
).transform_filter(
    datum.city == 'Durham'
).properties(
    width=600,
    height=150
)

## TimeUnit transform: Visitors by hour of day

- Hours of the day are integers counting from 0 at midnight, and **these are categorical, aggregating over the months and years**
- If you wanted to make a bar chart out of this, you'd need to change to `mark_bar()`, as well as `T` data type to ordinal `O`.

In [52]:
alt.Chart(pageviews).mark_line().encode(
    x = 'hours(timestamp):T',
    y = 'sum(visitors):Q'
)

### Canada, UK time shift

If we filter down to just Canadian and UK visitors, and color by country, we can see a shift in the peak viewing time of day corresponding to their respective time zones.

In [53]:
alt.Chart(pageviews).mark_line().encode(
    x = 'hours(timestamp):T',
    y = 'sum(visitors):Q',
    color = 'country'
).transform_filter(
    (datum.country == 'Canada') | (datum.country == 'United Kingdom')
)

---

## EXERCISE 2: Weekday vs hour of day heatmap

A heatmap is a compact way to view typical patterns throughout the day, and how that varies by weekday.

Now, put together the earlier examples to create the visualization below. Days of the week are on the vertical axis, hours of the day are on the horizontal, and color is the number of visitors. *(I've removed some of the axis labels to hide hints from you.)*

*Hint: you need to change from a Time data type to Ordinal to get discreet marks*

[Timeunit valid entries](https://altair-viz.github.io/user_guide/transform/timeunit.html)

![goal heatmap](images/libweekhoursheatmap.png)

---

## Facet wrapping

Altair (or Vega-Lite) has the ability to "wrap" facets, so if you have facets over a lot of categories, you can control how wide or tall the grid of plots gets.

- *Note: When I tried to facet with `lcc_description` Altair didn't elide the category text, so the plots were spaced out too wide. See the Pandas documentation on [working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for an explanation of the truncation I used.*
- *I'm going to plot without null LCC since it makes the truncation easier, and it results in a grid of 20 rather than 21.*


### Creating a new column in Pandas with shorter LCC descriptions

In [54]:
col_subset = ['timestamp','visitors','lcc_description']

# Get rid of rows with null LCC descriptions
pv2 = pageviews.loc[pageviews['lcc_description'].notna(), col_subset].copy()

# Split LCC descriptions on a space character, take first two elements and join with space
pv2['short_lcc'] = pv2['lcc_description'].str.split().apply(lambda x: x[:2]).str.join(' ')

### Doing the actual facet wrapped lines plots

In [71]:
alt.Chart(pv2).mark_line().encode(
    x = alt.X('yearmonthdate(timestamp):T').title('date'),
    y = 'sum(visitors):Q'
).properties(
    width=100,
    height=80
).facet(
    facet = 'short_lcc',
    columns = 4
)

---

## *EXTRA*

### Saving as HTML – file sizes with and without VegaFusion

VegaFusion has some other nice functions besides getting us around the MaxRowsError
- You can access the aggregated data after it's transformed by Altair
- You can use VegaFusion to save as HTML and it'll save with only the aggregated data rather than the whole data

In [55]:
temp_bar = alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = 'country'
)

#### Altair Save: File size = 2.2 Mb

In [56]:
temp_bar.save('temp_bar.html')

#### View the transformed data really fed into the chart with VegaFusion

In [57]:
vf.transformed_data(temp_bar)

Unnamed: 0,country,sum_visitors
0,Canada,2718
1,United States,5241
2,United Kingdom,2428


#### VegaFusion Save: File size = 4 kb

In [58]:
vf.save_html(temp_bar, 'temp_bar_vf.html')