# Library pageviews data

This is data from Google Analytics on a subset of library item web page views from 2012.

It has been severely reduced in size so that we can learn more about how to explore using Altair. You can see a version that deals with about a 20x larger subset in the `20_LibraryPageviews.ipynb` in this same repository.

In [1]:
import pandas as pd
import altair as alt
from altair import datum

## Read in library web site page views data

The data documents web views of items in the Duke Library catalogue. 

- Each row documents visitors to a particular item page within an hour during 2012. 
- Things like the item URL have been stripped out,
- but a Library of Congress Category (LCC) has been retained for the item. 
- The data also includes the rough location of the visitor, and 
- how may people from that location viewed the page during that hour.

In [2]:
pageviews = pd.read_csv('data/pageviews_2012_small.csv')
pageviews.head()

Unnamed: 0,timestamp,visitors,city,region,country,longitude,latitude,lcc_description
0,2012-01-01 16:00:00,1,Montreal,Quebec,Canada,-73.5542,45.5089,Military Science
1,2012-01-01 13:00:00,1,Durham,North Carolina,United States,-78.8986,35.994,History Of The Americas
2,2012-01-01 10:00:00,1,Edinburgh,Scotland,United Kingdom,-3.1875,55.9502,Social Sciences
3,2012-01-01 18:00:00,1,Plymouth,England,United Kingdom,-4.1427,50.3704,
4,2012-01-01 09:00:00,1,Edinburgh,Scotland,United Kingdom,-3.1875,55.9502,Social Sciences


In [3]:
len(pageviews)

10278

### View the Pandas column data types

In [4]:
pageviews.dtypes

timestamp           object
visitors             int64
city                object
region              object
country             object
longitude          float64
latitude           float64
lcc_description     object
dtype: object

## Change ISO_time to a true date and time data type

Note that the "timestamp" column has `dtype=object`. That is the same thing as a text string.

**To use the Altair date and time functionality, we need to convert these timestamp strings into Python `datetime` objects.**

In [5]:
pageviews['timestamp'] = pd.to_datetime(pageviews['timestamp'])
pageviews.dtypes

timestamp          datetime64[ns]
visitors                    int64
city                       object
region                     object
country                    object
longitude                 float64
latitude                  float64
lcc_description            object
dtype: object

### Viewing datetime components

See the [time/date components documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components) for more details about pulling out pieces of a datetime in Pandas.

In [26]:
pageviews['timestamp'].dt.date

0        2012-01-01
1        2012-01-01
2        2012-01-01
3        2012-01-01
4        2012-01-01
            ...    
10273    2012-12-31
10274    2012-12-31
10275    2012-12-31
10276    2012-12-31
10277    2012-12-31
Name: timestamp, Length: 10278, dtype: object

---

## MaxRowsError & options

**This is is the piece I hate this most about using and teaching Altair!**

We often hit a `MaxRowsError` when we use too much data to create a visualization. The developers built this in to make people aware of how much data they're embedding into each visualization. We can take off limit for number of rows, but that's not a great idea because the notebook will have a Vega-Lite JSON specification (text) embedded for every output pane, which also includes the data, so you end up with huge notebooks!

*There are a few different solutions* See the 
[Altair documentation on how to plot large datasets](https://altair-viz.github.io/user_guide/faq.html#maxrowserror-how-can-i-plot-large-datasets)
or the [Altair tutorial](https://altair-viz.github.io/altair-tutorial/README.html) 
notebook `03-Binning-and-aggregation` section on [how Altair encode data](https://altair-viz.github.io/altair-tutorial/notebooks/03-Binning-and-aggregation.html#aside-how-altair-encodes-data) for more details.

### 1. Use a smaller data set

Use Pandas to aggregate first so you have less data that you're feeding to Altair. This makes it less handy to do exploration, but Pandas is much faster than Altair, so your visualizations will render much more quickly. See my 
[Pandas 103 workshop video](https://warpwire.duke.edu/w/cd4EAA/) 
and the 
[accompanying repository](https://github.com/emonson/pandas-jupyterlab) 
for lessons on how to do grouping and aggregation with Pandas. Specifically, those lessons in the video use the
[Groupby_basics](https://github.com/emonson/pandas-jupyterlab/blob/master/Groupby_Basics.ipynb)
and
[Groupby_NCexploration](https://github.com/emonson/pandas-jupyterlab/blob/master/Groupby_NCexploration.ipynb) notebooks. 
In that repository are also a couple notebooks on [using Altair to explore an NC NO<sub>2</sub> emissions dataset](https://github.com/emonson/pandas-jupyterlab/blob/master/Altair_NCexplore.ipynb) 
and a
[timing comparison for Altair vs Pandas](https://github.com/emonson/pandas-jupyterlab/blob/master/Altair_UStimings.ipynb) 
on a large dataset with aggregation (50 seconds when using Altair for aggregating vs almost immediate when using Pandas).

### 2. Altair data server data transformer

I think the 
[Altair data server](https://github.com/altair-viz/altair_data_server) 
is the best solution, but it needs Python 3.5, so won't work on all remote systems, and I hate that it requires an extra install:

```
pip install altair_data_server
```

Then, in the notebook you need to enable the `data_server` data transformer:

```
alt.data_transformers.enable('data_server')
```

When you want to go back to the default behavior for saving out HTML, you need to do:

```
alt.data_transformers.enable('default')
```

### 3. Refer to your data at a URL

If your JSON or CSV data is on a server with a URL, you can just put that URL in place of the dataframe:

```
temps = 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/seattle-temps.csv'
alt.Chart(temps).mark_bar().encode(
    x='month(date):O',
    y='mean(temp):Q'
)
```

### 4. JSON data transformer

The solution I used to recommend was the JSON data transformer. This is still a decent option because it doesn't require any other installs. 

What it does is save the data into a local JSON file and the visualization in the notebook or saved HTML just refers to that JSON file and so it won't explode the size of your HTML or notebooks. This still works if you're using Jupyter Notebook, but newer version of Jupyter Lab broke the behavior because they look for local files in a different directory path. **In JupyterLab you need to now specify and extra urlpath argument!** In 
[newer versions of Altair](https://github.com/altair-viz/altair/issues/1867#issuecomment-565824117)
(>4.1) you can specify the proper url path this way:

```
alt.data_transformers.enable('json', urlpath='files')
```

See the section of `01_NatureBarValues.ipynb` on
[Saving to HTML Files](01_NatureBarValues.ipynb#Saving-to-HTML-files) for issues involved in HTML when the JSON data transformer is enabled.

### 5. Turn off or raise the max rows limit

As stated before, this can lead to huge HTML files and notebooks, but it'll work.

```
alt.data_transformers.enable('default', max_rows=None)
```


---

## Saving to HTML files

As we saw in a previous lesson, saving an Altair visualization to an HTML file is very easy – you just have to chain a `.save('filename.html')` command on to the end of the specification.

**Unfortunately, there is one complication – if you've set the `alt.data_transformers.enable('json')` to avoid the MaxRowsError, you might want to turn that off before you save, so you can easily double-click to view your HTML file.** Or, you must at least understand that you'll need to run/have a web server to view your HTML visualizations.

### JSON Data Transformer – effect on saved files

If you have set `alt.data_transformers.enable('json')` to avoid the MaxRowsError, Altair will automatically, behind the scenes, saves your data to a JSON file on your local filesystem (hard drive), and just reference that file name instead of embedding teh data in the JSON specification of the Vega-Lite visualization.

When you save an HTML file from Altair, the JSON specification it saves along with the file is exactly like it was in your Jupyter Notebook. If you have enabled the 'json' data transformer, the HTML file will reference the same JSON file URL for your data instead of embedding all the data in the JSON (and thus the HTML file).

The problem comes when you try to double-click on the HTML file from your hard drive to view it. 

- If the data is embedded, there is no problem – you will be able to see your static *or interactive* Vega-Lite chart that Altair has generated.
- **If the Vega-Lite chart refers to the data through a local file (URL), the page won't display properly when you double-click on it! Grabbing a local file is considered a [CORS request](https://en.wikipedia.org/wiki/Cross-origin_resource_sharing) in this scenario, and so *for security reasons* it isn't executed.**

### HTML files referring to local JSON files need a server

What you have to do to view any HTML-embedded Vega-Lite visualization that refers to a local JSON file is to put it (and the JSON data file) on a web server and view it through your browser. If you don't have easy access to a web server, you can run a temporary one locally by going into the directory with the files in a terminal on the Mac, or in the Anaconda prompt on Windows, and type:

`python -m http.server`

That should print out a message saying:

`Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) …`

which means you can go to http://0.0.0.0:8000/ or http://127.0.0.1:8000/ or http://localhost:8000 in your browser and see the current directory. Click on the HTML file you want to view and the visualization should work fine.


---

### For now using the JSON data transformer

In [7]:
alt.data_transformers.enable('json', urlpath='files')

DataTransformerRegistry.enable('json')

---

## EXERCISE 1: Horizontal bars of visitors per country

Make a bar chart of 
- *sum of the number of visitors over all the data* – horizontal (bottom) axis
- *per country* – vertical axis

```
alt.Chart(----).mark_----().encode(
    x = ----,
    y = ----
)
```

- **Copy the code chunk above,** 
- **paste it into the cell below, then** 
- **replace the dashes with correct code.**

---

## Sum of visitors within Library of Congress Categories by country

If we want to see how may items were viewed per LCC category, a bar chart is a good starting place. 

Since the label lines are long, it's easier to read them if they're horizontal.

We can also split these bars by country using color to give us a general sense of the split, *as long as we remember that it's not easy for people to compare bars that don't have the same baseline*.

In [8]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = 'lcc_description',
    color = 'country'
)

### Providing extra arguments for encoding channels

Up until now we have used simple expressions for `x=` and `y=` because all we were feeding Altair was a column or a simple aggregation expression on a column.

**Sometimes you need to give extra arguments to alter the way the axes are displayed. Altair has special objects for the encoding channels, to help you do that.**

- They all start with capital letters, and 
- you have to reference them starting with the altair module.

e.g. 

`alt.Y('lcc_description', sort='descending')`

In [9]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = alt.Y('lcc_description', sort='descending'),
    color = 'country'
)

## Sorting bars by sum of visitors

**Alphabetical ordering is rarely the best choice for a categorical axis!** 

It's handy for lookup in a long list, but **ordering by a quantity lets us see the patterns in the data more easily, and automatically gives us a ranking of the categories**.

The object we use for sorting an encoding field is `alt.EncodingSortField()`, which is unfortunately a long name. We give it 

- the field to sort by – *(e.g. `field='visitors'`)*
- an aggregation function for that field – *(e.g. `op='sum'` or `op='mean'`)*
- which order to sort – *(`order='ascending'` or `order='descending'`)*

In [10]:
alt.Chart(pageviews).mark_bar().encode(
    x = 'sum(visitors)',
    y = alt.Y('lcc_description',
            sort=alt.EncodingSortField(
                field='visitors',
                op='sum',
                order='descending'
            )
    ),
    color = 'country'
)

## Log scale on number of visitors

Let's introduce a log scale on X, since the x values distribution is a bit skewed. That way it'll be easier to see the small and large values at the same time.

**With a log scale you shouldn't use bars, since there's no zero-point, so we'll switch to `mark_point()`**

In [11]:
log_symbols_plot = alt.Chart(pageviews).mark_point().encode(
    x = alt.X('sum(visitors)', scale=alt.Scale(type='log')),
    y = alt.Y('lcc_description',
            sort=alt.EncodingSortField(
                field="visitors",
                op="sum",
                order="descending"
            )
    ),
    color = 'country',
    shape = 'country'
)

log_symbols_plot

## Configuring grid lines

Altair's default is to put grid lines on a quantitative axis, but here let's use grids on the categorical Y-axis to help us associate the labels with the points.

- We could add an `axis=alt.Axis(grid=True)` or `grid=False` to the individual encoding X and Y fields
- If we wanted to control both X and Y together, we could add a `.configure_axis(grid=True)` to the Chart
- Here we'll turn on Y axis grids with  `.configure_axisY(grid=True)`

The grid lines in Altair can only go where there are axis values, so if we wanted to control the number of grid lines for the x-axis, we would need to manually set the values.

In [12]:
log_symbols_plot.encode(
    x = alt.X('sum(visitors)', 
              scale=alt.Scale(type='log'), 
              axis=alt.Axis(values=[1,10,100,1000,10000])
             )
).configure_axisY(grid=True)

### Alternative log scales

Sometimes the numbers and lookup work out better if you do **a log scale that's not base 10.**

In [13]:
log_symbols_plot.encode(
    x = alt.X('sum(visitors)',
              scale=alt.Scale(type='log', base=2)
             )
).configure_axisY(grid=True)

---

# TimeUnit transform: 

## Visitors continuous time line

**For time series, it's often useless to view our data in the original fine event detail!** 

If we look at timestamps and visitors directly, it looks like a bunch of very closely spaced ones with a few twos and threes, etc.

In [14]:
alt.Chart(pageviews.iloc[5000:6000,:]).mark_point().encode(
    x = 'timestamp:T',
    y = 'visitors:Q'
).properties(
    width=400,
    height=150
)

## Time aggregation

We'd like to aggregate it on different time scales to see what patterns pop out. We saw some aggregation already with the `sum(visitors)` and sorting, but **Altair has many built-in time-scale aggregation functions, too, called TimeUnit Transforms.**

The documentation lists the [Timeunit valid entries](https://altair-viz.github.io/user_guide/transform.html#timeunit-transform)

### A sum of visitors per month shows some seasonal detail

Here we'll try a couple of different time scales on which to aggregate. First, monthly with `yearmonth()` to see the very coarse-scale trends over the academic year.

In [15]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonth(timestamp):T',
    y = 'sum(visitors):Q'
).properties(
    width=400,
    height=150
)

### Sum of visitors per day adds within-week detail

`yearmonthdate()` retains all of these, the year month and date, aggregating to the day level.

In [16]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonthdate(timestamp):T',
    y = 'sum(visitors):Q'
).properties(
    width=600,
    height=150
)

## Just Durham data using transform_filter()

**You can see the school holidays more clearly in the Durham pageviews** as subtle drops in the number of visitors.

*Note: **datum is just a way to reference the data elements in each row** instead of a whole column*

*Note also, that if we have lots of data, these types of filtering operations are faster in Pandas than in Altair, so you can pre-filter your data before feeding it to Altair.*

In [17]:
alt.Chart(pageviews).mark_line().encode(
    x = 'yearmonthdate(timestamp):T',
    y = 'sum(visitors):Q',
    tooltip = 'yearmonthdate(timestamp):T'
).transform_filter(
    datum.city == 'Durham'
).properties(
    width=600,
    height=150
)

## TimeUnit transform: Visitors by hour of day

We did the days of the week earlier. Another interesting visualization is hours of the day.

If you wanted to make a bar chart out of this, you'd need to change to `mark_bar()`, as well as `T` data type to ordinal `O`.

In [18]:
alt.Chart(pageviews).mark_line().encode(
    x = 'hours(timestamp):T',
    y = 'sum(visitors):Q'
)

### Canada, UK time shift

If we filter down to just Canadian and UK visitors, and color by country, we can see a shift in the peak viewing time of day corresponding to their respective time zones.

In [19]:
alt.Chart(pageviews).mark_line().encode(
    x = 'hours(timestamp):T',
    y = 'sum(visitors):Q',
    color = 'country'
).transform_filter(
    (datum.country == 'Canada') | (datum.country == 'United Kingdom')
)

---

## EXERCISE 2: Weekday vs hour of day heatmap

A heatmap is a compact way to view typical patterns throughout the day, and how that varies by weekday.

Now, put together the earlier examples to create the visualization below. Days of the week are on the vertical axis, hours of the day are on the horizontal, and color is the number of visitors. *(I've removed some of the axis labels to hide hints from you.)*

*Hint: you need to change from a Time data type to Ordinal to get discreet marks*

![goal heatmap](images/LibWeekHoursHeatmap.png)

---

## Facet wrapping

Recently, Altair (or Vega-Lite) added the ability to "wrap" facets, so if you have facets over a lot of categories, you can control how wide or tall the grid of plots gets.

*Note: When I tried to facet with `lcc_description` Altair didn't elide the category text, so the plots were spaced out too wide. See the Pandas documentation on [working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for an explanation of the truncation I used.*

*I'm going to plot without null LCC since it makes the truncation easier, and it results in a grid of 20 rather than 21.*


In [20]:
col_subset = ['timestamp','visitors','lcc_description']

# Get rid of rows with null LCC descriptions
pv2 = pageviews.loc[pageviews['lcc_description'].notna(), col_subset]

# Split LCC descriptions on a space character, take first two elements and join with space
pv2['short_lcc'] = pv2['lcc_description'].str.split().apply(lambda x: x[:2]).str.join(' ')

alt.Chart(pv2).mark_line().encode(
    x = alt.X('yearmonthdate(timestamp):T', title='date'),
    y = 'sum(visitors):Q'
).properties(
    width=120,
    height=120
).facet(
    facet = 'short_lcc',
    columns = 4
)