In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import arviz as az
import bambi as bmb

import statsmodels.formula.api as smf

import matplotlib.pyplot as plt

plt.rcParams.update(
    {"mathtext.default": "regular", "figure.dpi": 300, "figure.figsize": (8, 6)}
)

# Bayesian Inference: kudos over time in fan-fiction

In this homework you will investigate an effect over time of popularity measures in fan-fiction. A lot of research uses proxy measures of popularity, and it has often been observed that the measures themselves are biased, which needs to be taken into account.

An Archive Of Our Own (AO3) is a huge collection of "fan-fiction" (stories written by fans set in universes created by popular novels, comics, or TV series). These texts are shared and read by users of the site, who can award a 'kudos' to the work if they enjoyed it. It is natural to use this as a "proxy measure" of popularity. The problem with this is that works naturally accrue some kudos over time (they have more opportunities for people to read them and give them kudos), and so recent works are disadvantaged by measuring popularity this way.

Here you will work through some modelling to find out more about this effect.

In [2]:
rng = np.random.default_rng(seed=42)

## Load and process the data

We will only use a subset of the columns, here, but note that this is a much larger dataset than we have worked with so far!

In [3]:
df = pd.read_csv("../datasets/fanfiction/en.csv")[
    [
        "work_id",
        "words",
        "comments",
        "kudos",
        "bookmarks",
        "rating",
        "hits",
        "published",
        "author",
        "fandom",
    ]
].dropna()
df

Unnamed: 0,work_id,words,comments,kudos,bookmarks,rating,hits,published,author,fandom
5,12583840,2780,3.0,45.0,4.0,Explicit,991.0,2017-10-31,ceonsa yusan (casdeanchronicles),bangtansonyeondan | Bangtan Boys | BTS
10,9810938,2318,1.0,32.0,3.0,Teen And Up Audiences,432.0,2017-02-18,hustlehobi (brainstorming),bangtansonyeondan | Bangtan Boys | BTS
11,5388149,1781,1.0,42.0,4.0,Not Rated,347.0,2015-12-09,hustlehobi (brainstorming),The Transformers (IDW Generation One)
14,9320243,6103,3.0,24.0,2.0,General Audiences,429.0,2017-01-14,hustlehobi (brainstorming),bangtansonyeondan | Bangtan Boys | BTS
18,9789800,788,6.0,55.0,5.0,Teen And Up Audiences,608.0,2017-02-16,canhyiShan hwiyeong (Kookienism),SF9 (Band)
...,...,...,...,...,...,...,...,...,...,...
24133,1640759,3288,40.0,3.0,1.0,General Audiences,90.0,2005-12-20,yuletide_archivist,Austin & Murry-O'Keefe Families - Madeleine L'...
24134,1640432,1370,4.0,2.0,1.0,Explicit,94.0,2003-11-17,yuletide_archivist,Chicago (2002)
24136,1624760,1169,3.0,16.0,3.0,Mature,254.0,2007-12-18,yuletide_archivist,Grease (movie)
24139,1627550,15496,34.0,16.0,4.0,Teen And Up Audiences,538.0,2008-12-21,yuletide_archivist,The Devil Wears Prada (2006)


## Convert dates to timestamps

We give you some pre-processing code. The dates in the data are stored as strings, but we need them as numbers. Here we convert them to a "timestamp" which is based on the number of seconds since January 1, 1970 (it's a computer thing, don't ask). In other words, larger timestamps are more recent works. Unfortunately, this make the dates a little hard to read, but we'll live with it (we could fix it with complicated code, but that's not the point right now).

In [None]:
df["date"] = (pd.to_datetime(df.published).astype(int) // 1e10).astype(int)
# there are some works with mistaken dates, so we trim out the bottom 5%
df = df[df.date > df.date.quantile(0.05)]
df

Unnamed: 0,work_id,words,comments,kudos,bookmarks,rating,hits,published,author,fandom,date
5,12583840,2780,3.0,45.0,4.0,Explicit,991.0,2017-10-31,ceonsa yusan (casdeanchronicles),bangtansonyeondan | Bangtan Boys | BTS,150940800
10,9810938,2318,1.0,32.0,3.0,Teen And Up Audiences,432.0,2017-02-18,hustlehobi (brainstorming),bangtansonyeondan | Bangtan Boys | BTS,148737600
11,5388149,1781,1.0,42.0,4.0,Not Rated,347.0,2015-12-09,hustlehobi (brainstorming),The Transformers (IDW Generation One),144961920
14,9320243,6103,3.0,24.0,2.0,General Audiences,429.0,2017-01-14,hustlehobi (brainstorming),bangtansonyeondan | Bangtan Boys | BTS,148435200
18,9789800,788,6.0,55.0,5.0,Teen And Up Audiences,608.0,2017-02-16,canhyiShan hwiyeong (Kookienism),SF9 (Band),148720320
...,...,...,...,...,...,...,...,...,...,...,...
24133,1640759,3288,40.0,3.0,1.0,General Audiences,90.0,2005-12-20,yuletide_archivist,Austin & Murry-O'Keefe Families - Madeleine L'...,113503680
24134,1640432,1370,4.0,2.0,1.0,Explicit,94.0,2003-11-17,yuletide_archivist,Chicago (2002),106902720
24136,1624760,1169,3.0,16.0,3.0,Mature,254.0,2007-12-18,yuletide_archivist,Grease (movie),119793600
24139,1627550,15496,34.0,16.0,4.0,Teen And Up Audiences,538.0,2008-12-21,yuletide_archivist,The Devil Wears Prada (2006),122981760


## Visual Inspection

Using a scatterplot (or a regression plot, if you like), plot `kudos` against `age`, and see if you see the age effect visually. What are you thoughts?

Because the `kudos` values vary so much, you might also like to use `plt.yscale("log")` after you create the plot to view the values on a log scale...

In [1]:
# code here

## Linear Regression

Using a "standard" linear regression with `statsmodels`, model `kudos` as the response variable with `date` as the predictor. What do you think? You should think about:
- how much variance is explained by `date`
- is the model 'confident' about its estimates?
- are the estimates "significant"?
- how about the *effect size*? Large? Small?

In [2]:
# code here

## Bayesian Count Modelling

One of the reasons that the linear model was a problem is that "kudos" is really a count variable (it is always a non-negative integer). Since it accumulates over time, it is much more natural to model this with a more suitable distribution.

Using a Negative Binomial model, fit a Bayesian model with `bambi` that predicts `kudos` using `date`.

Because the date values a very large, the model will be *numerically unstable* and almost certainly fail without help. Here, the standard trick is to scale the data. The formula language has this built in. just write `scale(predictor)` instead of `predictor` in your formula.

Don't forget to use the magic incantation `idata_kwargs={"log_likelihood": True}` when fitting, so that we can do model comparisons later.

> NOTE
>
> Two things. First, some of these models will be slow - there is a *lot* of data. You can use `progressbar=True` in your call to `fit()`, or just wait it out. Second, it's possible that you might see some warning about either divergences or r-hat issues. For this exercise, just ignore them.

In [3]:
# code here

## Interpret and view predictions

Based on the model output, what do you think about the effect of date? Is it real?

Next, plot the model prediction curve using `interpret.plot_predictions()`. Think about both the shape of the line and the high-density intervals.

In [4]:
# code here

## Sanity Check the PPC

As before, we will check the predictions from the posterior to see if they are "roughly" consistent with the real observations. Huge inconsistencies here usually mean that we are using the wrong model. Because of the variation in `kudos` we will use a log scale.

We give you the function to do this. Call it like `plot_ppc_log(model_variable, idata_variable, name_of_response_variable)`

You should see:
- The posterior *mean* (orange dashed)
- Many distributions from the posterior predictive (kudos values simulated by the model exploring the predictor space)
- The empirical observed distribution (black line)

What is ideal is if the black line is entirely covered by the blue posterior distributions, which would mean that all the real outputs could be feasibly predicted by the model.

What do you think? Is our modelling approach "broadly" OK? Is the model over-predicting or under-predicting?

In [None]:
def plot_ppc_log(model, idata, response: str, cut=False):
    # predict response variable (kudos) from many posterior distribution samples
    id = model.predict(idata, kind="response", inplace=False)
    # plot as log values to make easier to read
    ppd = np.log(id.posterior_predictive[response] + 1)
    obs = np.log(id.observed_data[response] + 1)
    var = f"log({response} + 1)"
    if cut:
        id.posterior_predictive[var] = ppd.where(ppd < cut, drop=True)
        id.observed_data[var] = obs.where(obs < cut)
    else:
        id.posterior_predictive[var] = ppd
        id.observed_data[var] = obs

    return az.plot_ppc(id, var_names=[var])

In [5]:
# code here

## Improve the model.

Now run three more models. Since there are many repeated measurements for author and fandom, it is natural to stratify by these variables by adding *mixed effects*. Recall that the purpose of a mixed effect is to allow each group of repeated observations to have its own slope. What we want to do is to allow to model to learn that some authors and fandoms are inherently more popular, and will naturally get more kudos (while still being affected by the kudos-over-time effect that we are modelling.)

Make one model with a random effect for `author`, one with `fandom` and one with both. Do not add any interaction terms. For each model, look at the PPC plot. This will take a while.

Out of author and fandom, which seems to help the model most?

Plot the predictions and compare them to the previous ones. What do you notice?

In [None]:
# LOTS OF CODE CELLS HERE

## Model Comparison

Using the method `az.compare`, compare your four models. Which is the best?

> NOTE: you will see some warnings here. Just ignore them.

In [7]:
# code here

## Mixed effects are meaningful!

If you take the summary output from `az.summary()` you get a pandas dataframe. Sort the values by the mean effect, and find the author with the most positive mean effect. Think about what this means. Now go back to the original dataframe, and find the authors there with the highest median `kudos`. Compare.

(This is a good chance to use `group_by` and `agg` from our very first pandas session)

In [8]:
# code here

## One Last Thing...

Finally, fan-fiction can be pretty racy stuff, with content from General Audiences stories about Pokémon up to and including (a lot of) outright pornography. Most works on AO3 have a *rating* as well as more specific content advisory keywords (which we have not included in this notebook!).

Your last task is to add the `rating` as a categorical predictor (no interactions). Compare that model with the previous ones using `az.compare()` to see if it improves predictions.


In [9]:
# code here

## Plot the predictions per rating

Now plot the predictions, conditional on both `date` and `rating`. To make the layout better, add the following arguments to your `plot_predictions`:
```python
    fig_kwargs={"figsize": (20, 8), "sharey": True},
    subplot_kwargs={
        "main": "date",
        "group": "rating",
        "panel": "rating",
    },
    legend=False
```
This will break the different ratings into subplots. Interpret.

In [10]:
# code here

# CONGRATULATIONS!

In this course you have progressed from simple descriptive statistics and basic uni- and bi-variate tests, to fully developed methods of Bayesian inference that are suitable for cutting-edge analysis on real problems in many fields.

Please, though, remember:
1. Statistics have no inherent value in the absence of *your* expertise and ideas
2. Always be honest! Sometimes the answers aren't in the data (there's too much uncertainty), and sometimes the answers aren't the ones we hoped for, and *that's OK*. Always do the best job you can, and describe your results (and problems) honestly and completely. 

Now go forth, and research!


```
Version History

Current: v1.0.0

20/11/24: 1.0.0: first draft, BN
```