# Session 4: homework

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams.update(
    {"mathtext.default": "regular", "figure.dpi": 300, "figure.figsize": (6, 6)}
)
import seaborn as sns
import scipy as sp
import statsmodels.api as sm
import statsmodels.formula.api as smf
import math

### 1. Arthurian manuscripts

Consider the following preprocessing steps (you get the code for free):
- Load the Arthurian manuscripts metadata and restrict it to a subset of the columns
- Drop any incomplete rows
- Only consider rows for which the script type is "cursive" or "textualis" (which are the most common script types)
- As in 4-2, drop entries with 'obvious' data entry issues
- Calculate the 'text_surface' (surface of the writing area) and 'leaf_surface' (surface of the full page) and assign these to new columns in the dataset (result in $cm^2$).

In [None]:
df = pd.read_csv("../../datasets/arthur/manuscripts.csv", index_col=0)
df.columns = df.columns.str.replace("-", "_")

page_cols = ["leaf_height", "leaf_width", "text_height", "text_width"]
mss = df[
    page_cols
    + [
        "script",
        "material",
        "physical_type",
    ]
].dropna()
mss = mss[mss.script.isin(["textualis", "cursive"])]

# Divide by 10 for mm -> cm
mss[page_cols] = mss[page_cols].apply(lambda x: x / 10)

mss["text_area"] = mss.text_height * mss.text_width
mss["leaf_area"] = mss.leaf_height * mss.leaf_width

bad = (
    (mss.leaf_height < mss.text_height)
    | (mss.leaf_width < mss.text_width)
    | ((mss.leaf_area) < 0.001)
    | ((mss.text_area) < 0.001)
)
mss = mss[~bad]
mss

1. Run a univariate linear model in which you use a manuscript's `leaf_area` to predict its `text_area`. Print the model summary.

> CHECK What is the *dependent variable*? What is the *predictor*?

1. **FROM THE DATAFRAME** plot `leaf_area` vs `text_area` as a scatterplot with `sns.scatterplot`. Convention says that you should have the **dependent** variable on the y-axis, but either way, make sure the regression line seems to fit!

2. **MANUALLY** (not using `regplot` or seaborn tricks) add the regression line from the model parameters using `axline()` 

Inspect the model to retrieve the estimated coefficients for the regression formula ($mx + c$). **Manually** (i.e. without a function call) calculate which `text_area` the model would propose for a manuscript with a `leaf_area` of 500 $cm^2$, 700 $cm^2$ and 1500 $cm^2$ respectively, by copying the coefficients into the formula for the regression line. Validate your result, by calling `predict()` on the model for all three cases simultaneously.

> TIP: This is a little fiddly. The easiest way for one variable is to pass a `dict` where the key is the predictor and the value is a `list` of values for which we want predictions. For multivariate models it will be easiest to pass a whole `DataFrame`

- Supplement the linear model which you just ran with the calculation of one of three correlation tests that we saw ("spearman", "kendall" or "pearson"). Justify your choice. Describe the correlation by looking up (*Gries, p. 147*) the suitable wording for the correlation coefficient which you obtain.

- Now, consider the following three categorical predictors (1) `physical_type`, (2) `script`, and (3) `material`. Run a separate `ols()` regression for each of these features as predictors for a manuscript's `leaf_area`. Provide an interpretation for each of the coefficients you obtain. Produce boxplots for each experiment.

For obvious reasons, `leaf_area` is a good predictor for "text_area" (since they correlate), whereas `physical_type` also seems useful. Combine both `leaf_area` and `physical_type` as predictors into a single model that predicts `text_area`. Inspect whether and how the dependent variables complement each other. Compare the two single-variate models to the bivariate model. Which model would you prefer?

### 2. Harry Potter and the New Chapter

Reload the data on sentence lengths in the Harry Potter novels. Consider the UK sentence lengths and:
1. Produce a scatterplot in which you plot the sentence lengths in the UK chapters as a function of their (ascending) chapter index in this oeuvre.
- Provide a suitable title and descriptive labels for the horizontal and vertical axis.
2. Run a linear model and plot the regression line (in red) on top of the scatterplot. Report on the result of the linear model on the basis of the model's summary. Answer the question:
  - What is the average increase/decrease in tokens for every consecutive chapter in this series, according to this model?

Finally, if a new chapter were produced at the end of the final book, use the statsmodels `get_prediction` API to predict its *mean length* with an associated 95% confidence interval.

HINT: If you get stuck, you might have forgotten the `summary_frame()` method...

<img src=https://imgs.xkcd.com/comics/linear_regression.png>

<small>[XKCD](https://xkcd.com/1725/) CC-BY-NC 2.5</small>

```
Version History

Current: v1.0.1

8/10/24: 1.0.0: first draft, BN
09/10/24: 1.0.1: proofread, MK
```