# Humanities Data Analysis: Exam
### *Second exam period - September 2021*

This notebook contains the Humanities Data Analysis exam exercises. Code and write down text answers in the designated blocks, as in the weekly assignments. **Don't forget to save your progress!** You have three hours to fill out the exam. You can earn a total of **17.5 points** (which will be combined with the permanent evaluation score): the distribution is indicated per exercise. When you have finished, save your notebook (include your name in the filename!) and **send it via email to mike.kestemont@uantwerpen.be and lisa.hilte@uantwerpen.be** 

Some final notes before you get started:

- Some questions have been marked as hard: we advise you to only solve these after having solved the rest of the exam.
- Provide an interpretion of your results whenever that is explicitly asked. Use full sentences and avoid telegram-style. Allow for nuance in your interpretation and make use of the relevant terminology.
- Try to fill in at least something for every question: we cannot give any marks for an empty answer box. 
- Do not remove any outliers or rescale data, unless when you are explicitly asked to do so.
- The required datasets are in the folder `datasets_exam`, which is saved in the same directory level (i.e. in the same folder) as this notebook.
- You are allowed to use all course materials (slides, notebooks, handbook by Gries) and the internet (e.g. reading discussion fora). However, **communication with third parties (including fellow students) is strictly forbidden!**
- No technical support can be provided: working with the notebooks (loading, saving, etc.) is assumed to be an acquired skill. 

Good luck!

## Student information

Make sure to **include your name in the filename when saving this notebook**. In order to avoid any possible confusion, enter your name here too:

In [None]:
# first and last name here

In [None]:
# student number here

## Exercise 1: Protagonists' gender and social class

**This exercise consists of two parts (1a - 1b) that range from an easy to an intermediate level, as indicated below. You can earn 5 points in total.**

### Exercise 1a: Loading and exploring data

**[level: easy]**

**[points: 2]**

Load and inspect the dataset `correlaciones/spanish-novels.tsv` from the `datasets_exam` folder and call it `novels`.

In [1]:
# code here

Which levels does the `protagonist.gender` variable have? Write down the names of the levels, and give the absolute frequencies (number of data points) per level.

In [2]:
# code here

In [3]:
# text answer here

Create a subset of your dataset `novels` in which you only keep protagonists with 'female' or 'male' gender. You get this preprocessing step for free:

In [4]:
# given: preprocessing step

# select subset
novels <- novels[(novels$protagonist.gender == 'male')|(novels$protagonist.gender == 'female'),]
# delete empty levels
novels$protagonist.gender <- factor(novels$protagonist.gender)

Now inspect (with tables) the relation between the following two variables:
- `protagonist.gender`: gender of the protagonist (main character) in a novel
- `protagonist.social.level`: social class of the protagonist (main character) in a novel

Create informative tables to compare these two variables, and round all proportions to three digits. Create:

- a table with raw (absolute) frequencies
- a table with column proportions
- a table with row proportions

In [4]:
# code here

Next, create the appropriate plot for `protagonist.gender` versus `protagonist.social.level`. Add coloring to increase clarity. 

In [5]:
# code here

What do the tables and plot suggest about the social diversity in Spanish novels among female and male protagonists? And about the gender divide among Spanish novel protagonists?

In [6]:
# text answer here

### Exercise 1b: Chi-squared test

**[level: intermediate]**

**[points: 3]**

Before conducting a chi-squared test on the correlation between `protagonist.gender` and `protagonist.social.level`, formulate your null hypothesis $H_0$ and alternative hypothesis $H_1$.

In [7]:
# H0 here

In [8]:
# H1 here

Test all assumptions of a chi-squared test. What are your findings, and their (statistical) implications/consequences? If required, take the necessary actions.

In [9]:
# code here

In [10]:
# text answer here

Following your previous steps, now perform a chi-squared test and interpret the result. Make sure to say something about the obtained chi-squared value and p-value, and about your previously formulated hypotheses.

In [11]:
# code here

In [12]:
# text answer here

## Exercise 2: Faces in TIME Magazine

**This exercise consists of three parts (2a - 2c) that range from an easy to an intermediate level, as indicated below. You can earn 6 points in total.**

TIME Magazine is an influential periodical US publication that has covered America's domestic and global politics for many decades now. For a recent publication in the *Journal of Cultural Analytics*, [Jofre et al. (2020)](https://culturalanalytics.org/article/12266-what-s-in-a-face-gender-representation-of-faces-in-time-1940s-1990s) extracted all the human faces which were photorealistically depicted in TIME magazine's archive, covering 3,389 issues dating from 1923 to 2014. The research team (partly automatically) classified each of these instances as belonging to the 'male' or 'female' gender. For this exam question, we have preprocessed and reformatted their data (available from [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YAT5QK)) into two new, easy-to-use datasets (`time/time-year.csv` and `time/time-season.csv`) that you will need for answering the questions below.

### Exercise 2a: Loading and exploring data

**[level: easy]**

**[points: 1]**


Load the comma-separated file `time-year.csv`. Inspect the beginning and end of the table. For each year in the dataset (1923-2014), this table records the proportion of female faces (over all faces) that Jofre et al. detected. Verify that you have proportions for 90 years.

In [13]:
# code here

### Exercise 2b: Running a linear model

**[level: intermediate]**

**[points: 3]**

First, plot the proportion of female faces in TIME over time as a scatterplot. Provide meaningful axis labels and a figure title. Informally describe any diachronic trends you can glean from this graph.

In [14]:
# code here

In [15]:
# text answer here

Now run a linear model to predict the proportion of `female` faces as a function of `year`. (For the sake of simplicity, you do not have to scale or center your variables; also, you can treat the dependent variable `female` as a simple scalar.) Is the model you obtain satisfactory? Why (not)? Extract relevant arguments from the model summary.

In [16]:
# code here

In [17]:
# text answer here

Generate your scatterplot from before again, but now overlay it with the regression line (in red) yielded by your linear model.

In [18]:
# code here

Thought exercise: Suppose that TIME Magazine really started publishing in 1922 already, which proportion of female faces would your model have predicted for this fictional year?

In [19]:
# code here

### Exercise 2c: TIME series

**[level: intermediate]**

**[points: 2]**

Apart from a linear model, we can also approach this TIME series data with a bivariate test to verify the hypothesis that the proportion of female faces depicted in TIME has increased over time. Run Kendall's $\tau$ on this data to verify this hypothesis and discuss your findings. Make sure to apply the correct directionality (cf. `alternative`) when running the test. Report your findings, using the following terms: "null hypothesis", "alternative hypothesis", "p-value", "significant(ly)", and "directional(ity)".

In [20]:
# code here

In [21]:
# text answer here

## Exercise 3: Spellingdata revisited (full analysis)

**This exercise consists of three parts (3a - 3c) that range from an easy to a hard level, as indicated below. You can earn 6.5 points in total.**

### Exercise 3a: Loading and exploring data

**[level: easy]**

**[points: 1.5]**

Load and inspect `spellingdata3.tsv` from the `datasets_exam` folder. You will notice that this is an updated version of the spellingdata dataset that you are familiar with, with some added variables. Read the readme to learn about the new variable `frequency`, and to go over the other variables once more.

In [22]:
# code here

Turn `error` into a factor.

In [23]:
# code here

Check this dataset for repeated measurements. Are there any? Describe what the implications for statistical modeling will be: which kind of model will you need to use?

In [24]:
# code here

In [25]:
# text answer here

### Exercise 3b: Statistical modeling

**[level: intermediate]**

**[points: 2.5]**

Fit an appropriate statistical model to investigate the impact of `frequency`, `education` and their potential interaction (predictors) on the making of verb spelling errors (`error`) (response). You need to scale `frequency` when including it in your analyses. You can do so via the function `scale()`: simply use `scale(frequency)`.


In [26]:
# code here: load libraries

In [27]:
# code here: model fitting

Test whether or not all the predictors in the model are significant. Describe your findings and their implications.

In [28]:
# code here

In [29]:
# text answer here

### Exercise 3c: Model visualization and interpretation

**[level: hard]**

**[points: 2.5]**

Visualize the results of your model an effect plot. (You can ignore potential vector-warnings here when plotting).

In [30]:
# code here

Finally, describe and interpret the (large trends of the) results. You can refer to both your model information and your plot.

In [31]:
# text answer here