<a href="https://colab.research.google.com/github/browndw/humanities_analytics/blob/main/mini_labs/Mini_Lab_04_Keywords.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Lab 4: Keywords

We're going to use the same libraries and introductory steps in our processing pipeline.

For details about the package and it's functions see: <https://docuscospacy.readthedocs.io/en/latest/docuscope.html>

If you'd like to explore what this library does in an interactive online interface, you can go to: <https://docuscope-ca.eberly.cmu.edu/>

We'll also be using [great_tables](https://posit-dev.github.io/great-tables/articles/intro.html) to design and output tablular data.

## Install the

Note that the capture decorator simply supresses the installation output.

In [2]:
%%capture
!pip install docuscospacy>=0.3
!pip install great_tables

## Install the model

Next we'll install the model.

In [3]:
%%capture
!pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"

## Load the libraries

We'll need these for our proceessing pipeline (docuscospacy, spacy) wrangle data frames (polars), generate and maipulate tables (great_tables) and create plots (matplotlib).

In [4]:
import docuscospacy as ds
import polars as pl
import spacy
from great_tables import GT, md, html
from matplotlib import pyplot as plt

## Import data

For this exercise, we'll import some toy data that was designed to replicate COCA on a small scale.

In [5]:
df = pl.read_parquet("https://github.com/browndw/humanities_analytics/raw/refs/heads/main/data/data_tables/sample_corpus.parquet")

Rember that the table has a `doc_id` column and a `text` column. This is conventional formatting for processing textual data.

In [6]:
df.head()

doc_id,text
str,str
"""acad_01""","""Teachers and other school pers…"
"""acad_02""","""Abstract Does the conflict in …"
"""acad_03""","""January 17, 1993, will mark th…"
"""acad_04""","""Thirty years have passed since…"
"""acad_05""","""ABSTRACT -- A common property …"


For the purposes of this mini lab, it is also useful to note what the categories are that have been encoded into the `doc_id`. To retrive that information, we can extract the initial character string and return the unique values. This information will be useful later.

In [11]:
df.get_column("doc_id").str.extract(r"^([a-z]+)", 1).unique().sort().to_list()

['acad', 'blog', 'fic', 'mag', 'news', 'spok', 'tvm', 'web']

## Load the model

As with any processing task of this kind, we will first need to load a model instance.

In [12]:
nlp = spacy.load("en_docusco_spacy")

## Process the corpus

Now, we pass our data frame `df` to the nlp model that we just loaded using the `docuscope_parse` function. This is the most computationally intensive part of the process (and it takes the longest). For this sample corpus, processing should take a couple of minutes.

If you're working with a medium to large corpus, you'd typically want to down-sample the full corpus and check to make everything is working before processing the full data set.

Down-sample our `df` would be easy using the `sample` function in [`polars`](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sample.html).

In [13]:
ds_tokens = ds.docuscope_parse(df, nlp_model=nlp, n_process=4)

### Peek at the tokens

Now we can check what we've generated.

In [None]:
ds_tokens.head()

doc_id,token,pos_tag,ds_tag,pos_id,ds_id
str,str,str,str,u32,u32
"""acad_01""","""Teachers ""","""NN2""","""Character""",1,1
"""acad_01""","""and ""","""CC""","""Untagged""",2,2
"""acad_01""","""other ""","""JJ""","""InformationTopics""",3,3
"""acad_01""","""school ""","""NN1""","""InformationTopics""",4,3
"""acad_01""","""personnel ""","""NN2""","""Character""",5,4


## What is **keyness**?

Keyness is a generic term for various tests that compare observed vs. expected frequencies.

The most commonly used (though not the only option) is called log-likelihood in corpus linguistics, but you will see it else where called a **G-test** goodness-of-fit.

The calculation is based on a 2 x 2 contingency table. It is similar to a chi-square test, but performs better when corpora are unequally sized.

Expected frequencies are based on the relative size of each corpus (in total number of words N~i~) and the total number of observed frequencies:

$$
E_i = \sum_i O_i \times \frac{N_i}{\sum_i N_i}
$$
And log-likelihood is calculated according the formula:

$$
LL = 2 \times \sum_i O_i \ln \frac{O_i}{E_i}
$$
A good explanation of its implementation in linguistics can be found here: <http://ucrel.lancs.ac.uk/llwizard.html>.

You don't need to worry about the math, but you should understand what is happening conceptually and what the results show.

> Keyness is measure the frequency we observe in a target corpus vs. what we would expect if our target corpus and our reference corpus were part of the same distribution. In other words, take both our corpora and pool them together. What is the frequency we would expect to see.

Importantly, keyess measures **how much evidence we have for an effect**. It doesn't make much sense, for example, to claim for example that a token is "more key" than another.

## Choosing a target and reference corpus

The first step in carrying out a keyness calculation is to decide what you're comparing to what. This choice has clear implications for what you can claim based on your results. What would it show, for example, to compare television and movie scripts to blogs? What's the research question? What do you hope to learn?

Let's start by comparing the academic texts to the other text types. To do this will use the `filter` function. Note that the tilde (`~`) is a negator (so it doesn't contain the string).

In [17]:
target = ds_tokens.filter(pl.col("doc_id").str.contains("acad"))
reference = ds_tokens.filter(~pl.col("doc_id").str.contains("acad"))

And check the results.

In [20]:
target.head()

doc_id,token,pos_tag,ds_tag,pos_id,ds_id
str,str,str,str,u32,u32
"""acad_01""","""Teachers ""","""NN2""","""Character""",1,1
"""acad_01""","""and ""","""CC""","""Untagged""",2,2
"""acad_01""","""other ""","""JJ""","""InformationTopics""",3,3
"""acad_01""","""school ""","""NN1""","""InformationTopics""",4,3
"""acad_01""","""personnel ""","""NN2""","""Character""",5,4


In [21]:
reference.head()

doc_id,token,pos_tag,ds_tag,pos_id,ds_id
str,str,str,str,u32,u32
"""blog_01""","""unpleasant""","""JJ""","""Negative""",144947,122978
"""blog_01""",""", ""","""Y""","""Untagged""",144948,122979
"""blog_01""","""and ""","""CC""","""Narrative""",144949,122980
"""blog_01""","""then ""","""RT""","""Narrative""",144950,122980
"""blog_01""","""you ""","""PPY""","""Narrative""",144951,122980


## Creating frequency tables

The docuscospacy package has a function called `keyness_table` that will calculate all of the statistical information that we'll need.

See here: <https://docuscospacy.readthedocs.io/en/latest/corpus_analysis.html#Keyword-tables>

The function requires us to first create frequency tables for our target and our reference.

In [22]:
wc_target = ds.frequency_table(target)
wc_ref = ds.frequency_table(reference)

## Creating a keyness table

Now we can generate our keyness table.

In [24]:
kw = ds.keyness_table(wc_target, wc_ref)

Check the table.

In [25]:
kw.head()

Token,Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""of""","""IO""",1554.092129,0.985829,0.0,38898.437188,19641.207534,4866,17626,100.0,99.714286
"""the""","""AT""",697.526424,0.47492,1.0318e-153,66205.68368,47635.444212,8282,42748,100.0,99.714286
"""social""","""JJ""",459.036955,3.501689,7.788699999999999e-102,1678.72417,148.206093,210,133,52.0,20.0
"""studies""","""NN2""",394.0216,3.998845,1.1025e-87,1247.05224,78.003207,156,70,44.0,6.857143
"""in""","""II""",343.013394,0.591128,1.408e-76,21639.553939,14364.847743,2707,12891,100.0,99.714286


The columns are as follows:

1. **Token**: the token
1. **Tag**: the tag
1. **LL**: the keyness value or [**log-likelihood**](http://ucrel.lancs.ac.uk/llwizard.html), also known as a G2 or goodness-of-fit test.
1. **LR**: the effect size, which here is the [**log ratio**](http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/)
1. **PV**: the *p*-value associated with the log-likelihood
1. **RF_Tar**: the relative frequency in the target corpus (per mil)
1. **RF_Ref**: the relative frequency in the reference corpus (per mil)
1. **AF_Tar**: the absolute frequency in the target corpus
1. **AF_Ref**: the absolute frequency in the reference corpus
1. **Range_Tar**: the percentage of texts in which the token apprears in the target corpus
1. **Range_Ref**: the percentage of texts in which the token apprears in the reference corpus

### Comparing tags instead of tokens

It can be useful sometimes to compare tags (parts-of-speech or DocuScope) instead of individual tokens. For that, the process is the same. But rather than first creating counts of token frequencies, we'll create tables of tag frequencies.

In [27]:
tag_tar = ds.tags_table(target, count_by='pos') # by part-of-speech
tag_ref = ds.tags_table(reference, count_by='pos') # by part-of-speech
ds_ref = ds.tags_table(reference, count_by='ds') # by DocuScope
ds_tar = ds.tags_table(target,  count_by='ds') # by DocuScope

And generate a keyness table by setting `tags_only=True`.

In [33]:
kt_pos = ds.keyness_table(tag_tar, tag_ref, tags_only=True)
kt_ds = ds.keyness_table(ds_tar, ds_ref, tags_only=True)

Check the result of the part-of-speech table.

In [34]:
kt_pos.head()

Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""JJ""",2312.748987,0.74229,0.0,9.610296,5.744936,12022,51555,100.0,100.0
"""NN2""",2028.82342,0.789272,0.0,7.544666,4.365617,9438,39177,100.0,100.0
"""IO""",1554.779099,0.985961,0.0,3.890643,1.964344,4867,17628,100.0,99.714286
"""NN1""",1459.096639,0.408936,0.0,18.369239,13.835317,22979,124158,100.0,100.0
"""AT""",662.178275,0.457909,5.019e-146,6.731684,4.900941,8421,43981,100.0,99.714286


And the DocuScope table.

In [35]:
kt_ds.head()

Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""AcademicTerms""",3754.836644,1.200984,0.0,8.171945,3.55462,8435,24796,100.0,99.714286
"""Untagged""",728.133631,0.230456,2.2805e-160,33.689534,28.715787,34774,200313,100.0,100.0
"""InformationTopics""",605.072806,0.656914,1.3195999999999999e-133,3.841347,2.436311,3965,16995,100.0,100.0
"""Citation""",521.574429,1.357769,1.9231000000000001e-115,0.92328,0.36025,953,2513,100.0,96.0
"""AcademicWritingMoves""",502.414707,2.034075,2.8352e-111,0.467937,0.114254,483,797,96.0,74.571429


## Effect size

While the LL value produces one important piece of information (the amount of evidence we have for an effect), it neglects another (the magnitude of the effect). Whenever we report on significance it is **critical** to report **effect size**. Some common effect size measures include:

* %DIFF - see Gabrielatos and Marchi [-@gabrielatos2011keyness]
    + Costas has also provided an FAQ with more details <http://ucrel.lancs.ac.uk/ll/DIFF_FAQ.pdf>
* Bayes Factor (BIC) - see Wilson [-@wilson2013embracing]
    + You can interpret the approximate Bayes Factor as degrees of evidence against the null hypothesis as follows:
        - 0-2: not worth more than a bare mention
        - 2-6: positive evidence against H~0~
        - 6-10: strong evidence against H~0~
        - 10: very strong evidence against H~0~
    + For negative scores, the scale is read as "in favor of" instead of "against".
* Effect Size for Log Likelihood (ELL) - see Johnston *et al.* [-@johnston2006measures]
    + ELL varies between 0 and 1 (inclusive). Johnston *et al.* say "interpretation is straightforward as the proportion of the maximum departure between the observed and expected proportions".
* Relative Risk
* Odds Ratio
* Log Ratio - see Andrew Hardie's CASS blog for how to interpret this
    + Note that if either word has zero frequency then a small adjustment is automatically applied (0.5 observed frequency which is then normalized) to avoid division by zero errors.

### Log Ratio (LR)

The `keyness_table` fuction returns [Hardie's Log Ratio](https://cass.lancs.ac.uk/log-ratio-an-informal-introduction/), which is easy and intuitive. It is simply the (base 2) logarithm or the relative frequencies. So a LR of 1 would indicate that the frequency in the target is 2-times greater than the reference, an LR of three it would be 4-times greater, and an LR of 3 8 times greater.

## P-values

It is also important to interpret and report *p*-values correctly. There is a nice [explanation here](https://www.scottbot.net/HIAL/index.html@p=24697.html).

A p-value represents and threshold (conventionally 0.05, 0.01, or 0.001) at which we can claim a differnce is significant or reaches significance.

The threshold we choose is largely dependent on the size of our corpora. With a corpus of many millions of words, at a *p*-value < 0.05, an enormous quatity of tokens would reach significance.

Using a phrase like ["marginally significant"](https://scientistseessquirrel.wordpress.com/2015/11/16/is-nearly-significant-ridiculous/) can be a red flag in some quarters.

> Note that the default threshold in the `keyness_table` function is `threshold=.01`. In other words, it returns values below that threshold.

## Saving your results

To save your results, you need to connect the Colab you're running to your Google Drive.

```
from google.colab import drive
drive._mount('/content/drive')
```

Tables and figures can then be saved directy to your drive. You will need to specify a path. If you're unsure of what the path should be, you can naviate your drive by clicking on the folder icon to the rught (under they key icon). YOu can then hover over the folder where you'd like to save your file, click on the three dots to the right, and copy the path.

To, for example, save the `kt_pos` on my Drive, I would run:

```
kt_pos.write_csv("/content/drive/MyDrive/76-380-780 MiHA/Mini Labs/output/keyness_pos.csv")
```

## Question 🤔

Looking at the tables above, what do you see as some of the affordances and difficulties or limiations of working with key words?

## On your own

1. Using the keyness table we created `kw`, filter the table for modal verbs. The [tagset is here](https://ucrel.lancs.ac.uk/claws7tags.html).

> Hint: you will need to use [`filter`](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html) and specify the column using `pl.col("Tag")`


In [None]:
# Your code here

2. Now try to filter for **ALL** common nouns.

> Hint: Noun tags start with "NN".

In [None]:
# Your code here

3. [Sort](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sort.html#polars.DataFrame.sort) the table by Log Ratio (from largest to smallest)

In [None]:
# Your code here