# Intro to Quantitative Textual Analysis - Week 8: Correlations and Classifications

## Brezina 2018 ch. 5: Correlations

### Pearson's correlation (pp. 142–146)

Pearson's correlation (r) is expressed as follows:

```math
r = \frac{\text{covariance}}{SD_1 \times SD_2}
```

Covariance, in turn, is expressed:

```math
\text{covariance} = \frac{\text{sum of multiplied distances from } mean_1 \text{ and } mean_2}{\text{total no. of cases} - 1}
```

For example, suppose that we have five documents with $N_a$ adjectives and $N_n$ nouns:

In [6]:
docs = [(5, 10), (12, 15), (14, 25), (15, 26), (20, 30)]

def covariance(corpus: list[(int, int)]):
    mean_1 = sum([a for a, _ in corpus]) / len(corpus)
    mean_2 = sum([b for _, b in corpus]) / len(corpus)

    return sum([(mean_1 - a) * (mean_2 - b) for a, b in corpus]) / (len(corpus) - 1)

docs_covariance = covariance(docs)

We can calculate the standard deviation for each variable (number of adjectives and number of nouns) using the **sample standard deviation** from Brezina 2018 p. 50:

```math
\text{standard deviation}_\text{sample} = \sqrt{\frac{\text{sum of squared distances from the mean}}{\text{total no. of cases - 1}}}
```

In [5]:
import math

def sd_sample(arr: list[int]):
    mean = sum(arr) / len(arr)

    return math.sqrt(sum([(mean - x)**2 for x in arr]) / (len(arr) - 1))

sd_1 = sd_sample([a for a, _ in docs])
sd_2 = sd_sample([b for _, b in docs])

docs_covariance / (sd_1 * sd_2)

0.9384978052288936

In this case, Pearson's correlation indicates a _very_ strong positive correlation in between the number of adjectives and the number of nouns in these (made-up) documents.

Pearson's correlation will always range between -1 and +1: negative numbers indicate a negative correlation, and positive numbers indicate a positive correlation.

### What to report with correlation measures

As you can probably guess, it's important to report a p-value or confidence interval wiith your correlation statistics in order to give your readers a sense of statistical significance (which, as it turns out, is directly correlated to the number of observations).

Note that the functions that we wrote don't care about the length of the input arrays, as long as their respective type signatures are obeyed (a list of 2-tuples of `int`s for `covariance` and a list of `int`s for `sd_sample`).

## Using scipy instead of calculating Pearson's _r_ by hand

Instead of calculating Pearson's _r_ and associated p-values by hand, we can use the `scipy` library.

In [None]:
from scipy import stats

# zip(*docs) is a python-ism for "split this list of tuples into separate tuples"
x, y = zip(*docs)

stats.pearsonr(x, y)

PearsonRResult(statistic=np.float64(0.9384978052288937), pvalue=np.float64(0.018139369943329754))

# In-Class Lab 1

In the `treebanks/` directory, you'll find annotated versions of several of Sophocles' plays. (You don't need to know Greek to complete this exercise.)

Each file contains sentence- and word-level annotations. We'll focus on the `word` elements, which look something like this:

`<word id="12" form="ἐκμάθοις" lemma="ἐκμανθάνω" postag="v2saoa---" head="8" relation="ATR" cite="urn:cts:greekLit:tlg0011.tlg001:2"/>`

The `postag` ("part-of-speech tag") attribute will be our focus for this exercise. Its first letter tells us the part of speech for a given word --- `v` for "verb," `a` for "adjective," `n` for "noun," and so on.

Using the [`lxml`](https://lxml.de) library, write an xpath expression for parsing the `word` tags and their respective `@postag` attributes.

Then determine if the correlation between the number of adjectives and number of nouns in a document that we see in the BNE corpus in Brezina 2018 also holds true for this subset of the Sophoclean corpus.

Finally, run similar analyses for correlations between nouns and verbs and between verbs and adjectives.

In [1]:
from lxml import etree
from scipy import stats

import os

directory = "treebanks"
files = [
    f"{directory}/{f}"
    for f in os.listdir(directory)
    if os.path.isfile(f"{directory}/{f}")
]


def get_words_from_treebank(f):
    tree = etree.parse(f)

    return tree.findall("//word")


def count_pos(words, pos: str):
    return len([w for w in words if w.get("postag", " ")[0] == pos])


docs = []

for f in files:
    words = get_words_from_treebank(f)
    n_adj = count_pos(words, "a")
    n_noun = count_pos(words, "n")
    n_verb = count_pos(words, "v")

    docs.append(dict(filename=f, n_adj=n_adj, n_noun=n_noun, n_verb=n_verb))

correl_adj_noun = stats.pearsonr([d['n_adj'] for d in docs], [d['n_noun'] for d in docs])
correl_adj_verb = stats.pearsonr([d['n_adj'] for d in docs], [d['n_verb'] for d in docs])
correl_noun_verb = stats.pearsonr([d['n_noun'] for d in docs], [d['n_verb'] for d in docs])

f"correl_adj_noun: {correl_adj_noun}, correl_adj_verb: {correl_adj_verb}, correl_noun_verb: {correl_noun_verb}"

  return tree.findall("//word")


'correl_adj_noun: PearsonRResult(statistic=np.float64(0.47723925112032944), pvalue=np.float64(0.1630785642475298)), correl_adj_verb: PearsonRResult(statistic=np.float64(0.9304682054809317), pvalue=np.float64(9.397365857894094e-05)), correl_noun_verb: PearsonRResult(statistic=np.float64(0.23544158280947364), pvalue=np.float64(0.5125839641001063))'

As you can see, we need significantly more data to reject the null hypothesis -- that is, to get the p-value for each correlation to a reasonably low number.

Your task, then, is to find additional texts to supplement the texts that have been provided here. You might need to run POS-tagging on them using SpaCy or a similar library.

Please be sure to include a justification of whatever texts you include as you build out this small "corpus."

## Further reading

### Factor Analysis

Brezina 2018 (164) describes **factor analysis** as "a complex mathematical procedure that reduces a large number of linguistic variables. This is done by considering correlations betwen variables...; those that correlate -- both positively and negatively -- are considered components of the same factor because they have a connection.... A **factor** is thus a group of related linguistic variables summarizing a more general tendency ... in the data."

If you are interested in following up on factor analysis and using it for your final project, be sure to read over the rest of chapter 5.

# Text Classification with BERT

