# Intro to Quantitative Textual Analysis - Week 10

## Diachronic corpora: Change over time (Brezina 2018, ch. 7)

### Key terms

- longitudinal study
- diachronic corpora
- lockwords
- bootstrapping test

### Visualization techniques

- Line graph
- Candlestick plot
- Sparkline

## Warm-up: N-grams

By now, we're all familiar with some charts created through Google's [Ngram Viewer](https://books.google.com/ngrams/). But what do these charts actually show us, and what are there limitations?

Search for a few terms of your choosing in the Ngram Viewer. Try changing the time scale or zooming in and out.

With a partner or in small groups, discuss the following questions:

1. What do these n-grams show us?
2. What corpus are they using?
3. What are the limitations of this kind of charting?
4. What data are we missing for more sophisticated analyses?

## Colors over time

In the ./data directory, you'll find 'colours-data.csv', a CSV dataset provided by Brezina. Each row has the year, followed by the relative frequencies of several colors for that year. We'll use these data to practice visualizing change over time.

In [3]:
# install dependencies
%pip install altair pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.12 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
# import altair for visualization
import altair as alt

# import pandas for data-wrangling
import pandas as pd

In [5]:
# load the data
colors_df = pd.read_csv("data/colours-data.csv")
colors_df


Unnamed: 0,Year,red,blue,green,yellow,orange,grey
0,1600,38.13,0.15,33.34,6.28,6.28,1.05
1,1601,64.43,1.01,41.61,23.16,23.16,0.00
2,1602,45.30,1.20,21.79,7.38,7.38,0.34
3,1603,33.03,0.45,25.79,7.99,7.99,1.66
4,1604,24.60,0.96,21.33,3.46,3.46,2.69
...,...,...,...,...,...,...,...
95,1695,44.78,5.80,23.54,12.79,12.79,2.73
96,1696,37.30,5.27,31.43,15.35,15.35,3.20
97,1697,48.87,10.09,27.53,20.72,20.72,1.95
98,1698,57.86,5.20,28.54,13.52,13.52,2.47


In [6]:
# pivot the table for easier charting
value_vars = colors_df.columns.to_list()[1:]
by_color = pd.melt(
    colors_df, id_vars=["Year"], value_vars=value_vars, value_name="relative frequency"
).rename(columns={"variable": "color"})

by_color['Year'] = by_color['Year'].astype(str)

In [7]:
def decade(x: int):

    return int(x)//10*10

by_color['decade'] = by_color['Year'].apply(decade)

In [8]:
# chart the data
alt.Chart(by_color).mark_line(point=alt.OverlayMarkDef(tooltip=True, filled=False, fill="black")).encode(
    # brush=alt.selection_interval(encodings=["x"])
    x="Year:T",
    y="relative frequency:Q",
    color=alt.Color("color").scale(None)
).interactive().properties(width=1000).add_params()


In [9]:
alt.Chart(by_color).mark_boxplot(extent='min-max').encode(
    alt.X('decade:T'),
    alt.Y('relative frequency:Q'),
    alt.Color("color").scale(None)
).properties(width=1000).add_params()

### Your turn

> Discuss: What kinds of information can you glean from this chart? How can you make the chart more useful?

1. Using the [Altair](https://altair-viz.github.io/user_guide/data.html) docs, add a tooltip showing the year, color, and relative frequency when you hover over a point.
2. The visualization helps very little for grey and blue -- can you figure out how to "zoom in" and get a meaningful sense of their change over time?
3. Can you regroup the data by decade (e.g., `[1600, 1609], [1610, 1619], etc.`) and plot the results as a [box plot](https://altair-viz.github.io/user_guide/marks/boxplot.html)?

# Bootstrapping and percentage change

In this section, we'll use the same dataset to explore "bootstrapping," a "process of multiple resampling" that "gives an insight into the amount of variation in the data and gives us the confidence to generalize from this sample" [@Brezina2018 231].

But first, let's start with a more intuitive method.

## Percentage change

```math
\text{\% increase/decrease} = \frac{\text{relative frequency in corpus 2} - \text{relative frequency in corpus 1}}{\text{relative frequency in corpus 1}} \times 100
```

Let's divide our colors corpus in two halves by year, corresponding to the two halves of the 17th Century.

In [10]:
colors_df_1600_to_1649 = colors_df[colors_df['Year'].astype(int) <= 1649]
colors_df_1650_to_1699 = colors_df[colors_df['Year'].astype(int) >= 1650]

We'll use the mean relative frequency per color to calculate the percentage change.

In [11]:
red_1600 = colors_df_1600_to_1649['red'].mean()
red_1650 = colors_df_1650_to_1699['red'].mean()
red_change = ((red_1650 - red_1600) / red_1600) * 100

red_change

np.float64(2.642137081067606)

In [12]:
def perChange(x: str):
    colors_df_1600_to_1649 = colors_df[colors_df['Year'].astype(int) <= 1649]
    colors_df_1650_to_1699 = colors_df[colors_df['Year'].astype(int) >= 1650]
    color_1600 = colors_df_1600_to_1649[x].mean()
    color_1650 = colors_df_1650_to_1699[x].mean()
    color_change = ((color_1650 - color_1600) / color_1600) * 100

    print(f"{x}: {color_change}")

In [13]:
perChange('green')
perChange('blue')

green: -16.310438417930705
blue: 104.65306122448983


### Your turn

Perform the same calculations for the other colors. 

Can you generalize the calculation of percentage change by writing a function?

## Bootstrap test

"The **bootstrap test** [proposed in [@Lijffijt.etal2016]] is a non-parametric test of statistical significance, ... which compares two corpora and computes the p-value associated with the comparison." [@Brezina2018 231–232]

We can define the bootstrapping test mathematically as follows:

```math
\text{p} = \frac{1 + 2 \times \text{number of bootstrapping cycles} \times \text{(p1 or 1 – p1, whichever is smaller)}}{1 + \text{number of bootstrapping cycles}}
```

where

```math
p1 = \frac{\text{For all boostrapping cycles sume of value H}}{\text{number of boostrapping cycles}}
```

and where H can be 1 (value of interest in corpus1 > corpus2), 0.5 (value of interest in corpus1 == corpus2), or 0 (value of interest corpus1 < corpus2).

**Note that resampling can include duplicates from the dataset.**

Again, you'll find the data that we're working with in the data/ folder -- this time, we're using the "bootstrap" CSVs (again provided by Brezina).

In [14]:
df = pd.read_csv("./data/bootstrap_its.csv")
df

Unnamed: 0,ID_its,1650_59,1660_69
0,1,0.00,0.00
1,2,662.01,0.00
2,3,191.36,0.00
3,4,1625.28,0.00
4,5,475.62,0.00
...,...,...,...
4222,4223,67.85,1188.80
4223,4224,903.14,1483.68
4224,4225,1333.04,0.00
4225,4226,1520.25,1842.75


### Your turn

Calculate the p value for the bootstrap_*.csv data sets using Brezina's method. (Note that SciPy includes a `boostrap` method, but it differs from Brezina's definition.)

Be sure to calculate Cohen's _d_ and the 95% Confidence Interval as well. Are your results the same as Brezina's?

In [15]:
data1=df['1650_59']
data2=df['1660_69']
import numpy as np

In [None]:
def H(x, y):
    if x < y:
        return 0.0
    elif x == y:
        return 0.5
    else:
        return 1.0

def p1(data1, data2, N:int):
    hVal = []
    for i in range(0,N):
        sample1 = np.random.choice(data1, size=len(data1), replace=True)
        sample2 = np.random.choice(data2, size=len(data2), replace=True)
        
        stat1 = np.mean(sample1)
        stat2 = np.mean(sample2)
        
        h = H(stat1, stat2)
        hVal.append(h)
    
    return np.mean(hVal)

def bootstrap_p_value(data1, data2, N):
    p1_value = p1(data1, data2, N)
    p1_min = min(p1_value, 1 - p1_value)
    p = (1 + 2 * N * p1_min)/(1 + N)
    return p

p_val = bootstrap_p_value(data1, data2, 4227)
print(f"p-value: {p_val}")

p-value: 0.00023651844843897824


In [24]:
def cohen_d(x1, x2):
    n1 = len(x1)
    n2 = len(x2)

    mean1 = np.mean(x1)
    mean2 = np.mean(x2)

    s1 = np.std(x1, ddof = 1)
    s2 = np.std(x2, ddof = 1)

    sPool = np.sqrt(((n1-1)*s1**2 + (n2 - 1)*s2**2) / (n1+n2-2))

    d = (mean1 - mean2)/sPool
    return d

print(f"Cohen's d: {cohen_d(data1,data2)}")

Cohen's d: -0.10027441938146146


In [32]:
def bootstrap_d(x1, x2, N = 4227):
    d_values = []
    n1 = len(x1)
    n2 = len(x2)
    
    pooled = np.concatenate([x1, x2])

    for i in range (0, 4227):
        resam1 = np.random.choice(x1, size=n1, replace = True)
        resam2 = np.random.choice(x2, size=n2, replace = True)

        d = cohen_d(resam1, resam2)
        d_values.append(d)

    d_values = np.sort(d_values)
    lowerbound = np.percentile(d_values, 2.5)
    upperbound = np.percentile(d_values, 97.5)

    return (lowerbound, upperbound)

    
print(f"95% confidence interval for Cohen's d: {bootstrap_d(data1, data2, 4227)}")

95% confidence interval for Cohen's d: (np.float64(-0.1434693979728853), np.float64(-0.05792234223719922))
