# Preliminaries

## Import statements

This gives us access to code that isn't part of base Python.

In [None]:
import os
import pandas as pd
from matplotlib import pyplot as plt

## Load the data

Here we load the feature table calculated in Notebook 6.

In [None]:
# load the feature table
csv_file = os.path.join('data', 'features_by_line.csv')
feat_count_line = pd.read_csv(csv_file)

# convert the line_ids to Categorical values
feat_count_line['line_id'] = pd.Categorical(
    values = feat_count_line['line_id'], 
    categories = feat_count_line['line_id'].unique(), 
    ordered = True,
)

# make line_ids the table index
feat_count_line = feat_count_line.set_index('line_id')

# show the table
display(feat_count_line)

# Sampling 

Now that we have a line-based feature table, the next step is to create a series of larger samples using a **sliding window**. When we're done, we'll have samples of a fixed number of lines that overlap with each other at the edges.

## Fixed-size samples

To start with, we'll forget about the overlap and just practice cutting the table into samples of a fixed number of lines. Let's create a variable called `size` that stores the size of the window in lines. Let's say we start with something small, like `5`, so we can easily see the effects.

Now we want to break up the line-based table into groups of `size` consecutive rows at a time. We need to label each line with a **sample ID** that we can group by. For eample, if `size` is 5, we want to generate something that looks like
```
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, ...]
```

i.e., something that repeats `size` times in a row and then increments. Then we could feed this into `groupby()` to chunk the line-based table.

## Sidebar: floor division

One easy way to generate exactly this kind of sequence is with Python's **floor division** operator, `//`. Floor division, also called "integer division," is like regular division except that it drops the decimal part of the answer.

In [None]:
a = 10
b = 3

# regular division
print(f'{a}/{b} =', a/b)

# floor division
print(f'{a}//{b} =', a//b)

Watch what happens if we run through a sequence of dividends with a constant divisor:

In [None]:
max_a = 10
b = 3

pd.DataFrame({
    'a': a,
    'b': b,
    f'a//b': a//b,
} for a in range(max_a))

## Generating sample IDs from line IDs

This is just the behaviour we're looking for. Now all we need is a constantly increasing serial number for each line, and we can do floor division by `size` to produce our sample id.  Luckily, the Categorial data type that we used for our **line_id** column can convert the category labels to integers (that's how it remembers their sort order).

Here we create a set of line-based labels that convert the line urns to serial integers and then to sample ids based on window size.

In [None]:
size = 5

line_labels = pd.DataFrame(dict(
    line_urn = feat_count_line.index,
    line_serial = feat_count_line.index.codes,
    sample_id = feat_count_line.index.codes // size * size,
))
display(line_labels)

## Collect line-based feature counts by sample ID

Now that we can generate a set of sample IDs, we can use `groupby()` to gather our line-based feature counts into counts over `size`-line samples.

In order to avoid calculating a lot of feature counts I don't really need, I'm going to identify some features of interest and select just a subset of the columns in the feature table *before* I do the `groupby()`.

In [None]:
# features of interest
features = ['pos_ADJ', 'pos_ADV', 'person_2', 'person_3']

# group by sample id
feat_count_sample = feat_count_line[features].groupby(line_labels.sample_id.values).agg('sum')

# inspect results
display(feat_count_sample)

## Index samples by line ID

This is looking good. The table index (row labels) is the **sample_id** column from the line-based table. As we saw above, it corresponds to the serial number of the first line in the sample. I'm going to change this to the **line_id** of the first line in the sample, just to make it easier to read.

In [None]:
# re-index samples based on line id
feat_count_sample = feat_count_sample.set_index(
    line_labels['line_urn'].groupby(line_labels['sample_id']).agg('first')
)

# inspect results
display(feat_count_sample)

## Inspect the shape of the resulting samples

Depending on the size of the window and where it falls, some samples may not have the full complement of lines. At the same time, different lines have different numbers of tokens, and I'm not sure how that will affect the number of tokens per sample.

Just to keep my eye on variance in sample size, then, I'm going to calculate both the number of lines and the number of tokens in each sample. We can use this to cull outliers if we need to.

In [None]:
sample_info = feat_count_line.groupby(line_labels.sample_id.values).agg(
    lines = ('lemma_ALL', 'count'),
    tokens = ('lemma_ALL', 'sum'),
).set_index(
    line_labels['line_urn'].groupby(line_labels['sample_id']).agg('first')
)
display(sample_info)

## Moving the sliding window

Using this method, we can theoretically chunk our lines into samples of any size we want. But the other piece of the sampling process is moving the window, so that we have samples that start on any given line. Alongside the window **size**, we're going to introduce a new variable, the window **offset**. This is a number less than `size` that adjusts where the first window is placed relative to the first line.

In [None]:
size = 5
offset = 1

line_labels = pd.DataFrame(dict(
    line_urn = feat_count_line.index,
    line_serial = feat_count_line.index.codes,
    sample_id = (feat_count_line.index.codes - offset) // size * size + offset,
))
display(line_labels)

### 🤔 What's different here?

- At least one of the samples has a negative id: does that make sense?
- The first non-negative sample id is `1` instead of `0`
- The first line is in a sample all by itself
- Sample IDs change on lines ending with `1` and `6` now, not `0` and `5`

## Calculate info on the new samples

Let's redo our line and token counts for the new sample set.

In [None]:
sample_info = feat_count_line.groupby(line_labels.sample_id.values).agg(
    lines = ('lemma_ALL', 'count'),
    tokens = ('lemma_ALL', 'sum'),
).set_index(
    line_labels['line_urn'].groupby(line_labels['sample_id']).agg('first')
)
display(sample_info)

Whereas with no offset the first sample had 5 lines and the last one had 4, now the first has 1 and the last has 3. All the other samples still have 5 lines, but now they begin on, and are labelled by, different lines in the text.

## Putting it all together

If we run this process many times, then, changing `offset` each time, we can create samples of `size` lines that begin on every possible line in the corpus. If we then concatenate all these samples into one giant table, we'll have our overlapping sample set.

In order to systematically perform all these steps many times while varying the parameters, it makes sense to rewrite them as a custom function. For each new offset, we can just call our function with a different argument. This is the **DRY** principle: "Don't repeat yourself." Writing a custom function makes our workflow better in a couple of important ways:

- we don't have to type the same stuff over and over again
- we don't have to name or keep track of a lot of very similar tables
- if (when) we make a mistake, we only have to correct it in one place

Here's the full process for a given window size, offset, and features of interest. A couple of small things:

- I'm adding the line labels directly to the feature table for simplicity.
- I'm adding an option to drop samples that don't have the full complement of lines

In [None]:
def get_samples(size, offset, features, drop=False):
    
    # generate line labels
    line_labels = pd.DataFrame(dict(
        row = feat_count_line.index,
        sample_id = (feat_count_line.index.codes - offset) // size * size + offset,
    ))
    
    # generate sample labels
    sample_labels = feat_count_line.groupby(line_labels.sample_id.values).agg(
        lines = ('lemma_ALL', 'count'),
        tokens = ('lemma_ALL', 'sum'),
    )
    
    # generate samples
    samples = (feat_count_line
        .groupby(line_labels.sample_id.values)
        .agg(
            lines = ('lemma_ALL', 'count'),
            tokens = ('lemma_ALL', 'sum'),
        )
        .join(
            feat_count_line[features]
                .groupby(line_labels.sample_id.values)
                .agg('sum')
        )
        .set_index(
            line_labels.row.groupby(line_labels.sample_id.values).agg('first')
        )
    )
    
    # optionally drop small samples
    if drop:
        samples = samples.loc[samples.lines == size]
    
    return samples

## Try it out

Let's try generating some features using different parameters.

In [None]:
size = 5
offset = 0
features = ['pos_ADJ', 'pos_ADV', 'person_2', 'person_3']

get_samples(size, offset, features)

### Combining samples with multiple offsets

In [None]:
size = 50
features = ['pos_CCONJ', 'pos_ADJ', 'pos_ADV', 'person_2', 'person_3']

samples = pd.concat(get_samples(size, offset, features) for offset in range(size))
samples = samples.sort_index()

## Visualizing the results

Pandas lets us create a simple plot from a data frame very easily. Here, we plot the number of coordinating conjunctions per sample (column **pos_CCONJ**) over the whole corpus.

In [None]:
samples.pos_CCONJ.plot()

There's a marked increase about two-thirds of the way through: this corresponds to the division between Seneca and Valerius Flaccus. As we saw earlier, Valerius uses conjunctions like *et* and *-que* quite a bit more frequently than Seneca.

### Customizing the plot

If we use Pyplot's interface, we can customize the look of the graph to make things easier to read. Here is a generic workflow for plotting one feature at a time. 

- the dimensions of the plot are set to elongate the x-axis
- sample size and feature to be plotted are variables, so we can change them easily
- they're noted on the graph itself, so we don't forget what we were plotting
- divisions between the works are marked on the graph
- urns are turned 90 degrees so they don't overlap each other

Try changing the `size` and `feat` parameters for different views on the data.

In [None]:
# set params
size = 100
feat = 'pos_CCONJ'

# generate sample table
samples = pd.concat(get_samples(size, offset, features=[feat], drop=True) for offset in range(size)).sort_index()

# get x-axis positions of book division
div_mask = samples.index.str.contains(r':(?:\d\.)?1$')
divs = samples.index[div_mask]

# create plot
fig, ax = plt.subplots(figsize=(12,5))

# draw the line graph
ax.plot(samples[feat])

# mark book divisions
ax.set_xticks(
    ticks = divs.codes,
    labels = [urn[:-2] for urn in divs.values],
    rotation = 90,
)

# label graph
ax.set_ylabel(f'{feat} (count per sample)')
ax.set_title(f'window size = {size}')

# diplay figure
plt.show()

## The effects of sample size

What are some of the considerations in choosing the size in lines of the sliding window?

- making the window larger should even out some of the noisy ups and downs
- on the other hand, we don't want to dilute the signal of coherent episodes in the narrative
- lines at the end of a work will be contaminated by the beginning of the next

The next code block draws multiple plots at different sample sizes to compare the effects. Again, try varying the parameters to see the results.

In [None]:
# set params
sizes = [50, 100, 200]
feat = 'pos_CCONJ'

# get x-axis positions of book division
div_mask = samples.index.str.contains(r':(?:\d\.)?1$')
divs = samples.index[div_mask]

# create plot
fig, ax = plt.subplots(figsize=(12,5))

# sample and graph for each size
for size in sizes:
    samples = pd.concat(get_samples(size, offset, features=[feat], drop=True) for offset in range(size)).sort_index()
    ax.plot(samples[feat], label=size)
    
# mark book divisions
ax.set_xticks(
    ticks = divs.codes,
    labels = [urn[:-2] for urn in divs.values],
    rotation = 90,
)

# label the figure
ax.set_ylabel(f'{feat} (per sample)')
ax.legend(title='sample size')

# diplay figure
plt.show()

In [None]:
# set params
feats = ['pos_CCONJ', 'pos_ADV', 'pos_ADJ']
size = 200

# get x-axis positions of book division
div_mask = samples.index.str.contains(r':(?:\d\.)?1$')
divs = samples.index[div_mask]

# create plot
fig, ax = plt.subplots(figsize=(12,5))

# sample and graph for each size
for feat in feats:
    samples = pd.concat(get_samples(size, offset, features=[feat], drop=True) for offset in range(size)).sort_index()
    ax.plot(samples[feat].div(samples['tokens']) * 1000, label=feat)
    
# mark book divisions
ax.set_xticks(
    ticks = divs.codes,
    labels = [urn[:-2] for urn in divs.values],
    rotation = 90,
)

# label figure
ax.set_ylabel(f'{feat} (per 1000 words)')
ax.set_title(f'window size = {size}')
ax.legend()

# diplay figure
plt.show()