# Week 9: A Pandas Approach to TTRs in the Colonial South Asian Literature dataset


Topics:
- Renaming columns
- Merging data frames
- Defining functions
- Moving window average TTR
- Using `.groupby()` to get average TTRs for different categories

# Loading the CSAL Dataset

Let's begin by loading the CSAL dataset and having a look at what kinds of "metadata" it contains.

In [None]:
import pandas as pd

In [None]:
csal_meta_df = pd.read_csv('csal.csv')

In [None]:
csal_meta_df

Our task today is to investigate whether the South Asian or "foreign" writers use a higher TTR in their works. So the most important column for us at this point is `Nationality of Author`. Let's have a closer look at what it contains.

In [None]:
csal_meta_df['Nationality of Author'].value_counts()

In [None]:
csal_meta_df['Nationality of Author'].value_counts().plot(kind="pie", figsize=(7, 7))

Let's also have a look at `Genre`, which might be interesting to us in a bit as well...

In [None]:
csal_meta_df['Genre'].value_counts()

In [None]:
csal_meta_df['Genre'].value_counts().plot(kind="pie", figsize=(7, 7))

# Approaching TTR Task in Pandas

Let's use data frames to compute TTRs.  This will allow us to slice and dice the data in different ways.


## Generating TTR CSV files... and Loading Them Back as Pandas DataFrames

Let's start by using some code directly recycled** from the Week 5 lecture. 

### DON'T RUN THIS CODE IN CLASS. IT MIGHT CRASH THE SERVER.

In [None]:
import re
from pathlib import Path

folder_path = "csal/" # We're telling the code to look in the "csal/" subfolder, where the CSAL files all live.

# compute the sample size:
sample_size = 0

for file_path in sorted(Path(folder_path).glob('*.txt')):
    
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    num_tokens = len(text_words)
    
    if sample_size == 0 or num_tokens < sample_size:
        sample_size = num_tokens

# Open the output file and write the headers
file = open("ttr-standardized.csv", mode="w", encoding="utf-8")

# Column labels are precise, identifying these as "Standardized" values
file.write('"Text","Standardized Types","Standardized Tokens","Standardized TTR"\n') 

for file_path in sorted(Path(folder_path).glob('*.txt')):
    text = open(file_path, encoding='utf-8').read()
    text = re.sub("[^a-zA-Z0-9]", " ", text)
    
    text_words = text.split()
    text_words_standardized = text_words[:sample_size]
    tokens_standardized = len(text_words_standardized)

    unique_words_standardized = []
    
    for word in text_words_standardized:
        word = word.lower()
        if word not in unique_words_standardized:
            unique_words_standardized.append(word)
            
    types_standardized = len(unique_words_standardized)
    
    ttr_standardized = (types_standardized / tokens_standardized) * 100
    
    # path.name used rather than path.stem so that recoreded filenames match CSAL metadata
    file.write(f'"{file_path.name}",{types_standardized},{tokens_standardized},{ttr_standardized:.2f}\n') 

file.close()

In [None]:
standardized_ttr_df = pd.read_csv("ttr-standardized-prebaked.csv")
#standardized_ttr_df = pd.read_csv("ttr-standardized.csv")

In [None]:
standardized_ttr_df


Please check out the the W09_moving_window.ipynb file to see how I computed the different values for moving windows. This will allow us to compare the standardized TTRs to the moving windows TTRs and do some simple statistics on them.

In [None]:
moving_windows_ttr_df = pd.read_csv("ttr-windows.csv")
moving_windows_ttr_df

# Merging DataFrames

When two DataFrames have the same column, we can merge them into one.
Below is the command we use to `.merge()` our two DataFrames, **"on"** the column they have in common. 

In [None]:
pd.merge(moving_windows_ttr_df, standardized_ttr_df, on="Text")

But we probably don't want the types and tokens columns, so let's get rid of them.

In [None]:
# create a data frame with only the columns we want in our merged data frame


Now let's go ahead and stick that into a variable

In [None]:
ttr_df = pd.merge(moving_windows_ttr_df, standardized_ttr_df, on="Text")

In [None]:
ttr_df

# Merging the TTR Data with the CSAL Metadata

Fortunately the csal data has the same 'Text' column.

In [None]:
csal_ttr_df = pd.merge(csal_meta_df, ttr_df, on="Text")

In [None]:
csal_ttr_df

Let's learn a little more about this new mega-DataFrame we're created...

In [None]:
csal_ttr_df.describe()

# Sorting by Column

Before we jump into our actual task for this week, let's see how you would sort the full dataset by Standardized TTR, from lowest to highest; then from highest to lowest.

In [None]:
csal_ttr_df.sort_values(by='Standardized TTR', ascending=True)

In [None]:
csal_ttr_df.sort_values(by='Standardized TTR', ascending=False)

# Using GroupBy and Mean to Get Our TTR-by-Nationality Data

Now that we have this mega-DataFrame — it contains all the CSAL metadata, and all our precious TTR data — we can pursue our original research question: do texts written by authors from the subcontinent have higher or lower TTRs than texts written by authors identified as foreign?

**What data do we actually need to see, in what format, to pursue that research question?**

Let's start by using our old friend `df.groupby()` and group this DataFrame by the `Nationality of Author` column.

In [None]:
csal_by_nationality_df = csal_ttr_df.groupby('Nationality of Author')
csal_by_nationality_df

DataFrames produced by GroupBy can't be visualized in the standard way that normal DataFrames are. We need to call methods on them to see what's inside. Remember what we're looking for: the **mean standardized TTR for each category of author nationality**. If we just call on old reliable `df.describe()`, we can see that this data is already the `csal_by_nationality_df` DataFrame we just produced. Do you see where it is in the below output?

In [None]:
csal_by_nationality_df.describe()

Here's how we grab only the information we want from `csal_by_nationality_df` — subsetting to the `Standardized TTR` column (using a method we've been using for a few weeks now — passing a `['list containing a single string']` into the `dataframe[ ]` structure) and then calling the Pandas `.mean()` function on that column.

What we get from this is just a plain old Pandas DataFrame (not a GroupBy object)

In [None]:
csal_by_nationality_df[['TTR 4051', 'Standardized TTR']].mean()

In [None]:
type(csal_by_nationality_df[['Standardized TTR']].mean())

Now let's stick that into a variable... and let's make a plot of the data we've uncovered... and then interpret the results together!

In [None]:
mean_ttr_by_nationality_df = csal_by_nationality_df[['Standardized TTR', 'TTR 4051']].mean()
mean_ttr_by_nationality_df = mean_ttr_by_nationality_df.rename(columns={'TTR 4051' : 'Moving Average TTR'})

In [None]:
mean_ttr_by_nationality_df.plot(kind='bar', figsize=(10,5), title='Standardized TTRs Averaged Across Nationality of Author')

Let's now look at similar plots for TTR data sorted according to different metadata categories, using the same methods employed above. Does this give you any further insight into the results above?

In [None]:
csal_ttr_by_year_df = csal_ttr_df.groupby('Year')
mean_ttr_by_year_df = csal_ttr_by_year_df[['Standardized TTR', 'TTR 4051']].mean()
mean_ttr_by_year_df

In [None]:
mean_ttr_by_year_df.plot(figsize=(15,5), title='TTRs Averaged Across Year of Publication')

Wait! Something looks strange there.  Why are the TTRs so different for 1870?

In [None]:
csal_ttr_df[csal_ttr_df['Year'] == 1870]

In [None]:
mean_ttr_by_genre_df = csal_ttr_df.groupby('Genre')[['Standardized TTR','TTR 4051']].mean()
mean_ttr_by_genre_df

In [None]:
mean_ttr_by_genre_df.plot(kind='bar', figsize=(10,5), title='Standardized TTRs Averaged Across Genre of Text')

Let's use the techniques we learned last time to produce our gender signal-by-year plots to see exactly how many works in each Genre appear for each of the author nationalities.

In [None]:
csal_ttr_df.groupby(['Genre', 'Nationality of Author']).size()

In [None]:
csal_ttr_df.groupby(['Genre', 'Nationality of Author']).size().unstack()

In [None]:
csal_genre_by_nationality_df = csal_ttr_df.groupby(['Genre', 'Nationality of Author']).size().unstack(fill_value=0)
csal_genre_by_nationality_df

In [None]:
csal_genre_by_nationality_df = csal_ttr_df.groupby(['Genre', 'Nationality of Author'])[['TTR 4051']].mean().unstack(fill_value=0)
csal_genre_by_nationality_df


In [None]:
csal_genre_by_nationality_df.plot(kind='bar', figsize=(10,5), title='Moving Window TTRs Averaged Across Genre of Text and Nationality of Author').legend(loc="lower right")

Note that things are lumped by rows: drama, fiction, etc.  What if we'd wanted things to be lumped by nationality?  We could create a new dataframe `csal_ttr_df.groupby(['Nationality of Author','Genre', ]).size().unstack(fill_value=0)` by exchanging the order of `Nationality of Author` and `Genre`

In [None]:
csal_ttr_df.groupby(['Nationality of Author','Genre', ])[['TTR 4051']].mean().unstack(fill_value=0)

And we can plot if we want!

In [None]:
ax = csal_genre_by_nationality_df.T.plot(kind='bar', figsize=(10,5), title='Moving Window TTRs Averaged Across Genre of Text and Nationality of Author', xlabel='Nationality of Author', ylabel='Moving Window TTR')
ax.set_xticklabels(['American', 'Britsh', 'British, American', 'Canadian', 'Irish', 'South Asian', 'South Asian; British'])


Let's close today's class by 

- imagining how we could improve our approach to our original research question
- thinking of what other research questions we could ask of the CSAL dataset — with the TTR data we've added, or perhaps with some other metadata category or textual metric?