<a href="https://colab.research.google.com/github/browndw/humanities_analytics/blob/main/mini_labs/Mini_Lab_02_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini Lab 1: The Basics

## A simple processing pipeline

In order to carry out any sort of computational analysis, we need to convert text into numbers. Although this is now fairly easy to do with computers, it, nonetheless, constitutes a RADICAL reorganization of text.

A processing pipeline typically looks something like this:

![A processing pipeline](https://raw.githubusercontent.com/browndw/humanities_analytics/refs/heads/main/data/_images/pipeline.svg)

To begin seeing what this looks like in practice, let's start with a toy example.

### A toy example

Frist, we'll create an object consisting of a character string. In this case, the first sentence from *A Tale of Two Cities*:

> It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair.

In [2]:
totc_txt = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."

### Tokenization

🤔 How would you turn this text into something you can count? In other words, we need to convert text into numbers in order to carry out any kind of computational or statistical analysis. So how would you do that?

One obvious way would be to simply split the text at spaces.

In [3]:
split_totc = totc_txt.split(" ")
print(split_totc)

['It', 'was', 'the', 'best', 'of', 'times,', 'it', 'was', 'the', 'worst', 'of', 'times,', 'it', 'was', 'the', 'age', 'of', 'wisdom,', 'it', 'was', 'the', 'age', 'of', 'foolishness,', 'it', 'was', 'the', 'epoch', 'of', 'belief,', 'it', 'was', 'the', 'epoch', 'of', 'incredulity,', 'it', 'was', 'the', 'season', 'of', 'Light,', 'it', 'was', 'the', 'season', 'of', 'Darkness,', 'it', 'was', 'the', 'spring', 'of', 'hope,', 'it', 'was', 'the', 'winter', 'of', 'despair.']


### Counting tokens

To create a table of counts, we'll first install a library to help us create and manipulate tables (or data frames). For all of our labs, we'll use a Python library called [Polars](https://docs.pola.rs/). There's also a handy [introduction here](https://pbpython.com/polars-intro.html).

In [4]:
%%capture
!pip install polars>=1.19


Now we can [import](https://www.geeksforgeeks.org/import-module-python/) some useful things.



In [5]:
import polars as pl
from collections import Counter

Then count our tokens and put them into a table.

In [6]:
totc_counts = Counter(split_totc)
counts_df = pl.DataFrame(totc_counts).transpose(include_header=True, header_name="token").rename({"column_0": "count"})
counts_df.head()

token,count
str,i64
"""It""",1
"""was""",10
"""the""",10
"""best""",1
"""of""",10



---

📝 Coding note: For these labs we'll largely be using a library called [polars](https://docs.pola.rs/api/python/stable/reference/index.html) to construct and manipulate data frames, which are just tabular data structures (i.e., they have rows and columns). The first part of the code (`pl.DataFrame(totc_counts)`) [creates the polars data frame](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html). The second (`transpose(include_header=True, header_name="token")`) [pivots the data frame](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.transpose.html) so that the rows become the columns and the columns the rows. And the third (`rename({"column_0": "count"})`) assigns the name "count" to the column that has been automatically labeled "column_0" when we transposed the data frame.

---



The process of splitting the string vector into constituent parts is called **tokenization** or [**word segmentation**](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation). Think of this as telling the computer how to define a word (or a "token", which is a more precise, technical term). In this case, we've done it in an extremely simple way--by defining a token as any string that is bounded by spaces. As a result, we have different tokens for the third-person pronoun *it*.



In [10]:
counts_df.filter(pl.col("token").str.contains(r"(?i)^it$"))

token,count
str,i64
"""It""",1
"""it""",9


---

📝 Coding note: The polars library has powerful tools for [filtering/selecting data](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html). Here, we filter on the "token" column (`pl.col("token")`) and we want a [string](https://docs.pola.rs/api/python/stable/reference/expressions/string.html) that contains "it". The "(?i)" signals that the search should be [case insensitive](https://stackoverflow.com/questions/75911005/case-insensitive-search-in-polars-python) (i.e., should include both upper and lower case) and the sybols "^" and "$" are [regular expression symbols](https://www.sitepoint.com/learn-regex/) indicating the beginning and end of a string respectively. There is [a regex tutorial here](https://regexlearn.com/learn).

---

## Using models to tokenize at scale

In order to execute this process at scale, we have a couple of options:


1.   We could manipulate our text by, for example, coverting everything to lower case, deleting any character sequences that don't contain a letter, deleting symbols, and deleting punctuation. Then, we could simply simply split on spaces as we did in our simple experiment above. This is called [pre-processing or text cleaning](https://medium.com/@maleeshadesilva21/preprocessing-steps-for-natural-language-processing-nlp-a-beginners-guide-d6d9bf7689c9).
2.   Alternatively, we could pass our data to an alogrithm or model with a complex set of rules or probabilities encoded into it.

The second option tends to be more computationally intensive. However, model-based parsing allows us to extract additional information from texts. Depending on the model, we can retrieve part-of-speech tags, named entities, sentiment scores, or dependency (i.e., syntactic) relations.

For most of these labs, we will be using [spaCy](https://spacy.io/) models to tokenize and parse our data. These models are relative efficient, well-documented, and widely used in industry.

### Install libraries

We'll begin by installing [docuscospacy](https://docuscospacy.readthedocs.io/en/latest/index.html), which we'll use for tokenizing and tagging and great_tables, which (as the name suggests) is using for designing and writing tables.

In [11]:
%%capture
!pip install docuscospacy>=0.3

### Install the spaCy model

Next, we'll download the model ([en_docusco_spacy](https://huggingface.co/browndw/en_docusco_spacy)) that we'll be using.

In [12]:
%%capture
!pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"

### Load the libraries

And finally load the libraries.

In [13]:
import docuscospacy as ds
import spacy

### Parsing text

Parsing text with spaCy requires:


1.   Initializing an "instance" of our model.
2.   Loading some text.
3.   Passing the text to the model.

So let's do that with our *Tale of Two Cities* example:



In [14]:
nlp = spacy.load("en_docusco_spacy")

In [21]:
totc_txt = "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."

In [22]:
doc = nlp(totc_txt)

Now we can see some of what the model generates as outputs:

In [25]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.ent_iob_, token.ent_type_)

It  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
best  JJT O 
of  IO B Narrative
times  NNT2 I Narrative
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
worst  JJT B Negative
of  IO B Narrative
times  NNT2 I Narrative
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
age  NN1 B Narrative
of  IO I Narrative
wisdom  NN1 B Positive
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
age  NN1 B Narrative
of  IO I Narrative
foolishness  NN1 B Negative
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
epoch  NN1 B Narrative
of  IO O 
belief  NN1 B Character
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
epoch  NN1 B Narrative
of  IO O 
incredulity  NN1 O 
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
season  NNT1 B Narrative
of  IO O 
Light  NN1 O 
,  Y O 
it  PPH1 B Narrative
was  VBDZ I Narrative
the  AT I Narrative
season  NNT1 B Narrative
of  

### Using docuscospacy to automate the process

To use the docuscospacy library we first need a data frame with one column with document ids and another with text.

In [27]:
totc_corpus = pl.DataFrame({"doc_id": "totc", "text": [totc_txt]})

Now we can pass that corpus to our spaCy instance.

In [28]:
totc_tokens = ds.docuscope_parse(totc_corpus, nlp_model=nlp, n_process=4)

After processing, we can create a number of useful stuctures, like a table of frquency counts.

In [31]:
ds.frequency_table(totc_tokens).head()

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""it""","""PPH1""",10,166666.666667,100.0
"""of""","""IO""",10,166666.666667,100.0
"""the""","""AT""",10,166666.666667,100.0
"""was""","""VBDZ""",10,166666.666667,100.0
"""age""","""NN1""",2,33333.333333,100.0


### Processing a larger data set

To load in a larger data set, we can read in data that is either on you Google Drive or data that we can link to from the web.

Here we will read in a corpus from our course GitHub repository.

In [32]:
df = pl.read_parquet("https://github.com/browndw/humanities_analytics/raw/refs/heads/main/data/data_tables/sample_corpus.parquet")

---

📝 Coding note: The polars library has a variety of functions for [reading in data](https://docs.pola.rs/api/python/stable/reference/io.html). Data can also be written into your Google Drive, which we will do in another lab.

---

Now we can use the same `docuscope_parse` function to process the corpus. This will take about 2 minutes.

In [34]:
ds_tokens = ds.docuscope_parse(df, nlp_model=nlp, n_process=4)

And create a frequency table.

In [35]:
wc = ds.frequency_table(ds_tokens)

In [36]:
wc.head()

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""the""","""AT""",51030,49907.38332,99.75
"""and""","""CC""",25288,24731.685467,99.5
"""of""","""IO""",22492,21997.195094,99.75
"""a""","""AT1""",22033,21548.292704,99.25
"""to""","""TO""",16269,15911.095811,99.0


In [37]:
wc.tail()

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""zuni""","""NP1""",1,0.978001,0.25
"""zur""","""NN1""",1,0.978001,0.25
"""zvezda""","""NP1""",1,0.978001,0.25
"""zwit""","""NP1""",1,0.978001,0.25
"""zymo""","""NP1""",1,0.978001,0.25


From the table, it is relatively easy to extract important information like the total word count (or size) of a corpus. Here we simply [sum](https://docs.pola.rs/api/python/dev/reference/expressions/api/polars.sum.html) the "AF" (or absolute frequency) column.

In [40]:
wc.select("AF").sum()

AF
u32
1022494
