# Tutorial on Text Pre-Processing for Education Language Data

Welcome to this tutorial on using [`edu-convokit`](https://github.com/rosewang2008/edu-convokit) for text pre-processing.
Text pre-processing is a critical step in handling education language data.
- It ensures the data is clean (education data is notoriously messy).
- It ensures the data is standardized, ready for annotation and analysis.
- It ensures that the students and educators are anonymized; this is important to protect the privacy of individuals involved and allow for safe secondary data analysis.

`edu-convokit` is designed to support these purposes.

## 📚 Learning Objectives

In this tutorial, you will learn how to use `TextPreprocessor` to:

- <a href="#📝-anonymizing-data-with-known-names">Section Link 🔗</a>: Anonymize your data when you know the names of your students and educators.
- <a href="#📝-anonymizing-data-with-unknown-names">Section Link 🔗</a>: Anonymize your data when you do _not_ know the names of your students and educators.
- <a href="#📝-standardizing-data-for-downstream-annotation-and-analysis">Section Link 🔗</a>: Standardize your data for downstream feature annotation.

Without further ado, let's get started!

## Installation

Let's first install `edu-convokit`.


In [None]:
!pip install git+https://github.com/rosewang2008/edu-convokit.git

Collecting git+https://github.com/rosewang2008/edu-convokit.git
  Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-580kdce9
  Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-580kdce9
  Resolved https://github.com/rosewang2008/edu-convokit.git to commit 8eb087b51abfa36a7031bf1de4e3dc40d8848186
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
from edu_convokit.preprocessors import TextPreprocessor

# For helping us flexibly load data
from edu_convokit import utils

## 📑 Data

Let's load the data we'll be working with. We're going to be using a transcript from the [TalkMoves dataset](https://github.com/SumnerLab/TalkMoves).

In [None]:
!wget "https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx"

data_fname = "Boats and Fish 2_Grade 4.xlsx"
df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json

# Show these lines because they contain names in the speaker and text columns.
df[25:35]

--2023-12-30 10:25:32--  https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats%20and%20Fish%202_Grade%204.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10528 (10K) [application/octet-stream]
Saving to: ‘Boats and Fish 2_Grade 4.xlsx.3’


2023-12-30 10:25:32 (61.8 MB/s) - ‘Boats and Fish 2_Grade 4.xlsx.3’ saved [10528/10528]



Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,David,"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,David,"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,Beth,One half,,4 - Making a Claim
30,30,,17.0,David,I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, David and Meredith?",2 - Keeping Everyone Together,
32,32,,17.0,David,What do you have?,,2 - Relating to Another Student
33,33,,17.0,Meredith and David,Well,,1 - None
34,34,,18.0,David,we think,,1 - None


### Some things to observe about the data...

💡 Note: `edu-convokit` cares about two key columns: a column for the speaker and a column for the text.
- In the TalkMoves dataset, the speaker is in the `Speaker` column and the text is in the `Sentence` column. We can create two variables to store these column names as these will be used throughout the tutorial.

💡 Note: We see that the names occur in the speaker and text column
- e.g., names like David and Meredith appear in the speaker and text column.
- The teacher is always shortened to "T" in the speaker column.

💡 Note: The utterances from the same speaker are not always grouped together.
- We'll fix this in the section on standardizing the data for downstream annotation and analysis.

In [None]:
# Creating variables for the columns we want to use
TEXT_COLUMN = "Sentence"
SPEAKER_COLUMN = "Speaker"


## 📝 Anonymizing Data with Known Names

We will now anonymize the data when we know the names of the students and educators in the dataset.
From our experience, this is the most common scenario in education language data where the names of the students and educators are known.
For example, these names come from a roster or a list of students in a class, or are officially recorded in a database.

To do this, we need to create a list of names that we want to anonymize, and a list of replacement names that we want to use to replace the names in the dataset.


In [None]:
# Show the names of the speakers. In your use case, you might load this from a file or database.
print(df[SPEAKER_COLUMN].unique())

['T' 'David' 'Meredith' 'Beth' 'Meredith and David' 'T 2']


In [None]:
# Create list of names and replacement names. We will make the replacement names unique so that we can easily find them later.
known_names = ["David", "Meredith", "Beth"]
known_replacement_names = [f"[STUDENT_{i}]" for i in range(len(known_names))]
print(known_replacement_names)

['[STUDENT_0]', '[STUDENT_1]', '[STUDENT_2]']


In [None]:
# Now let's anonymize the names in the text!
processor = TextPreprocessor()
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    # We will directly replace the names in the text column.
    # If you want to keep the original text, you can set `target_text_column` to a new column name.
    target_text_column=TEXT_COLUMN
)

In [None]:
# Let's see what the anonymized text looks like!
df.iloc[25:35]

Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,David,"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,David,"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,Beth,One half,,4 - Making a Claim
30,30,,17.0,David,I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, [STUDENT_0] and [STUDENT_1]?",2 - Keeping Everyone Together,
32,32,,17.0,David,What do you have?,,2 - Relating to Another Student
33,33,,17.0,Meredith and David,Well,,1 - None
34,34,,18.0,David,we think,,1 - None


💡 Note: Nice, we can see that the text has been anonymized (e.g., line 31)!

However, the speaker names have not been anonymized. Let's fix that.

In [None]:
df = processor.anonymize_known_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=SPEAKER_COLUMN
)

df.iloc[25:35]

Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,[STUDENT_0],"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,[STUDENT_0],"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,[STUDENT_2],One half,,4 - Making a Claim
30,30,,17.0,[STUDENT_0],I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, [STUDENT_0] and [STUDENT_1]?",2 - Keeping Everyone Together,
32,32,,17.0,[STUDENT_0],What do you have?,,2 - Relating to Another Student
33,33,,17.0,[STUDENT_1] and [STUDENT_0],Well,,1 - None
34,34,,18.0,[STUDENT_0],we think,,1 - None


🎉 Great, now we have anonymized the speaker names as well! Some other great things are that:
- We have a record of the original names and the anonymized names. So if we want to go back to the original names, we can do that.
- The anonymized names are consistent: So [STUDENT_0] in the SPEAKER_COLUMN will refer to the same [STUDENT_0] in the TEXT_COLUMN.


This concludes the tutorial on anonymizing data with known names.
The next section will cover anonymizing data when you do *not* know the names of the students and educators in your dataset.

## 📝 Anonymizing Data with Unknown Names

We will now anonymize the data when we **do not know** the names of the students and educators in the dataset.
Note that the anonymization will be imperfect as we do not know the names of the students and educators in the dataset and identifying names consistently is a hard task (rf. [named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition))---so use this with caution!
We will show some of these failure modes in the tutorial.

In [None]:
# Let's start fresh with the original data
df = utils.load_data(data_fname)
df.iloc[25:35]

Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,David,"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,David,"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,Beth,One half,,4 - Making a Claim
30,30,,17.0,David,I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, David and Meredith?",2 - Keeping Everyone Together,
32,32,,17.0,David,What do you have?,,2 - Relating to Another Student
33,33,,17.0,Meredith and David,Well,,1 - None
34,34,,18.0,David,we think,,1 - None


In [None]:
processor = TextPreprocessor()
df, (names, replacement_names) = processor.anonymize_unknown_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    # Will return the names and replacement names that were used.
    return_names=True
)

print(f"Names: {names}")
print(f"Replacement names: {replacement_names}")
df.iloc[25:35]


Names: ['Beth', 'David']
Replacement names: ['[PERSON0]', '[PERSON1]']


Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,[PERSON1],"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,[PERSON1],"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,[PERSON0],One half,,4 - Making a Claim
30,30,,17.0,[PERSON1],I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, David and Meredith?",2 - Keeping Everyone Together,
32,32,,17.0,[PERSON1],What do you have?,,2 - Relating to Another Student
33,33,,17.0,Meredith and [PERSON1],Well,,1 - None
34,34,,18.0,[PERSON1],we think,,1 - None


💡 Note: Observe that the name "Meredith" has not been anonymized.
`anonymize_unknown_names` currently uses spacY's named entity recognition model to identify names. This is an imperfect model and will not identify all names, as we can see here.

There are ways we can improve this. For example:
- We can manually add "Meredith" to the list of names to anonymize and run `anonymize_known_names` again.
- We can cross-reference names from the [SSA database](https://www.ssa.gov/oact/babynames/limits.html) to identify names that are not identified by the model. However, this will lead to a high false positive rate, i.e., names that are not actually names will be identified as names.

To complete the anonymization process, we will use the `names` and `replacement_names` returned from `anonynmize_unknown_names` to anonymize the text. This makes the anonymization consistent between the speaker and text columns.

In [None]:
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    target_text_column=TEXT_COLUMN,
    names=names,
    replacement_names=replacement_names
)

# David is anonymized but Meredith is not (rf. line 31).
df.iloc[25:35]

Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,[PERSON1],"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,[PERSON1],"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,[PERSON0],One half,,4 - Making a Claim
30,30,,17.0,[PERSON1],I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, [PERSON1] and Meredith?",2 - Keeping Everyone Together,
32,32,,17.0,[PERSON1],What do you have?,,2 - Relating to Another Student
33,33,,17.0,Meredith and [PERSON1],Well,,1 - None
34,34,,18.0,[PERSON1],we think,,1 - None


## 📝 Standardizing Data for Downstream Annotation and Analysis

We will now standardize the data for downstream annotation and analysis.
One common standardization is to group the utterances from the same speaker together.
We will show how you can do this on the anonymized data.

For other standardizations, please refer to [`edu-convokit`'s documentation](TODO), or feel free to add a feature/pull request on our [GitHub](https://github.com/rosewang2008/edu-convokit).

In [None]:
# First let's start fresh with the original data & anonymize it like we did before.
df = utils.load_data(data_fname)
processor = TextPreprocessor()

# Anonymize text
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=TEXT_COLUMN
)

# Anonymize speakers
df, (names, replacement_names) = processor.anonymize_unknown_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    return_names=True
)

df.iloc[25:35]

Unnamed: 0.1,Unnamed: 0,TimeStamp,Turn,Speaker,Sentence,Teacher Tag,Student Tag
25,25,,14.0,[PERSON1],"Yeah, I know, and put ‘em up to there, and tha...",,4 - Making a Claim
26,26,,14.0,[PERSON1],"Hey, wait a minute, hey wait, maybe that’s it,...",,4 - Making a Claim
27,27,,15.0,T,Now take six of the ones,1 - None,
28,28,,15.0,T,Which is bigger?,8 - Press for Accuracy,
29,29,,16.0,[PERSON0],One half,,4 - Making a Claim
30,30,,17.0,[PERSON1],I think one half is...,,2 - Relating to Another Student
31,31,,,T,"Yes, [STUDENT_0] and [STUDENT_1]?",2 - Keeping Everyone Together,
32,32,,17.0,[PERSON1],What do you have?,,2 - Relating to Another Student
33,33,,17.0,Meredith and [PERSON1],Well,,1 - None
34,34,,18.0,[PERSON1],we think,,1 - None


Now we'll group utterances from the same speaker together.

In [None]:
df = processor.merge_utterances_from_same_speaker(
    df=df,
    text_column=TEXT_COLUMN,
    speaker_column=SPEAKER_COLUMN,
    # We're going to directly replace the text in the text column.
    target_text_column=TEXT_COLUMN
)

df.iloc[25:35]

Unnamed: 0,Sentence,Speaker
25,dark green,[PERSON1]
26,If you put it up to a whole,Meredith
27,"I’m sorry, what’s the number name for dark green",T
28,One,Meredith and [PERSON1]
29,Ok.,T
30,And you put six ones up to the dark green,Meredith
31,"Hold on, I’m a little confused. Tell me again....",T
32,One sixth,Meredith
33,One sixth.,T
34,"And then these, this would be",[PERSON1]


We can see that the utterances from the same speaker are now grouped together!

## 📝 Conclusion and Where to Go From Here

In this tutorial, we learned how to use `TextPreprocessor` to:
1. Anonymize your data when you know the names of your students and educators.
2. Anonymize your data when you do _not_ know the names of your students and educators.
3. Standardize your data for downstream feature annotation.

The next natural step is to annotate your data with features of interest. Here are some resources to get you started:
- [`edu-convokit`'s documentation on `Annotator`](https://edu-convokit.readthedocs.io/en/latest/annotation.html)
- [`edu-convokit`'s tutorial on `Annotator`](https://colab.research.google.com/drive/1rBwEctFtmQowZHxralH2OGT5uV0zRIQw)


If you have any questions, please feel free to reach out to us on [`edu-convokit`'s GitHub](https://github.com/rosewang2008/edu-convokit).

👋 Happy exploring your data with `edu-convokit`!