# Tutorial on Annotating for Education Language Data

Welcome to this tutorial on using [`edu-convokit`](https://github.com/rosewang2008/edu_convokit) for annotating your education language data!
Annotation is a critical step in understanding your data, and it is important to do it right & consistently across datasets.
`edu-convokit` is designed to help you do just that.

Annotation is useful because:
- It creates descriptive statistics about your data, which can help you understand the data.
- It quantifies the language used by your students and educators, which can help you understand the language.
- It measures the interaction between the student and the educator, which can help you understand the interaction.

`edu-convokit` is designed to support these purposes.

## 📚 Learning Objectives

In this tutorial, you will learn how to use `Annotator` to annotate your data. Some of the annotations we'll cover include:
- <a href="#📝-annotating-talk-time">Section Link 🔗</a> Talk Time: We will annotate the amount of time the student and educator talk.
- <a href="#📝-annotating-student-reasoning">Section Link 🔗</a> Student Reasoning: We will annotate use of reasoning in the student's speech.
- <a href="#📝-annotating-teacher-focusing-questions">Section Link 🔗</a> Teacher Focusing Questions: We will annotate the use of focusing questions by the educator.
- <a href="#📝-annotating-conversational-uptake">Section Link 🔗</a> Conversational Uptake: We will annotate instances of high conversational uptake by the educator.

For other annotations, please refer to the [documentation](https://edu-convokit.readthedocs.io/en/latest/) for more information.
If you want to add your own annotations, please make a pull request to the [repo](https://github.com/rosewang2008/edu-convokit/).

Without further ado, let's get started!

## Installation

First, install `edu-convokit`:

In [None]:
!pip install git+https://github.com/rosewang2008/edu-convokit.git

Collecting git+https://github.com/rosewang2008/edu-convokit.git
  Cloning https://github.com/rosewang2008/edu-convokit.git to /tmp/pip-req-build-s81zucpt
  Running command git clone --filter=blob:none --quiet https://github.com/rosewang2008/edu-convokit.git /tmp/pip-req-build-s81zucpt
  Resolved https://github.com/rosewang2008/edu-convokit.git to commit 8eb087b51abfa36a7031bf1de4e3dc40d8848186
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting clean-text (from edu-convokit==0.0.1)
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting num2words==0.5.10 (from edu-convokit==0.0.1)
  Downloading num2words-0.5.10-py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.6/101.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting docopt>=0.6.2 (from num2words==0.5.10->edu-convokit==0.0.1)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting emoji<2.0.0,>=1.0.0 (from clean-t

In [None]:
from edu_convokit.annotation import Annotator

# We're going to standardize the text with TextPreprocessor.
# In this tutorial, we're going to assume you're familiar with TextPreprocessor.
# For the tutorial on TextPreprocessor, see: https://colab.research.google.com/drive/1a-EwYwkNYHSNcNThNTXe6DNpsis0bpQK
from edu_convokit.preprocessors import TextPreprocessor

# For helping us flexibly load data
from edu_convokit import utils



## 📑 Data

Let's load the data we'll be working with. We're going to be using a transcript from the [TalkMoves dataset](https://github.com/SumnerLab/TalkMoves).

We're also going to use `TextPreprocessor` to anonymize and pre-process the data. This is optional, but recommended.

For the tutorial on `TextPreprocessor`, please refer to [this tutorial](https://colab.research.google.com/drive/1a-EwYwkNYHSNcNThNTXe6DNpsis0bpQK). Here, we're going to assume you're familiar with `TextPreprocessor`.

In [None]:
!wget "https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats and Fish 2_Grade 4.xlsx"

data_fname = "Boats and Fish 2_Grade 4.xlsx"
df = utils.load_data(data_fname) # Handles loading data from different file types including: .csv, .xlsx, .json

# We're going to first standardize the text with TextPreprocessor as done from our last tutorial: anonymize and merge utterances from the same speaker
processor = TextPreprocessor()
TEXT_COLUMN = "Sentence"
SPEAKER_COLUMN = "Speaker"
known_names = ["David", "Meredith", "Beth"]
known_replacement_names = [f"[STUDENT_{i}]" for i in range(len(known_names))]

# Anonymize text
df = processor.anonymize_known_names(
    df=df,
    text_column=TEXT_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names,
    target_text_column=TEXT_COLUMN
)

# Anonymize speakers
df = processor.anonymize_known_names(
    df=df,
    text_column=SPEAKER_COLUMN,
    target_text_column=SPEAKER_COLUMN,
    names=known_names,
    replacement_names=known_replacement_names
)

# Merge utterances
df = processor.merge_utterances_from_same_speaker(
    df=df,
    text_column=TEXT_COLUMN,
    speaker_column=SPEAKER_COLUMN,
    target_text_column=TEXT_COLUMN
)

# Show
df.head()

--2023-12-30 10:29:37--  https://raw.githubusercontent.com/rosewang2008/edu-convokit/master/data/talkmoves/Boats%20and%20Fish%202_Grade%204.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10528 (10K) [application/octet-stream]
Saving to: ‘Boats and Fish 2_Grade 4.xlsx’


2023-12-30 10:29:38 (14.2 MB/s) - ‘Boats and Fish 2_Grade 4.xlsx’ saved [10528/10528]



Unnamed: 0,Sentence,Speaker
0,"I'm wondering which is bigger, one half or two...",T
1,Try the purples. Get three purples. It doesn’t...,[STUDENT_0]
2,What was it? Two thirds?,[STUDENT_1]
3,It would be like brown or something like that.,[STUDENT_0]
4,Ok,[STUDENT_1]


## 📝 Annotating Talk Time

Let's start by annotating the amount of time the student and educator talk.
We will define talk time as the number of words in `TEXT_COLUMN`.
However, if you have metadata about the length of the audio, you can also use that to annotate talk time.
Please refer to the [documentation](https://edu-convokit.readthedocs.io/en/latest/) for more information.

In [None]:
annotator = Annotator()

# The talktime values will be populated in this column
TALK_TIME_COLUMN = "talktime"

df = annotator.get_talktime(
    df=df,
    text_column=TEXT_COLUMN,
    analysis_unit="words",
    output_column=TALK_TIME_COLUMN
)

df.head()

Unnamed: 0,Sentence,Speaker,talktime
0,"I'm wondering which is bigger, one half or two...",T,54
1,Try the purples. Get three purples. It doesn’t...,[STUDENT_0],12
2,What was it? Two thirds?,[STUDENT_1],5
3,It would be like brown or something like that.,[STUDENT_0],9
4,Ok,[STUDENT_1],1


🎉 We can see with a single function call, we've added our first annotation -- `talktime` -- to our data!

All the other annotations work in a similar way. Let's continue!

## 📝 Annotating Student Reasoning

Next, let's annotate the student's reasoning.
Under the hood, we're using a model trained on student's math reasoning from [prior work](https://github.com/ddemszky/classroom-transcript-analysis).
So...

💡 Note:
- This model is trained on math reasoning, so it may not work well on other subjects.
- This model will run slow on CPU, so we recommend using a GPU. If you have a GPU, this library will automatically use it.
- This model is trained on _student_ utterances. `edu-convokit` has a simple way to only annotate student utterances, which we'll see below.


In [None]:
# The reasoning annotations will be populated in this column
STUDENT_REASONING_COLUMN = "student_reasoning"

df = annotator.get_student_reasoning(
    df=df,
    speaker_column=SPEAKER_COLUMN,
    text_column=TEXT_COLUMN,
    output_column=STUDENT_REASONING_COLUMN,
    # Since this model is only trained on _student_ utterances,
    # we can explicitly pass in the speaker names associated to students.
    # It will only annotate utterances from these speakers.
    speaker_value=known_replacement_names,
)

df.head()

    For more details on the model, see https://arxiv.org/pdf/2211.11772.pdf


tokenizer_config.json:   0%|          | 0.00/429 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/238k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Unnamed: 0,Sentence,Speaker,talktime,student_reasoning
0,"I'm wondering which is bigger, one half or two...",T,54,
1,Try the purples. Get three purples. It doesn’t...,[STUDENT_0],12,0.0
2,What was it? Two thirds?,[STUDENT_1],5,
3,It would be like brown or something like that.,[STUDENT_0],9,0.0
4,Ok,[STUDENT_1],1,


🎉 Great! We've added our second annotation -- `student_reasoning` -- to our data!

💡 Note:
- `student_reasoning` is NaN for the educator's utterances as desired.
- Otherwise, for the students, `student_reasoning` is either 1.0 or 0.0. 1.0 means the model thinks the student is using reasoning, and 0.0 means the model thinks the student is not using reasoning.

💡 Are you wondering whether there's an easy way to **view examples** of the student's reasoning?

`edu-convokit` has a simple way to do this with our `Analyzer`s. This will be covered in the tutorial on `Analyzer`s: [link](https://colab.research.google.com/drive/1xfrq5Ka3FZH7t9l87u4sa_oMlmMvuTfe).
For now, let's continue annotating!


## 📝 Annotating Teacher Focusing Questions

Let's annotate the educator's use of focusing questions.
Under the hood, we're using a model trained on focusing questions in math classrooms from [prior work](https://github.com/sterlingalic/funneling-focusing).
So...

💡 Note:
- This model is trained on math classroom data, so it may not work well on other subjects.
- This model will run slow on CPU, so we recommend using a GPU. If you have a GPU, this library will automatically use it.
- This model is trained on _teacher_ utterances. `edu-convokit` has a simple way to only annotate teacher utterances which is similar to the one we saw above for student utterances.

In [None]:
# The focusing questions annotation will be populated in this column
FOCUSING_QUESTIONS_COLUMN = "focusing_questions"

df = annotator.get_focusing_questions(
    df=df,
    speaker_column=SPEAKER_COLUMN,
    text_column=TEXT_COLUMN,
    output_column=FOCUSING_QUESTIONS_COLUMN,
    # Since this model is only trained on _teacher_ utterances,
    # we can explicitly pass in the speaker names associated to the teacher.
    speaker_value=['T']
)

df.head()

    For more details on the model, see https://aclanthology.org/2022.bea-1.27.pdf


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/238k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/770 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Unnamed: 0,Sentence,Speaker,talktime,student_reasoning,focusing_questions
0,"I'm wondering which is bigger, one half or two...",T,54,,0.0
1,Try the purples. Get three purples. It doesn’t...,[STUDENT_0],12,0.0,
2,What was it? Two thirds?,[STUDENT_1],5,,
3,It would be like brown or something like that.,[STUDENT_0],9,0.0,
4,Ok,[STUDENT_1],1,,


🎉 Great! We've added our third annotation -- `focusing_questions` -- to our data!

💡 Note:
- `focusing_questions` is NaN for the student utterances.
- Similar to before, `focusing_questions` is either 1.0 or 0.0. 1.0 means the model thinks the educator is using a focusing question, and 0.0 means the model thinks the educator is not using a focusing question.

## 📝 Annotating Conversational Uptake

Let's annotate the educator's conversational uptake of the student.
Under the hood, we're using a model trained from [prior work](https://github.com/ddemszky/conversational-uptake).
It measures whether the educator builds on the contribution of the student's utterance.

So...

💡 Note:
- This model will run slow on CPU, so we recommend using a GPU. If you have a GPU, this library will automatically use it.
- This model is trained on teacher utterances following student utterances. `edu-convokit` has a simple way to only annotate these teacher utterances which is similar to the function calls we saw before.


In [None]:
UPTAKE_COLUMN = "uptake"

df = annotator.get_uptake(
    df=df,
    speaker_column=SPEAKER_COLUMN,
    text_column=TEXT_COLUMN,
    output_column=UPTAKE_COLUMN,
    # We want to specify the first speaker to be the students.
    speaker1=known_replacement_names,
    # We want to specify the second speaker to be the teacher
    speaker2='T'
)

    For more details on the model, see https://arxiv.org/pdf/2106.03873.pdf


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/585 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
df.head(20)

Unnamed: 0,Sentence,Speaker,talktime,student_reasoning,focusing_questions,uptake
0,"I'm wondering which is bigger, one half or two...",T,54,,0.0,
1,Try the purples. Get three purples. It doesn’t...,[STUDENT_0],12,0.0,,
2,What was it? Two thirds?,[STUDENT_1],5,,,
3,It would be like brown or something like that.,[STUDENT_0],9,0.0,,
4,Ok,[STUDENT_1],1,,,
5,"We’re not doing the one third, we’re doing two...",[STUDENT_0],14,0.0,,
6,First we’ve got to find out what a third of it...,[STUDENT_1],18,0.0,,
7,One third?,[STUDENT_0],2,,,
8,What’s third of an orange? Let’s start a diffe...,[STUDENT_1],21,0.0,,
9,"Alright, yeah, I was thinking of that way before",[STUDENT_0],9,0.0,,


🎉 Great, we finished our last annotation of the tutorial!

With these annotations, we can now do some analysis on our data.

We can save our annotated data to a file which we'll use in the next tutorial on `Analyzer`s: [link](TODO).


In [None]:
df.to_csv("annotated_data.csv", index=False)

## 📝 Conclusion and Where to Go From Here

In this tutorial, we learned how to use `Annotator` to annotate our data. With one simple function call, we were able to annotate:
- Talk Time
- Student Reasoning
- Teacher Focusing Questions
- Conversational Uptake

What are some natural next steps?
- You can annotate with other features. Please refer to the [documentation](https://edu-convokit.readthedocs.io/en/latest/) for an exhaustive list of features. Or, you can add your own features by making a pull request to the [repo](https://github.com/rosewang2008/edu-convokit).
- You can analyze your data with [`edu-convokit`'s `Analyzer`](https://colab.research.google.com/drive/1xfrq5Ka3FZH7t9l87u4sa_oMlmMvuTfe).
- For a tutorial on `Analyzer`, please refer to [this tutorial](https://colab.research.google.com/drive/1xfrq5Ka3FZH7t9l87u4sa_oMlmMvuTfe).
- For the documentation on `Analyzer`, please refer to [this documentation](https://edu-convokit.readthedocs.io/en/latest/analyzer.html).

If you have any questions, please feel free to reach out to us on [`edu-convokit`'s GitHub](https://github.com/rosewang2008/edu-convokit).

👋 Happy exploring your data with `edu-convokit`!