# Extraction

Here we show extraction on student feedback comments. Often you might have a focused question like "What did students say about the discussion forums?" and maybe it wasn't addressed explicitly in a separate survey question. You then want to extract relevant passages from student's comments. 

This enables downstream tasks like running sentiment analysis on the extracted excerpts to see how students felt about a particular aspect (technically labeled aspect-based sentiment analysis). We won't cover that in this notebook, but you can see the end-to-end example notebook for more.

## Imports and setup

In [3]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

In [4]:
import pandas as pd
import json
import textwrap
from typing import Any
from IPython.display import HTML
from functools import partial
from pprint import pprint
from pathlib import Path
from feedback_analyzer.excerpt_extraction import (
    ExcerptExtraction,
    extract_excerpts, 
)
from feedback_analyzer.single_input_task import apply_task
from feedback_analyzer.models_common import CommentModel, LLMConfig
from feedback_analyzer.batch_runner import process_tasks

In [5]:
# this makes it more robust to run async tasks inside an already async environment (jupyter notebooks)
import nest_asyncio
nest_asyncio.apply()

Make sure to either set `ANTHROPIC_API_KEY` as an environment variable or put it in a .env file and use the following cell to load the env var. The format in the .env file is:
```
ANTHROPIC_API_KEY=yourKeyGoesHere
```

In [6]:
%load_ext autoreload
%autoreload 2

This is a convenience function to make seeing Pandas dataframe values easier, especially when there are long strings like the student comments we will be using.

In [7]:
def full_show(df):
    with pd.option_context('display.max_columns', None, 'display.max_rows', None, 'display.max_colwidth', None):
        display(df)

In [8]:
MODEL_NAME_HAIKU = "claude-3-haiku-20240307"

This is a convenience function for pretty-printing long student comments.

In [9]:
def print_wrap(text: str, width: int = 72) -> str:
    print(textwrap.fill(text, width=width))

## Load the example data

In [10]:
data_path = Path('../data/example_data')

Let's load up some fake data. 

All of these comments are synthetic to avoid sharing any sensitive or PII information, but they should work great for illustration purposes. There are 100 rows, with just a few null/nan values here and there for realism. In most surveys I've seen, there are quite a number of null/None/blank etc values, and the functions are written to handle those.

In [11]:
example_survey = pd.read_csv(data_path / 'example_survey_data_synthetic.csv')
full_show(example_survey.head())

Unnamed: 0,best_parts,enhanced_learning,improve_course
0,I valued the practical clinical aspects related to immune-related disorders and their management.,The illustrative visuals and straightforward explanatory clips.,Consider reducing the duration of certain videos. A few appeared to be slightly prolonged.
1,The flexibility to learn at a self-determined speed,The opportunity to review the lecture content,"The pace of some lectures could be slowed down. At times, it's challenging to follow the lecturer's speech or decipher their handwriting."
2,The educational content was extremely enriching and stimulating! The section on oncology was the highlight.,the self-assessment activities.,Nothing specific comes to mind.
3,Professional growth within the medical sector,"The practical integration workshops were highly beneficial, they significantly contributed to a deeper comprehension of the theories and their implementation in a healthcare environment.",Incorporating a few advanced projects as optional tasks could benefit learners who wish to delve deeper into the subject matter. These projects wouldn't need to influence exam scores.
4,The highlights of the class included the practical demonstration clips that made the complex biological principles more understandable by connecting them to daily well-being and actions. This connection was incredibly beneficial as I navigated the course content.,"The aspect of the course that most facilitated my learning was the regular assessments provided at each segment, which helped confirm my grasp of the material presented. These checkpoints effectively guided me in the correct learning direction. It's evident that considerable effort was invested in designing these educational modules to enable students to gain a deep comprehension rather than just a superficial understanding of the subject matter.","Extend the duration of the concept videos for the more challenging topics, as they require a deeper dive to fully grasp the intricacies involved. Additionally, consider introducing an additional educator to the mix. The dynamic of having multiple voices in another subject area is quite engaging, and it would be beneficial to replicate that experience in this subject to prevent monotony from setting in with just one instructor."


## Extraction

### Test extraction on a single comment

Here we'll choose a comment that has multiple distinct topics in the answer and see how extracting distinct excerpts (about each of the topics touched upon) works. In this case, the goal focus (suggestions for improvement) is well-aligned with the survey question, but the objective of extraction is to chunk each comment into excerpts, each of which has a particular focus.

In [12]:
# the original survey question
improve_course_question = "What could be improved about the course?"
# the goal focus is what we're trying to get out of the question. This may be different than the focus of the question itself.
goal_focus = "suggestions for improvement"

# the pattern is that we make a surveytask (in this case, for extraction), wrap the input 
# (typically one or a batch of comments), and then apply the task to the input.
# What pops out is a result object, which is a pydantic model for easy use.
comment = example_survey.iloc[4]['improve_course']
survey_task = ExcerptExtraction(goal_focus=goal_focus, question=improve_course_question)
task_input = CommentModel(comment=comment)
sample_extraction = await apply_task(task_input=task_input,
                                     get_prompt=survey_task.prompt_messages,
                                     result_class=survey_task.result_class)

pprint(f'Comment: {comment}')
pprint(json.loads(sample_extraction.model_dump_json()))

('Comment: Extend the duration of the concept videos for the more challenging '
 'topics, as they require a deeper dive to fully grasp the intricacies '
 'involved. Additionally, consider introducing an additional educator to the '
 'mix. The dynamic of having multiple voices in another subject area is quite '
 'engaging, and it would be beneficial to replicate that experience in this '
 'subject to prevent monotony from setting in with just one instructor.')
{'excerpts': ['Extend the duration of the concept videos for the more '
              'challenging topics, as they require a deeper dive to fully '
              'grasp the intricacies involved.',
              'Consider introducing an additional educator to the mix. The '
              'dynamic of having multiple voices in another subject area is '
              'quite engaging, and it would be beneficial to replicate that '
              'experience in this subject to prevent monotony from setting in '
              'with just o

You can see that the extraction pulled out the two separate topics that the student mentioned in their comment.

### Alternate convenience method for extraction

The method demonstrated above kind of requires a bit too much knowledge of the inner workings, if you're just trying to use the program simply. Here's a convenience wrapper that does the same thing. It looks a little different because it allows passing multiple comments. It also runs the comments in batches, asynchronously, to parallel process while staying within any context window limitations and rate limits of the models. We'll also switch to using the Claude Haiku model here, given it's faster and cheaper, to see how it does. (By default, if you don't specify, it uses Sonnet 3.5).

In [13]:
# improve_course_question and goal_focus were defined in the cell above
comments = [example_survey.iloc[4]['improve_course']]
sample_extractions = await extract_excerpts(comments=comments, 
                                            question=improve_course_question, 
                                            goal_focus=goal_focus,
                                            llm_config=LLMConfig(model=MODEL_NAME_HAIKU))

for comment, extraction in zip(comments, sample_extractions):
    print_wrap(f'Student comment: "{comment}"')
    pprint(extraction.model_dump())

processing 1 inputs in batches of 25
sleeping for 20 seconds between batches
starting 0 to 25
completed 0 to 25
elapsed time: 1.3662559986114502
Student comment: "Extend the duration of the concept videos for the more
challenging topics, as they require a deeper dive to fully grasp the
intricacies involved. Additionally, consider introducing an additional
educator to the mix. The dynamic of having multiple voices in another
subject area is quite engaging, and it would be beneficial to replicate
that experience in this subject to prevent monotony from setting in with
just one instructor."
{'excerpts': ['Extend the duration of the concept videos for the more '
              'challenging topics, as they require a deeper dive to fully '
              'grasp the intricacies involved.',
              'Consider introducing an additional educator to the mix. The '
              'dynamic of having multiple voices in another subject area is '
              'quite engaging, and it would be benefi

### Test extraction over a batch of comments

Here we'll use 10 comments as an example to show batch running. We're using the `survey_task` we defined above that encapsulates the extraction task. The survey question was "What could be improved about the course?" and the goal focus was "suggestions for improvement". We'll go back to using the Sonnet 3.5 model (default) here to give the best results.

In [15]:
comments_to_test = [CommentModel(comment=comment) for comment in example_survey['improve_course'].tolist()[:10]]

# this requires the survey_task to be instantiated first, which is done a couple of cells above
# The batch running routine is generic and takes a list of comments and a task to apply to them.
# For that reason, we use a partial that packages the survey task with some of its parameters pre-filled.
ex_task = partial(apply_task, 
                  get_prompt=survey_task.prompt_messages, 
                  result_class=survey_task.result_class)
                #   llm_config=LLMConfig(model=MODEL_NAME_HAIKU))
extractions = await process_tasks(comments_to_test, ex_task)

for comment, excerpts in zip(comments_to_test, extractions):
    print_wrap(f'Student comment: "{comment.comment}"')
    pprint(excerpts.model_dump())
    print('\n')

processing 10 inputs in batches of 25
sleeping for 20 seconds between batches
starting 0 to 25
completed 0 to 25
elapsed time: 2.4936439990997314
Student comment: "Consider reducing the duration of certain videos. A
few appeared to be slightly prolonged."
{'excerpts': ['Consider reducing the duration of certain videos. A few '
              'appeared to be slightly prolonged.']}


Student comment: "The pace of some lectures could be slowed down. At
times, it's challenging to follow the lecturer's speech or decipher
their handwriting."
{'excerpts': ["The pace of some lectures could be slowed down. At times, it's "
              "challenging to follow the lecturer's speech or decipher their "
              'handwriting.']}


Student comment: "Nothing specific comes to mind."
{'excerpts': []}


Student comment: "Incorporating a few advanced projects as optional
tasks could benefit learners who wish to delve deeper into the subject
matter. These projects wouldn't need to influence exam sco

Notice that there are no excerpts for comments that did not contain anything pertaining to the goal focus (suggestions for improvement in this case). This is a nice way of focusing on just the comments that have useful feedback.

### Alternate convenience method for extraction of a batch of comments

Note that the results may not always be the same, given the inherent non-deterministic nature of LLMs.

In [16]:
# The question and goal_focus were defined above, but we're redefining here for clarity
# to show the example all in a single cell.
question = "What could be improved about the course?"
goal_focus = "suggestions for improvement"
comments = example_survey['improve_course'].tolist()[:10]
results = await extract_excerpts(comments=comments,
                                 question=question,
                                 goal_focus=goal_focus)
                                #  llm_config=LLMConfig(model=MODEL_NAME_HAIKU))

# we'll just show a few for brevity but feel free to change the slice
for comment, excerpts in list(zip(comments, results))[4:7]:
    print_wrap(f'Student comment: "{comment}"')
    pprint(excerpts.model_dump())
    print('\n')

processing 10 inputs in batches of 25
sleeping for 20 seconds between batches
starting 0 to 25
completed 0 to 25
elapsed time: 2.4829351902008057
Student comment: "Extend the duration of the concept videos for the more
challenging topics, as they require a deeper dive to fully grasp the
intricacies involved. Additionally, consider introducing an additional
educator to the mix. The dynamic of having multiple voices in another
subject area is quite engaging, and it would be beneficial to replicate
that experience in this subject to prevent monotony from setting in with
just one instructor."
{'excerpts': ['Extend the duration of the concept videos for the more '
              'challenging topics, as they require a deeper dive to fully '
              'grasp the intricacies involved.',
              'Consider introducing an additional educator to the mix. The '
              'dynamic of having multiple voices in another subject area is '
              'quite engaging, and it would be benef

Notice how it put separate suggestions into different excerpts, which helps us later in classifying these into different categories or running sentiment analysis on them. It also helps as a filter - notice how it didn't return any excerpts for the comments that didn't contain any suggestions for improvement.

### New goal focus

Now let's see what happens if we change the goal focus to something that is not exactly the same as the survey question. In this case, let's say we want to know what suggestions for improvement students had about the lectures and videos. We can define that as our goal focus and pass that along.

In [17]:
# The question and goal_focus were defined above, but we're redefining here for clarity
# to show the example all in a single cell.
question2 = "What could be improved about the course?"
goal_focus2 = "lectures and videos"
comments = example_survey['improve_course'].tolist()[:10]
results2 = await extract_excerpts(comments=comments,
                                 question=question2,
                                 goal_focus=goal_focus2)
                                #  llm_config=LLMConfig(model=MODEL_NAME_HAIKU))

for comment, excerpts in zip(comments, results2):
    print_wrap(f'Student comment: "{comment}"')
    pprint(excerpts.model_dump())
    print('\n')

processing 10 inputs in batches of 25
sleeping for 20 seconds between batches
starting 0 to 25
completed 0 to 25
elapsed time: 2.1860463619232178
Student comment: "Consider reducing the duration of certain videos. A
few appeared to be slightly prolonged."
{'excerpts': ['Consider reducing the duration of certain videos. A few '
              'appeared to be slightly prolonged.']}


Student comment: "The pace of some lectures could be slowed down. At
times, it's challenging to follow the lecturer's speech or decipher
their handwriting."
{'excerpts': ["The pace of some lectures could be slowed down. At times, it's "
              "challenging to follow the lecturer's speech or decipher their "
              'handwriting.']}


Student comment: "Nothing specific comes to mind."
{'excerpts': []}


Student comment: "Incorporating a few advanced projects as optional
tasks could benefit learners who wish to delve deeper into the subject
matter. These projects wouldn't need to influence exam sco

Nice! Now the resulting excerpts only have to do with lectures and videos. Notice that comments that clearly have suggestions for improvement ("Extend the duration! The course felt too brief...") but are not about lectures or videos no longer show up as excerpts.