# Theme Derivation

Here the goal is to derive the main themes of the comments, where a theme is some common feedback expressed by multiple students. This is a good starting point when you're trying to answer the question: "What did students say about the course?" at a high level before diving into specifics. This could equally well be applied to user feedback surveys - there isn't much about the analysis that is specific to education. 

This process starts from scratch, in other words, with no preconceived notion of what the themes are, and lets the model derive the themes organically based on the comments themselves.

## Imports and setup

In [1]:
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

True

In [26]:
import pandas as pd
import json
from pprint import pprint
from pathlib import Path
from IPython.display import Markdown, display
from feedback_analyzer.summarization import summarize_comments
from feedback_analyzer.models_common import LLMConfig
from feedback_analyzer.theme_derivation import RefineThemes, find_themes
from feedback_analyzer.models_common import CommentModel, CommentBatch
from feedback_analyzer.single_input_task import apply_task

In [6]:
# this makes it more robust to run async tasks inside an already async environment (jupyter notebooks)
import nest_asyncio
nest_asyncio.apply()

Make sure to either set `ANTHROPIC_API_KEY` as an environment variable or put it in a .env file and use the following cell to load the env var. The format in the .env file is:
```
ANTHROPIC_API_KEY=yourKeyGoesHere
```

In [7]:
%load_ext autoreload
%autoreload 2

This is a convenience function to make seeing Pandas dataframe values easier, especially when there are long strings like the student comments we will be using.

In [8]:
def full_show(df):
    with pd.option_context('display.max_columns', None, 'display.max_rows', None, 'display.max_colwidth', None):
        display(df)

In [14]:
MODEL_NAME_HAIKU = "claude-3-haiku-20240307"

## Load the example data

In [9]:
data_path = Path('../data/example_data')

Let's load up some fake data. 

All of these comments are synthetic to avoid sharing any sensitive or PII information, but they should work great for illustration purposes. There are 100 rows, with just a few null/nan values here and there for realism. In most surveys I've seen, there are quite a number of null/None/blank etc values, and the functions are written to handle those.

In [10]:
example_survey = pd.read_csv(data_path / 'example_survey_data_synthetic.csv')
full_show(example_survey.head())

Unnamed: 0,best_parts,enhanced_learning,improve_course
0,I valued the practical clinical aspects related to immune-related disorders and their management.,The illustrative visuals and straightforward explanatory clips.,Consider reducing the duration of certain videos. A few appeared to be slightly prolonged.
1,The flexibility to learn at a self-determined speed,The opportunity to review the lecture content,"The pace of some lectures could be slowed down. At times, it's challenging to follow the lecturer's speech or decipher their handwriting."
2,The educational content was extremely enriching and stimulating! The section on oncology was the highlight.,the self-assessment activities.,Nothing specific comes to mind.
3,Professional growth within the medical sector,"The practical integration workshops were highly beneficial, they significantly contributed to a deeper comprehension of the theories and their implementation in a healthcare environment.",Incorporating a few advanced projects as optional tasks could benefit learners who wish to delve deeper into the subject matter. These projects wouldn't need to influence exam scores.
4,The highlights of the class included the practical demonstration clips that made the complex biological principles more understandable by connecting them to daily well-being and actions. This connection was incredibly beneficial as I navigated the course content.,"The aspect of the course that most facilitated my learning was the regular assessments provided at each segment, which helped confirm my grasp of the material presented. These checkpoints effectively guided me in the correct learning direction. It's evident that considerable effort was invested in designing these educational modules to enable students to gain a deep comprehension rather than just a superficial understanding of the subject matter.","Extend the duration of the concept videos for the more challenging topics, as they require a deeper dive to fully grasp the intricacies involved. Additionally, consider introducing an additional educator to the mix. The dynamic of having multiple voices in another subject area is quite engaging, and it would be beneficial to replicate that experience in this subject to prevent monotony from setting in with just one instructor."


Create variables for the original survey questions

The column names are shorthand for easy usage in code below, but let's also define some variables that hold the original survey questions that were asked for each column. We'll use these as metadata when passing along comments to the LLM routines. This is potentially important context for an LLM...after all, the survey comment "The flexibility to learn at a self-determined speed" may have different significance if the question was "What were the best parts of the course?" versus "How could we improve the course?".

In [11]:
best_parts_question = "What were the best parts of the course?"
enhanced_learning_question = "What parts of the course enhanced your learning the most?"
improve_course_question = "What could be improved about the course?"

We'll also load up some Coursera comments (source is from [this Kaggle notebook](https://www.kaggle.com/datasets/imuhammad/course-reviews-on-coursera), just using the first 100 as an example. The included example dataset is just the first 200 rows of the full 1.45 million rows. I didn't include the full set so as not to blimp up the size of this repo.

In [12]:
coursera_survey = pd.read_csv(data_path / 'coursera_survey_200rows.csv', nrows=100)
full_show(coursera_survey.head())

Unnamed: 0,reviews,reviewers,date_reviews,rating,course_id
0,"Pretty dry, but I was able to pass with just two complete watches so I'm happy about that. As usual there were some questions on the final exam that were NO WHERE in the course, which is annoying but far better than many microsoft tests I have taken. Never found the suplimental material that the course references... but who cares... i passed!",By Robert S,"Feb 12, 2020",4,google-cbrs-cpi-training
1,would be a better experience if the video and screen shots would sho on the side of the text that the instructor is going thru so that user does not have to go all the way to beginning of text to be able to view any slides instructor is showing.,By Gabriel E R,"Sep 28, 2020",4,google-cbrs-cpi-training
2,Information was perfect! The program itself was a little annoying. I had to wait 30 to 45 minutes after watching the videos to to take the quiz. Other than that the information was perfect and passed the test with no issues!,By Jacob D,"Apr 08, 2020",4,google-cbrs-cpi-training
3,A few grammatical mistakes on test made me do a double take but all in all not bad.,By Dale B,"Feb 24, 2020",4,google-cbrs-cpi-training
4,Excellent course and the training provided was very detailed and easy to follow.,By Sean G,"Jun 18, 2020",4,google-cbrs-cpi-training


In [13]:
coursera_review_question = "What is your review of the course?"

## Theme derivation (bottom up)
Derive themes from a batch of comments

Here we're doing this in a chained series of steps behind the scenes to come up with the final themes and citations for each theme. Here's what the chain looks like: 

<img alt="theme derivation" src="../images/theme_derivation_chain.png" title="Theme derivation chain" height="500">

In [59]:
comments = example_survey['improve_course'].tolist() # 100 comments
derivation_result = await find_themes(comments=comments, question=improve_course_question)

for theme in derivation_result.themes:
    print(theme.theme_title)

Video Content Improvements
Course Depth and Content
Interactive and Practical Elements
Assessment and Quiz Improvements
Course Materials and Resources
Course Structure and Duration
Language and Consistency
Positive Feedback


Looks pretty good. Let's look in more detail at the themes and supporting citations (quotes from the comments that back up the themes).

In [60]:
def format_themes(themes):
    output = []
    for theme in themes:
        output.append(f"Theme: {theme.theme_title}")
        output.append(f"Description:\n{theme.description}")
        output.append("Citations:")
        for citation in theme.citations:
            output.append(f"  • \"{citation}\"")
        output.append("")  # Empty line between themes
    return "\n".join(output)

print(format_themes(derivation_result.themes))

Theme: Video Content Improvements
Description:
• Suggestions for video duration adjustments
• Concerns about lecture pace and clarity
• Requests for more visual aids and multimedia
Citations:
  • "Consider reducing the duration of certain videos. A few appeared to be slightly prolonged."
  • "The pace of some lectures could be slowed down. At times, it's challenging to follow the lecturer's speech or decipher their handwriting."
  • "Incorporating additional visual aids could enhance and solidify the understanding of the material."

Theme: Course Depth and Content
Description:
• Desire for more in-depth and advanced content
• Requests for additional subjects and topics
• Suggestions for more practical examples and case studies
Citations:
  • "Delve deeper into the subject matter! It would be engaging to explore additional intricacies."
  • "Incorporating information about the latest treatment methods would be beneficial."
  • "It would be beneficial to include additional case studies, 

Let's take a look for a different survey question.

In [55]:
comments_best_parts = example_survey['best_parts'].tolist()
derivation_result_best_parts = await find_themes(comments=comments_best_parts, question=best_parts_question)

for theme in derivation_result_best_parts.themes:
    print(theme.theme_title)

Visual Aids and Multimedia
Practical Applications
Interactive Learning
Clear Content Structure
Comprehensive Coverage
Flexible Learning
Effective Assessments
Clinical Relevance
Expert Instruction
Supplementary Resources
Engaging Content
Integration of Concepts
Diverse Learning Materials


In [56]:
print(format_themes(derivation_result_best_parts.themes))

Theme: Visual Aids and Multimedia
Description:
• Effective use of visual illustrations and animations
• Video content that clarifies complex concepts
• Engaging multimedia resources enhancing comprehension
Citations:
  • "The illustrative animations clarified the main ideas effectively. The foundational exercises were also quite beneficial."
  • "I appreciated the clarity and visual aids provided in the course, which simplified complex topics such as cellular processes, hereditary traits, oncology principles, and DNA analysis. These tools made the material more accessible, especially compared to the challenge of deciphering academic papers and other scientific texts on my own."
  • "The instructional animations and the variety of engaging multimedia resources were excellent, making the material straightforward and comprehensible."

Theme: Practical Applications
Description:
• Real-world clinical scenarios
• Case studies connecting theory to practice
• Practical exercises demonstrating 

Your turn...as an exercise for the reader, use the `coursera_survey` comments that we loaded and run `find_themes` on those. We already defined the `coursera_review_question` variable as well that you will want to use.