# Theme Derivation

Here the goal is to derive the main themes of the comments, where a theme is some common feedback expressed by multiple students. This is a good starting point when you're trying to answer the question: "What did students say about the course?" at a high level before diving into specifics. This could equally well be applied to user feedback surveys - there isn't much about the analysis that is specific to education. 

This process starts from scratch, in other words, with no preconceived notion of what the themes are, and lets the model derive the themes organically based on the comments themselves.

## Imports and setup

In [28]:
import pandas as pd
from pprint import pprint
from pathlib import Path
from dotenv import load_dotenv, find_dotenv
from survey_analysis.theme_derivation import DeriveThemes, derive_themes
from survey_analysis.models_common import CommentModel, CommentBatch
from survey_analysis.single_input_task import apply_task

In [11]:
# this makes it more robust to run async tasks inside an already async environment (jupyter notebooks)
import nest_asyncio
nest_asyncio.apply()

Make sure to either set `OPENAI_API_KEY` as an environment variable or put it in a .env file and use the following cell to load the env var. The format in the .env file is:
```
OPENAI_API_KEY=yourKeyGoesHere
```

In [2]:
load_dotenv(find_dotenv())

True

In [3]:
%load_ext autoreload
%autoreload 2

This is a convenience function to make seeing Pandas dataframe values easier, especially when there are long strings like the student comments we will be using.

In [4]:
def full_show(df):
    with pd.option_context('display.max_columns', None, 'display.max_rows', None, 'display.max_colwidth', None):
        display(df)

## Load the example data

In [5]:
data_path = Path('../data/example_data')

Let's load up some fake data. 

All of these comments are synthetic to avoid sharing any sensitive or PII information, but they should work great for illustration purposes. There are 100 rows, with just a few null/nan values here and there for realism. In most surveys I've seen, there are quite a number of null/None/blank etc values, and the functions are written to handle those.

In [6]:
example_survey = pd.read_csv(data_path / 'example_survey_data_synthetic.csv')
full_show(example_survey.head())

Unnamed: 0,best_parts,enhanced_learning,improve_course
0,I valued the practical clinical aspects related to immune-related disorders and their management.,The illustrative visuals and straightforward explanatory clips.,Consider reducing the duration of certain videos. A few appeared to be slightly prolonged.
1,The flexibility to learn at a self-determined speed,The opportunity to review the lecture content,"The pace of some lectures could be slowed down. At times, it's challenging to follow the lecturer's speech or decipher their handwriting."
2,The educational content was extremely enriching and stimulating! The section on oncology was the highlight.,the self-assessment activities.,Nothing specific comes to mind.
3,Professional growth within the medical sector,"The practical integration workshops were highly beneficial, they significantly contributed to a deeper comprehension of the theories and their implementation in a healthcare environment.",Incorporating a few advanced projects as optional tasks could benefit learners who wish to delve deeper into the subject matter. These projects wouldn't need to influence exam scores.
4,The highlights of the class included the practical demonstration clips that made the complex biological principles more understandable by connecting them to daily well-being and actions. This connection was incredibly beneficial as I navigated the course content.,"The aspect of the course that most facilitated my learning was the regular assessments provided at each segment, which helped confirm my grasp of the material presented. These checkpoints effectively guided me in the correct learning direction. It's evident that considerable effort was invested in designing these educational modules to enable students to gain a deep comprehension rather than just a superficial understanding of the subject matter.","Extend the duration of the concept videos for the more challenging topics, as they require a deeper dive to fully grasp the intricacies involved. Additionally, consider introducing an additional educator to the mix. The dynamic of having multiple voices in another subject area is quite engaging, and it would be beneficial to replicate that experience in this subject to prevent monotony from setting in with just one instructor."


Create variables for the original survey questions

The column names are shorthand for easy usage in code below, but let's also define some variables that hold the original survey questions that were asked for each column. We'll use these as metadata when passing along comments to the LLM routines. This is potentially important context for an LLM...after all, the survey comment "The flexibility to learn at a self-determined speed" may have different significance if the question was "What were the best parts of the course?" versus "How could we improve the course?".

In [7]:
best_parts_question = "What were the best parts of the course?"
enhanced_learning_question = "What parts of the course enhanced your learning the most?"
improve_course_question = "What could be improved about the course?"

We'll also load up some Coursera comments (source is from [this Kaggle notebook](https://www.kaggle.com/datasets/imuhammad/course-reviews-on-coursera), just using the first 100 as an example. The included example dataset is just the first 200 rows of the full 1.45 million rows. I didn't include the full set so as not to blimp up the size of this repo.

In [8]:
coursera_survey = pd.read_csv(data_path / 'coursera_survey_200rows.csv', nrows=100)
full_show(coursera_survey.head())

Unnamed: 0,reviews,reviewers,date_reviews,rating,course_id
0,"Pretty dry, but I was able to pass with just two complete watches so I'm happy about that. As usual there were some questions on the final exam that were NO WHERE in the course, which is annoying but far better than many microsoft tests I have taken. Never found the suplimental material that the course references... but who cares... i passed!",By Robert S,"Feb 12, 2020",4,google-cbrs-cpi-training
1,would be a better experience if the video and screen shots would sho on the side of the text that the instructor is going thru so that user does not have to go all the way to beginning of text to be able to view any slides instructor is showing.,By Gabriel E R,"Sep 28, 2020",4,google-cbrs-cpi-training
2,Information was perfect! The program itself was a little annoying. I had to wait 30 to 45 minutes after watching the videos to to take the quiz. Other than that the information was perfect and passed the test with no issues!,By Jacob D,"Apr 08, 2020",4,google-cbrs-cpi-training
3,A few grammatical mistakes on test made me do a double take but all in all not bad.,By Dale B,"Feb 24, 2020",4,google-cbrs-cpi-training
4,Excellent course and the training provided was very detailed and easy to follow.,By Sean G,"Jun 18, 2020",4,google-cbrs-cpi-training


In [9]:
coursera_review_question = "What is your review of the course?"

## Theme derivation (bottom up)

Derive themes from a batch of comments

Notice that we're passing the original survey question that prompted these responses. This is useful metadata for LLM (see above in the data loading section).

Single pass:

In [10]:
survey_task = DeriveThemes(question=best_parts_question)
comments = example_survey['best_parts'].tolist()
task_input = CommentBatch(comments=[CommentModel(comment=comment) for comment in comments])
sample_output = await apply_task(task_input=task_input,
                                 get_prompt=survey_task.prompt_messages,
                                 result_class=survey_task.result_class)

print(sample_output.model_dump_json(indent=2))

{
  "themes": [
    {
      "theme_title": "Practical Applications",
      "description": "Students appreciated the practical application segments of the lessons, which helped deepen their understanding of complex concepts through real-world scenarios, laboratory exercises, and case studies.",
      "citations": [
        "The practical application segments of the lessons were beneficial. They aided in deepening comprehension of the molecular processes.",
        "The practical exercises in the laboratory and their relevance to real-world health conditions or metabolic processes in organisms.",
        "The practical application sessions involving real-world patient scenarios (immune therapy techniques, cellular therapy) were outstanding."
      ]
    },
    {
      "theme_title": "Visual and Interactive Content",
      "description": "Visual aids such as instructional videos, animations, and interactive modules were highlighted as engaging and effective in simplifying complex topics a

The themes derived in a single pass of the comments by the LLM (here defaulting to `gpt-4-0125-preview`) seem (in repeated experimentation) to be highly dependent on the order of the comments and vary even with `temperature=0` for the model. Sometimes it comes up with just a single theme when clearly the comments have more themes. Therefore, we have a convenience function that allows doing multiple shuffled passes to mitigate LLM positional bias and then combining those behind the scenes. This seems to stabilize the resulting themes nicely.

Multi-pass (3 runs) theme derivation, shuffling comments each time:

In [15]:
comments = example_survey['best_parts'].tolist() # same 100 fake example comments as above
sample_output = await derive_themes(comments=comments, 
                                    question=best_parts_question, 
                                    shuffle_passes=3) # this defaults to 3 passes but just making it explicit here

pass 1
title: Practical Applications
description: Students appreciated the practical applications and exercises throughout the course, which helped them understand complex concepts and their relevance to real-world medical practice.

pass 2
title: Engaging Visual Content
description: Students appreciated the use of visual aids, including videos, animations, and illustrations, to make complex topics more understandable and engaging. This approach helped in simplifying the material and enhancing learning.

title: Practical Applications
description: The course was praised for its practical application sessions, hands-on activities, and real-world clinical scenarios. These elements helped students grasp fundamental ideas and understand their significance within the medical practice.

title: Comprehensive Resources
description: Students valued the comprehensive range of instructional materials provided, including study guides, vocabulary lists, and additional resources. These materials faci

What themes did we arrive at?

In [16]:
# print the titles of the themes from sample_output
for theme in sample_output.updated_themes:
    print(theme.theme_title)

Practical and Clinical Applications
Comprehensive Educational Resources and Visual Aids
Interactive and Engaging Learning
Flexible Learning
Instructional Quality
Assessment and Feedback
Cutting-Edge Content


Now let's take a look at the model's reasoning in combining themes along the way, across the different passes, and then get more detail on the final themes.

In [32]:
print("Reasoning:\n")
pprint(sample_output.reasoning)
print('\nFinal Themes:\n')
for theme in sample_output.updated_themes:
    print(theme.theme_title)
    pprint(theme.description)
    print('\n')

Reasoning:

("1. 'Practical and Clinical Applications' and 'Practical Applications' cover "
 'the same ground, focusing on the practical aspects and real-world '
 'applications in medical practice. \n'
 "2. 'Engaging and Comprehensive Educational Resources', 'Visual Aids', and "
 "'Comprehensive Resources' can be merged because they all emphasize the "
 'importance of educational materials, including visual aids and comprehensive '
 'study resources, in enhancing learning. \n'
 "3. 'Interactive Learning' and 'Engaging Content' are similar as both "
 'highlight the interactive and engaging nature of the course content, '
 'including hands-on activities and multimedia. \n'
 "4. The themes 'Flexible Learning', 'Instructional Quality', 'Assessment and "
 "Feedback', 'Cutting-Edge Content' are unique in their focus and do not "
 'overlap significantly with others.')

Final Themes:

Practical and Clinical Applications
('This theme emphasizes the importance of practical applications, hands-on

Pretty good! It has made some logical choices about which themes to combine along the way to arrive at a smaller set of main themes. These can be used as a starting point then for multilabel classification of the comments if the goal is next to quantify how much feedback there was in each category.

If you want to see that process in action (and more), check out the end-to-end demo notebook that runs through a fuller workflow of how you might go about different steps from theme derivation to multilabel classification, extraction, and sentiment analysis.

Your turn...as an exercise for the reader, use the `coursera_survey` comments that we loaded and run `derive_themes` on those. We already defined the `coursera_review_question` variable as well that you will want to use.