  Case Study: Analyzing Student Feedback using LLMs

  Background:

  In the edtech space, it's common for platforms to receive vast amounts of feedback from students. This feedback can be about course content, platform usability, instructor quality, etc. Analyzing this feedback manually can be time-consuming. LLMs can assist in summarizing and categorizing this feedback for actionable insights.

  Problem Statement:

  MTT Consuling & Edtech wants to analyze feedback from students to improve their online course offerings. The feedback is in the form of text, and there are thousands of responses. The goal is to categorize feedback into themes (e.g., "Course Content", "Instructor Quality", "Platform Usability") and understand common sentiments (positive, negative, neutral).

  Challenges:

  •	Large volume of feedback.
  •	Variability in language and expression among students.
  •	Need for accurate categorization and sentiment analysis.

  Dataset:

  For this case study, we'll use the "Online Course Reviews" dataset from Kaggle. This dataset contains textual feedback from various online courses.

```

  Data Pre-processing:

  •	Load the dataset and perform basic cleaning (remove duplicates, handle missing values).
  •	Tokenize the feedback using LLM-specific tokenizers.

  Categorization with LLM:
  •	Fine-tune an LLM (like BERT) on a subset of manually labeled feedback to classify into categories.
  •	Use the fine-tuned model to predict categories for the entire dataset.

  Sentiment Analysis:
  •	Use a pre-trained sentiment analysis model (like the ones available in HuggingFace's transformers library) to determine the sentiment of each feedback.

  Results Visualization:
  •	Visualize the distribution of feedback across categories.
  •	Visualize sentiment distribution within each category.


# This is formatted as code
```



In [1]:
!pip install transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m45.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install transformers[torch] --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/258.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/258.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!unzip reviews.csv.zip

## Data Exploration, Cleaning and Preparation

In [31]:
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np

In [53]:
df = pd.read_csv('review1100_categorized.csv')
df.shape

(1147, 4)

In [54]:
# check for duplicate entries
print(df.shape)
df.drop_duplicates(inplace=True)
print(df.shape)

(1147, 4)
(1147, 4)


In [55]:
df.isnull().sum()

Id          0
Review      0
Label       0
Category    0
dtype: int64

In [35]:
for i in np.random.randint(1,df.shape[0],10):
  print(df['Review'][i])
  print("#"*50)

It's really well organized. I like the pace of the course and the voice of "Professor Andy"!
##################################################
I am in love for this course, a lot of good information, practice activities, etc...It is helping me a lot!
##################################################
Awesome course, excelent material, great classes.
##################################################
Some great material and I learned a lot. I did in fact complete the course but chose not to submit written material. Sometimes the lecturer was perhaps a little dry.
##################################################
It's a very nice course if you're doing some refreshing of college chemistry.
##################################################
Suberb introducition into 3d printing idea and usage
##################################################
Definitely recommend this short course which gives a good overview of Ableton Live's capabilities and how to use them.
###########################

In [36]:
def clean_data(doc):
  doc = re.sub(":[a-zA-Z0-9().-]+","",doc)
  doc = re.sub("[^a-zA-Z0-9\s]","",doc)
  doc = re.sub("\s\s"," ",doc)
  doc = doc.strip()
  return doc

clean_data("my name is Anshu :D and :) what is 'your' name ???---...")

'my name is Anshu and what is your name'

In [37]:
df['Review'] = df['Review'].apply(clean_data)

In [38]:
for i in np.random.randint(1,df.shape[0],10):
  print(df['Review'][i])
  print("#"*50)

I greatly enjoyed the material presented in this course and while I do not doubt the proficiency of its creators the poor standard of spoken English of its two main presenters prevents me from recommending it to colleagues As another reviewer has remarked their English may be quite sufficient for day to day use with other experts but for inadequate for teaching students who are hearing many technical terms for the first time The transcripts are no better and also contain many mistakes I believe that this course should be stopped until these problems are corrected
##################################################
Great course A must for every individual who has some idea about it and even if you dont Aric and his team will guide you throughout this wonderful technique
##################################################
Worldclass
##################################################
Great intro to Ableton with good handson experience Nice starting point for those who seek to expand their m

## Tokenization

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
input_data = tokenizer(list(df['Review']),truncation=True,padding=True,max_length=512)
df.head()

In [41]:
y = df['Category']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
!pip install datasets --quiet

In [46]:
from datasets import Dataset
train_data = Dataset.from_dict({'input_ids':input_data['input_ids'],'labels':y})

## Text Classification Model

In [47]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

mdoel = BertForSequenceClassification.from_pretrained("bert-base-uncased",num_labels=3)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [49]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = "./results",
    per_device_train_batch_size=8,
    num_train_epochs=5,
    logging_dir='./logs'
)

# create a trianer object to train a model
trainer = Trainer(model=mdoel,
                  args = training_args,
                  train_dataset=train_data)

trainer.train()

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss
500,0.316


TrainOutput(global_step=720, training_loss=0.31351425382826065, metrics={'train_runtime': 592.3106, 'train_samples_per_second': 9.682, 'train_steps_per_second': 1.216, 'total_flos': 1508955450670080.0, 'train_loss': 0.31351425382826065, 'epoch': 5.0})

In [65]:
class_names = le.classes_.tolist()
class_names

['Course Content', 'Instructor Quality', 'Platform Usability']

In [86]:
testset = []
actuals = []
for i in np.random.randint(0,df.shape[0],5):
  testset.append(df['Review'][i])
  actuals.append(df['Category'][i])

In [None]:
testset

In [88]:
test_set = Dataset.from_dict({"input_ids":tokenizer(testset,truncation=True,padding=True,max_length=512)['input_ids']})
preds = trainer.predict(test_set)

In [None]:
preds

In [None]:
for i in range(5):
  print(testset[i])
  p = np.argmax(preds[0][i])
  print(actuals[i],class_names[p])