# Project Part 3 - RateMyProfessor Deep Learning Model

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/eboyer221/CS39AA-project/blob/main/Project%20Part%203.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eboyer221/CS39AA-project/blob/main/Project%20Part%203.ipynb)

For Part 3 of this Project I will be using the 'BertForSequenceClassification' model for binary classification. This code will attempt to fine-tune BERT using the 'transformers' library by HuggingFace.

In [74]:
#!pip install transformers[torch]

In [85]:
!pip install --upgrade ipywidgets 

Defaulting to user installation because normal site-packages is not writeable


In [86]:
#install packages
import pandas as pd
import nltk
import torch
import torch.nn.functional as F
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset, random_split
from torch.nn import CrossEntropyLoss
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import torch.cuda
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification,  TrainingArguments, Trainer
from datasets import Dataset, load_metric

In [87]:
# Load the ratemyprofessor ratings dataset
data_path = 'https://raw.githubusercontent.com/eboyer221/CS39AA-Project/main/merged_data.csv'
df_1 = pd.read_csv(data_path)

In [88]:
#Apply cleaning steps to the dataset
#remove rows that have null values in either of these columns
columns_to_check = ['student_star', 'comments']

# Remove rows with null values in either of the specified columns
df_1 = df_1.dropna(subset=columns_to_check)

# Reset the index after removing rows
df_1.reset_index(drop=True, inplace=True)
# Columns to remove 
columns_to_remove = ['school_name', 'local_name', 'state_name',
                    'year_since_first_review', 'take_again', 'diff_index',
                    'tag_professor', 'post_date', 'name_onlines', 'attence',
                    'for_credits', 'would_take_agains', 'grades', 'stu_tags',
                    'help_useful', 'help_not_useful', 'professor_name', 'department_name',
                    'num_student', 'star_rating', 'student_difficult']

# Drop the specified columns
df = df_1.drop(columns=columns_to_remove)

#Change the pandas default column width to view more of the comments field
pd.set_option("display.max_colwidth", 370)

df.head()

Unnamed: 0,student_star,comments
0,3.5,"Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop."
1,5.0,such a fun professor. really helpful and knows his stuff
2,5.0,Such a easy class. It\'s simple. Do your homework and pay attention and you will fly right by or be the person that blames him for not leaarning. He wont let you fail. just ask for help....
3,5.0,"A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance."
4,1.0,"Took 100 level class for Ethics offered online as an option to fill a core requirement She was terrible! Did not seem to have a grasp of the English language nor does she seem to have a grasp on reality as she insisted many times that failure in an ENTRY LEVEL, OPTIONAL class is very common due to the ""difficulty"" of material, very full of herself"


The variable that I am primarily focused on predicting using comments is the star rating of the professor's overall quality. This is a continuous numerical variable, however it can be conceptually broken up into quality categories. According to RMP’s official standard, a rating of 3.5-5.0 is good, 2.5-3.4 is average and 1.0-2.4 is poor.

In [89]:
#Create a new rating column that reflects the sentiment where:
#ratings that are greater than or equal to 3.5 are considered positive
# ratings of 2.5-3.4 are considered neutral
#ratings that are less than 3.5 are considered negative

# Function to categorize ratings
def categorize_sentiment(rating):
    if rating >= 3.5:
        return 'positive'
    elif 2.5 <= rating < 3.5:
        return 'neutral'
    else:
        return 'negative'

# Create a new column 'rating_sentiment' based on the 'student_star' column
df['rating_sentiment'] = df['student_star'].apply(categorize_sentiment)

rating_result_counts = df['rating_sentiment'].value_counts()

# Display the counts
print(rating_result_counts)

df.head(10)

rating_sentiment
positive    13549
negative     4095
neutral      1940
Name: count, dtype: int64


Unnamed: 0,student_star,comments,rating_sentiment
0,3.5,"Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop.",positive
1,5.0,such a fun professor. really helpful and knows his stuff,positive
2,5.0,Such a easy class. It\'s simple. Do your homework and pay attention and you will fly right by or be the person that blames him for not leaarning. He wont let you fail. just ask for help....,positive
3,5.0,"A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance.",positive
4,1.0,"Took 100 level class for Ethics offered online as an option to fill a core requirement She was terrible! Did not seem to have a grasp of the English language nor does she seem to have a grasp on reality as she insisted many times that failure in an ENTRY LEVEL, OPTIONAL class is very common due to the ""difficulty"" of material, very full of herself",negative
5,3.5,No Comments,positive
6,5.0,"She is an extremely demanding professor. The work load is a big one. But she is also extremely helpful and explains things very clearly. She\'s not a professor for slackers... not and easy A, but definitely not impossible. Show up. Pay attention. Ask for help.",positive
7,3.0,"Boo. When I took multicultural psych from Dr. Swaney it was more like Salish Psych. The first day (of a 300 level class!) we went over APA style guidelines and how to use blackboard. She\'s a nice person, but should not be teaching at a college level.",neutral
8,5.0,Steph was very helpful and cared about our experience. She wanted to be sure we learned and were comfortable with everything before our testing. Go to her office hours if you need help. I learned a lot.,positive
9,5.0,One of the best classes I took at UM.,positive


In [90]:
MODEL_NAME = "bert-base-cased"
MAX_LENGTH=50

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=3, max_length=MAX_LENGTH, output_attentions=False, output_hidden_states=False)

AttributeError: 'FloatProgress' object has no attribute 'style'

In [61]:
classes = df.rating_sentiment.unique().tolist()
class_tok2idx = dict((v, k) for k, v in enumerate(classes))
class_idx2tok = dict((k, v) for k, v in enumerate(classes))
print(class_tok2idx)
print(class_idx2tok)

{'positive': 0, 'negative': 1, 'neutral': 2}
{0: 'positive', 1: 'negative', 2: 'neutral'}


Create a new column with these new labels. This will be the y that is used.

In [62]:
df['label'] = df['rating_sentiment'].apply(lambda x: class_tok2idx[x])
df.head()

Unnamed: 0,student_star,comments,rating_sentiment,label
0,3.5,"Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop.",positive,0
1,5.0,such a fun professor. really helpful and knows his stuff,positive,0
2,5.0,Such a easy class. It\'s simple. Do your homework and pay attention and you will fly right by or be the person that blames him for not leaarning. He wont let you fail. just ask for help....,positive,0
3,5.0,"A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance.",positive,0
4,1.0,"Took 100 level class for Ethics offered online as an option to fill a core requirement She was terrible! Did not seem to have a grasp of the English language nor does she seem to have a grasp on reality as she insisted many times that failure in an ENTRY LEVEL, OPTIONAL class is very common due to the ""difficulty"" of material, very full of herself",negative,1


In [75]:
sequence_0 = "A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance."
seq0_tokens = tokenizer(sequence_0, return_tensors="pt")
print(f"number of tokens in seq0 is {len(seq0_tokens['input_ids'].flatten())}")
print(seq0_tokens)
F.softmax(model(**seq0_tokens).logits, dim=1)

NameError: name 'tokenizer' is not defined

In [71]:
ds_raw = Dataset.from_pandas(df[['label','comments']])
ds_raw[0]

{'label': 0,
 'comments': 'Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop.'}

In [72]:
def tokenize_function(examples):
    return tokenizer(examples["comments"], padding="max_length", truncation=True, max_length=MAX_LENGTH)

ds = ds_raw.map(tokenize_function, batched=True)

AttributeError: 'FloatProgress' object has no attribute 'style'

In [None]:
ds[0]

In [70]:
ds = ds.shuffle(seed=42)
ds[0]

NameError: name 'ds' is not defined

In [69]:
train_prop = 0.85
ds_train = ds.select(range(int(len(ds)*train_prop)))
ds_eval = ds.select(range(int(len(ds)*train_prop), len(ds)))

NameError: name 'ds' is not defined

In [None]:
print(f"len(ds_train) = {len(ds_train)}")
print(f"len(ds_eval) = {len(ds_eval)}")

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(num_train_epochs=10,
                                  do_train=True,
                                  report_to=None,
                                  output_dir="/kaggle/working",
                                  evaluation_strategy="steps",
                                  eval_steps=200,
                                  learning_rate=1e-5,
                                  per_device_train_batch_size=32,
                                  per_device_eval_batch_size=32)

trainer = Trainer(model = model, 
                  args = training_args,
                  train_dataset = ds_train, 
                  eval_dataset = ds_eval,
                  compute_metrics = compute_metrics,
)

In [None]:
if torch.cuda.is_available():
    device = "cuda:0"
    print("Using GPU")
else: 
    device = "cpu"

In [None]:
model.to(device)

In [None]:
torch.set_grad_enabled(True)
trainer.train()
trainer.evaluate()