# Project Part 2 - RateMyProfessor Analysis

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/eboyer221/CS39AA-project/blob/main/project_part2.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eboyer221/CS39AA-project/blob/main/project_part2.ipynb)


We will now revisit the RateMyProfessor dataset from [Part 1](https://github.com/eboyer221/CS39AA-Project/blob/main/project_part1.ipynb) and attempt to fit a baseline model to predict ratings using review comments.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline 
from scipy.stats import uniform, randint
import re
import ast
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download NLTK resources (if not already downloaded)
#nltk.download('stopwords')
#nltk.download('wordnet')

First, we will apply the data pre-processing steps to clean the text of the comments section. In addition, we will apply stemming and lemmatization.

In [2]:
data_path = 'https://raw.githubusercontent.com/eboyer221/CS39AA-Project/main/merged_data.csv'
df_1 = pd.read_csv(data_path)
#remove rows that have null values in either of these columns
columns_to_check = ['student_star', 'comments']

# Remove rows with null values in either of the specified columns
df_1 = df_1.dropna(subset=columns_to_check)

# Reset the index after removing rows
df_1.reset_index(drop=True, inplace=True)
# Columns to remove 
columns_to_remove = ['school_name', 'local_name', 'state_name',
                    'year_since_first_review', 'take_again', 'diff_index',
                    'tag_professor', 'post_date', 'name_onlines', 'attence',
                    'for_credits', 'would_take_agains', 'grades', 'stu_tags',
                    'help_useful', 'help_not_useful']

# Drop the specified columns
df = df_1.drop(columns=columns_to_remove)

#Change the pandas default column width to view more of the comments field
pd.set_option("display.max_colwidth", 370)

df.head()

Unnamed: 0,professor_name,department_name,star_rating,num_student,student_star,student_difficult,comments
0,Robert Olshansky,Urban & Regional Planning department,3.5,1,3.5,2.0,"Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop."
1,Marshall Levett,Counseling department,5.0,2,5.0,1.0,such a fun professor. really helpful and knows his stuff
2,Marshall Levett,Counseling department,5.0,2,5.0,1.0,Such a easy class. It\'s simple. Do your homework and pay attention and you will fly right by or be the person that blames him for not leaarning. He wont let you fail. just ask for help....
3,Soazig Le Bihan,Philosophy department,3.6,4,5.0,5.0,"A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance."
4,Soazig Le Bihan,Philosophy department,3.6,4,1.0,4.0,"Took 100 level class for Ethics offered online as an option to fill a core requirement She was terrible! Did not seem to have a grasp of the English language nor does she seem to have a grasp on reality as she insisted many times that failure in an ENTRY LEVEL, OPTIONAL class is very common due to the ""difficulty"" of material, very full of herself"


In [3]:
# Function to clean up comments text using stemming
def clean_comments_stemm(text):
    # Check if the value is a string and not NaN
    if isinstance(text, str) and text.lower() != 'nan':
        # Convert to lowercase
        text = text.lower()

        # Remove special characters, numbers, and extra whitespaces
        text = re.sub(r'[^a-zA-Z\s]', '', text)

        # Remove stop words
        stop_words = set(stopwords.words('english'))
        words = text.split()
        words = [word for word in words if word not in stop_words]
        text = ' '.join(words)

        # Perform stemming
        stemmer = PorterStemmer()
        words = text.split()
        words = [stemmer.stem(word) for word in words]
        text = ' '.join(words)

    return text

# Apply the clean_comments function to the 'comments' column
df['tokens_stemm'] = df['comments'].apply(clean_comments_stemm)

df.head()

Unnamed: 0,professor_name,department_name,star_rating,num_student,student_star,student_difficult,comments,tokens_stemm
0,Robert Olshansky,Urban & Regional Planning department,3.5,1,3.5,2.0,"Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop.",good guy laid back interest field class get littl slllllllloooooowwwwwwww junior workshop
1,Marshall Levett,Counseling department,5.0,2,5.0,1.0,such a fun professor. really helpful and knows his stuff,fun professor realli help know stuff
2,Marshall Levett,Counseling department,5.0,2,5.0,1.0,Such a easy class. It\'s simple. Do your homework and pay attention and you will fly right by or be the person that blames him for not leaarning. He wont let you fail. just ask for help....,easi class simpl homework pay attent fli right person blame leaarn wont let fail ask help
3,Soazig Le Bihan,Philosophy department,3.6,4,5.0,5.0,"A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance.",hard class massiv amount work soazig also good explain difficult concept give excel feedback access extra assist
4,Soazig Le Bihan,Philosophy department,3.6,4,1.0,4.0,"Took 100 level class for Ethics offered online as an option to fill a core requirement She was terrible! Did not seem to have a grasp of the English language nor does she seem to have a grasp on reality as she insisted many times that failure in an ENTRY LEVEL, OPTIONAL class is very common due to the ""difficulty"" of material, very full of herself",took level class ethic offer onlin option fill core requir terribl seem grasp english languag seem grasp realiti insist mani time failur entri level option class common due difficulti materi full


In [4]:
# Function to clean up comments text using lemmatization
def clean_comments_lemm(text):
    # Check if the value is a string and not NaN
    if isinstance(text, str) and text.lower() != 'nan':
        # Convert to lowercase
        text = text.lower()

        # Remove special characters, numbers, and extra whitespaces
        text = re.sub(r'[^a-zA-Z\s]', '', text)

        # Remove stop words
        stop_words = set(stopwords.words('english'))
        words = text.split()
        words = [word for word in words if word not in stop_words]

        # Perform lemmatization
        lemmatizer = WordNetLemmatizer()
        words = [lemmatizer.lemmatize(word) for word in words]

        return words
    
    return []


# Apply the clean_comments function with lemmatization to the 'comments' column
df['tokens_lemm'] = df['comments'].apply(clean_comments_lemm)

df.head()

Unnamed: 0,professor_name,department_name,star_rating,num_student,student_star,student_difficult,comments,tokens_stemm,tokens_lemm
0,Robert Olshansky,Urban & Regional Planning department,3.5,1,3.5,2.0,"Good guy, laid back and interested in his field. Class can get... a little..... slllllllloooooowwwwwwww during his junior workshop.",good guy laid back interest field class get littl slllllllloooooowwwwwwww junior workshop,"[good, guy, laid, back, interested, field, class, get, little, slllllllloooooowwwwwwww, junior, workshop]"
1,Marshall Levett,Counseling department,5.0,2,5.0,1.0,such a fun professor. really helpful and knows his stuff,fun professor realli help know stuff,"[fun, professor, really, helpful, know, stuff]"
2,Marshall Levett,Counseling department,5.0,2,5.0,1.0,Such a easy class. It\'s simple. Do your homework and pay attention and you will fly right by or be the person that blames him for not leaarning. He wont let you fail. just ask for help....,easi class simpl homework pay attent fli right person blame leaarn wont let fail ask help,"[easy, class, simple, homework, pay, attention, fly, right, person, blame, leaarning, wont, let, fail, ask, help]"
3,Soazig Le Bihan,Philosophy department,3.6,4,5.0,5.0,"A very hard class, and a massive amount of work. But, Soazig is also very good about explaining difficult concepts, gives excellent feedback, and is very accessible for extra assistance.",hard class massiv amount work soazig also good explain difficult concept give excel feedback access extra assist,"[hard, class, massive, amount, work, soazig, also, good, explaining, difficult, concept, give, excellent, feedback, accessible, extra, assistance]"
4,Soazig Le Bihan,Philosophy department,3.6,4,1.0,4.0,"Took 100 level class for Ethics offered online as an option to fill a core requirement She was terrible! Did not seem to have a grasp of the English language nor does she seem to have a grasp on reality as she insisted many times that failure in an ENTRY LEVEL, OPTIONAL class is very common due to the ""difficulty"" of material, very full of herself",took level class ethic offer onlin option fill core requir terribl seem grasp english languag seem grasp realiti insist mani time failur entri level option class common due difficulti materi full,"[took, level, class, ethic, offered, online, option, fill, core, requirement, terrible, seem, grasp, english, language, seem, grasp, reality, insisted, many, time, failure, entry, level, optional, class, common, due, difficulty, material, full]"


The variable that I am primarily focused on predicting using comments is the star rating of the professor's overall quality. This is a continuous numerical variable, however it can be conceptually broken up into quality categories. According to RMP’s official standard, a rating of 3.5-5.0 is good, 2.5-3.4 is average and 1.0-2.4 is poor. I would like to determine the terms that distinguish the highest performing professors so I would like to see whether we could structure this initial model as a binary classification problem by dividing the ratings into 'good' and 'bad' with ratings greater than or equal to 3.5 (=>3.5) being categorized as 'good' (1) and ratings lower than 3.5 (<3.5) being categorized as 'bad' (0).  

In [5]:
#Create a new binary column 'rating_result' where:
#1 represents ratings that are greater than or equal to 3.5 (considered "good" or positive).
#0 represents ratings that are less than 3.5 (considered "bad" or negative).

df['rating_result'] = (df['star_rating'] >= 3.5).astype(int)
rating_result_counts = df['rating_result'].value_counts()

# Display the counts
print(rating_result_counts)


rating_result
1    13301
0     6283
Name: count, dtype: int64


There are 13301 'good' ratings and 6283 'bad' ratings. This means that 68% of the data set is composed of high ratings. We will use sklearn.model_selection.train_test_split() to split the dataset into validation and training subsets.

Now let's create a vocabulary sorted by frequency for the full dataset, and the subsets of tokens associated with good and bad ratings.

In [6]:
# Check the data types in the 'tokens_lemm' column
print(df['tokens_lemm'].apply(type).value_counts())

# Subset the data by good and bad ratings in the main DataFrame
df_good = df[df['rating_result'] == 1]
df_bad = df[df['rating_result'] == 0]

def create_vocab_list(tokens_column):
    vocab = dict()
    for lemm_tokens in tokens_column:
        for token in lemm_tokens:
            if token not in vocab:
                vocab[token] = 1
            else:
                vocab[token] += 1
    return vocab

# Create vocabularies for all, good, and bad ratings in the main DataFrame
vocab_all = dict(sorted(create_vocab_list(df['tokens_lemm']).items(), key=lambda item: item[1], reverse=True))
vocab_good = dict(sorted(create_vocab_list(df_good['tokens_lemm']).items(), key=lambda item: item[1], reverse=True))
vocab_bad = dict(sorted(create_vocab_list(df_bad['tokens_lemm']).items(), key=lambda item: item[1], reverse=True))

print(f"number of unique tokens overall: {len(vocab_all)}, pos tokens: {len(vocab_good)}, neg: {len(vocab_bad)})")

tokens_lemm
<class 'list'>    19584
Name: count, dtype: int64
number of unique tokens overall: 17350, pos tokens: 13252, neg: 9325)


Let us consider the list of the most frequent words in the overall data set.

In [7]:
list(vocab_all.items())[:20]
#list(vocab_good.items())[:20]
#list(vocab_bad.items())[:20]

[('class', 17332),
 ('professor', 5522),
 ('take', 5029),
 ('teacher', 4995),
 ('easy', 4079),
 ('test', 3807),
 ('great', 3806),
 ('student', 3753),
 ('really', 3739),
 ('good', 3222),
 ('make', 3200),
 ('get', 2922),
 ('lot', 2805),
 ('lecture', 2762),
 ('help', 2702),
 ('hard', 2617),
 ('time', 2525),
 ('work', 2486),
 ('helpful', 2325),
 ('one', 2322)]

In [10]:
X = df['tokens_lemm'].apply(lambda tokens: ' '.join(tokens))  # Convert lists to space-separated strings
y = df['rating_result']  # Binary classification labels (0 or 1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to DataFrames
train_df = pd.DataFrame({'tokens_lemm': X_train, 'rating_result': y_train})
test_df = pd.DataFrame({'tokens_lemm': X_test, 'rating_result': X_test})

# Display the first few rows of the training set
print("Training Set:")
print(f"train_df.shape: {train_df.shape}")

# Display the first few rows of the testing set
print("\nTesting Set:")
print(f"test_df.shape: {test_df.shape}")



Training Set:
train_df.shape: (15667, 2)

Testing Set:
test_df.shape: (3917, 2)


In [None]:
# Display the counts in the training set
training_counts = y_train.value_counts()
print("Training Set:")
print(training_counts)

# Display the counts in the testing set
test_counts = y_test.value_counts()
print("\nTesting Set:")
print(test_counts)

The distribution of 'good' and 'bad' ratings in the training and test sets appear to reflect the proportions in the initial data set.

Next, we will vectorize the lemmatized comments using the term-frequency inverse-document-frequency vectorizer from sklearn, TfidfVectorizer. 

In [None]:
# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

print(X_train_vectorized)


In [None]:
# Convert the sparse matrix to a dense NumPy array
dense_array = X_train_vectorized[:5, :10].toarray()

print(dense_array)


Now we will fit a Bernoulli Naive Bayes model using the training data and evaluate the performance.

In [None]:
# Create and train the Bernoulli Naive Bayes classifier
model = BernoulliNB()
model.fit(X_train_vectorized, y_train)

# Evaluate the model on the training set
accuracy_train = accuracy_score(y_train, predictions_train)
report_train = classification_report(y_train, predictions_train)
conf_matrix_train = confusion_matrix(y_train, predictions_train)

# Print the results
print(f"Training Accuracy: {accuracy_train}")
print("Training Classification Report:\n", report_train)
print("Training Confusion Matrix:\n", conf_matrix_train)

The Accuracy of the model is approximately 75%. Let's see how it performs using the test data.

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test_vectorized)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

The model performed about the same on the test data, with an accuracy of 75%. Let's see if tuning the hyperparameters available for the Bernoulli NB model helps to improve the performance.

In [None]:
# Define the hyperparameter distributions
param_dist = {
    'alpha': uniform(0.1, 1.0),
    'binarize': uniform(0.0, 0.1),
    'fit_prior': [True, False]
}

# Perform randomized search with cross-validation
randomized_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)
randomized_search.fit(X_train_vectorized, y_train)

# Get the best hyperparameters
best_params = randomized_search.best_params_
print("Best Hyperparameters:", best_params)

# Train the model with the best hyperparameters
best_model = randomized_search.best_estimator_

# Use the best model to make predictions on the test set
y_pred = best_model.predict(X_test_vectorized)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy}")

Adjusting the hyperparameters only resulted in a slight 1% increase in accuracy.

Next, we will try a Support Vector Machine (SVM).

In [None]:
# Create a pipeline with TF-IDF vectorizer and SVM classifier
model = make_pipeline(TfidfVectorizer(), SVC())

# Train the model
model.fit(X_train, y_train)

# Predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)
print("Confusion Matrix:\n", conf_matrix)

The SVM model performed slightly better than the Naive Bayes model with an accuracy of 76%.