# Summary

In this project, we sought to perform sentiment analysis, predicting the quality and difficulty ratings for professors on https://www.ratemyprofessors.com/ from the corresponding comment. A variety of approaches were taken, a summary of which is provided in the "Accuracy Comparison Table" at the end of this document. With respect to difficulty, the highest performing model employed GloVe embeddings and deep bidirectional LSTM cells, achieving an accuracy of ~38.03% on a validation subset. With respect to quality, the highest performing model employed GloVe embeddings and deep bidirectional GRU cells, achieving an accuracy of ~47.06% on a validation subset, with most models within 5% of this benchmark. The problem was also approached as a regression problem, with the lowest test loss using mean squared error on difficulty being ~.8127 from a model utilizing GloVe and deep bidirectional GRU cells and the lowest test loss on quality under the same terms being ~1.2845 from a model utilizing a bespoke embedding and a bidirectional GRU cell. In both cases, it was harder to predict difficulty rather than quality, possibly as a result of the former only spanning 5 values, 1 through 5, while the latter spanned 9, 1 through 5 in .5 increments, though fewer models were dedicated to difficulty in any case. Another interesting feature is that in many cases models tended to very quickly overfit on the training data, with validation accuracy quickly lowering after only one or two epochs. Overall, models spanned a variety of embeddings, dimensions, architectures, and data sources with varying results.

# **APPROACH - 1 :**
---
## **RNN Model for Sentiment Analysis of Student Comments using GloVe Embeddings and Keras**
---
### This code is performing sentiment analysis on a dataset of student comments about professors, in order to predict the professor's rating and level of difficulty. It begins by importing necessary libraries, including pandas, numpy, and the Natural Language Toolkit (NLTK), and loading the data into a pandas dataframe from a CSV file.

### The code then loads pre-trained GloVe embeddings to use in creating an embedding matrix for the tokenizer's word index. It cleans the text data by removing stopwords, numerical values, and symbols, and converts the text to sequences. It creates target variables for the professor's rating and level of difficulty, and splits the data into training and testing sets.

### The code builds a recurrent neural network (RNN) model using the Keras API with an embedding layer, LSTM layer, and a dense softmax output layer. It trains the model on the augmented training data, and evaluates its performance on the augmented test data. Finally, it prints the test accuracy of the model.

### Overall, this code is a machine learning pipeline for predicting professor ratings and levels of difficulty based on student comments, using an RNN model and pre-trained embeddings.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing the required libraries.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout


In [None]:
# Importing the data from .csv to data frame.
df = pd.read_csv(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/Big Data Set from RateMyProfessor.com for Professors Teaching Evaluation/RateMyProfessor_Sample data.csv')

In [None]:
# Load the pre-trained GloVe embeddings
embeddings_index = {}
with open(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:
# Remove 'nan' values from 'comments' column
df = df.dropna(subset=['comments'])

# Remove stopwords, numerical values, and symbols from 'comments' column and convert to lowercase
nltk.download('stopwords')
stop_words = stopwords.words('english')
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df['comments'] = df['comments'].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Set parameters for model
MAX_NB_WORDS = 19114
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100

# Tokenize words in 'comments' column
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True)
tokenizer.fit_on_texts(df['comments'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Convert text to sequences
X = tokenizer.texts_to_sequences(df['comments'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

# Create target variables
y_star = pd.get_dummies(df['student_star']).values
y_difficult = pd.get_dummies(df['student_difficult']).values

Found 19113 unique tokens.


In [None]:
# Create an embedding matrix for the tokenizer's word index
num_words = len(tokenizer.word_index) + 1
embedding_dim = 100
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
# Split data into training and testing sets
VALIDATION_SPLIT = 0.2
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y_star = y_star[indices]
y_difficult = y_difficult[indices]
num_validation_samples = int(VALIDATION_SPLIT * X.shape[0])

X_train = X[:-num_validation_samples]
y_train_star = y_star[:-num_validation_samples]
y_train_difficult = y_difficult[:-num_validation_samples]
X_test = X[-num_validation_samples:]
y_test_star = y_star[-num_validation_samples:]
y_test_difficult = y_difficult[-num_validation_samples:]

In [None]:
y_train_star.shape

(15995, 9)

In [None]:
# Build RNN model
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
model.add(LSTM(64, dropout=0.2))
model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 250, 100)          1911400   
                                                                 
 lstm (LSTM)                 (None, 64)                42240     
                                                                 
 dense (Dense)               (None, 9)                 585       
                                                                 
Total params: 1,954,225
Trainable params: 42,825
Non-trainable params: 1,911,400
_________________________________________________________________
None


In [None]:
# Train model on augmented data
model.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=20, batch_size=128)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f45ce390eb0>

In [None]:
# Evaluate model on augmented test data
score, acc = model.evaluate(X_test, y_test_star, batch_size=32)
print('Test accuracy :', acc)

Test accuracy : 0.4452226161956787


# **--- END OF SECTION 1 ---**

# **Aproach 2:**
---
## **Sentiment Analysis using GloVe Embeddings and GRU Model on RateMyProfessor Data**
---
### This code is a machine learning model built using Keras and TensorFlow to predict student ratings based on their comments about a professor. The code starts by importing necessary libraries and loading the data from a CSV file. It then loads pre-trained GloVe embeddings and uses them to create an embedding matrix for the tokenizer's word index. The text data is preprocessed by cleaning and tokenizing the comments column of the DataFrame. The data is then split into train and test sets, tokenized, and padded to have the same length. A GRU (Gated Recurrent Unit) model is created with two GRU layers, a dropout layer, and a dense layer. The model is then compiled with a different optimizer and trained on the train set for 5 epochs. Finally, the model is evaluated on the test set, and the accuracy is printed.

In [None]:
# Importing the required default libraries.
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split

In [None]:
# Importing the data from .csv to data frame.
data_frame = pd.read_csv(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/Big Data Set from RateMyProfessor.com for Professors Teaching Evaluation/RateMyProfessor_Sample data.csv')

In [None]:
# Load the pre-trained GloVe embeddings
embeddings_index = {}
with open(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:
# Drop rows with null values in 'comments' column
data_frame.dropna(subset=['comments'], inplace=True)

# Select only the 'student_star' and 'comments' columns
data_frame = data_frame[['student_star', 'comments']]

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

# Define a regex pattern to match only alphanumeric characters
pattern = r'[^a-zA-Z0-9]'

def clean_text(text):
    # Remove non-word characters using the regex pattern
    text = re.sub(pattern, ' ', text)
    # Convert text to lowercase
    text = text.lower()
    # Split text into individual words
    words = text.split()
    # Remove stopwords using NLTK
    words = [word for word in words if word not in stopwords.words('english')]
    return words

# Apply the clean_text function to the 'comments' column of the DataFrame
data_frame['comments'] = data_frame['comments'].apply(lambda x: clean_text(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
data_frame.shape

(19993, 2)

In [None]:
print(data_frame['comments'])

0        [class, hard, two, one, gen, ed, knockout, con...
1        [definitely, going, choose, prof, looney, clas...
2        [overall, enjoyed, class, assignments, straigh...
3        [yes, possible, get, definitely, work, content...
4        [professor, looney, great, knowledge, astronom...
                               ...                        
19995               [great, sense, humor, love, parasites]
19996    [really, nice, guy, really, funny, however, bi...
19997    [parasitology, class, lot, work, makes, extrem...
19998    [way, much, work, 1, credit, class, shegnoski,...
19999    [extremely, easy, lab, teacher, quizzes, littl...
Name: comments, Length: 19993, dtype: object


In [None]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split

X = data_frame['comments']
y = data_frame['student_star']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Tokenize the comments column of the train and test sets
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad the sequences of the train and test sets to have the same length
max_length = max([len(seq) for seq in X_train_seq])
X_train_pad = pad_sequences(X_train_seq, maxlen=max_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length)

In [None]:
# Create an embedding matrix for the tokenizer's word index
num_words = len(tokenizer.word_index) + 1
embedding_dim = 100
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
# Create the GRU model
model = keras.models.Sequential()
model.add(keras.layers.Embedding(num_words, embedding_dim, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(keras.layers.GRU(128, return_sequences=True))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.GRU(64))
model.add(keras.layers.Dense(32, activation='relu'))

# Output layer for 'student_star'
model.add(keras.layers.Dense(6, activation='softmax', name='student_star'))

# Compile the model with a different optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()



Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 72, 100)           1370300   
                                                                 
 gru (GRU)                   (None, 72, 128)           88320     
                                                                 
 dropout (Dropout)           (None, 72, 128)           0         
                                                                 
 gru_1 (GRU)                 (None, 64)                37248     
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 student_star (Dense)        (None, 6)                 198       
                                                                 
Total params: 1,498,146
Trainable params: 127,846
Non-

In [None]:
# Train the model on the train set for more epochs
history = model.fit(X_train_pad, y_train, epochs=5, batch_size=128, validation_split=0.2)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test_pad, y_test)
print('Test accuracy:', accuracy)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test accuracy: 0.37095698714256287


# **--- END OF SECTION 2 ---**

# **Approach 3**

---

## **Sentiment Analysis using LSTM Model on RateMyProfessor Data**

---

### The code loads a dataset of professor reviews from RateMyProfessor.com and performs sentiment analysis using a deep learning LSTM model. The dataset is initially read from a CSV file and then augmented with additional reviews loaded from a JSON file. The reviews are preprocessed by removing stopwords, numerical values, and symbols from the 'comments' column and converting it to lowercase. The pre-trained GloVe embeddings are loaded and an embedding matrix is created for the tokenizer's word index. The data is then split into training and testing sets, and an LSTM model is built and trained on the training set using the categorical cross-entropy loss function and the Adam optimizer. The model's performance is evaluated on the testing set, and the accuracy is displayed.

### Overall, this code implements a basic sentiment analysis on RateMyProfessor data, providing insights into the quality and difficulty of various professors based on student reviews.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Importing the required libraries.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout

In [None]:
#For more info on parsing json in pandas, see https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# df_json = pd.read_json(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
# print(df_json[0][2])
df = pd.read_csv(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/Big Data Set from RateMyProfessor.com for Professors Teaching Evaluation/RateMyProfessor_Sample data.csv')

In [None]:
# import json
# myfile = open('/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
# mydict = json.load(myfile)
# print(mydict[0][0])
# myfile.close()
# #Total review count: 3374

In [None]:
import json
my_file = open('/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
json_dict = json.load(my_file)
my_file.close()
lastindex = 20000
for i in range(len(json_dict)):
  for j in range(len(json_dict[i])):
    d = json_dict[i][j]
    tempDataFrame = pd.DataFrame({
        "comments": d["Comment"],
        "student_star": float(d["Quality"]),
        "student_difficult": float(d["Difficulty"]),
        "professor_name": d["professor"]
    }, index = [lastindex])
    df = pd.concat([df, tempDataFrame])
    lastindex = lastindex + 1

In [None]:
print(df['comments'])

0        This class is hard, but its a two-in-one gen-e...
1        Definitely going to choose Prof. Looney\'s cla...
2        I overall enjoyed this class because the assig...
3        Yes, it\'s possible to get an A but you\'ll de...
4        Professor Looney has great knowledge in Astron...
                               ...                        
23369    don't take this call unless you love memorizin...
23370    Really funny guy. He gives ridiculous (but app...
23371    Only teaches from powerpoints and gives tons o...
23372    Awesome prof. Gives out homework and short ess...
23373    Great teacher great course. Funny guy, gives o...
Name: comments, Length: 23374, dtype: object


In [None]:
# Display the first 10 professors and their corresponding stars, difficulty, and comment
prof_index = 0
prof_miniset = {}
for i in range(10):
  prof_name = df['professor_name'][prof_index]
  prof_miniset[prof_name] = ["Stars: " + str(df['student_star'][prof_index]), "Difficulty: " + str(df['student_difficult'][prof_index]), df['comments'][prof_index], ]
  while(prof_name == df['professor_name'][prof_index]):
    prof_index = prof_index + 1
for k,v in prof_miniset.items():
  print(k, v)

Leslie  Looney ['Stars: 5.0', 'Difficulty: 3.0', 'This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most classes, you have to actually participate to pass. Sections are easy and offer extra credit every week. Very funny dude. Not much more I can say.']
Jans  Wager ['Stars: 5.0', 'Difficulty: 2.0', 'Dr. Wager is a great professor. Her expectations were clear and she really wants to help you succeed. She was always entertaining and knew what she was talking about. I took her hybrid course which i would definitely recommend.']
Robert  Warden ['Stars: 1.5', 'Difficulty: 4.0', 'This guy is a quack! You never understand what he is saying!']
Bryan  Eldredge ['Stars: 3.0', 'Difficulty: 5.0', 'I took his online class as an elective and regretted taking it by the end. The information is interesting but he is a tough grader. His tests were hard and very confusing.']
William  Hollinrake ['Stars: 1.0', 'Difficulty: 5.0', 'Took online course. Day a

In [None]:
# Load the pre-trained GloVe embeddings
embeddings_index = {}
with open(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:
# Remove 'nan' values from 'comments' column
df = df.dropna(subset=['comments'])

# Remove stopwords, numerical values, and symbols from 'comments' column and convert to lowercase
nltk.download('stopwords')
stop_words = stopwords.words('english')
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df['comments'] = df['comments'].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Set parameters for model
MAX_NB_WORDS = 20979
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100

# Tokenize words in 'comments' column
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True)
tokenizer.fit_on_texts(df['comments'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Convert text to sequences
X = tokenizer.texts_to_sequences(df['comments'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

# Create target variables
y_star = pd.get_dummies(df['student_star']).values
y_difficult = pd.get_dummies(df['student_difficult']).values

Found 20978 unique tokens.


In [None]:
# Create an embedding matrix for the tokenizer's word index
num_words = len(tokenizer.word_index) + 1
embedding_dim = 100
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
# Split data into training and testing sets
VALIDATION_SPLIT = 0.2
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y_star = y_star[indices]
y_difficult = y_difficult[indices]
num_validation_samples = int(VALIDATION_SPLIT * X.shape[0])

X_train = X[:-num_validation_samples]
y_train_star = y_star[:-num_validation_samples]
y_train_difficult = y_difficult[:-num_validation_samples]
X_test = X[-num_validation_samples:]
y_test_star = y_star[-num_validation_samples:]
y_test_difficult = y_difficult[-num_validation_samples:]

In [None]:
print(y_train_star.shape)

(18694, 9)


In [None]:
# Build RNN model
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(64, dropout=0.2))
model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 250, 100)          2097900   
                                                                 
 lstm_3 (LSTM)               (None, 64)                42240     
                                                                 
 dense_2 (Dense)             (None, 9)                 585       
                                                                 
Total params: 2,140,725
Trainable params: 2,140,725
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
# Train model
model.fit(X_train, y_train_star, batch_size=32, epochs=10, verbose=1, validation_data=(X_test, y_test_star))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f45c5ed9910>

In [None]:
# Evaluate model
scores = model.evaluate(X_test, y_test_star, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 38.13%


# **--- END OF SECTION 3 ---**

# **Approach 4**


---


## **Building an RNN Model to Predict Professor Rating from Student Reviews using Bidirectional LSTM layers in Keras.**


---


### The code provided loads a dataset of professor ratings and comments from a CSV file and a JSON file, preprocesses the text data by cleaning and tokenizing it, and trains a Bidirectional LSTM neural network model to predict the star rating and difficulty level of a professor based on the comments given by the students.

### Here's a summary of what the code does:

### 1. Mounts the Google Drive to Colab and imports the necessary libraries.
### 2. Reads the professor rating and comments data from a CSV file and a JSON file.
### 3. Cleans and tokenizes the text data by removing stopwords, numerical values, and symbols, and converts the text to lowercase.
### 4. Tokenizes the comments using the Keras tokenizer and converts the text to sequences, which are then padded to a maximum sequence length.
### 5. Creates target variables for the star rating and difficulty level using one-hot encoding.
### 6. Loads pre-trained GloVe embeddings and creates an embedding matrix for the tokenizer's word index.
### 7. Splits the data into training and testing sets.
### 8. Builds a Bidirectional LSTM neural network model with an embedding layer, two LSTM layers, two dense layers, and a softmax output layer.
### 9. Trains and evaluates the model on the training and testing sets.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

KeyboardInterrupt: ignored

In [None]:
# Importing the required libraries.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout

In [None]:
#For more info on parsing json in pandas, see https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# df_json = pd.read_json(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
# print(df_json[0][2])
df = pd.read_csv(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/Big Data Set from RateMyProfessor.com for Professors Teaching Evaluation/RateMyProfessor_Sample data.csv')

In [None]:
# import json
# myfile = open('/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
# mydict = json.load(myfile)
# print(mydict[0][0])
# myfile.close()
# #Total review count: 3374

In [None]:
import json
my_file = open('/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
json_dict = json.load(my_file)
my_file.close()
lastindex = 20000
for i in range(len(json_dict)):
  for j in range(len(json_dict[i])):
    d = json_dict[i][j]
    tempDataFrame = pd.DataFrame({
        "comments": d["Comment"],
        "student_star": float(d["Quality"]),
        "student_difficult": float(d["Difficulty"]),
        "professor_name": d["professor"]
    }, index = [lastindex])
    df = pd.concat([df, tempDataFrame])
    lastindex = lastindex + 1

In [None]:
print(df['comments'])

0        This class is hard, but its a two-in-one gen-e...
1        Definitely going to choose Prof. Looney\'s cla...
2        I overall enjoyed this class because the assig...
3        Yes, it\'s possible to get an A but you\'ll de...
4        Professor Looney has great knowledge in Astron...
                               ...                        
23369    don't take this call unless you love memorizin...
23370    Really funny guy. He gives ridiculous (but app...
23371    Only teaches from powerpoints and gives tons o...
23372    Awesome prof. Gives out homework and short ess...
23373    Great teacher great course. Funny guy, gives o...
Name: comments, Length: 23374, dtype: object


In [None]:
# Display the first 10 professors and their corresponding stars, difficulty, and comment
prof_index = 0
prof_miniset = {}
for i in range(10):
  prof_name = df['professor_name'][prof_index]
  prof_miniset[prof_name] = ["Stars: " + str(df['student_star'][prof_index]), "Difficulty: " + str(df['student_difficult'][prof_index]), df['comments'][prof_index], ]
  while(prof_name == df['professor_name'][prof_index]):
    prof_index = prof_index + 1
for k,v in prof_miniset.items():
  print(k, v)

Leslie  Looney ['Stars: 5.0', 'Difficulty: 3.0', 'This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most classes, you have to actually participate to pass. Sections are easy and offer extra credit every week. Very funny dude. Not much more I can say.']
Jans  Wager ['Stars: 5.0', 'Difficulty: 2.0', 'Dr. Wager is a great professor. Her expectations were clear and she really wants to help you succeed. She was always entertaining and knew what she was talking about. I took her hybrid course which i would definitely recommend.']
Robert  Warden ['Stars: 1.5', 'Difficulty: 4.0', 'This guy is a quack! You never understand what he is saying!']
Bryan  Eldredge ['Stars: 3.0', 'Difficulty: 5.0', 'I took his online class as an elective and regretted taking it by the end. The information is interesting but he is a tough grader. His tests were hard and very confusing.']
William  Hollinrake ['Stars: 1.0', 'Difficulty: 5.0', 'Took online course. Day a

In [None]:
# Load the pre-trained GloVe embeddings
embeddings_index = {}
with open(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/glove.42B.300d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:
# Remove 'nan' values from 'comments' column
df = df.dropna(subset=['comments'])

# Remove stopwords, numerical values, and symbols from 'comments' column and convert to lowercase
nltk.download('stopwords')
stop_words = stopwords.words('english')
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df['comments'] = df['comments'].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Set parameters for model
MAX_NB_WORDS = 20979
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 300

# Tokenize words in 'comments' column
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True)
tokenizer.fit_on_texts(df['comments'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Convert text to sequences
X = tokenizer.texts_to_sequences(df['comments'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

# Create target variables
y_star = pd.get_dummies(df['student_star']).values
y_difficult = pd.get_dummies(df['student_difficult']).values

Found 20978 unique tokens.


In [None]:
# Create an embedding matrix for the tokenizer's word index
num_words = len(tokenizer.word_index) + 1
embedding_dim = EMBEDDING_DIM
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
# Split data into training and testing sets
VALIDATION_SPLIT = 0.2
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y_star = y_star[indices]
y_difficult = y_difficult[indices]
num_validation_samples = int(VALIDATION_SPLIT * X.shape[0])

X_train = X[:-num_validation_samples]
y_train_star = y_star[:-num_validation_samples]
y_train_difficult = y_difficult[:-num_validation_samples]
X_test = X[-num_validation_samples:]
y_test_star = y_star[-num_validation_samples:]
y_test_difficult = y_difficult[-num_validation_samples:]

In [None]:
print(y_train_star.shape)

(18694, 9)


In [None]:
# Build RNN model with Bidirectional LSTM layers
# Train/test models on only either the quality or difficulty metrics
from tensorflow.keras.models import load_model
from tensorflow.keras.layers import Bidirectional

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH))
model.add(Bidirectional(LSTM(64,return_sequences=True, dropout=0.2)))
model.add(Bidirectional(LSTM(64,dropout=0.2)))
model.add(Dense(64, activation="tanh"))
model.add(Dense(9, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.save("candidate.keras")

star_candidate = load_model("candidate.keras")
difficulty_candidate = load_model("candidate.keras")
difficulty_candidate.pop()
difficulty_candidate.add(Dense(5, activation='softmax'))
difficulty_candidate.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(difficulty_candidate.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 250, 300)          6293700   
                                                                 
 bidirectional (Bidirectiona  (None, 250, 128)         186880    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 64)                8256      
                                                                 
 dense_1 (Dense)             (None, 9)                 585       
                                                                 
Total params: 6,588,237
Trainable params: 6,588,237
Non-

In [None]:
# Train the models
print("Model trained on star alone")
star_candidate.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=10, batch_size=128)
print("Model trained on difficulty alone")
difficulty_candidate.fit(X_train, y_train_difficult, validation_data=(X_test, y_test_difficult), epochs=10, batch_size=128)

Model trained on star alone
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Model trained on difficulty alone
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc866e8e4a0>

### **Alternate Model - 2:**

### **The code defines a neural network model that can classify text into one of nine categories. The model uses pre-trained word embeddings and two layers of bidirectional LSTM cells. The first LSTM layer has twice as many parameters as before, and the embedding layer is no longer trainable. The model is trained using categorical cross-entropy loss and the Adam optimizer. It is evaluated on a validation set and trained for 10 epochs with a batch size of 128.**



In [None]:
#Same as before, only we no longer train the embedding layer and doubled the number of parameters in the first LSTM layer
alt_model = Sequential([
    Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False),
    Bidirectional(LSTM(128,return_sequences=True, dropout=0.2)),
    Bidirectional(LSTM(64,dropout=0.2)),
    Dense(64, activation="tanh"),
    Dense(9, activation='softmax')
])

alt_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(alt_model.summary())
alt_model.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=10, batch_size=128)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 250, 300)          6293700   
                                                                 
 bidirectional_2 (Bidirectio  (None, 250, 256)         439296    
 nal)                                                            
                                                                 
 bidirectional_3 (Bidirectio  (None, 128)              164352    
 nal)                                                            
                                                                 
 dense_3 (Dense)             (None, 64)                8256      
                                                                 
 dense_4 (Dense)             (None, 9)                 585       
                                                                 
Total params: 6,906,189
Trainable params: 612,489
Non-

<keras.callbacks.History at 0x7fc864c8e0e0>

### **Alternate Model - 3:**

### **This code creates a neural network model with two LSTM layers, two dense layers with dropout, and an output layer with softmax activation. The number of parameters in the second LSTM layer is doubled compared to the first model. The model is trained on a dataset with nine categories and evaluated using accuracy.**

In [None]:
#Add an additional dense layer and dropout, and double the parameter count in the second LSTM layer
alt2_model = Sequential([
    Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False),
    Bidirectional(LSTM(128,return_sequences=True, dropout=0.2)),
    Bidirectional(LSTM(128,dropout=0.2)),
    Dense(64, activation="tanh"),
    Dropout(0.4),
    Dense(64, activation="tanh"),
    Dropout(0.4),
    Dense(9, activation='softmax')
])

alt2_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(alt2_model.summary())
alt2_model.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=10, batch_size=128)

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 250, 300)          6293700   
                                                                 
 bidirectional_4 (Bidirectio  (None, 250, 256)         439296    
 nal)                                                            
                                                                 
 bidirectional_5 (Bidirectio  (None, 256)              394240    
 nal)                                                            
                                                                 
 dense_5 (Dense)             (None, 64)                16448     
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 64)               

<keras.callbacks.History at 0x7fc860d3d7b0>

### **Alternate Model - 4:**
### **The code defines a deep learning model to classify text based on star ratings. It uses an embedding layer followed by two bidirectional LSTM layers and several dense layers with dropout. The difference with the previous models is that the number of parameters in the LSTM layers and the dense layers have been modified, and a sigmoid activation function has been used in the first two dense layers. The model is trained using the categorical cross-entropy loss function and the Adam optimizer, and it is evaluated on a validation set.**

In [None]:
#Cut parameter count in 2nd layer back, double the count in first dense layer, and try sigmoid activation
alt3_model = Sequential([
    Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False),
    Bidirectional(LSTM(128,return_sequences=True, dropout=0.2)),
    Bidirectional(LSTM(64,dropout=0.2)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(9, activation='softmax')
])

alt3_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(alt3_model.summary())
alt3_model.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=20, batch_size=128)

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 250, 300)          6293700   
                                                                 
 bidirectional_6 (Bidirectio  (None, 250, 256)         439296    
 nal)                                                            
                                                                 
 bidirectional_7 (Bidirectio  (None, 128)              164352    
 nal)                                                            
                                                                 
 dense_8 (Dense)             (None, 128)               16512     
                                                                 
 dropout_2 (Dropout)         (None, 128)               0         
                                                                 
 dense_9 (Dense)             (None, 64)               

<keras.callbacks.History at 0x7fc867638220>

### **Alternate Model - 5:**
### **The code defines a neural network model that uses GRU layers instead of LSTM layers, with a higher dropout rate to prevent overfitting. It also includes several dense layers with sigmoid activation and dropout layers. The model is trained on the star rating dataset and evaluated using accuracy as a metric.**

In [None]:
#GRU layers with higher dropout rate
from tensorflow.keras.layers import GRU

alt4_model = Sequential([
    Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False),
    Bidirectional(GRU(64,return_sequences=True, dropout=0.4)),
    Bidirectional(GRU(64,dropout=0.4)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(9, activation='softmax')
])

alt4_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(alt4_model.summary())
alt4_model.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=20, batch_size=128)

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 250, 300)          6293700   
                                                                 
 bidirectional_8 (Bidirectio  (None, 250, 128)         140544    
 nal)                                                            
                                                                 
 bidirectional_9 (Bidirectio  (None, 128)              74496     
 nal)                                                            
                                                                 
 dense_11 (Dense)            (None, 128)               16512     
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_12 (Dense)            (None, 64)               

<keras.callbacks.History at 0x7fc864947940>

# **--- END OF SECTION 4 ---**

# **Approach 5**
---
## **Using Regression Model**
---
### The model is being trained as a regression problem to predict the star ratings for professors based on the comments left by students.

### The model architecture uses pre-trained GloVe embeddings, followed by two layers of Bidirectional GRUs with dropout regularization, and two dense layers with sigmoid activation and dropout regularization. The final dense layer has a single output node, which will output the predicted rating for a given comment.

### The data is split into training and testing sets, and the mse loss function is used to evaluate the model's performance during training.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing the required libraries.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout

In [None]:
#For more info on parsing json in pandas, see https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# df_json = pd.read_json(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
# print(df_json[0][2])
df = pd.read_csv(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/Big Data Set from RateMyProfessor.com for Professors Teaching Evaluation/RateMyProfessor_Sample data.csv')

In [None]:
import json
my_file = open('/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
json_dict = json.load(my_file)
my_file.close()
lastindex = len(df)
for i in range(len(json_dict)):
  for j in range(len(json_dict[i])):
    d = json_dict[i][j]
    tempDataFrame = pd.DataFrame({
        "comments": d["Comment"],
        "student_star": float(d["Quality"]),
        "student_difficult": float(d["Difficulty"]),
        "professor_name": d["professor"]
    }, index = [lastindex])
    df = pd.concat([df, tempDataFrame])
    lastindex = lastindex + 1

In [None]:
print(len(df))

23374


In [None]:
print(df['comments'])

0        This class is hard, but its a two-in-one gen-e...
1        Definitely going to choose Prof. Looney\'s cla...
2        I overall enjoyed this class because the assig...
3        Yes, it\'s possible to get an A but you\'ll de...
4        Professor Looney has great knowledge in Astron...
                               ...                        
23369    don't take this call unless you love memorizin...
23370    Really funny guy. He gives ridiculous (but app...
23371    Only teaches from powerpoints and gives tons o...
23372    Awesome prof. Gives out homework and short ess...
23373    Great teacher great course. Funny guy, gives o...
Name: comments, Length: 23374, dtype: object


In [None]:
# Display the first 10 professors and their corresponding stars, difficulty, and comment
prof_index = 0
prof_miniset = {}
for i in range(10):
  prof_name = df['professor_name'][prof_index]
  prof_miniset[prof_name] = ["Stars: " + str(df['student_star'][prof_index]), "Difficulty: " + str(df['student_difficult'][prof_index]), df['comments'][prof_index], ]
  while(prof_name == df['professor_name'][prof_index]):
    prof_index = prof_index + 1
for k,v in prof_miniset.items():
  print(k, v)

Leslie  Looney ['Stars: 5.0', 'Difficulty: 3.0', 'This class is hard, but its a two-in-one gen-ed knockout, and the content is very stimulating. Unlike most classes, you have to actually participate to pass. Sections are easy and offer extra credit every week. Very funny dude. Not much more I can say.']
Jans  Wager ['Stars: 5.0', 'Difficulty: 2.0', 'Dr. Wager is a great professor. Her expectations were clear and she really wants to help you succeed. She was always entertaining and knew what she was talking about. I took her hybrid course which i would definitely recommend.']
Robert  Warden ['Stars: 1.5', 'Difficulty: 4.0', 'This guy is a quack! You never understand what he is saying!']
Bryan  Eldredge ['Stars: 3.0', 'Difficulty: 5.0', 'I took his online class as an elective and regretted taking it by the end. The information is interesting but he is a tough grader. His tests were hard and very confusing.']
William  Hollinrake ['Stars: 1.0', 'Difficulty: 5.0', 'Took online course. Day a

In [None]:
# Load the pre-trained GloVe embeddings
embeddings_index = {}
with open(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/glove.42B.300d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [None]:
# Remove 'nan' values from 'comments' column
df = df.dropna(subset=['comments'])

# Remove stopwords, numerical values, and symbols from 'comments' column and convert to lowercase
nltk.download('stopwords')
stop_words = stopwords.words('english')
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df['comments'] = df['comments'].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Set parameters for model
MAX_NB_WORDS = 20979
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 300

# Tokenize words in 'comments' column
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, lower=True)
tokenizer.fit_on_texts(df['comments'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# Convert text to sequences
X = tokenizer.texts_to_sequences(df['comments'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

# Create target variables
y_star = df['student_star'].values
y_difficult = df['student_difficult'].values

Found 20978 unique tokens.


In [None]:
#How exactly does the word_index work? Let's look at the first comment vectorized
mylist = df['comments'][0].split(" ")
for word in mylist:
  print(f"{word}:{word_index[word]}")

class:1
hard:10
twoinone:9644
gened:3844
knockout:7026
content:426
stimulating:1279
unlike:1258
classes:34
actually:129
participate:366
pass:143
sections:1280
easy:7
offer:770
extra:127
credit:137
every:57
week:155
funny:77
dude:973
much:31
say:151


In [None]:
#Next, let's see how the first comment is converted to a sequence
print(df['comments'][0])
Y = tokenizer.texts_to_sequences(df['comments'].values)
Y = pad_sequences(Y, maxlen=MAX_SEQUENCE_LENGTH)
print(Y[0])

class hard twoinone gened knockout content stimulating unlike classes actually participate pass sections easy offer extra credit every week funny dude much say
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    

In [None]:
# Create an embedding matrix for the tokenizer's word index
num_words = len(tokenizer.word_index) + 1
embedding_dim = EMBEDDING_DIM
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [None]:
# Split data into training and testing sets
VALIDATION_SPLIT = 0.2
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y_star = y_star[indices]
y_difficult = y_difficult[indices]
num_validation_samples = int(VALIDATION_SPLIT * X.shape[0])

X_train = X[:-num_validation_samples]
y_train_star = y_star[:-num_validation_samples]
y_train_difficult = y_difficult[:-num_validation_samples]
X_test = X[-num_validation_samples:]
y_test_star = y_star[-num_validation_samples:]
y_test_difficult = y_difficult[-num_validation_samples:]

In [None]:
print(y_train_star.shape)

(18694,)


In [None]:
#Treat output as a regression problem
from tensorflow.keras.layers import GRU, Bidirectional

alt5_model = Sequential([
    Embedding(MAX_NB_WORDS, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False),
    Bidirectional(GRU(64,return_sequences=True, dropout=0.4)),
    Bidirectional(GRU(64,dropout=0.4)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(1)
])

alt5_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(alt5_model.summary())
alt5_model.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=20, batch_size=128)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 250, 300)          6293700   
                                                                 
 bidirectional (Bidirectiona  (None, 250, 128)         140544    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              74496     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 128)               16512     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                8

<keras.callbacks.History at 0x7ff4602eab80>

In [None]:
#Verify that inputs are actually sequences of indices
for i in range(20):
  print(X[i])

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0 62]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   

In [None]:
#Sanity check by running on a test case
import tensorflow as tf
print(X[0])
print(X[0][0])
print("Prediction:")
print(Y[0].shape)
y = tf.expand_dims(Y[0],1)
print(y.shape)
alt5_model.predict(y)

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0 62]
0
Prediction:
(250,)
(250, 1)


array([[3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],
       [3.7170818],


In [None]:
#It was interesting that the network seemed to have a "default" of 3.7170818, so let's do some stats
print(df['student_star'][0])
print(df['student_star'].mean())
print(df['student_star'].mode())
print(df['professor_name'].mode())

5.0
3.6573800659049085
0    5.0
Name: student_star, dtype: float64
0    Robert Valenza 
Name: professor_name, dtype: object


The difference between the output of the network for an empty embedding and the mean rating is less than .06. This seems to suggest the network is just guessing by assuming the output should be the mean up until actual comments are encountered.

# **--- END OF SECTION 5 ---**

#**Approach 6**


---


## **One-Hot Encoding and Sequential Model on RateMyProfessor**


---


### This code is a machine learning model built using Keras and TensorFlow to predict student ratings based on their comments about a professor. The code starts by importing necessary libraries and loading the data from a CSV file. The text data is preprocessed by cleaning and tokenizing the comments column. The data is then split into train and test sets, tokenized, and padded to have the same length. A Sequential model is created with three dense layers, a dropout layer, and a dense layer. The model is then compiled with a different optimizer and trained on the train set for 10 epochs. Finally, the model is evaluated on the test set, and the accuracy is printed.

In [None]:
#Mount google drive in a Google Collab

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#importing all the necessary libraries
import pandas as pd
import numpy as np
import re
import nltk
from keras.preprocessing.text import Tokenizer
from numpy import array
from tensorflow import keras
from numpy import argmax
from keras.utils import to_categorical
import pandas as pd
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

In [None]:
#reading the csv file from google drive and saving the data in df using pd
df = pd.read_csv(r'/content/drive/My Drive/ANN_Project_2/RateMyProfessor_Sample data.csv')

#check the columns
print(df['comments'])
print(df['student_difficult'])

0        This class is hard, but its a two-in-one gen-e...
1        Definitely going to choose Prof. Looney\'s cla...
2        I overall enjoyed this class because the assig...
3        Yes, it\'s possible to get an A but you\'ll de...
4        Professor Looney has great knowledge in Astron...
                               ...                        
19995     Great sense of humor!!!! Love parasites now!!!!!
19996    he is a really nice guy and is really funny..h...
19997    His parasitology class is a lot of work but he...
19998    He is WAY too much work for a 1 credit class. ...
19999    Extremely easy lab teacher, quizzes are a litt...
Name: comments, Length: 20000, dtype: object
0        3.0
1        2.0
2        3.0
3        3.0
4        1.0
        ... 
19995    5.0
19996    4.0
19997    3.0
19998    5.0
19999    2.0
Name: student_difficult, Length: 20000, dtype: float64


In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# Remove 'nan' values from 'comments' column
df = df.dropna(subset=['comments'])

# Remove stopwords, numerical values, and symbols from 'comments' column and convert to lowercase
stop_words = stopwords.words('english')
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\d+', ' ', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#text cleaning/preprocessing
df['comments'] = df['comments'].apply(lambda x: clean_text(x))

#storing the column values in comment
comments = df['comments']

In [None]:
#implementing one-hot encoding
comments_one = pd.get_dummies(comments)

df['student_difficult'].shape
comments_one.head(10)

Unnamed: 0,Unnamed: 1,aaagh college dean husband sense part architecture part curriculum sometimes see coming run away bother pay lots money,aaron kozbelt knows subject knows explain concepts organized presentations clear would liked courses professors brooklyn college watching teach brush lecture skills,aarrggh fricking unbelievable decent guy bugger know teach makes physics sound like rocket science makes ants picnic end world like texan accent though,abbas,abbas hilarious,abert genuinely nice person shows enthusiasm invest sees invest learning material math people inevitably going practice constantly order fully grasp learning hw problems ever challenging hw tests eazy,ability walk room bring silence terrifies time think awesome doubt could mean students really high standards expects work hard live underneath exterior actually care students lot,able explain apply everything well also thought knew everything fair grading papers test,able explain things well relating classmates tell really knows loves information teaches wants understand everything could also teach class,...,z great teacher knowledgeable passionate subject lectures clear point often hilarious surprised much learned bec class seems casual found retaining material almost year since took class still identify rocks minerals fly,zahajko worst instructor ever chose favorite students students grade classmates tests absolute violation privacy raced lectures called student like front whole class plain rude absolutely recommend,zeal compassion better students anyone around unmatched honor professor coach approachable subject matter expert hence without doubt fully confident recommend genuine mentor leader mis business intelligence database,zeman man take tests make percent grade questions test directly quizzes plus gives essays ahead time bad quiz wk major pain avg tho paper nearly impossible write grades easily lowest grade youll prob get b,zeno redundant teacher avoid possible gives ton pointless work hard keep really teach anything related philosophy total waste time,zoology best course taken far across different colleges difficult pass extremely interesting classroom loaded specimens craig amazing professor pleasant person around general dress weather prepared outdoor excursions time,zorn intriguing intelligent guy man little crazy high standards essays basis entire grade dedication helping writing progress unmatched difficult grader really engaging discussion overall incredibly helpful character amazing,zorn one favorites wacky right ways makes coming class worth reading manageable papers challenging grades pretty hard name make better writer zorn always help learned lot classes money well spent scu,zzzzz huh oh credit hour class hour nap class absolutly pointless required majors minors like mr feeny may bad,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split

X = comments_one
y = df['student_difficult']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Tokenize the comments column of the train and test sets
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad the sequences of the train and test sets to have the same length
max_length = max([len(seq) for seq in X_train_seq])
X_train_pad = pad_sequences(X_train_seq, maxlen=max_length)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_length)

In [None]:
#training model
model = keras.models.Sequential()
model.add(keras.layers.Dense(128, activation = 'relu'))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Dense(64, activation = 'relu'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(32, activation='sigmoid'))

# Output layer for 'student_star'
model.add(keras.layers.Dense(6, activation='softmax', name='student_difficuly'))

# Compile the model with a different optimizer
optimizer = keras.optimizers.RMSprop(lr=0.001)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])



In [None]:
# Train the model on the train set for more epochs
history = model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print('Test Accuracy', accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy 0.23424474895000458


# **--- END OF SECTION 6 ---**

#**Approach 7**


---


## **Sentimental Analysis on RateMyProfessor Data using Glove Embedding and Bidirectional LSTM layers**

---


### The code is implementation for sentiment analysis of student reviews of professors on RateMyProfessor.com. It predicts a professor's teaching efficacy and difficulty using a Bidirectional LSTM neural network architecture with pre-trained GloVe word embeddings.

### Here is a breakdown of the code:

###1. Import necessary libraries

###2. Preprocess the data by converting the text to lowercase, removing punctuation, tokenizing the text into words, and removing stop words.

###3. Tokenize the comments column and pad the sequences to have the same length.

###4. Split the data into train and test sets.

###5. Load pre-trained GloVe embeddings and create an embedding matrix.

###6. The model consists of an Embedding layer, a Bidirectional LSTM layer, and a Dense layer with linear activation function. The Embedding layer uses the pre-trained GloVe embeddings as weights and is set to non-trainable.

###7. Compile the model with RMSprop optimizer, binary_crossentropy loss function, and metrics of mean squared error and accuracy.

###8. Train the model with the training data for 5 epochs and batch size of 128, and validate with 20% of the training data.

###9. Print the model summary and evaluate its performance on the test data.

In [None]:
import pandas as pd
import nltk
from gensim.models import Word2Vec
import string
from google.colab import drive
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
import re

drive.mount('/content/drive')
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/RateMyProfessor_Sample_data.csv')
nltk.download('stopwords')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Drop rows with missing values
df=df.dropna(subset=['comments','student_star','student_difficult'])

In [None]:
X = df['comments']

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

# Tokenize the comments column
tokenizer = Tokenizer()
df['comments'] = df['comments'].fillna('')

tokenizer.fit_on_texts(df['comments'])
sequences = tokenizer.texts_to_sequences(df['comments'])

# Pad the sequences to have the same length
max_length = max([len(seq) for seq in sequences])
X = pad_sequences(sequences, maxlen=max_length)

In [None]:

y = df[['student_star', 'student_difficult']].values
y = np.argmax(y, axis=1)

In [None]:
glove_path='/content/drive/MyDrive/glove.42B.300d.txt'
embedding_dim=300

In [None]:

from tensorflow.keras.utils import to_categorical

# Split the data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)


In [None]:
def load_glove_embeddings(embedding_path, word_index, embedding_dim):
    num_words = len(word_index) + 1
    if isinstance(embedding_dim, np.ndarray):
        embedding_dim = embedding_dim.item()
    embeddings_matrix = np.zeros((num_words, embedding_dim))

    embeddings_index = {}
    with open(embedding_path) as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

    for word, i in word_index.items():
        if i >= num_words:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embeddings_matrix[i] = embedding_vector

    return embeddings_matrix


In [None]:

embeddings_matrix = load_glove_embeddings(glove_path, tokenizer.word_index, embedding_dim=300)


In [None]:
import keras
import tensorflow as tf


model = keras.models.Sequential()
model.add(keras.layers.Embedding(embeddings_matrix.shape[0],embeddings_matrix.shape[1],weights = [embeddings_matrix],trainable= False))

model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)))
model.add(keras.layers.Dense(2, activation='linear'))

optimizer = keras.optimizers.RMSprop(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['mse','accuracy'])

# Print the model summary
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 300)         4939200   
                                                                 
 bidirectional (Bidirectiona  (None, 64)               85248     
 l)                                                              
                                                                 
 dense (Dense)               (None, 2)                 130       
                                                                 
Total params: 5,024,578
Trainable params: 85,378
Non-trainable params: 4,939,200
_________________________________________________________________


In [None]:
history = model.fit(X_train, y_train, epochs=5, batch_size=128, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
loss,mse, accuracy = model.evaluate(X_test, y_test)
print('Test Accuracy', accuracy)

Test Accuracy 0.7974493503570557


# **--- END OF SECTION 7 ---**

#**APPROACH 8**
---
## **REGRESSION MODEL -2**
---
### In this model, we used '***Gensim***' package to load and access '***GloVE***' word embeddings, used '***spacy***' and '***nltk***' language models to preprocess the data in provided corpus '***RateMyProfessor_Sample data***', then built a '***regression model***' with a ***LSTM layer***, followed by a ***Dense layer***, followed by a ***Droput layer***, followed by ***two*** ***individual*** ***dense output layers*** that predicts the scores/ratings for '***Quality***' and '***Difficulty***' of the professor based on the comments.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Installing required packages
!pip install gensim
!python -m spacy download en_core_web_lg


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2023-04-29 05:55:11.465140: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-29 05:55:16.987204: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-29 05:55:16.987763: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at htt

we installed **gensim** package to load and access the **GloVe** embeddings,and downloaded **en_core_web_lg** language model from **spacy** to preprocess the corpus that we are provided with to use to train the model.


In [None]:
# Importing the required default libraries.
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from keras.utils import pad_sequences
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

In [None]:
# converting glove to gensim word2vec format

# glove_file = '/content/drive/MyDrive/glove.42B.300d.txt'
# output_file = "glove_word2vec.txt"

# glove2word2vec(glove_file, output_file)

"""

Don't need to run the above three lines of code again as we already converted glove to gensim word2vec format as "glove_word2vec.txt".
if you don't have that file, you need to run this code again.
better don't run the code as it may consume most of your colab resources and takes around 15 minutes of time.
Instead, use the file " glove_word2vec.txt"
"""

model_glove = KeyedVectors.load_word2vec_format("glove_word2vec.txt")

word_vectors = model_glove
#word_vectors.save('word_vectors.kv')
#word_vectors = KeyedVectors.load('/content/word_vectors.kv')

Here, we converted ***GloVE*** embeddings into ***word2vec*** format of ***genism*** to easily access the word embeddings. Then, we loaded the converted ***word2vec*** format of ***GloVE*** as ***model_glove*** and created an instance of ***model_glove*** as ***word_vectors*** to save it as "***word_vectors.kv***", which is a ***Keyed vectors*** format of ***gensim***.

***Note***: if you already have "***word_vectors.kv***" file in your notebook, you can skip all the lines of code until the last commented line and can use only the last commented line of code in the cell to load the file, this saves a lot of time and resources as it only takes a few seconds of time to load it.

In [None]:
# get the word embeddings of a word
word_vectors['excellent']

array([ 9.0617e-02,  3.2209e-01, -2.8606e-01, -4.3576e-01,  6.5842e-01,
       -3.0986e-01, -3.2205e+00, -3.3805e-01, -1.1715e-01, -2.6288e-01,
       -5.0936e-01, -1.7889e-03,  1.1095e-01,  2.0856e-01, -1.8062e-01,
       -1.2951e-01,  3.0683e-02, -3.9619e-01, -2.4232e-02, -2.6083e-01,
        8.3488e-02,  2.5158e-01,  1.8634e-01, -2.2445e-01,  2.0258e-01,
       -1.7600e-01,  1.3958e-01,  1.5411e-01,  9.2125e-03,  2.2933e-01,
        4.8343e-01,  5.4896e-02, -1.7834e-01,  2.9386e-01,  7.3732e-02,
       -3.6977e-01,  3.3151e-02, -1.8546e-01, -1.4493e-02, -4.5637e-02,
        3.0009e-01,  8.9469e-02, -2.2406e-01, -2.1427e-01,  1.4199e-01,
        2.8529e-02,  1.5734e-01, -7.4364e-03,  1.9294e-01,  1.0349e-01,
        4.7936e-02,  5.7523e-01, -3.7558e-01,  2.3186e-02, -3.2913e-01,
        7.1907e-02,  2.9372e-02, -8.0111e-01, -1.3413e-01,  1.0208e-01,
        1.1508e-01, -8.9648e-02,  1.8926e-01,  4.1204e-01, -1.0060e-01,
       -4.0713e-01,  1.3602e-01,  1.6024e-01, -2.5463e-01, -6.68

we can access the word embeddings of a word using the above line of code

In [None]:
word_vectors.similarity('great','good')

0.848614

The above code checks the similarity between to words.

In [None]:
df = pd.read_csv('/content/RateMyProfessor_Sample data.csv')

We have loaded the given corpus i.e., "***RateMyProfessor_Sample data.csv***" into a pandas data frame.

In [None]:
df.dropna(subset = ['comments'], inplace = True)
df.reset_index(drop = True, inplace = True)
df = df [['student_star', 'comments','student_difficult']]
df

Unnamed: 0,student_star,comments,student_difficult
0,5.0,"This class is hard, but its a two-in-one gen-e...",3.0
1,5.0,Definitely going to choose Prof. Looney\'s cla...,2.0
2,4.0,I overall enjoyed this class because the assig...,3.0
3,5.0,"Yes, it\'s possible to get an A but you\'ll de...",3.0
4,5.0,Professor Looney has great knowledge in Astron...,1.0
...,...,...,...
19988,1.5,Great sense of humor!!!! Love parasites now!!!!!,5.0
19989,2.5,he is a really nice guy and is really funny..h...,4.0
19990,5.0,His parasitology class is a lot of work but he...,3.0
19991,4.0,He is WAY too much work for a 1 credit class. ...,5.0


The code snippet drops the records with null comments and makes sure the indices are continuous.

In [None]:
# Import required language models

import spacy
import nltk

# Download 'words' corpus from ntlk (Natural Language Toolkit) library
nltk.download('words')

# Load the previously downloaded "en_core_web_lg" language model as nlp
nlp = spacy.load("en_core_web_lg")

#create a set of vocabulary in 'words' corpus of nltk as 'words'
words = set(nltk.corpus.words.words())

# Definig a function that preprocess and vectorize the text data
def CleanText_and_Vectorize(text):

    # Remove non-english words from the text
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not w.isalpha())

    # Remove stop words and lemmatize the text
    doc = nlp(text)

    word_embeddings = []
    for token in doc:
        # check if the token is in the word_vectors(Keyed Vectors format of GloVE). Continue without performing any action if it is not in the word_vectors
        if token.lemma_ not in word_vectors.key_to_index:
          continue
        # check if it is a stop word or a punctuation. Continue without performing any action if it is a stop word or a punctuation
        elif token.is_stop or token.is_punct : # it's optional to use token.is_punct here
          continue

        # convert words into embbeddings
        vector = word_vectors[token.lemma_]
        word_embeddings.append(vector)

    return  word_embeddings

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


The code downloads the "***words***" corpus from the ***NLTK library*** and uses it to preprocess and vectorize  text data using the ***spaCy language model*** and pre-trained **GloVe word embeddings**.

In [None]:
#Try preprocessing and vectorization of asample text data using 'CleanText_and_Vectorize' function
text = 'Definitely going to choose Prof. Looney\'s class again! Interesting class and easy A. You can bring notes to exams so you don\'t need to remember a lot. Lots of bonus points available and the observatory sessions are awesome!'
print(CleanText_and_Vectorize(text))

[array([ 1.8315e-01,  1.6785e-01,  4.2958e-01, -2.9332e-01,  3.1203e-01,
       -4.4731e-02, -3.5613e+00, -1.1850e-01,  2.0356e-02, -8.1756e-01,
        1.7654e-02, -1.4323e-01,  6.0164e-01,  9.8865e-02,  1.6221e-01,
        6.2312e-02,  1.1385e-01, -1.9449e-01, -1.1285e-01, -8.1337e-02,
       -5.0393e-02, -2.4186e-02,  1.9904e-01, -7.6335e-02,  2.1435e-01,
       -3.0881e-01, -2.2568e-01,  6.0657e-01,  2.4933e-01,  2.8433e-01,
       -2.9857e-02, -9.5095e-02,  1.2549e-01,  9.8321e-02, -2.1169e-01,
       -2.8529e-01, -1.2910e-01,  2.4108e-01,  2.5237e-01,  1.6759e-01,
        4.9352e-04,  1.2010e-01, -7.3734e-02, -5.4717e-02,  1.6821e-01,
       -1.4597e-01, -1.4255e-01, -1.0056e-01,  2.5957e-01,  2.7934e-02,
        1.9946e-01, -8.7880e-03,  2.8088e-01, -1.5341e-01, -1.5649e-01,
       -2.8636e-01,  8.6480e-03, -4.9797e-01,  4.1838e-01,  1.4312e-01,
        1.1074e-01,  9.4336e-02,  1.6607e-01,  2.1501e-01, -1.3163e-01,
        1.3850e-02,  8.6084e-02, -2.1840e-01,  1.2086e-02, -4.1

In [None]:
# preprocess and vectorize text data in the  comments column of the corpus 'RateMyProfessor_Sample data.csv'
df['word_embeddings'] = df['comments'].apply(lambda text: CleanText_and_Vectorize(text))

# save the modified dataframe as pickle file.
#use pickle file format to save the daaframe to make sure to retrive the information about the datatypes of the data
df.to_pickle('RateMyProfessor_WordEmbeddings.pkl')

#load the previously saved pickle file as df
df = pd.read_pickle('/content/RateMyProfessor_WordEmbeddings.pkl')


These lines of code preprocess and vectorize the text data in the comments column of the "***RateMyProfessor_Sample data.csv***" corpus by applying the CleanText_and_Vectorize function to each comment and storing the result in a new column called "***word_embeddings***". The modified dataframe is then saved as a pickle file and later loaded back into the notebook using the read_pickle method.

***Note***: if you already have '***RateMyProfessor_WordEmbeddings.pkl'***, you can skip running the first two lines of codes and run the only the last line to load the file, and that definetly serves the purpose.

In [None]:
df.head(1)

Unnamed: 0,student_star,comments,student_difficult,word_embeddings
0,5.0,"This class is hard, but its a two-in-one gen-e...",3.0,"[[0.80308, -0.016776, 0.025788, -0.18749, 0.39..."


In [None]:
df.tail(1)

Unnamed: 0,student_star,comments,student_difficult,word_embeddings
19992,5.0,"Extremely easy lab teacher, quizzes are a litt...",2.0,"[[0.078935, -0.11888, -0.29625, -0.46044, 0.02..."


In [None]:
# get the word_embeddings of first comment in the data frame 'df'
np.array(df['word_embeddings'][0])

array([[ 0.80308  , -0.016776 ,  0.025788 , ...,  0.042583 , -0.39625  ,
        -0.040408 ],
       [-0.38065  ,  0.33245  ,  0.053904 , ...,  0.086626 ,  0.33142  ,
         0.2014   ],
       [-0.01004  , -0.41808  , -0.061757 , ..., -0.0073133,  0.12578  ,
        -0.48272  ],
       ...,
       [ 0.11384  ,  0.42036  ,  0.23951  , ...,  0.055397 , -0.29187  ,
        -0.35491  ],
       [-0.40781  ,  0.1567   , -0.28539  , ...,  0.11763  , -0.0026115,
         0.32068  ],
       [-0.30338  ,  0.36641  ,  0.24148  , ..., -0.32577  , -0.011849 ,
         0.17795  ]], dtype=float32)

In [None]:
# get the shape of the word_embeddings of first comment in the data frame 'df'
np.array(df['word_embeddings'][0]).shape

(17, 300)

In [None]:
# Get the total number of comments in the dataframe
total_comments= len(df['comments'])

# Get the maximum length of word embeddings in the comments using list comprehension
max_length = max([len(df.word_embeddings[i]) for i in range(total_comments) if i in df.index])

# Pad the sequences of word embeddings to make them of equal length for all comments, using the Keras function 'pad_sequences'
padded_embeddings = pad_sequences(df['word_embeddings'], maxlen=max_length, dtype='float32', padding='post', truncating='post')

The code calculates the maximum length of word embeddings in the dataframe and pads sequences of embeddings with zeros. This ensures that all sequences have the same length, which is necessary for training a neural network.

In [None]:
# get max_length
max_length

46

In [None]:
# Convert the 2D numpy array of padded embeddings to a pandas series for easier handling
padded_embeddings_df = pd.Series([np.array(padded_embeddings[i]) for i in range(len(padded_embeddings))])

Here, we create a pandas series where each element is an array of padded embeddings for the corresponding comment

In [None]:
# Replace the 'word_embeddings' column with 'padded_embeddings_df'
df['word_embeddings'] = padded_embeddings_df

This line of code assigns the padded word embeddings to the 'word_embeddings' column of the dataframe 'df', replacing the previously existing word embeddings.

In [None]:
# get the padded word_embeddings of first comment in the data frame 'df'
df['word_embeddings'][0]

array([[ 0.80308  , -0.016776 ,  0.025788 , ...,  0.042583 , -0.39625  ,
        -0.040408 ],
       [-0.38065  ,  0.33245  ,  0.053904 , ...,  0.086626 ,  0.33142  ,
         0.2014   ],
       [-0.01004  , -0.41808  , -0.061757 , ..., -0.0073133,  0.12578  ,
        -0.48272  ],
       ...,
       [ 0.       ,  0.       ,  0.       , ...,  0.       ,  0.       ,
         0.       ],
       [ 0.       ,  0.       ,  0.       , ...,  0.       ,  0.       ,
         0.       ],
       [ 0.       ,  0.       ,  0.       , ...,  0.       ,  0.       ,
         0.       ]], dtype=float32)

In [None]:
# get the shape of the word_embeddings of first comment in the data frame 'df'
df['word_embeddings'][0].shape

(46, 300)

In [None]:
from keras.layers import Input, LSTM, Dense, Dropout
from keras.models import Model

# Define input shape
input_shape =df['word_embeddings'][0].shape

# Define model architecture
input_layer   =  Input(shape=input_shape)
lstm_layer    =  LSTM(128)(input_layer)
dense_layer   =  Dense(32, activation = 'relu')(lstm_layer)
dropout_layer =  Dropout(0.2)(dense_layer)
output_layer1 =  Dense(1, activation = 'relu', name = 'Quality')(dropout_layer)
output_layer2 =  Dense(1, activation = 'relu', name = 'Difficulty')(dropout_layer)

model = Model(inputs = input_layer, outputs = [output_layer1, output_layer2])



This code imports the necessary layers and model from the Keras library, defines the input shape for the model based on the shape of the word embeddings, sets up the architecture for a neural network model with an LSTM layer, a dense layer, a dropout layer, and two output layers, and then creates the model object using the input and output layers. The model has two output layers to predict the ***Quality*** and ***Difficulty*** of a professor based on the input word embeddings.

In [None]:
# Compile model
model.compile(loss={'Quality': 'mse', 'Difficulty': 'mse'},
              optimizer=keras.optimizers.Adam(learning_rate=0.001),
              metrics={'Quality': keras.metrics.RootMeanSquaredError(),
                       'Difficulty': keras.metrics.RootMeanSquaredError()})


The above code snippet compiles the previously defined model using the mean squared error (mse) loss function for both Quality and Difficulty outputs. It uses the Adam optimizer with a learning rate of 0.001 and sets the root mean squared error (RMSE) as the evaluation metric for both outputs.

In [None]:
# Print model summary
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 46, 300)]    0           []                               
                                                                                                  
 lstm (LSTM)                    (None, 128)          219648      ['input_1[0][0]']                
                                                                                                  
 dense (Dense)                  (None, 32)           4128        ['lstm[0][0]']                   
                                                                                                  
 dropout (Dropout)              (None, 32)           0           ['dense[0][0]']                  
                                                                                              

In [None]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split

# stack the embeddings to create a 3D array
X = np.stack(df['word_embeddings'])

# convert student rating for Quality to numpy array
y_quality = np.array(df['student_star'])

# convert student rating for Difficulty to numpy array
y_difficulty = np.array(df['student_difficult'])

# split the data into train and test sets with a test size of 20% and a random state of 42
X_train, X_test, y_quality_train, y_quality_test, y_difficulty_train, y_difficulty_test = train_test_split(X, y_quality, y_difficulty, test_size=0.2, random_state=42)


Here, we are splitting the data into training and testing sets using the ***train_test_split*** function from ***scikit-learn***. The features are stored in a NumPy array '***X***', while the target variables for ***Quality*** and ***Difficulty*** ratings are stored in separate NumPy arrays ***y_quality*** and ***y_difficulty***, respectively. The resulting variables ***X_train***, ***X_test***, ***y_quality_train***, ***y_quality_test***, ***y_difficulty_train***, and ***y_difficulty_test*** store the training and testing data

In [None]:
# Train the model
history = model.fit(X_train, {'Quality': y_quality_train, 'Difficulty': y_difficulty_train},epochs=20, batch_size=128,validation_split=0.2)

# Evaluate the model on test data
loss_error_matrix = model.evaluate(X_test, {'Quality': y_quality_test, 'Difficulty': y_difficulty_test})

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The above piece of code fits the compiled model to the training data with 20 epochs and batch size of 128, using 20% of the training data as validation set. The losses are evaluated on the test set.

In [None]:
#save the model
model.save('student_rating_regression_model.h5')

In [None]:
# Get Quality Root mean squared error
print('Root Mean Squared error Of Quality:', loss_error_matrix[3])

# Get Difficulty Root mean squared error
print('Root Mean Squared error Of Difficulty:', loss_error_matrix[4])

Root Mean Squared error Of Quality: 1.0203790664672852
Root Mean Squared error Of Difficulty: 1.154398798942566


In [None]:
# Ask user for input
user_input = input("Enter your comment: ")
# Clean user input
user_input = CleanText_and_Vectorize(user_input)
user_input = np.expand_dims(user_input, axis=0)
# Pad user input sequence to have the same length as the training data
user_input = pad_sequences(user_input, maxlen=max_length,dtype='float32', padding='post', truncating='post')

# Predict the rating using the trained model
rating = model.predict(user_input)
# Print the predicted rating
print('Quality: %.1f' % min(rating[0],5))
print('Difficulty: %.1f' % min(rating[1],5))

Enter your comment: A brilliant proffessor, makes the topic interesting through class discussions. Although powerpoints would help in preparation for exams. His trivia questions are not hard if you pay attention in class. He is a very nice and understaning professor. The study guides he provides for exams are very helpful so long as you study well! Take him.
Quality: 4.6
Difficulty: 3.1


# **--- END OF SECTION 8 ---**

#**Approach 9**
---
## **REGRESSION MODEL - 3**
---
### This model treats the assignment as a regression problem and trains and utilizes a brand new word embedding, utilizing tensorflow.keras.layers.TextVectorization to both clean the data and generate the actual sequences, training on both the .csv and .json RateMyProfessor data together.

In [None]:
# Importing the required libraries.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout, TextVectorization

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Importing the data from both .json and .csv to data frame.
import json
df = pd.read_csv(r'/content/drive/My Drive/GROUP_PROJECT_2_ANN/Big Data Set from RateMyProfessor.com for Professors Teaching Evaluation/RateMyProfessor_Sample data.csv')
my_file = open('/content/drive/My Drive/GROUP_PROJECT_2_ANN/all_reviews.json')
json_dict = json.load(my_file)
my_file.close()
lastindex = 20000
for i in range(len(json_dict)):
  for j in range(len(json_dict[i])):
    d = json_dict[i][j]
    tempDataFrame = pd.DataFrame({
        "comments": d["Comment"],
        "student_star": float(d["Quality"]),
        "student_difficult": float(d["Difficulty"]),
        "professor_name": d["professor"]
    }, index = [lastindex])
    df = pd.concat([df, tempDataFrame])
    lastindex = lastindex + 1

In [None]:
print(len(np.array(df['comments'])))

23374


In [None]:
# # Remove 'nan' values from 'comments' column
df = df.dropna(subset=['comments'])

# # Remove stopwords and convert to lowercase
nltk.download('stopwords')
stop_words = stopwords.words('english')
def clean_text(text):
    text = text.lower()
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

df['comments'] = df['comments'].apply(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#Some of this was tinkering around with how exactly TextVectorization works
comments_nparr = np.array(df['comments'])
unique_comments = np.unique(comments_nparr)
unique_comments_hack = unique_comments[1:]
print(len(unique_comments))
print(len(unique_comments_hack))

22102
22101


In [None]:
# Set parameters for model
MAX_NB_WORDS = 100000
MAX_SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 300

# Tokenize words in 'comments' column
encoder = TextVectorization(
    vocabulary = unique_comments_hack,
    standardize="lower_and_strip_punctuation",
    split="whitespace",
    output_sequence_length = MAX_SEQUENCE_LENGTH
)
altcoder = TextVectorization(
    standardize="lower",
    split="whitespace",
    output_sequence_length = MAX_SEQUENCE_LENGTH,
    max_tokens = MAX_NB_WORDS
)
altcoder.adapt(unique_comments_hack)
print('Found %s unique tokens in Altcoder sample' % len(altcoder.get_vocabulary()))

Found 22103 unique tokens. In Encoder sample
Found 37543 unique tokens. In Altcoder sample


Note how there's way more tokens in the "altcoder" than "encoder" - this is because the former is treating entire strings as its vocabulary, as opposed to words

In [None]:
# Split data into training and testing sets
VALIDATION_SPLIT = 0.2

X = df['comments'].values
y_star = df['student_star'].values
y_difficult = df['student_difficult'].values
y_star_category = pd.get_dummies(df['student_star']).values
y_difficult_category = pd.get_dummies(df['student_difficult']).values

indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y_star = y_star[indices]
y_difficult = y_difficult[indices]
y_star_category = y_star_category[indices]
y_difficult_category = y_difficult_category[indices]
num_validation_samples = int(VALIDATION_SPLIT * X.shape[0])

X_train = X[:-num_validation_samples]
y_train_star = y_star[:-num_validation_samples]
y_train_difficult = y_difficult[:-num_validation_samples]
y_train_star_category = y_star_category[:-num_validation_samples]
y_train_difficult_category = y_difficult_category[:-num_validation_samples]
X_test = X[-num_validation_samples:]
y_test_star = y_star[-num_validation_samples:]
y_test_difficult = y_difficult[-num_validation_samples:]
y_test_star_category = y_star_category[-num_validation_samples:]
y_test_difficult_category = y_difficult_category[-num_validation_samples:]

In [None]:
#Treat outputs as a regression problem
from tensorflow.keras.layers import GRU, Bidirectional

star_model_regression = Sequential([
    altcoder,
    Embedding(input_dim=len(altcoder.get_vocabulary()), output_dim=100, input_length=MAX_SEQUENCE_LENGTH),
    Bidirectional(GRU(100, dropout=0.4)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(1)
])

star_model_regression.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(star_model_regression.summary())
star_model_regression.fit(X_train, y_train_star, validation_data=(X_test, y_test_star), epochs=20, batch_size=128)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 250)              0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 250, 100)          3754300   
                                                                 
 bidirectional (Bidirectiona  (None, 200)              121200    
 l)                                                              
                                                                 
 dense (Dense)               (None, 128)               25728     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                8

<keras.callbacks.History at 0x7fa09a471b10>

In [None]:
difficulty_model_regression = Sequential([
    altcoder,
    Embedding(input_dim=len(altcoder.get_vocabulary()), output_dim=100, input_length=MAX_SEQUENCE_LENGTH),
    Bidirectional(GRU(100, dropout=0.4)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(1)
])

difficulty_model_regression.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(difficulty_model_regression.summary())
difficulty_model_regression.fit(X_train, y_train_difficult, validation_data=(X_test, y_test_difficult), epochs=20, batch_size=128)

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 250)              0         
 ectorization)                                                   
                                                                 
 embedding_8 (Embedding)     (None, 250, 100)          3754300   
                                                                 
 bidirectional_8 (Bidirectio  (None, 200)              121200    
 nal)                                                            
                                                                 
 dense_24 (Dense)            (None, 128)               25728     
                                                                 
 dropout_16 (Dropout)        (None, 128)               0         
                                                                 
 dense_25 (Dense)            (None, 64)               

<keras.callbacks.History at 0x7fa092f743a0>

##Alternate Model - 2

### **This model treats the problem as a classification problem. Otherwise, it is exactly the same as before**



In [None]:
from tensorflow.keras.layers import GRU, Bidirectional

star_model_classification = Sequential([
    altcoder,
    Embedding(input_dim=len(altcoder.get_vocabulary()), output_dim=100, input_length=MAX_SEQUENCE_LENGTH),
    Bidirectional(GRU(100, dropout=0.4)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(9, activation="softmax")
])

star_model_classification.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(star_model_classification.summary())
star_model_classification.fit(X_train, y_train_star_category, validation_data=(X_test, y_test_star_category), epochs=20, batch_size=128)

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_5 (TextV  (None, 250)              0         
 ectorization)                                                   
                                                                 
 embedding_6 (Embedding)     (None, 250, 100)          3754300   
                                                                 
 bidirectional_6 (Bidirectio  (None, 200)              121200    
 nal)                                                            
                                                                 
 dense_18 (Dense)            (None, 128)               25728     
                                                                 
 dropout_12 (Dropout)        (None, 128)               0         
                                                                 
 dense_19 (Dense)            (None, 64)               

<keras.callbacks.History at 0x7fa09fa1a1d0>

In [None]:
difficulty_model_classification = Sequential([
    altcoder,
    Embedding(input_dim=len(altcoder.get_vocabulary()), output_dim=100, input_length=MAX_SEQUENCE_LENGTH),
    Bidirectional(GRU(100, dropout=0.4)),
    Dense(128, activation="sigmoid"),
    Dropout(0.4),
    Dense(64, activation="sigmoid"),
    Dropout(0.4),
    Dense(5, activation="softmax")
])

difficulty_model_classification.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(difficulty_model_classification.summary())
difficulty_model_classification.fit(X_train, y_train_difficult_category, validation_data=(X_test, y_test_difficult_category), epochs=20, batch_size=128)

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 250)              0         
 ectorization)                                                   
                                                                 
 embedding_10 (Embedding)    (None, 250, 100)          3754300   
                                                                 
 bidirectional_10 (Bidirecti  (None, 200)              121200    
 onal)                                                           
                                                                 
 dense_30 (Dense)            (None, 128)               25728     
                                                                 
 dropout_20 (Dropout)        (None, 128)               0         
                                                                 
 dense_31 (Dense)            (None, 64)              

<keras.callbacks.History at 0x7fa07e0b4b80>

##Alternate Model - 3

### **This model is exactly the same as the preceding model, only it utilizes tanh activation.**


In [None]:
star_model_final = Sequential([
    altcoder,
    Embedding(input_dim=len(altcoder.get_vocabulary()), output_dim=100, input_length=MAX_SEQUENCE_LENGTH),
    Bidirectional(GRU(100, dropout=0.4)),
    Dense(128, activation="tanh"),
    Dropout(0.4),
    Dense(64, activation="tanh"),
    Dropout(0.4),
    Dense(9, activation="softmax")
])

star_model_final.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(star_model_final.summary())
star_model_final.fit(X_train, y_train_star_category, validation_data=(X_test, y_test_star_category), epochs=20, batch_size=128)

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (TextV  (None, 250)              0         
 ectorization)                                                   
                                                                 
 embedding_11 (Embedding)    (None, 250, 100)          3754300   
                                                                 
 bidirectional_11 (Bidirecti  (None, 200)              121200    
 onal)                                                           
                                                                 
 dense_33 (Dense)            (None, 128)               25728     
                                                                 
 dropout_22 (Dropout)        (None, 128)               0         
                                                                 
 dense_34 (Dense)            (None, 64)              

<keras.callbacks.History at 0x7fa06063f610>

# **--- END OF SECTION 9 ---**

# **ACCURACY COMPARISON TABLE**

**Classification Models**

All values are percentages and the highest seen during training or at some subsequent evaluate() step. Training accuracies may not reflect actual values due to presence of dropout within models, but are included for reference.

| Approaches  | Quality Train Accuracy | Quality Test Accuracy | Difficulty Train Accuracy  | Difficulty Test Accuracy | JSON data | Embedding Dimension |  Embedding Trained |
|:----------:|:---------:|:--------:|:----------:|:---------:|:--------:|:--------:|:--------:|
|   1 - GloVe, LSTM  |   49.35   |  44.52   |   N/A |   N/A   |     | 100   | |  
|   2 - GloVe, Deep GRU  |   42.19   |   43.34  |   N/A  |   N/A   |     | 100   | |  
|   3 - GloVe, LSTM  |   81.95   |    46.01  |   N/A  |   N/A   |  X   | 100   |X |  
|   4.1 - GloVe, Deep Bidirectional LSTM  |   76.25    |  46.97  |   82.56  |   38.03   |  X   |300  | X |
|   4.2 - GloVe, Deep Bidirectional LSTM  |   53.98    |   46.44  |   N/A  |   N/A   |  X   | 300   |  |
|   4.3 - GloVe, Deep Bidirectional LSTM  |   50.05   |  46.57  |   N/A  |   N/A   |  X   | 300   |  |
|   4.4 - GloVe, Deep Bidirectional LSTM |   55.57   |  46.82 |   N/A  |   N/A   |  X   |300   |  |  
|   4.5 - GloVe, Deep Bidirectional GRU |   49.49   |  47.06 |   N/A  |   N/A   |  X   |300   |  |  
|   6 - One-Hot Encoding, Dense Network *(See note)|   N/A   |  N/A |   55.58  |   27.58   |    |N/A   | N/A |  N/A
|   7 - GloVe, Bidirectional LSTM, Relative  **(See note) |   80.97   |  80.09 |   80.97  |   80.09   |     |300   |  |  
|   9.2 - Custom Embedding, Bidirectional GRU |   74.90   |  42.78 |   87.99  |  35.48   |  X   | 100   | X |
|   9.3 - Custom Embedding, Bidirectional GRU |   91.10   |  44.43 |    N/A |   N/A   |  X   | 100   | X |  

*Note: While this is not an RNN, it is included as a baseline to demonstrate the relative accuracy of dense and recurrent networks

**Note: This model actually predicts whether or not the difficulty or quality values are higher for a given comment. This is included as reference and to demonstrate how different questions lead to vastly different accuracies.

**Regression Models**

| Approaches  | Quality Train Loss | Quality Test Loss | Difficulty Train Loss | Difficulty Test Loss | JSON data | Embedding Dimension | Embedding Trained
|:----------:|:---------:|:--------:|:----------:|:---------:|:--------:|:--------:|:--------:|
|   5 - GloVe, Deep Bidirectional GRU |   0.8161   |  0.8127   |   N/A |   N/A   |  X   | 300   |
|   8 - GloVe, LSTM, User Input Support |   1.2941   |   1.0412  |   1.5355 |    1.3326   |     | 300   |
|   9.1 - Custom Embedding, Bidirectional GRU  |   0.4571   |   1.0418  |    0.3644  |   1.2845  |  X   | 100   | X
