# Assignment 2: SoC Module Recommender System

## Instructions to submit the assignment

- Name your jupyter notebook as `Assignment2_[StudentID].ipynb`. For instance: `Assignment2_A0123873A.ipynb`
- Your solution notebook must contain the python code that we can run to verify the answers.
  - late within 1 hour: 10% reduction in grade
  - late within 6 hours: 30% reduction in grade
  - late within 12 hours: 50% reduction in grade
  - late within 1 days: 70% reduction in grade
  - after 1 days: zero mark
- **This is an individual assessment. Refrain from working in groups.**

In this assignment we design a recomendation engine (*Don't worry about the effectiveness of the system. It maybe very bad. The idea is just to offer you a proof of concept!*). The recommendation engine suggests the students a module that closely matches the modules already taken by the student. The dataset comprices of two files:
- List of modules in the School of Computing 
- List of graduated students and the modules they had taken during their studies

Schematic diagram of the entire process is shown as below:

![Schema]()

# Loading the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.pipeline import Pipeline, FeatureUnion

'''
    YOU MUST USE THE RANDOM SEED WHEREVER NEEDED OR RANDOM_STATE as 42.
'''
rng = np.random.default_rng(seed=42)

courses = pd.read_csv("courses.tsv", sep='\t')
students = pd.read_csv("students.tsv", sep='\t')

# Question 1: Creating the preprocessing pipeline (10 marks)

We want to create a sklearn pipeline to efficiently preprocess the data and prepare it for training a model. We use three different features in the `courses` data: `specialisation`, `info` and `workload`. We want to represent every feature in a numeric form and merge them to form a feature vector for every course. We do so in the following way:
- `specialisation` represents one of the six levels of the module. For instance: CS2103 is a Software Engineering (SE) specialisation module. Encode this categorical feature into a vector. The decision of handling missing values is left to you! *(Hint: You can use `MultiLabelBinerizer` to do so.)*
- `info` provides a short discription of the module. We want to convert it into a vector using CountVectorizer. *Don't forget to remove the stopwords* while doing so.
-  `workload` states the intended distribution of workload over lectures, tutorials, labs and self study. We want to find the workload as the sum of individual workloads. For instnce: 3-1-1-3-2 workload transforms to 10 hours.

Provide implementation for three classes that help us build the pipeline. `transformed_courses` should be a numpy array of shape `[n_courses X n_features]`.

                                                                                                   (6 marks)

In [2]:
#Replacing "Netoworking" typo with Networking
courses['specialisation'] = courses['specialisation'].replace("Netoworking","Networking")

#Filling the NAN values with relative proportions of the existing values in that column
proportions = courses['specialisation'].value_counts(normalize=True)
#print(proportions)
unique_specialisations = proportions.index
courses['specialisation'] = courses['specialisation'].fillna(pd.Series(np.random.choice(unique_specialisations,p=proportions,size=len(courses))))

In [3]:
class WorkloadTransformer:        
    def fit(self, X, y = None, **fit_params):
        
        return X
    
    def transform(self, X, y = None, **fit_params):
        def compute_wkld(x):
            x = x.strip()
            str_list = x.split("-")
            workloads = [float(element) for element in str_list]
            total_workload = sum(workloads)
            return total_workload
        
        result = X['workload'].apply(compute_wkld)
        result = result.to_numpy()
        #print(result.shape)
        result = result.reshape(-1, 1)
        
        return result

In [4]:
class InfoTransformer:  
    def fit(self, X, y = None, **fit_params):
        
        return X
    
    def transform(self, X, y = None, **fit_params):
        cv = CountVectorizer(stop_words='english')
        cv.fit(X['info'])
        count_vector = cv.transform(X['info'])
        
        return count_vector.toarray()

In [5]:
class SpecTransformer:        
    def fit(self, X, y = None, **fit_params):
        
        return X
    
    def transform(self, X, y = None, **fit_params):
        each_course_specialisations = {}
        specialisation_list = []

        for index, row in X.iterrows():
            each_course_specialisations = {x.strip() for x in row['specialisation'].split(",")}
            specialisation_list.append(each_course_specialisations)

        #print(specialisation_list)
        mlb = MultiLabelBinarizer()
        mlb.fit(specialisation_list)
        
        
        OHE_Specialisation = mlb.transform(specialisation_list)
        
        unique_labels = mlb.classes_
        label_indices = {label: index for index, label in enumerate(unique_labels)}
        #print(label_indices)
        
        return OHE_Specialisation

In [6]:
featureTransformer = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer())])),
    ('info_processing', Pipeline([('info', InfoTransformer())])),
    ('spec_processing', Pipeline([('spec', SpecTransformer())])),
])

featureTransformer.fit(courses)
transformed_courses = featureTransformer.transform(courses)

Now we prepare our testing data in the same way we preprocessed the course. `students` data comprises of 1000 students and a list of modules they have taken. 

Create `Xtest` and `Ytest` as two matrices. `Xtest`, of size `1000*5`, comprises of first five modules for every student in the list. `Ytest`, of size `1000*[remaining_modules]`, comprises of rest of the modules for every student in the list. 
We do so in order to assess the performance of the recommender. We assess the recommender based on its effectiveness to predict the modules given a list of five modules as the input.

For instance: 
- `Xtest[0] = ['CS2105', 'CS4222', 'CS6270', 'CS6205', 'CS4226']`
- `Ytest[0] = ['CS3282', 'CS6204', 'CS5223', 'CS3281', 'CS4344', 'CS5422', 'CS3237', 'CS5233']`.

<div align="right">(2 marks)</align>

In [7]:
students_courses = students['courses']
Xtest = []
Ytest = []

for element in students_courses:
    student_course_list = element.split(",")
    Xtest.append(student_course_list[:5])
    Ytest.append(student_course_list[5:])

For every student in `Xtest`, we need to transform the list of 5 modules to the feature space using the `featureTransformer` fit on the training data. For every module we will get a feature vector of size `n_features`. We *add* these feature vectors to get an aggregate feature vector for very student.

Write a function `getFeatureVector` that takes in the list of modules and `featureTransformer`. It returns the feature vector for the specified list of courses. For instance, `getFeatureVector(Xtest[0], featureTransformer)` will return a vector of size `n_features`.

<div align="right">(2 marks)</div>

In [8]:
def getFeatureVector(modules, featureTransformer):
    modules_in_courses = courses[courses['code'].isin(modules)]
    modules_index_courses = modules_in_courses.index
    transformed_courses = featureTransformer.transform(courses)
    modules_featurevector = transformed_courses[modules_index_courses]
    aggregated_featurevector = np.sum(modules_featurevector, axis=0)
    
    return aggregated_featurevector

In [9]:
print(Xtest[0])
result = getFeatureVector(Xtest[0],featureTransformer)
print(result)

['CS5422', 'CS5223', 'CS4237', 'CS3281', 'CS6213']
[50.  0.  0. ...  3.  0.  0.]


# Question 2: Content based recommender (4 marks)

We can use a model as simple as K-nearest neighbour (KNN) to perform a content based recommendation. If we provide a list of 5 modules to the recommender, it provide us a list of modules that are similar to the specified modules.

`sklearn` provides `NearestNeighbors` as well as `KNeighborsClassifier`, both of which have a similar functionality. `NearestNeighbors` provides as an easy functionality to predict a list of K nearest neighbours. Therefore, we prefer it over `KNeighborsClassifier`. If we want to find K nearest points to a datapoint`d`, we need to use `n_neighbors` as K + 1 because the list includes `d` itself.

You can now train the model using the training data, which comprises of `transformed_courses` and with their codes as the labels. 
<div align="right">(1 mark)</div>

In [10]:
K = 5
model = NearestNeighbors(algorithm="brute", n_neighbors = K + 1)
model.fit(transformed_courses)

It is time to see our model in action. Let's see what modules our model reommends based on the modules taken by a student.

Write a function that takes in a *pre-trained* model of your choice as input and the list of modules. It returns the top-K recommendations of the model. Print the top 6 recommendations for the first student. 
<div align="right">(3 marks)</div>

In [11]:
def recommend(model, modulesTaken, k = 6):
    
    aggregated_featurevector = getFeatureVector(modulesTaken, featureTransformer)
    aggregated_featurevector = aggregated_featurevector.reshape(-1,1)
    aggregated_featurevector = aggregated_featurevector.T
    result = model.kneighbors(aggregated_featurevector, return_distance=False)
    result_indices = result[0]
    recommended_courses = courses.iloc[result_indices]['code']
    #print(len(recommended_courses))
    
    return recommended_courses[:k]

print("Top 6 recommendations for the first student")
print(recommend(model, Xtest[0], 6))

Top 6 recommendations for the first student
33     CS3203
34     CS3205
112    CS5223
12     CS2020
38     CS3216
39     CS3217
Name: code, dtype: object


# Question 3: Recommender evaluation (6 marks)

Is this the model any good? To assess the performance of the model, we use **precision** and **recall** as our metrics. `Ytest` consists of true labels for every students. Using those labels as the ground truth, compute the precision and recall for every student. Write a code that prints values of average precision and recall for a specific value of `K` over the `students` dataset. Print the value of average precision and average recall for `K= 10`.

                                                                                                     (2 marks)

In [12]:
precision_list = []
recall_list = []
for i in range(0,len(Xtest)):
    predicted_courses = recommend(model, Xtest[i], 10)
    #print("Length of the predicted courses: ", len(predicted_courses))
    predicted_courses = set(predicted_courses)
    actual_courses = set(Ytest[i])
    intersection = predicted_courses.intersection(actual_courses)
    
    precision_i = len(intersection)/len(predicted_courses)
    recall_i = len(intersection)/len(actual_courses)
    
    precision_list.append(precision_i)
    recall_list.append(recall_i)
    
print("Average Precision: ",np.mean(precision_list))
print("Average Recall: ",np.mean(recall_list))

Average Precision:  0.050166666666666665
Average Recall:  0.02938390091160989


We observe that both precision and recall is not really great. The reason might be igh feature dimension, which may even be noisy. Append the exisiting `featureTransformer` with a PCA to reduce the dimension. 

Print the value of average precision and recall for `K= 10` after the introduction of PCA.

                                                                                                     (2 marks)

In [13]:
pca = PCA(n_components=30, random_state=42)
pca_transformer = Pipeline([
    ("features", featureTransformer),
    ("pca", pca) 
])
transformed_PCA_data = pca.fit_transform(transformed_courses)
model_pca = NearestNeighbors(algorithm = "brute", n_neighbors = 10)
model_pca.fit(transformed_PCA_data)

In [14]:
def recommend_pca(model_pca, modulesTaken, k = 10):
    aggregated_FV_PCA = getFeatureVector(modulesTaken, pca_transformer)
    aggregated_FV_PCA = aggregated_FV_PCA.reshape(-1,1)
    aggregated_FV_PCA = aggregated_FV_PCA.T
    result = model_pca.kneighbors(aggregated_FV_PCA, return_distance=False)
    result_indices = result[0]
    recommended_courses = courses.iloc[result_indices]['code']
    return recommended_courses[:k]

In [15]:
precision_list = []
recall_list = []
for i in range(0,len(Xtest)):
    predicted_courses = recommend_pca(model_pca, Xtest[i], 10)
    predicted_courses = set(predicted_courses)
    actual_courses = set(Ytest[i])
    intersection = predicted_courses.intersection(actual_courses)
    
    precision_i = len(intersection)/len(predicted_courses)
    recall_i = len(intersection)/len(actual_courses)
    
    precision_list.append(precision_i)
    recall_list.append(recall_i)
    
print("Average Precision after PCA: ",np.mean(precision_list))
print("Average Recall after PCA: ",np.mean(recall_list))

Average Precision after PCA:  0.14110000000000003
Average Recall after PCA:  0.149061360573551


Can you provide some **concrete** (something that you can implement) suggestions to improve the performance of the system? The improvement does not have to be very significant.

                                                                                                    (2 marks)

Following two suggestions have been implemented below,

1.) Setting min_df=2 helps filter out very rare terms that may be typos, noise, or specific to a small subset of documents. Can reduce the dimensionality of the feature space, which can lead to faster training times and potentially better model generalization.

Setting max_df=0.95 so as to filter out common terms that has very less discriminatory power thereby improving the informativeness of the features by focusing on more specific terms.

2.) The specialisation to which a specific course belongs, will be a better determinant for predicting which course will potentially be chosen by the student in the future. So, more weightage can be given to Specialisation feature than the other two. I have assigned a weigtage of 0.8 to Specialisation and 0.1 to Workload and Info each. 

Upon implementing the above two improvements, we get an increase of around 5% in both average precision and recall.

In [16]:
# Write your code here
class InfoTransformer_improved:  
    def fit(self, X, y = None, **fit_params):
        
        return X
    
    def transform(self, X, y = None, **fit_params):
        cv = CountVectorizer(min_df=2, max_df=0.95, stop_words='english')
        cv.fit(X['info'])
        count_vector = cv.transform(X['info'])
        
        return count_vector.toarray()

In [17]:
featureTransformer = FeatureUnion([
    ('workload_processing', Pipeline([('wrkld', WorkloadTransformer())])),
    ('info_processing_improved', Pipeline([('info_improved', InfoTransformer_improved())])),
    ('spec_processing', Pipeline([('spec', SpecTransformer())])),
],transformer_weights={'workload_processing': 0.1, 'info_processing_improved': 0.1, 'spec_processing': 0.8})


featureTransformer.fit(courses)
transformed_courses = featureTransformer.transform(courses)

In [18]:
pca = PCA(n_components=30, random_state=42)
pca_transformer = Pipeline([
    ("features", featureTransformer),
    ("pca", pca) 
])
transformed_PCA_data = pca.fit_transform(transformed_courses)
model_pca = NearestNeighbors(algorithm = "brute", n_neighbors = 10)
model_pca.fit(transformed_PCA_data)

In [19]:
def recommend_pca(model_pca, modulesTaken, k = 10):
    aggregated_FV_PCA = getFeatureVector(modulesTaken, pca_transformer)
    aggregated_FV_PCA = aggregated_FV_PCA.reshape(-1,1)
    aggregated_FV_PCA = aggregated_FV_PCA.T
    result = model_pca.kneighbors(aggregated_FV_PCA, return_distance=False)
    result_indices = result[0]
    recommended_courses = courses.iloc[result_indices]['code']
    return recommended_courses[:k]

In [20]:
precision_list = []
recall_list = []
for i in range(0,len(Xtest)):
    predicted_courses = recommend_pca(model_pca, Xtest[i], 10)
    predicted_courses = set(predicted_courses)
    actual_courses = set(Ytest[i])
    intersection = predicted_courses.intersection(actual_courses)
    
    precision_i = len(intersection)/len(predicted_courses)
    recall_i = len(intersection)/len(actual_courses)
    
    precision_list.append(precision_i)
    recall_list.append(recall_i)
    
print("Average Precision after PCA and improvement: ",np.mean(precision_list))
print("Average Recall after PCA and improvement : ",np.mean(recall_list))

Average Precision after PCA and improvement:  0.19390000000000004
Average Recall after PCA and improvement :  0.20554380000374584
