# AUC Course Recommender
## Description
In this notebook is the source code for the Amsterdam University College (AUC) Course Recommender. This course recommender is part of a project for the Text Mining course.

## Code
### Imports:

In [20]:
import pandas as pd
import numpy as np
import nltk
import re
import string
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import remove_stopwords
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from sklearn.metrics import ndcg_score

### Loading and Preprocessing the Data

Loading the data in as a pandas dataframe:

In [24]:
data = pd.read_csv("datasets/recommender_dataset.csv")

Next we drop the rows which have nothing in the course_catalogue_number column. These rows are empty because the course scraper did not scrape information from courses whose websites were not written in English, meaning that after dropping these rows all courses in the dataset are taught in English.

In [26]:
courses = data.dropna(subset=['course_catalogue_number', 'is_part_of', 'language_of_instruction', 'course_description'])

Now we remove the stop words from dataframe's columns, as these act as noise that do not add much discriminative value in terms of similarity. 

In [30]:
courses = courses.reset_index(drop=True)
for i in range(1, int(len(courses))):
    courses.loc[i, 'course_name'] = courses.loc[i, 'course_name'].lower()
    courses.loc[i, 'college_graduate'] = courses.loc[i, 'college_graduate'].lower()
    courses.loc[i, 'language_of_instruction'] = courses.loc[i, 'language_of_instruction'].lower()
    courses.loc[i, 'is_part_of'] = courses['is_part_of'][i].lower()
    courses.loc[i, 'is_part_of'] = remove_stopwords(courses.loc[i, 'is_part_of'])
    courses.loc[i, 'course_description'] = courses.iloc[i, 7].lower()
    courses.loc[i, 'course_description'] = remove_stopwords(courses.loc[i, 'course_description'])

We also remove punctuation from the text, as this also acts as noise:

In [32]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = re.sub(r'[^\w\s]+', '', courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = re.sub(r'[^\w\s]+', '', courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = re.sub(r'[^\w\s]+', '', courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = re.sub(r'[^\w\s]+', '', courses['course_description'][i])

Sources for this part of the code:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://www.w3schools.com/python/pandas/ref_df_reset_index.asp#:~:text=Definition%20and%20Usage,this%2C%20use%20the%20drop%20parameter.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://towardsdatascience.com/remove-punctuation-pandas-3e461efe9584/
* https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

### Tokenization

Before vectoring the data, we tokenise it. We do this for the 'course_name', 'is_part_of', 'college_graduate', and 'course_description' columns.

In [36]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['course_description'][i])

We now create a new column with the text from all three columns joined together, which we will soon vectorise and use in the recommender system.

In [38]:
courses['combined_text'] = courses['course_name'] + courses['is_part_of'] + courses['college_graduate'] + courses['course_description']

Sources for this part of the code:
* https://www.nltk.org/api/nltk.tokenize.regexp.html
* https://www.kaggle.com/code/kanikanarang94/tokenization-using-nltk
* https://saturncloud.io/blog/how-to-combine-two-columns-in-a-pandas-dataframe/

### Vectorization

First we tag the "combined_text" column in order to be able to vectorize it with doc2vec.

In [42]:
tagged_data = [TaggedDocument(words=doc, tags=[cid]) for doc, cid in zip(courses['combined_text'], courses['course_catalogue_number'])]

Next we vectorise the "combined_text" column.

In [44]:
params = {
    'vector_size': 100, # dimension of embeddings
    'window': 5, # window -/+ before and after focus word
    'epochs': 5, # number of iterations over the corpus
    'min_count': 5, # filter on words whose frequency is below this count
    'workers': 4, # how many cores to use
    'alpha': 0.05 # initial learning rate for SGD. This is lambda in the class notes
}

model = Doc2Vec(**params)
  
model.build_vocab(tagged_data)

max_epochs = 100

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=1)
    model.alpha = 0.175
    model.min_alpha = model.alpha

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

Sources for this part of the code (these sources are also relevent for the following part of this notebook):
* https://radimrehurek.com/gensim/models/doc2vec.html
* https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html
* https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5
* https://spotintelligence.com/2023/09/06/doc2vec/#What_is_Doc2Vec
* The parameters in the code were taken from the notebook where Word2Vec was introduced.

### Recommender

In [47]:
def recommend_courses(li_course_ids, model, courses_df, top_n=5):
    #getting the vectors in form the doc2vec model and making one vector out of all the courses in the li_course_ids
    vectors = [model.dv[tag] for tag in li_course_ids]
    av_vector = np.mean(vectors, axis=0).reshape(1, -1)
    all_vectors = np.array([model.dv[i] for i in range(len(model.dv)) if i not in li_course_ids])
    
    #calculating the cosine similarities and flattening them to one dimension
    cosine_sim = cosine_similarity(av_vector, all_vectors).flatten()
    
    #making the cosine similarities a list and sorting it
    sim_scores = list(enumerate(cosine_sim))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #removing the courses from the sim_scores which were in the li_course_ids - might be worng
    idex_to_id = courses_df['course_catalogue_number'].tolist()
    selected_indicies = [idex_to_id.index(cid) for cid in li_course_ids if cid in idex_to_id]
    sim_scores = [s for s in sim_scores if s[0] not in selected_indicies]
    
    #getting the top 5 courses from the similaritie scores - also might be wrong
    course_indices = [i[0] for i in sim_scores[:top_n]]
    
    #returning the 
    return courses_df.iloc[course_indices][['course_name', 'course_catalogue_number']]

Sources for this part of the code:
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* https://numpy.org/doc/stable/reference/generated/numpy.matrix.flatten.html
* https://ioflood.com/blog/dataframe-to-list-pandas/#:~:text=You%20can%20use%20the%20toList,tolist()%20.&text=In%20the%20example%20above%2C%20we,1%2C%202%2C%203%5D.
* https://www.programiz.com/python-programming/methods/list/index

### Testing

Opening the test dataset:

In [51]:
majors = []
tracks = []
li_coourse_ids = []
with open("datasets/test_set.txt", 'r') as f:
    for l in f:
        if l[0] == '[':
            li_coourse_ids.append(eval(l))
        elif l.isupper():
            majors.append(l)
        elif l.islower():
            tracks.append(l)

In [52]:
print(majors[0])
print(tracks[0])
recommend_courses(li_coourse_ids[0], model, courses)

SCIENCE

math



Unnamed: 0,course_name,course_catalogue_number
2208,"[operator, algebras]",53348OPA8Y
72,"[advanced, computational, condensed, matter]",5354ACCM3Y
3279,"[varieties, of, peacebuilding]",7323C150FY
377,"[big, data, and, automated, content, analysis]",776500115Y
1792,"[managerial, economics]",6314M0261Y


In [53]:
print(tracks[1])
recommend_courses(li_coourse_ids[1], model, courses)

biomed



Unnamed: 0,course_name,course_catalogue_number
162,"[algebraic, topology, ii]",53342ALT8Y
2491,"[public, presentation, of, the, past, an, intr...",110221096Y
2373,"[postgrowth, entrepreneurship]",6013B0553Y
1349,"[historical, and, comparative, sociology]",7312E0020Y
609,"[conservation, and, restoration, for, audiovis...",158621046Y


In [54]:
print(tracks[2])
recommend_courses(li_coourse_ids[2], model, courses)

physics



Unnamed: 0,course_name,course_catalogue_number
1389,[hydrodynamics],5354HYDR6Y
1350,"[historical, sources]",138121016Y
3170,"[topology, in, physics]",5354TOIP6Y
286,"[atmospheric, sciences]",900384SCIY
1115,"[extreme, astrophysics]",5354EXAS6Y


In [55]:
print(tracks[3])
recommend_courses(li_coourse_ids[3], model, courses)

bio/environment



Unnamed: 0,course_name,course_catalogue_number
1184,"[food, governance, and, systemic, transformati...",5132GOST6Y
210,"[app, internships, period, 1]",19042E132Y
2942,"[violence, security, paradigms, and, debates]",73220041FY
2985,"[texts, in, the, 21st, century, forms, of, wri...",178421216Y
3028,"[the, history, of, ideas]",9002600HUY


In [56]:
print(tracks[4])
recommend_courses(li_coourse_ids[4], model, courses)

information/neuro



Unnamed: 0,course_name,course_catalogue_number
525,"[clinical, psychology, neuropsychology]",7201701PXY
162,"[algebraic, topology, ii]",53342ALT8Y
791,"[data, futures, lab]",900208SSCY
2208,"[operator, algebras]",53348OPA8Y
2491,"[public, presentation, of, the, past, an, intr...",110221096Y


In [57]:
print(majors[1])
print(tracks[5])
recommend_courses(li_coourse_ids[5], model, courses)

SOCIAL SCIENCE

economics



Unnamed: 0,course_name,course_catalogue_number
162,"[algebraic, topology, ii]",53342ALT8Y
2322,"[poetics, of, protest]",900209HUMY
2434,"[professional, skills, educational, skills]",5224PSES2Y
2942,"[violence, security, paradigms, and, debates]",73220041FY
113,"[advanced, quantum, algorithms]",5334ADQA6Y


In [58]:
print(tracks[6])
recommend_courses(li_coourse_ids[6], model, courses)

psychology/economics



Unnamed: 0,course_name,course_catalogue_number
284,"[asymptotic, statistics]",5374ASST8Y
3173,"[toric, varieties]",5334TOVA6Y
794,"[data, literacy]",5512DALI6Y
3220,"[tutoring, and, study, guidance, masters, in, ...",14202A025Y
1457,"[international, criminal, tribunals, procedura...",3854I1Q8GY


In [59]:
print(tracks[7])
recommend_courses(li_coourse_ids[7], model, courses)

political science/law



Unnamed: 0,course_name,course_catalogue_number
770,"[current, topics, clinical, developmental, and]",7203BO48XY
2208,"[operator, algebras]",53348OPA8Y
162,"[algebraic, topology, ii]",53342ALT8Y
64,"[advanced, algebraic, geometry]",5334AAGR8Y
2491,"[public, presentation, of, the, past, an, intr...",110221096Y


In [60]:
print(majors[2])
print(tracks[8])
recommend_courses(li_coourse_ids[8], model, courses)

HUMANITIES

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
1350,"[historical, sources]",138121016Y
162,"[algebraic, topology, ii]",53342ALT8Y
284,"[asymptotic, statistics]",5374ASST8Y
441,"[capstone, interdisciplinary, literature, review]",7303S2000Y
2202,"[ontwikkelen, van, trainingen]",7204MS40XY


In [61]:
print(tracks[9])
recommend_courses(li_coourse_ids[9], model, courses)

cultural analysis



Unnamed: 0,course_name,course_catalogue_number
201,"[anthropology, of, contemporary, south, asia]",7312R0022Y
3173,"[toric, varieties]",5334TOVA6Y
2643,"[research, project, 2the, value, of, gold, and...",172421196Y
2211,"[oral, history, and, artwork, life, stories]",146421726Y
2202,"[ontwikkelen, van, trainingen]",7204MS40XY


In [62]:
print(tracks[10])
recommend_courses(li_coourse_ids[10], model, courses)

media/film



Unnamed: 0,course_name,course_catalogue_number
2202,"[ontwikkelen, van, trainingen]",7204MS40XY
451,"[case, study, project, preservation, and, pres...",158621036Y
1864,"[masters, thesis, cultural, data, and, ai]",159429000Y
3173,"[toric, varieties]",5334TOVA6Y
2373,"[postgrowth, entrepreneurship]",6013B0553Y


In [63]:
print(tracks[11])
recommend_courses(li_coourse_ids[11], model, courses)

art history/history



Unnamed: 0,course_name,course_catalogue_number
284,"[asymptotic, statistics]",5374ASST8Y
537,"[cognitive, musicology]",115215496Y
2202,"[ontwikkelen, van, trainingen]",7204MS40XY
403,"[bodies, and, the, posthuman]",178421036Y
2430,"[product, management]",6614ZM041Y


Sources for this part of the code:
* https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/

## Evaluation

**Normalised Discounted Cumulative Gain Metric**

Takes as an input the relevance score obtained manually, and conputes the ndcg score. 

In [None]:
#the relevance order in which the program recommends the courses
y_score = np.array([[5,4,3,2,1]])

def ndcg_metric (ground_truth, k=5):
    y_true = np.array([ground_truth])
    return ndcg_score(y_true, y_score, k=k)

In [None]:
#manually labeled accuracy

**Diversity Metric**

In [89]:
def get_cos_distances(li_course_ids, model):
    #getting the vectors in form the doc2vec model and making one vector out of all the courses in the li_course_ids
        vectors = [model.dv[tag] for tag in li_course_ids]
        av_vector = np.mean(vectors, axis=0).reshape(1, -1)
        all_vectors = np.array([model.dv[i] for i in range(len(model.dv)) if i not in li_course_ids])

        #calculating the cosine similarities and flattening them to one dimension
        cosine_sim = cosine_similarity(av_vector, all_vectors).flatten()
        print(cosine_sim)

In [95]:
get_cos_distances(li_coourse_ids[0], model)

[0.16390654 0.33204544 0.10238466 ... 0.03284438 0.02828562 0.30534863]


In [153]:
def compute_ild (recommended_ids, model,  k=5):
    
    test_ids = [tag for tag in recommended_ids if tag in model.dv]
    
    #raw vectors from all the recommender courses 
    vectors = np.array([model.dv[tag] for tag in test_ids])
    
    #cosine similarities
    cos_sim_matrix = cosine_similarity(vectors)
    
    #cosine distance (1-cos_similarity)
    cos_dist_matrix = 1 - cos_sim_matrix
    
    #remove the diagnoal (self-distance)
    total_distance = np.sum(cos_dist_matrix) - np.trace(cos_dist_matrix)
    
    ild_score = total_distance / (k * (k - 1))
    
    return ild_score

In [155]:
def evaluate_ild(test_sets, model, courses):
    ild_results = []
    
    for i, test_input in enumerate(test_sets):
        df_rec = recommend_courses(test_input, model, courses)

        ids = [tag for tag in df_rec["course_catalogue_number"] if tag in model.dv]
        ild = compute_ild(ids, model)
        ild_results.append(ild)
        
    return ild_results

ild_scores = evaluate_ild(li_coourse_ids, model, courses)

KeyError: "Key '900211HUMY' not present"

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html

https://dl.acm.org/doi/full/10.1145/3664928?utm_source=chatgpt.com