# AUC Course Recommender
## Description
In this notebook is the source code for the Amsterdam University College (AUC) Course Recommender. This course recommender is part of a project for the Text Mining course.

## Code
### Imports:

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import string
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import remove_stopwords
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

### Loading and Preprocessing the Data

Loading the data in as a pandas dataframe:

In [4]:
data = pd.read_csv("datasets/recommender_dataset.csv")

Next we drop the rows which have nothing in the course_catalogue_number column. These rows are empty because the course scraper did not scrape information from courses whose websites were not written in English, meaning that after dropping these rows all courses in the dataset are taught in English.

In [6]:
courses = data.dropna(subset=['course_catalogue_number', 'is_part_of', 'language_of_instruction', 'course_description'])

Now we remove the stop words from dataframe's columns, as these act as noise that do not add much discriminative value in terms of similarity. 

In [8]:
courses = courses.reset_index(drop=True)
for i in range(1, int(len(courses))):
    courses.loc[i, 'course_name'] = courses.loc[i, 'course_name'].lower()
    courses.loc[i, 'college_graduate'] = courses.loc[i, 'college_graduate'].lower()
    courses.loc[i, 'language_of_instruction'] = courses.loc[i, 'language_of_instruction'].lower()
    courses.loc[i, 'is_part_of'] = courses['is_part_of'][i].lower()
    courses.loc[i, 'is_part_of'] = remove_stopwords(courses.loc[i, 'is_part_of'])
    courses.loc[i, 'course_description'] = courses.iloc[i, 7].lower()
    courses.loc[i, 'course_description'] = remove_stopwords(courses.loc[i, 'course_description'])

We also remove punctuation from the text, as this also acts as noise:

In [11]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = re.sub(r'[^\w\s]+', '', courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = re.sub(r'[^\w\s]+', '', courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = re.sub(r'[^\w\s]+', '', courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = re.sub(r'[^\w\s]+', '', courses['course_description'][i])

Sources for this part of the code:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://www.w3schools.com/python/pandas/ref_df_reset_index.asp#:~:text=Definition%20and%20Usage,this%2C%20use%20the%20drop%20parameter.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://towardsdatascience.com/remove-punctuation-pandas-3e461efe9584/
* https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

### Tokenization

Before vectoring the data, we tokenise it. We do this for the 'course_name', 'is_part_of', 'college_graduate', and 'course_description' columns.

In [19]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['course_description'][i])

We now create a new column with the text from all three columns joined together, which we will soon vectorise and use in the recommender system.

In [22]:
courses['combined_text'] = courses['course_name'] + courses['is_part_of'] + courses['college_graduate'] + courses['course_description']

Sources for this part of the code:
* https://www.nltk.org/api/nltk.tokenize.regexp.html
* https://www.kaggle.com/code/kanikanarang94/tokenization-using-nltk
* https://saturncloud.io/blog/how-to-combine-two-columns-in-a-pandas-dataframe/

### Vectorization

First we tag the "combined_text" column in order to be able to vectorize it with doc2vec.

In [27]:
tagged_data = [TaggedDocument(words=doc, tags=[cid]) for doc, cid in zip(courses['combined_text'], courses['course_catalogue_number'])]

Next we vectorise the "combined_text" column.

In [30]:
params = {
    'vector_size': 100, # dimension of embeddings
    'window': 5, # window -/+ before and after focus word
    'epochs': 5, # number of iterations over the corpus
    'min_count': 5, # filter on words whose frequency is below this count
    'workers': 4, # how many cores to use
    'alpha': 0.05 # initial learning rate for SGD. This is lambda in the class notes
}

model = Doc2Vec(**params)
  
model.build_vocab(tagged_data)

max_epochs = 100

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=1)
    model.alpha = 0.175
    model.min_alpha = model.alpha

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

Sources for this part of the code (these sources are also relevent for the following part of this notebook):
* https://radimrehurek.com/gensim/models/doc2vec.html
* https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html
* https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5
* https://spotintelligence.com/2023/09/06/doc2vec/#What_is_Doc2Vec
* The parameters in the code were taken from the notebook where Word2Vec was introduced.

### Recommender

In [33]:
def recommend_courses(li_course_ids, model, courses_df, top_n=5):
    #getting the vectors in form the doc2vec model and making one vector out of all the courses in the li_course_ids
    vectors = [model.dv[tag] for tag in li_course_ids]
    av_vector = np.mean(vectors, axis=0).reshape(1, -1)
    all_vectors = np.array([model.dv[i] for i in range(len(model.dv)) if i not in li_course_ids])
    
    #calculating the cosine similarities and flattening them to one dimension
    cosine_sim = cosine_similarity(av_vector, all_vectors).flatten()
    
    #making the cosine similarities a list and sorting it
    sim_scores = list(enumerate(cosine_sim))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #removing the courses from the sim_scores which were in the li_course_ids - might be worng
    idex_to_id = courses_df['course_catalogue_number'].tolist()
    selected_indicies = [idex_to_id.index(cid) for cid in li_course_ids if cid in idex_to_id]
    sim_scores = [s for s in sim_scores if s[0] not in selected_indicies]
    
    #getting the top 5 courses from the similaritie scores - also might be wrong
    course_indices = [i[0] for i in sim_scores[:top_n]]
    
    #returning the 
    return courses_df.iloc[course_indices][['course_name', 'course_catalogue_number']]

Sources for this part of the code:
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* https://numpy.org/doc/stable/reference/generated/numpy.matrix.flatten.html
* https://ioflood.com/blog/dataframe-to-list-pandas/#:~:text=You%20can%20use%20the%20toList,tolist()%20.&text=In%20the%20example%20above%2C%20we,1%2C%202%2C%203%5D.
* https://www.programiz.com/python-programming/methods/list/index

### Testing

Opening the test dataset:

In [38]:
majors = []
tracks = []
li_coourse_ids = []
with open("datasets/test_set.txt", 'r') as f:
    for l in f:
        if l[0] == '[':
            li_coourse_ids.append(eval(l))
        elif l.isupper():
            majors.append(l)
        elif l.islower():
            tracks.append(l)

In [54]:
print(majors[0])
print(tracks[0])
recommend_courses(li_coourse_ids[0], model, courses)

SCIENCE

math



Unnamed: 0,course_name,course_catalogue_number
1191,"[foundation, appreciating, the, complexity, of]",738100001Y
1126,"[field, course, in, environmental, earth, scie...",900289SCIY
354,"[behavioural, business, ethics]",6414M0381Y
541,"[cognitive, psychology]",900292SSCY
1638,"[journalistic, product]",77645TH02Y


In [56]:
print(tracks[1])
recommend_courses(li_coourse_ids[1], model, courses)

biomed



Unnamed: 0,course_name,course_catalogue_number
2637,"[research, project, 1]",53841REP6Y
97,"[advanced, mass, spectrometry]",5254ADMS6Y
354,"[behavioural, business, ethics]",6414M0381Y
607,"[consciousness, free, will, and, realworld, be...",5103BEWU6Y
193,"[analyzing, talk, and, language, discourse, and]",75250070FY


In [58]:
print(tracks[2])
recommend_courses(li_coourse_ids[2], model, courses)

physics



Unnamed: 0,course_name,course_catalogue_number
2170,"[nonequilibrium, statistical, physics]",5354NESP3Y
1600,"[introduction, to, security]",5063INTS6Y
1983,"[mathematics, 1, for, economics]",6011P0236Y
2527,"[quantum, programming, project]",5394QUPP3Y
3249,"[understanding, molecular, simulation]",5254UNMS6Y


In [60]:
print(tracks[3])
recommend_courses(li_coourse_ids[3], model, courses)

bio/environment



Unnamed: 0,course_name,course_catalogue_number
1799,"[marine, and, freshwater, biology]",5043MAFB6Y
138,"[advances, in, aquatic, sciences]",5224AIAS6Y
2960,"[system, earth]",900283SCIY
1006,"[environmental, chemistryecotoxicology]",900243SCIY
2745,"[seaweeds, on, shore, and, at, sea]",5224SOSA6Y


In [62]:
print(tracks[4])
recommend_courses(li_coourse_ids[4], model, courses)

information/neuro



Unnamed: 0,course_name,course_catalogue_number
1413,"[infectious, diseases]",900361SCIY
285,"[atmospheres, and, radiative, transfer]",5214STAT6Y
2326,"[policy, ethics, and, media]",5274POEM6Y
2706,"[rethinking, sustainable, societies, new, pers...",5512SSPB6Y
2103,"[mood, anxiety, psychotic, disorders]",7202BK02XY


In [48]:
print(majors[1])
print(tracks[5])
recommend_courses(li_coourse_ids[5], model, courses)

SOCIAL SCIENCE

economics



Unnamed: 0,course_name,course_catalogue_number
3311,"[which, lives, matter, narratives, and, experi...",142421226Y
268,"[artistic, research, lab, winter, school, tuto...",150528146Y
1551,"[introduction, to, development, sociology]",7311E0020Y
3130,"[thesis, seminar, visual, anthropology]",7314AV008Y
285,"[atmospheres, and, radiative, transfer]",5214STAT6Y


In [50]:
print(tracks[6])
recommend_courses(li_coourse_ids[6], model, courses)

psychology/economics



Unnamed: 0,course_name,course_catalogue_number
1190,"[foundation, of, international, tax, law]",3094FINTVY
2324,"[policy, making, and, rule, governance]",5294PMRG6Y
1038,"[ethics, and, the, future, of, business]",6314M0507Y
110,"[advanced, private, law, in, context]",3254AVLCVY
559,"[comparative, criminal, law, and, procedure]",3064CCLPVY


In [52]:
print(tracks[7])
recommend_courses(li_coourse_ids[7], model, courses)

political science/law



Unnamed: 0,course_name,course_catalogue_number
2637,"[research, project, 1]",53841REP6Y
354,"[behavioural, business, ethics]",6414M0381Y
2768,"[separation, sciences]",52548SES6Y
268,"[artistic, research, lab, winter, school, tuto...",150528146Y
203,"[anthropology, of, disasters]",7313T0102Y


In [40]:
print(majors[2])
print(tracks[8])
recommend_courses(li_coourse_ids[8], model, courses)

HUMANITIES

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
2106,"[morality, of, markets]",3803MOQPKY
3148,"[toga, topics]",3803TOTOKY
721,"[critical, perpectives, on, ai, governance]",3104CPAIVY
2637,"[research, project, 1]",53841REP6Y
3326,"[workshop, nonacademic, career, preparation]",7525W010AY


In [42]:
print(tracks[9])
recommend_courses(li_coourse_ids[9], model, courses)

cultural analysis



Unnamed: 0,course_name,course_catalogue_number
370,"[beyond, the, borders, of, europe, diaspora, a...",111212236Y
607,"[consciousness, free, will, and, realworld, be...",5103BEWU6Y
97,"[advanced, mass, spectrometry]",5254ADMS6Y
1510,"[internship, earth, sciences]",5264INT24Y
1089,"[evil, in, thought, and, literature]",136221246Y


In [44]:
print(tracks[10])
recommend_courses(li_coourse_ids[10], model, courses)

media/film



Unnamed: 0,course_name,course_catalogue_number
403,"[bodies, and, the, posthuman]",178421036Y
2375,[poststructuralism],187421316Y
2015,"[meme, studies, understanding, the, power, of,...",118221176Y
97,"[advanced, mass, spectrometry]",5254ADMS6Y
203,"[anthropology, of, disasters]",7313T0102Y


In [46]:
print(tracks[11])
recommend_courses(li_coourse_ids[11], model, courses)

art history/history



Unnamed: 0,course_name,course_catalogue_number
203,"[anthropology, of, disasters]",7313T0102Y
2637,"[research, project, 1]",53841REP6Y
3311,"[which, lives, matter, narratives, and, experi...",142421226Y
882,"[digital, practices]",118221032Y
1244,"[future, societies, lab]",736410066Y


Sources for this part of the code:
* https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/