# AUC Course Recommender
## Description
In this notebook is the source code for the Amsterdam University College (AUC) Course Recommender. This course recommender is part of a project for the Text Mining course at AUC.

## Code
### Imports:

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import remove_stopwords
from sentence_transformers import SentenceTransformer

### Loading and Preprocessing the Data

Loading the data in as a pandas dataframe:

In [4]:
data = pd.read_csv("datasets/recommender_dataset.csv")
print(data.shape)

(3812, 8)


Next we drop the rows which have nothing in the course_catalogue_number column. These rows are empty because the course scraper did not scrape information from courses whose websites were not written in English, meaning that after dropping these rows all courses in the dataset are taught in English.

In [6]:
courses = data.dropna(subset=['course_catalogue_number', 'is_part_of', 'language_of_instruction', 'course_description'])
print(courses.shape)

(3345, 8)


Example of course description before preprocessing:

In [8]:
print(courses.iloc[0]["course_description"])
print("LENGTH:" ,len(courses.iloc[0]["course_description"]))

Upon completing this course, you should be able to: identify and use different schools of thought in strategic management based on a solid understanding of their assumptions, strengths and weaknesses; critically reflect on different theories and perspectives in relation to competitive and cooperative strategy and compare them with alternative views; select, apply and combine analytical tools in diagnosing or addressing strategic issues at the business and network level in real-life cases; analyse the competitive, cooperative and coopetitive strategies of organizations, assess the impact of changes in these strategies, and formulate recommendations for improvement; adequately analyse the external and internal environment of an organization to derive relevant insights that can inform strategic decision-making; identify when and how to change or modify a business model over time and recognize relevant enabling and inhibiting factors; map the ecosystem(s) and alliance networks in which org

Now we remove the stop words from dataframe's columns, as these act as noise that do not add much discriminative value in terms of similarity. 

In [10]:
courses = courses.reset_index(drop=True)
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = courses.loc[i, 'course_name'].lower()
    courses.loc[i, 'college_graduate'] = courses.loc[i, 'college_graduate'].lower()
    courses.loc[i, 'language_of_instruction'] = courses.loc[i, 'language_of_instruction'].lower()
    courses.loc[i, 'is_part_of'] = courses['is_part_of'][i].lower()
    courses.loc[i, 'is_part_of'] = remove_stopwords(courses.loc[i, 'is_part_of'])
    courses.loc[i, 'course_description'] = courses.loc[i, "course_description"].lower()
    courses.loc[i, 'course_description'] = remove_stopwords(courses.loc[i, 'course_description'])

We also remove punctuation from the text, as this also acts as noise:

In [12]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = re.sub(r'[^\w\s]+', '', courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = re.sub(r'[^\w\s]+', '', courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = re.sub(r'[^\w\s]+', '', courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = re.sub(r'[^\w\s]+', '', courses['course_description'][i])

Example of course description after preprocessing:

In [14]:
print(courses.iloc[0]["course_description"])
print("LENGTH: ", len(courses.iloc[0]["course_description"]))

completing course able to identify use different schools thought strategic management based solid understanding assumptions strengths weaknesses critically reflect different theories perspectives relation competitive cooperative strategy compare alternative views select apply combine analytical tools diagnosing addressing strategic issues business network level reallife cases analyse competitive cooperative coopetitive strategies organizations assess impact changes strategies formulate recommendations improvement adequately analyse external internal environment organization derive relevant insights inform strategic decisionmaking identify change modify business model time recognize relevant enabling inhibiting factors map ecosystems alliance networks organizations operate identify potential strategy blindspots apply parallel thinking strategy look like sure effectuated course competitive cooperative strategy competitive strategy concerned making choices create maintain competitive adva

Sources for this part of the code:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://www.w3schools.com/python/pandas/ref_df_reset_index.asp#:~:text=Definition%20and%20Usage,this%2C%20use%20the%20drop%20parameter.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://towardsdatascience.com/remove-punctuation-pandas-3e461efe9584/
* https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

### Vectorization

Loading the SentenceTransformer model in.

In [18]:
model = SentenceTransformer("all-MiniLM-L6-v2")

Combining the colunms which have the main informations about the courses just into one column as one text

In [20]:
courses['combined_text'] = courses['course_name'] + courses['is_part_of'] + courses['college_graduate'] + courses['course_description']

encoding the combined_text column into the model and checkin it shape

In [22]:
embeddings = model.encode(courses['combined_text'])
print(embeddings.shape)

(3345, 384)


Sources for this part of the code:
* https://sbert.net/docs/quickstart.html#sentence-transformer
* https://peaceful0907.medium.com/sentence-embedding-by-bert-and-sentence-similarity-759f7beccbf1

### Recommender

In [25]:
def recommend_courses(li_course_ids, embeddings, courses_df, top_n=5):
    #making a dictionary which will have the course catalogue numbers and the idexes of the vectors
    ids_idx = {}
    for idx, cid in enumerate(courses_df['course_catalogue_number']):
        ids_idx[cid] = idx

    #getting the indexes of the courses in the li_course_ids
    course_idx = []
    for cid in li_course_ids:
        course_idx.append(ids_idx[cid])

    #getting the embbedings of the courses in the li_course_ids - soruice
    course_emb = embeddings[course_idx]

    #taking the mean of the course_emb
    av_emb = np.mean(course_emb, axis=0).reshape(1, -1)
    
    #calculating the cosine similarities and flattening them to one dimension
    cosine_sim = cosine_similarity(av_emb, embeddings).flatten()
    
    #making the cosine similarities a list and sorting it
    sim_scores = list(enumerate(cosine_sim))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #removing the courses from the sim_scores which were in the li_course_ids - might be worng
    index_to_id = courses_df['course_catalogue_number'].tolist()
    
    selected_indicies = [index_to_id.index(cid) for cid in li_course_ids if cid in index_to_id]
    
    sim_scores = [s for s in sim_scores if s[0] not in selected_indicies]
    
    #getting the top 5 courses from the similaritie scores - also might be wrong
    course_indices = [i[0] for i in sim_scores[:top_n]]
    
    
    #returning the name of the course and it's course catalogue number
    return courses_df.iloc[course_indices][['course_name', 'course_catalogue_number']]

Sources for this part of the code:
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* https://numpy.org/doc/stable/reference/generated/numpy.matrix.flatten.html
* https://ioflood.com/blog/dataframe-to-list-pandas/#:~:text=You%20can%20use%20the%20toList,tolist()%20.&text=In%20the%20example%20above%2C%20we,1%2C%202%2C%203%5D.
* https://www.programiz.com/python-programming/methods/list/index

### Testing

Opening the test dataset:

In [29]:
majors = []
tracks = []
li_course_ids = []
with open("datasets/test_set.txt", 'r') as f:
    for l in f:
        if l[0] == '[':
            li_course_ids.append(eval(l))
        elif l.isupper():
            majors.append(l)
        elif l.islower():
            tracks.append(l)

Using the recommender function on the test dataset to test it's preformance

In [31]:
print(majors[0])
print(tracks[0])
recommend_courses(li_course_ids[0], embeddings, courses)

SCIENCE

math/information



Unnamed: 0,course_name,course_catalogue_number
2452,programming in psychological science,7205RM39XY
775,current topics psychology and ai,7203BM45XY
118,advanced research methods and statistics,900323ACCY
2542,rationality cognition and reasoning,187413086Y
1553,introduction to digital methods programming,113221416Y


In [32]:
print(tracks[1])
recommend_courses(li_course_ids[1], embeddings, courses)

biomed



Unnamed: 0,course_name,course_catalogue_number
394,biomedical systems biology,5234BISB6Y
2161,neurosystems,5052NEU12Y
1561,introduction to health and wellbeing,900106SCIY
2160,neuroscience from cell to behaviour,5244NCTB5Y
1369,hormones and homeostasis,900262SCIY


In [33]:
print(tracks[2])
recommend_courses(li_course_ids[2], embeddings, courses)

physics



Unnamed: 0,course_name,course_catalogue_number
594,condensed matter theory 1,53541CMT3Y
1991,mathematics of physics,900334SCIY
2178,numerical mathematics,900228SCIY
893,discrete mathematics and algebra,900397SCIY
72,advanced computational condensed matter,5354ACCM3Y


In [34]:
print(tracks[3])
recommend_courses(li_course_ids[3], embeddings, courses)

bio/environment



Unnamed: 0,course_name,course_catalogue_number
1555,introduction to environmental sciences,900181SCIY
2960,system earth,900283SCIY
766,current topics in biology,5224CTIB6Y
1554,introduction to environmental humanities,129221066Y
1378,human environment interactions,5264HUEI6Y


In [35]:
print(tracks[4])
recommend_courses(li_course_ids[4], embeddings, courses)

information/neuro



Unnamed: 0,course_name,course_catalogue_number
540,cognitive neurobiology,5052CONE6Y
2161,neurosystems,5052NEU12Y
1240,fundamentals of neuroscience,5053FUN12Y
261,artificial cognition pattern recognition,900102SSCY
1107,experimental neurobiology,5244EXNE5Y


In [36]:
print(majors[1])
print(tracks[5])
recommend_courses(li_course_ids[5], embeddings, courses)

SOCIAL SCIENCE

economics



Unnamed: 0,course_name,course_catalogue_number
512,climate and environmental conflicts in the,73230268LY
1378,human environment interactions,5264HUEI6Y
2959,system change how to navigate complex societal,3803SYNSKY
516,climate change economics,6414M0504Y
1000,environment international sustainable develop...,73433E509Y


In [37]:
print(tracks[6])
recommend_courses(li_course_ids[6], embeddings, courses)

law, ir



Unnamed: 0,course_name,course_catalogue_number
2489,public international law,3802PUQPVY
2410,principles and foundations of international law,3554PRFIVY
1472,international law and contemporary challenges,3554ILCCVY
2357,politics and practices of international law,7324F101IY
3057,the politics of international law,7324P261ZY


In [38]:
print(tracks[7])
recommend_courses(li_course_ids[7], embeddings, courses)

psychology/economics



Unnamed: 0,course_name,course_catalogue_number
2110,motivation and cognition,3802MOQPVY
775,current topics psychology and ai,7203BM45XY
1241,fundamentals of psychology,3802FUQPVY
21,a critical look on psychologys past and future,3803CLPFVY
2477,psychological toolkit understanding social,7201721PXY


In [39]:
print(tracks[8])
recommend_courses(li_course_ids[8], embeddings, courses)

political science/law



Unnamed: 0,course_name,course_catalogue_number
1685,legal and social philosophy,900356SSCY
1201,freedom dreams social justice and struggles for,113221576Y
1361,history of political thought,73210026FY
2489,public international law,3802PUQPVY
1584,introduction to political science,7321E020FY


In [40]:
print(majors[2])
print(tracks[9])
recommend_courses(li_course_ids[9], embeddings, courses)

HUMANITIES

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
706,creating objects defining methods philosophy,189421036Y
1355,history and philosophy of the humanities,187421516Y
2285,philosophy of the humanities lca and english,109221056Y
2272,philosophy of science,900274HUMY
259,art science and technology,129216826Y


In [41]:
print(tracks[10])
recommend_courses(li_course_ids[10], embeddings, courses)

cultural analysis



Unnamed: 0,course_name,course_catalogue_number
3239,twentiethcentury theory and its afterlives,129221116Y
1569,introduction to literary and cultural analysis,129111072Y
591,concepts for reading contemporary cultures,129121042Y
259,art science and technology,129216826Y
2285,philosophy of the humanities lca and english,109221056Y


In [42]:
print(tracks[11])
recommend_courses(li_course_ids[11], embeddings, courses)

media/film



Unnamed: 0,course_name,course_catalogue_number
1146,film theories,159421036Y
1138,film analysis,119221012Y
491,cinema histories and cultures,159410226Y
132,advanced topics in media and culture film,119221062Y
259,art science and technology,129216826Y


In [43]:
print(tracks[12])
recommend_courses(li_course_ids[12], embeddings, courses)

art history/history



Unnamed: 0,course_name,course_catalogue_number
2990,the art market and culture industry,900341HUMY
2072,modern art globally oriented ii,109221276Y
259,art science and technology,129216826Y
681,core module 3 contemporary concepts and,150511062Y
1351,historicism anachronism memory how not to,129121036Y


Sources for this part of the code:
* https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/