# AUC Course Recommender
## Description
In this notebook is the source code for the Amsterdam University College (AUC) Course Recommender. This course recommender is part of a project for the Text Mining course at AUC.

## Code
### Imports:

In [1]:
#pip install sentence_transformers

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.parsing.preprocessing import remove_stopwords
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


### Loading and Preprocessing the Data

Loading the data in as a pandas dataframe:

In [3]:
data = pd.read_csv("datasets/recommender_dataset.csv")
print(data.shape)

(3812, 8)


Next we drop the rows which have nothing in the course_catalogue_number column. These rows are empty because the course scraper did not scrape information from courses whose websites were not written in English, meaning that after dropping these rows all courses in the dataset are taught in English.

In [4]:
courses = data.dropna(subset=['course_catalogue_number', 'is_part_of', 'language_of_instruction', 'course_description'])
print(courses.shape)

(3345, 8)


Although we will be using sentBERT and TF-IDF as our vectorisers, both of which handle stopwords and punctuation themselves, we found better results when we manually removed the stopwords and punctuation ourselves

Example of course description before preprocessing:

In [5]:
print(courses.iloc[0]["course_description"])
print("LENGTH:" ,len(courses.iloc[0]["course_description"]))

Upon completing this course, you should be able to: identify and use different schools of thought in strategic management based on a solid understanding of their assumptions, strengths and weaknesses; critically reflect on different theories and perspectives in relation to competitive and cooperative strategy and compare them with alternative views; select, apply and combine analytical tools in diagnosing or addressing strategic issues at the business and network level in real-life cases; analyse the competitive, cooperative and coopetitive strategies of organizations, assess the impact of changes in these strategies, and formulate recommendations for improvement; adequately analyse the external and internal environment of an organization to derive relevant insights that can inform strategic decision-making; identify when and how to change or modify a business model over time and recognize relevant enabling and inhibiting factors; map the ecosystem(s) and alliance networks in which org

Now we remove the stop words from dataframe's columns, as these act as noise that do not add much discriminative value in terms of similarity. 

In [6]:
courses = courses.reset_index(drop=True)
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = courses.loc[i, 'course_name'].lower()
    courses.loc[i, 'college_graduate'] = courses.loc[i, 'college_graduate'].lower()
    courses.loc[i, 'language_of_instruction'] = courses.loc[i, 'language_of_instruction'].lower()
    courses.loc[i, 'is_part_of'] = courses['is_part_of'][i].lower()
    courses.loc[i, 'is_part_of'] = remove_stopwords(courses.loc[i, 'is_part_of'])
    courses.loc[i, 'course_description'] = courses.loc[i, "course_description"].lower()
    courses.loc[i, 'course_description'] = remove_stopwords(courses.loc[i, 'course_description'])

We also remove punctuation from the text, as this also acts as noise:

In [7]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = re.sub(r'[^\w\s]+', '', courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = re.sub(r'[^\w\s]+', '', courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = re.sub(r'[^\w\s]+', '', courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = re.sub(r'[^\w\s]+', '', courses['course_description'][i])

Example of course description after preprocessing:

In [8]:
print(courses.iloc[0]["course_description"])
print("LENGTH: ", len(courses.iloc[0]["course_description"]))

completing course able to identify use different schools thought strategic management based solid understanding assumptions strengths weaknesses critically reflect different theories perspectives relation competitive cooperative strategy compare alternative views select apply combine analytical tools diagnosing addressing strategic issues business network level reallife cases analyse competitive cooperative coopetitive strategies organizations assess impact changes strategies formulate recommendations improvement adequately analyse external internal environment organization derive relevant insights inform strategic decisionmaking identify change modify business model time recognize relevant enabling inhibiting factors map ecosystems alliance networks organizations operate identify potential strategy blindspots apply parallel thinking strategy look like sure effectuated course competitive cooperative strategy competitive strategy concerned making choices create maintain competitive adva

Combining the colunmns which have the main informations about the courses just into one column as one text:

In [9]:
courses['combined_text'] = courses['course_name'] + courses['is_part_of'] + courses['college_graduate'] + courses['course_description']

Sources for this part of the code:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://www.w3schools.com/python/pandas/ref_df_reset_index.asp#:~:text=Definition%20and%20Usage,this%2C%20use%20the%20drop%20parameter.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://towardsdatascience.com/remove-punctuation-pandas-3e461efe9584/
* https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

## Base Model: TF-IDF Vectorizer

In [10]:
vectorizer = TfidfVectorizer()

In [11]:
tfidf_embeddings = vectorizer.fit_transform(courses["combined_text"])
tfidf_embeddings = tfidf_embeddings.toarray()
print(tfidf_embeddings.shape)

(3345, 30273)


Next we create a function that recommends the courses based off of the input course ids, the embeddings, and the course dataframe itself. We can also choose how many courses the recommend using the top_n parameter.

In [12]:
def recommend_courses(li_course_ids, embeddings, courses_df, top_n=5):
    #making a dictionary which will have the course catalogue numbers and the idexes of the vectors
    ids_idx = {}
    for idx, cid in enumerate(courses_df['course_catalogue_number']):
        ids_idx[cid] = idx

    #getting the indexes of the courses in the li_course_ids
    course_idx = []
    for cid in li_course_ids:
        course_idx.append(ids_idx[cid])

    #getting the embbedings of the courses in the li_course_ids - soruice
    course_emb = embeddings[course_idx]

    #taking the mean of the course_emb
    av_emb = np.mean(course_emb, axis=0).reshape(1, -1)
    
    #calculating the cosine similarities and flattening them to one dimension
    cosine_sim = cosine_similarity(av_emb, embeddings).flatten()
    
    #making the cosine similarities a list and sorting it
    sim_scores = list(enumerate(cosine_sim))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #removing the courses from the sim_scores which were in the li_course_ids - might be worng
    index_to_id = courses_df['course_catalogue_number'].tolist()
    
    selected_indicies = [index_to_id.index(cid) for cid in li_course_ids if cid in index_to_id]
    
    sim_scores = [s for s in sim_scores if s[0] not in selected_indicies]
    
    #getting the top 5 courses from the similaritie scores
    course_indices = [i[0] for i in sim_scores[:top_n]]
    
    
    #returning the name of the course and it's course catalogue number
    return courses_df.iloc[course_indices][['course_name', 'course_catalogue_number']]

Sources for this part of the code:
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* https://numpy.org/doc/stable/reference/generated/numpy.matrix.flatten.html
* https://ioflood.com/blog/dataframe-to-list-pandas/#:~:text=You%20can%20use%20the%20toList,tolist()%20.&text=In%20the%20example%20above%2C%20we,1%2C%202%2C%203%5D.
* https://www.programiz.com/python-programming/methods/list/index

### Testing

Opening the test dataset:

In [13]:
majors = []
tracks = []
li_course_ids = []
with open("datasets/test_set.txt", 'r') as f:
    for l in f:
        if l[0] == '[':
            li_course_ids.append(eval(l))
        elif l.isupper():
            majors.append(l)
        elif l.islower():
            tracks.append(l)

Using the recommender function to on the test dataset to evaluate performance by-eye:

In [14]:
print(majors[0])
print(tracks[0])
recommend_courses(li_course_ids[0], tfidf_embeddings, courses)

SCIENCE

math/information



Unnamed: 0,course_name,course_catalogue_number
1241,fundamentals of psychology,3802FUQPVY
1613,introductory psychology and brain cognition,7201702PXY
2272,philosophy of science,900274HUMY
2739,scientific programming 1,50621SCP3Y
2029,methods for social sciences research,900102ACCY


In [15]:
print(tracks[1])
recommend_courses(li_course_ids[1], tfidf_embeddings, courses)

biomed



Unnamed: 0,course_name,course_catalogue_number
1561,introduction to health and wellbeing,900106SCIY
1298,global mental health,739400140Y
1613,introductory psychology and brain cognition,7201702PXY
2160,neuroscience from cell to behaviour,5244NCTB5Y
1338,health econometrics empirical research,6414M0405Y


In [16]:
print(tracks[2])
recommend_courses(li_course_ids[2], tfidf_embeddings, courses)

physics



Unnamed: 0,course_name,course_catalogue_number
2308,physics of sustainable energy,5092PHSE6Y
2882,statistical theory of complex molecular systems,5254STTC6Y
1984,mathematics 1 calculus,6011P0166Y
448,case studies in energy climate and sustainability,900319SCIY
1982,mathematics,3802M1QPVY


In [17]:
print(tracks[3])
recommend_courses(li_course_ids[3], tfidf_embeddings, courses)

bio/environment



Unnamed: 0,course_name,course_catalogue_number
448,case studies in energy climate and sustainability,900319SCIY
2960,system earth,900283SCIY
513,climate change,5264CLCH6Y
859,developmental biology,5224DEBI6Y
3255,urban anthropology lab,900381SSCY


In [18]:
print(tracks[4])
recommend_courses(li_course_ids[4], tfidf_embeddings, courses)

information/neuro



Unnamed: 0,course_name,course_catalogue_number
2160,neuroscience from cell to behaviour,5244NCTB5Y
1376,human body anatomy and physiology ii,900261SCIY
1613,introductory psychology and brain cognition,7201702PXY
1240,fundamentals of neuroscience,5053FUN12Y
3032,the integrated brain,5102THIB6Y


In [19]:
print(majors[1])
print(tracks[5])
recommend_courses(li_course_ids[5], tfidf_embeddings, courses)

SOCIAL SCIENCE

economics



Unnamed: 0,course_name,course_catalogue_number
448,case studies in energy climate and sustainability,900319SCIY
1010,environmental economics and policies,6414M0503Y
1007,environmental econometrics empirical research,6414M0404Y
118,advanced research methods and statistics,900323ACCY
2890,statistics for sciences,900128ACCY


In [20]:
print(tracks[6])
recommend_courses(li_course_ids[6], tfidf_embeddings, courses)

law, ir



Unnamed: 0,course_name,course_catalogue_number
1475,international law of military operations,3854INQ8VY
2357,politics and practices of international law,7324F101IY
2489,public international law,3802PUQPVY
3057,the politics of international law,7324P261ZY
2410,principles and foundations of international law,3554PRFIVY


In [21]:
print(tracks[7])
recommend_courses(li_course_ids[7], tfidf_embeddings, courses)

psychology/economics



Unnamed: 0,course_name,course_catalogue_number
1241,fundamentals of psychology,3802FUQPVY
1613,introductory psychology and brain cognition,7201702PXY
1589,introduction to psychology minor programme,7201718PXY
359,behavioural economics,900325SSCY
2152,neuroeconomics,6414M0167Y


In [22]:
print(tracks[8])
recommend_courses(li_course_ids[8], tfidf_embeddings, courses)

political science/law



Unnamed: 0,course_name,course_catalogue_number
1384,human rights in private law,3254HRP6KY
1360,history of legal theory,3802HIQPVY
1466,international human rights law,3854HUQ4KY
1435,integrative seminar i human rights,3801I1QPVY
1070,european human rights law,3554H1Q4KY


In [23]:
print(majors[2])
print(tracks[9])
recommend_courses(li_course_ids[9], tfidf_embeddings, courses)

HUMANITIES

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
1582,introduction to philosophy ii,900108HUMY
2029,methods for social sciences research,900102ACCY
706,creating objects defining methods philosophy,189421036Y
2272,philosophy of science,900274HUMY
3043,the netherlands in the seventeenth century,112221346Y


In [24]:
print(tracks[10])
recommend_courses(li_course_ids[10], tfidf_embeddings, courses)

cultural analysis



Unnamed: 0,course_name,course_catalogue_number
1582,introduction to philosophy ii,900108HUMY
1643,key debates in gender sexuality,7334E003FY
1259,introduction to gender and sexuality studies,7302A4001Y
1256,gender in modern europe realities and,111211226Y
2537,race class and gender intersectionality,900374SSCY


In [25]:
print(tracks[11])
recommend_courses(li_course_ids[11], tfidf_embeddings, courses)

media/film



Unnamed: 0,course_name,course_catalogue_number
1146,film theories,159421036Y
1138,film analysis,119221012Y
3144,this is film film heritage in practice,158421012Y
1144,film research seminar i,15941A006Y
1145,film research seminar ii,15941A016Y


In [26]:
print(tracks[12])
recommend_courses(li_course_ids[12], tfidf_embeddings, courses)

art history/history



Unnamed: 0,course_name,course_catalogue_number
2990,the art market and culture industry,900341HUMY
27,a material history of art,146410726Y
256,art and thought in the dutch republic,151621086Y
2072,modern art globally oriented ii,109221276Y
145,aesthetics of decolonization modern art and,114221496Y


### Alternate Model: SentenceBERT Transformer

Loading in the SentenceTransformer model.

In [27]:
pip install hf_xet

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [28]:
model = SentenceTransformer("all-MiniLM-L6-v2")

encoding the combined_text column into the model and checkin it shape

In [29]:
sentbert_embeddings = model.encode(courses['combined_text'])
print(sentbert_embeddings.shape)

(3345, 384)


Sources for this part of the code:
* https://sbert.net/docs/quickstart.html#sentence-transformer
* https://peaceful0907.medium.com/sentence-embedding-by-bert-and-sentence-similarity-759f7beccbf1

### Testing

Using the recommender function on the test dataset to test it's preformance:

In [30]:
print(majors[0])
print(tracks[0])
recommend_courses(li_course_ids[0], sentbert_embeddings, courses)

SCIENCE

math/information



Unnamed: 0,course_name,course_catalogue_number
2452,programming in psychological science,7205RM39XY
775,current topics psychology and ai,7203BM45XY
118,advanced research methods and statistics,900323ACCY
2542,rationality cognition and reasoning,187413086Y
1553,introduction to digital methods programming,113221416Y


In [31]:
print(tracks[1])
recommend_courses(li_course_ids[1], sentbert_embeddings, courses)

biomed



Unnamed: 0,course_name,course_catalogue_number
394,biomedical systems biology,5234BISB6Y
2161,neurosystems,5052NEU12Y
1561,introduction to health and wellbeing,900106SCIY
2160,neuroscience from cell to behaviour,5244NCTB5Y
1369,hormones and homeostasis,900262SCIY


In [32]:
print(tracks[2])
recommend_courses(li_course_ids[2], sentbert_embeddings, courses)

physics



Unnamed: 0,course_name,course_catalogue_number
594,condensed matter theory 1,53541CMT3Y
1991,mathematics of physics,900334SCIY
2178,numerical mathematics,900228SCIY
893,discrete mathematics and algebra,900397SCIY
72,advanced computational condensed matter,5354ACCM3Y


In [33]:
print(tracks[3])
recommend_courses(li_course_ids[3], sentbert_embeddings, courses)

bio/environment



Unnamed: 0,course_name,course_catalogue_number
1555,introduction to environmental sciences,900181SCIY
2960,system earth,900283SCIY
766,current topics in biology,5224CTIB6Y
1554,introduction to environmental humanities,129221066Y
1378,human environment interactions,5264HUEI6Y


In [34]:
print(tracks[4])
recommend_courses(li_course_ids[4], sentbert_embeddings, courses)

information/neuro



Unnamed: 0,course_name,course_catalogue_number
540,cognitive neurobiology,5052CONE6Y
2161,neurosystems,5052NEU12Y
1240,fundamentals of neuroscience,5053FUN12Y
261,artificial cognition pattern recognition,900102SSCY
1107,experimental neurobiology,5244EXNE5Y


In [35]:
print(majors[1])
print(tracks[5])
recommend_courses(li_course_ids[5], sentbert_embeddings, courses)

SOCIAL SCIENCE

economics



Unnamed: 0,course_name,course_catalogue_number
512,climate and environmental conflicts in the,73230268LY
1378,human environment interactions,5264HUEI6Y
2959,system change how to navigate complex societal,3803SYNSKY
516,climate change economics,6414M0504Y
1000,environment international sustainable develop...,73433E509Y


In [36]:
print(tracks[6])
recommend_courses(li_course_ids[6], sentbert_embeddings, courses)

law, ir



Unnamed: 0,course_name,course_catalogue_number
2489,public international law,3802PUQPVY
2410,principles and foundations of international law,3554PRFIVY
1472,international law and contemporary challenges,3554ILCCVY
2357,politics and practices of international law,7324F101IY
3057,the politics of international law,7324P261ZY


In [37]:
print(tracks[7])
recommend_courses(li_course_ids[7], sentbert_embeddings, courses)

psychology/economics



Unnamed: 0,course_name,course_catalogue_number
2110,motivation and cognition,3802MOQPVY
775,current topics psychology and ai,7203BM45XY
1241,fundamentals of psychology,3802FUQPVY
21,a critical look on psychologys past and future,3803CLPFVY
2477,psychological toolkit understanding social,7201721PXY


In [38]:
print(tracks[8])
recommend_courses(li_course_ids[8], sentbert_embeddings, courses)

political science/law



Unnamed: 0,course_name,course_catalogue_number
1685,legal and social philosophy,900356SSCY
1201,freedom dreams social justice and struggles for,113221576Y
1361,history of political thought,73210026FY
2489,public international law,3802PUQPVY
1584,introduction to political science,7321E020FY


In [39]:
print(majors[2])
print(tracks[9])
recommend_courses(li_course_ids[9], sentbert_embeddings, courses)

HUMANITIES

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
706,creating objects defining methods philosophy,189421036Y
1355,history and philosophy of the humanities,187421516Y
2285,philosophy of the humanities lca and english,109221056Y
2272,philosophy of science,900274HUMY
259,art science and technology,129216826Y


In [40]:
print(tracks[10])
recommend_courses(li_course_ids[10], sentbert_embeddings, courses)

cultural analysis



Unnamed: 0,course_name,course_catalogue_number
3239,twentiethcentury theory and its afterlives,129221116Y
1569,introduction to literary and cultural analysis,129111072Y
591,concepts for reading contemporary cultures,129121042Y
259,art science and technology,129216826Y
2285,philosophy of the humanities lca and english,109221056Y


In [41]:
print(tracks[11])
recommend_courses(li_course_ids[11], sentbert_embeddings, courses)

media/film



Unnamed: 0,course_name,course_catalogue_number
1146,film theories,159421036Y
1138,film analysis,119221012Y
491,cinema histories and cultures,159410226Y
132,advanced topics in media and culture film,119221062Y
259,art science and technology,129216826Y


In [42]:
print(tracks[12])
recommend_courses(li_course_ids[12], sentbert_embeddings, courses)

art history/history



Unnamed: 0,course_name,course_catalogue_number
2990,the art market and culture industry,900341HUMY
2072,modern art globally oriented ii,109221276Y
259,art science and technology,129216826Y
681,core module 3 contemporary concepts and,150511062Y
1351,historicism anachronism memory how not to,129121036Y


Sources for this part of the code:
* https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/

## Evaluation

**Normalised Discounted Cumulative Gain Metric**

Takes as an input the relevance score obtained manually, and computes the NDCG score. 

In [43]:
#the relevance order in which the program recommends the courses
from sklearn.metrics import ndcg_score
y_score = np.array([[5,4,3,2,1]])

def compute_ndcg(ground_truth):
    y_true = np.array([ground_truth])
    return ndcg_score(y_true, y_score)

Comparing the hierarchical accuracies between the models:

In [44]:
print(majors[0])
print(tracks[0])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([0,0,1,4,1]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([1,0,0,0,3]):.3f}")

SCIENCE

math/information

Base Model: TF-IDF Vectorizer: 0.509 

Alternate Model: SentenceBERT Transformer: 0.595


In [45]:
print(tracks[1])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([5,4,3,4,0]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([5,4,4,4,5]):.3f}")

biomed

Base Model: TF-IDF Vectorizer: 0.994 

Alternate Model: SentenceBERT Transformer: 0.982


In [46]:
print(tracks[2])
print(f"model 1: {compute_ndcg([5,4,3,0,2]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([4,5,3,3,3]):.3f}")

physics

model 1: 0.991 

Alternate Model: SentenceBERT Transformer: 0.968


In [47]:
print(tracks[3])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([2,5,1,0,0]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([3,5,4,1,2]):.3f}")

bio/environment

Base Model: TF-IDF Vectorizer: 0.836 

Alternate Model: SentenceBERT Transformer: 0.911


In [48]:
print(tracks[4])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([2,3,1,3,3]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([3,4,4,5,4]):.3f}")

information/neuro

Base Model: TF-IDF Vectorizer: 0.896 

Alternate Model: SentenceBERT Transformer: 0.905


In [49]:
print(majors[1])
print(tracks[5])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([1,5,1,3,0]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([2,2,4,5,3]):.3f}")

SOCIAL SCIENCE

economics

Base Model: TF-IDF Vectorizer: 0.760 

Alternate Model: SentenceBERT Transformer: 0.805


In [50]:
print(tracks[6])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([3,5,5,5,5]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([5,5,5,5,5,]):.3f}")

law, ir

Base Model: TF-IDF Vectorizer: 0.912 

Alternate Model: SentenceBERT Transformer: 1.000


In [51]:
print(tracks[7])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([2,2,1,3,3]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([2,1,3,2,4]):.3f}")

psychology/economics

Base Model: TF-IDF Vectorizer: 0.870 

Alternate Model: SentenceBERT Transformer: 0.803


In [52]:
print(tracks[8])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([4,2,4,2,4]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([3,5,3,3,4]):.3f}")

political science/law

Base Model: TF-IDF Vectorizer: 0.952 

Alternate Model: SentenceBERT Transformer: 0.914


In [53]:
print(majors[2])
print(tracks[9])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([2,0,3,1,3]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([3,5,5,2,1]):.3f}")

HUMANITIES

history/philosophy

Base Model: TF-IDF Vectorizer: 0.805 

Alternate Model: SentenceBERT Transformer: 0.908


In [54]:
print(tracks[10])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([0,3,3,4,4]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([2,5,5,2,2]):.3f}")

cultural analysis

Base Model: TF-IDF Vectorizer: 0.715 

Alternate Model: SentenceBERT Transformer: 0.861


In [55]:
print(tracks[11])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([5,5,5,5,5]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([5,5,5,5,0]):.3f}")

media/film

Base Model: TF-IDF Vectorizer: 1.000 

Alternate Model: SentenceBERT Transformer: 1.000


In [56]:
print(tracks[12])
print(f"Base Model: TF-IDF Vectorizer: {compute_ndcg([2,4,4,2,4]):.3f} \n\nAlternate Model: SentenceBERT Transformer: {compute_ndcg([2,2,0,4,0]):.3f}")

art history/history

Base Model: TF-IDF Vectorizer: 0.879 

Alternate Model: SentenceBERT Transformer: 0.796


**Diversity Metric**

Takes as an input the embeddings of the recommended courses computes intra-list diversity of these courses.

In [71]:
def compute_ild(recommended_course_embeddings):
    # Compute cosine similarity between each embedding
    similarity_matrix = cosine_similarity(recommended_course_embeddings)
    n = len(recommended_course_embeddings)

    # This returns only the indicies of the upper part of the matrix, ignoring the diagonal and ignoring anything below the diagonal (so we don't compare
    # courses with themselves or compare two different courses twice)
    upper_tri_indices = np.triu_indices(n, k=1)
    similarities = similarity_matrix[upper_tri_indices]

    dissimilarities = 1 - similarities
    ild = dissimilarities.mean()
    return ild

Next we evaluate the recommended courses based on intra-list diversity.

In [85]:
print(majors[0])
print(tracks[0])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1241, 1613, 2272, 2739, 2029]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[2452, 775, 118, 2542, 1553]]):.3f}")

SCIENCE

math/information

Base Model: TF-IDF Vectorizer: 0.882

Alternate Model: SentenceBERT Transformer: 0.334


In [86]:
print(tracks[1])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1561, 1298, 1613, 2160, 1338]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[394, 2161, 1561, 2160, 1369]]):.3f}")

biomed

Base Model: TF-IDF Vectorizer: 0.863

Alternate Model: SentenceBERT Transformer: 0.353


In [87]:
print(tracks[2])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[2308, 2882, 1984, 448, 1982]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[594, 1991, 2178, 893, 72]]):.3f}")

physics

Base Model: TF-IDF Vectorizer: 0.894

Alternate Model: SentenceBERT Transformer: 0.336


In [88]:
print(tracks[3])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[448, 2960, 513, 859, 3255]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[1555, 2960, 766, 1554, 1378]]):.3f}")

bio/environment

Base Model: TF-IDF Vectorizer: 0.901

Alternate Model: SentenceBERT Transformer: 0.232


In [89]:
print(tracks[4])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[2160, 1376, 1613, 1240, 3032]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[540, 2161, 1240, 261, 1107]]):.3f}")

information/neuro

Base Model: TF-IDF Vectorizer: 0.850

Alternate Model: SentenceBERT Transformer: 0.208


In [90]:
print(majors[1])
print(tracks[5])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[448, 1010, 1007, 118, 2890]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[512, 1378, 2959, 516, 1000]]):.3f}")

SOCIAL SCIENCE

economics

Base Model: TF-IDF Vectorizer: 0.897

Alternate Model: SentenceBERT Transformer: 0.225


In [91]:
print(tracks[6])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1475, 2357, 2489, 3057, 2410]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[2489, 2410, 1472, 2357, 3057]]):.3f}")

law, ir

Base Model: TF-IDF Vectorizer: 0.344

Alternate Model: SentenceBERT Transformer: 0.164


In [92]:
print(tracks[7])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1241, 1613, 1589, 359, 2152]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[2110, 775, 1241, 21, 2477]]):.3f}")

psychology/economics

Base Model: TF-IDF Vectorizer: 0.717

Alternate Model: SentenceBERT Transformer: 0.251


In [93]:
print(tracks[8])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1384, 1360, 1466, 1435, 1070]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[1685, 1201, 1361, 2489, 1584]]):.3f}")

political science/law

Base Model: TF-IDF Vectorizer: 0.466

Alternate Model: SentenceBERT Transformer: 0.338


In [94]:
print(majors[2])
print(tracks[9])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1582, 2029, 706, 2272, 3043]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[706, 1355, 2285, 2272, 259]]):.3f}")

HUMANITIES

history/philosophy

Base Model: TF-IDF Vectorizer: 0.851

Alternate Model: SentenceBERT Transformer: 0.224


In [95]:
print(tracks[10])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1582, 1643, 1259, 1256, 2537]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[3239, 1569, 591, 259, 2285]]):.3f}")

cultural analysis

Base Model: TF-IDF Vectorizer: 0.786

Alternate Model: SentenceBERT Transformer: 0.184


In [96]:
print(tracks[11])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[1146, 1138, 3144, 1144, 1145]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[1146, 1138, 491, 132, 259]]):.3f}")

media/film

Base Model: TF-IDF Vectorizer: 0.686

Alternate Model: SentenceBERT Transformer: 0.247


In [97]:
print(tracks[12])
print(f"Base Model: TF-IDF Vectorizer: {compute_ild(tfidf_embeddings[[2990, 27, 256, 2072, 145]]):.3f}")
print(f"\nAlternate Model: SentenceBERT Transformer: {compute_ild(sentbert_embeddings[[2990, 2072, 259, 681, 1351]]):.3f}")

art history/history

Base Model: TF-IDF Vectorizer: 0.658

Alternate Model: SentenceBERT Transformer: 0.242


Sources for this part of the code:
- ChatGPT: https://chatgpt.com/share/68373ff5-3e78-800d-8d41-b1cd95c2cd62

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html

- https://arxiv.org/abs/2307.04644