# Text Mining Project: AUC Courses Recommender
## Desrciption
In this notebook you will find the source code for the Amsterdam University College (AUC) Course Recommender which was made for a Project for the course called Text Mining. 

## Code
### Imports:

In [184]:
import pandas as pd
import numpy as np
import nltk
import re
import string
from sklearn.metrics.pairwise import cosine_similarity
from gensim.parsing.preprocessing import remove_stopwords
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

### Loading and Preprocessing the Data

Loading the data in as a DataFrame

In [5]:
data = pd.read_csv("datasets/recommender_dataset.csv")

Droping the rows which had noting in the row in the course_catalogue_number column.

In [8]:
courses = data.dropna(subset=['course_catalogue_number', 'is_part_of', 'language_of_instruction', 'course_description'])

Removing the stop words from columns

In [17]:
courses = courses.reset_index(drop=True)
for i in range(1, int(len(courses))):
    courses.loc[i, 'course_name'] = courses.loc[i, 'course_name'].lower()
    courses.loc[i, 'college_graduate'] = courses.loc[i, 'college_graduate'].lower()
    courses.loc[i, 'language_of_instruction'] = courses.loc[i, 'language_of_instruction'].lower()
    courses.loc[i, 'is_part_of'] = courses['is_part_of'][i].lower()
    courses.loc[i, 'is_part_of'] = remove_stopwords(courses.loc[i, 'is_part_of'])
    courses.loc[i, 'course_description'] = courses.iloc[i, 7].lower()
    courses.loc[i, 'course_description'] = remove_stopwords(courses.loc[i, 'course_description'])

Removing punctuation from the text

In [20]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = re.sub(r'[^\w\s]+', '', courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = re.sub(r'[^\w\s]+', '', courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = re.sub(r'[^\w\s]+', '', courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = re.sub(r'[^\w\s]+', '', courses['course_description'][i])

Soueces for this part of the code:
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
* https://www.w3schools.com/python/pandas/ref_df_reset_index.asp#:~:text=Definition%20and%20Usage,this%2C%20use%20the%20drop%20parameter.
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
* https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
* https://towardsdatascience.com/remove-punctuation-pandas-3e461efe9584/
* https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

### Tokenization

Tokenization of columns: 'course_name', 'is_part_of', 'college_graduate', and 'course_description'

In [31]:
for i in range(len(courses)):
    courses.loc[i, 'course_name'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['course_name'][i])
    courses.loc[i, 'is_part_of'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['is_part_of'][i])
    courses.loc[i, 'college_graduate'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['college_graduate'][i])
    courses.loc[i, 'course_description'] = nltk.tokenize.WordPunctTokenizer().tokenize(courses['course_description'][i])

Making a new column with all of the text from the row.

In [29]:
courses['combined_text'] = courses['course_name'] + courses['is_part_of'] + courses['college_graduate'] + courses['course_description']

Soueces for this part of the code:
* https://www.nltk.org/api/nltk.tokenize.regexp.html
* https://www.kaggle.com/code/kanikanarang94/tokenization-using-nltk
* https://saturncloud.io/blog/how-to-combine-two-columns-in-a-pandas-dataframe/

### Vectorization

Tagging the column courses['combined_text'] in order to be able to vectorize it with doc2vec.

In [34]:
tagged_data = [TaggedDocument(words=word_tokenize(str(_d)), tags=[str(i)]) for i, _d in enumerate(courses['combined_text'])]

Below I am vectorizing the column courses['combined_text'].

In [37]:
params = {
    'vector_size': 100, # dimension of embeddings
    'window': 5, # window -/+ before and after focus word
    'epochs': 5, # number of iterations over the corpus
    'min_count': 5, # filter on words whose frequency is below this count
    'workers': 4, # how many cores to use
    'alpha': 0.05 # initial learning rate for SGD. This is lambda in the class notes
}

model = Doc2Vec(**params)
  
model.build_vocab(tagged_data)

max_epochs = 100

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=1)
    model.alpha = 0.2
    model.min_alpha = model.alpha

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration

Soueces for this part of the code (those sources are also relevent in the following part of this notebook):
* https://radimrehurek.com/gensim/models/doc2vec.html
* https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html
* https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5
* https://spotintelligence.com/2023/09/06/doc2vec/#What_is_Doc2Vec

### Recommender

In [150]:
def recommend_courses(li_course_ids, model, courses_df, top_n=5):
    #selecting the courses from the data - might be wrong
    selected_texts = []
    for cid in li_course_ids:
        row = courses_df[courses_df['course_catalogue_number'] == cid]
        if not (len(row) == 0):
            selected_texts.append(nltk.tokenize.WordPunctTokenizer().tokenize(str(row.iloc[0]['combined_text'])))

    #getting the vectors in form the doc2vec model and making one vector out of all the courses in the li_course_ids
    vectors = [model.infer_vector(tokens) for tokens in selected_texts]
    av_vector = np.mean(vectors, axis=0).reshape(1, -1)
    all_vectors = np.array([model.dv[i] for i in range(len(model.dv))])
    
    #calculating the cosine similarities and flattening them to one dimension
    cosine_sim = cosine_similarity(av_vector, all_vectors).flatten()
    
    #making the cosine similarities a list and sorting it
    sim_scores = list(enumerate(cosine_sim))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    #removing the courses from the sim_scores which were in the li_course_ids - might be worng
    idex_to_id = courses_df['course_catalogue_number'].tolist()
    selected_indicies = [idex_to_id.index(cid) for cid in li_course_ids if cid in idex_to_id]
    sim_scores = [s for s in sim_scores if s[0] not in selected_indicies]
    
    #getting the top 5 courses from the similaritie scores - also might be wrong
    course_indices = [i[0] for i in sim_scores[:top_n]]
    
    #returning the 
    return courses_df.iloc[course_indices][['course_name', 'course_catalogue_number']]

Soueces for this part of the code:
* https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
* https://numpy.org/doc/stable/reference/generated/numpy.matrix.flatten.html
* https://ioflood.com/blog/dataframe-to-list-pandas/#:~:text=You%20can%20use%20the%20toList,tolist()%20.&text=In%20the%20example%20above%2C%20we,1%2C%202%2C%203%5D.
* https://www.programiz.com/python-programming/methods/list/index

### Testing

Opening of the test dataset.

In [155]:
majors = []
tracks = []
li_coourse_ids = []
with open("datasets/test_set.txt", 'r') as f:
    for l in f:
        if l[0] == '[':
            li_coourse_ids.append(eval(l))
        elif l.isupper():
            majors.append(l)
        elif l.islower():
            tracks.append(l)

In [157]:
print(majors[0])
print(tracks[0])
recommend_courses(li_coourse_ids[0], model, courses)

SCIENCE

math



Unnamed: 0,course_name,course_catalogue_number
3269,urban research project,75150004ZY
2393,presentation applied anthropology inside,7313TP103Y
2358,politics and protest the latin american state,137221016Y
1994,matter materiality and material culture,141412172Y
1464,international financial management,6614ZF005Y


In [159]:
print(tracks[1])
recommend_courses(li_coourse_ids[1], model, courses)

biomed



Unnamed: 0,course_name,course_catalogue_number
2356,politics and international relations in antiquity,13822E116Y
1342,heart development function and disease,5234HDFD6Y
616,conservation principles and practice 2 book and,146421386Y
1227,from the margins to the mainstream gender,7324G200ZY
1166,financial economics and quantitative methods,3784FEM6VY


In [161]:
print(tracks[2])
recommend_courses(li_coourse_ids[2], model, courses)

physics



Unnamed: 0,course_name,course_catalogue_number
1660,language proficiency sign language of the,135110226Y
1631,italian language proficiency and culture 4,127121046Y
54,acquisition change and variation in slavic lan...,184415116Y
3115,thesis preparation theoretical physics,5354TPTP3Y
1096,evolution of language and music,5102EVTM6Y


In [163]:
print(tracks[3])
recommend_courses(li_coourse_ids[3], model, courses)

bio/environment



Unnamed: 0,course_name,course_catalogue_number
1460,international economic cooperation,6414M0159Y
2016,memory conflict in eastern and southeastern e...,142414096Y
1275,geopolitical economy of energy in eurasia,7323C155LY
1146,film theories,159421036Y
1675,law justice and morality,3801LJMOVY


In [165]:
print(tracks[4])
recommend_courses(li_coourse_ids[4], model, courses)

information/neuro



Unnamed: 0,course_name,course_catalogue_number
1680,leadership and organisational culture,7202BA04XY
1127,field school 1,110121146Y
2628,research methods 1,73415B006Y
2375,poststructuralism,187421316Y
44,academic skills tutoring,77411001AY


In [167]:
print(majors[1])
print(tracks[5])
recommend_courses(li_coourse_ids[5], model, courses)

SOCIAL SCIENCE

economics



Unnamed: 0,course_name,course_catalogue_number
8,12a machine learning and reasoning for health,4604MM117Y
1566,introduction to international relations,7321M127LY
2827,sociology concepts issues and research,73310201AY
2978,term paper argumentative discourse across domains,165418516Y
316,bachelors internship econometrics,6013B0362Y


In [169]:
print(tracks[6])
recommend_courses(li_coourse_ids[6], model, courses)

psychology/economics



Unnamed: 0,course_name,course_catalogue_number
77,advanced creative writing,900324HUMY
3322,working with quantitative data,75050005FY
3124,thesis seminar english literature and culture,178418636Y
2766,seminar mathematical logic,5314SEML3Y
2791,skills and research methods for political science,3802SRPPVY


In [171]:
print(tracks[7])
recommend_courses(li_coourse_ids[7], model, courses)

political science/law



Unnamed: 0,course_name,course_catalogue_number
2762,semantics and philosophy,187421126Y
1950,mastering your masters,77613905AY
2389,prejudice stereotyping,7205RS25XY
2106,morality of markets,3803MOQPKY
3249,understanding molecular simulation,5254UNMS6Y


In [173]:
print(tracks[8])
recommend_courses(li_coourse_ids[8], model, courses)

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
1519,internship seminar preservation and presentation,158621106Y
1433,integrated coastal dune management,5264ICDM6Y
673,core course holocaust and genocide studies,143414000Y
550,coming together coming apart theories and,111121116Y
2176,nuclear magnetic resonance,5254NUMR6Y


In [175]:
print(majors[2])
print(tracks[8])
recommend_courses(li_coourse_ids[8], model, courses)

HUMANITIES

history/philosophy



Unnamed: 0,course_name,course_catalogue_number
2275,philosophy of science,7012B2005Y
2964,tax treaties i,3094TAXTVY
1725,logic and the human factor in forensic reasoning,5274LTHF6Y
717,critical cultural theory and the politics of,187421526Y
2384,practical project,148621026Y


In [177]:
print(tracks[9])
recommend_courses(li_coourse_ids[9], model, courses)

cultural analysis



Unnamed: 0,course_name,course_catalogue_number
560,comparative european tort law,3802TOQPVY
752,cultures of european governance,142424042Y
1519,internship seminar preservation and presentation,158621106Y
1202,freedom alienation and the crisis of modernity,7323A076LY
2700,research workshop survey,774111003Y


In [179]:
print(tracks[10])
recommend_courses(li_coourse_ids[10], model, courses)

media/film



Unnamed: 0,course_name,course_catalogue_number
169,reading the city amsterdam literary classics,120217656Y
2525,quantum optics,5354QUOP6Y
3142,thinking postcolonial europe,111221416Y
1008,environmental economics,900122SSCY
2722,russian east european literatures the classics,133221096Y


In [181]:
print(tracks[11])
recommend_courses(li_coourse_ids[11], model, courses)

art history/history



Unnamed: 0,course_name,course_catalogue_number
1281,german 2 intermediate,11112L206Y
1669,law and economics ii empirical legal studies,3013L4Q0KY
725,critical theories,187421256Y
1311,good research practices,776500206Y
2826,sociology of the other,900274SSCY


Soueces for this part of the code:
* https://www.geeksforgeeks.org/python-convert-a-string-representation-of-list-into-list/