# Homework 3
In this notebook there are the solutions for the Homework 3  
Gianluca Procopio - 1942103  
Andrea Polidori - 2143429  
Lorenzo Incoronato - 1916863  
Riccardo Violano - 2148833


## Useful libraries

In [2]:
import csv
from bs4 import BeautifulSoup
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
import re
import numpy as np
import nltk
import requests
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer

## Some useful constants

Working on my local machine I created a file called _constants.py_ which contains every constant we used in our work.

In [3]:
CURRENCY_MAP = {
    '$': 'USD',
    '£': 'GBP',
    '€': 'EUR',
    '¥': 'JPY'
}


# 1. Data Collection

## 1.1 Get the list of master's degree courses

In order to get the list of links for each master course, I used Selenium. Selenium, with its WebDriver, is used to simulate and to automate the use of a browser. So I used it to navigate over the pages to collect all the links.  
To navigate around the pages I took advantage of the _route_ used for each page button (at the bottom of the webpage). In fact, by inspecting the next page button, I saw its href="/masters-degrees/msc-degrees/?PG=2".  
Given this route and by looking at the website link (www.findamasters.com/masters-degrees/msc-degrees/), I understood that for each page the only changing part in the link was '?PG=x' at the end, so by changing the link as following 'edit_URL = URL + '?PG=' + str(i)' I could easly navigate over the pages.


For each page, I analyzed the HTML source code and selected all the links thanks to their classname ('driver.find_elements(By.CLASS_NAME,"courseLink")') using 'By', a Selenium module that allows me locate elements within a document. Then, I iterated over all the elements found and added them in a 'link.txt' file.

## 1.2 Crawl master's degree pages

To perform the second answer I needed Selenium and OS. OS is a python module that provides functions for interacting with the operating system, in this case it allows me to create new folders. In fact, as requested in the homework, I created a folder called 'pages', inside which I created 400  folders named 'page_i' (with i between 1 and 400), each one containing the 15 masters' source html code (in the end we will have 400 folders with 15 html files per folder, so in total 6000 html files).  
Retrieving the source code was possible thanks to 'driver.page_source', a method provided by Selenium. In the end I wrote each source code in a html file.

## 1.3 Parse downloaded pages

In this last part of the exercise I had to create a TSV file for each master. To parse the HTML code I took advantage of 'BeautifulSoup', a python package useful for parsing HTML code.  
In order to get or the needed information first of all I analyzed the html source code and for each information I looked for the container inside which it's stored. Once identified the container, I looked for useful information like class or id that can help me identifying the correct text, and then I extracted the text by using some methods provided by Beautiful Soup, such as soup.find(...) or soup.select(..). In the end, I create a TSV file 'course_i.tsv' containing all the retrieved informations about a master course.
All these TSV files are stored in a 'tsv' folder.

In the last part of the code I merged all the 'course_i.tsv' file in 'master.tsv' which is the full dataset.

# 2. Search Engine

First of all, I need the dataset.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
tsv_path = "/content/drive/MyDrive/hw3_adm/master.tsv"

masters = pd.read_csv(tsv_path, delimiter='\t')

## 2.0 Preprocessing

### 2.0.0 Preprocessing the text

First, you must pre-process all the information collected for each MSc by:

1. Removing stopwords
2. Removing punctuation
3. Stemming
4. Anything else you think it's needed


First of all I need to download all the resources needed, such as the 'stopwords' dataset to filter all the stopwords in the dataset, and the Punkt tokenizer models, it's a useful model provided by NLTK for tokenization. By downloading these resources, we ensure that NLTk can perform optimally all the  natural language processing tasks on the dataset.

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
pip install wordsegment

In [8]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
import wordsegment

# Solve problem of missclick about writing words
wordsegment.load()

lst_stopwords = set(stopwords.words('english'))


ps = PorterStemmer()

# function of preprocessing
def process_text(text):
    # Avoid Nan Value
    if pd.isna(text):
        return []
    #punctuation
    text = ''.join([char for char in str(text) if char not in string.punctuation])
    # word segmentation
    segmented_tokens = [segment for token in text.split() for segment in wordsegment.segment(token)]

    # stemming
    tokens = [ps.stem(word) for word in segmented_tokens if word.lower() not in lst_stopwords]

    return tokens

# We work only on description (more usefull)
masters['descr_stem'] = masters['description'].apply(process_text)


Working on the 'duration' column, I'm going to fix some data.

In [9]:
masters['duration'] = masters['duration'].str.lower()

masters['duration'] = masters['duration'].str.replace('full-time', 'full time').str.replace('part-time', 'part time')

masters.loc[(masters['duration'] == 'one year ft') | (masters['duration'] == '1 year ft') | (masters['duration'] == 'full time : 1 year')| (masters['duration'] == 'full time : one year'), 'duration'] = '1 year full-time'

I have to fix the 'url' column aswell.

In [10]:
masters['url'] = masters['url'].str.replace(' ','')

### 2.0.1 Preprocessing the fees column

In order to preprocess the fees column first of all I removed all the punctuation and useless digits (such as academic years), so that in the fees column the only digits there were the fees' amounts.   

Then through a regex I found all the combinations (currency, symbol or code, and amount in both orders). I casted all the amounts to float, so that I could easily get the higher fee for each master, and formatted all the values in the same way and converted all the currency symbol in their currency code through a map.

In [11]:
# my regex had problems with academic years, so I removed them.
# strings like 202223, 20222023 will be removed
def remove_academic_years(text):
    # first remove useless punctuations
    text=str(text).translate(str.maketrans("", "", string.punctuation))
    # delete all academic years
    academic_year_pattern = r'\b20\d{4,6}\b'
    return re.sub(academic_year_pattern, '', str(text))

#removed 'TBC' strings that gave me problem when I was searching the fees with the regex
def remove_TBC(text):
    short_word_pattern = r'\b(?:TBC)\b'
    return re.sub(short_word_pattern, '', str(text))

# in this function througha. regex I find all te combinations for the currency and amounts.
def extract_currency_and_amount(fee):
    currency_pattern = r'([A-Z]{3}|[£€¥$])\s*([\d,]+\.?\d*)|\s*([\d,]+\.?\d*)\s*([A-Z]{3}|[£€¥$])'
    matches = re.findall(currency_pattern, str(fee))

    if matches:
        extracted_values = [] # contains the fixed list of all the matches, because in the tuple of the matches there are some empty strings
        for match in matches:
            values = [item for item in match if item]
            if len(values) == 2: # each match has only 2 elements: currency and amount
                extracted_values.append(tuple(values))
        if len(extracted_values) == 0: # case base
            return []
        res = [] # in the res variable will be stored the return value
        for el in extracted_values:
            if el[0].isdigit(): # if the first element in the tuple is digit, then the second is currency and will be converted if it's a symbol
                cur = el[1]
                if cur in '£€¥$': # is the currency is expressed as sybol, convert it through a map
                    cur = CURRENCY_MAP[cur]
                res.append((float(el[0]),cur))
            else: # if the second element is digit, the first is currency and willl be converted
                cur = el[0]
                if cur in '£€¥$':
                    cur = CURRENCY_MAP[cur]
                res.append((float(el[1]),cur))
        return res
    else:
        return []

masters['fees'] = masters['fees'].apply(remove_TBC)
masters['fees'] = masters['fees'].apply(remove_academic_years)
masters['fees'] = masters['fees'].apply(extract_currency_and_amount)

In the end I'm going to convert all the fees from the different currencies in USD, using the Open Exchange Rates API, as suggested from ChatGPT. To take advantage of this API I had to register and get my API key.

FIrst I need the exchange rates in real time, the free trial of this API allows me only to use as base currency the USD, so every amount will be converted to USD.

In [12]:
def get_exchange_rates(api_key, base_currency='USD'):
    endpoint = f'https://open.er-api.com/v6/latest/{base_currency}'
    params = {'apikey': api_key}
    response = requests.get(endpoint, params=params)
    data = response.json()
    return data['rates']

api_key = '76c3b4f3197a4390af5438f5cfb4b121'

exchange_rates = get_exchange_rates(api_key)

The *exchange_rates* variable contains a dictionary with the current change between a currency and the USD.

In the end of this code, I wrote a function called 'convert_to_usd(..)' which took for each master the list of fees and its currencies, convert the amount in USD and take the highest value.

so in the end the 'fees_usd' columns contains the highest fee expressed in USD. To keep the notebook clean and tidy i will show only the first 10 rows to give an idea of how the column looks like.

In [13]:
def convert_to_usd(row, exchange_rates, target_currency='USD'):
    if len(row)==0:
      return None
    usd_fees = [] # contains all the fees converted
    for amount, currency in row:
        if currency in exchange_rates and amount is not None and amount != '':
            # append to the list of fees the converted values
            usd_fees.append(round(float(amount) / exchange_rates[currency] * exchange_rates[target_currency], 2))
    highest = max(usd_fees) if len(usd_fees) > 0 else 0 # choose the highest fee among the converted ones.
    return highest

masters['fees_usd'] = masters['fees'].apply(lambda row: convert_to_usd(row, exchange_rates))
print(masters['fees_usd'].iloc[:10])

0         NaN
1    43733.91
2         NaN
3         NaN
4         NaN
5    39014.42
6    19668.82
7    19668.82
8         NaN
9    16390.68
Name: fees_usd, dtype: float64


## 2.1 Conjunctive query

### 2.1.1 Create your index!

To Use the two code after this cell, you have to create "vocabulary.txt" and "inverted_index.txt", in this way we can avoid to calculate them each time

In [14]:
import pandas as pd
from collections import defaultdict
#first create the dict where we will put all the words
vocabulary = defaultdict(lambda: len(vocabulary))

# Concat of all words
all_words = masters['descr_stem'].apply(lambda x: x if isinstance(x, list) else [])


for document in all_words:
    for word in document:
        _ = vocabulary[word]

# Save
with open("vocabulary.txt", "w", encoding="utf-8") as vocab_file:
    for word, term_id in vocabulary.items():
        vocab_file.write(f"{word}\t{term_id}\n")

In [15]:
import pandas as pd
from collections import defaultdict
#first create the dict where we will put all the words, we create the inverted like a set to avoid duplicate
vocabulary = defaultdict(lambda: len(vocabulary))
inverted_index = defaultdict(set)

#construction of the inverted index
for doc_id, document in enumerate(masters['descr_stem'].dropna().astype(object)):
    for word in document:
        term_id = vocabulary[word]
        inverted_index[term_id].add(doc_id)

# Save
with open("inverted_index.txt", "w", encoding="utf-8") as index_file:
    for term_id, doc_ids in inverted_index.items():
        index_file.write(f"{term_id} : {sorted(doc_ids)}\n")

### 2.1.2 Execute the query

In [16]:
import pandas as pd

# List to memorize
data_list = []

# Read inverted index file
with open('inverted_index.txt', 'r') as file:
    for line in file:
        # Extraction
        numbers_str = line.split(':')[1].strip()
        numbers_str_cleaned = numbers_str.strip('[]').replace(',', '')
        numbers_set = set(map(int, numbers_str_cleaned.split()))

        data_list.append({'Number': int(line.split(':')[0].strip()), 'links': numbers_set})

# Dataframe for inverted
df_inverted = pd.DataFrame(data_list)
df_inverted.set_index('Number', inplace=True)

# Read vocabulary file
df_vocabulary = pd.read_csv('vocabulary.txt', sep='\t', names=['Word', 'Number'])
df_vocabulary['Number'] = pd.to_numeric(df_vocabulary['Number'])

# Merge dataframes
df_merged = pd.merge(df_vocabulary, df_inverted, left_on='Number', right_index=True)


Query Research V1

In [41]:
from collections import defaultdict
def Research1(Query):

#we process the query to avoid the problem of punkt, stemming and all the other stuff
  Query=process_text(Query)
  Query=list(map(str.lower,Query))
  linkss = set(range(6000))
#we extract the doc_id of the words that we want, and we do an interction with the total of doc_id
#in this way we are sure to have only the documents that have all the word of the Query
  for j in Query:
    temp_link=[]
    search = df_merged[df_merged['Word'] == j]
    for index, row in search.iterrows():
        links_number = row['links']
        temp = list(links_number)

        temp_link.extend(temp)

    temp_link=set(temp_link)
    linkss=linkss.intersection(temp_link)


#we filter the dataset for the output that we want
  result_df = pd.DataFrame(columns=['courseName', 'universityName', 'description', 'url'])

  for i in linkss:
    goal = masters.loc[i, ['courseName', 'universityName', 'description', 'url']]
    result_df = pd.concat([result_df, goal.to_frame().transpose()])


  return(result_df)

  and should_run_async(code)


In [42]:
Query=['Advanced knowledge']
Research1(Query)


Unnamed: 0,courseName,universityName,description,url
1,Accounting and Finance - MSc,University of Leeds,"Businesses and governments rely on sound financial knowledge to underpin their strategies for success.This course will kick-start your professional career, providing you with advanced knowledge of modern financial theories, accounting practices and controls, as well as an understanding of the organisational context.You'll cover fundamental topics such as corporate finance and corporate reporting. Optional modules will allow you specialise your knowledge, ranging from international business finance to forensic accounting. You’ll also be able to apply your learning to practical case studies and simulations to test your knowledge in genuine business scenarios.",https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891
4,Addictions MSc,King’s College London,"Join us for an online session for prospective students to find out about our MSc Addictions course. The online session will take placeWednesday 17th May 12 - 12.45pm.Book onlineOur Addiction Studies course provides postgraduate education in addictions, with a particular emphasis on recent biological, psychological and socio-environmental advances in the conceptualisation of drug use and addiction and its effective prevention. Your theoretical understanding of the subject will be positioned within the context of policy development, focusing on facilitating the transfer of knowledge and skills to the workplace.",https://www.findamasters.com/masters-degrees/course/addictions-msc/?i132d4318c27100
12,Analytical Toxicology MSc,King’s College London,The Analytical Toxicology MSc is a unique study course that integrates theoretical and practical aspects of analytical science with clinical and forensic toxicology. This course will provide you with a detailed knowledge and comprehensive understanding of advanced analytical toxicology and its applications.,https://www.findamasters.com/masters-degrees/course/analytical-toxicology-msc/?i132d4846c14767
4108,Global Meetings and Events Management MSc / PGDip,University College Birmingham,"Become part of an events industry worth an estimated £70 billion to the UK economy, offering hundreds of thousands of jobs across a huge range of sectors. Whether you dream of managing festivals, music and sporting events or high-profile conferences, meetings and exhibitions, our Global Meetings and Events Management postgraduate course will give you advanced knowledge and skills for higher level careers in this dynamic",https://www.findamasters.com/masters-degrees/course/global-meetings-and-events-management-msc-pgdip/?i1038d6176c43380
2069,Clinical Research - MSc,Cardiff University,"This part-time, distance learning course offers extensive knowledge and expertise relevant to those working in clinical trials. Its is also suitable for those in a senior role in clinical research in a pharmaceutical, biotechnology, device, contract research or site management company or the NHS.The MSc in Clinical Research is designed to advance and enhance your career in Clinical Research in the pharmaceutical industry, health service, contract research organisations, regulatory authorities, data management organisations and in medical writing.",https://www.findamasters.com/masters-degrees/course/clinical-research-msc/?i33d4094c49009
...,...,...,...,...
2002,Clinical Geriatrics - MSc,Cardiff University,"The MSc Clinical Geriatrics aims to equip healthcare professionals, who have a special interest in managing and leading care in older adults, with an advanced clinical knowledge of ageing in health and illness as they present in",https://www.findamasters.com/masters-degrees/course/clinical-geriatrics-msc/?i33d4542c69034
2023,Clinical Neuropsychology - MSc,University of Bristol,"Professional programmes in Clinical Neuropsychology have been offered within University of Bristol since 2010 and are delivered in partnership with North Bristol NHS Trust, which includes a large regional neurosciences and major trauma centre. Our Clinical courses aim to engage and inspire clinicians to enhance their skills in Clinical Neuropsychology and support entry on to the British Psychological Society's (BPS) Specialist Register of Clinical Neuropsychologists (SRCN). This MSc is accredited by the BPS and delivers the adult knowledge and practice component of the competency requirements for advanced training in clinical neuropsychology.",https://www.findamasters.com/masters-degrees/course/clinical-neuropsychology-msc/?i287d6202c36582
2038,Clinical Ophthalmic Practice MSc,University College London,"This programme will allow you to develop advanced knowledge of the eye, eye diseases and treatment and the research underpinning clinical practice. It is delivered by ophthalmologists, researchers and nurses who are at the forefront of ophthalmic research clinical practice at the Institute of Ophthalmology and Moorfields Eye Hospital Trust both institutions will be involved in the teaching.",https://www.findamasters.com/masters-degrees/course/clinical-ophthalmic-practice-msc/?i274d1871c40131
2040,Clinical Optometry - MSc,Cardiff University,"The aim of this programme is to provide optometrists with the opportunity to undertake a challenging and rewarding Masters in Clinical Optometry that focuses on their personal development as professionals leading, managing and developing ophthalmic practice roles and models of care delivery.The MSc Clinical Optometry is designed for optometrists in primary, secondary and tertiary settings who wish to advance their knowledge base, and specialist clinical and leadership skills. You will become actively involved in the advancement of optometry.",https://www.findamasters.com/masters-degrees/course/clinical-optometry-msc/?i33d4093c52009


## 2.2 Conjunctive query & Ranking score

### 2.2.1 Inverted index

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(input='content', lowercase=False, tokenizer=lambda text: text,vocabulary=vocabulary, max_df=0.2)
# with max_df i am excluding every element that appear more than 20%, because they could be too common words
masters_tfidf = tfidf.fit_transform(masters.descr_stem)
inverted_index_tfidf = {}

with open('inverted_index.txt', 'r') as file:
    for line in file:
        parts = line.strip().split(':')
        term = int(parts[0])
        doc_list = eval(parts[1])  # Use eval to convert the string representation of the tuple to an actual tuple

        inverted_index_tfidf[term] = []
        for doc in doc_list:
            doc_id = doc
            tfidf_value = masters_tfidf[doc_id,term]
            inverted_index_tfidf[term].append((doc_id, tfidf_value))

# Save the new inverted index to a text file
with open('inverted_index_tfidf.txt', 'w') as file:
    for term, doc_tfidf_list in inverted_index_tfidf.items():
        file.write(f"{term}: {doc_tfidf_list}\n")



### 2.2.2 Execute the query

In [40]:
import heapq
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def Research2(query, df, k):
  #i need to preprocces the description and the query
    df['descr_stem'] = df['description'].apply(process_text)
    query = process_text(query)

    # Calculate TF-IDF for the query
    tfidf_vectorizer = TfidfVectorizer(input='content', lowercase=False, tokenizer=lambda text: text, vocabulary=vocabulary,max_df=0.2)
    tfidf_query = tfidf_vectorizer.fit_transform([query])


    heap = []
   #we find the tfidf of the smaller dataset with only the words that have all the words of the Query
    for index, document in df['descr_stem'].items():
         tfidf_document = tfidf_vectorizer.transform([document])
         similarity = cosine_similarity(tfidf_query, tfidf_document)[0, 0]
#we use a heap structure to have the top_k_documents
         if similarity > 0:
            if len(heap) < k:
                heapq.heappush(heap, (similarity,index))

            else:
              #i add temp just because heap is a tuple and i need only the similarity
                min_similarity,temp = heap[0]


                if similarity > min_similarity:
                    heapq.heappop(heap)
                    heapq.heappush(heap, (similarity,index))


    # i add this list to have the dataframe organized in a better way
    top= [(index, similarity) for (similarity, index) in sorted(heap)]
    #i added the index to not lose the information of the original index

    result_df = pd.DataFrame(top, columns=['Index', 'Similarity'])


    #we have the index, we want the columns of the original dataframe
    result_df = result_df.set_index('Index')
    result_df = result_df.join(masters.loc[result_df.index, ['courseName', 'universityName', 'description', 'url']])
    result_df = result_df.sort_values(by='Similarity', ascending=False)
    return result_df


query2 = ['Advanced knowledge']
k = 5
documentsss=Research1(query2)
top_k_results = Research2(query2, documentsss, k)
top_k_results



Unnamed: 0_level_0,Similarity,courseName,universityName,description,url
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
650,0.321699,Advanced Clinical Practice - MSc,Canterbury Christ Church University,"Gain the knowledge and skills needed to become a qualified health care professional working in a clinically senior post.The MSc Advanced Clinical Practice course has been developed to provide you with knowledge and skills to allow you to develop and progress your advanced clinical practice role. You will develop an in-depth and advanced knowledge of your role (ACP). The knowledge is informed by current practice and research. You will develop a critical awareness of the subject matter and be able to demonstrate critical skills, knowledge of your profession demonstrating strategic leadership and education in practice, you will also reflect on your progress as a learner.",https://www.findamasters.com/masters-degrees/course/advanced-clinical-practice-msc/?i32d2712c57002
926,0.296649,Advancing Practice - MSc,University of Northampton,"Our MSc Advancing Practice awards support the advancement of healthcare professionals practice by developing, knowledge, skills and understanding to challenge and innovate future practice. This award also offers and alternative route to individuals with an interest in Advanced Clinical Practice.By challenging legal, professional and ethical dilemmas you will develop advanced knowledge and skills to underpin safe and effective practice, supported by the application of complex decision making. You will benefit from a flexible route which builds and develops opportunity for advancing healthcare provision.",https://www.findamasters.com/masters-degrees/course/advancing-practice-msc/?i337d1774c57759
782,0.293139,Advanced Healthcare Practice - MSc,Cardiff University,"Our MSc Advanced Healthcare Practice programme aims to develop your knowledge, understanding and critical appreciation of the four pillars of advanced level practice.It offers you the opportunity to apply learning to advance your leadership and management, facilitation of learning and teaching, and clinical practice skills, all of which are underpinned by evidence, research and service improvement.Our programme is suitable for those progressing towards a level of autonomous advanced practice as well as experienced registered health care professionals, already working as advanced practitioners.",https://www.findamasters.com/masters-degrees/course/advanced-healthcare-practice-msc/?i33d4736c67062
752,0.288671,Advanced Computing MSc,King’s College London,"Our Advanced Computing MSc provides knowledge and experience of computing at an advanced level. The programme allows students to select modules on a wide range of advanced computer science subjects, so that they can build a programme that suits their interests and career aspirations.",https://www.findamasters.com/masters-degrees/course/advanced-computing-msc/?i132d3905c23524
697,0.27655,Advanced Clinical Practice MSc,University of Greenwich,Develop your skills and deepen your knowledge of advanced health clinical practice with this tailored Master's course. Our MSc in Advanced Clinical Practice is designed for current practitioners who are registered with a professional body and who would like to become advanced clinical practitioners.,https://www.findamasters.com/masters-degrees/course/advanced-clinical-practice-msc/?i309d6650c56313


# 3. Define a new score!

In this section of the homework, we were required to create a new function to evaluate the similarity score. We chose to use a type of similarity called Jaccard Similarity, which is quite similar to Cosine Similarity and can assess the resemblance between two texts. The focus of the investigation was on analyzing a query that assumed a student's search for a full-time master's program in Management starting in October. To achieve this, we initially created a function to calculate the score for each analyzed course, placing greater emphasis on the similarity in course names, which adds an additional 0.2 to the score. This is because, when searching for courses of study, the name (what is studied) carries greater importance for achieving an optimal result.

An important implementation arises from the fact that we condensed the columns by combining "courseName," "isItFullTime," and "startDate," and then analyzed the Jaccard similarity for each row.

As a final step, a tuple was created containing the dataframe of all obtained results with the values of the columns, as required by the text, along with the newly calculated similarity variable.

In [25]:
import heapq

def calcola_punteggio(documento_combined, query):
    combined_text = str(documento_combined)
    parole_documento = set(combined_text.split())
    parole_query = query.split()  # Change set() to split()

    # Give more importance to the first word of the query
    punteggio_aggiuntivo = 0.2 if parole_query and parole_query[0] in parole_documento else 0

    numeratore = 0
    denominatore = 0

    for parola_documento in parole_documento:
        lunghezza_parola_documento = len(parola_documento)
        numeratore += abs(lunghezza_parola_documento - len(parole_query))
        denominatore += lunghezza_parola_documento + len(parole_query)

    punteggio_jaccard = (numeratore / denominatore + punteggio_aggiuntivo) if denominatore != 0 else 0

    return punteggio_jaccard

masters_df=pd.DataFrame(masters)
masters_df['combined'] = masters_df['courseName'] + ' ' + masters_df['isItFullTime'] + ' ' + masters_df['startDate']

Query = "Management Full Time Ocober"
heap = []
k = 450
linkss = set(range(6000))

for j in Query:
    temp_link=[]
    search = df_merged[df_merged['Word'] == j]
    for index, row in search.iterrows():
        links_number = row['links']
        temp = list(links_number)

        temp_link.extend(temp)

temp_link=set(temp_link)
linkss=linkss.intersection(temp_link)



for i in linkss:
    documento = masters_df.loc[i]
    combined_text = masters_df.loc[i, 'combined']
    punteggio = calcola_punteggio(combined_text, Query)

    documento_rilevante = documento[['courseName', 'universityName', 'description', 'url']]

    documento_tuple = tuple(documento_rilevante.values)

    heapq.heappush(heap, (punteggio, i, documento_tuple))

documenti_principali = heapq.nlargest(k, heap)

result_tuples = [(indice, masters_df.loc[indice, 'courseName'], masters_df.loc[indice, 'universityName'],
                  masters_df.loc[indice, 'description'], masters_df.loc[indice, 'url'], punteggio)
                 for punteggio, indice, _ in documenti_principali]

result_df = pd.DataFrame(result_tuples, columns=['Index', 'courseName', 'universityName', 'description', 'url', 'similarity'])

result_df = result_df.set_index('Index')

pd.set_option("display.max_colwidth", None)

result_df

Unnamed: 0_level_0,courseName,universityName,description,url,similarity
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2364,Construction Management MS (Online),Rochester Institute of Technology USA,Online ExclusivelyCo-op/Internship Encouraged100% Outcome Rate of RIT Graduates from this degree88K+ Annual Salary for these Jobs11% Employment Growth in this,https://www.findamasters.com/masters-degrees/course/construction-management-ms-online/?i693d7481c63870,0.502326
3497,"Environmental, Health and Safety Management MS",Rochester Institute of Technology USA,Online Option AvailableCo-op/Internship EncouragedSTEM-OPT Visa Eligible100% Outcome Rate of RIT Graduates from this degree$80K Median First-Year Salary of RIT Graduates from,https://www.findamasters.com/masters-degrees/course/environmental-health-and-safety-management-ms/?i693d7481c44225,0.465306
4191,Health and Well-being Management MS,Rochester Institute of Technology USA,"100% Outcome Rate of RIT Graduates from this degreeChoose from two options in: (1) content development, implementation, and evaluation or (2) health and well-being program management.",https://www.findamasters.com/masters-degrees/course/health-and-well-being-management-ms/?i693d7481c63872,0.438095
4279,Health Systems Management (Online) MS,Rochester Institute of Technology USA,Online Exclusively100% Outcome Rate of RIT Graduates from this degree$87K Median First-Year Salary of RIT Graduates from this degree75k+ Average Annual Salary20% Employment Growth by,https://www.findamasters.com/masters-degrees/course/health-systems-management-online-ms/?i693d7481c44232,0.432558
4142,Global Supply Chain Management MS,Rochester Institute of Technology USA,"Co-op/Internship EncouragedSTEM-OPT Visa Eligible100% Outcome Rate of RIT Graduates from this degreeThis interdisciplinary program integrates supply chain management and logistics, operations management, analytics, industrial engineering, global business, and management.",https://www.findamasters.com/masters-degrees/course/global-supply-chain-management-ms/?i693d7481c63871,0.395122
1997,Clinical Exercise Physiology - MSc/PgDip/PgCert,University of Salford,"This programme focuses on using 'exercise as medicine', to treat and manage conditions as well as to rehabilitate patients recovering from illness.We are currently applying for programme accreditation with the Registration Council for Clinical Physiologists (RCCP). Once accreditation has been approved, students who pass all modules and the clinical assessment of competencies will be eligible to register with RCCP as a Clinical Exercise",https://www.findamasters.com/masters-degrees/course/clinical-exercise-physiology-msc-pgdip-pgcert/?i347d663c54797,0.369565
4435,Human-Computer Interaction MS,Rochester Institute of Technology USA,"In the human-computer interaction master's degree, you'll study how people interact with websites, computer systems, and software, enabling you to create intuitive interfaces that improve how we interact with and use emerging technologies.Online Option AvailableCo-op/Internship EncouragedSTEM-OPT Visa Eligible100%Outcome Rate of RIT Graduates from",https://www.findamasters.com/masters-degrees/course/human-computer-interaction-ms/?i693d7481c44243,0.305556
5336,Manufacturing and Mechanical Systems Integration MS,Rochester Institute of Technology USA,Co-op/Internship EncouragedSTEM-OPT Visa Eligible100% Outcome Rate of RIT Graduates from this degree$83.9K Average First-Year Salary of RIT Graduates from,https://www.findamasters.com/masters-degrees/course/manufacturing-and-mechanical-systems-integration-ms/?i693d7481c44237,0.300971
4475,Industrial Design Engineering MSc,University of Twente,"BECOME A ‘DESIGNEER’ AND ENRICH PEOPLE’S LIVES WITH NEW AND/OR BETTER PRODUCTS WHILE CONTINUOUSLY IMPROVING THE DEVELOPMENT PROCESSES INVOLVED.In your daily life, without realising, you are continuously surrounded by products. But who designed these products? Why are they shaped the way they are? What user demands do they meet, what processes lie behind the development and what technologies are involved in that process? That’s what the Master’s in Industrial Design Engineering (IDE) at the University of Twente (UT) is all about.",https://www.findamasters.com/masters-degrees/course/industrial-design-engineering-msc/?i754d8913c70328,0.295455
1341,Audiology (with Clinical Placement) MSc/PgDip,University of Southampton,Study our full-time 2 year audiology master’s with clinical placement and combine scientific learning with practical training. Our 3 year course offers a part-time clinical placement over 2 years. Gain the skills to work in clinical audiology. UK applicants who complete the course are eligible to register withThe Registration Council for Clinical Physiologists (RCCP)and to work in the NHS as an audiologist. Our graduates from this degree have become leaders in the field both in the UK and abroad.,https://www.findamasters.com/masters-degrees/course/audiology-with-clinical-placement-msc-pgdip/?i349d6709c4214,0.288889


In this second case we note that using cosine similarity we will have by not considering the additional score a higher score. This is because cosine similarity takes into account the frequency of the query words in the text with which it is compared, while jaccard similarity bases its score only on the presence/absence of the query words in the text. Not surprisingly, we get two different dataframes as output.

Please Note: I used the hide warning because the score function used before was printing me the desired result but with a sklearn warning since empty or null values could happen in the function (specifically in columns like "isItFullTime").


In [26]:
import warnings

warnings.filterwarnings("ignore")

k=450
top_k_results = Research2(Query, masters_df, k)
pd.set_option("display.max_colwidth", None)

warnings.filterwarnings("default")

top_k_results

Unnamed: 0_level_0,Similarity,courseName,universityName,description,url
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3117,0.407913,Educational Leadership and Management MSc,University of Portsmouth,"If you're currently working, or aspire to work in an education-connected role and want to gain a qualification that demonstrates that you have developed your leadership and management skills, our Educational Leadership and Management course is for you.Offered as a MSc, PgDip and PgCert, you can study this course full-time, part-time or part-time by distance learning. The MSc and PgDip take one year full time and two years part time. The PgCert takes six months full time and one year part time.",https://www.findamasters.com/masters-degrees/course/educational-leadership-and-management-msc/?i343d8998c43995
3115,0.407913,Educational Leadership and Management (Distance Learning) MSc,University of Portsmouth,"If you're currently working, or aspire to work in an education-connected role and want to gain a qualification that demonstrates that you have developed your leadership and management skills, our Educational Leadership and Management course is for you.Offered as a MSc, PgDip and PgCert, you can study this course full-time, part-time or part-time by distance learning. The MSc and PgDip take one year full time and two years part time. The PgCert takes six months full time and one year part time.",https://www.findamasters.com/masters-degrees/course/educational-leadership-and-management-distance-learning-msc/?i343d8998c12738
1697,0.341545,Business and Management MSc,University of Stirling,"Our MSc Business and Management course gives you the knowledge and skills you need to be an inspiring, effective and resilient manager. We combine management theory with development of practical skills, and our course is taught by internationally recognised subject experts. It can be taken on a 12-month full-time basis or part time over 24",https://www.findamasters.com/masters-degrees/course/business-and-management-msc/?i352d8075c37803
2381,0.329910,Construction Project Management MSc,Liverpool John Moores University,Learn how to lead and integrate a team of multi-disciplinary professionals on LJMU's Construction Project Management Masters.This course is available to studyfull-timeorpart-time.,https://www.findamasters.com/masters-degrees/course/construction-project-management-msc/?i147d3863c9219
5816,0.327757,Master of Science in Engineering Enterprise Management,The Hong Kong University of Science and Technology,"In today's knowledge-based economy, companies and enterprises must compete relentlessly in terms of costs, quality and time to market themselves in the global context. Knowledge of cutting-edge management techniques, such as Logistics Management, Supply Chain Management, Six Sigma Processes, and knowledge in the effective deployment of information technology, are necessary to help companies compete successfully in the global arena. In addition, a good foundation in basic managerial training, such as operations management, project management, people management and basic business management, is essential in launching a management career.",https://www.findamasters.com/masters-degrees/course/master-of-science-in-engineering-enterprise-management/?i1175d8756c57657
...,...,...,...,...,...
5957,0.124847,Master's in Business Analysis and Performance Management,NHH Norwegian School of Economics,"The Business Analysis and Performance Management (BUS) major provides students with insights into how value creation in firms and organisations can be measured, analysed and managed in the best possible way.A good understanding of how different governance mechanisms work is a prerequisite for the entire organisation to work towards the same goals and visions.Business analysis is the foundation for decisions concerning investments, product- and customer portfolio, pricing, production management and logistics. Financial accounting is a central communication tool towards external stakeholders such as investors and borrowers.",https://www.findamasters.com/masters-degrees/course/master-s-in-business-analysis-and-performance-management/?i1488d6851c58871
367,0.123927,MSc Management with optional specialization,The London Institute of Banking and Finance,"Enhance your career with our Master's in Management. Our programme equips you with the skills to make impactful strategic business decisions and drive success on a global scale.Our flexible online learning lets you complete the degree at your pace, while our cutting-edge platform and expert support ensure your success. You’ll also have the option to select between two alternatives study paths – our standard management programme or a programme with a specialisation. Join us to connect with a community of leading academics, experts, and students worldwide, and achieve your career aspirations in management.",https://www.findamasters.com/masters-degrees/course/msc-management-with-optional-specialization/?i394d8872c71606
5295,0.123927,Management of Complex Projects MSc,University College London,"In today's fast-moving, global marketplace, traditional project management techniques based around strategic planning and control are not enough. Many modern projects are not only complicated (tough technical problems needing co-ordination across many suppliers) but also complex (with unclear requirements). The new Management of Complex Projects MSc has been designed to produce project managers equipped to lead the most challenging of projects.",https://www.findamasters.com/masters-degrees/course/management-of-complex-projects-msc/?i274d1844c40245
564,0.123927,Accounting MSc,University of Chester,"This MSc Accounting course supports the development of your professional career in business, accounting and financial management.This course is aligned to the Association of Chartered Certified Accountants (ACCA) professional examinations.Successful completion of the course will enhance your understanding of financial management and reporting, performance management, risk and corporate governance.Students will develop an understanding of a range of research methods and analytical techniques through an introduction to research methods.",https://www.findamasters.com/masters-degrees/course/accounting-msc/?i292d4965c68264


# 4. Visualizing the most relevant MSc degrees

Firstly we install the packages we will need for our task

In [None]:
pip install geopy

In [None]:
from geopy.geocoders import Nominatim
import folium
from folium.plugins import MarkerCluster

We proceed to extract from the originl dataframe the lines corrisponding to the results of exercise 3, and we do a quick visualization of our data info

In [31]:
df_tomap=masters.loc[result_df.index]

df_tomap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 2364 to 2681
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   courseName      40 non-null     object 
 1   universityName  40 non-null     object 
 2   facultyName     40 non-null     object 
 3   isItFullTime    40 non-null     object 
 4   description     40 non-null     object 
 5   startDate       40 non-null     object 
 6   fees            40 non-null     object 
 7   modality        40 non-null     object 
 8   duration        40 non-null     object 
 9   city            40 non-null     object 
 10  country         40 non-null     object 
 11  administration  35 non-null     object 
 12  url             40 non-null     object 
 13  descr_stem      40 non-null     object 
 14  fees_usd        1 non-null      float64
 15  combined        40 non-null     object 
dtypes: float64(1), object(15)
memory usage: 5.3+ KB


To extract the longitude and latitude of each city we use the geopy library and its tool Nominatim

In [32]:
geo_lat_long = Nominatim(user_agent="HW3_geocoder")  # tool for geolocalization

We add to our dataframe the columns latitude and longitude, filled with the values extracted with our tool applied to all the entries in the column 'city'.

In [33]:
df_tomap.loc[:, 'latitude'] = df_tomap['city'].apply(lambda localita: geo_lat_long.geocode(localita).latitude if geo_lat_long.geocode(localita) is not None else None)

In [34]:
df_tomap.loc[:, 'longitude'] = df_tomap['city'].apply(lambda localita: geo_lat_long.geocode(localita).longitude if geo_lat_long.geocode(localita) is not None else None)

  and should_run_async(code)


We check if there are missing values in any of the columns we are going to use for the map

In [35]:
df_tomap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 2364 to 2681
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   courseName      40 non-null     object 
 1   universityName  40 non-null     object 
 2   facultyName     40 non-null     object 
 3   isItFullTime    40 non-null     object 
 4   description     40 non-null     object 
 5   startDate       40 non-null     object 
 6   fees            40 non-null     object 
 7   modality        40 non-null     object 
 8   duration        40 non-null     object 
 9   city            40 non-null     object 
 10  country         40 non-null     object 
 11  administration  35 non-null     object 
 12  url             40 non-null     object 
 13  descr_stem      40 non-null     object 
 14  fees_usd        1 non-null      float64
 15  combined        40 non-null     object 
 16  latitude        40 non-null     float64
 17  longitude       40 non-null     

  and should_run_async(code)


We use the library Folium to create a map where we will locate all our courses. For the courses in the same city we utilize a clustering tool that, if clicked, will return all the courses in that specific location:

In [36]:
# We create a map and we center it on the mean of the coordinates
mappa = folium.Map(location=[df_tomap['latitude'].mean(), df_tomap['longitude'].mean()], zoom_start=5)
cluster = MarkerCluster().add_to(mappa)

  and should_run_async(code)


In [37]:
# We customize the popup and the markers that will appear for each course
for indice, riga in df_tomap.iterrows():
    popup_html = f"<strong>Course:</strong> {riga['courseName']}<br><strong>University:</strong> {riga['universityName']}<br><strong>Fees:</strong> {riga['fees']}<br><strong>City:</strong></font> {riga['city']}<br><strong>Country:</strong> {riga['country']}"
    folium.Marker(location=[riga['latitude'], riga['longitude']], popup=folium.Popup(popup_html, max_width=300)).add_to(cluster)

We finally save the map in an html format

In [38]:
mappa.save('map.html')

# 5. BONUS: More complex search engine

# 6. Command Line Question

First we are going to merge to create the .tsv file we will be working with.
The file was created with ">" just the names of each category we have and then we added all the course rows via ">>"

In [None]:
#!/bin/bash
cat tsv/column_names.tsv > merged_courses.tsv
cat tsv/course_*.tsv >> merged_courses.tsv

From now on we will use the command "awk -F '/t'" to make sure that the .tsv created is tabulated well and that you can navigate within the column.


## Country with the highest number of Master's degrees

For the first question via "print 11" prints all the countries in the dataset, they are then put in order with "sort," duplicates removed, and a count done for each tamite "uniq -c.".
They were then put in descending order based on the count just done with "sort -nr" and finally only the first row with country name and college number is extracted through

"awk '{print 2 " " 3 " " 1}' | head -n 1".


In [None]:
most_courses_country=$(awk -F'\t' '{print $11}' merged_courses.tsv | sort | uniq -c | sort -nr | awk '{print $2 " " $3 " " $1}' | head -n 1)
echo "Country that offers the most Master's Degrees: $most_courses_country"

## City with the highest number of Master's degrees

The same commands listed and explained before are used here only difference takes into account column 10 which corresponds to cities.


In [None]:
most_courses_city=$(awk -F'\t' '{print $10}' merged_courses.tsv | sort | uniq -c | sort -nr | awk '{print $2 " " $1}' | head -n 1)
echo "City offers the most Master's Degrees: $most_courses_city"

## Number of colleges with part-time education

In this other question, the command "$4 ~ /Part time/" is used to blind and select the rows that have the words "Part time" in the fourth column, the command "wc -l" is used to count the rows obtained.


In [None]:
num_part_time_colleges=$(awk -F'\t' '$4 ~ /Part time/ {print $2}' merged_courses.tsv | sort | uniq | wc -l)
echo "Number of colleges offer Part-Time education: $num_part_time_colleges"

## Percentage of courses in engineering

As in the last question we look in the first column for rows that have this time "Engineering|Engineer" and that is, that have in their name the word "Engineering" or ("|") "Engineer"

In this last question, I specifically used the command "printf \"%.2f\"" which goes to format the output to 2 decimal places, and I used "BEGIN{...}" which if I had not used it the calculation would have been performed for each line of the input file, resulting in incorrect behavior for the desired objective.

This is because this command causes everything in the curly brackets to be executed before scrolling through all the rows.


In [None]:
engineering_courses=$(awk -F '\t' '$1 ~ /Engineering|Engineer/ {print}' merged_courses.tsv | wc -l)
percentage=$(awk "BEGIN {printf \"%.2f\", ($engineering_courses /6000) * 100}")
echo "Percentage of courses in Engineering: $percentage%"

# 7. Algorithmic Question

To solve the problem we create the function fakereport(), which firstly checks if the report can be created with the constraints given, and if the answer is yes, it iterates over all the combinations of hours for each day until it finds the one that is equal to sumhours.

In [43]:
from itertools import product

def fakereport():
    d,sumhours=map(int,input().split())
    mintime=[]
    maxtime=[]
    for i in range(0,d):                         # O(d)
        mini,maxi=map(int,input().split())
        mintime.append(mini)
        maxtime.append(maxi)
    if sum(mintime)>sumhours or sum(maxtime)<sumhours:              # O(1)
        print('NO')
    else:
        liste= [list(range(mintime[i],maxtime[i]+1)) for i in range(d)]   # O(d)
        for i in list(product(*[lista for lista in liste])):              # O(n^d)  where n is the max length a list of daily working hours can be
            if sum(i) == sumhours:                                        # O(n^d)
                print('YES')                                              # O(1)
                print(' '.join(map(str, i)))                              # O(1)
                break


### Tests

In [46]:
fakereport()

2 5
0 1
3 5
YES
0 5


In [47]:
fakereport()

1 1
5 6
NO




### What is the time complexity (the Big O notation) of your solution? Please provide a detailed explanation of how you calculated the time complexity.



To  check the time complexity of the function each operation of the algorithm was analyzed. It was noted that the most time consuming operation was the iteration over the cartesian product of the lists containing the number of hours worked everyday.

The conclusion is that the maximun time that this funtion can run is equal to the longest possible list of hours worked every day to the power of d, the days worked, so the Big O notation would be O(n^d).

### Ask ChatGPT or any other LLM chatbot tool to check your code's time complexity (the Big O notation). Compare your answer to theirs. Do you believe this is correct? If the two differ, which one is right? (why?)

The following is the analysis of ChatGPT:

1. **Reading `d` and `sumhours`:** This involves constant time operations, making it O(1) complexity.

2. **For loop to read `mini` and `maxi`:** The for loop runs for `d` iterations. Within each iteration, there are constant time operations. Thus, the complexity of this block is O(d).

3. **Creating the lists `mintime` and `maxtime`:** `d` operations are performed to add elements to the lists. So, the complexity of this part is O(d).

4. **Creating the list `liste`:** This part involves creating a nested list that can have up to `maxtime[i] - mintime[i] + 1` elements. Therefore, the complexity of this part is O(n), where `n` is the sum of the lengths of the lists created.

5. **For loop using the `product` function:** The `product` function generates the Cartesian product of all the lists in `liste`. The complexity of this part is O(m^n), where `m` is the maximum length of a sublist in `liste` (fixed at `sumhours`) and `n` is the total number of sublists (fixed at `d`).

In terms of the overall asymptotic time complexity, the dominant part is the for loop using the `product` function. Therefore, the total time complexity of your algorithm seems to be O(sumhours^d).


We can see that the two solution are pretty similar wih the exception that ChatGPT considers sumhours as the max length a list can be. Actually, this is not specified so the lists could be longer, but thinking about it it's correct, because, if the contraint are respected, the function will never iterate over a longer list, because the solution would surely be found before.

### What do you think of the optimality of your code? Do you believe it is optimal? Can you improve? Please elaborate on your response.

Looking at my code, the thing that could improve the optimality of the funtion is probably the implementation of the creation of the lists containing all the hours worked in a single day in the for loop containing the product function, so that everything is executed at once.

But since the operation of creating the lists is way shorter than the *product function that makes the cartesian product, and merging the two wouldn't improve the asimptotical behavior of the function, I decided to put the two operation in two different lines to make the code more readable.