## 2. Search Engine

Now, we want to create two different Search Engines that, given as input a query, return the courses that match the query.

### 2.0 Preprocessing 

### 2.0.0)  Preprocessing the text

First, you must pre-process all the information collected for each MSc by:

1. Removing stopwords
2. Removing punctuation
3. Stemming
4. Anything else you think it's needed
   
For this purpose, you can use the [`nltk library](https://www.nltk.org/).

### 2.0.1) Preprocessing the fees column

Moreover, we want the field ```fees``` to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a ```float``` column renamed ```fees (CHOSEN COMMON CURRENCY)```.

### 2.1. Conjunctive query
For the first version of the search engine, we narrowed our interest to the __description__ of each course. It means that you will evaluate queries only concerning the course's description.

### 2.1.1) Create your index!

Before building the index, 
* Create a file named `vocabulary`, in the format you prefer, that maps each word to an integer (`term_id`).

Then, the first brick of your homework is to create the Inverted Index. It will be a dictionary in this format:

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```
where _document\_i_ is the *id* of a document that contains that specific word.

__Hint:__ Since you do not want to compute the inverted index every time you use the Search Engine, it is worth thinking about storing it in a separate file and loading it in memory when needed.

#### 2.1.2) Execute the query
Given a query input by the user, for example:

```
advanced knowledge
```

The Search Engine is supposed to return a list of documents.

##### What documents do we want?
Since we are dealing with conjunctive queries (AND), each returned document should contain all the words in the query.
The final output of the query must return, if present, the following information for each of the selected documents:

* `courseName`
* `universityName`
* `description`
* `URL`

__Example Output__ for ```advanced knowledge```: (please note that our examples are made on a small batch of the full dataset)

<p align="center">
<img src="img/output1.png" width = 1000>
</p>

If everything works well in this step, you can go to the next point and make your Search Engine more complex and better at answering queries.


### 2.2) Conjunctive query & Ranking score

For the second search engine, given a query, we want to get the *top-k* (the choice of *k* it's up to you!) documents related to the query. In particular:

* Find all the documents that contain all the words in the query.
* Sort them by their similarity with the query.
* Return in output *k* documents, or all the documents with non-zero similarity with the query when the results are less than _k_. You __must__ use a heap data structure (you can use Python libraries) for maintaining the *top-k* documents.

To solve this task, you must use the *tfIdf* score and the _Cosine similarity_. The field to consider is still the `description`. Let's see how.


#### 2.2.1) Inverted index
Your second Inverted Index must be of this format:

```
{
term_id_1:[(document1, tfIdf_{term,document1}), (document2, tfIdf_{term,document2}), (document4, tfIdf_{term,document4}), ...],
term_id_2:[(document1, tfIdf_{term,document1}), (document3, tfIdf_{term,document3}), (document5, tfIdf_{term,document5}), (document6, tfIdf_{term,document6}), ...],
...}
```

Practically, for each word, you want the list of documents in which it is contained and the relative *tfIdf* score.

__Tip__: *TfIdf* values are invariant for the query. Due to this reason, you can precalculate and store them accordingly.

#### 2.2.2) Execute the query

In this new setting, given a query, you get the proper documents (i.e., those containing all the query's words) and sort them according to their similarity to the query. For this purpose, as the scoring function, we will use the Cosine Similarity concerning the *tfIdf* representations of the documents.

Given a query input by the user, for example:
```
advanced knowledge
```
The search engine is supposed to return a list of documents, __ranked__ by their Cosine Similarity to the query entered in the input.

More precisely, the output must contain:
* `courseName`
* `universityName`
* `description`
* `URL`
* The similarity score of the documents with respect to the query (float value between 0 and 1)
  
__Example Output__ for ```advanced knowledge```:

<p align="center">
<img src="img/output2.png" width = 1000>
</p>

### **Preprocessing**

In [2]:
# Libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string
import re
from collections import Counter
from functools import reduce
import json

# NLTK Download
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/petraudovicic/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/petraudovicic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
# Read the TSV data
df = pd.read_csv("TSV/course_1.tsv", sep="\t", index_col=False)

for i in range(2, 6001):
    try:
        df1 = pd.read_csv(
            "TSV/course_" + str(i) + ".tsv",
            sep="\t",
            index_col=False,
        )
        df1.index += i - 1
        df = pd.concat([df, df1])
    except Exception as e:
        print(i)
        print("Error: ", e)

df.head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url,Unnamed: 13
0,Computer Science - MSc,University of Hertfordshire,"School of Physics, Engineering and Computer Sc...",Full time,Why choose Herts?Industry Accreditation: Accre...,See Course,UK Students Full time: £9450 for the 2022/202...,MSc,"1 year full-time, 15 months full-time, 3 years...",Hatfield,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,
1,Computer Science (Cyber Security) - MSc,Staffordshire University,"School of Digital, Technologies and Arts",Full time,Join the fight against malicious programs and ...,September,Find the specific fees for your chosen program...,MSc,13 months - 25 months,Stoke on Trent,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,
2,Computer Science (Data Science) - MSc,Trinity College Dublin,School of Computer Science & Statistics,Full time,The MSc in Computer Science is an exciting one...,September,Please see the university website for further ...,MSc,1 year full-time,Dublin,Ireland,On Campus,https://www.findamasters.com/masters-degrees/c...,
3,Computer Science (by Research) - MSc,Lancaster University,School of Computing and Communications,Full time,The MSc by Research programme can be tailored ...,See Course,Please see the university website for further ...,MSc,"12 months full-time, 24 months part time",Lancaster,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,
4,Computer Science (Computer Networks and Securi...,Staffordshire University,"School of Digital, Technologies and Arts",Full time,Secure your future career with our Computer Sc...,September,Find the specific fees for your chosen program...,MSc,13 months - 25 months,Stoke on Trent,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,


Removing stopwords:

In [4]:
def stopless(text):
    if isinstance(text, str):
        words = word_tokenize(text)
        stop_words = set(stopwords.words("english"))
        filtered_words = [word for word in words if word.lower() not in stop_words]
        return " ".join(filtered_words)
    else:
        return text

In [5]:
df_preprocessed = df.applymap(stopless)
df_preprocessed.head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url,Unnamed: 13
0,Computer Science - MSc,University Hertfordshire,"School Physics , Engineering Computer Science",Full time,choose Herts ? Industry Accreditation : Accred...,See Course,UK Students Full time : £9450 2022/2023 academ...,MSc,"1 year full-time , 15 months full-time , 3 yea...",Hatfield,United Kingdom,Campus,https : //www.findamasters.com/masters-degrees...,
1,Computer Science ( Cyber Security ) - MSc,Staffordshire University,"School Digital , Technologies Arts",Full time,Join fight malicious programs cybercrime Compu...,September,Find specific fees chosen programme website,MSc,13 months - 25 months,Stoke Trent,United Kingdom,Campus,https : //www.findamasters.com/masters-degrees...,
2,Computer Science ( Data Science ) - MSc,Trinity College Dublin,School Computer Science & Statistics,Full time,MSc Computer Science exciting one-calendar-yea...,September,Please see university website information fees...,MSc,1 year full-time,Dublin,Ireland,Campus,https : //www.findamasters.com/masters-degrees...,
3,Computer Science ( Research ) - MSc,Lancaster University,School Computing Communications,Full time,MSc Research programme tailored individual res...,See Course,Please see university website information fees...,MSc,"12 months full-time , 24 months part time",Lancaster,United Kingdom,Campus,https : //www.findamasters.com/masters-degrees...,
4,Computer Science ( Computer Networks Security ...,Staffordshire University,"School Digital , Technologies Arts",Full time,Secure future career Computer Science ( Comput...,September,Find specific fees chosen programme website,MSc,13 months - 25 months,Stoke Trent,United Kingdom,Campus,https : //www.findamasters.com/masters-degrees...,


Removing punctuation:

In [6]:
def punct(text):
    if isinstance(text, str):
        words = word_tokenize(text)
        filtered_words = [
            word for word in words if word.lower() not in string.punctuation
        ]
        return " ".join(filtered_words)
    else:
        return text

In [7]:
df_preprocessed = df_preprocessed.applymap(punct)
df_preprocessed.head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url,Unnamed: 13
0,Computer Science MSc,University Hertfordshire,School Physics Engineering Computer Science,Full time,choose Herts Industry Accreditation Accredited...,See Course,UK Students Full time £9450 2022/2023 academic...,MSc,1 year full-time 15 months full-time 3 years p...,Hatfield,United Kingdom,Campus,https //www.findamasters.com/masters-degrees/c...,
1,Computer Science Cyber Security MSc,Staffordshire University,School Digital Technologies Arts,Full time,Join fight malicious programs cybercrime Compu...,September,Find specific fees chosen programme website,MSc,13 months 25 months,Stoke Trent,United Kingdom,Campus,https //www.findamasters.com/masters-degrees/c...,
2,Computer Science Data Science MSc,Trinity College Dublin,School Computer Science Statistics,Full time,MSc Computer Science exciting one-calendar-yea...,September,Please see university website information fees...,MSc,1 year full-time,Dublin,Ireland,Campus,https //www.findamasters.com/masters-degrees/c...,
3,Computer Science Research MSc,Lancaster University,School Computing Communications,Full time,MSc Research programme tailored individual res...,See Course,Please see university website information fees...,MSc,12 months full-time 24 months part time,Lancaster,United Kingdom,Campus,https //www.findamasters.com/masters-degrees/c...,
4,Computer Science Computer Networks Security MSc,Staffordshire University,School Digital Technologies Arts,Full time,Secure future career Computer Science Computer...,September,Find specific fees chosen programme website,MSc,13 months 25 months,Stoke Trent,United Kingdom,Campus,https //www.findamasters.com/masters-degrees/c...,


Stemming:

In [8]:
def stem(text):
    if isinstance(text, str):
        ps = PorterStemmer()
        words = word_tokenize(text)
        stemmed_words = [ps.stem(word) for word in words]
        return " ".join(stemmed_words)
    else:
        return text

In [9]:
df_preprocessed = df_preprocessed.applymap(stem)
df_preprocessed.head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url,Unnamed: 13
0,comput scienc msc,univers hertfordshir,school physic engin comput scienc,full time,choos hert industri accredit accredit british ...,see cours,uk student full time £9450 2022/2023 academ ye...,msc,1 year full-tim 15 month full-tim 3 year part-tim,hatfield,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
1,comput scienc cyber secur msc,staffordshir univers,school digit technolog art,full time,join fight malici program cybercrim comput sci...,septemb,find specif fee chosen programm websit,msc,13 month 25 month,stoke trent,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
2,comput scienc data scienc msc,triniti colleg dublin,school comput scienc statist,full time,msc comput scienc excit one-calendar-year prog...,septemb,pleas see univers websit inform fee cours,msc,1 year full-tim,dublin,ireland,campu,http //www.findamasters.com/masters-degrees/co...,
3,comput scienc research msc,lancast univers,school comput commun,full time,msc research programm tailor individu research...,see cours,pleas see univers websit inform fee cours,msc,12 month full-tim 24 month part time,lancast,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
4,comput scienc comput network secur msc,staffordshir univers,school digit technolog art,full time,secur futur career comput scienc comput networ...,septemb,find specif fee chosen programm websit,msc,13 month 25 month,stoke trent,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,


removing special characters:

### 2.0.1) Preprocessing the fees column

Moreover, we want the field ```fees``` to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a ```float``` column renamed ```fees (CHOSEN COMMON CURRENCY)```.

In [10]:
#a function that will take in a string fee and return just the numeric part of it as a float
def convert_to_numeric(value):
    # Removing currency symbols, commas, and spaces
    value = re.sub(r"eur|sek|chf|gbp|rmb|jpy|qr|[£€]|,|\s", "", value)
    return float(value)


def find_fees(text):
    if isinstance(text, str):
        # Removing patterns that contain years from the text ex. 2022/2023 so that our regex doesn't recognize it as a part of the fee
        text = re.sub(r"\b\d{4}/\d{4}\b|\b\d{4}/\d{2}\b", "", text)

        # Regular expression pattern for currency values
        currency_pattern = r"((eur|sek|chf|gbp|rmb|jpy|qr|[£€])\s?\d+(?:[.,\s]\d{3})*(?:[.,]\d{2})?|\d+(?:[.,\s]\d{3})*(?:[.,]\d{2})?\s?(eur|sek|chf|gbp|rmb|jpy|qr|[£€]))"
        matches = re.findall(currency_pattern, text)

        # Exchange rates
        exchange_rates = {
            "SEK": 0.08588,
            "GBP": 1.1443,
            "CHF": 1.03708,
            "JPY": 0.00618,
            "QR": 0.25672,
            "RMB": 0.12892,
        }

        # Converting to euros and calculating values
        numeric_values = []
        for value in matches:
            value_numeric = convert_to_numeric(value[0])
            currency = value[1].upper() 
            numeric_values.append(value_numeric * exchange_rates.get(currency, 1)) #converting all the fees to euros

        # Returning the maximum value or None if no values
        return max(numeric_values) if numeric_values else None
    else:
        return text

In [11]:
# Applying the function to the dataframe
df_preprocessed["fees"] = df_preprocessed["fees"].apply(find_fees)
df_preprocessed.rename(columns={"fees": "fees (euro)"}, inplace=True)
df_preprocessed[df_preprocessed["fees (euro)"].notna()].head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees (euro),modality,duration,city,country,administration,url,Unnamed: 13
0,comput scienc msc,univers hertfordshir,school physic engin comput scienc,full time,choos hert industri accredit accredit british ...,see cours,16500.0,msc,1 year full-tim 15 month full-tim 3 year part-tim,hatfield,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
29,clinic cognit neurosci msc,sheffield hallam univers,postgradu cours,full time,develop broad rang practic skill essenti work ...,septemb,10310.0,msc,1 year full-tim 2 year part-tim,sheffield,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
49,fashion forecast data analysi ma/msc,univers creativ art,busi school creativ industri,full time,uca 's new msc degre fashion forecast data ana...,septemb,10500.0,msc,1 year full time,farnham,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
50,facad engin msc,univers west england bristol,depart architectur built environ,full time,façad engin disciplin right large-scal commerc...,septemb,11500.0,msc,1 year full time 2 year part time,bristol,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...,
51,fashion tech special master,poli.design società consortil responsabilità l...,postgradu cours,full time,fashion tech design decis role fashion lifesty...,april,11000.0,msc,13 month,milan,itali,campu,http //www.findamasters.com/masters-degrees/co...,


### 2.1. Conjunctive query
For the first version of the search engine, we narrowed our interest to the __description__ of each course. It means that you will evaluate queries only concerning the course's description.

#### **Vocabulary**
Create a file named `vocabulary`, in the format you prefer, that maps each word to an integer (`term_id`).

Extracting all the words and giving them an unique id:

In [12]:
def vocabulary_df(df):
    all_words = [
        word
        for description in df["description"]
        if isinstance(description, str)
        for word in description.split()
    ]

    word_counts = Counter(all_words)

    # Assign a unique ID to each word
    vocabulary = {
        word: idx for idx, (word, count) in enumerate(word_counts.items(), start=1)
    }
    return vocabulary

In [13]:
vocabulary = vocabulary_df(df_preprocessed)

the first brick of your homework is to create the Inverted Index. It will be a dictionary in this format:

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```
where _document\_i_ is the *id* of a document that contains that specific word.

In [26]:
def inverted_index_vocabulary(df, vocabulary):
    inverted_index = {vocabulary[word]: [] for word in vocabulary}

    # Populating the inverted index, processing only string descriptions
    for doc_id, description in enumerate(df["description"], start=1):
        if isinstance(description, str):
            words = set(description.split())  # to avoid duplicate entries
            for word in words:
                if word in vocabulary:
                    inverted_index[vocabulary[word]].append(doc_id)

    return inverted_index

In [27]:
inverted_index = inverted_index_vocabulary(df_preprocessed, vocabulary)

In [16]:
# Writing vocabulary and inverted_index for easier loading later on
with open("vocabulary.json", "w") as vocab_file:
    json.dump(vocabulary, vocab_file)

with open("inverted_index.json", "w") as index_file:
    json.dump(inverted_index, index_file)

#### 2.1.2) Execute the query

In [17]:
# Loading files
with open("vocabulary.json", "r") as vocab_file:
    vocabulary = json.load(vocab_file)

with open("inverted_index.json", "r") as index_file:
    inverted_index = json.load(index_file)

Extracting ids of the words in the query:

In [18]:
def process_query(query):
    query_terms = query.split()
    query_term_ids = [
        vocabulary.get(term) for term in query_terms if term in vocabulary
    ]
    return query_term_ids

Using these ids to find all the documents containing all the words in the query:

In [19]:
def search_documents(query_term_ids):
    # Retrieve document lists for each term in the query
    document_lists = [
        inverted_index.get(str(term_id), []) for term_id in query_term_ids
    ]

    # Find the intersection of these lists
    if document_lists:
        common_documents = set(document_lists[0]).intersection(*document_lists[1:])
        return sorted(common_documents)
    else:
        return []

In [20]:
def query_execution(query):
    query_preprocess = stem(punct(stopless(query)))

    # Processing the query
    query_term_ids = process_query(query_preprocess)
    # Searching for documents
    matching_doc_ids = search_documents(query_term_ids)
    # Retrieving and displaying information
    if matching_doc_ids:
        return df.loc[list(matching_doc_ids), ['courseName', 'universityName', 'description']]
    else:
        print("No matching documents found.")
        return None

In [21]:
# Example query
query = "cyber security"
matching_doc_df = query_execution(query)
matching_doc_df.head(10)

Unnamed: 0,courseName,universityName,description
2,Computer Science (Data Science) - MSc,Trinity College Dublin,The MSc in Computer Science is an exciting one...
633,MSc in Healthcare Leadership,University of Hull,Start date: January 2024Study healthcare leade...
714,Marketing - MSc,Cardiff University,Why study this courseBring your interests and ...
723,"Cybercrime, Terrorism and Security",University of Portsmouth,This course is still being set up. For more in...
730,Cybercrime and Digital Investigation MSc,Middlesex University,As our lives become increasingly digitised the...
732,Cyberphysical Systems 2 year MSc,University of Nottingham,Cyber physical systems integrate computation w...
1056,Advanced Computer Science with Data Science,University of Strathclyde,Our MSc Advanced Computer Science with Data Sc...
1058,Advanced Computing - MSc,Imperial College London,This course is aimed at students who have a su...
1099,MSc Criminology and Criminal Psychology,University of Essex Online,"Start Date: September, OctoberDevelop your ski..."
1104,MSc Data Science,University of Essex Online,Start Date: OctoberUse the power of data to ma...


### 2.2) Conjunctive query & Ranking score

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf = TfidfVectorizer()
df_preprocessed.description = ["" if pd.isna(doc) else doc for doc in df_preprocessed.description]
results = tfidf.fit_transform(df_preprocessed.description)
i=0
inverted_index2=dict({})
while(i<len(vocabulary)): #jel ovo dobra granica???
    j=0
    inverted_index2[i]=list([])
    while(j<6000):
        inverted_index2[i].append((j,results[j, i]))
        j=j+1
    i=i+1
print(inverted_index2)

# results je ovog oblika: {(course_id, word_id): tfidf} -> nama treba ovaj oblik: {word_id1: (course_id1, tfidf1), (course_id2, tfidf2), (course_id3, tfidf3)}

KeyboardInterrupt: 