## 2. Search Engine

Now, we want to create two different Search Engines that, given as input a query, return the courses that match the query.

### 2.0 Preprocessing 

### 2.0.0)  Preprocessing the text

First, you must pre-process all the information collected for each MSc by:

1. Removing stopwords
2. Removing punctuation
3. Stemming
4. Anything else you think it's needed
   
For this purpose, you can use the [`nltk library](https://www.nltk.org/).

### 2.0.1) Preprocessing the fees column

Moreover, we want the field ```fees``` to collect numeric information. As you will see, you scraped textual information for this attribute in the dataset: sketch whatever method you need (using regex, for example, to find currency symbol) to collect information and, in case of multiple information, retrieve only the highest fees. Finally, once you have collected numerical information, you likely will have different currencies: this can be chaotic, so let chatGPT guide you in the choice and deployment of an API to convert this column to a common currency of your choice (it can be USD, EUR or whatever you want). Ultimately, you will have a ```float``` column renamed ```fees (CHOSEN COMMON CURRENCY)```.

### 2.1. Conjunctive query
For the first version of the search engine, we narrowed our interest to the __description__ of each course. It means that you will evaluate queries only concerning the course's description.

### 2.1.1) Create your index!

Before building the index, 
* Create a file named `vocabulary`, in the format you prefer, that maps each word to an integer (`term_id`).

Then, the first brick of your homework is to create the Inverted Index. It will be a dictionary in this format:

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```
where _document\_i_ is the *id* of a document that contains that specific word.

__Hint:__ Since you do not want to compute the inverted index every time you use the Search Engine, it is worth thinking about storing it in a separate file and loading it in memory when needed.

#### 2.1.2) Execute the query
Given a query input by the user, for example:

```
advanced knowledge
```

The Search Engine is supposed to return a list of documents.

##### What documents do we want?
Since we are dealing with conjunctive queries (AND), each returned document should contain all the words in the query.
The final output of the query must return, if present, the following information for each of the selected documents:

* `courseName`
* `universityName`
* `description`
* `URL`

__Example Output__ for ```advanced knowledge```: (please note that our examples are made on a small batch of the full dataset)

<p align="center">
<img src="img/output1.png" width = 1000>
</p>

If everything works well in this step, you can go to the next point and make your Search Engine more complex and better at answering queries.


### 2.2) Conjunctive query & Ranking score

For the second search engine, given a query, we want to get the *top-k* (the choice of *k* it's up to you!) documents related to the query. In particular:

* Find all the documents that contain all the words in the query.
* Sort them by their similarity with the query.
* Return in output *k* documents, or all the documents with non-zero similarity with the query when the results are less than _k_. You __must__ use a heap data structure (you can use Python libraries) for maintaining the *top-k* documents.

To solve this task, you must use the *tfIdf* score and the _Cosine similarity_. The field to consider is still the `description`. Let's see how.


#### 2.2.1) Inverted index
Your second Inverted Index must be of this format:

```
{
term_id_1:[(document1, tfIdf_{term,document1}), (document2, tfIdf_{term,document2}), (document4, tfIdf_{term,document4}), ...],
term_id_2:[(document1, tfIdf_{term,document1}), (document3, tfIdf_{term,document3}), (document5, tfIdf_{term,document5}), (document6, tfIdf_{term,document6}), ...],
...}
```

Practically, for each word, you want the list of documents in which it is contained and the relative *tfIdf* score.

__Tip__: *TfIdf* values are invariant for the query. Due to this reason, you can precalculate and store them accordingly.

#### 2.2.2) Execute the query

In this new setting, given a query, you get the proper documents (i.e., those containing all the query's words) and sort them according to their similarity to the query. For this purpose, as the scoring function, we will use the Cosine Similarity concerning the *tfIdf* representations of the documents.

Given a query input by the user, for example:
```
advanced knowledge
```
The search engine is supposed to return a list of documents, __ranked__ by their Cosine Similarity to the query entered in the input.

More precisely, the output must contain:
* `courseName`
* `universityName`
* `description`
* `URL`
* The similarity score of the documents with respect to the query (float value between 0 and 1)
  
__Example Output__ for ```advanced knowledge```:

<p align="center">
<img src="img/output2.png" width = 1000>
</p>

In [4]:
import pandas as pd

# Read the TSV data
df = pd.read_csv(
    "TSV/course_1.tsv", sep="\t", index_col=False
)

for i in range(2, 6001):
    try:
        df1 = pd.read_csv(
            "TSV/course_" + str(i) + ".tsv",
            sep="\t",
            index_col=False,
        )
        df1.index += i - 1
        df = pd.concat([df, df1])
    except Exception as e:
        print(i)
        print("Error: ", e)


Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url,Unnamed: 13
0,Computer Science - MSc,University of Hertfordshire,"School of Physics, Engineering and Computer Sc...",Full time,Why choose Herts?Industry Accreditation: Accre...,See Course,UK Students Full time: £9450 for the 2022/202...,MSc,"1 year full-time, 15 months full-time, 3 years...",Hatfield,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,
1,Computer Science (Cyber Security) - MSc,Staffordshire University,"School of Digital, Technologies and Arts",Full time,Join the fight against malicious programs and ...,September,Find the specific fees for your chosen program...,MSc,13 months - 25 months,Stoke on Trent,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,
2,Computer Science (Data Science) - MSc,Trinity College Dublin,School of Computer Science & Statistics,Full time,The MSc in Computer Science is an exciting one...,September,Please see the university website for further ...,MSc,1 year full-time,Dublin,Ireland,On Campus,https://www.findamasters.com/masters-degrees/c...,
3,Computer Science (by Research) - MSc,Lancaster University,School of Computing and Communications,Full time,The MSc by Research programme can be tailored ...,See Course,Please see the university website for further ...,MSc,"12 months full-time, 24 months part time",Lancaster,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,
4,Computer Science (Computer Networks and Securi...,Staffordshire University,"School of Digital, Technologies and Arts",Full time,Secure your future career with our Computer Sc...,September,Find the specific fees for your chosen program...,MSc,13 months - 25 months,Stoke on Trent,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...,


In [17]:
df = df.drop('Unnamed: 13', axis=1)
df.info()


KeyError: "['Unnamed: 13'] not found in axis"

In [23]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Removing stopwords
nltk.download('stopwords')

def stopless(text):
    if isinstance(text, str):
        words = word_tokenize(text)
        stop_words = set(stopwords.words('english'))
        filtered_words = [word for word in words if word.lower() not in stop_words]
        return " ".join(filtered_words)
    else:
        return text

df = df.applymap(stopless)



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/petraudovicic/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                          courseName  \
0                             Computer Science - MSc   
1          Computer Science ( Cyber Security ) - MSc   
2            Computer Science ( Data Science ) - MSc   
3                Computer Science ( Research ) - MSc   
4  Computer Science ( Computer Networks Security ...   

             universityName                                    facultyName  \
0  University Hertfordshire  School Physics , Engineering Computer Science   
1  Staffordshire University             School Digital , Technologies Arts   
2    Trinity College Dublin           School Computer Science & Statistics   
3      Lancaster University                School Computing Communications   
4  Staffordshire University             School Digital , Technologies Arts   

  isItFullTime                                        description   startDate  \
0    Full time  choose Herts ? Industry Accreditation : Accred...  See Course   
1    Full time  Join fight malic

In [28]:
import string
#removing punctuation
nltk.download('punkt')
def punct(text):
    if isinstance(text, str):
        words = word_tokenize(text)
        filtered_words = [word for word in words if word.lower() not in string.punctuation]
        return " ".join(filtered_words)
    else:
        return text

df = df.applymap(punct)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/petraudovicic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:

#stemming
from nltk.stem import PorterStemmer
nltk.download('punkt')
ps = PorterStemmer()
def stem(text):
    if isinstance(text, str):
        words = word_tokenize(text)
        stemmed_words = [ps.stem(word) for word in words]
        return " ".join(stemmed_words)
    else:
        return text
df = df.applymap(stem)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/petraudovicic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,comput scienc msc,univ hertfordshir,school physic engin comput scienc,full time,choo hert industri accredit accredit british c...,see cour,uk student full time £9450 2022/2023 academ ye...,msc,1 year full-tim 15 month full-tim 3 year part-tim,hatfield,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...
1,comput scienc cyber secur msc,staffordshir univ,school digit technolog art,full time,join fight malici program cybercrim comput sci...,septemb,find specif fee chosen programm websit,msc,13 month 25 month,stoke trent,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...
2,comput scienc data scienc msc,triniti colleg dublin,school comput scienc statist,full time,msc comput scienc excit one-calendar-year prog...,septemb,plea see univ websit inform fee cour,msc,1 year full-tim,dublin,ireland,campu,http //www.findamasters.com/masters-degrees/co...
3,comput scienc research msc,lancast univ,school comput commun,full time,msc research programm tailor individu research...,see cour,plea see univ websit inform fee cour,msc,12 month full-tim 24 month part time,lancast,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...
4,comput scienc comput network secur msc,staffordshir univ,school digit technolog art,full time,secur futur career comput scienc comput networ...,septemb,find specif fee chosen programm websit,msc,13 month 25 month,stoke trent,unit kingdom,campu,http //www.findamasters.com/masters-degrees/co...
