## 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in de\ploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.

### 1.2. Crawl master's degree pages

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the courses on page 1, page 2, ... of the list of master's programs.
   
__Tip__: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.
 
### 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.
    
<p align="center">
<img src="img/example.jpeg" width = 1000>
</p>
This are the first rows of the scraped dataset:

<div style="overflow-x:auto;">
<table>
<thead>
  <tr>
    <th>index</th>
    <th>courseName</th>
    <th>universityName</th>
    <th>facultyName</th>
    <th>isItFullTime</th>
    <th>description</th>
    <th>startDate</th>
    <th>fees</th>
    <th>modality</th>
    <th>duration</th>
    <th>city</th>
    <th>country</th>
    <th>administration</th>
    <th>url</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>0</td>
    <td> Accounting and Finance - MSc</td>
    <td>University of Leeds</td>
    <td>Leeds University Business School</td>
    <td>Full time</td>
    <td>Businesses and governments rely on [...].</td>
    <td>September</td>
    <td>UK: £18,000 (Total) International: £34,750 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891">Link</a></td>
  </tr>
  <tr>
    <td>1</td>
    <td> Accounting, Accountability & Financial Management MSc</td>
    <td>King’s College London</td>
    <td>King’s Business School</td>
    <td>Full time</td>
    <td>Our Accounting, Accountability & Financial Management MSc course will provide [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522">Link</a></td>
  </tr>
  <tr>
    <td>2</td>
    <td> Accounting, Financial Management and Digital Business - MSc</td>
    <td>University of Reading</td>
    <td>Henley Business School</td>
    <td>Full time</td>
    <td>Embark on a professional accounting career [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Reading</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351">Link</a></td>
  </tr>
  <tr>
    <td>3</td>
    <td> Addictions MSc</td>
    <td>King’s College London</td>
    <td>Institute of Psychiatry, Psychology and Neuroscience</td>
    <td>Full time</td>
    <td>Join us for an online session for prospective [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>One year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/addictions-msc/?i132d4318c27100">Link</a></td>
  </tr>
  <tr>
    <td>4</td>
    <td> Advanced Chemical Engineering - MSc</td>
    <td>University of Leeds</td>
    <td>School of Chemical and Process Engineering</td>
    <td>Full time</td>
    <td>The Advanced Chemical Engineering MSc at Leeds [...].</td>
    <td>September</td>
    <td>UK: £13,750 (Total) International: £31,000 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
  </tr>
  <!-- Add more rows here as needed -->
</tbody>
</table>
</div>


For each master's degree, you create a `course_i.tsv` file of this structure:

```
courseName \t universityName \t  ... \t url
```

If an information is missing, you just leave it as an empty string.

### 1.1. Get the list of master's degree courses

Firstly, we need to create the `list_of_masters.py` module. 

In [2]:
# Libraries
import requests
from bs4 import BeautifulSoup
import time
import os
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

In [3]:
BASE_URL = "https://www.findamasters.com/masters-degrees/msc-degrees/"
OUTPUT_FILE = "masters_urls.txt"

Checking out the URL and inspecting the website we find out the next piece of HTML code: `<a class="courseLink text-dark" href="/masters-degrees/course/applied-economics-banking-and-financial-markets-online-msc/?i280d8352c56675" title="Applied Economics (Banking and Financial Markets), online MSc at University of Bath Online, University of Bath"><u>Applied Economics (Banking and Financial Markets), online MSc</u></a>`. The class `courseLink text-dark` will be helpful for us in getting all the course links. 

In [42]:
def get_course_urls(page_url):
    while True:
        try:
            response = requests.get(page_url, timeout=5)
            response.raise_for_status()
            break
        except requests.RequestException as e:
            # In case of failing request
            print(f"Request failed: {e}")
            time.sleep(1)
    soup = BeautifulSoup(response.content, "html.parser")
    course_links = soup.find_all("a", class_="courseLink text-dark")
    urls = [
        "https://www.findamasters.com" + link["href"]
        for link in course_links
        if "href" in link.attrs
    ]
    return urls

In [None]:
for page in range(number_of_pages):
    # Progress bar
    if (page + 1) % 25 == 0 or page == 0:
        print(f"Scraping page {page + 1}")

    # String of page_url, in the first page only get the BASE_URL
    if page == 0:
        page_url = f"{BASE_URL}"
    else:
        page_url = f"{BASE_URL}?PG={page + 1}"
    urls = get_course_urls(page_url)

    if urls:
        # Append to the file so it remembers the previous entries
        with open(OUTPUT_FILE, "a") as file:
            for url in urls:
                file.write(f"{url}\n")

    # Sleeping to not have too many requests
    time.sleep(1)

After running the script, we need to check out the file, it should contain 6000 URLS.

In [None]:
with open(OUTPUT_FILE, "r") as file:
    for count, line in enumerate(file):
        pass
# This needs to be 6000
print("Total Lines", count + 1)

Total Lines 6000


### 1.2. Crawl master's degree pages

For the second step, we need to download those HTML files and store them in the seperate folder. This will be done in the `crawler_master_html.py` module, below is the implementation.

In [18]:
def download_html(url_and_folder):
    url, folder = url_and_folder
    while True:
        response = requests.get(url, timeout=5)
        # "Just a moment..." started showing up in some cases. 
        # Until we get the content we want continue trying the same URL
        if "Just a moment..." in response.text:
            time.sleep(1)
        else:
            filename = url.split("/")[-2] + url.split("/")[-1] + ".html"
            path = os.path.join(folder, filename)
            with open(path, "w") as file:
                file.write(response.text)
            time.sleep(1)
            break

After writing the `download_html` function, now we need to use it on all of the URLS that are present in masters_urls.txt file.

In [19]:
with open("masters_urls.txt", "r") as file:
    urls = [url.strip() for url in file.readlines()]

url_and_folder_pairs = []
for i, url in enumerate(urls):
    page_number = i // 15 + 1
    folder = f"HTML/Page_{page_number}"
    os.makedirs(folder, exist_ok=True)
    url_and_folder_pairs.append((url, folder))

# Asynchronous execution of download_html at most 10 threads to save time
with ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(download_html, url_and_folder_pairs)

Each of these HTML's will be downloaded in their respectful folder from where they came from. Asynchronous parallelism is being used for the function `download_html` to speed up the proces of crawling all the HTML's. 

### 1.3 Parse downloaded pages

For the final part, we need to parse downloaded HTML files. This will be done with the module `parser_html.py`. For all of the values of interest, we used Inspect Element in Firefox browser, to navigate to the correct HTML class that we need for data mining.

In [6]:
def parse_html(html_content):
    soup = BeautifulSoup(html_content, "html.parser")

    course_name = (
        soup.find("h1", class_="course-header__course-title").text.strip()
        if soup.find("h1", class_="course-header__course-title")
        else ""
    )
    university_name = (
        soup.find("a", class_="course-header__institution").text.strip()
        if soup.find("a", class_="course-header__institution")
        else ""
    )
    faculty_name = (
        soup.find("a", class_="course-header__department").text.strip()
        if soup.find("a", class_="course-header__department")
        else ""
    )
    is_full_time = (
        soup.find("a", title="View all Full time Masters courses").text.strip()
        if soup.find("a", title="View all Full time Masters courses")
        else ""
    )
    description = (
        soup.find("div", id="Snippet").text.strip()
        if soup.find("div", id="Snippet")
        else ""
    )
    start_date = (
        soup.find("span", class_="key-info__start-date").text.strip()
        if soup.find("span", class_="key-info__start-date")
        else ""
    )
    fees = (
        soup.find("div", class_="course-sections__fees")
        .find("p")
        .get_text(separator=" ")
        .strip()
        if soup.find("div", class_="course-sections__fees")
        else ""
    )
    modality = (
        soup.find("span", class_="key-info__qualification").text.strip()
        if soup.find("span", class_="key-info__qualification")
        else ""
    )
    duration = (
        soup.find("span", class_="key-info__duration").text.strip()
        if soup.find("span", class_="key-info__duration")
        else ""
    )
    city = (
        soup.find("a", class_="course-data__city").text.strip()
        if soup.find("a", class_="course-data__city")
        else ""
    )
    country = (
        soup.find("a", class_="course-data__country").text.strip()
        if soup.find("a", class_="course-data__country")
        else ""
    )
    administration = (
        soup.find("a", class_="course-data__on-campus").text.strip()
        if soup.find("a", class_="course-data__on-campus")
        else ""
    )
    url = (
        soup.select_one('link[rel="canonical"]')["href"]
        if soup.select_one('link[rel="canonical"]')
        else ""
    )

    course_info = {
        "courseName": course_name,
        "universityName": university_name,
        "facultyName": faculty_name,
        "isItFullTime": is_full_time,
        "description": description,
        "startDate": start_date,
        "fees": fees,
        "modality": modality,
        "duration": duration,
        "city": city,
        "country": country,
        "administration": administration,
        "url": url,
    }

    return course_info

In [8]:
def write_to_tsv(course_info, filename):
    with open("TSV/" + filename, "w", encoding="utf-8") as f:
        # First row we write the column names
        column_names = "\t".join(course_info.keys())
        f.write(column_names + "\n")    
        # Second row we write the values of columns
        column_values = "\t".join(str(value) for value in course_info.values())
        f.write(column_values + "\n")


After writing `parse_html` and `write_to_tsv`, now we need to write a script that will use it. We take all of the files inside `HTML` folder, parse them, and store them in seperate `.tsv` files that are named `course_{i}.tsv` where the `i` is the course number from 1 to 6000.

In [9]:
root_path = "HTML/"

course_counter = 1
for root, dirs, files in os.walk(root_path):
    for file in files:
        if file.endswith(".html"):
            file_path = os.path.join(root, file)
            with open(file_path, "r", encoding="utf-8") as f:
                # Reading HTML
                html_content = f.read()

                # Parsing HTML
                course_info = parse_html(html_content)

                # " was giving us troubles so we removed it
                for key, value in course_info.items():
                    course_info[key] = value.replace('"', "")

                # Writing to the .tsv file
                tsv_filename = f"course_{course_counter}.tsv"
                write_to_tsv(course_info, tsv_filename)

                course_counter += 1

Now we need to test the dataset, if we managed to complete the task successfully.

In [10]:
# Read the TSV data
df = pd.read_csv("TSV/course_1.tsv", sep="\t", index_col=False)

for i in range(2, 6001):
    try:
        df1 = pd.read_csv("TSV/course_" + str(i) + ".tsv", sep="\t", index_col=False)
        df1.index += i - 1
        df = pd.concat([df, df1])
    except Exception as e:
        print(i)
        print("Error: ", e)

df.head()

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,Computer Science - MSc,University of Hertfordshire,"School of Physics, Engineering and Computer Sc...",Full time,Why choose Herts?Industry Accreditation: Accre...,See Course,UK Students Full time: £9450 for the 2022/202...,MSc,"1 year full-time, 15 months full-time, 3 years...",Hatfield,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
1,Computer Science (Cyber Security) - MSc,Staffordshire University,"School of Digital, Technologies and Arts",Full time,Join the fight against malicious programs and ...,September,Find the specific fees for your chosen program...,MSc,13 months - 25 months,Stoke on Trent,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
2,Computer Science (Data Science) - MSc,Trinity College Dublin,School of Computer Science & Statistics,Full time,The MSc in Computer Science is an exciting one...,September,Please see the university website for further ...,MSc,1 year full-time,Dublin,Ireland,On Campus,https://www.findamasters.com/masters-degrees/c...
3,Computer Science (by Research) - MSc,Lancaster University,School of Computing and Communications,Full time,The MSc by Research programme can be tailored ...,See Course,Please see the university website for further ...,MSc,"12 months full-time, 24 months part time",Lancaster,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
4,Computer Science (Computer Networks and Securi...,Staffordshire University,"School of Digital, Technologies and Arts",Full time,Secure your future career with our Computer Sc...,September,Find the specific fees for your chosen program...,MSc,13 months - 25 months,Stoke on Trent,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   courseName      5975 non-null   object
 1   universityName  5975 non-null   object
 2   facultyName     5975 non-null   object
 3   isItFullTime    5250 non-null   object
 4   description     5975 non-null   object
 5   startDate       5975 non-null   object
 6   fees            5854 non-null   object
 7   modality        5975 non-null   object
 8   duration        5975 non-null   object
 9   city            5975 non-null   object
 10  country         5975 non-null   object
 11  administration  5199 non-null   object
 12  url             5979 non-null   object
dtypes: object(13)
memory usage: 609.5+ KB


In [12]:
df[df['courseName'].isnull()]

Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
1083,,,,,,,,,,,,,
1087,,,,,,,,,,,,,
1205,,,,,,,,,,,,,
1291,,,,,,,,,,,,,
1423,,,,,,,,,,,,,
1856,,,,,,,,,,,,,
2198,,,,,,,,,,,,,
2459,,,,,,,,,,,,,
2491,,,,,,,,,,,,,
2501,,,,,,,,,,,,,


We can see that we have 25 bad entries in the dataset examining the NaN courseName entries. Examining the .tsv files with these index it appears that they are mostly empty. Furthermore, going into the HTML files we see that all 25 URL's are broken.