## 1. Data collection

For this homework, there is no provided dataset. Instead, you have to build your own. Your search engine will run on text documents. So, here
we detail the procedure to follow for the data collection. We strongly suggest you work on different modules when implementing the required functions. For example, you may have a ```crawler.py``` module, a ```parser.py``` module, and a ```engine.py``` module: this is a good practice that improves readability in reporting and efficiency in de\ploying the code. Be careful; you are likely dealing with exceptions and other possible issues! 

### 1.1. Get the list of master's degree courses

We start with the list of courses to include in your corpus of documents. In particular, we focus on web scrapping the [MSc Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/). Next, we want you to **collect the URL** associated with each site in the list from the previously collected list.
The list is long and split into many pages. Therefore, we ask you to retrieve only the URLs of the places listed in **the first 400 pages** (each page has 15 courses, so you will end up with 6000 unique master's degree URLs).

The output of this step is a `.txt` file whose single line corresponds to the master's URL.

### 1.2. Crawl master's degree pages

Once you get all the URLs in the first 400 pages of the list, you:

1. Download the HTML corresponding to each of the collected URLs.
2. After you collect a single page, immediately save its `HTML` in a file. In this way, if your program stops for any reason, you will not lose the data collected up to the stopping point.
3. Organize the downloaded `HTML` pages into folders. Each folder will contain the `HTML` of the courses on page 1, page 2, ... of the list of master's programs.
   
__Tip__: Due to the large number of pages you should download, you can use some methods that can help you shorten the time. If you employed a particular process or approach, kindly describe it.
 
### 1.3 Parse downloaded pages

At this point, you should have all the HTML documents about the master's degree of interest, and you can start to extract specific information. The list of the information we desire for each course and their format is as follows:

1. Course Name (to save as ```courseName```): string;
2. University (to save as ```universityName```): string;
3. Faculty (to save as ```facultyName```): string
4. Full or Part Time (to save as ```isItFullTime```): string;
5. Short Description (to save as ```description```): string;
6. Start Date (to save as ```startDate```): string;
7. Fees (to save as ```fees```): string;
8. Modality (to save as ```modality```):string;
9. Duration (to save as ```duration```):string;
10. City (to save as ```city```): string;
11. Country (to save as ```country```): string;
12. Presence or online modality (to save as ```administration```): string;
13. Link to the page (to save as ```url```): string.
    
<p align="center">
<img src="img/example.jpeg" width = 1000>
</p>
This are the first rows of the scraped dataset:

<div style="overflow-x:auto;">
<table>
<thead>
  <tr>
    <th>index</th>
    <th>courseName</th>
    <th>universityName</th>
    <th>facultyName</th>
    <th>isItFullTime</th>
    <th>description</th>
    <th>startDate</th>
    <th>fees</th>
    <th>modality</th>
    <th>duration</th>
    <th>city</th>
    <th>country</th>
    <th>administration</th>
    <th>url</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>0</td>
    <td> Accounting and Finance - MSc</td>
    <td>University of Leeds</td>
    <td>Leeds University Business School</td>
    <td>Full time</td>
    <td>Businesses and governments rely on [...].</td>
    <td>September</td>
    <td>UK: £18,000 (Total) International: £34,750 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891">Link</a></td>
  </tr>
  <tr>
    <td>1</td>
    <td> Accounting, Accountability & Financial Management MSc</td>
    <td>King’s College London</td>
    <td>King’s Business School</td>
    <td>Full time</td>
    <td>Our Accounting, Accountability & Financial Management MSc course will provide [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522">Link</a></td>
  </tr>
  <tr>
    <td>2</td>
    <td> Accounting, Financial Management and Digital Business - MSc</td>
    <td>University of Reading</td>
    <td>Henley Business School</td>
    <td>Full time</td>
    <td>Embark on a professional accounting career [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Reading</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/accounting-financial-management-and-digital-business-msc/?i345d4286c351">Link</a></td>
  </tr>
  <tr>
    <td>3</td>
    <td> Addictions MSc</td>
    <td>King’s College London</td>
    <td>Institute of Psychiatry, Psychology and Neuroscience</td>
    <td>Full time</td>
    <td>Join us for an online session for prospective [...].</td>
    <td>September</td>
    <td>Please see the university website for further information on fees for this course.</td>
    <td>MSc</td>
    <td>One year FT</td>
    <td>London</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
    <td><a href="https://www.findamasters.com/masters-degrees/course/addictions-msc/?i132d4318c27100">Link</a></td>
  </tr>
  <tr>
    <td>4</td>
    <td> Advanced Chemical Engineering - MSc</td>
    <td>University of Leeds</td>
    <td>School of Chemical and Process Engineering</td>
    <td>Full time</td>
    <td>The Advanced Chemical Engineering MSc at Leeds [...].</td>
    <td>September</td>
    <td>UK: £13,750 (Total) International: £31,000 (Total)</td>
    <td>MSc</td>
    <td>1 year full time</td>
    <td>Leeds</td>
    <td>United Kingdom</td>
    <td>On Campus</td>
  </tr>
  <!-- Add more rows here as needed -->
</tbody>
</table>
</div>


For each master's degree, you create a `course_i.tsv` file of this structure:

```
courseName \t universityName \t  ... \t url
```

If an information is missing, you just leave it as an empty string.

### 1.1. Get the list of master's degree courses

Firstly, we need to create the `crawler.py` module. 

In [29]:
# Libraries
import requests
from bs4 import BeautifulSoup
import time
import os


In [22]:
BASE_URL = "https://www.findamasters.com/masters-degrees/msc-degrees/"
OUTPUT_FILE = "masters_urls.txt"

Checking out the URL and inspecting the website we find out the next piece of HTML code: `<a class="courseLink text-dark" href="/masters-degrees/course/applied-economics-banking-and-financial-markets-online-msc/?i280d8352c56675" title="Applied Economics (Banking and Financial Markets), online MSc at University of Bath Online, University of Bath"><u>Applied Economics (Banking and Financial Markets), online MSc</u></a>`. The class `courseLink text-dark` will be helpful for us in getting all the course links. 

In [23]:
def get_course_urls(page_url):
    try:
        response = requests.get(page_url, timeout=10)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        return None
    soup = BeautifulSoup(response.content, 'html.parser')
    course_links = soup.find_all('a', class_='courseLink text-dark')
    urls = ['https://www.findamasters.com/' + link['href'] for link in course_links if 'href' in link.attrs]
    return urls

In [50]:
for page in range(1):
    if (page + 1) % 25 == 0 or page == 0:
        print(f"Scraping page {page + 1}")
    if page == 0:
        page_url = f"{BASE_URL}"
    else:
        page_url = f"{BASE_URL}?PG={page + 1}"
    urls = get_course_urls(page_url)
    
    if urls:
        with open(OUTPUT_FILE, 'a') as file:
            for url in urls:
                file.write(f"{url}\n")

    # Sleeping to not have too many requests         
    time.sleep(1)

Scraping page 1


In [28]:
with open(OUTPUT_FILE, 'r') as file:
    for count, line in enumerate(file):
        pass
# This needs to be 6000
print('Total Lines', count + 1)

Total Lines 6000


### 1.2. Crawl master's degree pages

In [58]:
def download_html(url, folder):
    while True:
        response = requests.get(url)
        if "Just a moment..." in response.text:
            # Sleeping before trying again (net issues, too many requests etc.)
            time.sleep(3)
        else:
            filename = url.split('/')[-2] + url.split('/')[-1][2:6] + '.html'
            path = os.path.join(folder, filename)
            with open(path, 'w') as file:
                file.write(response.text)
            break 


In [59]:
with open('masters_urls.txt', 'r') as file:
    urls = file.readlines()

for i, url in enumerate(urls):
    page_number = i // 15 + 1
    folder = f"HTML/Page_{page_number}"
    os.makedirs(folder, exist_ok=True)
    download_html(url.strip(), folder)

    # Sleeping to not have too many requests         
    time.sleep(1)

KeyboardInterrupt: 

### 1.3 Parse downloaded pages

In [64]:
def parse_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    course_name = soup.find('tag', id='courseName').text if soup.find('tag', id='courseName') else ""
    university_name = soup.find('tag', id='universityName').text if soup.find('tag', id='universityName') else ""
    faculty_name = soup.find('tag', id='facultyName').text if soup.find('tag', id='facultyName') else ""
    is_full_time = soup.find('tag', id='isItFullTime').text if soup.find('tag', id='isItFullTime') else ""
    description = soup.find('tag', id='description').text if soup.find('tag', id='description') else ""
    start_date = soup.find('tag', id='startDate').text if soup.find('tag', id='startDate') else ""
    fees = soup.find('tag', id='fees').text if soup.find('tag', id='fees') else ""
    modality = soup.find('tag', id='modality').text if soup.find('tag', id='modality') else ""
    duration = soup.find('tag', id='duration').text if soup.find('tag', id='duration') else ""
    city = soup.find('tag', id='city').text if soup.find('tag', id='city') else ""
    country = soup.find('tag', id='country').text if soup.find('tag', id='country') else ""
    url = soup.select_one('link[rel="canonical"]')['href'] if soup.select_one('link[rel="canonical"]') else ''
    course_info = {
        'courseName': course_name,
        'universityName': university_name,
        'facultyName': faculty_name,
        'isItFullTime': is_full_time,
        'description': description,
        'startDate': start_date,
        'fees': fees,
        'modality': modality,
        ''
        'url': url
    }

    return course_info

In [66]:
def write_to_tsv(course_info, filename):
    with open('TSV/' + filename, 'w', encoding='utf-8') as f:
        for key, value in course_info.items():
            f.write(f"{value}\t")
        f.write('\n') 

In [68]:
root_path = 'HTML/'

course_counter = 1
for root, dirs, files in os.walk(root_path):
    for file in files:
        if file.endswith('.html'):
            file_path = os.path.join(root, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                html_content = f.read()
                course_info = parse_html(html_content)
                
                tsv_filename = f"course_{course_counter}.tsv"
                write_to_tsv(course_info, tsv_filename)
                print(f"Generated {tsv_filename}")  # Optional, for progress tracking
                
                course_counter += 1

Generated course_1.tsv
Generated course_2.tsv
Generated course_3.tsv
Generated course_4.tsv
Generated course_5.tsv
Generated course_6.tsv
Generated course_7.tsv
Generated course_8.tsv
Generated course_9.tsv
Generated course_10.tsv
Generated course_11.tsv
Generated course_12.tsv
Generated course_13.tsv
Generated course_14.tsv
Generated course_15.tsv
Generated course_16.tsv
Generated course_17.tsv
Generated course_18.tsv
Generated course_19.tsv
Generated course_20.tsv
Generated course_21.tsv
Generated course_22.tsv
Generated course_23.tsv
Generated course_24.tsv
Generated course_25.tsv
Generated course_26.tsv
Generated course_27.tsv
Generated course_28.tsv
Generated course_29.tsv
Generated course_30.tsv
Generated course_31.tsv
Generated course_32.tsv
Generated course_33.tsv
Generated course_34.tsv
Generated course_35.tsv
Generated course_36.tsv
Generated course_37.tsv
Generated course_38.tsv
Generated course_39.tsv
Generated course_40.tsv
Generated course_41.tsv


KeyboardInterrupt: 