The goal of this project is to develop a version 1 web scraper using Beautiful Soup to collect comprehensive information about the different degrees and courses offered at McGill University. The scraper is designed with three key steps:

1. `soup_maker`: This function allows for the rapid creation of Beautiful Soup objects, facilitating efficient data extraction from web pages.

2. `url_page_collector`: This step involves gathering all the relevant URLs from a single page, streamlining the process of scraping multiple pages simultaneously.

3. `extract_course`: Create a function to extract detailed course information, such as credits, course descriptions, professors, and other relevant data.

4. `extract_all`: By utilizing a Beautiful Soup object, this step effectively extracts the required course details and stores them in a well-organized dictionary.

Overall, the project aims to build a functional web scraper that can systematically gather pertinent data about McGill University's degrees and courses, serving as the foundation for future iterations and enhancements.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
#lets make a program that takes a url and throws back a soup with the content I want
def soup_maker(url):
  url = url
  response = requests.get(url)
  response = response.content
  soup = BeautifulSoup( response, 'html.parser')
  return soup

In [3]:
#lets make a program that takes a url and returns a list of urls that i need to scrape all the page with degress

def url_page_collector(url):
  urls = []
  soup = soup_maker(url)
 #this variable have all the html from the page
  all_programs = soup.find('div', class_="view-content")

#now i just want the html that have the url that i will need to loop

  links = all_programs.find_all('a', href = True)
  for link in links:
    partial_link = link['href']
    url1 = "https://www.mcgill.ca"+partial_link
    urls+= [url1]

  return urls


In [4]:
def extract_courses(ul):
  error_counter=0
  try:

    courses_in_topic = (ul.find_all('li', class_="program-course"))

    #now we have to loop thourgh each course to extract the info we need per course and save it in a list
    course_list=[]


    for i, course in enumerate(courses_in_topic):

      course_info= course.find('a').text # use .text to convert to tex
      course_strip= course_info.strip()
      course_code = course_info[15:23].strip() #use .strip to remove weird extra spaces before and after.
      course_name = course_info[24:-26].strip()
      content = course.find('div', class_="content")
      course_description = content.find("p").text.strip()
      if "*" in course_strip: # check the statment cuz the credits still show in the cocurse name.
        credits = course_strip[-25:-24]
        course_name = course_strip[9:-26]
      elif "**" in course_strip:
        credits = course_strip[-26:-25]
      elif "***" in course_strip:
        credits = course_strip[-27:28]
      else:

        credits = course_info[-25:-24]
      credits = credits
      prereq =[]
      coreq = []
      pre_co_req = []
      restriction = []
      course_note=[]
    #to find all the pre, co, pre-co req and restriction, i notice there are some classes with more than 1 restriction.

      for p in course.find_all('p'):


        if 'Prerequisite'  in p.text:
          prereq+=[p.text]

        elif 'Corequisite'  in p.text:
          coreq+=[p.text]

        elif 'Pre-/co-requisite' in p.text:
          pre_co_req+=[p.text]

        elif 'Restriction' in p.text:
          restriction+=[p.text]

        elif "Faculty" in p.text:
          faculty = p.text

        elif course_description  in p.text:
          course_description = course_description

        elif "Terms:"  in p.text:
          terms = p.text
          terms = terms[19::].strip()

        elif "Instructors:" in p.text:
          instructor_names = p.text
          instructor_names = instructor_names[24::].strip()

        elif "Fall" in p.text:
          course_note = course_note

        elif 'Winter' in p.text:
          course_note = course_note

        else:
          course_note+=[p.text]

      course_list.append([course_code, course_name, credits, terms, faculty,  course_description, instructor_names, prereq, pre_co_req, restriction, course_note])
  except (ValueError, TypeError ):
    error_counter +=1
    print(error_counter)
    pass

  return course_list

In [5]:
url_list = url_page_collector('https://www.mcgill.ca/study/2022-2023/programs/search?page=1')
print(url_list)

['https://www.mcgill.ca/study/2022-2023/faculties/arts/undergraduate/programs/bachelor-arts-ba-major-concentration-african-studies', 'https://www.mcgill.ca/study/2022-2023/faculties/basc/undergraduate/programs/bachelor-arts-ba-major-concentration-african-studies', 'https://www.mcgill.ca/study/2022-2023/faculties/arts/undergraduate/programs/bachelor-arts-ba-minor-concentration-african-studies', 'https://www.mcgill.ca/study/2022-2023/faculties/basc/undergraduate/programs/bachelor-arts-ba-minor-concentration-african-studies', 'https://www.mcgill.ca/study/2022-2023/faculties/arts/undergraduate/programs/bachelor-arts-ba-minor-concentration-arabic-language', 'https://www.mcgill.ca/study/2022-2023/faculties/basc/undergraduate/programs/bachelor-arts-ba-minor-concentration-arabic-language', 'https://www.mcgill.ca/study/2022-2023/faculties/arts/undergraduate/programs/bachelor-arts-ba-major-concentration-anthropology', 'https://www.mcgill.ca/study/2022-2023/faculties/basc/undergraduate/programs/b

In [6]:
test_soup = soup_maker('https://www.mcgill.ca/study/2022-2023/faculties/basc/undergraduate/programs/bachelor-science-bsc-minor-atmospheric-science')

In [7]:
def extract_all(soup):
  degree_name = soup.find("h1").text.strip()
  total_number_of_credits = degree_name[-13:-9]
  #print(degree_name, total_number_of_credits)

  h4_ps= soup.find_all(['h4', 'p', 'ul'])
#print(h4_ps)
  major_struc = {}
  course_type= ""
  required_credits = ""
  courses_required=[]
  complementary = ""
  complementary_credits = []
  complementary_courses = []
  topic_name = ""
  topic_credits= ""
  topic_info = []
  complementary_dict={}
  topic_courses = []

  for i, h in enumerate(h4_ps):

  #this part will take care of the required courses
  #check for preque as well

    if h.name == 'h4':
      course_type= h.text # ti does not do it all the time use the slicing better
      #print (course_type)
      course_type.strip()

      if h4_ps[i+1].name == 'p':
        required_credits = [h4_ps[i+1].text]


      elif "credits" in h.text:
        required_credits = course_type[-10:-9]
      elif course_type:
        course_type = h4_ps[i+2].text

      counter =i
      while counter+1 <len(h4_ps) and h4_ps[counter+1].name!= 'h4':

        if h4_ps[counter+1].name == 'ul':
          courses_required = extract_courses(h4_ps[counter+1])

          #print(courses_required)

          if course_type in major_struc:

            major_struc[course_type].append([required_credits, courses_required])
          else:
            major_struc[course_type] = [required_credits, courses_required]

          break
        else :
          counter+=1

  print(major_struc)
  return major_struc

In [8]:
test_soup = soup_maker('https://www.mcgill.ca/study/2022-2023/faculties/science/undergraduate/programs/bachelor-science-bsc-major-computer-science')

In [9]:
example=extract_all(test_soup)

{'Required Courses (33 credits)': [['* Students who have sufficient knowledge in a programming language do not need to take COMP 202.'], [['COMP 202', 'Foundations of Programming ', '3', 'Fall 2022, Winter 2023, Summer 2023', 'Offered by: Computer Science (Faculty of Science)', 'Computer Science (Sci) : Introduction to computer programming in a high level language: variables, expressions, primitive types, methods, conditionals, loops. Introduction to algorithms, data structures (arrays, strings), modular software design, libraries, file input/output, debugging, exception handling. Selected topics.', "Campbell, Jonathan (Fall) M'hiri, Faten (Winter) M'hiri, Faten (Summer)", ['Prerequisite: a CEGEP level mathematics course'], [], ['Restrictions: Not open to students who have taken or are taking COMP 204, COMP 208, or GEOG 333; not open to students who have taken or are taking COMP 206 or COMP 250.'], ['3 hours', 'COMP 202 is intended as a general introductory course, while COMP 204 is in

In [10]:
import json

# Export data to a JSON file
def export_to_json(data, filename="courses.json"):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)



In [11]:
# Run your scraper
test_url = 'https://www.mcgill.ca/study/2022-2023/faculties/science/undergraduate/programs/bachelor-science-bsc-major-computer-science'
test_soup = soup_maker(test_url)
scraped_data = extract_all(test_soup)

# Export to files
export_to_json(scraped_data, "computer_science_courses.json")

print("Export completed successfully.")


{'Required Courses (33 credits)': [['* Students who have sufficient knowledge in a programming language do not need to take COMP 202.'], [['COMP 202', 'Foundations of Programming ', '3', 'Fall 2022, Winter 2023, Summer 2023', 'Offered by: Computer Science (Faculty of Science)', 'Computer Science (Sci) : Introduction to computer programming in a high level language: variables, expressions, primitive types, methods, conditionals, loops. Introduction to algorithms, data structures (arrays, strings), modular software design, libraries, file input/output, debugging, exception handling. Selected topics.', "Campbell, Jonathan (Fall) M'hiri, Faten (Winter) M'hiri, Faten (Summer)", ['Prerequisite: a CEGEP level mathematics course'], [], ['Restrictions: Not open to students who have taken or are taking COMP 204, COMP 208, or GEOG 333; not open to students who have taken or are taking COMP 206 or COMP 250.'], ['3 hours', 'COMP 202 is intended as a general introductory course, while COMP 204 is in

In [12]:
import json

# Export data to a JSON file
def export_to_json(data, filename="courses.json"):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

In [14]:
# Run the scraper
test_url = 'https://www.mcgill.ca/study/2022-2023/faculties/science/undergraduate/programs/bachelor-science-bsc-major-computer-science'
test_soup = soup_maker(test_url)
scraped_data = extract_all(test_soup)

# Export to files
export_to_json(scraped_data, "computer_science_courses.json")

print("Export completed successfully.")

{'Required Courses (33 credits)': [['* Students who have sufficient knowledge in a programming language do not need to take COMP 202.'], [['COMP 202', 'Foundations of Programming ', '3', 'Fall 2022, Winter 2023, Summer 2023', 'Offered by: Computer Science (Faculty of Science)', 'Computer Science (Sci) : Introduction to computer programming in a high level language: variables, expressions, primitive types, methods, conditionals, loops. Introduction to algorithms, data structures (arrays, strings), modular software design, libraries, file input/output, debugging, exception handling. Selected topics.', "Campbell, Jonathan (Fall) M'hiri, Faten (Winter) M'hiri, Faten (Summer)", ['Prerequisite: a CEGEP level mathematics course'], [], ['Restrictions: Not open to students who have taken or are taking COMP 204, COMP 208, or GEOG 333; not open to students who have taken or are taking COMP 206 or COMP 250.'], ['3 hours', 'COMP 202 is intended as a general introductory course, while COMP 204 is in