# Pre-requisites of Compulsory and Optional modules

Report: 

- Datasets: We used webscraping to retrieve data from LSE website about BSc Data Science course structure, such as pre-requisites for each module for all 3 years, if there are any. We did not include A-level Mathematics as a pre-requisite.
https://www.lse.ac.uk/resources/calendar2021-2022/programmeRegulations/undergraduate/2021/BScDataScience.htm

- The data we collected is stored in 'pre-requisites' dictionary, which has course codes as keys and list of pre-requisites for that course as values. We webscraped the data from LSE website on the 20th of April at 3pm.



## Importing the data

In [1]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.lse.ac.uk/resources/calendar2021-2022/programmeRegulations/undergraduate/2021/BScDataScience.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')

# Looking at the contents of the webpage
soup.contents

[<html>
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <title>BSc in Data Science</title>
 <meta content="en-gb" http-equiv="Content-Language"/>
 <meta content="Calendar web editor" name="dc.creator"/>
 <meta content="calendar@lse.ac.uk" name="dc.creator.email"/>
 <meta content="Academic Registrar's Division" name="lse.pagesource"/>
 <meta content="Fri Sep 24 16:54:48 BST 2021" name="lse.publishDate"/>
 <link href="/css/capis.css" rel="stylesheet" type="text/css"/>
 <link href="../../../Default.htm" rel="start" title="Calendar"/>
 <link href="../../../undergraduate.htm" rel="prev" title="Undergraduate"/>
 <link href="../Default.htm" rel="prev" title="Programme regulations"/>
 <link href="Default.htm" rel="prev" title="2021"/>
 <script type="text/javascript">
 function getParams ()
 {
     var result = {};
     var tmp = [];
 
     location.search
         .substr (1)
         .split ("&")
         .forEach (function (item)
         {
             tmp = i

## Extracting Module codes and their Links

In [2]:
modules = {}

for tag in soup.find_all('td', colspan="2"):
    
    a_tag = tag.find('a')
    
    #checking if tags like <td colspan="2"><p><a... exist
    if a_tag:
        
        module_code = a_tag.get_text().strip()
        
        if module_code not in ['Options list','Paper 10 options list','Papers 10 & 11 options list','LSE100']:
            
            #finding the link for each course code
            module_link = a_tag['href'].replace('../../..', 'https://www.lse.ac.uk/resources/calendar')
            modules[module_code] = module_link
    
modules

{'ST102': 'https://www.lse.ac.uk/resources/calendar/courseGuides/ST/2021_ST102.htm',
 'MA100': 'https://www.lse.ac.uk/resources/calendar/courseGuides/MA/2021_MA100.htm',
 'ST101': 'https://www.lse.ac.uk/resources/calendar/courseGuides/ST/2021_ST101.htm',
 'EC1A3': 'https://www.lse.ac.uk/resources/calendar/courseGuides/EC/2021_EC1A3.htm',
 'EC1B3': 'https://www.lse.ac.uk/resources/calendar/courseGuides/EC/2021_EC1B3.htm',
 'FM101': 'https://www.lse.ac.uk/resources/calendar/courseGuides/FM/2021_FM101.htm',
 'MA102': 'https://www.lse.ac.uk/resources/calendar/courseGuides/MA/2021_MA102.htm',
 'MA222': 'https://www.lse.ac.uk/resources/calendar/courseGuides/MA/2021_MA222.htm',
 'ST206': 'https://www.lse.ac.uk/resources/calendar/courseGuides/ST/2021_ST206.htm',
 'ST202': 'https://www.lse.ac.uk/resources/calendar/courseGuides/ST/2021_ST202.htm',
 'MA214': 'https://www.lse.ac.uk/resources/calendar/courseGuides/MA/2021_MA214.htm',
 'ST310': 'https://www.lse.ac.uk/resources/calendar/courseGuides/

## Looking at the Contents of a Single Webpage from above.

In [3]:
url = 'https://www.lse.ac.uk/resources/calendar/courseGuides/MA/2021_MA102.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')

# Looking at the contents of the webpage
soup.contents

[<html>
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <title>MA102 Mathematical Proof and Analysis</title>
 <meta content="Calendar web editor" name="dc.creator"/>
 <meta content="calendar@lse.ac.uk" name="dc.creator.email"/>
 <meta content="Academic Registrar's Division" name="lse.pagesource"/>
 <link href="/css/capis.css" rel="stylesheet" type="text/css"/>
 <link href="../../Default.htm" rel="start" title="Calendar"/>
 <link href="../../undergraduate.htm" rel="prev" title="Undergraduate"/>
 <link href="../../courseGuides/undergraduate.htm" rel="prev" title="Course guides"/>
 <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js" type="text/javascript"></script><script src="https://www.gstatic.com/charts/loader.js" type="text/javascript"></script><script type="text/javascript">
 
         function buildChart() { // made sure function is loaded before so that google chart's callback needs it
 
             var examsContentId ;
  

## Extracting Module codes and their corresponding Pre-requisites information

In [4]:
#dictionary that has course codes as keys and pre-requisites information as values
pre_req_info = {}

for module, module_link in modules.items():
    
    #this module's website doesn't load
    if not module == 'ST304':
        
        r = requests.get(module_link)   
        each_module = BeautifulSoup(r.content,'lxml')
        pr = each_module.find('p', {'id': 'preRequisites-Label'})
        
        if pr:
            
            pr_content = pr.find_next('div', {'id': 'preRequisites-Content'})
            pre_req_info[module] = (pr_content.get_text(strip=True))
            
pre_req_info

{'ST102': 'A-level Mathematics.No previous knowledge of statistics is assumed.',
 'MA100': 'This course assumes knowledge of the elementary techniques of mathematics including calculus, as evidenced for example by a good grade in A Level Mathematics.',
 'ST101': 'Although not a formal requirement, it is preferable that students have some familiarity with the basic concepts of probability and statistics, to the level of ST102/ST107 first 2 chapters (Data visualisation and descriptive statistics and probability theory).',
 'EC1A3': 'A-level mathematics, or equivalent.',
 'EC1B3': 'Students must have completed Microeconomics I (EC1A3).A-level mathematics, or equivalent.',
 'MA102': 'Students should have taken, or be taking concurrently, the course Mathematical Methods (MA100),orthe course Quantitative Methods (Mathematics) (MA107).',
 'MA222': 'Students should ideally have taken the course Mathematical Methods (MA100) or equivalent, entailing intermediate-level knowledge of linear algebra

## Cleaning Pre-requisites data

In [5]:
pre_requisites = {}

for course, pre_req in pre_req_info.items():
    
    words = []
    course_words = pre_req.split(' ')
    
    #CLEAN THE PRE-REQUISITES DATA
    for word in course_words:

        word = word.split('.')[0] if '.' in word else word
        
        word = word.split(',')[0] if ',' in word else word
        
        word = word.split('(')[1][:5] if '(' in word else word
        
        word = word.split('/') if '/' in word and re.search(r"\d", word) else word
        
        word = word.split('and') if 'and' in word and re.search(r"\d", word) else word
        
        words.append(word)
        
        pre_requisites[course] = words
        
print(pre_requisites)

{'ST102': ['A-level', 'Mathematics', 'previous', 'knowledge', 'of', 'statistics', 'is', 'assumed'], 'MA100': ['This', 'course', 'assumes', 'knowledge', 'of', 'the', 'elementary', 'techniques', 'of', 'mathematics', 'including', 'calculus', 'as', 'evidenced', 'for', 'example', 'by', 'a', 'good', 'grade', 'in', 'A', 'Level', 'Mathematics'], 'ST101': ['Although', 'not', 'a', 'formal', 'requirement', 'it', 'is', 'preferable', 'that', 'students', 'have', 'some', 'familiarity', 'with', 'the', 'basic', 'concepts', 'of', 'probability', 'and', 'statistics', 'to', 'the', 'level', 'of', ['ST102', 'ST107'], 'first', '2', 'chapters', 'Data', 'visualisation', 'and', 'descriptive', 'statistics', 'and', 'probability', 'theory)'], 'EC1A3': ['A-level', 'mathematics', 'or', 'equivalent'], 'EC1B3': ['Students', 'must', 'have', 'completed', 'Microeconomics', 'I', 'EC1A3', 'mathematics', 'or', 'equivalent'], 'MA102': ['Students', 'should', 'have', 'taken', 'or', 'be', 'taking', 'concurrently', 'the', 'course

## Extracting the Pre-requisite modules for each module

In [6]:
final_pre_requisites = {}

for course, pre_req in pre_requisites.items():
    course_codes = []
    
    for word in pre_req:
        
        #checking if a word is a string
        if isinstance(word, str):
            
            #look for LLNNN and LLNLN formats
            if re.match(r"^[A-Z]{2}\d{3}$", word) or re.match(r"^[A-Z]{2}\d{1}\[A-Z]{1}\d{1}$", word):
                
                if word not in course_codes:
                    
                    course_codes.append(word)
        else:
            
            for each_word in word:
                
                #look for LLNNN and LLNLN formats
                if re.match(r"^[A-Z]{2}\d{3}$", each_word) or re.match(r"^[A-Z]{2}\d{1}\[A-Z]{1}\d{1}$", each_word):
                    
                    if each_word not in course_codes:
                        
                        course_codes.append(each_word)
                        
    final_pre_requisites[course] = course_codes

final_pre_requisites

{'ST102': [],
 'MA100': [],
 'ST101': ['ST102', 'ST107'],
 'EC1A3': [],
 'EC1B3': [],
 'MA102': ['MA100', 'MA107'],
 'MA222': ['MA100'],
 'ST206': ['ST102', 'MA100', 'MA107', 'ST109'],
 'ST202': ['ST102', 'MA100', 'MA107', 'ST109'],
 'MA214': ['MA102', 'MA103', 'ST101'],
 'ST310': ['ST102', 'MA100', 'MA107', 'ST109'],
 'ST300': ['ST202', 'ST206', 'MA100'],
 'ST308': ['ST102', 'MA100', 'MA107', 'ST109', 'ST202'],
 'ST326': ['ST202', 'ST206', 'ST211'],
 'ST312': ['ST102'],
 'MA301': ['MA100', 'MA107'],
 'MA316': ['MA103'],
 'MA320': ['MA100', 'MA103'],
 'MA333': ['MA208'],
 'ST301': ['ST202', 'ST206', 'ST227'],
 'ST302': ['ST202', 'ST206'],
 'ST303': ['ST202', 'ST206', 'ST302', 'ST306'],
 'ST307': ['ST107'],
 'FM213': ['MA100', 'ST102', 'MA107', 'ST109'],
 'FM300': ['FM212', 'FM213'],
 'ST327': ['ST102', 'ST107', 'ST203', 'MG205', 'MG202', 'ST109'],
 'ST330': ['ST202', 'ST206', 'ST302'],
 'EC2A3': ['MA107', 'MA100'],
 'EC2B3': ['MA107', 'MA100'],
 'EC2C3': ['ST102', 'ST107'],
 'EC2C4': [