# CAO Points

## Introduction

The CAO points data, available from the [CAO website](http://www.cao.ie), is published in a different format for each of the years 2019, 2020, and 2021. Each year's data, therefore, requires a different approach to acquisition, conversion to pandas DataFrame, and cleaning. The 2019 data is published in two PDF files; one for level 8 courses, and one for levels 6 and 7. The 2020 data is published as an Excel spreadsheet, and the 2021 data as preformatted text in a HTML web page.

The attributes of interest for comparison between the various years' datasets are `Course Code`, `Course Name`, `Institution Name`, `EOS`, which is the number of points achieved by the last applicant to be offered a place on the course, and `Mid`, which is the mid point between the number of points held by the highest point score and the lowest point score of the applicants offered a place on the course [1]. The 2021 data does not explicitly contain an either an `EOS` or a `Mid` column. It does provide the *Round 1* and *Round 2* points required for entry into each course as `RND1` and `RND2`. Examination of the 2020 data, which contains both an `EOS` field *and* `RND1` and `RND2` fields demonstrates that the `EOS` field is equal to the `RND2` value if it exists, otherwise the `RND1` value (```EOS = RND1 if RND1 else RND2```). As for the `Mid` field; this information does not appear to be available yet for the 2021 data.

In [88]:
# Imports
# Data analysis library
import pandas as pd
# Plotting library
import matplotlib.pyplot as plt
# PDF table parsing
from tabula import read_pdf
# Retrieval of resources from WWW
import requests
# URL construction
from requests.compat import urljoin
# Various utilities, mainly path construction
import os
# Creation of datetime strings for filenames
from datetime import datetime
# Regular expressions
import re
import numpy as np
import hashlib


## Acquiring the data


In [89]:
# Location of CAO points data
base_url = 'http://www2.cao.ie/points/'
# Local data directory
data_dir = 'data/cao'
backup_dir = 'data/cao/backup'

# Dictionary of source file names mapped to the file names that will be used locally
points_urls  = ({'l8.php'                  : 'cao_2021_lvl8.html',
                 'l76.php'                 : 'cao_2021_lvl76.html',
                 'CAOPointsCharts2020.xlsx': 'cao_2020_lvl876.xlsx'
                 })

# The rest of points_urls can be assembled programmatically
# as filenames follow a pattern

# List of years as 2-digit strings from 2019 to 2005
years = [str(i).zfill(2) for i in range(19, 4, -1)]
# For each year (2019 to 2005)
for year in years:
    levels = ('lvl8', 'lvl76')

    # 2011 and 2012 data is missing second 'l' from filenames
    if year in ('12', '11'):
        levels = ('lv8', 'lv76')
        
    # For each level 
    for level in levels:
        # construct remote filename
        remote_name = level + '_' + year + '.pdf'
        # construct local filename
        local_name = 'cao_20' + year + '_' + level + '.pdf'
        # Add remote and local filenames as keys and values in points_urls dict
        points_urls[remote_name] = local_name



In [90]:

# for each of the source files 
for url in (points_urls):
    # construct url and fetch content
    response = requests.get(urljoin(base_url, url))
    
    try:
        # attempt to open any previously downloaded local file
        with open(os.path.join(data_dir, points_urls[url]), "rb") as f:
            # Calculate md5 hashes for the local file and the remote file
            md5_local = hashlib.md5(f.read()).hexdigest()
            md5_response = hashlib.md5(response.content).hexdigest()

            # Set write_flag to False if the hashes are equal and True if they are not
            write_flag = (md5_local != md5_response)
    except IOError:
        # if the local file does not exist set the write_flag to True and move on
        write_flag = True

    # If the write_flag is True
    if write_flag:
        print(f"File: {points_urls[url]} has changed since last download. Updating...")
        # split the filename into name and extension
        fname, extension = os.path.splitext(points_urls[url])
        # construct unique filename by inserting datetime string between filename and extension
        filename = fname + datetime.now().strftime("_%Y%m%d_%H%M%S") + extension

        # write the timestamped remote file to the backup directory
        with open(os.path.join(backup_dir, filename), 'wb') as f:
            f.write(response.content)

        # also write the remote file to the data directory, overwriting any previous file
        with open(os.path.join(data_dir, points_urls[url]), 'wb') as f:
            f.write(response.content)
    else:
        print(f"File: {points_urls[url]} has not changed since last download. Skipping...")


File: cao_2021_lvl8.html has not changed since last download. Skipping...
File: cao_2021_lvl76.html has not changed since last download. Skipping...
File: cao_2020_lvl876.xlsx has not changed since last download. Skipping...
File: cao_2019_lvl8.pdf has not changed since last download. Skipping...
File: cao_2019_lvl76.pdf has not changed since last download. Skipping...
File: cao_2018_lvl8.pdf has not changed since last download. Skipping...
File: cao_2018_lvl76.pdf has not changed since last download. Skipping...
File: cao_2017_lvl8.pdf has not changed since last download. Skipping...
File: cao_2017_lvl76.pdf has not changed since last download. Skipping...
File: cao_2016_lvl8.pdf has not changed since last download. Skipping...
File: cao_2016_lvl76.pdf has not changed since last download. Skipping...
File: cao_2015_lvl8.pdf has not changed since last download. Skipping...
File: cao_2015_lvl76.pdf has not changed since last download. Skipping...
File: cao_2014_lvl8.pdf has not changed 

### 2021 Points Data

In [91]:
l8 = os.path.join(data_dir, 'cao_2021_lvl8.html')
l76 = os.path.join(data_dir, 'cao_2021_lvl76.html')

# Regular expression to capture fields from lines
# Lines consist of 2 letters and 3 numbers, comprising the course code; some whitespace; 
# 50 characters which start with a non-whitespace character; some more whitespace;
# some optional non whitespace characters comprising round 1 points; some more whitespace;
# and, optionally some more non-whitespace characters comprising round 2 points if present
re_fields = re.compile('^([A-Z]{2}[0-9]{3})\s+(\S.{49})\s+(\S+)?\s+(\S+)?')

# array to hold matched groups
data = []

for datafile, level in zip((l8, l76), (8, 76)):
    # encoding=cp1252 necessary to decode some characters on page
    with open(datafile, 'r', encoding='cp1252') as f:
        for line in f:
            match = re.match(re_fields, line)
            if match:
                fields = list(match.groups())
                fields.append(level)
                data.append(fields)

# column names
columns = ['Course Code', 'Course Name', 'Rnd1', 'Rnd2', 'Level']
df = pd.DataFrame.from_records(data, columns=columns)



Create new columns to hold information currently designated by *'s and #'s in numeric columns

Create new column indicating whether the course requires a test, interview or portfolio
This is indicated by a '#' in the Rnd1 or Rnd2 column
df['Test'] = df['Rnd1'].str.contains('#', na=False) | df['Rnd2'].str.contains('#', na=False)

Create a column indicating courses where not all applicants at Rnd1 point score were offered a place
This is indicated by a '*' in the Rnd1 or Rnd2 column
df['Not All'] = df['Rnd1'].str.contains('\*', na=False) | df['Rnd2'].str.contains('\*', na=False)

Create a new column for AQA meaning All Qualified Applicants were offered a place
df['AQA'] = df['Rnd1'].str.contains('AQA', na=False) | df['Rnd2'].str.contains('AQA', na=False)

Create a new column for 'New competition for available places' which seems to be courses 
for which the points have increased in round 2. Only occurs in level 76 and is indicated 
by a 'v' in 'Rnd2' column
df['New Comp'] = df['Rnd1'].str.contains('v', na=False) | df['Rnd2'].str.contains('v', na=False)

Generate 'EOS' column. == Rnd2 if it exists, otherwise Rnd1
df['EOS'] = np.where(df['Rnd2'].isnull(), df['Rnd1'], df['Rnd2'])

Remove Non-digits from Rnd1 and Rnd2 columns and convert columns to numeric values, 
with NaNs where values are missing (errors = 'coerce')
(Because NaN is a float, the whole columns must be floats)
df['Rnd1'] = pd.to_numeric(df['Rnd1'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
df['Rnd2'] = pd.to_numeric(df['Rnd2'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

In [92]:
newcols = {'Test': '#', 'Not All': '\*', 'AQA': 'AQA', 'New Comp': 'v'}

for k, v in newcols.items():
    df[k] = df['Rnd1'].str.contains(v, na=False) | df['Rnd2'].str.contains(v, na=False)

# Generate 'EOS' column. == Rnd2 if it exists, otherwise Rnd1
df['EOS'] = np.where(df['Rnd2'].isnull(), df['Rnd1'], df['Rnd2'])

# Remove Non-digits from Rnd1 and Rnd2 columns and convert columns to numeric values, 
# with NaNs where values are missing (errors = 'coerce')
# (Because NaN is a float, the whole columns must be floats)
df['EOS'] = pd.to_numeric(df['EOS'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

df.head(20)

Unnamed: 0,Course Code,Course Name,Rnd1,Rnd2,Level,Test,Not All,AQA,New Comp,EOS
0,AL801,Software Design for Virtual Reality and Gaming...,300,,8,False,False,False,False,300.0
1,AL802,Software Design in Artificial Intelligence for...,313,,8,False,False,False,False,313.0
2,AL803,Software Design for Mobile Apps and Connected ...,350,,8,False,False,False,False,350.0
3,AL805,Computer Engineering for Network Infrastructur...,321,,8,False,False,False,False,321.0
4,AL810,Quantity Surveying ...,328,,8,False,False,False,False,328.0
5,AL811,Civil Engineering ...,,,8,False,False,False,False,
6,AL820,Mechanical and Polymer Engineering ...,327,,8,False,False,False,False,327.0
7,AL830,General Nursing ...,451*,444,8,False,True,False,False,444.0
8,AL832,Mental Health Nursing ...,440*,431,8,False,True,False,False,431.0
9,AL835,Pharmacology ...,356,,8,False,False,False,False,356.0


### 2020 Points Data

1. Read Excel file using pandas.read_excel, specifying header row, desired columns, and row names
2. Create and populate 'Test', 'Not All', 'Matric', and 'AQA' columns
3. Remove all non-numeric characters from 'EOS' and 'Mid' and convert to numeric type

In [121]:
def tidy_cols(df):
    
    cols = ['Test', 'Not All', 'Matric', 'AQA']
    markers = ['#', '*', 'mat', 'AQA']

    for col, marker in zip(cols, markers):
        df[col] = df['EOS'].str.contains(marker, na=False, regex=False)

    for col in ('EOS', 'Mid'):
        df[col] = df[col].astype(str)
        # Some pdfs (e.g. 2020, level 8) have second point values in parentheses 
        # indicating new competition for additional places which must be removed
        # or the two poitn values will be concatenated in the next step
        df[col] = df[col].str.replace('\(.+\)', '', regex=True)
        df[col] = df[col].str.replace('[^0-9.]+', '', regex=True)
        df[col] = pd.to_numeric(df[col], errors='coerce', downcast='float')
        
    return df 

In [94]:
df2020 = pd.read_excel(os.path.join(data_dir, 'cao_2020_lvl876.xlsx'), 
                       header=10, 
                       usecols="B,C,H,I,J,L", 
                       names=['Course Name', 'Course Code', 'EOS', 'EOS *', 'Mid', 'Institution Name'],
                       converters={'EOS':str,'Mid':str})

# Asterisks usually found in EOS are in a separate col in this dataset
# Move asterisks to EOS so generic parser can be used
df2020['EOS'] = np.where(df2020['EOS *'].str.contains('*', na=False, regex=False), 
    df2020['EOS'] + '*', df2020['EOS']) 
df2020 = df2020.drop('EOS *', axis=1)

df2020 = tidy_cols(df2020)
# # Create new column indicating whether the course requires a test, interview or portfolio
# # This is indicated by a '#' in the EOS column
# df2020['Test'] = df2020['EOS'].str.contains('#', na=False, regex=False)

# # Create a column indicating courses where not all applicants at EOS point score were offered a place
# # This is indicated by a '*' in the EOS column
# df2020['Not All'] = df2020['EOS'].str.contains('*', na=False, regex=False)

# # Create a column indicating courses where a matric is required
# # This is indicated by the string '+matric' in the EOS column.
# # However, the tabula table parsing has interpreted the r in matric as a cell boundary so only 'mat' 
# # remains in the EOS column and 'ic' appears in the Mid column. The 'ic' will be dealt with next 
# df2020['Matric'] = df2020['EOS'].str.contains('mat', na=False, regex=False)

# # Remove Non-digits from EOS and Mid columns and convert columns to numeric values, with NaNs where values are missing (errors = 'coerce')
# # (Because NaN is a float, the whole columns must be floats)
# # Note that EOS and Mid contain mixed dtypes and so must be converted to string before the replace operation
# df2020['EOS'] = pd.to_numeric(df2020['EOS'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
# df2020['Mid'] = pd.to_numeric(df2020['Mid'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

df2020.head(100)

Unnamed: 0,Course Name,Course Code,EOS,Mid,Institution Name,Test,Not All,Matric,AQA
0,International Business,AC120,209.0,280.0,American College,False,False,False,False
1,Liberal Arts,AC137,252.0,270.0,American College,False,False,False,False
2,"First Year Art & Design (Common Entry,portfolio)",AD101,,,National College of Art and Design,True,False,True,False
3,Graphic Design and Moving Image Design (portfo...,AD102,,,National College of Art and Design,True,False,True,False
4,Textile & Surface Design and Jewellery & Objec...,AD103,,,National College of Art and Design,True,False,True,False
...,...,...,...,...,...,...,...,...,...
95,Theatre and Performative Practices - 3 or 4 ye...,CK112,330.0,434.0,University College Cork (NUI),False,False,False,False
96,Criminology - 3 years or 4 years (Internationa...,CK113,423.0,463.0,University College Cork (NUI),False,False,False,False
97,Social Science (Youth and Community Work) - 3 ...,CK114,777.0,,University College Cork (NUI),False,False,False,False
98,Social Work - Mature Applicants only,CK115,999.0,,University College Cork (NUI),False,False,False,False


### 2019 Points Data

The 2019 points data is held in two PDF files, one for level 8 courses and one for levels 6 and 7.

1. Read using tabula.read_pdf()
2. If necessary remove unwanted rows and assign header row
3. Fix and rename headers
4. Fill in institution column
5. Remove rows without course codes
6. Create and populate 'Test', 'Not All', 'Matric', and 'AQA' columns
7. Remove all non-numeric characters from 'EOS' and 'Mid' and convert to numeric type


In [95]:
def read_cao_pdf(pdf_path, header_row=None, splitfirst=False, table_num=0, drop_col=None, merge_drop=None):
    
    df = read_pdf(pdf_path, pages='all', multiple_tables=True)[table_num]

    # 2016 data has a ghost column
    if drop_col is not None:
        if merge_drop is not None:
            col1 = df.columns[drop_col[0]]
            col2 = df.columns[merge_drop]
            df.loc[df[col2].isnull(), col2] = df[col1]
            
        df.drop(df.columns[drop_col], axis=1, inplace=True)
    
    df.columns = ['Course Code', 'Course Name', 'EOS', 'Mid']
    
    if header_row is not None:
        # df.columns = df.iloc[header_row]
        df.rename_axis(None, axis=1, inplace=True)
        
        # Delete rows up to header_row
        df.drop(df.index[range(0, header_row + 1)], axis=0, inplace=True)

        # Reset the index
        df.reset_index(inplace=True, drop=True)
    
    # Create a new column in the dataframe for institution name 
    # identify institution name rows as those containing null course codes
    # and add those institution names to the new institution column
    df['Institution'] = df[df['Course Code'].isnull()]['Course Name']
    #df.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)
    
    # Fill empty fields in the institution column with the most recent non-na field
    df['Institution'] = df['Institution'].fillna(method='ffill')
    
    # Remove rows containing only institution names
    df = df[df['Course Code'].notna()]
    
    # A missing vertical line causes some the pdf parser to merge rows 
    # in certain tables (e.g. 2014 levels 6 & 7)
    # If that is the case we need to shift column contents to the right 
    # then split the firat column into course code and course name
    if splitfirst:
        # Shift the values in EOS to Mid
        df['Mid'] = df['EOS']
        # Shift the values in Course Name to EOS
        df['EOS'] = df['Course Name']
        # Extract the course name from the course code column and place in Course Name column
        df['Course Name'] = df['Course Code'].str.extract('^\D\D\d{3}(.+)$')
        # Extract the course code form the Course Code column and place in COurse Code column
        df['Course Code'] = df['Course Code'].str.extract('^(\D\D\d{3})')
        
    # Remove page header rows
    df = df[df['Course Code'] != 'Course Code']
    
    df = tidy_cols(df)
    
    return df


In [122]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2019_lvl8.pdf")
df8 = read_cao_pdf(pdf_path)

# Repair LM124 Course Name
df8.loc[df8['Course Code']=='LM124', 'Course Name'] += 'ce)'

df8.head()

ValueError: invalid downcasting method provided

In [97]:
# read the level 6 & 7 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2019_lvl76.pdf")
df76 = read_cao_pdf(pdf_path, header_row=7)

# Repair WD177 Course Name
df76.loc[df76['Course Code']=='WD177', 'Course Name'] += 'macy.)'

df76.head()


Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL600,Software Design,205.0,306,Athlone Institute of Technology,False,False,False,False
2,AL601,Computer Engineering,196.0,272,Athlone Institute of Technology,False,False,False,False
3,AL602,Mechanical Engineering,258.0,424,Athlone Institute of Technology,False,False,False,False
4,AL604,Civil Engineering,252.0,360,Athlone Institute of Technology,False,False,False,False
5,AL630,Pharmacy Technician,306.0,366,Athlone Institute of Technology,False,False,False,False


#### Merge dataframes

In [98]:
# add level 8 column to both dataframes
df8['Level 8'] = True
df76['Level 8'] = False

In [99]:
# conatenate level with levels 6 & 7
df = pd.concat([df8, df76], ignore_index=True)

# Rename column names to include year   
df = df.rename({'Course Name': 'Course Name 2019', 
                'EOS': 'EOS 2019', 
                'Mid':'Mid 2019', 
                'Test':'Test 2019', 
                'Not All': 'NotAll 2019',
                'Matric': 'Matric 2019',
                'AQA': 'AQA 2019',
                'Level 8': 'Level8 2019'}, axis=1)

In [100]:
df.tail()

Unnamed: 0,Course Code,Course Name 2019,EOS 2019,Mid 2019,Institution,Test 2019,NotAll 2019,Matric 2019,AQA 2019,Level8 2019
77,CW077,Tourism and Event Management (Wexford),204.0,259.0,"Institute of Technology, Carlow",False,False,False,False,False
78,CW106,Physiology and Health Science,454.0,499.0,"Institute of Technology, Carlow",False,False,False,False,False
79,CW107,Analytical Science,215.0,316.0,"Institute of Technology, Carlow",False,False,False,False,False
80,CW116,Pharmacy Technician Studies,241.0,351.0,"Institute of Technology, Carlow",False,False,False,False,False
81,CW117,Biosciences,205.0,351.0,"Institute of Technology, Carlow",False,False,False,False,False


In [101]:
# export dataframe to csv
df.to_csv('data/cao/cao_points_2019.csv')

### 2018

In [102]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2018_lvl8.pdf")
df8_18 = read_cao_pdf(pdf_path, header_row=7)

df8_18.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL801,Software Design (Game Development or Cloud Com...,295,326,Athlone Institute of Technology,False,False,False,False
2,AL810,Quantity Surveying,300,340,Athlone Institute of Technology,False,False,False,False
3,AL820,Mechanical and Polymer Engineering,299,371,Athlone Institute of Technology,False,False,False,False
4,AL830,General Nursing,418,440,Athlone Institute of Technology,False,False,False,False
5,AL832,Psychiatric Nursing,377,388,Athlone Institute of Technology,False,False,False,False


In [103]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2018_lvl76.pdf")
df76_18 = read_cao_pdf(pdf_path, header_row=7)

df76_18.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL601,Electronics and Computer Engineering,240.0,321.0,Athlone Institute of Technology,False,False,False,False
2,AL602,Mechanical Engineering,201.0,299.0,Athlone Institute of Technology,False,False,False,False
3,AL604,Civil Engineering,243.0,320.0,Athlone Institute of Technology,False,False,False,False
4,AL630,Pharmacy Technician,306.0,388.0,Athlone Institute of Technology,False,False,False,False
5,AL631,Dental Nursing,307.0,348.0,Athlone Institute of Technology,False,False,False,False


### 2017

In [104]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2017_lvl8.pdf")
df8_17 = read_cao_pdf(pdf_path)

df8_17.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL801,Software Design (Game Development or Cloud Com...,290,329.0,Athlone Institute of Technology,False,False,False,False
2,AL810,Quantity Surveying,311,357.0,Athlone Institute of Technology,False,False,False,False
3,AL820,Mechanical and Polymer Engineering,300,336.0,Athlone Institute of Technology,False,False,False,False
4,AL830,General Nursing,398,418.0,Athlone Institute of Technology,False,True,False,False
5,AL832,Psychiatric Nursing,378,389.0,Athlone Institute of Technology,False,False,False,False


In [105]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2017_lvl76.pdf")
df76_17 = read_cao_pdf(pdf_path)

df76_17.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL601,Electronics and Computer Engineering,228.0,420.0,Athlone Institute of Technology,False,False,False,False
2,AL602,Mechanical Engineering,212.0,303.0,Athlone Institute of Technology,False,False,False,False
3,AL604,Civil Engineering,,281.0,Athlone Institute of Technology,False,False,False,True
4,AL630,Pharmacy Technician,290.0,356.0,Athlone Institute of Technology,False,False,False,False
5,AL631,Dental Nursing,273.0,336.0,Athlone Institute of Technology,False,False,False,False


### 2016

In [106]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2016_lvl8.pdf")
df8_16 = read_cao_pdf(pdf_path, header_row=6, drop_col=[4])

df8_16.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL801,Software Design (Game Development or Cloud Com...,300.0,340.0,Athlone Institute of Technology,False,False,False,False
2,AL810,Quantity Surveying,315.0,355.0,Athlone Institute of Technology,False,False,False,False
3,AL820,Mechanical and Polymer Engineering,295.0,340.0,Athlone Institute of Technology,False,False,False,False
4,AL830,General Nursing,425.0,440.0,Athlone Institute of Technology,False,True,False,False
5,AL831,Mature Applicants General Nursing,181.0,185.0,Athlone Institute of Technology,True,False,False,False


In [107]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2016_lvl76.pdf")
df76_16 = read_cao_pdf(pdf_path, header_row=6, drop_col=[4])

df76_16.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL601,Electronics and Computer Engineering,205.0,295.0,Athlone Institute of Technology,False,False,False,False
2,AL602,Mechanical Engineering,205.0,305.0,Athlone Institute of Technology,False,False,False,False
3,AL604,Civil Engineering,280.0,370.0,Athlone Institute of Technology,False,False,False,False
4,AL630,Pharmacy Technician,270.0,383.0,Athlone Institute of Technology,False,False,False,False
5,AL631,Dental Nursing,275.0,365.0,Athlone Institute of Technology,False,False,False,False


### 2015

In [108]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2015_lvl8.pdf")
df8_15 = read_cao_pdf(pdf_path, header_row=14)

df8_15.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL801,Software Design (Game Development or Cloud Com...,280,345,Athlone Institute of Technology,False,False,False,False
2,AL820,Mechanical and Polymer Engineering,315,355,Athlone Institute of Technology,False,False,False,False
3,AL830,General Nursing,420,435,Athlone Institute of Technology,False,False,False,False
4,AL831,Mature Applicants General Nursing,176,182,Athlone Institute of Technology,True,True,False,False
5,AL832,Psychiatric Nursing,390,400,Athlone Institute of Technology,False,False,False,False


In [None]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2015_lvl76.pdf")
df76_15 = read_cao_pdf(pdf_path, header_row=13)

df76_15.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL601,Electronics and Computer Engineering,210.0,315,Athlone Institute of Technology,False,False,False,False
2,AL602,Mechanical Engineering,175.0,260,Athlone Institute of Technology,False,False,False,False
3,AL604,Civil Engineering,175.0,305,Athlone Institute of Technology,False,False,False,False
4,AL630,Pharmacy Technician,270.0,390,Athlone Institute of Technology,False,False,False,False
5,AL631,Dental Nursing,265.0,330,Athlone Institute of Technology,False,False,False,False


### 2014

In [110]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2014_lvl8.pdf")
df8_14 = read_cao_pdf(pdf_path, header_row=13)

df8_14.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL801,Software Design (Common Entry,280,335,ATHLONE IT,False,False,False,False
2,AL820,Mechanical and Polymer Engineering,315,365,ATHLONE IT,False,False,False,False
3,AL830,General Nursing,410,420,ATHLONE IT,False,False,False,False
4,AL831,Mature Applicants General Nursing,169,173,ATHLONE IT,True,False,False,False
5,AL832,Psychiatric Nursing,390,395,ATHLONE IT,False,False,False,False


In [111]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2014_lvl76.pdf")
df76_14 = read_cao_pdf(pdf_path, header_row=12, splitfirst=True)

df76_14.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL601,Electronics and Computer Engineering,185.0,290.0,ATHLONE IT,False,False,False,False
2,AL602,Mechanical Engineering,180.0,255.0,ATHLONE IT,False,False,False,False
3,AL604,Civil Engineering,95.0,250.0,ATHLONE IT,False,False,False,False
4,AL630,Pharmacy Technician,320.0,390.0,ATHLONE IT,False,False,False,False
5,AL631,Dental Nursing,265.0,335.0,ATHLONE IT,False,False,False,False


### 2013

In [112]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2013_lvl8.pdf")
df8_13 = read_cao_pdf(pdf_path, header_row=10)

df8_13.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
2,AL802,Software Design (Games Development),275,325,ATHLONE IT,False,False,False,False
3,AL803,Software Design (Cloud Computing),280,345,ATHLONE IT,False,False,False,False
4,AL830,General Nursing,410,415,ATHLONE IT,False,True,False,False
5,AL831,Mature Applicants General Nursing,566,581,ATHLONE IT,True,False,False,False
6,AL832,Psychiatric Nursing,395,400,ATHLONE IT,False,False,False,False


In [114]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2013_lvl76.pdf")
df76_13 = read_cao_pdf(pdf_path, header_row=10)

df76_13.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
3,AL601,Electronics and Computer Engineering,205.0,285,ATHLONE IT,False,False,False,False
4,AL604,Civil Engineering,165.0,260,ATHLONE IT,False,False,False,False
5,AL630,Pharmacy Technician,305.0,400,ATHLONE IT,False,False,False,False
6,AL631,Dental Nursing,300.0,350,ATHLONE IT,False,False,False,False
7,AL632,Science (Bioscience/Chemistry),160.0,300,ATHLONE IT,False,False,False,False


### 2012

In [115]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2012_lv8.pdf")
df8_12 = read_cao_pdf(pdf_path, header_row=11)

df8_12.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
3,AL802,Software Design (Games Development),300.0,340.0,ATHLONE IT,False,False,False,False
4,AL803,Software Design (Web Development),310.0,335.0,ATHLONE IT,False,False,False,False
5,AL805,Construction Technology and Management,,,ATHLONE IT,False,False,False,False
6,AL830,General Nursing,415.0,430.0,ATHLONE IT,False,True,False,False
7,AL831,Mature Applicants General Nursing,233.0,235.0,ATHLONE IT,True,False,False,False


In [116]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2012_lv76.pdf")
df76_12 = read_cao_pdf(pdf_path, header_row=10)

df76_12.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
3,AL601,Electronics and Computer Engineering,200.0,325.0,ATHLONE IT,False,False,False,False
4,AL602,Mechanical Engineering,200.0,285.0,ATHLONE IT,False,False,False,False
5,AL603,Construction Studies,195.0,280.0,ATHLONE IT,False,False,False,False
6,AL604,Civil Engineering,240.0,280.0,ATHLONE IT,False,False,False,False
7,AL630,Pharmacy Technician,275.0,365.0,ATHLONE IT,False,False,False,False


### 2011

In [117]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2011_lv8.pdf")
df8_11 = read_cao_pdf(pdf_path, header_row=24)

df8_11.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
0,AL032,Software Design (Games Development),285.0,330.0,,False,False,False,False
1,AL033,Toxicology,240.0,330.0,,False,False,False,False
2,AL034,Software Design (Web Development),285.0,340.0,,False,False,False,False
3,AL035,Construction Technology and Management,265.0,315.0,,False,False,False,False
4,AL050,Business,270.0,325.0,,False,False,False,False


In [118]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2011_lv76.pdf")
df76_11 = read_cao_pdf(pdf_path, header_row=19)

df76_11.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL001,Business,160.0,280,ATHLONE IT,False,False,False,False
2,AL002,Culinary Arts,155.0,215,ATHLONE IT,False,False,False,False
3,AL003,Office Management,,190,ATHLONE IT,False,False,False,True
4,AL004,Bar Supervision,135.0,185,ATHLONE IT,False,False,False,False
5,AL006,Applied Social Studies in Social Care,315.0,345,ATHLONE IT,False,False,False,False


### 2010

In [119]:
# read the level 8 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2010_lvl8.pdf")
df8_10 = read_cao_pdf(pdf_path, header_row=17)

df8_10.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL032,Software Design (Games Development),265.0,315.0,ATHLONE IT,False,False,False,False
2,AL033,Toxicology,280.0,345.0,ATHLONE IT,False,False,False,False
3,AL034,Software Design (Web Development),270.0,300.0,ATHLONE IT,False,False,False,False
4,AL035,Construction Technology and Management,265.0,310.0,ATHLONE IT,False,False,False,False
5,AL050,Business,275.0,320.0,ATHLONE IT,False,False,False,False


In [120]:
# read the level 76 pdf, extracting tables into a single dataframe
pdf_path = os.path.join(data_dir, "cao_2010_lvl76.pdf")
df76_10 = read_cao_pdf(pdf_path, header_row=None, table_num=1, drop_col=[1], merge_drop=2)

df76_10.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution,Test,Not All,Matric,AQA
1,AL001,Business,140.0,270.0,ATHLONE IT,False,False,False,False
2,AL002,Culinary Arts,115.0,205.0,ATHLONE IT,False,False,False,False
3,AL003,Office Management,120.0,225.0,ATHLONE IT,False,False,False,False
4,AL004,Bar Supervision,,175.0,ATHLONE IT,False,False,False,True
5,AL006,Applied Social Studies in Social Care,315.0,350.0,ATHLONE IT,False,False,False,False


## Analysing the data

## Conclusion

## References

[1] https://www.independent.ie/life/family/learning/understanding-your-cao-course-guide-26505318.html