# CAO Points

## Introduction

The CAO points data, available from the [CAO website](http://www.cao.ie), is published in a different format for each of the years 2019, 2020, and 2021. Each year's data, therefore, requires a different approach to acquisition, conversion to pandas DataFrame, and cleaning. The 2019 data is published in two PDF files; one for level 8 courses, and one for levels 6 and 7. The 2020 data is published as an Excel spreadsheet, and the 2021 data as preformatted text in a HTML web page.

The attributes of interest for comparison between the various years' datasets are `Course Code`, `Course Name`, `Institution Name`, `EOS`, which is the number of points achieved by the last applicant to be offered a place on the course, and `Mid`, which is the mid point between the number of points held by the highest point score and the lowest point score of the applicants offered a place on the course [1]. The 2021 data does not explicitly contain an either an `EOS` or a `Mid` column. It does provide the *Round 1* and *Round 2* points required for entry into each course as `RND1` and `RND2`. Examination of the 2020 data, which contains both an `EOS` field *and* `RND1` and `RND2` fields demonstrates that the `EOS` field is equal to the `RND2` value if it exists, otherwise the `RND1` value (```EOS = RND1 if RND1 else RND2```). As for the `Mid` field; this information does not appear to be available yet for the 2021 data.

In [1]:
# Imports
# Data analysis library
import pandas as pd
# Plotting library
import matplotlib.pyplot as plt
# PDF table parsing
from tabula import read_pdf
# Retrieval of resources from WWW
import requests
# URL construction
from requests.compat import urljoin
# Various utilities, mainly path construction
import os
# Creation of datetime strings for filenames
from datetime import datetime
# Regular expressions
import re
import numpy as np
import hashlib
from itertools import zip_longest


## Acquiring the data


### Downloading the raw data

In [2]:
# Location of CAO points data
base_url = 'http://www2.cao.ie/points/'
# Local data directory
data_dir = 'data/cao'
backup_dir = 'data/cao/backup'

# Dictionary of source file names mapped to the file names that will be used locally
points_urls  = ({'l8.php'                  : 'cao_2021_lvl8.html',
                 'l76.php'                 : 'cao_2021_lvl76.html',
                 'CAOPointsCharts2020.xlsx': 'cao_2020_lvl876.xlsx'
                 })

# The rest of points_urls can be assembled programmatically
# as filenames follow a pattern

# List of years as 2-digit strings from 2019 to 2005
years = [str(i).zfill(2) for i in range(19, 4, -1)]
# For each year (2019 to 2005)
for year in years:
    levels = ('lvl8', 'lvl76')
    # Using a separate local_levels variable allows consistent local 
    # file naming in cases where the remote files are inconsistently named
    local_levels = levels
    
    # 2011 and 2012 data is missing second 'l' from filenames
    if year in ('12', '11'):
        levels = ('lv8', 'lv76')
        
    # For each level 
    for level, local_level in zip(levels, local_levels):
        # construct remote filename
        remote_name = level + '_' + year + '.pdf'

        # construct local filename
        local_name = 'cao_20' + year + '_' + local_level + '.pdf'
        # Add remote and local filenames as keys and values in points_urls dict
        points_urls[remote_name] = local_name

# List of years as 2-digit strings from 2004 to 2001
years = [str(i).zfill(2) for i in range(4, 0, -1)]
for year in years:
    levels = ('deg', 'dip')
    local_levels = ('lvl8', 'lvl76')

    for level, local_level in zip(levels, local_levels):
        remote_name = level + year + '.htm'
        local_name = 'cao_20' + year + '_' + local_level + '.html'
        points_urls[remote_name] = local_name


In [3]:
def get_cao_source_data(base_url, points_urls, data_dir, backup_dir, verbose=False):
    # for each of the source files 
    for url in (points_urls):
        # construct url and fetch content
        response = requests.get(urljoin(base_url, url))

        try:
            # attempt to open any previously downloaded local file
            with open(os.path.join(data_dir, points_urls[url]), "rb") as f:
                # Calculate md5 hashes for the local file and the remote file
                md5_local = hashlib.md5(f.read()).hexdigest()
                md5_response = hashlib.md5(response.content).hexdigest()

                # Set write_flag to False if the hashes are equal and True if they are not
                write_flag = (md5_local != md5_response)
        except FileNotFoundError:
            # if the local file does not exist set the write_flag to True and move on
            write_flag = True

        # If the write_flag is True
        if write_flag:
            if verbose:
                print(f"File: {points_urls[url]} has changed since last download. Updating...")
            # split the filename into name and extension
            fname, extension = os.path.splitext(points_urls[url])
            # construct unique filename by inserting datetime string between filename and extension
            filename = fname + datetime.now().strftime("_%Y%m%d_%H%M%S") + extension

            # write the timestamped remote file to the backup directory
            with open(os.path.join(backup_dir, filename), 'wb') as f:
                f.write(response.content)

            # also write the remote file to the data directory, overwriting any previous file
            with open(os.path.join(data_dir, points_urls[url]), 'wb') as f:
                f.write(response.content)
        else:
            if verbose:
                print(f"File: {points_urls[url]} has not changed since last download. Skipping...")
                
get_cao_source_data(base_url=base_url, 
                    points_urls=points_urls, 
                    data_dir=data_dir, 
                    backup_dir=backup_dir, 
                    verbose=False)

### Data supplied as preformatted text embedded in a HTML page
#### 2021, 2004, 2003, 2002, and 2001

In [4]:
def read_cao_html(l8, l76, 
                  columns=['Course Code', 'Course Name', 'EOS', 'Mid', 'Level', 'Institution'], 
                  name_len=50,
                  special=['EOS', 'Mid']):
    # Regular expression to capture fields from lines
    # Lines consist of 2 letters and 3 numbers, comprising the course code; some whitespace; 
    # 50 characters which start with a non-whitespace character; some more whitespace;
    # some optional non whitespace characters comprising round 1 points; some more whitespace;
    # and, optionally some more non-whitespace characters comprising round 2 points if present
    re_fields = re.compile(f'^([A-Z]{{2}}[0-9]{{3}})\s+(\S.{{{name_len-1}}})\s+(\S+)?\s+(\S+)?')
    re_institution = re.compile(r'^\s{7}(\S.+\S)\s+$')

    # array to hold matched groups
    data = []
    institution = ''
    for datafile, level in zip((l8, l76), (8, 76)):
        # encoding=cp1252 necessary to decode some characters on page
        with open(datafile, 'r', encoding='cp1252') as f:
            for line in f:
                match_course = re.match(re_fields, line)
                match_institution = re.match(re_institution, line)
                if match_institution:
                    institution = match_institution.group(0).strip()
                if match_course:
                    fields = list(match_course.groups())
                    fields.append(level)
                    fields.append(institution)
                    data.append(fields)

                    

    # column names
    #columns = ['Course Code', 'Course Name', 'Rnd1', 'Rnd2', 'Level']
    df = pd.DataFrame.from_records(data, columns=columns)

    newcols = {'Test': '#', 'Not All': '\*', 'AQA': 'AQA', 'New Comp': 'v'}

    for k, v in newcols.items():
        df[k] = df[special[0]].str.contains(v, na=False) | df[special[1]].str.contains(v, na=False)

    # Generate 'EOS' column. == Rnd2 if it exists, otherwise Rnd1
    # Only here for 2021 data
    if special[0] == 'Rnd1':
        df['EOS'] = np.where(df['Rnd2'].isnull(), df['Rnd1'], df['Rnd2'])

    # Remove Non-digits from Rnd1 and Rnd2 columns and convert columns to numeric values, 
    # with NaNs where values are missing (errors = 'coerce')
    # (Because NaN is a float, the whole columns must be floats)
    df['EOS'] = pd.to_numeric(df['EOS'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
    if 'Mid' in df.columns:
        df['Mid'] = pd.to_numeric(df['Mid'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
    else:
        df['Mid'] = np.nan

    # Use boolean 'Level8' to store level
    df['Level8'] = df['Level'] == 8

    # Remove unwanted columns and make column orders consistent across years
    df = df[['Course Code', 'Course Name', 'Institution', 'EOS', 'Mid', 'Level8', 'Test', 'Not All', 'AQA', 'New Comp']]
   
    return df


Create new columns to hold information currently designated by *'s and #'s in numeric columns

Create new column indicating whether the course requires a test, interview or portfolio
This is indicated by a '#' in the Rnd1 or Rnd2 column
df['Test'] = df['Rnd1'].str.contains('#', na=False) | df['Rnd2'].str.contains('#', na=False)

Create a column indicating courses where not all applicants at Rnd1 point score were offered a place
This is indicated by a '*' in the Rnd1 or Rnd2 column
df['Not All'] = df['Rnd1'].str.contains('\*', na=False) | df['Rnd2'].str.contains('\*', na=False)

Create a new column for AQA meaning All Qualified Applicants were offered a place
df['AQA'] = df['Rnd1'].str.contains('AQA', na=False) | df['Rnd2'].str.contains('AQA', na=False)

Create a new column for 'New competition for available places' which seems to be courses 
for which the points have increased in round 2. Only occurs in level 76 and is indicated 
by a 'v' in 'Rnd2' column
df['New Comp'] = df['Rnd1'].str.contains('v', na=False) | df['Rnd2'].str.contains('v', na=False)

Generate 'EOS' column. == Rnd2 if it exists, otherwise Rnd1
df['EOS'] = np.where(df['Rnd2'].isnull(), df['Rnd1'], df['Rnd2'])

Remove Non-digits from Rnd1 and Rnd2 columns and convert columns to numeric values, 
with NaNs where values are missing (errors = 'coerce')
(Because NaN is a float, the whole columns must be floats)
df['Rnd1'] = pd.to_numeric(df['Rnd1'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
df['Rnd2'] = pd.to_numeric(df['Rnd2'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

In [5]:
# Non-default parameters for read_cao_html() for each year
html_files = {2021:{'columns': ['Course Code', 'Course Name', 'Rnd1', 'Rnd2', 'Level','Institution'], 
                    'special': ['Rnd1', 'Rnd2']},
              2004:{},
              2003:{},
              2002:{},
              2001:{'name_len': 35}
}

In [6]:
def read_cao_htmls(html_files):
    """
    Reads in all the html files in the html_files dictionary and returns a dictionary
    of dataframes with the year as the key
    """
    dfs = {}
    for year, params in html_files.items():
        l8 = os.path.join(data_dir, f'cao_{year}_lvl8.html')
        l76 = os.path.join(data_dir, f'cao_{year}_lvl76.html') 
        dfs[year] = read_cao_html(l8, l76, **params)
    
    return dfs

***

### Data supplied as an Excel spreadsheet
#### 2020

1. Read Excel file using pandas.read_excel, specifying header row, desired columns, and row names
2. Create and populate 'Test', 'Not All', 'Matric', and 'AQA' columns
3. Remove all non-numeric characters from 'EOS' and 'Mid' and convert to numeric type

In [7]:
def tidy_cols(df):
    
    cols = ['Test', 'Not All', 'Matric', 'AQA', 'New Comp']
    markers = ['#', '*', 'mat', 'AQA', 'v']

    for col, marker in zip(cols, markers):
        df[col] = df['EOS'].str.replace('\s', '', regex=True).str.contains(marker, na=False, regex=False)

    for col in ('EOS', 'Mid'):
        # Cast each point col to string so they can be cleaned up using string methods
        df[col] = df[col].astype(str)

        # Some pdfs have second point values in parentheses 
        # indicating new competition for additional places which must be removed
        # or the two point values will be concatenated in the next step
        df[col] = df[col].str.replace('\(.+\)', '', regex=True)
        
        # Remove non digits and decimal points outside numbers
        df[col] = df[col].str.replace('[^0-9.]', '', regex=True).str.strip(".")

        # Cast points columns to float
        df[col] = pd.to_numeric(df[col], errors='coerce', downcast='float')  
            
    # Reset the index
    df.reset_index(inplace=True, drop=True)

    # Make column orders consistent across years
    df = df[['Course Code', 'Course Name', 'Institution', 'EOS', 'Mid', 'Level8', 'Test', 'Not All', 'AQA', 'New Comp']]
           
    return df 

In [8]:
def read_cao_excel():
    """
    Reads in the CAO excel file and returns a dataframe.
    As this is a one-use function all the parameters are hardcoded.
    """

    # Read in the excel file
    df = pd.read_excel(os.path.join(data_dir, 'cao_2020_lvl876.xlsx'), 
                       header=10, 
                       usecols="B,C,H,I,J,K,L", 
                       names=['Course Name', 'Course Code', 'EOS', 'EOS *', 'Mid', 'Level8', 'Institution'],
                       converters={'EOS':str,'Mid':str})

    # Asterisks usually found in EOS are in a separate col in this dataset
    # Move asterisks to EOS so generic parser can be used
    df['EOS'] = np.where(df['EOS *'].str.contains('*', na=False, regex=False), 
        df['EOS'] + '*', df['EOS']) 
    df = df.drop('EOS *', axis=1)

    # Change 'Level8' to boolean
    df['Level8'] = df['Level8'] == 8

    df = tidy_cols(df)

    return df

### Data held as a one-column-per-page table in a PDF file
#### 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, and 2005

The 2019 points data is held in two PDF files, one for level 8 courses and one for levels 6 and 7.

1. Read using tabula.read_pdf()
2. If necessary remove unwanted rows and assign header row
3. Fix and rename headers
4. Fill in institution column
5. Remove rows without course codes
6. Create and populate 'Test', 'Not All', 'Matric', and 'AQA' columns
7. Remove all non-numeric characters from 'EOS' and 'Mid' and convert to numeric type


In [9]:
def read_cao_pdf(pdf_path, header_row=None, splitfirst=False, table_num=0, drop_col=None, merge_drop=None, multiple_tables=False):
    
    # Extract level from path
    level = re.search('lvl(.+)\.', pdf_path).group(1)
    df = read_pdf(pdf_path, pages='all', multiple_tables=multiple_tables)[table_num]

    # 2016 data has a ghost column
    if drop_col is not None:
        if merge_drop is not None:
            col1 = df.columns[drop_col[0]]
            col2 = df.columns[merge_drop]
            df.loc[df[col2].isnull(), col2] = df[col1]
            
        df.drop(df.columns[drop_col], axis=1, inplace=True)
    
    df.columns = ['Course Code', 'Course Name', 'EOS', 'Mid']

    if header_row is not None:
        # df.columns = df.iloc[header_row]
        df.rename_axis(None, axis=1, inplace=True)
        
        # Delete rows up to header_row
        df.drop(df.index[range(0, header_row + 1)], axis=0, inplace=True)
        
    # A missing vertical line causes some the pdf parser to merge rows 
    # in certain tables (e.g. 2014 levels 6 & 7)
    # If that is the case we need to shift column contents to the right 
    # then split the first column into course code and course name
    if splitfirst:
        # Create insititution column and add contents of Course Name column where Course Code is empty
        df['Institution'] = df[df['Course Code'].isnull()]['Course Name']
        # Shift the values in EOS to Mid
        df['Mid'] = df['EOS']
        # Shift the values in Course Name to EOS
        df['EOS'] = df['Course Name']
        # Locate rows with institution names (In Course code col) and move them to Institution col
        # Skip the first row because its a unique situation dealt with in the first libne of this if block
        df.loc[df.index[1:], 'Institution'] = df[~df['Course Code'].str.contains('[A-Z]{2}\d{3}', na=False)]['Course Code'] 
        # Extract the course name from the course code column and place in Course Name column
        df['Course Name'] = df['Course Code'].str.extract('^\D\D\d{3}(.+)$')
        # Extract the course code form the Course Code column and place in Course Code column
        df['Course Code'] = df['Course Code'].str.extract('^(\D\D\d{3})')       
    else:
        # Create a new column in the dataframe for institution name 
        # identify institution name rows as those containing null course codes
        # and add those institution names to the new institution column
        df['Institution'] = df[df['Course Code'].isnull()]['Course Name']
        #df.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)
    
    # Fill empty fields in the institution column with the most recent non-na field
    df['Institution'] = df['Institution'].fillna(method='ffill')
    
    # Remove rows containing only institution names
    df = df[df['Course Code'].notna()]
        
    # Remove page header rows
    df = df[df['Course Code'] != 'Course Code']
    
    # Remove oddball rows like two subject modratorships with point ranges rather than single values
    df = df[df['Course Code'].str.contains('^[A-Z]{2}\d{3}$')]
          
    # Add level column      
    df['Level8'] = level == '8'

    # tidy_cols, defined above creates new columns for extra info and cleans numerical columns
    df = tidy_cols(df)
        
    return df


In [10]:
# Non-default parameters for read_cao_pdf for each year
pdf_files = {2019: {'8': {},
                    '76': {'header_row': 7}},
             2018: {'8': {'header_row': 7},
                    '76': {'header_row': 7}},
             2017: {'8': {},
                    '76': {}},
             2016: {'8': {'header_row': 6, 'drop_col': [4]},
                    '76': {'header_row': 6, 'drop_col': [4]}},
             2015: {'8': {'header_row': 14},
                    '76': {'header_row': 13}},
             2014: {'8': {'header_row': 13},
                    '76': {'header_row': 12, 'splitfirst': True}},
             2013: {'8': {'header_row': 10},
                    '76': {'header_row': 10}},
             2012: {'8': {'header_row': 11},
                    '76': {'header_row': 10}},
             2011: {'8': {'header_row': 23},
                    '76': {'header_row': 19}},
             2010: {'8': {'header_row': 17},
                    '76': {'table_num': 1, 'drop_col': [1], 'merge_drop': 2, 'multiple_tables': True}},
             2009: {'8': {'header_row': 17},
                    '76': {'header_row': 11}},
             2008: {'8': {'header_row': 26},
                    '76': {'header_row': 24}},
             2005: {'8': {'header_row': 10},
                    '76': {'header_row': 9}}
             }


In [11]:
def read_cao_pdfs(pdf_files, type='single'):
    """
    Reads in all the pdf files in the pdf_files dictionary and returns a dictionary
    of dataframes with the year as the key
    """
    cao_dfs = {}
    for year, levels in pdf_files.items():
        
        cao_dfs[year] = {}
        for level, params in levels.items():
            file_path = os.path.join(data_dir, 'cao_' + str(year) + '_lvl' + level + '.pdf')
            if type == 'single':
                cao_dfs[year][level] = read_cao_pdf(file_path, **params)
            elif type == 'multiple':
                cao_dfs[year][level] = read_cao_multicol(file_path, **params)
            else:
                raise ValueError('type must be either single or multiple')

        cao_dfs[year] = pd.concat([cao_dfs[year]['8'], cao_dfs[year]['76']], axis=0, ignore_index=True)
        
    return cao_dfs

### Data held in a multiple-column-per-page and multiple-page-per-column table in a PDF file
#### 2007 and 2006

In [12]:
def read_cao_multicol(pdf_path, top, height, width, col_locs, runover, header_row=0):
#     # distance in points of top of table from top of page, 
#     # height of table, and width of table
#     top, height, width  = (18.875, 568, 246)

#     # distance in points of left edge of page column from left edge of page
#     col_locs = (18.375, 260.625, 509.625)

#     # Table columns run over to next page in most cases
#     # The 'runover' variable holds the number of rows in each page
#     # that need to be push back up to the previous page
#     runover = [0, 2, 4, 0, 0]

    # Extract level from path
    level = re.search('lvl(.+)\.', pdf_path).group(1)

    # List to hold dataframes
    tables = []
    for i, col_loc in enumerate(col_locs):
        # table area in this page column
        area = [top, col_loc, top + height, col_loc + width]
        # tables will be a list containing three lists, one holding all of the left page column tables, 
        # one all the centre column tables, and one all of the right column tables
        tables.append(read_pdf(pdf_path, pages="all", multiple_tables=True, area=area, pandas_options={'header': None}))

    # All of those above can be shifted to the left
    # Iterate through lists of lists of dataframes
    for df_list in tables:
        # Iterate through all dataframes in list
        for df in df_list:
            # If the dataframe has more than four columns
            if len(df.columns) > 4:
                # the last column is not wanted
                extra_col = df.iloc[:,-1]
                # if the number of rows in the dataframe 
                # is less than the number of na values in 
                # the extra row then there must be some data 
                # in the extra column that needs to be moved 
                # before the column is dropped
                if df.shape[0] > extra_col.isna().sum():
                    # Find the rows which hold data in the extra column 
                    # and shift all values one cell to the left
                    df[extra_col.notna()] = df[extra_col.notna()].shift(periods=-1, axis=1)
                
                # drop the extra column
                df.drop(df.columns[4], axis=1, inplace=True)

    # Transpose table list so that each sublist represents a page
    # and each dataframe represents a column in that page
    pages = [list(table) for table in zip_longest(*tables)]

    #Iterate over lists representing pages, starting with page 2 as 
    # page one has no previous page to push rows up to
    for page in range(1, len(pages)):
        # Get the number of rows which have run on from the previous page
        num_rows = runover[page]
        # iterate through dataframes representing page columns
        for i, col in enumerate(pages[page]):
            if col is not None:
                # copy the runover rows
                rows = col.head(num_rows)
                # append the runover rows to the dataframes representing the previous page's columns
                pages[page - 1][i] = pages[page - 1][i].append(rows, ignore_index=True)
                # drop the runover rows from the dataframes they had run over into
                col.drop(rows.index, inplace=True)

    # Flatten the list so that all data frames are in the 
    # correct order for concatenation
    table_cols = [col for page in pages for col in page]

    # The last two elements are None so remove them
    del(table_cols[-2:])

    # concatenate all of the column tables into a single dataframe
    df = pd.concat(table_cols)
    # Set column names
    df.columns = ['Course Code', 'Course Name', 'EOS', 'Mid']
    # reset the index
    df.reset_index(drop=True, inplace=True)

    # Remove all rows up to the header row which contain no data
    df.drop(df.index[0:header_row], inplace=True)

    # Remove rows where all values are NaN
    df.drop(df.index[pd.isnull(df).all(1)], inplace=True)

    # Rows in which Course Code is NaN and Course Name is not all caps, 
    # non-alphanumeric characters, and the words of, the, and and 
    # do not contain usable data
    df.drop(df[~(df['Course Name'].str.contains(
        '^[A-Z\s\Woftheand]+$', na=False)) & df['Course Code'].isna()], axis=1).index

    # Create a new column in the dataframe for institution name 
    # identify institution name rows as those containing null course codes
    # and add those institution names to the new institution column
    df['Institution'] = df[df['Course Code'].isnull() | df['Course Code'].str.contains('Code')]['Course Name']

    # Some Institution rows have the word 'Code' in the Course Code column (in 2006 pdfs)
    

    # Fill empty fields in the institution column with the most recent non-na field
    df['Institution'] = df['Institution'].fillna(method='ffill')

    # Remove rows containing only institution names
    df = df[df['Course Code'].notna()]

    # Add level column
    df['Level8'] = level == '8'

    # reset the index
    df.reset_index(drop=True, inplace=True)

    # Search in course code column reveals some bad rows
    # Drop any rows where Course Code does not follow /d/d/D/D/D pattern
    df.drop(df[~df['Course Code'].str.contains('^\D\D\d\d\d$', regex=True)].index, inplace=True)

    # add remaining columns and clean up
    df = tidy_cols(df)
        
    return df



In [13]:
# Non-default parameters for read_cao_broadsheet() for each year
pdf_multicol_files = {
    2007: {'8': {'top': 18.875, 'height': 568, 'width': 246,
                 'col_locs': [18.375, 260.625, 509.625],
                 'runover': [0, 2, 4, 6, 7, 9, 0],
                 'header_row': 18},
           '76': {'top': 18.875, 'height': 568, 'width': 246,
                  'col_locs': [18.375, 260.625, 509.625],
                  'runover': [0, 2, 4, 0, 0],
                  'header_row': 15}},
    2006: {'8': {'top': 18.125, 'height': 556.5, 'width': 324.75,
                 'col_locs': [19.125, 353.625],
                 'runover': [0, 0, 4, 7, 7, 12, 15, 11, 20],
                 'header_row': 20},
           '76': {'top': 18.125, 'height': 556.5, 'width': 324.75,
                  'col_locs': [19.125, 353.625],
                  'runover': [0, 0, 5, 7, 5, 14],
                  'header_row': 13}}
}


At this point we have three functions; `read_cao_htmls()`, `read_cao_excel()`, and `read_cao_pdfs()`, which between them will read all of the CAO points data from 2001 to 2021, assuming it is accessible on disk, and return a dict of pandas DataFrames.

In [14]:
def read_all_cao(write_csv=False, csv_loc=None, ret='df'):
        
    # Read all data files and construct dictionaries of dataframes
    dfs1 = read_cao_htmls(html_files)
    # read_cao_excel() returns a single dataframe rather than a dict
    df2  = read_cao_excel()
    dfs3 = read_cao_pdfs(pdf_files)
    dfs4 = read_cao_pdfs(pdf_multicol_files, type='multiple')

    # Merge dicts and dataframes to a single dict keyed by year
    dfs = dfs1 | dfs3 | dfs4 | {2020:df2}

    # Change int dict keys to strings
    dfs = {str(key) : dfs[key] for key in dfs}

    # Set index in dataframes to course code so that 
    # they are joined on course code when concatenated
    for df in dfs.values():
        df.set_index('Course Code', inplace=True)
    
    # Construct single multiindex dataframe holding CAO data from all years
    df = pd.concat(dfs, axis=1)

    # If write_csv flag is True..
    if write_csv:
        # Write a csv for each year's data
        for year in dfs:
            filename = os.path.join(csv_loc, f'cao_{year}.csv')
            dfs[year].to_csv(filename)
        # Write a single csv for all data
        df.to_csv('data/cao/csv/cao_2001-2021.csv')

    # Return either a dict of dataframes or a single multiindex dataframe
    if ret == 'dict':
        return dfs
    elif ret == 'df':
        return df
    else:
        raise ValueError('ret must be either "dict" or "df"')


In [16]:
# Read all data from original sources and write resulting csv's
dfs = read_all_cao(write_csv=True, csv_loc='./data/cao/csv', ret='dict')

## Analysing the data

In [20]:
# Load multiindex dataframe with 2001-2021 data from csv
df = pd.read_csv('data/cao/csv/cao_2001-2021.csv', header=[0,1], index_col=0)


In [22]:
df

Unnamed: 0_level_0,2021,2021,2021,2021,2021,2021,2021,2021,2021,2004,...,2006,2020,2020,2020,2020,2020,2020,2020,2020,2020
Unnamed: 0_level_1,Course Name,Institution,EOS,Mid,Level8,Test,Not All,AQA,New Comp,Course Name,...,New Comp,Course Name,Institution,EOS,Mid,Level8,Test,Not All,AQA,New Comp
Course Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AL801,Software Design for Virtual Reality and Gaming...,Athlone Institute of Technology,300.0,,True,False,False,False,False,,...,,Software Design with Virtual Reality and Gaming,Athlone Institute of Technology,303.0,367.0,True,False,False,False,False
AL802,Software Design in Artificial Intelligence for...,Athlone Institute of Technology,313.0,,True,False,False,False,False,,...,,Software Design with Artificial Intelligence f...,Athlone Institute of Technology,332.0,382.0,True,False,False,False,False
AL803,Software Design for Mobile Apps and Connected ...,Athlone Institute of Technology,350.0,,True,False,False,False,False,,...,,Software Design with Mobile Apps and Connected...,Athlone Institute of Technology,337.0,360.0,True,False,False,False,False
AL805,Computer Engineering for Network Infrastructur...,Athlone Institute of Technology,321.0,,True,False,False,False,False,,...,,Computer Engineering with Network Infrastructure,Athlone Institute of Technology,333.0,360.0,True,False,False,False,False
AL810,Quantity Surveying ...,Athlone Institute of Technology,328.0,,True,False,False,False,False,,...,,Quantity Surveying,Athlone Institute of Technology,326.0,352.0,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TU971,,,,,,,,,,,...,,Contemporary Visual Culture,Technological University Dublin,290.0,320.0,True,False,False,False,False
TU972,,,,,,,,,,,...,,Creative and Cultural Industries,Technological University Dublin,281.0,369.0,True,False,False,False,False
TU986,,,,,,,,,,,...,,Print Media Technology and Management,Technological University Dublin,289.0,300.0,True,False,False,False,False
TU993,,,,,,,,,,,...,,Early Childhood Care and Education,Technological University Dublin,270.0,311.0,True,False,False,False,False


## Conclusion

## References

[1] https://www.independent.ie/life/family/learning/understanding-your-cao-course-guide-26505318.html


https://tabula-py.readthedocs.io/en/latest/faq.html#how-to-use-area-option