# CAO Points

## Introduction

The CAO points data, available from the [CAO website](http://www.cao.ie), is published in a different format for each of the years 2019, 2020, and 2021. Each year's data, therefore, requires a different approach to acquisition, conversion to pandas DataFrame, and cleaning. The 2019 data is published in two PDF files; one for level 8 courses, and one for levels 6 and 7. The 2020 data is published as an Excel spreadsheet, and the 2021 data as preformatted text in a HTML web page.

The attributes of interest for comparison between the various years' datasets are `Course Code`, `Course Name`, `Institution Name`, `EOS`, which is the number of points achieved by the last applicant to be offered a place on the course, and `Mid`, which is the mid point between the number of points held by the highest point score and the lowest point score of the applicants offered a place on the course [1]. The 2021 data does not explicitly contain an either an `EOS` or a `Mid` column. It does provide the *Round 1* and *Round 2* points required for entry into each course as `RND1` and `RND2`. Examination of the 2020 data, which contains both an `EOS` field *and* `RND1` and `RND2` fields demonstrates that the `EOS` field is equal to the `RND2` value if it exists, otherwise the `RND1` value (```EOS = RND1 if RND1 else RND2```). As for the `Mid` field; this information does not appear to be available yet for the 2021 data.

In [2]:
# Imports
# Data analysis library
import pandas as pd
# Plotting library
import matplotlib.pyplot as plt
# PDF table parsing
from tabula import read_pdf
# Retrieval of resources from WWW
import requests
# URL construction
from requests.compat import urljoin
# Various utilities, mainly path construction
import os
# Creation of datetime strings for filenames
from datetime import datetime
# Regular expressions
import re
import numpy as np
import hashlib


## Acquiring the data


In [13]:
# Location of CAO points data
base_url = 'http://www2.cao.ie/points/'
# Local data directory
data_dir = 'data/cao'
backup_dir = 'data/cao/backup'

# Dictionary of source file names mapped to the file names that will be used locally
points_urls  = ({'lvl8_19.pdf': 'cao_2019_lvl8.pdf', 
                 'lvl76_19.pdf': 'cao_2019_lvl76.pdf', 
                 'CAOPointsCharts2020.xlsx': 'cao_2020_lvl876.xlsx', 
                 'l8.php': 'cao_2021_lvl8.html',
                 'l76.php': 'cao_2021_lvl76.html'})

# for each of the source files construct url, 
# fetch file, and write to data directory
for url in (points_urls):
    response = requests.get(urljoin(base_url, url))
    
    try:
        with open(os.path.join(data_dir, points_urls[url]), "rb") as f:
            md5_local = hashlib.md5(f.read()).hexdigest()
            md5_response = hashlib.md5(response.content).hexdigest()
            writefile = (md5_local != md5_response)
    except IOError:
        writefile = True


    if writefile:
        print(f"File: {points_urls[url]} has changed since last download. Updating...")
        # split the filename into name and extension
        fname, extension = os.path.splitext(points_urls[url])
        # construct unique filename by inserting datetime string between filename and extension
        filename = fname + datetime.now().strftime("_%Y%m%d_%H%M%S") + extension
        with open(os.path.join(backup_dir, filename), 'wb') as f:
            # write source data to disk
            f.write(response.content)

        with open(os.path.join(data_dir, points_urls[url]), 'wb') as f:
            # write source data to disk
            f.write(response.content)
    else:
        print(f"File: {points_urls[url]} has not changed since last download.")


File: cao_2019_lvl8.pdf has changed since last download. Updating...
File: cao_2019_lvl76.pdf has changed since last download. Updating...
File: cao_2020_lvl876.xlsx has changed since last download. Updating...
File: cao_2021_lvl8.html has changed since last download. Updating...
File: cao_2021_lvl76.html has changed since last download. Updating...


### 2021 Points Data

In [93]:
l8 = os.path.join(data_dir, 'cao_2021_lvl8.html')
l76 = os.path.join(data_dir, 'cao_2021_lvl76.html')

# Regular expression to capture fields from lines
# Lines consist of 2 letters and 3 numbers, comprising the course code; some whitespace; 
# 50 characters which start with a non-whitespace character; some more whitespace;
# some optional non whitespace characters comprising round 1 points; some more whitespace;
# and, optionally some more non-whitespace characters comprising round 2 points if present
re_fields = re.compile('^([A-Z]{2}[0-9]{3})\s+(\S.{49})\s+(\S+)?\s+(\S+)?')

# array to hold matched groups
data = []

for datafile, level in zip((l8, l76), (8, 76)):
    # encoding=cp1252 necessary to decode some characters on page
    with open(datafile, 'r', encoding='cp1252') as f:
        for line in f:
            match = re.match(re_fields, line)
            if match:
                fields = list(match.groups())
                fields.append(level)
                data.append(fields)

# column names
columns = ['Course Code', 'Course Name', 'Rnd1', 'Rnd2', 'Level']
df = pd.DataFrame.from_records(data, columns=columns)



Create new columns to hold information currently designated by *'s and #'s in numeric columns

Create new column indicating whether the course requires a test, interview or portfolio
This is indicated by a '#' in the Rnd1 or Rnd2 column
df['Test'] = df['Rnd1'].str.contains('#', na=False) | df['Rnd2'].str.contains('#', na=False)

Create a column indicating courses where not all applicants at Rnd1 point score were offered a place
This is indicated by a '*' in the Rnd1 or Rnd2 column
df['Not All'] = df['Rnd1'].str.contains('\*', na=False) | df['Rnd2'].str.contains('\*', na=False)

Create a new column for AQA meaning All Qualified Applicants were offered a place
df['AQA'] = df['Rnd1'].str.contains('AQA', na=False) | df['Rnd2'].str.contains('AQA', na=False)

Create a new column for 'New competition for available places' which seems to be courses 
for which the points have increased in round 2. Only occurs in level 76 and is indicated 
by a 'v' in 'Rnd2' column
df['New Comp'] = df['Rnd1'].str.contains('v', na=False) | df['Rnd2'].str.contains('v', na=False)

Generate 'EOS' column. == Rnd2 if it exists, otherwise Rnd1
df['EOS'] = np.where(df['Rnd2'].isnull(), df['Rnd1'], df['Rnd2'])

Remove Non-digits from Rnd1 and Rnd2 columns and convert columns to numeric values, 
with NaNs where values are missing (errors = 'coerce')
(Because NaN is a float, the whole columns must be floats)
df['Rnd1'] = pd.to_numeric(df['Rnd1'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
df['Rnd2'] = pd.to_numeric(df['Rnd2'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

In [94]:
newcols = {'Test': '#', 'Not All': '\*', 'AQA': 'AQA', 'New Comp': 'v'}

for k, v in newcols.items():
    df[k] = df['Rnd1'].str.contains(v, na=False) | df['Rnd2'].str.contains(v, na=False)

# Generate 'EOS' column. == Rnd2 if it exists, otherwise Rnd1
df['EOS'] = np.where(df['Rnd2'].isnull(), df['Rnd1'], df['Rnd2'])

# Remove Non-digits from Rnd1 and Rnd2 columns and convert columns to numeric values, 
# with NaNs where values are missing (errors = 'coerce')
# (Because NaN is a float, the whole columns must be floats)
df['EOS'] = pd.to_numeric(df['EOS'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

df.head(20)

Unnamed: 0,Course Code,Course Name,Rnd1,Rnd2,Level,Test,Not All,AQA,New Comp,EOS
0,AL801,Software Design for Virtual Reality and Gaming...,300,,8,False,False,False,False,300.0
1,AL802,Software Design in Artificial Intelligence for...,313,,8,False,False,False,False,313.0
2,AL803,Software Design for Mobile Apps and Connected ...,350,,8,False,False,False,False,350.0
3,AL805,Computer Engineering for Network Infrastructur...,321,,8,False,False,False,False,321.0
4,AL810,Quantity Surveying ...,328,,8,False,False,False,False,328.0
5,AL811,Civil Engineering ...,,,8,False,False,False,False,
6,AL820,Mechanical and Polymer Engineering ...,327,,8,False,False,False,False,327.0
7,AL830,General Nursing ...,451*,444,8,False,True,False,False,444.0
8,AL832,Mental Health Nursing ...,440*,431,8,False,True,False,False,431.0
9,AL835,Pharmacology ...,356,,8,False,False,False,False,356.0


### 2020 Points Data

In [None]:
df2020 = pd.read_excel(os.path.join(data_dir, 'all2020.xlsx'), header=10)
df2020.columns
df2020.head()

### 2019 Points Data

The 2019 points data is held in two PDF files, one for level 8 courses and one for levels 6 and 7.

#### Level 8

In [None]:
# read the entire pdf, extracting tables into a single dataframe
df8 = read_pdf(os.path.join(data_dir, "lvl8_19.pdf"), pages="all", multiple_tables=False)[0]
df8.head(10)

In [None]:
# Create a new column in the dataframe for institution name 
# identify institution name rows as those containing null course codes
# and add those institution names to the new institution column
df8['Institution'] = df8[df8['Course Code'].isnull()]['INSTITUTION and COURSE']
df8.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)
df8.head()

In [None]:
# Fill empty fields in the institution column with the most recent non-na field
df8['Institution'] = df8['Institution'].fillna(method='ffill')
df8.head()

In [None]:
# Finally, remove rows containing only institution names
df8 = df8[df8['Course Code'].notna()]

# Set some display options
# pd.set_option("display.max_rows", None)
# pd.set_option("display.max_colwidth", None)

df8

In [None]:
# Extract dictionary mapping two-letter college codes to institution names
# This will be useful for the 2021 data
college_code = dict(zip(df8['Course Code'].str[:2].unique(), df8['Institution'].unique()))

In [None]:
# Examine EOS values which contain non-numeric characters
df8[df8['EOS'].str.contains(r'[^0-9#*]') == True]

In [None]:
# Examine Mid values which contain non-numeric characters
df8[df8['Mid'].str.contains(r'[^0-9]') == True]

In [None]:
# Create new column indicating whether the course requires a test, interview or portfolio
# This is indicated by a '#' in the EOS column
df8['Test'] = df8['EOS'].str.contains('#', na=False)

# Create a column indicating courses where not all applicants at EOS point score were offered a place
# This is indicated by a '*' in the EOS column
df8['Not All'] = df8['EOS'].str.contains('\*', na=False)

# Create a column indicating courses where a matric is required
# This is indicated by the string '+matric' in the EOS column.
# However, the tabula table parsing has interpreted the r in matric as a cell boundary so only 'mat' 
# remains in the EOS column and 'ic' appears in the Mid column. The 'ic' will be dealt with next 
df8['Matric'] = df8['EOS'].str.contains('mat', na=False)


In [None]:
# Remove Non-digits from EOS and Mid columns and convert columns to numeric values, with NaNs where values are missing (errors = 'coerce')
# (Because NaN is a float, the whole columns must be floats)
df8['EOS'] = pd.to_numeric(df8['EOS'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
df8['Mid'] = pd.to_numeric(df8['Mid'].str.replace('[^0-9]+', '', ), errors='coerce')

In [None]:
# Repair LM124 Course Name
df8.loc[df8['Course Code']=='LM124', 'Course Name'] += 'ce)'

#### Level 6 and 7

In [None]:
# read the entire pdf, extracting tables into a single dataframe
df67 = read_pdf(os.path.join(data_dir, "lvl76_19.pdf"), pages="all", multiple_tables=False)[0]
df67.head(10)

In [None]:
# Some text has been included in the dataframe. The actual table starts at row 7 with the row names
# Rename rows using row 7
df67.columns = df67.iloc[7]
df67.rename_axis(None, axis=1, inplace=True)


In [None]:
# Delete rows up to row 7
df67.drop(df67.index[range(0, 8)], axis=0, inplace=True)
df67.head(10)

In [None]:
df67[df67['Mid'].str.contains('[^0-9]') == True]
df67[df67['EOS'].str.contains('[^0-9]') == True]

In [None]:
# Now this datframe can be cleaned up in the same manner as level 8
def cleanup(df):
    
    df = df.copy(deep=True)
    
    # Create a new column in the dataframe for institution name 
    # identify institution name rows as those containing null course codes
    # and add those institution names to the new institution column
    df['Institution'] = df[df['Course Code'].isnull()]['INSTITUTION and COURSE']
    df.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)

    # Fill empty fields in the institution column with the most recent non-na field
    df['Institution'] = df['Institution'].fillna(method='ffill')

    # Finally, remove rows containing only institution names
    df = df[df['Course Code'].notna()]

    # Create new column indicating whether the course requires a test, interview or portfolio
    # This is indicated by a '#' in the EOS column
    df['Test'] = df['EOS'].str.contains('#', na=False)

    # Create a column indicating courses where not all applicants at EOS point score were offered a place
    # This is indicated by a '*' in the EOS column
    df['Not All'] = df['EOS'].str.contains('\*', na=False)

    # Create a column indicating courses where a matric is required
    # This is indicated by the string '+matric' in the EOS column.
    # However, the tabula table parsing has interpreted the r in matric as a cell boundary so only 'mat' 
    # remains in the EOS column and 'ic' appears in the Mid column. The 'ic' will be dealt with next 
    df['Matric'] = df['EOS'].str.contains('mat', na=False)

    # Level 6 & 7 has a new code in EOS -- 'AQA' meaning All Qualified Applicants were offered a place
    # Create a new column for AQA
    df['AQA'] = df['EOS'].str.contains('AQA', na=False)

    # Remove Non-digits from EOS and Mid columns and convert columns to numeric values, with NaNs where values are missing (errors = 'coerce')
    # (Because NaN is a float, the whole columns must be floats)
    df['EOS'] = pd.to_numeric(df['EOS'].str.replace('[^0-9]+', '', regex=True), errors='coerce')
    df['Mid'] = pd.to_numeric(df['Mid'].str.replace('[^0-9]+', '', regex=True), errors='coerce')

    return df

df67 = cleanup(df67)        

In [None]:
df67.head(10)

In [None]:
# Repair WD177 Course Name
df67.loc[df67['Course Code']=='WD177', 'Course Name'] += 'macy.)'

#### Merge dataframes

In [None]:
# add AQA column to level 8 dataframe
df8['AQA'] = False

# add level 8 column to both dataframes
df8['Level 8'] = True
df67['Level 8'] = False



In [None]:
# conatenate level with levels 6 & 7
df = pd.concat([df8, df67], ignore_index=True)

# Rename column names to include year   
df = df.rename({'Course Name': 'Course Name 2019', 
                'EOS': 'EOS 2019', 
                'Mid':'Mid 2019', 
                'Test':'Test 2019', 
                'Not All': 'NotAll 2019',
                'Matric': 'Matric 2019',
                'AQA': 'AQA 2019',
                'Level 8': 'Level8 2019'}, axis=1)

In [None]:
df.head()

In [None]:
# export dataframe to csv
df.to_csv('data/cao/cao_points_2019.csv')

## Analysing the data

In [None]:
df[['EOS', 'Mid']].hist(bins=50, figsize=(20, 5))

## Conclusion

## References

[1] https://www.independent.ie/life/family/learning/understanding-your-cao-course-guide-26505318.html