# CAO Points

## Introduction

The CAO points data, available from the [CAO website](http://www.cao.ie), is published in a different format for each of the years 2019, 2020, and 2021. Each year's data, therefore, requires a different approach to acquisition, conversion to pandas dataframe, and cleaning. The 2019 data is published in two pdf files; one for level 8 courses, and one for levels 6 and 7. The 2020 data is published as an Excel spreadsheet, and the 2021 data as preformatted text in a HTML webpage.

The relevant columns are `Course Code`, `Course Name`, `Institution Name`, `EOS`, which is the number of points achieved by the last applicant to be offered a place on the course, and `Mid`, which is the mid point between the number of points held by the highest point score and the lowest point score of the applicants offered a place on the course [1].

## Acquiring the data

In [2]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from tabula import read_pdf
import requests
from requests.compat import urljoin


### 2019 Points Data

In [None]:
base_url = 'http://www2.cao.ie/points/'

# 2019 data is provided in PDF format
cao_2019_urls  = ('lvl8_19.pdf', 'lvl76_19.pdf')

# for each of the two points pdfs construct url, 
# fetch pdf, and write to working directory
for url in (cao_2019_urls):
    response = requests.get(urljoin(base_url, url))
    with open(url, 'wb') as f:
        f.write(response.content)

#### Level 8

In [None]:
# read the entire pdf, extracting tables into a single dataframe
df8 = read_pdf("lvl8_19.pdf", pages="all", multiple_tables=False)[0]
df8.head(10)

In [None]:
# Create a new column in the dataframe for institution name 
# identify institution name rows as those containing null course codes
# and add those institution names to the new institution column
df8['Institution'] = df8[df8['Course Code'].isnull()]['INSTITUTION and COURSE']
df8.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)
df8.head()

In [None]:
# Fill empty fields in the institution column with the most recent non-na field
df8['Institution'] = df8['Institution'].fillna(method='ffill')
df8.head()

In [None]:
# Finally, remove rows containing only institution names
df8 = df8[df8['Course Code'].notna()]

# Set some display options
# pd.set_option("display.max_rows", None)
# pd.set_option("display.max_colwidth", None)

df8

In [None]:
# Examine EOS values which contain non-numeric characters
df8[df8['EOS'].str.contains(r'[^0-9#*]') == True]

In [None]:
# Examine Mid values which contain non-numeric characters
df8[df8['Mid'].str.contains(r'[^0-9]') == True]

In [None]:
# Create new column indicating whether the course requires a test, interview or portfolio
# This is indicated by a '#' in the EOS column
df8['Test'] = df8['EOS'].str.contains('#', na=False)

# Create a column indicating courses where not all applicants at EOS point score were offered a place
# This is indicated by a '*' in the EOS column
df8['Not All'] = df8['EOS'].str.contains('\*', na=False)

# Create a column indicating courses where a matric is required
# This is indicated by the string '+matric' in the EOS column.
# However, the tabula table parsing has interpreted the r in matric as a cell boundary so only 'mat' 
# remains in the EOS column and 'ic' appears in the Mid column. The 'ic' will be dealt with next 
df8['Matric'] = df8['EOS'].str.contains('mat', na=False)


In [None]:
# Remove Non-digits from EOS and Mid columns and convert columns to numeric values, with NaNs where values are missing (errors = 'coerce')
# (Because NaN is a float, the whole columns must be floats)
df8['EOS'] = pd.to_numeric(df8['EOS'].str.replace(r'[^0-9]+', '', regex=True), errors='coerce')
df8['Mid'] = pd.to_numeric(df8['Mid'].str.replace(r'[^0-9]+', '', ), errors='coerce')

In [None]:
# Repair LM124 Course Name
df8.loc[df8['Course Code']=='LM124', 'Course Name'] += 'ce)'

#### Level 6 and 7

In [17]:
# read the entire pdf, extracting tables into a single dataframe
df67 = read_pdf("lvl76_19.pdf", pages="all", multiple_tables=False)[0]
df67.head(10)

Unnamed: 0.1,Unnamed: 0,ADMISSION DATA 2019,Unnamed: 2,Unnamed: 3
0,,End of Season,,
1,,"Level 6, 7",,
2,,The details given are for general information...,,
3,*,Not all on this points score were offered places,,
4,#,Test / Interview / Portfolio / Audition,,
5,AQA,All qualified applicants,,
6,,,,
7,Course Code,INSTITUTION and COURSE,EOS,Mid
8,,Athlone Institute of Technology,,
9,AL600,Software Design,205,306


In [18]:
# Some text has been included in the dataframe. The acutal table starts at row 7 with the row names
# Rename rows using row 7
df67.columns = df67.iloc[7]
df67.rename_axis(None, axis=1, inplace=True)


In [19]:
# Delete rows up to row 7
df67.drop(df67.index[range(0, 7)], axis=0, inplace=True)
df67.head(10)

Unnamed: 0,Course Code,INSTITUTION and COURSE,EOS,Mid
7,Course Code,INSTITUTION and COURSE,EOS,Mid
8,,Athlone Institute of Technology,,
9,AL600,Software Design,205,306
10,AL601,Computer Engineering,196,272
11,AL602,Mechanical Engineering,258,424
12,AL604,Civil Engineering,252,360
13,AL630,Pharmacy Technician,306,366
14,AL631,Dental Nursing,326,379
15,AL632,Applied Science,243,372
16,AL650,Business,210,317


In [None]:
df67[df67['Mid'].str.contains(r'[^0-9]') == True]
df67[df67['EOS'].str.contains(r'[^0-9]') == True]

In [20]:
# Now this datframe can be cleaned up in the same manner as level 8
def cleanup(df):
    
    df = df.copy(deep=True)
    
    # Create a new column in the dataframe for institution name 
    # identify institution name rows as those containing null course codes
    # and add those institution names to the new institution column
    df['Institution'] = df[df['Course Code'].isnull()]['INSTITUTION and COURSE']
    df.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)

    # Fill empty fields in the institution column with the most recent non-na field
    df['Institution'] = df['Institution'].fillna(method='ffill')

    # Finally, remove rows containing only institution names
    df = df[df['Course Code'].notna()]

    # Create new column indicating whether the course requires a test, interview or portfolio
    # This is indicated by a '#' in the EOS column
    df['Test'] = df['EOS'].str.contains('#', na=False)

    # Create a column indicating courses where not all applicants at EOS point score were offered a place
    # This is indicated by a '*' in the EOS column
    df['Not All'] = df['EOS'].str.contains('\*', na=False)

    # Create a column indicating courses where a matric is required
    # This is indicated by the string '+matric' in the EOS column.
    # However, the tabula table parsing has interpreted the r in matric as a cell boundary so only 'mat' 
    # remains in the EOS column and 'ic' appears in the Mid column. The 'ic' will be dealt with next 
    df['Matric'] = df['EOS'].str.contains('mat', na=False)

    # Level 6 & 7 has a new code in EOS -- 'AQA' meaning All Qualified Applicants were offered a place
    # Create a new column for AQA
    df['AQA'] = df['EOS'].str.contains('AQA', na=False)

    # Remove Non-digits from EOS and Mid columns and convert columns to numeric values, with NaNs where values are missing (errors = 'coerce')
    # (Because NaN is a float, the whole columns must be floats)
    df['EOS'] = pd.to_numeric(df['EOS'].str.replace(r'[^0-9]+', '', regex=True), errors='coerce')
    df['Mid'] = pd.to_numeric(df['Mid'].str.replace(r'[^0-9]+', '', regex=True), errors='coerce')

    return df

df67 = cleanup(df67)        

  df['Mid'] = pd.to_numeric(df['Mid'].str.replace(r'[^0-9]+', '', ), errors='coerce')


In [None]:
# Repair WD177 Course Name
df67.loc[df67['Course Code']=='WD177', 'Course Name'] += 'macy.)'

In [None]:
df67.head(10)

## Analysing the data

In [None]:
df.describe()

In [None]:
df[['EOS', 'Mid']].hist(bins=50, figsize=(20, 5))

## Conclusion

## References

[1] https://www.independent.ie/life/family/learning/understanding-your-cao-course-guide-26505318.html