# CAO Points

## Introduction

The CAO points data, available from the (CAO website)[http://www.cao.ie], is published in a different format for each of the years 2019, 2020, and 2021. Each year's data, therefore, requires a different approach to acquisition, conversion to pandas dataframe, and cleaning. The 2019 data is published in two pdf files; one for level 8 courses, and one for levels 6 and 7. The 2020 data is published as an Excel spreadsheet, and the 2021 data as preformatted text in a HTML webpage.

## Acquiring the data

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
from tabula import read_pdf
import requests
from requests.compat import urljoin


In [2]:
base_url = 'http://www2.cao.ie/points/'

# 2019 data is provided in PDF format
cao_2019_urls  = ('lvl8_19.pdf', 'lvl76_19.pdf')

# for each of the two points pdfs construct url, 
# fetch pdf, and write to working directory
for url in (cao_2019_urls):
    response = requests.get(urljoin(base_url, url))
    with open(url, 'wb') as f:
        f.write(response.content)

In [7]:
# read the entire pdf, extracting tables into a single dataframe
df = read_pdf("lvl8_19.pdf", pages="all", multiple_tables=False)[0]
df.head()

Unnamed: 0,Course Code,INSTITUTION and COURSE,EOS,Mid
0,,Athlone Institute of Technology,,
1,AL801,Software Design with Virtual Reality and Gaming,304.0,328.0
2,AL802,Software Design with Cloud Computing,301.0,306.0
3,AL803,Software Design with Mobile Apps and Connected...,309.0,337.0
4,AL805,Network Management and Cloud Infrastructure,329.0,442.0


In [8]:
# Create a new column in the dataframe for institution name 
# identify institution name rows as those containing null course codes
# and add those institution names to the new institution column
df['Institution'] = df[df['Course Code'].isnull()]['INSTITUTION and COURSE']
df.rename(columns={'INSTITUTION and COURSE':'Course Name'}, inplace=True)
df.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution
0,,Athlone Institute of Technology,,,Athlone Institute of Technology
1,AL801,Software Design with Virtual Reality and Gaming,304.0,328.0,
2,AL802,Software Design with Cloud Computing,301.0,306.0,
3,AL803,Software Design with Mobile Apps and Connected...,309.0,337.0,
4,AL805,Network Management and Cloud Infrastructure,329.0,442.0,


In [9]:
# Fill empty fields in the institution column with the most recent non-na field
df['Institution'] = df['Institution'].fillna(method='ffill')
df.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution
0,,Athlone Institute of Technology,,,Athlone Institute of Technology
1,AL801,Software Design with Virtual Reality and Gaming,304.0,328.0,Athlone Institute of Technology
2,AL802,Software Design with Cloud Computing,301.0,306.0,Athlone Institute of Technology
3,AL803,Software Design with Mobile Apps and Connected...,309.0,337.0,Athlone Institute of Technology
4,AL805,Network Management and Cloud Infrastructure,329.0,442.0,Athlone Institute of Technology


In [10]:
# Finally, remove rows containing only institution names
df = df[df['Course Code'].notna()]
df.head()

Unnamed: 0,Course Code,Course Name,EOS,Mid,Institution
1,AL801,Software Design with Virtual Reality and Gaming,304,328,Athlone Institute of Technology
2,AL802,Software Design with Cloud Computing,301,306,Athlone Institute of Technology
3,AL803,Software Design with Mobile Apps and Connected...,309,337,Athlone Institute of Technology
4,AL805,Network Management and Cloud Infrastructure,329,442,Athlone Institute of Technology
5,AL810,Quantity Surveying,307,349,Athlone Institute of Technology


## Analysing the data

## Conclusion

## References