# An Examination of CAO Points Data 2019 - 2021
## Submission Assignment for Fundamentals of Data Analysis
## GMIT Higher Diploma in Data Analytics
#### Gerry Donnelly November 2021

### Introduction
This assignment is all about getting data from websites and being able to analyse that data all through Python. The specific use case in this instance is the points data contained on the CAO website. The objective of the assignment is to be able to automatically extract the relevant points data for the years 2019, 2020 and 2021 and complete an analysis of that data, comparing each of the years and assessing what are the key trends that emerge from the data. 

The CAO or Central Applications Office is the body that processes all applications for third level educational institutions that are part of the Irish Higher Education system (HEI's) Students sit the Leaving Certificate exam, i.e. the final assessment of second level education and on the basis of results are awarded points. The sum total of points achieved are used by the HEI's to make course offers to students on the basis of their results and their prior stated course preferences. All of the offers to students are then made thorugh the CAO.

The CAO was instituted in 1977 and it is interesting to note that from a starting point of just 5 HEI's in 1977 and just under 15 thousand applicants it has grown steadily since then to reach 37 HEI's and in excess of 80 thousand applicants in 2021. The number of courses offered has gone from 69 in 1977 to over 1000 in 2021. [CAO Media Stats](http://www.cao.ie/index.php?page=mediapack&bb=mediastats)

While not considered in this assignment the growth, changes in the scale and course profiles offered by the COA since its inception in 1977 provides an interesting insight into wider societal changes. Over this period Ireland has transformed itself from a largely rural and small scale industrial profile to one where some of the juggernauts of 21st century global industry are happy to operate here. One can only marvel at the scale of the site currently under construction at Intel in Leixlip, a scene unrecognisable to the leaving cert cohort of 1977 but entirely in keeping with the skillsets emerging from our 21st century education system.

On a smaller, but no less interesting scale, i.e. 2019 - 2021 it will be interesting to see how the impact of a global pandemic and the complete upheaval of the education system in the intervening period may be reflected in the CAO points trends over the three years. Apart from the mechanics of getting at the data, this will be the primary outcome of this assignment. 

In [None]:
http://www.cao.ie/index.php?page=points&p=2021

In [1]:
# Import the regex library
import re

import pandas as pd

# Import the web requests library
import requests as rq

# Import the datetime library, will use for timestamping files
import datetime as dt

# Import os library, will use this to generate folder paths for saving files. 
import os



In [2]:
# Set up the url for the CAO points for 2021
cao2021url = 'http://www2.cao.ie/points/l8.php'

In [3]:
#Get the points data from the CAO website.
resp = rq.get(cao2021url)

#Check the response status, if it is 200 all ok. 
resp

<Response [200]>

In [4]:
# Set the encoding type for the data, needed to decode the non standard characters in the data set.
resp.encoding = 'cp1252'

In [5]:
# Use os.cwd to generate the current working directory for saving the output files from the analysis.
cwd = os.getcwd()
print(cwd)

C:\Users\donne\OneDrive\Documents\GMIT Data\Fundamentals of Data Analysis\Assessment\data_analysis_assessment


In [6]:
now = dt.datetime.now()
#path = cwd + '/data'+'/cao2021_' + now.strftime('%Y%m%d_%H%M%S')+'.csv'
path = cwd + '/CAO_Data'+'/cao2021.csv'
print(path)

C:\Users\donne\OneDrive\Documents\GMIT Data\Fundamentals of Data Analysis\Assessment\data_analysis_assessment/CAO_Data/cao2021.csv


In [7]:
# Compile the regex expression that will be used to extract the course and points data.
# Trying a number of different options
#re_course = re.compile('([\w]{5})  (.*)([\d]{3})(\*?) *')
#re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)([0-9]{3})(\*?) *')
#re_course = re.compile(r'([A-Z]{2}[0-9]{3})')

In [8]:
# Pull and display the CAO points data. 
linecount = 0
with open(path,'w') as f:
    f.write(','.join(['ccode','ctitle']) +'\n')
    for line in resp.iter_lines():
        #if re_course.fullmatch(line.decode('cp1252')):  
        if re.match('[A-Z]{2}[0-9]{3}', line.decode('cp1252')):
            dline = (line.decode('cp1252'))
            dcode = str(dline[:5])
            ccode = str(dline[7:58])
            course = dcode +', '+ccode
            #print(course)
            linecount = linecount+1
            #csv_ver = re_course.sub(r'\1,\2,\3,\4', line.decode('cp1252'))
            #csv_ver = re.split('  +',line.decode('cp1252'))
            #f.write(','.join(csv_ver) + '\n')
            f.write(course + '\n')
            #print(','.join(csv_ver))

In [9]:
linecount

949

In [10]:
#Read in the 2021 points CSV file.
courses2021 = pd.read_csv(path)
courses2021

Unnamed: 0,ccode,ctitle
0,AL801,Software Design for Virtual Reality and Gamin...
1,AL802,Software Design in Artificial Intelligence fo...
2,AL803,Software Design for Mobile Apps and Connected...
3,AL805,Computer Engineering for Network Infrastructu...
4,AL810,Quantity Surveying ...
...,...,...
944,WD211,Creative Computing ...
945,WD212,Recreation and Sport Management ...
946,WD230,Mechanical and Manufacturing Engineering ...
947,WD231,Early Childhood Care and Education ...


### CAO Points 2020

http://www2.cao.ie/points/CAOPointsCharts2020.xlsx

In [11]:
# Set up the url for the 2020 points data.It is already in excel formst.
cao2020url = 'http://www2.cao.ie/points/CAOPointsCharts2020.xlsx'

In [12]:
# Read in the excel file for the 2020 points data.
df2020 = pd.read_excel(cao2020url, skiprows=10, usecols='A:O')
df2020

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,LEVEL,HEI,Test/Interview #,avp,v
0,Business and administration,International Business,AC120,209,,,,209,,280,8,American College,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,8,American College,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,8,National College of Art and Design,#,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,8,National College of Art and Design,#,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,8,National College of Art and Design,#,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,7,Waterford Institute of Technology,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,8,Waterford Institute of Technology,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,8,Waterford Institute of Technology,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,8,Waterford Institute of Technology,,,


In [13]:
# Set up the datafrome to select 2 columns from the data.
courses2020 = df2020[['COURSE CODE2', 'COURSE TITLE']]
# Give the dataframe standard column names, will use the same ones as the 2021 set.
courses2020.columns = ['ccode', 'ctitle']
courses2020

Unnamed: 0,ccode,ctitle
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
1459,WD208,Manufacturing Engineering
1460,WD210,Software Systems Development
1461,WD211,Creative Computing
1462,WD212,Recreation and Sport Management


In [14]:
#df.to_csv("caopoints_2020.csv", index=False)

### CAO Points 2019

In [15]:
# Set up the url for the 2019 points data, note it is a .pdf file. 
cao2019url = 'http://www2.cao.ie/points/lvl8_19.pdf'

#### Steps to extract the pdf points data from the 2019 CAO file. 
There are a number of ways to turn the pdf file into excel. To save time following was done:
- Open the file from the link above in Chrome.
- The built in Adobe Acrobat Chrome extension has a selection of file conversion tools, one of which is pdf to excel conversion. 
- The resulting excel file opens in the browser and can be downloaded to a system folder. 
- The non needed rows are deleted directly in the resulting file. 

In [16]:
# Read in the 2019 points excel file.
df2019 = pd.read_excel('lvl8_2019.xlsx')
df2019

Unnamed: 0,Course Code,Course,EOS,Mid
0,AL801,Software Design with Virtual Reality and Gaming,304,328.0
1,AL802,Software Design with Cloud Computing,301,306.0
2,AL803,Software Design with Mobile Apps and Connected...,309,337.0
3,AL805,Network Management and Cloud Infrastructure,329,442.0
4,AL810,Quantity Surveying,307,349.0
...,...,...,...,...
925,WD200,Arts (options),221,296.0
926,WD210,Software Systems Development,271,329.0
927,WD211,Creative Computing,275,322.0
928,WD212,Recreation and Sport Management,274,311.0


In [17]:
# Select the first 2 columns and standardise the column names.
courses2019 = df2019[['Course Code', 'Course']]
courses2019.columns = ['ccode', 'ctitle']
courses2019

Unnamed: 0,ccode,ctitle
0,AL801,Software Design with Virtual Reality and Gaming
1,AL802,Software Design with Cloud Computing
2,AL803,Software Design with Mobile Apps and Connected...
3,AL805,Network Management and Cloud Infrastructure
4,AL810,Quantity Surveying
...,...,...
925,WD200,Arts (options)
926,WD210,Software Systems Development
927,WD211,Creative Computing
928,WD212,Recreation and Sport Management


In [18]:
# Join the 2019, 2020 and 2021 files.
allcourses = pd.concat([courses2021,courses2020, courses2019])
allcourses

Unnamed: 0,ccode,ctitle
0,AL801,Software Design for Virtual Reality and Gamin...
1,AL802,Software Design in Artificial Intelligence fo...
2,AL803,Software Design for Mobile Apps and Connected...
3,AL805,Computer Engineering for Network Infrastructu...
4,AL810,Quantity Surveying ...
...,...,...
925,WD200,Arts (options)
926,WD210,Software Systems Development
927,WD211,Creative Computing
928,WD212,Recreation and Sport Management


In [19]:
# Check which course codes are duplicated.
allcourses[allcourses.duplicated(subset=['ccode'])]

Unnamed: 0,ccode,ctitle
0,AC120,International Business
1,AC137,Liberal Arts
2,AD101,"First Year Art & Design (Common Entry,portfolio)"
3,AD102,Graphic Design and Moving Image Design (portfo...
4,AD103,Textile & Surface Design and Jewellery & Objec...
...,...,...
925,WD200,Arts (options)
926,WD210,Software Systems Development
927,WD211,Creative Computing
928,WD212,Recreation and Sport Management


In [20]:
# Only show the unique course codes. 
allcourses.drop_duplicates(subset=['ccode'], inplace=True, ignore_index=True)
allcourses

Unnamed: 0,ccode,ctitle
0,AL801,Software Design for Virtual Reality and Gamin...
1,AL802,Software Design in Artificial Intelligence fo...
2,AL803,Software Design for Mobile Apps and Connected...
3,AL805,Computer Engineering for Network Infrastructu...
4,AL810,Quantity Surveying ...
...,...,...
1646,SG441,Environmental Science
1647,SG446,Applied Archaeology
1648,TL803,Music Technology
1649,TL812,Computing with Digital Media


In [36]:
allcourses.set_index('ccode').join(df2020.set_index('COURSE CODE2'))

Unnamed: 0_level_0,ctitle,CATEGORY (i.e.ISCED description),COURSE TITLE,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,LEVEL,HEI,Test/Interview #,avp,v
ccode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
AL801,Software Design for Virtual Reality and Gamin...,Information and Communication Technologies (ICTs),Software Design with Virtual Reality and Gaming,303,,,,303,,367,8.0,Athlone Institute of Technology,,,
AL802,Software Design in Artificial Intelligence fo...,Information and Communication Technologies (ICTs),Software Design with Artificial Intelligence f...,332,,,,332,,382,8.0,Athlone Institute of Technology,,,
AL803,Software Design for Mobile Apps and Connected...,Information and Communication Technologies (ICTs),Software Design with Mobile Apps and Connected...,337,,,,337,,360,8.0,Athlone Institute of Technology,,,
AL805,Computer Engineering for Network Infrastructu...,Information and Communication Technologies (ICTs),Computer Engineering with Network Infrastructure,333,,,,333,,360,8.0,Athlone Institute of Technology,,,
AL810,Quantity Surveying ...,Architecture and construction,Quantity Surveying,319,,,,326,,352,8.0,Athlone Institute of Technology,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SG441,Environmental Science,,,,,,,,,,,,,,
SG446,Applied Archaeology,,,,,,,,,,,,,,
TL803,Music Technology,,,,,,,,,,,,,,
TL812,Computing with Digital Media,,,,,,,,,,,,,,


In [41]:
testjoin=allcourses.merge(df2020[['COURSE CODE2','R1 POINTS']],how='left',left_on='ccode', right_on='COURSE CODE2')
testjoin

Unnamed: 0,ccode,ctitle,COURSE CODE2,R1 POINTS
0,AL801,Software Design for Virtual Reality and Gamin...,AL801,303
1,AL802,Software Design in Artificial Intelligence fo...,AL802,332
2,AL803,Software Design for Mobile Apps and Connected...,AL803,337
3,AL805,Computer Engineering for Network Infrastructu...,AL805,333
4,AL810,Quantity Surveying ...,AL810,319
...,...,...,...,...
1646,SG441,Environmental Science,,
1647,SG446,Applied Archaeology,,
1648,TL803,Music Technology,,
1649,TL812,Computing with Digital Media,,
