# CAO Points Analysis
### Conor McCaffrey

***



Give info about task

***

Give info about packages you are importing and why, with references

In [1]:
# Convenient HTTP requests.
import requests as rq

# Regular expressions.
import re

# Dates and times.
import datetime as dt

# Data frames. (dataframes is another way to say spreadsheets)
import pandas as pd

# For downloading.
import urllib.request as urlrq

<br>

## 2021 Points

#### Please click below for 2021 courses
http://www.cao.ie/index.php?page=points&p=2021 [1]



Give overview here of response function (concise)

##### Let's gather the Level 8 courses

In [2]:
# Fetch the CAO Level 8 points URL.  (based on Lecture Notes)
resp8 = rq.get('http://www2.cao.ie/points/l8.php')
# Have a quick peek.
resp8       # Response 200 means all is okay. 404 means not found

<Response [200]>

##### Let's gather the Level 7/6 courses

In [3]:
# Fetch the CAO Level 7/6 points URL.
resp7 = rq.get('http://www2.cao.ie/points/l76.php')
# Have a quick peek.
resp7      # Response 200 means all is okay. 404 means not found

<Response [200]>

<br>

## Save original dataset

In [4]:
# Get current date and time
now = dt.datetime.now()

# Format as a string.
nowstr = now.strftime('%Y%m%d_%H%M%S')

In [5]:
# Create a filepath for original Level 8 data
path8 = 'data/cao2021_Lvl8_' +  nowstr + '.html'
# Create a filepath for original Level 7/6 data
path7 = 'data/cao2021_Lvl76_' +  nowstr + '.html'

<br>

## So we have an error on server

***

Technically the server says we should decode as iso-8859

```
Content-Type: text/html; charset=iso-8859-1

```

However, one line uses \x96 which isnt defined in iso-8859-1. <br>
Therefore, we use the similar decoding standard cp1252 which is similar but also includes \x96.

In [6]:
## Level 8
# The server uses the wrong encoding, fix it
original_encoding = resp8.encoding
# Change it to cp1252
resp8.encoding = 'cp1252'

In [7]:
# Save the original html file
with open(path8, 'w') as f:
    f.write(resp8.text)

In [8]:
## Level 7/6
# The server uses the wrong encoding, fix it
original_encoding = resp7.encoding
# Change it to cp1252
resp7.encoding = 'cp1252'

In [9]:
# Save the original html file
with open(path7, 'w') as f:
    f.write(resp7.text)

# Use regular expressions to select lines we want

***

In [10]:
# Compile the regular expression for matching lines.
re_course = re.compile(r'([A-Z]{2}[0-9]{3})  (.*)([0-9]{3})(\*?) *') # the .* is filler, can be basically anything. we are using it for the course name and spaces at end
# the 'r' at front means 'raw'. If you put at front it means 'dont evaluate backslashes'. Treat as a raw string
# (\*?) Asterik means 'zero or more of'. + means 1 or more of. "88*", asterik applies to second 8. this meams "8+" one or more eights
# (\*?) Backslah means dont treat asterik as a quantifer, treat as a backslash. Question mark is the quantifier, it means zero or one of.
 # Space with asterik means any number of spaces


### Loop through the lines of the response

***

In [11]:
#The filepath for the csv file. Level 8

# Data almost in correct format for analysis. Unicode, decoding, errors, look at 2021 points for CAO.

path8 = 'data/cao2021_Lvl8_' +  nowstr + '.csv'
# keep track of how many courses we process
no_lines = 0

# Open the csv file for writing.
try:
    with open (path8, 'w') as f:
        # Loop through the lines of the response.
        for line in resp8.iter_lines(): # took from google stack overflow.
        # Decode line, using the wrong encoding!
            dline = line.decode('cp1252')
        # match only the lines we want - the ones representing courses.
            if re_course.fullmatch(dline):    # getting set up for regular expression. 'pattern is the regular expression itself'
            # Add one to the lines counter
                no_lines = no_lines + 1
            # Uncomment next line to see the original
            #print(line)   # the .* is filler, can be basically anything. we are using it for the course name and spaces at end
            # Pick out the relevant parts of the matched line
                csv_version = re_course.sub(r'\1,\2,\3,\4', dline) # go in and one reg expres, what we want back isgroups of regular expression (/1 is first part, /2 is second part and so on. Reg exp above has 4 parts.)  
            # Print the CSV-style line
            #print(csv_version)
            # split the line on two or more spaces
                linesplit = re.split('  +', dline) # twos space and plus means two or more space
                #print(','.join(linesplit))
            # rejoin the substrings with commas in between.
                f.write(','.join(linesplit) + '\n')
except BaseException as err:
    print('This has caused an error')
else:
    print('Process has been successful.',f'Total number of lines is {no_lines}.')          




Process has been successful. Total number of lines is 922.


In [12]:
df2021 = pd.read_csv('data/cao2021_Lvl8_20211101_201121.csv', sep=',') # consider making a table listing University and the letters for it

In [13]:
df2021

Unnamed: 0,Course Code,Course,RND 1,RND 2
0,AL801,Software Design for Virtual Reality and Gaming,300,
1,AL802,Software Design in Artificial Intelligence for...,313,
2,AL803,Software Design for Mobile Apps and Connected ...,350,
3,AL805,Computer Engineering for Network Infrastructure,321,
4,AL810,Quantity Surveying,328,
...,...,...,...,...
917,WD211,Creative Computing,270,
918,WD212,Recreation and Sport Management,262,
919,WD230,Mechanical and Manufacturing Engineering,230,230
920,WD231,Early Childhood Care and Education,266,


In [14]:
#The filepath for the csv file.

# Data almost in correct format for analysis. Unicode, decoding, errors, look at 2021 points for CAO.



path7 = 'data/cao2021_Lvl76_' +  nowstr + '.csv'
# keep track of how many courses we process
no_lines = 0

# Open the csv file for writing.
try:
    with open (path7, 'w') as f:
        # Loop through the lines of the response.
        for line in resp7.iter_lines(): # took from google stack overflow.
            # Decode line, using the wrong encoding!
            dline = line.decode('cp1252')
            # match only the lines we want - the ones representing courses.
            if re_course.fullmatch(dline):    # getting set up for regular expression. 'pattern is the regular expression itself'
                # Add one to the lines counter
                no_lines = no_lines + 1
                #Uncomment next line to see the original
                #print(line)   # the .* is filler, can be basically anything. we are using it for the course name and spaces at end
                # Pick out the relevant parts of the matched line
                csv_version = re_course.sub(r'\1,\2,\3,\4', dline) # go in and one reg expres, what we want back is groups of regular expression (/1 is first part, /2 is second part and so on. Reg exp above has 4 parts.)  
                # Print the CSV-style line
                #print(csv_version)
                # split the line on two or more spaces
                linesplit = re.split('  +', dline) # twos space and plus means two or more space
                #print(','.join(linesplit))
                # rejoin the substrings with commas in between.
                f.write(','.join(linesplit) + '\n')
except BaseException as err:
    print('This has caused an error')
else:
    print('Process has been successful.',f'Total number of lines is {no_lines}.')    

Process has been successful. Total number of lines is 390.


In [15]:
df2021_7 = pd.read_csv('data/cao2021_Lvl76_20211101_205701.csv', sep=',') # consider making a table listing University and the letters for it

In [16]:
df2021_7

Unnamed: 0,Course Code,Course,RND 1,RND 2
0,AL605,Music and Instrument Technology,211,
1,AL630,Pharmacy Technician,308,
2,AL631,Dental Nursing,311,
3,AL632,Applied Science,297,
4,AL701,Computer Engineering for Network Infrastructure,207,
...,...,...,...,...
385,WD184,Retail Management,190,
386,WD188,Applied Health Care,220,
387,WD206,Electronic Engineering,180,
388,WD207,Mechanical Engineering,172,


<br>

## 2020 Points

http://www.cao.ie/index.php?page=points&p=2020

***


<br>

### Save Original Dataset

***

#### Level 8 Data

In [17]:
# Create a filepath for original data. 
path = 'data/cao2020_' +  nowstr + '.xlsx'

In [18]:
# Save original file to disk.
urlrq.urlretrieve('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx', path) 

('data/cao2020_20211101_223346.xlsx',
 <http.client.HTTPMessage at 0x203d8433580>)

<br>

#### Load Spreadsheet using Pandas

***

In [19]:
# Download and parse the excel spreadsheet
# Read excel function turnsnit into a dataframe in memory so we are now dealing with pandas df from here on
df = pd.read_excel('http://www2.cao.ie/points/CAOPointsCharts2020.xlsx', skiprows = 10) # pandas dataframe

In [20]:
df

Unnamed: 0,CATEGORY (i.e.ISCED description),COURSE TITLE,COURSE CODE2,R1 POINTS,R1 Random *,R2 POINTS,R2 Random*,EOS,EOS Random *,EOS Mid-point,...,avp,v,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Business and administration,International Business,AC120,209,,,,209,,280,...,,,,,,,,,,
1,Humanities (except languages),Liberal Arts,AC137,252,,,,252,,270,...,,,,,,,,,,
2,Arts,"First Year Art & Design (Common Entry,portfolio)",AD101,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
3,Arts,Graphic Design and Moving Image Design (portfo...,AD102,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
4,Arts,Textile & Surface Design and Jewellery & Objec...,AD103,#+matric,,,,#+matric,,#+matric,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1459,Manufacturing and processing,Manufacturing Engineering,WD208,188,,,,188,,339,...,,,,,,,,,,
1460,Information and Communication Technologies (ICTs),Software Systems Development,WD210,279,,,,279,,337,...,,,,,,,,,,
1461,Information and Communication Technologies (ICTs),Creative Computing,WD211,271,,,,271,,318,...,,,,,,,,,,
1462,Personal services,Recreation and Sport Management,WD212,270,,,,270,,349,...,,,,,,,,,,


In [21]:
df.iloc[753] # in Pandas, iloc gets i-location (row location). loc works for name. Spot check, random row

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Road Transport Technology and Management
COURSE CODE2                                                           LC286
R1 POINTS                                                                264
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      264
EOS Random *                                                             NaN
EOS Mid-point                                                            360
LEVEL                                                                      7
HEI                                         Limerick Institute of Technology
Test/Interview #                                                         NaN

In [22]:
df.iloc[1463]  # checking final row. A -1 would work as well (slicing) Spot check, final row

CATEGORY (i.e.ISCED description)          Engineering and engineering trades
COURSE TITLE                        Mechanical and Manufacturing Engineering
COURSE CODE2                                                           WD230
R1 POINTS                                                                253
R1 Random *                                                              NaN
R2 POINTS                                                                NaN
R2 Random*                                                               NaN
EOS                                                                      253
EOS Random *                                                             NaN
EOS Mid-point                                                            369
LEVEL                                                                      8
HEI                                        Waterford Institute of Technology
Test/Interview #                                                         NaN

In [23]:
# Create a filepath for the Pandas data
path = 'data/cao2020_' +  nowstr + '.csv'

In [24]:
# Save Pandas dataframe to disk. Look up to_csv to get rid of row index
df.to_csv(path)

<br>

Dont forget Level 7/6 points
Camelot or tabula-py (depends on Java)

## 2019 Points

http://www.cao.ie/index.php?page=points&p=2019

***

##### Steps to re-produce 

1. Download original PDF file.
2. Open original PDF file in Microsoft Word.
3. Save Microsoft Word's converted PDF in docx format.
4. Re-save Word Document for editing.
5. Delete Headers and Footers.
6. Delete preamble on Page 1.
7. Select all and copy.
8. Paste into Notepad ++.
9. Remove HEI name headings and paste onto each course line.
10. Delete blank lines.
11. Replace double tab characters with a single tab character (6 occurences)
12. Deleted tab characters at end of lines
13. Change backticks to apostrophes

In [25]:
df2019 = pd.read_csv('data/cao2019_20211031_184506.csv', sep = '\t') # consider making a table listing University and the letters for it
# then merge dataframes that linked course codes to HEIs

In [26]:
df2019

Unnamed: 0,Course Code,Course,EOS,Mid
0,AL801,Software Design with Virtual Reality and Gaming,304,328.0
1,AL802,Software Design with Cloud Computing,301,306.0
2,AL803,Software Design with Mobile Apps and Connected...,309,337.0
3,AL805,Network Management and Cloud Infrastructure,329,442.0
4,AL810,Quantity Surveying,307,349.0
...,...,...,...,...
925,WD200,Arts (options),221,296.0
926,WD210,Software Systems Development,271,329.0
927,WD211,Creative Computing,275,322.0
928,WD212,Recreation and Sport Management,274,311.0


In [27]:
df2019L7 = pd.read_csv('data/cao2019_20211101_221930_edited.CSV', sep = '\t') # consider making a table listing University and the letters for it
# then merge dataframes that linked course codes to HEIs

In [29]:
df2019L7.head()

Unnamed: 0,Course Code,Course,EOS,Mid
0,AL600,Software Design,205,306.0
1,AL601,Computer Engineering,196,272.0
2,AL602,Mechanical Engineering,258,424.0
3,AL604,Civil Engineering,252,360.0
4,AL630,Pharmacy Technician,306,366.0
