# Exam paper scraper
This notebook is a exam paper scraper for NUIG.

Dear god, NUIG. SSL ceritifcates are broken on the *www.mis.nuigalway.ie* and the defacto request Python library, Requests, fails due to a `SSLError: [Errno 8] _ssl.c:504: EOF occurred in violation of protocol`. So now, right off the bat, we have to resort to another option. I think NUIG does security by annoying the hell out of anyone who tries to interact with their systems.

Let's use the cURL Python bindings.

In [50]:
import pycurl
from bs4 import BeautifulSoup
from urllib import urlencode
from StringIO import StringIO
from pandas import DataFrame

def request_search_page(module):
    """Perform a request to mis.nuigalway.ie and return the HTML of the search page."""
    
    buffer = StringIO()
    c = pycurl.Curl()
    c.setopt(c.URL, "https://www.mis.nuigalway.ie/regexam/paper_index_search_results.asp")
    c.setopt(c.WRITEDATA, buffer)
    c.setopt(c.POSTFIELDS, urlencode({ "module": module }))
    c.perform()
    c.close()
    
    return buffer.getvalue()

def get_papers(module):
    """Perform a request to the examination center and return array of papers"""
    
    html = request_search_page(module)
    
    soup = BeautifulSoup(html, "html.parser")
    rows = soup.table.find_all("tr")
    papers = []
    
    # Skip the table header
    for i in range(1, len(rows)):
        row = rows[i]
        
        cells = row.find_all("td")
    
        # Extract the data
        paper = {
            "year": cells[0].text.strip(),
            "module": cells[1].text.strip(),
            "name": cells[2].text.strip(),
            "paper": cells[3].text.strip(),
            "sitting": cells[4].text.strip(),
            "period": cells[5].text.strip()
        }
        
        # Sometimes a paper is unavailable, we can check this if there is no "a" element
        # in the cell[6]
        if cells[6].a.get("href") == None:
            continue
        
        paper["link"] = generate_paper_link(paper)
        
        papers.append(paper)
    
    return papers

def generate_paper_link(paper):
    """Dirty implementation of the NUIG paper storage naming scheme. This saves us following that 301 to get the link to the actual PDF."""
    period = {
        "Semester 1": 5,
        "Autumn": 3,
        "Summer": 2
    }
    
    sitting = {
        "First Sitting": 1,
        "Second Sitting": 2
    }
    
    return "https://www.mis.nuigalway.ie/papers_public/{year}/{module_alpha}/{year}_{module}_{sitting}_{period}.PDF".format(
        year = paper["year"].replace("/", "_"),
        module_alpha = paper["module"][:2],
        module = paper["module"].replace("-", "_"),
        period = period[paper["period"]],
        sitting = sitting[paper["sitting"]]
    )
    
# Let's get an example module, say CT422 ;-)
ct422 = get_papers("CT422")

DataFrame.from_dict(ct422)

Unnamed: 0,link,module,name,paper,period,sitting,year
0,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Semester 1,First Sitting,2014/2015
1,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Semester 1,First Sitting,2014/2015
2,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Autumn,Second Sitting,2014/2015
3,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Semester 1,First Sitting,2013/2014
4,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Autumn,Second Sitting,2013/2014
5,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Summer,First Sitting,2012/2013
6,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Autumn,Second Sitting,2012/2013
7,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Summer,First Sitting,2011/2012
8,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Summer,First Sitting,2010/2011
9,https://www.mis.nuigalway.ie/papers_public/201...,CT422-1,Modern Information Management,Paper 1 - Written,Autumn,Second Sitting,2010/2011


Right, now we have a readily accessible list of exam paper PDFs to download. How about we download these to a good location.

In [53]:
from os.path import basename, join

def download_paper(paper, path):
    """Download the PDF paper and put in directory `path`."""
    filename = join(path, basename(paper["link"]).lower())
    
    print "Downloading \"%s %s\" to %s" % (paper["module"], paper["paper"], filename)
    with open(filename, "wb") as output:
        c = pycurl.Curl()
        c.setopt(c.URL, paper["link"])
        c.setopt(c.WRITEDATA, output)
        c.perform()
        c.close()

for paper in ct422:
    download_paper(paper, "/Users/adrian/Downloads/ct422")

Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2014_2015_ct422_1_1_5.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2014_2015_ct422_1_1_5.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2014_2015_ct422_1_2_3.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2013_2014_ct422_1_1_5.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2013_2014_ct422_1_2_3.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2012_2013_ct422_1_1_2.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2012_2013_ct422_1_2_3.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2011_2012_ct422_1_1_2.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2010_2011_ct422_1_1_2.pdf
Downloading "CT422-1 Paper 1 - Written" to /Users/adrian/Downloads/ct422/2010_2011_ct422_1_2_3.pdf
Downloadin