# Webscraping

Webscraping is the process of extracting information from a website by leveraging the power of computers. I'm getting my data from the website address ftp://ftp.zois.co.uk/pub/jcp. At this address is a list of links to csv files, one per day. Currently I only have the data from the 30th of July 2015 but in order to get better statistics as well as gain insight into the evolution of my data I need to get data from other dates. Rather than clicking each one to manually download it I can write a script to do it for me, furthermore I can automate the script to do this everyday so that the most up to date information will be populated into my database without any work from me.

In [11]:
import requests

In [14]:
response = requests.get("http://ftp.zois.co.uk/pub/jcp")

ConnectionError: ('Connection aborted.', error(60, 'Operation timed out'))

Their server is down, that could be a problem... Here's another reason to whether perhaps I would be better off developing my own web scraper to scrape the Universal Jobsmatch website directly. A bit more work from my end but I'll have a cup of tea and if their site is not working by then I might have a go at doing that myself.

Okay the server is still down so I'm going push forward with my own web scraper. As a disclaimer this'll be a first for me so bear that in mind if I do something stupid.

To start with I've logged onto the jobmatch website then ran a simple search with "Data Science" as the keywords, I've then taken the URL of the web page with the results that loaded and now I'm trying to access that information programatically (Sorry for bad english).

In [17]:
response = requests.get("https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3A%2F%2Fjobsearch.direct.gov.uk%2Fhome.aspx&pp=25&pg=1&q=Data%20Science&sort=rv.dt.di&re=134")



<img src="jobmatch.png">

In [18]:
print response
print dir(response)
print type(response)

<Response [200]>
['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
<class 'requests.models.Response'>


In [27]:
print response.content




<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html id="MasterPage1_htmlEl" xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head id="MasterPage1_htmlHead"><title>
	Universal Jobmatch jobs and skills search - Search results for 'Data Science'
</title><link rel="Stylesheet" type="text/css" href="/Channels/UKMGSCORE/Styles/ukdwp.css" media="screen" /><link rel="alternate stylesheet" type="text/css" href="/Channels/UKMGSCORE/Styles/normal.css" title="Normal" /><link rel="alternate stylesheet" type="text/css" href="/Channels/UKMGSCORE/Styles/larger.css" title="Larger" /><link rel="alternate stylesheet" type="text/css" href="/Channels/UKMGSCORE/Styles/largest.css" title="Largest" /><link rel="shortcut icon" href="../Channels/UKMGSCORE/JobSearch/favicon.ico?v=2" type="image/x-icon" />

    <script type="text/javascript" src="/JavaScripts/jquery-1.7.1.min.js"></scrip

Well that's a lot of text, how am I going to find what I want? Well after a bit a googling I found this, http://docs.python-guide.org/en/latest/scenarios/scrape/ which explains that each element in a web page has an **xpath** that you can use to access that element, if you're using chrome you can right click an element, goto inspect element, then in the side bar that opens up right click the corresponding html element and there should be an option for the xpath.

<img src="jobmatch_xpath.png">

To access the table shown above I'm using a package called `lxml`. To install that I used the python package manager `pip` via the command, `pip install lxml`. I don't think I've talked about `pip` before but in case I haven't, there's plenty of information out there on the internets.

In [28]:
from lxml import html

In [29]:
tree = html.fromstring(response.content)

In [30]:
tree

<Element html at 0x10c6a7d60>

In [34]:
table = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]

In [35]:
table

<Element table at 0x10c6a7e10>

In [39]:
rows = table.findall("tr")
print rows

[<Element tr at 0x10f497158>, <Element tr at 0x10f4971b0>, <Element tr at 0x10f497208>, <Element tr at 0x10f497260>, <Element tr at 0x10f4972b8>, <Element tr at 0x10f497310>, <Element tr at 0x10f497368>, <Element tr at 0x10f4973c0>, <Element tr at 0x10f497418>, <Element tr at 0x10f497470>, <Element tr at 0x10f4974c8>, <Element tr at 0x10f497520>, <Element tr at 0x10f497578>, <Element tr at 0x10f4975d0>, <Element tr at 0x10f497628>, <Element tr at 0x10f497680>, <Element tr at 0x10f4976d8>, <Element tr at 0x10f497730>, <Element tr at 0x10f497788>, <Element tr at 0x10f4977e0>, <Element tr at 0x10f497838>, <Element tr at 0x10f497890>, <Element tr at 0x10f4978e8>, <Element tr at 0x10f497940>, <Element tr at 0x10f497998>, <Element tr at 0x10f4979f0>]


So far I've drilled down and found each row in the table. Note the first row in the table is the header of the table so I'll skip that when searching for the jobs, also the third column in the table is where the link to the job is stored, the first is the date, the second is a blank column for spacing I think?

In [58]:
for element in rows[1:]:
    cell = element.findall("td")[2]
    link = cell.find("a")
    print link.attrib['href']
    
        

http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18403744&JobTitle=Senior%20Delivery%20Manager%20%28Data%20Science%20Hub%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=134&AVSDM=2015-08-18T04%3a18%3a00-05%3a00
http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18395352&JobTitle=Senior%20Data%20Science%20Consultant%20with%20Leading%20Global%20Consultancy&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=134&AVSDM=2015-08-18T12%3a52%3a00-05%3a00
http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18402745&JobTitle=Head%20of%20Data%20Science%20-%20Investment%20banking&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=134&AVSDM=2015-08-18T04%3a16%3a00-05%3a00
http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18107641&JobTitle=Graduate%20Software%20Engineer%20%28Data%20Science%2c%20Python%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fh

Great looks like I have the links to the job details page, I'll go to each of those pages and extract the data from them. There is however a small issue at the moment in that I'm only getting results from the first page so I'll have to find a way to iterate through all of them too. 

How many pages are there? I noticed at the top of the table is a summary that says "Page 1 of 40", I'll try to use that. I also noticed that the `Element` object that forms the objects in the html document have a `cssselect` method, I might try that to select the element. To use it I'll also have to do a `pip install cssselect`.

In [77]:
page_summary = tree.cssselect("div.pagesSummary span")

In [78]:
page_summary


[<Element span at 0x10ea00ba8>, <Element span at 0x10ea00d08>]

There are actually two page summaries, one at the top of the table and one at the bottom, I'll just use the first one.

In [79]:
page_summary = page_summary[0].text

In [80]:
n_pages = int(page_summary.split(' ')[-1])

In [81]:
n_pages

40

Whoop whoop, next to use this to iterate through the pages. I noticed earlier that when I browsed through each page the url was changing in a systematic way, theres a `pg` attribute in the url string that controls which page you're looking at.

https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3A%2F%2Fjobsearch.direct.gov.uk%2Fhome.aspx&pp=25&pg=1&q=Data%20Science&sort=rv.dt.di&re=3

What I'll do is create a URL template then substitute page numbers into it.

In [94]:
URL_TEMPLATE = "https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3A%2F%2Fjobsearch.direct.gov.uk%2Fhome.aspx&pp=25&pg={page_number}&q=Data%20Science&sort=rv.dt.di&re=3"

In [117]:
previous_table = None
first_table = None
for i in range(1, n_pages + 1):
    response = requests.get(URL_TEMPLATE.format(page_number=i))
    table = html.fromstring(response.content).xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]
    if previous_table != None:
        assert previous_table.text_content() != table.text_content()
        assert table.text_content() != first_table.text_content()
    else:
        first_table = table
    previous_table = table



I also add in a little check to make sure I'm not getting the same page back, or that the content I'm getting back isn't the same as the first page (after a bit of playing around I found that if you put in a page number that doesn't exist you get back the first page)

In [118]:
response1 = requests.get(URL_TEMPLATE.format(page_number=1))
response2 = requests.get(URL_TEMPLATE.format(page_number=3242341341324))
tree1 = html.fromstring(response1.content)
tree2 = html.fromstring(response2.content)
table1 = tree1.xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]
table2 = tree2.xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]
print table1.text_content() == table2.text_content() # They're the same table!




True


Well on the surface that appears to be working. Again I'm still in the exploratory phase so I'll want to go back to this at some point and set up some real testing measures. Next let's have a look at a job description page.

In [120]:
response = requests.get('https://jobsearch.direct.gov.uk/GetJob.aspx?JobID=18403744&JobTitle=Senior+Delivery+Manager+(Data+Science+Hub)&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=134&AVSDM=2015-08-18T04%3a18%3a00-05%3a00')



In [142]:
import dateutil.parser

tree = html.fromstring(response.content)
title = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[2]')[0].text
description = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[1]')[0].text
company = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[1]')[0].text
apply_ = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[2]/a')[0].attrib['href']
jobid = int(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[1]')[0].text)
added = dateutil.parser.parse(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[2]')[0].text)
location = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[3]')[0].text
industry = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[4]')[0].text
job_type = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[5]')[0].text
salary = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[6]')[0].text
hours_of_work = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[7]')[0].text # not in the original data model
job_reference_code = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[8]')[0].text # not in original data model
# In the current dat model but not shown here
# career
# contact
# education
# noted
# reference - could this be the job reference code? Not always a code though
print 'title:', title
print 'description:', description
print 'company:', company
print 'apply_:', apply_
print 'jobid:', jobid
print 'added:', added
print 'location:', location
print 'industry:', industry
print 'job_type:', job_type
print 'salary:', salary
print 'hours_of_work:', hours_of_work
print 'job_reference_code:', job_reference_code

title: Senior Delivery Manager (Data Science Hub)
description: A Senior Delivery Manager (Data Science) is required for our client, a high profile government department based in Sheffield. For an initial 6 month contract.Job title - Data Science Hub Delivery Lead Our client is developing a data science capability. Initial requirement for 6 months Data Science Hub Delivery Lead will be responsible for setting up, leadership and management of a Data Science Hub. The Hub will comprise of data scientists, business analysts and will deliver a series of projects to be agreed with the Head of Data Science, working with senior stakeholders. The Data Science Hub Lead will manage demand, quality and consistency. and will deliver services that meet the Government's Digital by Default Service Standard. The key purpose of this role is to support the Head of Data Science in the development of the Data Science capability.Knowledge, Skills & Capability Requirements: * Experience of working in a highly

Looks good, but this is very specific code, it probably wont work with all job description pages but the only way to find out is to try I guess. Let's try it on all the jobs and see where it fails.

In [145]:
URL_TEMPLATE = "https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3A%2F%2Fjobsearch.direct.gov.uk%2Fhome.aspx&pp=25&pg={page_number}&q=Data%20Science&sort=rv.dt.di&re=3"
response = requests.get(URL_TEMPLATE.format(page_number=1))
tree = html.fromstring(response.content)
table = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]
job_links = table.findall("tr")[1:] # remove headers
job_links = [row.findall("td")[2].find("a").attrib['href'] for row in job_links]
    



In [157]:
failures = []
for job_link in job_links:
    response = requests.get(job_link)
    tree = html.fromstring(response.content)
    try:
        title = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[2]')[0].text
    except Exception as e:
        print 'failure:', job_link
        failures.append((job_link, e),)
        




failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347186&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T10%3a35%3a00-05%3a00




In [158]:
print failures

[('http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347186&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T10%3a35%3a00-05%3a00', IndexError('list index out of range',))]


Looking at that particular link it seems it failed because the listing was removed, I need to keep in mind stuff like that will happen. Let's try it with a few more details.

In [159]:
failures = []
for job_link in job_links:
    response = requests.get(job_link)
    tree = html.fromstring(response.content)
    try:
        title = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[2]')[0].text
        description = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[1]')[0].text
        company = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[1]')[0].text
        apply_ = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[2]/a')[0].attrib['href']
        jobid = int(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[1]')[0].text)
        added = dateutil.parser.parse(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[2]')[0].text)
        location = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[3]')[0].text
        industry = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[4]')[0].text
        job_type = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[5]')[0].text
        salary = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[6]')[0].text
        hours_of_work = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[7]')[0].text # not in the original data model
        job_reference_code = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[8]')[0].text # not in original data model
    except Exception as e:
        print 'failure:', job_link
        failures.append((job_link, e),)
        





failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18395352&JobTitle=Senior%20Data%20Science%20Consultant%20with%20Leading%20Global%20Consultancy&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-18T12%3a52%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18107641&JobTitle=Graduate%20Software%20Engineer%20%28Data%20Science%2c%20Python%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-17T07%3a11%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18333271&JobTitle=Lead%20Front%20End%20Developer%20-%20Top%20Data%20Science%20Company&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T03%3a35%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18331959&JobTitle=Machine%20Learning%20Data%20Scientist%20-%20Data%20Science%2c%20Machine%20Learning%2c%20Java%2c%20Python&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T02%3a33%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347186&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T10%3a35%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18367534&JobTitle=Graduate%20Software%20Engineer%20%28Data%20Science%2c%20Python%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-17T07%3a32%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=17661653&JobTitle=Graduate%20Software%20Engineer%20%28Data%20Science%2c%20Python%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-17T07%3a08%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18329345&JobTitle=Senior%20Analyst%20%28Java%20and%20Data%20Science%29%20%3a%20Warrington%20%3a%20Govt%20Body&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-17T09%3a30%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=17855287&JobTitle=Software%20Developer%20%28C%23%2c%20.Net%2c%20Data%20Science%2c%20Algorithms%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-18T01%3a09%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18323980&JobTitle=Senior%20Analyst%20%28Java%20%26%20Data%20Science%29%20-%20Warrington%20-%20Govt%20Body&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-15T07%3a10%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18230335&JobTitle=Data%20Science%20Manager&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-13T01%3a04%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18224747&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-13T12%3a05%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18214006&JobTitle=Senior%20Modelling%20Analyst%2f%20Manager%20%28Data%20Science%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-12T07%3a05%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=17977804&JobTitle=Analytics%20Manager%20%e2%80%93%20Data%20Science%2fInsight&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-12T06%3a21%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18138400&JobTitle=Data%20Science%20Manager&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-11T12%3a52%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18160416&JobTitle=Lecturer%20%28Data%20Science%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-11T02%3a47%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=17744367&JobTitle=Data%20Science%20Manager&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-12T04%3a52%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18027373&JobTitle=https%3a%2f%2fcareers.kew.org%2fvacancy%2fresearch-leader-data-science-curation-227167.html&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-07T08%3a30%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18007776&JobTitle=Data%20Science%20Analyst&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-06T11%3a56%3a00-05%3a00
failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18007777&JobTitle=Data%20Science%20Analyst&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-06T11%3a56%3a00-05%3a00




In [160]:
len(failures)

20

In [161]:
for link, error in failures:
    print link
    print error
    print

http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18395352&JobTitle=Senior%20Data%20Science%20Consultant%20with%20Leading%20Global%20Consultancy&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-18T12%3a52%3a00-05%3a00
list index out of range

http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18107641&JobTitle=Graduate%20Software%20Engineer%20%28Data%20Science%2c%20Python%29&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-17T07%3a11%3a00-05%3a00
list index out of range

http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18333271&JobTitle=Lead%20Front%20End%20Developer%20-%20Top%20Data%20Science%20Company&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T03%3a35%3a00-05%3a00
list index out of range

http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18331959&JobTitle=Machine%20Learning%20Da

I'm going to remove items which I think are non essential which some jobs might not display and should not result in an error.

In [165]:
failures = []
for job_link in job_links:
    response = requests.get(job_link)
    tree = html.fromstring(response.content)
    try:
        jobid = int(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[1]')[0].text)
        title = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[2]')[0].text
        description = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[1]')[0].text
        try:
            company = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[1]')[0].text
            apply_ = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[2]/a')[0].attrib['href']
            added = dateutil.parser.parse(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[2]')[0].text)
            location = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[3]')[0].text
            industry = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[4]')[0].text
            job_type = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[5]')[0].text
            salary = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[6]')[0].text
            hours_of_work = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[7]')[0].text # not in the original data model
            job_reference_code = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[8]')[0].text # not in original data model
        except IndexError:
            pass
    except Exception as e:
        print 'failure:', job_link
        failures.append((job_link, e),)
        



failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347186&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T10%3a35%3a00-05%3a00




In [164]:
len(failures)

1

Now to try for all the pages.

In [168]:
failures = []
jobs = []

for page_number in range(1, n_pages + 1):
    response = requests.get(URL_TEMPLATE.format(page_number=page_number))
    tree = html.fromstring(response.content)
    table = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]
    job_links = table.findall("tr")[1:] # remove headers
    job_links = [row.findall("td")[2].find("a").attrib['href'] for row in job_links]
    for job_link in job_links:
        response = requests.get(job_link)
        tree = html.fromstring(response.content)
        try:
            job = {}
            job['jobid'] = int(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[1]')[0].text)
            job['title'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[2]')[0].text
            job['description'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[1]')[0].text
            try:
                job['company'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[1]')[0].text
            except IndexError:
                pass
            try:
                job['apply_'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[2]/a')[0].attrib['href']
            except IndexError:
                pass
            try:
                job['added'] = dateutil.parser.parse(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[2]')[0].text)
            except IndexError:
                pass
            try:
                job['location'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[3]')[0].text
            except IndexError:
                pass
            try:
                job['industry'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[4]')[0].text
            except IndexError:
                pass
            try:
                job['job_type'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[5]')[0].text
            except IndexError:
                pass
            try:
                job['salary'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[6]')[0].text
            except IndexError:
                pass
            try:
                job['hours_of_work'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[7]')[0].text # not in the original data model
            except IndexError:
                pass
            try:
                job['job_reference_code'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[8]')[0].text # not in original data model
            except IndexError:
                pass
            jobs.append(job)
        except Exception as e:
            print 'failure:', job_link
            failures.append((job_link, e),)
    



failure: http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347186&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T10%3a35%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18416412&JobTitle=Lecturer%2fSenior%20Lecturer%20in%20Psychology&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=7&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-18T09%3a32%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347314&JobTitle=Network%20Engineer&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=22&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T11%3a13%3a00-05%3a00
failure:



 http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=17637521&JobTitle=Trainee%20Project%20Set%20Up%20Coordinator&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=40&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-13T02%3a58%3a00-05%3a00


In [169]:
len(jobs)


996

In [170]:
len(failures)

4

In [171]:
for link, error in failures:
    print link, error

http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347186&JobTitle=Head%20of%20Data%20Science&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=1&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T10%3a35%3a00-05%3a00 list index out of range
http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18416412&JobTitle=Lecturer%2fSenior%20Lecturer%20in%20Psychology&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=7&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-18T09%3a32%3a00-05%3a00 list index out of range
http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=18347314&JobTitle=Network%20Engineer&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=22&q=Data+Science&sort=rv.dt.di&re=3&AVSDM=2015-08-16T11%3a13%3a00-05%3a00 list index out of range
http://jobsearch.direct.gov.uk/GetJob.ashx?JobID=17637521&JobTitle=Trainee%20Project%20Set%20Up%20Coordinator&redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&pg=40&q=Data+Science&sort=rv.dt.di&r

I think I'll call it a day here, just a few things to note.

- The website states there are 4120 search results but I only can see 1000 of them, I'm guessing this is some limit they place on the number of returned results
- The results are a bit mixed, I think the keyword science has brought in jobs that aren't particularly relevant e.g. a lecturing position in Psychology
- I need to think about how broad I want the search to be. I need to balance between having data they *might* be useful vs too much irrelevant data
- If I am to setup an automated scraper that scrapes daily then there is space for some optimisation, after the initial scrape that collects all backlog data, I will only need to scrape for data posted in the last day