#Webscraping 2

So it's been about a month since I started this project and I've only really had three days where I've had time to work on it. Because work is so intermittent I decided that the next task should be to automate the webscraping so that when I do come back to the project I will have some more data. The website ([ftp://ftp.zois.co.uk/pub/jcp](ftp://ftp.zois.co.uk/pub/jcp)) where I originally got my data seems to be down for good so I'll have to rely on my own webscraper.

First to prototype out what the script will look like.

In [10]:
'''
    Check when the last scraping was done
    Check the current date and time
    Depending on when the last scrape was, 
    scrape the dates needed to update to now
    The job match website allows you to search all, today, yesterday, 3, 7, 14, 30 days ago.
    The url string has a tm parameter, tm=-1 for all, tm=0 for today, tm=1 for yesterday,
    tm=3 for 3 days ago and so on
    On a side note: weirdly enough if you reload the page continually the number of results change continually
    e.g. if you reload the URL: https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3a%2f%2fjobsearch.direct.gov.uk%2fhome.aspx&pp=25&q=Data%20Science&sort=rv.dt.di&re=134&tm=3
    the number of results change
    If the last scraping was done on a different day from today
    Then scrape everything between that date and now, e.g. if the last scrape was two days ago,
    then scrape everything using the 3 days ago filter.
    If the last scrape was yesterday, scrape everything using the "yesterday" filter
    
    How to keep track of when the last scrape was?
    Need some form of permanant storage,
    candidates are
    - using a file, e.g. a text file which stores a date
    - use a database table
    
    I think I'll go with a table, that way I can store some meta data like how many jobs were added per scrape.
    Each row in the table will correspond to a scraping session,
    Columns in the table might include stuff like, date and time of scrape, number of items imported,
    that's all I can actually think of right now
    
    Once we know when the last scrape was scrape all the data from then until now
    Have the script sleep for a while depending on how often you want to scrape
    Then scrape repeatedly
'''
pass

Now attempt to write it out in code.

In [None]:
from data_science_jobs.models import ScrapingSession # This doesn't actually exist yet

last_session = ScrapingSession.objects.last()
now = datetime.datetime.now()

days_since_last_scrape = None if last_session == None else now.date() - last_session.date
# a scraping session will need to have a data attribute

if days_since_last_scrape == 0:
    date_filter = 0
elif days_since_last_scrape == 1:
    date_filter = 1
elif days_since_last_scrape <= 3:
    date_filter = 3
elif days_since_last_scrape <= 7:
    date_filter = 7
elif days_since_last_scrape <= 14:
    date_filter = 14
elif days_since_last_scrape <= 30:
    date_filter = 30
else:
    date_filter = -1
    
SCRAPING_URL = 'https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3A%2F%2Fjobsearch.direct.gov.uk%2Fhome.aspx&pp=25&sort=rv.dt.di&re=3&tm={date_filter}&pg={page_number}&q=%22Data%20Science%22'    

class JobScraper:
    
    FREQUENCY = datetime.timedelta(hours=1)
    def scrape(self):
        # TODO
        pass
    
scraper = JobScraper()

while True:
    scraper.scrape()
    import time
    time.sleep(scraper.FREQUENCY.total_seconds())
    
    # Need to update the date filter here so the next time it runs it only updates todays jobs


It started to get a bit messy here so I decided to rewrite it, hopefully with each iteration I'll get something that is readable, maintainable, correct and efficient.

In [11]:
import requests
import time
import datetime
from data_science_jobs.models import ScrapingSession

SCRAPING_URL = 'https://jobsearch.direct.gov.uk/JobSearch/PowerSearch.aspx?redirect=http%3A%2F%2Fjobsearch.direct.gov.uk%2Fhome.aspx&pp=25&sort=rv.dt.di&re=3&tm={date_filter}&pg={page_number}&q=%22Data%20Science%22'    
FREQUENCY = datetime.timedelta(hours=1)

def get_date_filter(last_session):
    now = datetime.datetime.now()
    today = now.date()
    days_since_last_scrape = None if last_session == None else today - last_session.date
    if days_since_last_scrape == 0:
        date_filter = 0
    elif days_since_last_scrape == 1:
        date_filter = 1
    elif days_since_last_scrape <= 3:
        date_filter = 3
    elif days_since_last_scrape <= 7:
        date_filter = 7
    elif days_since_last_scrape <= 14:
        date_filter = 14
    elif days_since_last_scrape <= 30:
        date_filter = 30
    else:
        date_filter = -1
    return date_filter

def get_n_pages(response):
    # TODO
    pass

def populate_db(response):
    # TODO
    pass

while True:
    last_session = ScrapingSession.objects.last()
    date_filter = get_date_filter(last_session)
    response = requests.get(SCRAPING_URL.format(page_number=1, date_filter=date_filter))
    n_pages = get_n_pages(response)
    populate_db(response)
    for page in range(2, n_pages):
        response = requests.get(SCRAPING_URL.format(page_number=page, date_filter=date_filter))
        populate_db(response)
    time.sleep(FREQUENCY.total_seconds())
    

ImportError: No module named data_science_jobs.models

I decided to get rid of the `JobScraper` class, don't think I really need to use objects in this simple script.

This version looks pretty good to me, now I'll finish the `get_n_pages` function, I already wrote some of this in the previous notebook, I'll just copy and paste.

In [12]:
from lxml import html

def get_n_pages(response):
    tree = html.fromstring(response.content)
    page_summary = tree.cssselect("div.pagesSummary span")
    n_pages = int(page_summary.split(' ')[-1])
    return n_pages

And now for populating the database.

In [None]:
from data_science_jobs import Job

def populate_db(response):
    failures = []
    jobs = []
    tree = html.fromstring(response.content)
    table = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[7]/table')[0]
    job_links = table.findall("tr")[1:] # remove headers
    job_links = [row.findall("td")[2].find("a").attrib['href'] for row in job_links]
    for job_link in job_links:
        response = requests.get(job_link)
        tree = html.fromstring(response.content)
        try:
            job = {}
            job['jobid'] = int(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[1]')[0].text)
            job['title'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[2]')[0].text
            job['description'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[1]')[0].text
            try:
                job['company'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/h2[1]')[0].text
            except IndexError:
                pass
            try:
                job['apply_'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[5]/div[2]/div[2]/a')[0].attrib['href']
            except IndexError:
                pass
            try:
                job['added'] = dateutil.parser.parse(tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[2]')[0].text)
            except IndexError:
                pass
            try:
                job['location'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[3]')[0].text
            except IndexError:
                pass
            try:
                job['industry'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[4]')[0].text
            except IndexError:
                pass
            try:
                job['job_type'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[5]')[0].text
            except IndexError:
                pass
            try:
                job['salary'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[6]')[0].text
            except IndexError:
                pass
            try:
                job['hours_of_work'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[7]')[0].text # not in the original data model
            except IndexError:
                pass
            try:
                job['job_reference_code'] = tree.xpath('//*[@id="aspnetForm"]/div/div[2]/div[4]/div/div[4]/dl/dd[8]')[0].text # not in original data model
            except IndexError:
                pass
            jobs.append(job)
        except Exception as e:
            print 'failure:', job_link
            failures.append((job_link, e),)
            
        for job in jobs:
            job = JobListing(**job)
            job.save()
            
        return {
            'jobs': jobs,
            'failures': failures
        }


Bear in mind that none of this code has actually been tested to see if it works! I'm actually certain that it won't work because, for example, I'm pretty sure my JobListing class doesn't know how to deal with the `hours_of_work` attribute and the `data_science_jobs` module doesn't event have a `JobListing` class in it, nor a `ScrapingSession`, the previous `JobListing` class lives in the `data_exploration` module which was as its name suggest for exploration only. Now I have a better idea of what I want and how to structure it I'll start coding it up properly with tests and all. Until next time.