<h1>ICAEW Webscraping Test</h1>

This jupyter notebook contains our experiments to webscrape all of the Professional Body data that we need.

<h2>Importing webscraping libraries</h2>

The first time that you do webscraping you may have to download some tools, I chose to experiment with scrapy:

https://scrapy.org/

I'm running this on a machine with Anaconda and Jupyter, so all I need to do to install scrapy is to type:

conda install scrapy

And let anadona do the work.

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

If the installation worked the the import statement above runs with no output. We're going to try and follow the scrapy tutorial on these pages:

https://docs.scrapy.org/en/latest/intro/tutorial.html

https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

So, let's get this going.

<h2>Defining the Webscraping Spider</h2>

This code block is defining what our web spider will do.

We tested this code using this page: http://www.accaglobal.com/uk/en/member/find-an-accountant/find-firm/results/details.html?isocountry=GB&location=London&country=UK&advisorid=2841947

https://find.icaew.com/listings/view/listing_id/35931 because it has several complex features that are not available on every page.

<h3>start_requests</h3>

The first function has two purposes:

The first is to construct the series of urls that we are going to scrape. Note that because of the design of the website we just have to increase the id number by one to try a new page. Not every single ID number is currently in use, but the code will just skip a missing page by itself.

The second is to iterate over the list of urls using the Request function in scrapy which takes at least a url and a function as arguements. The function is just the set of instructions we've built to get the information we want from the page - we'll discuss this in a moment.

<p style="color:red">To do, impliment the better extraction method<br>
To do, mute the log</p>

<h3>parse</h3>

We second function has one real pupose, which is to read the web page we access, scrape the useful information and package it up as a neat data structure - we don't really need to neaten it up here, but the processing is trivial and it will reduce the size of the output. The three steps work like this:

<h4>1. scrape the important data</h4>

"page = response.url.split("/")[-1]" extracts everything after the final slash in the URL, which in this case is the membership number!

"response" is a class which allows to manipulate the data in the webpage we call. In this case were going to use a "selecter" method which lets us pick individual elements of the webpage, and then the extract method to pick out the selected element. We have used the xpath selecter because the page is html, but you can use other ones (e.g. css).

It's too mucht to explain xpath here, but I will explain one as a guide. Obviously the exact coding here depends completely on the web page being scraped and took a little while to develop. We're lucky that the webpages we're scraping are very regular in format, so we can get what we need without too much bother.

'//div[@class="row-fluid listing-details"]//div[@class="span3 title"]//text() | //ul[@class="prof"]//text()'

The first thing to note is that this has a pipe character in it. We're actually extracting two types of data - the first is the heading of the section (which says "specialisms"), and the second is a list of the specialisms. This is convenient because not all data is available for every firm, so it makes it simpler to identify what we're meant to be looking at.

So, from the left: 

"//div[@class="row-fluid listing-details"]" means "find all div tags with the class "row-fluid lisitng details"". It's the "//" which means "any", you could also use "/" for "the first". The "=" means find an identical match, but you can use other comparitors.

"//div[@class="span3 title"]" means (having narrowed the search above) to "find all div tags with the class "span3 title"". In our website that is only one place in the document.

"//text()" means select all text elements inside the tags we have selected.

In this case the output is a messy looking string that contains the text we want, some new line characters, and some white space - we'll deal with that in the next block.

<h4>2. clean the data</h4>

The purpose of the next section of code is to remove the newline characters and whitespace:

firm_name = [' '.join(a.replace('\n', ' ').split()) for a in f_name]

For each element in the f_name list, we do two things:

Firstly we replace all newline characers with spaces using the .replace method.

Secondly, we remove the whitespace - this is done using a combination of the join and split functions. The split function breaks each sentence into individual words, and the join function joins them back together, in this case with a space in between. 

<h4>3. package the data up.</h4>

This step just puts all the data into a list of lists - in fact there is more processing to do but we can do that offline.

In [2]:
class MySpider(scrapy.Spider):
    name = "ICAEW"

    custom_settings = {
        'DOWNLOAD_DELAY': 2,
    }
    
    
    def start_requests(self):
        
        urls = []
                
        for i in range(70,81):
            url = 'https://find.icaew.com/listings/view/listing_id/'+str(i)
            urls.append(url)
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        page = response.url.split("/")[-1]      
        f_name = response.xpath('//div[@class="listing-name"]/h1/text()').extract()
        f_contact_details = response.xpath('//div[@class="row-fluid listing-details listing"]//div[@class="span3 title"]//text() | //div[@class="row-fluid listing-details listing"]//div[@class="span9"]//text()').extract()
        f_specialisations = response.xpath('//div[@class="row-fluid listing-details"]//div[@class="span3 title"]//text() | //ul[@class="prof"]//text()').extract()
        f_profile = response.xpath('//h2[@class="pgttl prof promid"]//text() | //div[@class="span12 profile"]//text()').extract()
        f_qualifications = response.xpath('//div[@class="striped listing-details"]//div[@class="span3 title"]//text() | //div[@class="striped listing-details"]//div[@class="span9"]//text()').extract()
        f_partners_and_staff = response.xpath('//h3[@class="pgttl prof promid"]//text() | //h4[@class="pgttl prof promid"]//text() | //div[@class="striped listing-details"]//div[@class="span12 title"]//text()').extract()
        
        firm_name = [' '.join(a.replace('\n', ' ').split()) for a in f_name]
        firm_contact_details = [' '.join(a.replace('\n', ' ').split()) for a in f_contact_details]
        firm_specialisations = [' '.join(a.replace('\n', ' ').split()) for a in f_specialisations]
        firm_profile = [' '.join(a.replace('\n', ' ').split()) for a in f_profile]
        firm_qualifications = [' '.join(a.replace('\n', ' ').split()) for a in f_qualifications]
        firm_partners_and_staff = [' '.join(a.replace('\n', ' ').split()) for a in f_partners_and_staff]
        
        firm = [page,firm_name,firm_contact_details,firm_specialisations,firm_profile,firm_qualifications,firm_partners_and_staff]
        firms.append(firm)

<h2>Running the Webscraping</h2>

Note, you can't re run the code below in a single session for one reason or another, so you need to restart the kernel between runs.

This code creates a lightweight container for our webspider and then runs it - to be honest understanding this is probably optional unless it breaks.

In [3]:
firms = []
process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2017-09-22 11:22:20 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-09-22 11:22:20 [scrapy.utils.log] INFO: Overridden settings: {}
2017-09-22 11:22:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-22 11:22:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddle

We've now downloaded all the pages that we want to scrape. The first thing to do is to examine what we got back

Now we need to parse the data

[4] - removing blanks, and concatenating
[3] - only one example in the top 100 - need to check, removing blanks
[2] - a set of pairs, but not all mandatory so we need to work out what exists and then populate it or not.
[1] - name, convert from unicode
[0] - page ID, as is

In [92]:
des_data = []

for i in range(len(firms)):   
    if firms[i][2]:
        dd = []
        for j in range(0,len(firms[i][2])):
            if firms[i][2][j]: 
                dd.append(firms[i][2][j].encode('ascii','replace'))
        des_data.append(dd)    
    else:
        des_data.append("")   

#print(des_data)

des_type = []

for i in range (len(des_data)):
    for j in range (0,len(des_data[i]),2):
        des_type.append(des_data[i][j].encode('ascii','replace'))    

des_set = list(set(des_type))

#print(des_set)

designatory_data = []

for i in range (len(des_data)):
    f_des = []
    for t in range(len(des_set)):
        if des_set[t] in des_data[i]:
            f_des.append(des_data[i][des_data[i].index(str(des_set[t])) + 1].encode('ascii','replace'))
        else:
            f_des.append("")
    
    designatory_data.append(f_des)

#print(designatory_data)

We now need to extract specialisations data

In [93]:
specialisations = []

for i in range(len(firms)):   
    if firms[i][3]:
        spec = []
        for j in range(1,len(firms[i][3])):
            if firms[i][3][j]: 
                spec.append(firms[i][3][j].encode('ascii','replace'))
        specialisations.append(spec)    
    else:
        specialisations.append("")  

#print(specialisations) 

We now need to extract company profile data

In [94]:
company_profile = []

for i in range(len(firms)):
    if firms[i][4] > 2:
        prof = []
        for j in range(1,len(firms[i][4])):
            if firms[i][4][j]: 
                prof.append(firms[i][4][j].encode('ascii','replace'))
        company_profile.append(' '.join(prof))  
    else:
        company_profile.append("")  

#print(company_profile)

We now need to make the qualifications data into a regular format:

In [95]:
qualifications = []

qual_type = []

for i in range (len(firms)):
    for j in range (0,len(firms[i][5]),2):
        qual_type.append(firms[i][5][j].encode('ascii','replace'))    

qual_set = list(set(qual_type))
    
#print(qual_set)

for i in range (len(firms)):
    f_qual = []
    for t in range(len(qual_set)):
        if qual_set[t] in firms[i][5]:
            f_qual.append(firms[i][5][firms[i][5].index(str(qual_set[t])) + 1].encode('ascii','replace'))
        else:
            f_qual.append("")
    qualifications.append(f_qual)
    
#print(qualifications)

We now need to extract details of the chartered accountants and other staff

In [96]:
staff_types = ['ICAEW Chartered Accountants','Partners & additional staff']

type_index = []

for i in range(len(firms)):
    type_ind = []
    
    for j in staff_types:
        if [firms[i][6].index(str(j)) for entry in firms[i][6] if j in entry]:
            type_ind.append([firms[i][6].index(str(j)) for entry in firms[i][6] if j in entry][0])
        else:
            type_ind.append(None)
    
    type_index.append(type_ind)

#print(type_index)

staff_ICAEW = []
staff_other = []

for i in range(len(firms)):
    
    if not type_index[i][0] is None:
        staff_I = []
        staff_o = []
        if not type_index[i][1] is None:
            for line in firms[i][6][type_index[i][0]+1:type_index[i][1]]:
                staff_I.append(line)
            for line in firms[i][6][type_index[i][1]+1:len(firms[i][6])]:
                staff_o.append(line)        
        else:
            for line in firms[i][6][type_index[i][0]+1:len(firms[i][6])]:
                staff_I.append(line)
            
        staff_ICAEW.append(staff_I)
        staff_other.append(staff_o)

    elif not type_index[i][1] is None:
        staff_I = []
        staff_o = []

        for line in firms[i][6][type_index[i][1]+1:len(firms[i][6])]:
            staff_o.append(line)        
            
        staff_ICAEW.append(staff_I)
        staff_other.append(staff_o)    

    else:
        staff_ICAEW.append([])
        staff_other.append([])  

#print(staff_ICAEW)
#print(staff_other)

We have now constructed all the variables we need, so  the final step is to put them into a matrix, sort them for convenience and inspect the output.

In [109]:
firms_matrix = []

# the code above is somewhat responsive to the data in the website, but the code below is manual
# the commented out code is currently inoperative because it relies on designatory data existing
# that doesn't in the test data.

for i in range(len(firms)):
    firms_matrix.append([int(firms[i][0]),firms[i][1][0].encode('ascii','replace'),designatory_data[i][1],designatory_data[i][2],designatory_data[i][0],specialisations[i],company_profile[i],qualifications[i],staff_ICAEW[i],staff_other[i]])

#    firms_matrix.append([int(firms[i][0]),firms[i][1][0].encode('ascii','replace'),designatory_data[i][1],designatory_data[i][2],designatory_data[i][0],designatory_data[i][3],designatory_data[i][4],specialisations[i],company_profile[i],qualifications[i],staff_ICAEW[i],staff_other[i]])

firms_matrix.sort()
print(firms_matrix)

[[70, 'Plushcourt Estates Limited', '2 The Estate Yard, Ixworth, Bury St. Edmunds', 'IP31 2HE', '', '', '', ['', ''], [], []], [71, 'Davidsons', '23 Comfrey Close, Farnborough', 'GU14 9XX', '01252 514292', '', '', ['', ''], [], []], [72, 'Stephen Mayled & Associates Ltd', 'Cottage Farm, Michaelston-le-Pit, Dinas Powys', 'CF64 4HE', '02920 515777', '', '', ['', ''], [], []], [76, 'Barrowby Accountants Limited', 'Kobia, Low Road, Barrowby, Grantham', 'NG32 1DJ', '01476 569832', '', '', ['', ''], [], []], [77, 'Potter and Pollard', 'Suite 7, Wessex House, St. Leonards Road, Bournemouth', 'BH8 8QS', '01202 526677', '', '', ['Registered by the ICAEW to carry out audit work', 'Regulated by ICAEW for a range of investment business activities'], [u'Mrs Janet Gee', u'Mrs Janet Gee', u'Mrs Janet Gee'], []], [78, 'PricewaterhouseCoopers LLP', '1 Embankment Place, London', 'WC2N 6RH', '0207 583 5000', '', '', ['Registered by the ICAEW to carry out audit work', ''], [u'Mr Philip Bloomfield', u'Mr T