<h1>ACCA Webscraping Firms</h1>

This jupyter notebook contains the code used to webscrape ACCA Firms

<h2>Importing webscraping libraries</h2>

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

<h2>Defining the Webscraping Spider</h2>

This code block is defining what our web spider will do - note that it isn't running it, just defining it. See that we are extending the exsiting scrapy.Spider class rather than doing everything from scratch, so we only have minimal coding to do.

We tested this code using one page:

http://www.accaglobal.com/uk/en/member/find-an-accountant/find-firm/results.html?isocountry=GB&location=&country=UK&firmname=&organisationid=ACCA&pagenumber=2&resultsperpage=100&requestcount=2&hid=

ACCA are helpful enough to return the entire details of a firm as the search result, so parsing is the only real issue. It would only take 84 requests to get the details of all firms.

In [2]:
class MySpider(scrapy.Spider):
    name = "ACCA"

    custom_settings = {
        'DOWNLOAD_DELAY': 60,
    }
    
    def start_requests(self):
        
        urls = []
                
        for i in range(1,85):
            url = 'http://www.accaglobal.com/uk/en/member/find-an-accountant/find-firm/results.html?isocountry=GB&location=&country=UK&firmname=&organisationid=ACCA&pagenumber='+str(i)+'&resultsperpage=100&requestcount='+str(i)+'&hid='
            urls.append(url)
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        
        page_link = response.xpath('//tr//@id | //tr//h5//a//@href').extract()
        name = response.xpath('//tr//@id | //tr//h5//a/text()').extract()
        address = response.xpath('//tr//@id | //tr//td/text()').extract()
        telephone = response.xpath('//tr//@id | //tr//ul//li[last()]//text()').extract()
        email_website = response.xpath('//tr//@id | //tr//ul//li//@href').extract()
        expertise = response.xpath('//tr//@id | //tr[@class="expandable"]//text()').extract()

        firm_page_link = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in page_link]
        firm_name = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in name]
        firm_address = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in address]
        firm_telephone = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in telephone]
        firm_email_website = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in email_website]
        firm_expertise = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in expertise]
        
        full_pl_output.append(firm_page_link)
        full_nm_output.append(firm_name)
        full_ad_output.append(firm_address)
        full_tp_output.append(firm_telephone)
        full_ew_output.append(firm_email_website)
        full_ex_output.append(firm_expertise)   

<h2>Running the Webscraping</h2>

Note, you can't re run the code below in a single session for one reason or another, so you need to restart the kernel between runs.

This code creates a lightweight container for our webspider and then runs it - to be honest understanding this is probably optional unless it breaks.

In [3]:
full_pl_output = []
full_nm_output = []
full_ad_output = []
full_tp_output = []
full_ew_output = []
full_ex_output = []

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2017-09-23 08:17:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-09-23 08:17:14 [scrapy.utils.log] INFO: Overridden settings: {}
2017-09-23 08:17:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-23 08:17:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddle

We've now downloaded all the pages that we want to scrape. The first thing to do is to examine what we got back

In [4]:
final_pl_output = []
final_nm_output = []
final_ad_output = []
final_tp_output = []
final_ew_output = []
final_ex_output = []

for i in range(len(full_pl_output)):
    for j in range(len(full_pl_output[i])):
        final_pl_output.append(full_pl_output[i][j].encode('ascii','replace'))
        
for i in range(len(full_nm_output)):
    for j in range(len(full_nm_output[i])):
        final_nm_output.append(full_nm_output[i][j].encode('ascii','replace'))

for i in range(len(full_ad_output)):
    for j in range(len(full_ad_output[i])):
        final_ad_output.append(full_ad_output[i][j].encode('ascii','replace'))
        
for i in range(len(full_tp_output)):
    for j in range(len(full_tp_output[i])):
        final_tp_output.append(full_tp_output[i][j].encode('ascii','replace'))
        
for i in range(len(full_ew_output)):
    for j in range(len(full_ew_output[i])):
        final_ew_output.append(full_ew_output[i][j].encode('ascii','replace'))

for i in range(len(full_ex_output)):
    for j in range(len(full_ex_output[i])):
        final_ex_output.append(full_ex_output[i][j].encode('ascii','replace'))


#print(final_pl_output[0:4])
#print(final_ad_output[0:20])
#print(final_nm_output[0:4])
#print(final_tp_output)
#print(final_ew_output[0:8])
#print(final_ex_output[0:22])

In [5]:
row_index_pl = []
row_index_ad = []
row_index_nm = []
row_index_tp = []
row_index_ew = []
row_index_ex = []

for i in range(len(final_pl_output)):
    if "rowId" in final_pl_output[i]:
        row_index_pl.append(i)

for i in range(len(final_nm_output)):
    if "rowId" in final_nm_output[i]:
        row_index_nm.append(i)

for i in range(len(final_ad_output)):
    if "rowId" in final_ad_output[i]:
        row_index_ad.append(i)

for i in range(len(final_tp_output)):
    if "rowId" in final_tp_output[i]:
        row_index_tp.append(i)

for i in range(len(final_ew_output)):
    if "rowId" in final_ew_output[i]:
        row_index_ew.append(i)

for i in range(len(final_ex_output)):
    if "rowId" in final_ex_output[i]:
        row_index_ex.append(i)

In [6]:
split_output_pl = []
split_output_ad = []
split_output_nm = []
split_output_tp = []
split_output_ew = []
split_output_ex = []

start = 0
for index in row_index_pl[1:]:
    split_output_pl.append(final_pl_output[start+1:index])
    start = index
    
split_output_pl.append(final_pl_output[start+1:])

start = 0
for index in row_index_ad[1:]:
    split_output_ad.append(final_ad_output[start+1:index])
    start = index
    
split_output_ad.append(final_ad_output[start+1:])

start = 0
for index in row_index_nm[1:]:
    split_output_nm.append(final_nm_output[start+1:index])
    start = index
    
split_output_nm.append(final_nm_output[start+1:])

start = 0
for index in row_index_tp[1:]:
    split_output_tp.append(final_tp_output[start+1:index])
    start = index
    
split_output_tp.append(final_tp_output[start+1:])

start = 0
for index in row_index_ew[1:]:
    split_output_ew.append(final_ew_output[start+1:index])
    start = index
    
split_output_ew.append(final_ew_output[start+1:])

start = 0
for index in row_index_ex[1:]:
    split_output_ex.append(final_ex_output[start+1:index])
    start = index
    
split_output_ex.append(final_ex_output[start+1:])

#print(split_output_pl[0:3])
#print(split_output_ad[0:3])
#print(split_output_nm[0:3])
#print(split_output_tp[0:3])
#print(split_output_ew[0:4])
#print(split_output_ex[0:3])

In [7]:
final_pl = [a[0].split("=")[-1] for a in split_output_pl]

final_nm = []
for a in split_output_nm:
    if a:
        final_nm.append(a[0])
    else:
        final_nm.append("")


final_ad = []   
for i in range(len(split_output_ad)):
    ad = []
    
    for j in range(len(split_output_ad[i])):
        if split_output_ad[i][j]:
            ad.append(split_output_ad[i][j].encode('ascii','replace'))
            
    final_ad.append(ad)

final_pc = []
for i in range(len(final_ad)):   
    final_pc.append(final_ad[i][-1])

final_ad2 = []
for i in range(len(final_ad)):   
    final_ad2.append(final_ad[i][:-1])

for i in final_ad:
    if i:
         final_pc.append(i[-1])
    else:
        final_pc.append("")
        
final_tp = []
for a in split_output_tp:
    if a:
        final_tp.append(a[0])
    else:
        final_tp.append("")

final_em = []
final_ws = []
for a in split_output_ew:
    if a:
        if "mailto" in a[0]:
            if len(a) == 1:
                final_em.append(a[0][7:])
                final_ws.append("")
            else:
                final_em.append(a[0][7:])
                final_ws.append(a[1])
        else:
            final_ws.append(a[0])
            final_em.append("")
    else:
        final_em.append("")
        final_ws.append("")

final_ex = []    
for i in range(len(split_output_ex)):
    ce = []
    for j in range(len(split_output_ex[i])):
        if split_output_ex[i][j]:
            ce.append(split_output_ex[i][j].encode('ascii','replace'))
        final_ex.append(ce)
    else:
        final_ex.append([])

In [14]:
firms_matrix_ACCA = []

for i in range(len(final_pl)):
    firms_matrix_ACCA.append([final_pl[i],final_nm[i],final_ad2[i],final_pc[i],final_tp[i],final_em[i],final_ws[i],final_ex[i]])

firms_matrix_ACCA.sort()

In [15]:
print(len(firms_matrix_ACCA))

8355


In [None]:
import pickle

pickle.dump(firms_matrix_ACCA, open('/home/de-admin/Documents/Webscraping/ACCA_Firms.p','wb'))

How to get pickled data back:

import pickle

output = pickle.load(open('/home/de-admin/Documents/Webscraping/ACCA_Firms.p','rb'))
