<h1>IFA Webscraping Firms</h1>

This jupyter notebook contains the code used to webscrape ACCA Firms

<h2>Importing webscraping libraries</h2>

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

<h2>Defining the Webscraping Spider</h2>

This code block is defining what our web spider will do - note that it isn't running it, just defining it. See that we are extending the exsiting scrapy.Spider class rather than doing everything from scratch, so we only have minimal coding to do.

We tested this code using one page:

https://www.ifa.org.uk/find-an-accountant?p=AL&r=0&i=True&pg=0

IFA are helpful enough to return all of their search results as part of the map element on the page, so parsing is the only real issue. We can do one search per postcde and it would only take 121 requests to get the details of all firms, but we will have to remove duplicates. 

In [2]:
ALL_UK_POSTCODES = ['AB','AL','B','BA','BB','BD','BH','BL','BN','BR','BS','BT','CA','CB','CF','CH','CM','CO','CR','CT','CV','CW','DA','DD','DE','DG','DH','DL','DN','DT','DY','E','EC','EH','EN','EX','FK','FY','G','GL','GU','HA','HD','HG','HP','HR','HS','HU','HX','IG','IP','IV','KA','KT','KW','KY','L','LA','LD','LE','LL','LN','LS','LU','M','ME','MK','ML','N','NE','NG','NN','NP','NR','NW','OL','OX','PA','PE','PH','PL','PO','PR','RG','RH','RM','S','SA','SE','SG','SK','SL','SM','SN','SO','SP','SR','SS','ST','SW','SY','TA','TD','TF','TN','TQ','TR','TS','TW','UB','W','WA','WC','WD','WF','WN','WR','WS','WV','YO','ZE']

In [2]:
class MySpider(scrapy.Spider):
    name = "IFA"

    custom_settings = {
        'DOWNLOAD_DELAY': 1,
    }
    
    def start_requests(self):
        
        urls = []
        POSTCODES = ALL_UK_POSTCODES
        
        for i in POSTCODES:
            url = 'https://www.ifa.org.uk/find-an-accountant?p='+str(i)+'&r=0&i=True&pg=0'
            urls.append(url)
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        f_name = response.xpath('//div[@class="map-location"]//div[@class="fa-name"]//text()').extract()
        f_address = response.xpath('//div[@class="map-location"]//div[@class="fa-name"]//text() | //div[@class="map-location"]//div[@class="fa-address"]//text()').extract()
        f_company = response.xpath('//div[@class="map-location"]//div[@class="fa-name"]//text() | //div[@class="map-location"]//div[@class="fa-company"]//text()').extract()
        f_web = response.xpath('//div[@class="map-location"]//div[@class="fa-name"]//text() | //div[@class="map-location"]//div[@class="fa-web"]//text()').extract()
        f_email = response.xpath('//div[@class="map-location"]//div[@class="fa-name"]//text() | //div[@class="map-location"]//div[@class="fa-email"]//text()').extract()
            
        firm_name = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in f_name]
        firm_address = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in f_address]
        firm_company = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in f_company]
        firm_web = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in f_web]
        firm_email = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in f_email]
                        
        full_nm_output.append(firm_name)
        full_ad_output.append(firm_address)     
        full_co_output.append(firm_company)
        full_wb_output.append(firm_web)     
        full_em_output.append(firm_email)

<h2>Running the Webscraping</h2>

Note, you can't re run the code below in a single session for one reason or another, so you need to restart the kernel between runs.

This code creates a lightweight container for our webspider and then runs it - to be honest understanding this is probably optional unless it breaks.

In [3]:
full_nm_output = []
full_ad_output = []
full_co_output = []
full_wb_output = []
full_em_output = []

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2017-09-24 21:33:58 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-09-24 21:33:58 [scrapy.utils.log] INFO: Overridden settings: {}
2017-09-24 21:33:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-24 21:33:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddle

We've now downloaded all the pages that we want to scrape. The first thing to do is to examine what we got back

In [47]:
#print(full_nm_output[0][0:2])
#print(full_ad_output[0][0:8])
#print(full_co_output[0][0:8])
#print(full_wb_output[0][0:8])
#print(full_em_output[0][0:8])

[u'Mr Michael Walker, FFA FFTA', u'Mr Ronald Simpson, FFA FIPA']


In [41]:
row_index_ad = []
row_index_co = []
row_index_wb = []
row_index_em = []

for i in range(len(full_nm_output)):
    ad = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_ad_output[i]:
            ad.append(full_ad_output[i].index(j.encode('ascii','replace')))
    row_index_ad.append(ad)


for i in range(len(full_nm_output)):
    co = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_co_output[i]:
            co.append(full_co_output[i].index(j.encode('ascii','replace')))
    row_index_co.append(co)
    

for i in range(len(full_nm_output)):
    wb = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_wb_output[i]:
            wb.append(full_wb_output[i].index(j.encode('ascii','replace')))
    row_index_wb.append(wb)


for i in range(len(full_nm_output)):
    em = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_em_output[i]:
            em.append(full_em_output[i].index(j.encode('ascii','replace')))
    row_index_em.append(em)

In [108]:
split_output_nm = []
split_output_ad = []
split_output_co = []
split_output_wb = []
split_output_em = []

for i in range(len(full_nm_output)):
    for j in range(len(full_nm_output[i])):
        split_output_nm.append(full_nm_output[i][j].encode('ascii','replace'))

for i in range(len(row_index_ad)):
    start = 0
    for index in row_index_ad[i][1:]:
        split_output_ad.append(full_ad_output[i][start+1:index])
        start = index
    
    split_output_ad.append(full_ad_output[start+1:])

for i in range(len(row_index_co)):
    start = 0
    for index in row_index_co[i][1:]:
        split_output_co.append(full_co_output[i][start+1:index])
        start = index
    
    split_output_co.append(full_co_output[start+1:])

for i in range(len(row_index_wb)):
    start = 0
    for index in row_index_wb[i][1:]:
        split_output_wb.append(full_wb_output[i][start+1:index])
        start = index
    
    split_output_wb.append(full_wb_output[start+1:])

for i in range(len(row_index_em)):
    start = 0
    for index in row_index_em[i][1:]:
        split_output_em.append(full_em_output[i][start+1:index])
        start = index
    
    split_output_em.append(full_em_output[start+1:])

print(len(split_output_nm),split_output_nm[0:3])
print(len(split_output_ad),split_output_ad[0:3])
print(len(split_output_co),split_output_co[0:3])
print(len(split_output_wb),split_output_wb[0:3])
print(len(split_output_em),split_output_em[0:3])

(205, ['Mr Michael Walker, FFA FFTA', 'Mr Ronald Simpson, FFA FIPA', 'Mr Robert Gordon, FFA FFTA'])
(205, [[u'75 Bon Accord Street', u'Aberdeen, Aberdeenshire AB11 6ED', u'UNITED KINGDOM'], [u'19 Stonefield Drive', u'Netherblackhall Inverurie', u'Inverurie, Aberdeenshire AB51 4DZ', u'UNITED KINGDOM'], [u'9 Victoria Street', u'Aberdeen, Scotland AB10 1XB', u'UNITED KINGDOM']])
(205, [[u'Michael J Walker & Co'], [u'Simron Consulting'], [u'A G Accounting Limited']])
(205, [[''], [''], ['']])
(205, [[u'mike@michaeljwalker.co.uk'], [u'simronltd@aol.com'], [u'fixedfeeag@aol.com']])


In [158]:
split_output_nm2 = []
split_output_ql = []
split_output_ql2 = []

for i in split_output_nm: 
    split_output_nm2.append(i.split(",")[0])
    split_output_ql.append(i.split(",")[1].split(" "))

for i in split_output_ql:
    sq = []
    for j in range(len(i)):
        if i[j]:          
            sq.append(i[j])
    split_output_ql2.append(sq)

split_output_ad2 = []

for i in range(len(split_output_ad)):
    ad = []
    for j in range(len(split_output_ad[i])):
        ad.append(split_output_ad[i][j].encode('ascii','replace'))
    split_output_ad2.append(ad)

split_output_co2 = []

for i in range(len(split_output_co)):
    if split_output_co[i]:
        for j in range(len(split_output_co[i])):
            split_output_co2.append(split_output_co[i][j].encode('ascii','replace'))  
    else:
        split_output_co2.append("")

split_output_em2 = []

for i in range(len(split_output_em)):
    if split_output_em[i]:
        for j in range(len(split_output_em[i])):
            split_output_em2.append(split_output_em[i][j].encode('ascii','replace'))
    else:
        split_output_em2.append("")

split_output_wb2 = []

for i in range(len(split_output_wb)):
    if split_output_wb[i]:
        wb = ""
        for j in range(len(split_output_wb[i])):
            wb += split_output_wb[i][j].encode('ascii','replace')
        split_output_wb2.append(wb)                
    else:
        split_output_wb2.append("")

firms_matrix_IFA = []

for i in range(len(split_output_nm)):
    firms_matrix_IFA.append([split_output_nm2[i],split_output_ql2[i],split_output_ad2[i],split_output_co2[i],split_output_wb2[i],split_output_em2[i]])

In [159]:
print(firms_matrix_IFA[0])
print(len(firms_matrix_IFA))

['Mr Michael Walker', ['FFA', 'FFTA'], ['75 Bon Accord Street', 'Aberdeen, Aberdeenshire AB11 6ED', 'UNITED KINGDOM'], 'Michael J Walker & Co', '', 'mike@michaeljwalker.co.uk']
205


In [168]:
final_matrix_IFA = []
for i in firms_matrix_IFA:
    if not i in final_matrix_IFA:
        final_matrix_IFA.append(i)

In [170]:
print(final_matrix_IFA[0])
print(len(final_matrix_IFA))

['Mr Michael Walker', ['FFA', 'FFTA'], ['75 Bon Accord Street', 'Aberdeen, Aberdeenshire AB11 6ED', 'UNITED KINGDOM'], 'Michael J Walker & Co', '', 'mike@michaeljwalker.co.uk']
205
