<h1>ATT Webscraping Members</h1>

This jupyter notebook contains the code used to webscrape ATT Members

<h2>Importing webscraping libraries</h2>

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

<h2>Defining the Webscraping Spider</h2>

This code block is defining what our web spider will do - note that it isn't running it, just defining it. See that we are extending the exsiting scrapy.Spider class rather than doing everything from scratch, so we only have minimal coding to do.

We tested this code using one page:

http://www.accaglobal.com/gb/en/member/find-an-accountant/directory-of-member/results.html?isocountry=GB&FirstName=&Surname=&Location=&inputcountrysuspended=&orgid=ACCA&orby=FNA&ipp=100&pn=1&hid=&requestcount=1

ACCA are helpful enough to return the entire details of a member as the search result, so parsing is the only real issue. Annoyingly there are around 200k members, so we will need 2k requests.

In [2]:
class MySpider(scrapy.Spider):
    name = "ATT"

    custom_settings = {
        "RETRY_TIMES" : 10,
        "RETRY_HTTP_CODES":[500, 503, 504, 400, 403, 404, 408],
        "DOWNLOADER_MIDDLEWARES":{'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,'scrapy_proxies.RandomProxy': 100,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110},
        "PROXY_LIST":'/home/de-admin/Documents/Webscraping/proxy_list.txt',
        "PROXY_MODE": 0
    }

    def start_requests(self):
        
        urls = []
                
        for i in range(6085205,6085205+500*37+1,37):
            
            url = 'https://core.att.org.uk/Member/memberDetail?pID='+str(i)
            urls.append(url)
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        page = [response.url.split("/")[-1]]
        details = response.xpath('//ul//text()').extract()
        firm_details = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in details]

        for i in firm_details:
            page.append(i)

        full_dt_output.append([page])

<h2>Running the Webscraping</h2>

Note, you can't re run the code below in a single session for one reason or another, so you need to restart the kernel between runs.

This code creates a lightweight container for our webspider and then runs it - to be honest understanding this is probably optional unless it breaks.

In [3]:
full_dt_output = []

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2017-09-23 18:08:08 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-09-23 18:08:08 [scrapy.utils.log] INFO: Overridden settings: {}
2017-09-23 18:08:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-23 18:08:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy_proxies.RandomProxy',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermi

We've now downloaded all the pages that we want to scrape. The first thing to do is to examine what we got back

In [6]:
print(full_dt_output)

[[['memberDetail?pID=6085390', '', u'Mr Matthew Benjamin Davies ATT', '', '', u'LONDON', u'SW12 8BJ', '']], [['memberDetail?pID=6085501', '', u'Mr Christopher Smith CTA ATT', '', '', u'Leathers the Accountants', u'Cale Cross House', u'156 Pilgrim Street', u'NEWCASTLE UPON TYNE', u'NE1 6SU', u'Email: c.smith@leathersllp.co.uk', u'Tel: 0181 224 6760', u'Specialisms: Corporation Tax Returns', '']], [['memberDetail?pID=6085353', '', u'Mr Philip Edward Grange ATT', '', '', u'NEWCASTLE UPON TYNE', u'NE13 9AA', '']], [['memberDetail?pID=6085316', '', u'Mrs Nicola McReynolds CTA ATT', '', '', u'Henry Brown & Co.', u'26 Portland Road', u'KILMARNOCK', u'East Ayrshire', u'KA1 2EB', u'Email: nmcreynolds@henrybrown.co.uk', u'Tel: 01563522308', u'Specialisms: Capital Allowances, Capital Gains Tax, Child Benefit, Corporation Tax Returns, Employer Compliance , Income Tax, International Tax, Investigations, Compliance Checks, National Insurance, Non-Resident Landlord Registration and Tax Returns, Partn

In [5]:
final_dt_output = []

for i in range(len(full_dt_output)):
    for j in range(len(full_dt_output[i])):
        final_dt_output.append(full_dt_output[i][j].encode('ascii','replace'))

print(final_dt_output[0:20])

AttributeError: 'list' object has no attribute 'encode'

row_index_fm = []
row_index_tn = []
row_index_co = []
row_index_yr = []
row_index_ct = []

for i in range(len(final_fm_output)):
    if "rowId" in final_fm_output[i]:
        row_index_fm.append(i)

for i in range(len(final_tn_output)):
    if "rowId" in final_tn_output[i]:
        row_index_tn.append(i)

for i in range(len(final_co_output)):
    if "rowId" in final_co_output[i]:
        row_index_co.append(i)

for i in range(len(final_yr_output)):
    if "rowId" in final_yr_output[i]:
        row_index_yr.append(i)

for i in range(len(final_ct_output)):
    if "rowId" in final_ct_output[i]:
        row_index_ct.append(i)

split_output_fm = []
split_output_tn = []
split_output_co = []
split_output_yr = []
split_output_ct = []

start = 0
for index in row_index_fm[1:]:
    split_output_fm.append(final_fm_output[start+1:index])
    start = index
    
split_output_fm.append(final_fm_output[start+1:])

start = 0
for index in row_index_tn[1:]:
    split_output_tn.append(final_tn_output[start+1:index])
    start = index
    
split_output_tn.append(final_tn_output[start+1:])

start = 0
for index in row_index_co[1:]:
    split_output_co.append(final_co_output[start+1:index])
    start = index
    
split_output_co.append(final_co_output[start+1:])

start = 0
for index in row_index_yr[1:]:
    split_output_yr.append(final_yr_output[start+1:index])
    start = index
    
split_output_yr.append(final_yr_output[start+1:])

start = 0
for index in row_index_ct[1:]:
    split_output_ct.append(final_ct_output[start+1:index])
    start = index
    
split_output_ct.append(final_ct_output[start+1:])

print(split_output_fm[0:5])
print(split_output_tn[0:5])
print(split_output_co[0:5])
print(split_output_yr[0:5])

final_fm = []   
for i in range(len(split_output_fm)):
    fm = []
    
    for j in range(len(split_output_fm[i])):
        if split_output_fm[i][j]:
            fm.append(split_output_fm[i][j].encode('ascii','replace'))
            
    final_fm.append(fm)

final_tn = []
for a in split_output_tn:
    if a:
        final_tn.append(a[0])
    else:
        final_tn.append("")

final_co = []
for a in split_output_co:
    if a:
        final_co.append(a[0])
    else:
        final_co.append("")

final_yr = []
for a in split_output_yr:
    if a:
        final_yr.append(int(a[0]))
    else:
        final_yr.append(None)

final_ct = []   
for i in range(len(split_output_ct)):
    ct = []
    
    for j in range(len(split_output_ct[i])):
        if split_output_ct[i][j]:
            ct.append(split_output_ct[i][j].encode('ascii','replace'))
            
    final_ct.append(ct)


final_nm = []
final_ty = []
for i in final_fm:
    final_nm.append(i[0])
    final_ty.append(i[1])

members_matrix_ACCA = []

for i in range(len(final_nm)):
    members_matrix_ACCA.append([final_nm[i],final_ty[i],final_tn[i],final_co[i],final_yr[i],final_ct[i]])


print(members_matrix_ACCA)

f = open('/home/de-admin/Documents/Webscraping/ACCA_Firms.txt', 'w')

for item in firms_matrix_ACCA:
    print>>f, item

f.close()
