<h1>ACCA Webscraping Members</h1>

This jupyter notebook contains the code used to webscrape ACCA Members

<h2>Importing webscraping libraries</h2>

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

<h2>Defining the Webscraping Spider</h2>

This code block is defining what our web spider will do - note that it isn't running it, just defining it. See that we are extending the exsiting scrapy.Spider class rather than doing everything from scratch, so we only have minimal coding to do.

We tested this code using one page:

https://core.att.org.uk/Member/memberDetail?pID=6085205

ACCA are helpful enough to return the entire details of a member as the search result, so parsing is the only real issue. Annoyingly there are around 200k members, so we will need 2k requests.

In [2]:
class MySpider(scrapy.Spider):
    name = "ACCA"

    custom_settings = {
        'DOWNLOAD_DELAY': 60,
    }
    
    def start_requests(self):
        
        urls = []
                
        for i in range(1,2):
            
            url = 'http://www.accaglobal.com/gb/en/member/find-an-accountant/directory-of-member/results.html?isocountry=GB&FirstName=&Surname=&Location=&inputcountrysuspended=&orgid=ACCA&orby=FNA&ipp=100&pn='+str(i)+'&hid=&requestcount='+str(i)
            urls.append(url)
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        
        full_name = response.xpath('//tr//@id | //tr//td[@data-column="Full name"]//text()').extract()
        town = response.xpath('//tr//@id | //tr//td[@data-column="Town"]//text()').extract()
        country = response.xpath('//tr//@id | //tr//td[@data-column="Country"]//text()').extract()
        admission_year = response.xpath('//tr//@id | //tr//td[@data-column="Admission year"]//text()').extract()
        certificates = response.xpath('//tr//@id | //tr//td[@data-column="Practicing certificates held"]//text()').extract()


        firm_full_name = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in full_name]
        firm_town = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in town]
        firm_country = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in country]
        firm_admission_year = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in admission_year]
        firm_certificates = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in certificates]
        
        full_fm_output.append(firm_full_name)
        full_tn_output.append(firm_town)
        full_co_output.append(firm_country)
        full_yr_output.append(firm_admission_year)
        full_ct_output.append(firm_certificates)

<h2>Running the Webscraping</h2>

Note, you can't re run the code below in a single session for one reason or another, so you need to restart the kernel between runs.

This code creates a lightweight container for our webspider and then runs it - to be honest understanding this is probably optional unless it breaks.

In [3]:
full_fm_output = []
full_tn_output = []
full_co_output = []
full_yr_output = []
full_ct_output = []

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2017-09-23 11:36:21 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-09-23 11:36:21 [scrapy.utils.log] INFO: Overridden settings: {}
2017-09-23 11:36:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-23 11:36:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddle

[[u'rowId-1', u'(Zheng Qinglin), Teh Keng Lin', '', u'ACCA', '', u'rowId-2', u', Ajomale, Oluwatoromo, Yemi, Isaac', '', u'ACCA', '', u'rowId-3', u'@ Lee Shiang Yee, Lee Chong Chu', '', u'FCCA', '', u'rowId-4', u'A Abdullah, Shafiullah', '', u'ACCA', '', u'rowId-5', u'A Haleem Mohamed Ashqar,', '', u'ACCA', '', u'rowId-6', u'A Halim, Mohammad H', '', u'FCCA', '', u'rowId-7', u'A K, Afzal', '', u'ACCA', '', u'rowId-8', u'A K M Koushik Ahmed,', '', u'ACCA', '', u'rowId-9', u'A K M Saidur Rahman,', '', u'ACCA', '', u'rowId-10', u'A Kanapathy, Kamaleswaran', '', u'FCCA', '', u'rowId-11', u'A L Ashif Aboobacker,', '', u'ACCA', '', u'rowId-12', u'A Madduma Patabendige, T S', '', u'ACCA', '', u'rowId-13', u'A R B Thiruchchenthooran,', '', u'ACCA', '', u'rowId-14', u'A Rahim, Abdul Azim', '', u'ACCA', '', u'rowId-15', u'A Redha Abdulla Faraj,', '', u'FCCA', '', u'rowId-16', u'A Saiful Azlin B P Salin,', '', u'FCCA', '', u'rowId-17', u'A Singh, Gurwachan Kaur', '', u'ACCA', '', u'rowId-18', u'A

We've now downloaded all the pages that we want to scrape. The first thing to do is to examine what we got back

In [4]:
final_fm_output = []
final_tn_output = []
final_co_output = []
final_yr_output = []
final_ct_output = []

for i in range(len(full_fm_output)):
    for j in range(len(full_fm_output[i])):
        final_fm_output.append(full_fm_output[i][j].encode('ascii','replace'))
        
for i in range(len(full_tn_output)):
    for j in range(len(full_tn_output[i])):
        final_tn_output.append(full_tn_output[i][j].encode('ascii','replace'))

for i in range(len(full_co_output)):
    for j in range(len(full_co_output[i])):
        final_co_output.append(full_co_output[i][j].encode('ascii','replace'))
        
for i in range(len(full_yr_output)):
    for j in range(len(full_yr_output[i])):
        final_yr_output.append(full_yr_output[i][j].encode('ascii','replace'))
        
for i in range(len(full_ct_output)):
    for j in range(len(full_ct_output[i])):
        final_ct_output.append(full_ct_output[i][j].encode('ascii','replace'))

print(final_fm_output[0:4])
print(final_tn_output[0:4])
print(final_co_output[0:4])
print(final_yr_output[0:4])
print(final_ct_output[0:4])

['rowId-1', '(Zheng Qinglin), Teh Keng Lin', '', 'ACCA']
['rowId-1', '', 'rowId-2', 'dublin']
['rowId-1', 'Singapore', 'rowId-2', 'Ireland']
['rowId-1', '2014', 'rowId-2', '2016']
['rowId-1', '', 'rowId-2', '']


In [6]:
row_index_fm = []
row_index_tn = []
row_index_co = []
row_index_yr = []
row_index_ct = []

for i in range(len(final_fm_output)):
    if "rowId" in final_fm_output[i]:
        row_index_fm.append(i)

for i in range(len(final_tn_output)):
    if "rowId" in final_tn_output[i]:
        row_index_tn.append(i)

for i in range(len(final_co_output)):
    if "rowId" in final_co_output[i]:
        row_index_co.append(i)

for i in range(len(final_yr_output)):
    if "rowId" in final_yr_output[i]:
        row_index_yr.append(i)

for i in range(len(final_ct_output)):
    if "rowId" in final_ct_output[i]:
        row_index_ct.append(i)

In [21]:
split_output_fm = []
split_output_tn = []
split_output_co = []
split_output_yr = []
split_output_ct = []

start = 0
for index in row_index_fm[1:]:
    split_output_fm.append(final_fm_output[start+1:index])
    start = index
    
split_output_fm.append(final_fm_output[start+1:])

start = 0
for index in row_index_tn[1:]:
    split_output_tn.append(final_tn_output[start+1:index])
    start = index
    
split_output_tn.append(final_tn_output[start+1:])

start = 0
for index in row_index_co[1:]:
    split_output_co.append(final_co_output[start+1:index])
    start = index
    
split_output_co.append(final_co_output[start+1:])

start = 0
for index in row_index_yr[1:]:
    split_output_yr.append(final_yr_output[start+1:index])
    start = index
    
split_output_yr.append(final_yr_output[start+1:])

start = 0
for index in row_index_ct[1:]:
    split_output_ct.append(final_ct_output[start+1:index])
    start = index
    
split_output_ct.append(final_ct_output[start+1:])

print(split_output_fm[0:5])
print(split_output_tn[0:5])
print(split_output_co[0:5])
print(split_output_yr[0:5])

[['(Zheng Qinglin), Teh Keng Lin', '', 'ACCA', ''], [', Ajomale, Oluwatoromo, Yemi, Isaac', '', 'ACCA', ''], ['@ Lee Shiang Yee, Lee Chong Chu', '', 'FCCA', ''], ['A Abdullah, Shafiullah', '', 'ACCA', ''], ['A Haleem Mohamed Ashqar,', '', 'ACCA', '']]
[[''], ['dublin'], ['petaling jaya'], ['abu dhabi'], ['dharga town']]
[['Singapore'], ['Ireland'], ['Malaysia'], ['United Arab Emirates'], ['Sri Lanka']]
[['2014'], ['2016'], ['2002'], ['2015'], ['2017']]
[[''], [''], [''], [''], [''], [''], [''], [''], [''], ['', '', 'Holds ACCA Practising Certificate', '', ''], [''], [''], [''], [''], [''], [''], [''], [''], [''], [''], ['', '', 'Holds ACCA Practising Certificate', '', ''], [''], [''], [''], ['', '', 'Holds ACCA Practising Certificate', '', ''], [''], [''], [''], [''], ['']]


In [32]:
final_fm = []   
for i in range(len(split_output_fm)):
    fm = []
    
    for j in range(len(split_output_fm[i])):
        if split_output_fm[i][j]:
            fm.append(split_output_fm[i][j].encode('ascii','replace'))
            
    final_fm.append(fm)

final_tn = []
for a in split_output_tn:
    if a:
        final_tn.append(a[0])
    else:
        final_tn.append("")

final_co = []
for a in split_output_co:
    if a:
        final_co.append(a[0])
    else:
        final_co.append("")

final_yr = []
for a in split_output_yr:
    if a:
        final_yr.append(int(a[0]))
    else:
        final_yr.append(None)

final_ct = []   
for i in range(len(split_output_ct)):
    ct = []
    
    for j in range(len(split_output_ct[i])):
        if split_output_ct[i][j]:
            ct.append(split_output_ct[i][j].encode('ascii','replace'))
            
    final_ct.append(ct)


final_nm = []
final_ty = []
for i in final_fm:
    final_nm.append(i[0])
    final_ty.append(i[1])

['ACCA', 'ACCA', 'FCCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'ACCA', 'ACCA', 'FCCA', 'FCCA', 'FCCA', 'ACCA', 'FCCA', 'FCCA', 'FCCA', 'ACCA', 'ACCA', 'FCCA', 'ACCA']


In [33]:
members_matrix_ACCA = []

for i in range(len(final_nm)):
    members_matrix_ACCA.append([final_nm[i],final_ty[i],final_tn[i],final_co[i],final_yr[i],final_ct[i]])


In [34]:
print(members_matrix_ACCA)

[['(Zheng Qinglin), Teh Keng Lin', 'ACCA', '', 'Singapore', 2014, []], [', Ajomale, Oluwatoromo, Yemi, Isaac', 'ACCA', 'dublin', 'Ireland', 2016, []], ['@ Lee Shiang Yee, Lee Chong Chu', 'FCCA', 'petaling jaya', 'Malaysia', 2002, []], ['A Abdullah, Shafiullah', 'ACCA', 'abu dhabi', 'United Arab Emirates', 2015, []], ['A Haleem Mohamed Ashqar,', 'ACCA', 'dharga town', 'Sri Lanka', 2017, []], ['A Halim, Mohammad H', 'FCCA', 'kuala lumpur', 'Malaysia', 2006, []], ['A K, Afzal', 'ACCA', 'malappuram district', 'India', 2016, []], ['A K M Koushik Ahmed,', 'ACCA', 'mithapukur', 'Bangladesh', 2014, []], ['A K M Saidur Rahman,', 'ACCA', 'dhaka', 'Bangladesh', 2016, []], ['A Kanapathy, Kamaleswaran', 'FCCA', 'bromley', 'United Kingdom', 2002, ['Holds ACCA Practising Certificate']], ['A L Ashif Aboobacker,', 'ACCA', 'kerala', 'India', 2017, []], ['A Madduma Patabendige, T S', 'ACCA', 'baulkham hills', 'Australia', 2016, []], ['A R B Thiruchchenthooran,', 'ACCA', 'colombo', 'Sri Lanka', 2014, []],

f = open('/home/de-admin/Documents/Webscraping/ACCA_Firms.txt', 'w')

for item in firms_matrix_ACCA:
    print>>f, item

f.close()
