<h1>CIMA Webscraping Firms</h1>

This jupyter notebook contains the code used to webscrape CIMA Firms

<h2>Importing webscraping libraries</h2>

import scrapy
from scrapy.crawler import CrawlerProcess

<h2>Defining the Webscraping Spider</h2>

This code block is defining what our web spider will do - note that it isn't running it, just defining it. See that we are extending the exsiting scrapy.Spider class rather than doing everything from scratch, so we only have minimal coding to do.

We tested this code using one page:

https://www.cimaglobal.com/About-us/Find-a-CIMA-Accountant/?p=3&surname=&company=&city=&county=&country=United+Kingdom&funcspecialism=&sectorspecialism=&sortby=country&results=500#Results

CIMA are helpful enough to return as many search results as you want, but the real details are kept on the page for each accountant, so we can easily scrape the links but we will have to ping each page to get member details.

class MySpider(scrapy.Spider):
    name = "CIMA"

    custom_settings = {
        'DOWNLOAD_DELAY': 1,
    }
    
    def start_requests(self):
        
        urls = []
        
        for i in range(1,4):
            url = 'https://www.cimaglobal.com/About-us/Find-a-CIMA-Accountant/?p='+str(i)+'&surname=&company=&city=&county=&country=United+Kingdom&funcspecialism=&sectorspecialism=&sortby=country&results=500#Results'
            urls.append(url)
        
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        f_links = response.xpath('//ul[@class="accountantListing"]//p[@class="accountantListing-button"]//a//@href').extract()

        firm_links = [' '.join(a.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ').split()) for a in f_links]
                        
        full_lk_output.append(firm_links)

<h2>Running the Webscraping</h2>

Note, you can't re run the code below in a single session for one reason or another, so you need to restart the kernel between runs.

This code creates a lightweight container for our webspider and then runs it - to be honest understanding this is probably optional unless it breaks.

full_lk_output = []

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

We've now downloaded all the pages that we want to scrape. The first thing to do is to examine what we got back

print(full_lk_output[0][1])

final_lk_output = []

for i in full_lk_output:
    for j in i:
        final_lk_output.append(j.encode('ascii','replace'))

import pickle

pickle.dump(final_lk_output, open('/home/de-admin/Documents/Webscraping/CIMA_links.p','wb'))

Now we've got all the links, let's get the data

In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

In [7]:
import pickle

CIMA_Links = pickle.load(open('/home/de-admin/Documents/Webscraping/CIMA_links.p','rb'))

print(str(CIMA_Links[0:2]))
print(len(CIMA_Links))

['http://www.cimaglobal.com/About-us/Find-a-CIMA-Accountant/Blue-Osprey-Limited-10507/', 'http://www.cimaglobal.com/About-us/Find-a-CIMA-Accountant/Tina-Murray-9869/']
1473


In [3]:
class MySpider(scrapy.Spider):
    name = "CIMA"

    #custom_settings = {
    #    "RETRY_TIMES" : 10,
    #    "RETRY_HTTP_CODES":[500, 503, 504, 400, 403, 404, 408],
    #    "DOWNLOADER_MIDDLEWARES":{'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,'scrapy_proxies.RandomProxy': 100,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110},
    #    "PROXY_LIST":'/home/de-admin/Documents/Webscraping/proxy_list.txt',
    #    "PROXY_MODE": 0
    #}
    
    custom_settings = {
        'DOWNLOAD_DELAY': 0.5,
    }
    
    def start_requests(self):

        for url in CIMA_Links[30:32]:
            yield scrapy.Request(url=str(url), callback=self.parse)
            
    def parse(self, response):
        f_details = response.xpath('//div[@class="searchResultDetails-details wysiwyg"]//dt//text() | //div[@class="searchResultDetails-details wysiwyg"]//dd//text()').extract()
        f_specs = response.xpath('//div[@class="searchResultDetails-details wysiwyg"]//h3//text() | //div[@class="searchResultDetails-details wysiwyg"]//li//text()').extract()
                      
        full_sp_output.append(f_specs)
        full_dt_output.append(f_details)

In [4]:
full_sp_output = []
full_dt_output = []

process = CrawlerProcess()
process.crawl(MySpider)
process.start()

2017-09-29 17:21:36 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-09-29 17:21:36 [scrapy.utils.log] INFO: Overridden settings: {}
2017-09-29 17:21:36 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-09-29 17:21:36 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddle

In [5]:
print(full_sp_output)
print(full_dt_output)

[[u'Functional specialisms', u'Book-keeping/accounting', u'Business planning/budgeting', u'Business process improvement', u'Business tax planning/advice', u'Cash flow management/treasury', u'Company secretarial matters', u'Contribution/profit analysis', u'Cost reduction', u'Costing/accounting systems', u'Interim management', u'Internal audit/risk analysis', u'Management performance reports', u'Payroll/NI/PAYE administration', u'Personal tax planning/advice', u'Sector specialisms', u'Transportation/automotive', u'Retail services', u'Manufacturing', u'Hotel/catering/travel', u'Food/drink', u'Distribution'], [u'Functional specialisms', u'Book-keeping/accounting', u'Business planning/budgeting', u'Business process improvement', u'Business tax planning/advice', u'Contribution/profit analysis', u'Cost reduction', u'Costing/accounting systems', u'Management performance reports', u'Payroll/NI/PAYE administration', u'Personal tax planning/advice']]
[[u'Company Name:', u'Wahid Ahmed & Co Ltd', u

In [None]:
row_index_ad = []
row_index_co = []
row_index_wb = []
row_index_em = []

for i in range(len(full_nm_output)):
    ad = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_ad_output[i]:
            ad.append(full_ad_output[i].index(j.encode('ascii','replace')))
    row_index_ad.append(ad)


for i in range(len(full_nm_output)):
    co = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_co_output[i]:
            co.append(full_co_output[i].index(j.encode('ascii','replace')))
    row_index_co.append(co)
    

for i in range(len(full_nm_output)):
    wb = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_wb_output[i]:
            wb.append(full_wb_output[i].index(j.encode('ascii','replace')))
    row_index_wb.append(wb)


for i in range(len(full_nm_output)):
    em = []
    for j in full_nm_output[i]:
        if j.encode('ascii','replace') in full_em_output[i]:
            em.append(full_em_output[i].index(j.encode('ascii','replace')))
    row_index_em.append(em)

In [None]:
split_output_nm = []
split_output_ad = []
split_output_co = []
split_output_wb = []
split_output_em = []

for i in range(len(full_nm_output)):
    for j in range(len(full_nm_output[i])):
        split_output_nm.append(full_nm_output[i][j].encode('ascii','replace'))

for i in range(len(row_index_ad)):
    start = 0
    for index in row_index_ad[i][1:]:
        split_output_ad.append(full_ad_output[i][start+1:index])
        start = index
    
    split_output_ad.append(full_ad_output[start+1:])

for i in range(len(row_index_co)):
    start = 0
    for index in row_index_co[i][1:]:
        split_output_co.append(full_co_output[i][start+1:index])
        start = index
    
    split_output_co.append(full_co_output[start+1:])

for i in range(len(row_index_wb)):
    start = 0
    for index in row_index_wb[i][1:]:
        split_output_wb.append(full_wb_output[i][start+1:index])
        start = index
    
    split_output_wb.append(full_wb_output[start+1:])

for i in range(len(row_index_em)):
    start = 0
    for index in row_index_em[i][1:]:
        split_output_em.append(full_em_output[i][start+1:index])
        start = index
    
    split_output_em.append(full_em_output[start+1:])

print(len(split_output_nm),split_output_nm[0:3])
print(len(split_output_ad),split_output_ad[0:3])
print(len(split_output_co),split_output_co[0:3])
print(len(split_output_wb),split_output_wb[0:3])
print(len(split_output_em),split_output_em[0:3])

In [None]:
split_output_nm2 = []
split_output_ql = []
split_output_ql2 = []

for i in split_output_nm: 
    split_output_nm2.append(i.split(",")[0])
    split_output_ql.append(i.split(",")[1].split(" "))

for i in split_output_ql:
    sq = []
    for j in range(len(i)):
        if i[j]:          
            sq.append(i[j])
    split_output_ql2.append(sq)

split_output_ad2 = []

for i in range(len(split_output_ad)):
    ad = []
    for j in range(len(split_output_ad[i])):
        ad.append(split_output_ad[i][j].encode('ascii','replace'))
    split_output_ad2.append(ad)

split_output_co2 = []

for i in range(len(split_output_co)):
    if split_output_co[i]:
        for j in range(len(split_output_co[i])):
            split_output_co2.append(split_output_co[i][j].encode('ascii','replace'))  
    else:
        split_output_co2.append("")

split_output_em2 = []

for i in range(len(split_output_em)):
    if split_output_em[i]:
        for j in range(len(split_output_em[i])):
            split_output_em2.append(split_output_em[i][j].encode('ascii','replace'))
    else:
        split_output_em2.append("")

split_output_wb2 = []

for i in range(len(split_output_wb)):
    if split_output_wb[i]:
        wb = ""
        for j in range(len(split_output_wb[i])):
            wb += split_output_wb[i][j].encode('ascii','replace')
        split_output_wb2.append(wb)                
    else:
        split_output_wb2.append("")

firms_matrix_IFA = []

for i in range(len(split_output_nm)):
    firms_matrix_IFA.append([split_output_nm2[i],split_output_ql2[i],split_output_ad2[i],split_output_co2[i],split_output_wb2[i],split_output_em2[i]])

In [None]:
print(firms_matrix_IFA[0])
print(len(firms_matrix_IFA))

In [None]:
final_matrix_IFA = []
for i in firms_matrix_IFA:
    if not i in final_matrix_IFA:
        final_matrix_IFA.append(i)

In [None]:
print(final_matrix_IFA[0])
print(len(final_matrix_IFA))