### Scrape responsibly! 



#### Function to scrape enforcement actions by the Central Bank of Ireland.
The actual target URL is not listed below. The target webpage currently contains 13 pages of enforcement actions. Apart from the last page, each page has 10 embedded links. Each link contains information on an enforcement action or settlement. 

The main function below focuses on the script within each link and extracts only entities of interest that were subject to the enforcement action. There is a lot of java script embedded in the url which is more the domain of "Selenium" but the target information was text within a script class so it was possible to stick with Requests and Beautiful soup to extract the text information of interest.

#### The Python function of interest below is:  get_enforcement_list(). get_enforcement_list()  uses the following Python libraries: 
1. Python Requests to extract all html, css from target URL. (https://pypi.org/project/requests/)
2. Python Beautiful Soup parses the html, css and text from java script. (https://pypi.org/project/beautifulsoup4/)
3. "spaCy" utilises its NER models to search for any named entities in the target list that are potentially of interest. (https://spacy.io/)

Once called the function produces a list of organisations or persons that were subject to an enforcement or settlement action. All names are publicly available at time of scrape. Run time was just under four seconds for the function to scrape the entire contents from the webpage, parse all scraped code, look through thirteen pages of interest and produce a list of organisations/people of interest. Overall the NER models worked well. There were a few misses which could be refined and cleaned up. Some entities were merged as one eg (regulator and target entity). The list is at the bottom of the notebook.



##### May 2019

 

In [1]:
#Libraries
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
import re 
import spacy
import time 

In [2]:
#Functions

#Scrape with requests and Soup
def scrape_url():
    global scripts
    url = 'https://www....legal-notices/enforcement-actions'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html")
    scripts = soup.findAll('script')
 
  
#Extract strings of interest    
def parse_script(scripts):
    global s,start,end,entities
    entities=[]
    i = 0
    count=0 
    for script in scripts:    
        global s,start,end    
        if "var documentListTableContent" in script.text:  
            s=script.text.strip()
            for item in range(130):            
                count +=1
                start = s.find('Public',i,56500)
                end = s.find("url", start)
                sent = s[start:end]                                 
                entities.append(sent)
                i +=500                            

#Main function: sends get request, extracts all code/data and uses spaCy to target any named entities of interest.                
def get_enforcement_list():
    start = time.time()
    scrape_url()
    parse_script(scripts)
    offenders=entities[0:113]
    str1 = ''.join(offenders)
    str1.split(",")
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(str1)
    print('LIST OF ENTITIES SUBJECT TO ENFORCEMENT ACTIONS/SETTLEMENTS')
    print('') 
    doc = nlp(str1)
    for ent in doc.ents:
        if ent.text != 'Public Statement relating to Settlement Agreement' and ent.text != '1' and  ent.text != '2' and ent.text != 'Central Bank' and ent.text != 'Settlement Agreement' and ent.text != 'the Central Bank of Ireland' and ent.text != 'the Central Bank' and ent.text != 'the Financial Regulator'and ent.text != 'Ireland':
            print(ent.text) 
            end = time.time()
    print('')        
    print("Run time in seconds :",(end - start)) 
                

In [3]:
#Call main function  

get_enforcement_list()

LIST OF ENTITIES SUBJECT TO ENFORCEMENT ACTIONS/SETTLEMENTS

Bank of Montreal
RSA Insurance
Tom McMenamin
E-Services and Communications Credit Union Limited
Citibank Europe
St. Canice&#39;s Kilkenny Credit Union Limited
Appian Asset Management Limited
BCWM
Michael P. Walsh
B.C.P. Asset Management Designated Activity Company
Central Bank of Ireland and Merrion Stockbrokers Limited
Robert Moynihan
Lorna Heffernan Finance Limited
Lupton Financial Services Limited
Lupton &amp;amp
Co. Financial Services
Albert Reilly
a Albert Reilly Insurance and Financial Consultants
Intesa Sanpaolo Life
Bank of Ireland
Allied Irish Bank
Drimnagh Credit Union
the Central Bank of Ireland and Bray Credit Union Limited
Springboard Mortgages Limited
Springboard
Ulster Bank
Ulster Bank
Ireland Limited
Capita Life and Pensions Services
KBC Bank
New Ireland Assurance Company
The Mortgage Centre
Arch Reinsurance Europe
Octagon Online Services Limited
the Central Bank of Ireland and Lambay Capital Limited
Michael H