# Scraping UK asylum data (27th October 2021)

This notebook scraps the tribunal decision's data related to asylum applications from https://tribunalsdecisions.service.gov.uk/

In particular, the notebook:

1. Defines the needed web scraping functions.

2. Scraps the contents of tribunalsdecisions.service.gov.uk/utiac . The scraping strategy consists of three steps:
    - First, launching a search session and scraping the general information obtained in the 1778 pages of results (using selenium to navigate through the results). A list of 35299 urls is obtained.
    - Second, accessing each of the 35299 urls ans scraping all the detailed information available.
    - Third, downloading the word (doc/docx) document with the judicial decision.
    
3. Stores the scraped material in a list of dictionaries where each dictionary contains all the data scraped for a given judicial decision. The resulting data set is serialised in json and pickle. Json (jsonData.json) and a picle (pickleData.pkl) objects are created.

This notebook should run in the tfm environment, which can be created with the environment.yml file.

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import re
import json
import pickle
import pandas as pd
import whois
import sys
import datetime
from tqdm import tqdm
from datetime import timedelta
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import *
from selenium.common.exceptions import TimeoutException
import urllib.request
import wget
import concurrent.futures
import tqdm

import sys
IN_COLAB = 'google.colab' in sys.modules


# What environment am I using?
print(f'Current environment: {sys.executable}')

# Change the current working directory
os.chdir('/Users/albertamurgopacheco/Documents/GitHub/TFM')
# What's my working directory?
print(f'Current working directory: {os.getcwd()}')


Current environment: /Users/albertamurgopacheco/anaconda3/envs/tfm/bin/python
Current working directory: /Users/albertamurgopacheco/Documents/GitHub/TFM


In [2]:
# Define working directories in colab and local execution

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/gdrive')
    docs_path = '/content/gdrive/MyDrive/TFM/data/raw'
    input_path = '/content/gdrive/MyDrive/TFM'
    output_path = '/content/gdrive/MyDrive/TFM/output'

else:
    docs_path = './data/raw'
    input_path = '.'
    output_path = './output'

# Scraping functions

Define a function to scrap the general data in the UIATC landing page using beautifulSoup.

More specifically, the general data refers to: 1) the url pointing to a page with detailed information for each sentence, and 2) the date of the judicial sentence.

In [3]:
def getData(htmlSource):
    """
    getData gets the mouse over links to the tribunal decisions & the dates

    :param htmlSource: Source HTML for the page 
    :return: data as a list of tuples
    """
    
    # Scraping tribunal decision names with BeautifulSoup
    soup = BeautifulSoup(htmlSource, 'html')
    
    # Scraping mouse-over urls
    linksList = re.findall(r'(<a href="/utiac/(.*?)">)', htmlSource)
    linksList = [i[1] for i in linksList]
    linksList = list(set(linksList))
    #print("Number of links:",len(linksList))
    #print(linksList)
    
    # Scraping dates

    # Find by class the dates and store in list
    datesList = list(soup.find_all("td", class_="date"))
    # Convert the list elements to string  
    datesList = [str(i) for i in datesList]
    # Slice the part of string including data (date format yyyy-mm-dd)
    datesList = [i[33:43] for i in datesList]
    #print(datesList)
    #print("Number of dates %s",len(datesList))

    # Assign data to tuples: # get the list of tuples from two lists and merge them by using zip()
    tuplesList = list(zip(linksList, datesList))
    
    return tuplesList

Define a function to scrap the detailed data in each of the decision's pages. The functions uses the library requests to make a direct call from each url.



In [5]:
def getDetailedData(url):
    """
    getDetailedData gets the detailed data linked to a tribunal decision
    and saves a doc file in /Users/albertamurgopacheco/Documents/GitHub/TFM/data/raw

    :param url: url (link) to the page containing the detailed info 
    :return: dictionary 
    """
    
    # START WITH URL
    try:
        response = requests.get(url = url)
        
        # if response status code is 200 OK, then
        if response.status_code == 200:
        # load the data
            data = response.text
            soup = BeautifulSoup(data, 'html')
    
            # Scrape the reference number
            refList = list(soup.find_all("h1"))
            # Convert the list elements to string  
            refList = [str(i) for i in refList]
            # Remove leading <h1> and trailing </h1>
            refList = [i.replace('</h1>', '') for i in refList]
            refList = [i.replace('<h1>', '') for i in refList]
            #print(refList)
    
            # Find the link (docLink) to the document
            lnk = re.findall(r'(<a class="doc-file" href="https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/(.*?)")', data)
            # Build link using the second element in regex result (list of tuples)
            docLink = "https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/" + lnk[0][1]
            # Download files to raw folder
            try:
                filename = wget.download(url = docLink, out = docs_path)
                downloaded = "Yes"
            # Handle download exceptions 
            except Exception as err:
                print("Could not download file {}".format(docLink))
                print(err)
                downloaded = "No"
                pass
     
            # Find detailed information
            res = [item.get_text() for item in soup.select("span")]
            # Remove \xa0 from strings
            res = [elem if '\xa0' not in elem else elem.replace('\xa0', '') for elem in res]
            # Remove trailing and leading spaces and \n
            res = [elem.strip() for elem in res]
            #print(res)

            # Split list of results into two lists (keys & values)
            keysList = res[::2] # Keys: Elements from res starting from 0 iterating by 2
            valuesList = res[1::2] # Values: Elements from res starting from 1 iterating by 2
            #print(keysList)
            #print(valuesList)
    
            # Create dictionary with results (resDict)
            zip_iterator = zip(keysList, valuesList)
            resDict = dict(zip_iterator)
    
            # Add reference number and link to document to the dictionary
            resDict["Document"] = docLink
            resDict["Reference"] = refList
            resDict["Download"] = downloaded
            resDict["File"] = lnk[0][1]
            #print(resDict)
            
        else:
            resDict = {"URL not working:": str(url)}
            print(f"URL not working: {url}")
            
    except requests.exceptions.RequestException as e:  # Capture exceptions
        print (e.response.text)
        raise SystemExit(e)   
    
    return resDict  

# Web scraping

Once the necessary functions have been defined, open a firefox browser.

In [6]:
# Using selenium, open the tribunal decision's website in firefox
driver = webdriver.Firefox()
driver.get("https://tribunalsdecisions.service.gov.uk/")

# Getting current URL source code 
get_title = driver.title 
  
# Printing the title of the URL 
print(get_title) 
assert "Tribunal decisions" in driver.title

# Getting current URL source code 
get_source = driver.page_source
time.sleep(2)
#print(get_source)

Tribunal decisions


Start scraping some general information (url and date) for each sentence. There are 1778 pages to go through. The urls will be used to scrap the detailed information. The date is scraped for discrimination purposes in case different sentences shared the same url.

In [7]:
# Scrap current page data and browse to next page

# List of tuples to store the results from getData()
a = []

#while True:
i=1
while i<1178:

    # Getting current URL source code 
    get_source = driver.page_source
    # Scrape the data
    b = getData(get_source)
    # Append list data b to list data a
    a += b
    i+=1
    
    # Click on next page
    try:
        delay = 15 # seconds
        #element_present = EC.presence_of_element_located((By.CLASS_NAME, 'next_page'))
        #WebDriverWait(driver, delay).until(element_present)
        
        wait = WebDriverWait(driver, delay)
        element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'next_page')))
        element.click()
            # wait       
    except TimeoutException:
        print("Loading took too much time!")
        break


Prepare the list of urls to iterate with the scraping function.

In [13]:
# List of links with the decision files
decisionLinks = [tple[0] for tple in a]

# Number of urls to scrap detailed data
print(f'Number of urls to scrap detailed data from: {len(decisionLinks)}')

# Create a list of urls from links
urls = [ "https://tribunalsdecisions.service.gov.uk/utiac/"+decision for decision in decisionLinks]

# Attempted download item urls[35072] crashes the loop
# https://tribunalsdecisions.service.gov.uk/utiac/2003-ukiat-7478 (no doc available)
# print(urls.index('https://tribunalsdecisions.service.gov.uk/utiac/2003-ukiat-7478'))
urls.pop(35072)
# print(urls.index('https://tribunalsdecisions.service.gov.uk/utiac/hu-02724-2015'))
urls.pop(16561)
print(f'New number of urls to scrap detailed data from: {len(urls)}')
print(decisionLinks[-1:])


Number of urls to scrap detailed data from: 35310
New number of urls to scrap detailed data from: 35310
['2002-ukiat-702']


Scraping the detailed data for each of the tribunal decission while also downloading the word document of each sentence.

In [9]:
# List of dict where each dict contains scraped detailed data
scrapedList = []

# Scrap detailed data from all urls
for url in urls:
    scrapedItem = getDetailedData(url)
    #print(scrapedItem)
    scrapedList.append(scrapedItem)

AttributeError: 'NoneType' object has no attribute 'text'

In [28]:
# Number of scraped court decisions
print(f'A total of {len(scrapedList)} elements have been scraped')

# Number of documents scraped

print(scrapedList[0])

A total of 35298 elements have been scraped
{'Case title:': '', 'Appellant name:': '', 'Status of case:': 'Unreported', 'Hearing date:': '16 Aug 2021', 'Promulgation date:': '11 Oct 2021', 'Publication date:': '26 Oct 2021', 'Last updated on:': '26 Oct 2021', 'Country:': '', 'Judges:': '', 'Document': 'https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/73734/HU049642018.doc', 'Reference': ['HU/04964/2018'], 'Download': 'Yes', 'File': '73734/HU049642018.doc'}


In [42]:
# Function to fix the name of the File
def fix_File(string):
    """
    Given a string of dictionaries search_dictionariesgets obtains the dictionary matching a key/value pair
    :string: a string incorporating path and file name with extension
    :return: clean string without path/extension
    """
    head, tail = os.path.split(string)
    file_name, file_ext = os.path.splitext(tail)
    return file_name

# Function to fix each value of the key File 
def update_File(dicc):
    """
    Given a dictionary apply function fix_File() to key: File
    :dicc: a pyhon dict with key File
    :return: updated python dict
    """
    val = dicc.get('File')
    new_val = fix_File(val)
    dicc.update({'File': new_val})
    return dicc

# Fix the values of {File} key in each dictionary
[update_File(x) for x in scrapedList]

[{'Case title:': '',
  'Appellant name:': '',
  'Status of case:': 'Unreported',
  'Hearing date:': '16 Aug 2021',
  'Promulgation date:': '11 Oct 2021',
  'Publication date:': '26 Oct 2021',
  'Last updated on:': '26 Oct 2021',
  'Country:': '',
  'Judges:': '',
  'Document': 'https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/73734/HU049642018.doc',
  'Reference': ['HU/04964/2018'],
  'Download': 'Yes',
  'File': 'HU049642018'},
 {'Case title:': '',
  'Appellant name:': '',
  'Status of case:': 'Unreported',
  'Hearing date:': '24 Aug 2021',
  'Promulgation date:': '6 Oct 2021',
  'Publication date:': '22 Oct 2021',
  'Last updated on:': '22 Oct 2021',
  'Country:': '',
  'Judges:': '',
  'Document': 'https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/73683/PA011022019.doc',
  'Reference': ['PA/01102/2019'],
  'Download': 'Yes',
  'File': 'PA011022019'},
 {'Case title:': '',
  'Appellant name:': '',
  'Status of case:': 'Unreported',
  'Heari

Could not download file https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/51943/AA055522014.doc
HTTP Error 500: Internal Server Error
Could not download file https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/40081/IA083642010___IA083692010___IA083752010.DOC
HTTP Error 403: Forbidden

https://tribunalsdecisions.service.gov.uk/utiac/2003-ukiat-7478

urls[35014]

In [43]:
# Save as a json file jsonData in data directory
with open('./data/jsonData.json', 'w') as fout:
    json.dump(scrapedList, fout)


# Open json
#parsed = json.loads(jsonData)
# Test
# print(json.dumps(parsed[30000], indent = 4, sort_keys = True))


In [1]:
# Save as a pickle
with open('./data/pickleData.pkl', 'wb') as f:
    pickle.dump(scrapedList, f, protocol = pickle.HIGHEST_PROTOCOL)

# Open pickle file
with open('./data/pickleData.pkl', 'rb') as f:
    d = pickle.load(f)

FileNotFoundError: [Errno 2] No such file or directory: './data/pickleData.pkl'

In [10]:
# Open jsonData file
jsonData_path = os.path.join(os.getcwd(), 'jsonData.json')
with open(jsonData_path) as json_file:
    data = json.load(json_file)
print(data[0])

{'Case title:': '', 'Appellant name:': '', 'Status of case:': 'Unreported', 'Hearing date:': '10 Aug 2021', 'Promulgation date:': '17 Sep 2021', 'Publication date:': '4 Oct 2021', 'Last updated on:': '4 Oct 2021', 'Country:': '', 'Judges:': '', 'Document': 'https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/73573/PA027032020.doc', 'Reference': ['PA/02703/2020'], 'Download': 'Yes'}
