# Information extraction (28th October 2021)

This notebook extracts additional information from the text of the tribunal decisions and stores it in the relevant dictionary.

In particular, the notebook performs information extraction on:

1. The label included in the name of the file.

2. The court where the case was heard ("Heard at").

3. The judges.

4. The legal representation for the appellant and the respondent.

5. The decision/ruling by the judge.

Each of these fields is added to the dictionary of each judicial decision.

The resulting data set - a list of updated dictionaries -  is serialised as a json object (jsonDataFinal.json).

This notebook should run in the tfm environment, which can be created with the environment.yml file.

In [38]:
from os import listdir
from os.path import isfile, join, getsize
import numpy as np
import re
import json
import pickle
import pandas as pd
import whois
import sys
import datetime
from tqdm import tqdm
import textract
import re
from pprint import pprint

from nltk.tokenize import sent_tokenize, word_tokenize
import stanza
import spacy

import sys
IN_COLAB = 'google.colab' in sys.modules


# What environment am I using?
print(f'Current environment: {sys.executable}')

# Change the current working directory
os.chdir('/Users/albertamurgopacheco/Documents/GitHub/TFM')
# What's my working directory?
print(f'Current working directory: {os.getcwd()}')


Current environment: /Users/albertamurgopacheco/anaconda3/envs/tfm/bin/python
Current working directory: /Users/albertamurgopacheco/Documents/GitHub/TFM


In [39]:
# Define working directories in colab and local execution

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/gdrive')
    docs_path = '/content/gdrive/MyDrive/TFM/data/raw'
    input_path = '/content/gdrive/MyDrive/TFM'
    output_path = '/content/gdrive/MyDrive/TFM/output'

else:
    docs_path = './data/raw'
    input_path = '.'
    output_path = './output'

# INFORMATION EXTRACTION

# 1. The label included in the name of the file

There are two categories of cases: the reported and the unreported ones. The reported cases include richer data while the unreported ones (the vast majority of cases) miss several data fields due to a request for annonimity from any of the parties involved in the legal dispute.

The first two letters in the file name seem to follow some logic. Inspecting the documents reveals the following meanings:

In [40]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each decision and extract first two characters of the file's name
for decision in tqdm(data):
    # Only 'unteported' decisions include this 2-letter code
    if decision.get('Status of case:') == 'Unreported':
        string_code = decision.get('File')[:2]
    else:
        string_code = 'NA'
    
    # Add dictionary key 'Code label' with value string to the dictionary
    decision.update({'Code label:': string_code})

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

100%|██████████| 35305/35305 [00:00<00:00, 1588874.25it/s]


# 2. The court where the case was heard

An inspection of a sample of judicial decisions reveals that the name of the court is located in the first part of the document and it usually follows the expression "Heard at".

The strategy to capture this field will consist of a search using regular expressions. 

In [41]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the text of the court decision
    decision_string = decision.get('String')
    # Deal with empty/corrupt files that didn't upload a sentence string
    if decision_string:
        # Regex expression: What comes after "Heard at" until hitting 3 balnks or new line
        #regex = '(?<=Heard at).*[^\S\r\n]{3,}'
        regex = 'Heard at(.*)[\S\r\n]| (?<=Heard at).*[^\S\r\n]{3,}'
        catch = re.search(regex, decision_string)

        # If the catch is successful
        if catch :
            string = catch.group(0)
            # Remove ':' if included in the catch
            string = string.replace(':','')
            # Remove leading and trailing spaces
            string = string.strip()
            # Avoids picking up parts of tables and '|'
            string = string.split('   ')
            string = string[0]
            # Remove 'Heard at' if included in the catch
            string = string.replace('Heard at ','')
            # Remove 'manually' some strings often included in the catch
            string = string.replace('|Decision & Reasons Promulgated','')
            string = string.replace('|Decision and Reasons Promulgated','')
            string = string.replace('| Decision & Reasons Promulgated','')
            string = string.replace('Decision Promulgated','')
            string = string.replace('|Decision & Reasons promulgated','')
            string = string.replace('|Determination Promulgated','')
            string = string.replace('Decision and Reasons Promulgated','')
            string = string.replace('|Decision & Reasons  Promulgated','')
            string = string.replace(' on 4 July 2003','')
            string = string.replace('Determination Promulgated','')
            string = string.replace('Decision & Reasons Promulgated','')
            string = string.replace('|Decisions and Reasons Promulgated','')
            string = string.replace('|Decision and Reasons','')
            string = string.replace('UT(IAC)','')
            string = string.replace('UT (IAC) ','')
            string = string.replace('Date of Hearing  9 December 2005','')
            string = string.replace(' | |SS (Risk-Manastry) Iran CG [2003] UKIAT 00035 |','')
            # Strip of often found trailing characters
            string = string.rstrip(',')
            string = string.rstrip('|')
            # Remove leading and trailing spaces (again)
            string = string.strip()
            
        else:
            string = 'NA'
        
        #print(string)
        # Add dictionary key 'Heard at' with value string to the dictionary
        decision.update({'Heard at:': string})
    else:
        continue
# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

100%|██████████| 35305/35305 [00:01<00:00, 27089.94it/s]


# 3. The judges



In [42]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the text of the court decision
    decision_string = decision.get('String')
    # Deal with empty/corrupt files that didn't upload a sentence string
    if decision_string:
        # Regex expression: What comes in between 'Before' and 'Between'
        regex = '(?<=Before)([\s\S]*?)(?=Between)'
        catch = re.search(regex, decision_string)
        #If the catch is successful
        if catch :
            string = catch.group(0)

            # Get rid of some table delimiters
            string = string.replace('|','')
            string = string.replace('?','')
            string = string.replace(',','')

            # Remove leading and trailing spaces
            string = string.strip()
            
            # Split strings (spaces > 3 usually indicates two "joint" names)
            # Alternative approach tried and discarded base on sentence tokenization 
            # from nltk.tokenize import sent_tokenize
            listNames = string.split("   ")
            # Make list of names with strings containijng names
            # Capitalize the first letter of each word & delete 
            listNames = [name.strip().title() for name in listNames if name.strip()]

            # Discard content in brackets as it's mostly titles and clutter
            listNames = [re.sub('[\(\[].*?[\)\]]', '', x).strip() for x in listNames]


            # Finally, delete titles, positions held and other clutter around the name
            clutter = ['Judge', 'Tribunal', 'Court', 'Upper', 'Deputy', 'Senior', 'Of', 'The', 'Mr', 'Dr', 'Vice', 'President',
            ':', 'Honourable', 'Hon.', '', '- - - - - - - - - - - - - - - - - - - -', 'Ut', 'Trinbunal', '-And-', 'Mrs', 'President,',
            'Tribnunal', '-', 'Hon', 'And', 'Chairman', 'Vice-President', 'Immigration', 'Asylum Chamber', '-Vice', '(Senior',
            '...............', 'Designated', 'His Honour', 'Respondent Representation: For Appellant', 'Secretary State For Home Department',
            'Appellant', 'Lord', 'Sir', 'In Matter An Application For Judicial Review', 'I) Eu Regulation Number 604/2013 Human',
            'Miss', 'Ms.', ':-']

            # 
            listNames = [' '.join(filter(lambda x: x not in clutter,  name.split())) for name in listNames]
            # Remove remaining 'issues' with empty strings ''
            listNames = list(filter(None, listNames))
            # Add a . following individual letters

            #print(listNames)
            
        else:
            listNames = ['NA']
        
        #print(decision.get('File'))
        #print(listNames)
        # Add dictionary key 'Judges:' with value list of strings to the dictionary
        decision.update({'Judges:': listNames})
    else:
        continue

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)


100%|██████████| 35305/35305 [00:01<00:00, 24965.24it/s]


In [43]:
# 'Manually' fix some mistakes
        
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):

    if decision.get('File') == '00046_ukut_iac_2020_ps_iran_cg':
        listNames = ['J Barnes', 'A R Mackey', 'S L Batiste']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '00393_ukut_iac_2019__jw_ors_ijr':
        listNames = ['Rimington Jackson']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '2004_ukiat_00248_gh_iraq_cg':
        listNames = ['Rintoul', 'Bruce']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '00270_ukut_iac_2015_mmw_ijr':
        listNames = ['Justice Mccloskey']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '00271_ukut_iac_2015_bh_ijr':
        listNames = ['Justice Mccloskey', "O'Connor"]
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == 'AA082212015':
        listNames = ['Alis', 'I K']
        decision.update({'Judges:': listNames})
    else:
        continue

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)


100%|██████████| 35305/35305 [00:00<00:00, 2771630.50it/s]


# 4. The legal representation for the appellant and the respondent

The legal team consists of the representation for the appellant and the respondent.

In [242]:
representation = []
files = []

# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the text of the court decision
    decision_string = decision.get('String')
    file_name = decision.get('File')
    files.append(file_name)
    #print(file_name)
    # Use only first third of text
    string = decision_string[:len(decision_string)//3]
    # All text in lower
    string = string.lower()
    # Apply stanford nlp to string
    doc = nlp(string)

    # List to store the ruling sentences
    catch = []

    # Make sentences
    for i, sentence in enumerate(doc.sentences):
        sente = [token.text for token in sentence.tokens]
        # Keep only the alpha tokens
        sente = [e for e in sente if e.isalpha()]
        catch.append(sente)
        #print(catch)
    
    # Look for partial hits (representation_leads_part) in string 
    representation_leads_part = [['representation', 'for', 'the', 'appellant'], ['representation', 'for', 'the', 'claimant'],
    ['for', 'the', 'appellant'], ['representation', 'for', 'the', 'appellants'], ['for', 'the', 'first', 'appellant']]
    
    # Representation has not been found yet (flag = 0)
    flag = 0

    for element in catch:
        for part in representation_leads_part:
            # find index of part hit
            idx_part = representation_leads_part.index(part)
            # Condition flag == 0 to avoid greedy behaviour (several matches) Only matters 1st hit
            if sublist(representation_leads_part[idx_part], element) and flag == 0:
                index = catch.index(element)
                # representaion lead found in catch
                flag = 1
                # Keep only sentence with the hit (it includes all needed info)
                new_catch = catch[index]
                representation.append(new_catch)
                decision.update({'Representation:': new_catch})
                #print(new_catch)
                break
            else:
                continue
                
    # If information on the representation has not been found (flag = 0)
    if flag == 0:
        #print(f'Did not find a nationality {file_name} in catch: {catch}')
        representation.append(np.nan)
        decision.update({'Representation:': np.nan})
        #print('Did not find a representation')
        #print(catch)
    else:
        continue

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)


100%|██████████| 35305/35305 [5:05:36<00:00,  1.93it/s]


The information on the legal representatives has been captured for a large number of decisions. 

In [244]:
dict_representation = {'File':files,'Representation':representation}

df = pd.DataFrame(dict_representation, columns=['File','Representation'])
df.isna().sum()


File                 0
Representation    3372
dtype: int64

The text dic includes a relatively long string with the information on the legal representatives. The following code breaks it down into two pieces:
- The legal representation of the appellant (legalAppellant).
- The legal representation of the defendant (legalDefendant).

In [None]:
# Deal with empty/corrupt files that didn't upload a sentence string
    # Regex expression: What comes in between 'Before' and 'Between'
    # regex = '(?<=Before)([\s\S]*?)(?=Between)'
    regex ='representation:([\S\s]*)for the respondent'
    catch = re.search(regex, decision_string.lower())
    #If the catch is successful
    if catch :
        string = catch.group(0)
        delimiters = ['|', '?', ':']
        # Get rid of some table delimiters
        for i in delimiters:
            string = string.replace(i,'')

        # Remove leading and trailing spaces
        string = string.strip()

        print(string)









# Path to the txt documents
txt_path = './data/processed/txt_files_test/'
print(os.listdir(txt_path))
# Loop over each text file and extract Court information
for text in os.listdir(txt_path):
    print(text)

    with open(txt_path + text, 'r') as file:
        decision_string = file.read()
        # Regex expression: What comes after "Heard at" until hitting 3 balnks or new line
        #regex = '(?<=Heard at).*[^\S\r\n]{3,}'
        #regex = 'Before([\S\s]*)Between'
        regex = '(?<=Before)([\s\S]*?)(?=Between)'

        catch = re.search(regex, decision_string)
        #If the catch is successful
        if catch :
            string = catch.group(0)

            # Keep only alpha numeric
            string = string.replace('|','')
            #string = re.sub(r'[^A-Za-z0-9 ]+', '', string)
            # Remove leading and trailing spaces
            string = string.strip()
            print(string)
        else:
            continue




# Use regex on sample list
l =['00010_ukait_2009_gs_afghanistan_cg.txt', '00003_ukait_2008_aa_others_pakistan.txt', 
'IA411142014.txt', 'IA417362014___Others.txt', 'PA047742016.txt', 'PA053522017.txt',
'IA124652014.txt', 'IA125982015.txt', 'PA085102018.txt']

# Use regex on entire list
ll = os.listdir(txt_path)
print(len(ll))




    # Loading string with court decision to data
for txt_file in  tqdm(os.listdir(txt_path)):
    
    # Open file and obtain string and file_name
    with open(txt_path + txt_file, 'r') as file:
        string = file.read()
        f_name, f_ext = os.path.splitext(file.name)
        head, file_name = os.path.split(f_name)
    # Search data list of dictionaries for dict where {"File":} = file_name
    for d in data:
        if d.get('File') == file_name:
            # Add dictionary key 'String' with value string
            d.update({'String': string})

# 5. The decision of the judge

The decision of the judge is the most challenging piece of information to extract from the documents. # First isolate the part of the document most likely to include the decission the second half of the document. Second, get rid of annexes and appendixes. third, # classifying judgments is not the same as classifying cases.


In [60]:
# Function to capture if all elements exist in a list
def all_exist(avalue, bvalue):
    """
    Given a list of sentences/lists all_exist checks whether avalue list exists in bvalue

    :avalue: list to search
    :bvalue: list to be searched
    :return: the list with the match
    """
    return all(any(x in y for y in bvalue) for x in avalue)

# nlp sentence tokenizer with Stanford
nlp = stanza.Pipeline(lang = 'en', processors = 'tokenize', tokenize_no_ssplit = True)

# Store decisions in a list to make a df
decisions = []
files = []

# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the full text of the court decision
    string = decision.get('String')
    file_name = decision.get('File')
    files.append(file_name)

    # Use only second half of text (skip references to annxes and appendixes)
    string = string[len(string)//2:]

    # Discard text following appendix and annexes
    string = string.rsplit("appendix", 1)
    string = string[0]
    string = string.rsplit("annex", 1)
    string = string[0]

    # Narrow down the search from the end
    # Split on last occurrence of "Signed"
    string = string.rsplit("Signed", 1)
    string = string[0].lower()


    # Keep a max of 2000 characters
    string = string[ min(-2000, len(string)):]

    # Get rid of text after the last occurrence of 'anonymity'
    string = string.rsplit("anonymity", 1)
    string = string[0]

    # Apply stanford nlp
    doc = nlp(string)

    # List to store the ruling sentences
    catch = []
    # Flag = 1 when decision found
    flag = 0
        
    # Make sentences
    for i, sentence in enumerate(doc.sentences):
        sente = [token.text for token in sentence.tokens]
        # Keep only the alpha tokens
        sente = [e for e in sente if e.isalpha()]
        #print(type(sente))
        catch.append(sente)
        
    # Identify decision leads in sentences
    decision_leads = [['notice', 'of', 'decision'], ['decision'], ['decisions'], ['conclusions'], ['conclusion']]
        
    # When decision lead found, trim catch and update flag value 
    for lead in decision_leads:
        try:
            # Find index of decision lead in ruling
            index = catch.index(lead)
            # Remove sentences before the decision lead sentence
            del catch[0:index]
            # Flatten the list of lists/sentences
            flat_catch = [item for sublist in catch for item in sublist]
            # Decision found
            flag = 1
            # Store decision in decisions list
            decisions.append(flat_catch)
            decision.update({'Decision:': flat_catch})
            #print('Found decision 1')
            #print(flat_catch)
            break
        except ValueError:
            continue
    
    # If a decision has not been found yet (flag = 0)
    if flag == 0:
    # Look for partial hits in text 
        decision_leads_part = [['for', 'the', 'above', 'reasons'], ['for', 'the', 'reasons', 'i', 'have', 'given'], ['general', 'conclusions'],
        ['for', 'the', 'reasons', 'set', 'out', 'above'], ['for', 'all', 'of', 'these', 'reasons'], ['decision', 'and', 'directions'], ['conclusions'],
        ['notice', 'of', 'decision'], ['decision','the', 'application', 'for', 'judicial', 'review', 'is'], ['there', 'is', 'no', 'material', 'error', 'of', 'law', 'in'],
        ['decision', 'the', 'decision', 'of', 'tribunal', 'judge', 'dean', 'promulgated'], ['the', 'decision', 'of', 'the', 'ftt', 'is', 'set', 'aside'],
        ['i', 'grant', 'permission', 'to', 'appeal', 'i', 'set', 'aside', 'the', 'decision', 'of', 'the', 'tribunal'], ['i', 'set', 'aside', 'that', 'decision'],
        ['the', 'appellant', 'appeal', 'as', 'originally', 'brought', 'to', 'the', 'ftt', 'is', 'dismissed'], ['i', 'do', 'not', 'set', 'aside', 'the', 'decision']]
            
        for element in catch:
            for part in decision_leads_part:
                idx_part = decision_leads_part.index(part)
                if all_exist(decision_leads_part[idx_part], element):
                    index = catch.index(element)
                    # Decision found in catch
                    flag = 1
                    # Remove sentences before the decision lead sentence
                    del catch[0:index]
                    # Flatten the list of lists/sentences
                    flat_catch = [item for sublist in catch for item in sublist]
                    #print('Found decision 2')
                    #print(flat_catch)
                    break
                
                else:
                    continue
                
        # If a decision has still not been found (flag = 0)
        if flag == 0:
            decisions.append(np.nan)
            decision.update({'Decision:': np.nan})
            #print('Did not find a decision')
            #print(catch)
        else:
            # Store decision in decisions list
            decisions.append(flat_catch)
            decision.update({'Decision:': flat_catch})
            continue


# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

2021-11-13 20:28:12 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-11-13 20:28:12 INFO: Use device: cpu
2021-11-13 20:28:12 INFO: Loading: tokenize
2021-11-13 20:28:12 INFO: Done loading processors!
100%|██████████| 35305/35305 [2:00:38<00:00,  4.88it/s]


In [245]:
dict_decisions = {'File':files,'Decision':decisions}

df = pd.DataFrame(dict_decisions, columns=['File','Decision'])
df.isna().sum()
#print(data[49])
print(len(files))
print(len(decisions))
#print(decisions[32488])
#print(files[5000])
#rint(decisions[5000])
#print(files[6000])
print(decisions[6002])
#print(df[df['Decision'].isnull()])
print(df.isnull().sum(axis = 0))
print(len(files))
print(len(decisions))
#print(df[df['Decision'].isnull()])
#print(json.dumps(data[32554], indent = 4, sort_keys = True))

35305
35305
['decision', 'the', 'determination', 'of', 'the', 'tribunal', 'having', 'been', 'found', 'to', 'contain', 'a', 'material', 'error', 'of', 'law', 'i', 'substitute', 'the', 'following', 'decision', 'the', 'appellant', 'appeal', 'is', 'allowed', 'under', 'the', 'immigration', 'rules']
HU166042017
['the', 'judge', 'also', 'went', 'on', 'to', 'consider', 'whether', 'or', 'not', 'any', 'exceptional', 'circumstances', 'existed', 'in', 'this', 'particular', 'case', 'the', 'findings', 'and', 'conclusions', 'are', 'comprehensive', 'and', 'when', 'the', 'decision', 'is', 'viewed', 'holistically', 'the', 'judge', 'consideration', 'is', 'entirely', 'sound', 'in', 'light', 'of', 'the', 'above', 'the', 'appellant', 'appeal', 'to', 'the', 'upper', 'tribunal', 'is', 'dismissed', 'and', 'the', 'decision', 'of', 'the', 'tribunal', 'stands', 'anonymity', 'i', 'make', 'no']
PA098412018
['notice', 'of', 'decision', 'for', 'the', 'above', 'reasons', 'the', 'decision', 'i', 'on', 'the', 'appellant

# 7. Sense of the decision.

The decision has been isolated. However, no information on whether the sentence accepts/rejects or is neutral.

In [None]:
# Sense of decision depends on the appellent. If appellent is home office, then... The decision of the First-tier Tribunal did not involve the  making  of an error of law and I uphold it
# is accepted, otherwise is rejected.

# Rejected
The appeal is dismissed
The decision of the First-tier Tribunal stands
not involve an error on 
not satisfied that  the  judge  erred 
decision stands
did  not  involve  the making of a material error on a point of law
I do not set aside the decision but order that it shall stand
appeal is dismissed

# Accepted
The First-tier Tribunal erred in law
I have remade the decision
is set aside
The appeal, as brought by the appellant to the First-tier  Tribunal,  is allowed.
the appeal is remade and I allow the appeal
Appeals allowed
It  is  set  aside
I allow the claimant's appeal
The decision of the First-tier Tribunal has already  been  set  aside
The original decision shall stand
set aside the decision
is set aside

decision allowing the appeal on humanitarian protection grounds, as well as on human rights grounds

# Neutral
else



# 8. Nationality of the appellant. 
The field country is empty to a large extent.

In [182]:
# Function to capture if all elements exist in a specific order in a list
def sublist(sublist, lst):
    """
    Given a list of sentences/lists all_exist checks whether avalue list exists in bvalue

    :sublist: list to search
    :lst: list to be searched
    :return: the list with the match
    """
    if not isinstance(sublist, list):
        raise ValueError("sublist must be a list")
    if not isinstance(lst, list):
        raise ValueError("lst must be a list")

    sublist_len = len(sublist)
    k=0
    s=None

    if (sublist_len > len(lst)):
        return False
    elif (sublist_len == 0):
        return True

    for x in lst:
        if x == sublist[k]:
            if (k == 0): s = x
            elif (x != s): s = None
            k += 1
            if k == sublist_len:
                return True
        elif k > 0 and sublist[k-1] != s:
            k = 0

    return False

countries = ['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bolivia, Plurinational State of', 'Bonaire', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo, The Democratic Republic of the', 'Congo', 'Cook Islands', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See (Vatican City State)', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran, Islamic Republic of', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', "Korea, Democratic People's Republic of", 'Korea, Republic of', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia, Republic of', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia, Federated States of', 'Micronesia', 'Moldova, Republic of', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestinian Territory, Occupied', 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Réunion', 'Romania', 'Russian Federation', 'Russia', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'South Sudan', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Syrian Arab Republic', 'Taiwan', 'Taiwan, Province of China', 'Tajikistan', 'Tanzania', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United States', 'United States Minor Outlying Islands', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Venezuela, Bolivarian Republic of', 'Vietnam', 'Viet Nam', 'Virgin Islands, British', 'Virgin Islands, U.S.', 'Wallis and Futuna', 'Yemen', 'Zambia', 'Zimbabwe', 'Pakistani', 'Iranian', 'Bangladeshi', 'Indian', 'Egyptian', 'Afghan', 'Albanian', 'Algerian', 'American', 'Andorran', 'Angolan', 'Antiguans', 'Argentinean', 'Armenian', 'Australian', 'Austrian', 'Azerbaijani', 'Bahamian', 'Bahraini', 'Bangladeshi', 'Barbadian', 'Barbudans', 'Batswana', 'Belarusian', 'Belgian', 'Belizean', 'Beninese', 'Bhutanese', 'Bolivian', 'Bosnian', 'Brazilian', 'Bruneian', 'Bulgarian', 'Burkinabe', 'Burmese', 'Burundian', 'Cambodian', 'Cameroonian', 'Canadian', 'Cape Verdean', 'Central African', 'Chadian', 'Chilean', 'Chinese', 'Colombian', 'Comoran', 'Congolese', 'Costa Rican', 'Croatian', 'Cuban', 'Cypriot', 'Czech', 'Danish', 'Djibouti', 'Dominican', 'Dutch', 'Dutchman', 'Dutchwoman', 'East Timorese', 'Ecuadorean', 'Egyptian', 'Emirian', 'Equatorial Guinean', 'Eritrean', 'Estonian', 'Ethiopian', 'Fijian', 'Filipino', 'Finnish', 'French', 'Gabonese', 'Gambian', 'Georgian', 'German', 'Ghanaian', 'Greek', 'Grenadian', 'Guatemalan', 'Guinea-Bissauan', 'Guinean', 'Guyanese', 'Haitian', 'Herzegovinian', 'Honduran', 'Hungarian', 'I-Kiribati', 'Icelander', 'Indian', 'Indonesian', 'Iranian', 'Iraqi', 'Irish', 'Israeli', 'Italian', 'Ivorian', 'Jamaican', 'Japanese', 'Jordanian', 'Kazakhstani', 'Kenyan', 'Kittian and Nevisian', 'Kuwaiti', 'Kyrgyz', 'Laotian', 'Latvian', 'Lebanese', 'Liberian', 'Libyan', 'Liechtensteiner', 'Lithuanian', 'Luxembourger', 'Macedonian', 'Malagasy', 'Malawian', 'Malaysian', 'Maldivan', 'Malian', 'Maltese', 'Marshallese', 'Mauritanian', 'Mauritian', 'Mexican', 'Micronesian', 'Moldovan', 'Monacan', 'Mongolian', 'Moroccan', 'Mosotho', 'Motswana', 'Mozambican', 'Namibian', 'Nauruan', 'Nepalese', 'Netherlander', 'New Zealander', 'Ni-Vanuatu', 'Nicaraguan', 'Nigerian', 'Nigerien', 'North Korean', 'Northern Irish', 'Norwegian', 'Omani', 'Pakistani', 'Palauan', 'Panamanian', 'Papua New Guinean', 'Paraguayan', 'Peruvian', 'Polish', 'Portuguese', 'Qatari', 'Romanian', 'Russian', 'Rwandan', 'Saint Lucian', 'Salvadoran', 'Samoan', 'San Marinese', 'Sao Tomean', 'Saudi', 'Scottish', 'Senegalese', 'Serbian', 'Seychellois', 'Sierra Leonean', 'Singaporean', 'Slovakian', 'Slovenian', 'Solomon Islander', 'Somali', 'South African', 'South Korean', 'Spanish', 'Sri Lankan', 'Sudanese', 'Surinamer', 'Swazi', 'Swedish', 'Swiss', 'Syrian', 'Taiwanese', 'Tajik', 'Tanzanian', 'Thai', 'Togolese', 'Tongan', 'Trinidadian or Tobagonian', 'Tunisian', 'Turkish', 'Tuvaluan', 'Ugandan', 'Ukrainian', 'Uruguayan', 'Uzbekistani', 'Venezuelan', 'Vietnamese', 'Welsh', 'Yemenite', 'Zambian', 'Zimbabwean']
countriesLower = [x.lower() for x in countries]
#print(countriesLower)

In [183]:
# Store decisions in a list to make a df
nationalities = []
files = []

# nlp sentence tokenizer with Stanford
nlp = stanza.Pipeline(lang = 'en', processors = 'tokenize', tokenize_no_ssplit = True)

# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the full text of the court decision
    string = decision.get('String')
    file_name = decision.get('File')
    files.append(file_name)

    # Use only first third of text
    string = string[:len(string)//3]
    # All text in lower
    string = string.lower()

    # Apply stanford nlp
    doc = nlp(string)

    # List to store the ruling sentences
    catch = []

    # Make sentences
    for i, sentence in enumerate(doc.sentences):
        sente = [token.text for token in sentence.tokens]
        # Keep only the alpha tokens
        sente = [e for e in sente if e.isalpha()]
        #print(type(sente))
        catch.append(sente)
            
    # If a nationality has not been found yet (flag = 0)
    # Look for partial hits in text 
    nationality_leads_part = [['the', 'appellant', 'is', 'a', 'national', 'of'], ['the', 'appellant', 'is', 'a', 'citizen', 'of'],
    ['the', 'respondent', 'is', 'a', 'citizen', 'of'], ['the', 'appellants', 'are', 'all', 'citizens', 'of'], ['citizen', 'of'],
    ['national', 'of'], ['citizens', 'of']]
    
    # Flag = 1 when nationality is found in catch
    flag = 0

    for element in catch:
        for part in nationality_leads_part:
            idx_part = nationality_leads_part.index(part)
            if sublist(nationality_leads_part[idx_part], element):
                #print(nationality_leads_part[idx_part])
                index = catch.index(element)
                # Nationality lead found in catch
                #flag = 1
                # Remove sentences before the decision lead sentence
                new_catch = catch[index]
                # flag2 = 1 when nationality is found in 1
                for token in new_catch:
                    #indx_country = countriesLower.index(country)
                    if sublist([token], countriesLower):
                        #print(f'FOUND A NATIONALITY {token} in {file_name}')
                        flag = 1
                        nationalities.append(token)
                        decision.update({'Nationality:': token})
                        break
                    else:
                        
                        continue
                    break

            if flag == 1:
                break
                    #print(f'Did not find a nationality in {catch}')

                        #indx = catch.index(country)
                        #print(countriesLower[indx])
                # Flatten the list of lists/sentences
                #catch = [item for sublist in catch for item in sublist]
                #catch = ' '.join(catch)
                #print('Found nationality 2')
                #print(flat_catch)
                #print(catch)
                # Load the model
                #nlp = spacy.load("en_core_web_sm")
                #catch_spacy = nlp(catch)
                #for ent in catch_spacy.ents:    
                    # check if entity is equal 'LOC' or 'GPE'
                #    if ent.label_ in ['GPE']:
                #        print(ent.text, ent.label_)  
                #break
            #else:
            #    continue
        if flag == 1:
            break   
    # If a decision has still not been found (flag = 0)
    if flag == 0:
        #print(f'Did not find a nationality {file_name} in catch: {catch}')
        nationalities.append(np.nan)
        decision.update({'Nationality:': np.nan})
        #print('Did not find a decision')
        #print(catch)
    else:
            # Store decision in decisions list
            #nationalities.append(flat_catch)
            #decision.update({'Decision:': flat_catch})
        continue

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)



2021-11-14 18:00:58 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-11-14 18:00:58 INFO: Use device: cpu
2021-11-14 18:00:58 INFO: Loading: tokenize
2021-11-14 18:00:58 INFO: Done loading processors!
100%|██████████| 35305/35305 [5:04:39<00:00,  1.93it/s]


In [184]:
print(len(files))
print(len(nationalities))
#print(nationalities)
dict_nationalities = {'File':files,'Nationality':nationalities}

df = pd.DataFrame(dict_nationalities, columns=['File','Nationality'])
df.isna().sum()




35305
35305


File               0
Nationality    12537
dtype: int64

In [175]:
# look for it in first half of string
# GPE Countries, cities, states.
# LOC Non-GPE locations, mountain ranges, bodies of water.
#
sp = spacy.load("en_core_web_sm")
# loop over every row in the 'Bio' column
for text in df['Bio'].tolist():
    # use spacy to extract the entities
    doc = sp(text)
    for ent in doc.ents:    
        # check if entity is equal 'LOC' or 'GPE'
        if ent.label_ in ['GPE']:
            print(ent.text, ent.label_)  

# 9. Key word extraction

In [186]:
print(len(files))
print(len(nationalities))
dict_nationalities = {'File':files,'Nationality':nationalities}

df = pd.DataFrame(dict_nationalities, columns=['File','Nationality'])
df.isna().sum()




35305
35305


File               0
Nationality    12537
dtype: int64

Decisions 
https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.match?view=net-5.0

In [None]:
# Extraction of decisions

https://research.iclr.co.uk/blog/blackstone-goes-live


# keyword extraction for issues in strategies
https://www.airpair.com/nlp/keyword-extraction-tutorial


In [None]:
# Decision extraction

f = open('./data/processed/txt_files/00003_ukait_2008_aa_others_pakistan.txt', "r")
for number, paragraph in enumerate(f.read().split("\n\n"), 1):
    print(number)
    print(paragraph)
    pattern = "Decisions"
    if paragraph.find(pattern) != -1:
        print("save to file") 
    else:
        print("don't save to file")

        

In [None]:
import spacy

# Load the model
nlp = spacy.load("en_blackstone_proto")
