# Information extraction (28th October 2021)

This notebook extracts additional information from the text of the tribunal decisions and stores it in the relevant dictionary.

In particular, the notebook performs information extraction on:

1. The label included in the name of the file.

2. The court where the case was heard ("Heard at").

3. The judges.

4. The legal representation for the appellant and the respondent.

5. The decision/ruling by the judge.

Each of these fields is added to the dictionary of each judicial decision.

The resulting data set - a list of updated dictionaries -  is serialised as a json object (jsonDataFinal.json).

This notebook should run in the tfm environment, which can be created with the environment.yml file.

In [38]:
from os import listdir
from os.path import isfile, join, getsize
import numpy as np
import time
import re
import json
import pickle
import pandas as pd
import whois
import sys
import datetime
from tqdm import tqdm
import textract
import re

import sys
IN_COLAB = 'google.colab' in sys.modules


# What environment am I using?
print(f'Current environment: {sys.executable}')

# Change the current working directory
os.chdir('/Users/albertamurgopacheco/Documents/GitHub/TFM')
# What's my working directory?
print(f'Current working directory: {os.getcwd()}')


Current environment: /Users/albertamurgopacheco/anaconda3/envs/tfm/bin/python
Current working directory: /Users/albertamurgopacheco/Documents/GitHub/TFM


In [39]:
# Define working directories in colab and local execution

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/gdrive')
    docs_path = '/content/gdrive/MyDrive/TFM/data/raw'
    input_path = '/content/gdrive/MyDrive/TFM'
    output_path = '/content/gdrive/MyDrive/TFM/output'

else:
    docs_path = './data/raw'
    input_path = '.'
    output_path = './output'

# INFORMATION EXTRACTION

# 1. The label included in the name of the file

There are two categories of cases: the reported and the unreported ones. The reported cases include richer data while the unreported ones (the vast majority of cases) miss several data fields due to a request for annonimity from any of the parties involved in the legal dispute.

The first two letters in the file name seem to follow some logic. Inspecting the documents reveals the following meanings:

In [40]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each decision and extract first two characters of the file's name
for decision in tqdm(data):
    # Only 'unteported' decisions include this 2-letter code
    if decision.get('Status of case:') == 'Unreported':
        string_code = decision.get('File')[:2]
    else:
        string_code = 'NA'
    
    # Add dictionary key 'Code label' with value string to the dictionary
    decision.update({'Code label:': string_code})

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

100%|██████████| 35305/35305 [00:00<00:00, 1588874.25it/s]


# 2. The court where the case was heard

An inspection of a sample of judicial decisions reveals that the name of the court is located in the first part of the document and it usually follows the expression "Heard at".

The strategy to capture this field will consist of a search using regular expressions. 

In [41]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the text of the court decision
    decision_string = decision.get('String')
    # Deal with empty/corrupt files that didn't upload a sentence string
    if decision_string:
        # Regex expression: What comes after "Heard at" until hitting 3 balnks or new line
        #regex = '(?<=Heard at).*[^\S\r\n]{3,}'
        regex = 'Heard at(.*)[\S\r\n]| (?<=Heard at).*[^\S\r\n]{3,}'
        catch = re.search(regex, decision_string)

        # If the catch is successful
        if catch :
            string = catch.group(0)
            # Remove ':' if included in the catch
            string = string.replace(':','')
            # Remove leading and trailing spaces
            string = string.strip()
            # Avoids picking up parts of tables and '|'
            string = string.split('   ')
            string = string[0]
            # Remove 'Heard at' if included in the catch
            string = string.replace('Heard at ','')
            # Remove 'manually' some strings often included in the catch
            string = string.replace('|Decision & Reasons Promulgated','')
            string = string.replace('|Decision and Reasons Promulgated','')
            string = string.replace('| Decision & Reasons Promulgated','')
            string = string.replace('Decision Promulgated','')
            string = string.replace('|Decision & Reasons promulgated','')
            string = string.replace('|Determination Promulgated','')
            string = string.replace('Decision and Reasons Promulgated','')
            string = string.replace('|Decision & Reasons  Promulgated','')
            string = string.replace(' on 4 July 2003','')
            string = string.replace('Determination Promulgated','')
            string = string.replace('Decision & Reasons Promulgated','')
            string = string.replace('|Decisions and Reasons Promulgated','')
            string = string.replace('|Decision and Reasons','')
            string = string.replace('UT(IAC)','')
            string = string.replace('UT (IAC) ','')
            string = string.replace('Date of Hearing  9 December 2005','')
            string = string.replace(' | |SS (Risk-Manastry) Iran CG [2003] UKIAT 00035 |','')
            # Strip of often found trailing characters
            string = string.rstrip(',')
            string = string.rstrip('|')
            # Remove leading and trailing spaces (again)
            string = string.strip()
            
        else:
            string = 'NA'
        
        #print(string)
        # Add dictionary key 'Heard at' with value string to the dictionary
        decision.update({'Heard at:': string})
    else:
        continue
# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

100%|██████████| 35305/35305 [00:01<00:00, 27089.94it/s]


# 3. The judges



In [42]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the text of the court decision
    decision_string = decision.get('String')
    # Deal with empty/corrupt files that didn't upload a sentence string
    if decision_string:
        # Regex expression: What comes in between 'Before' and 'Between'
        regex = '(?<=Before)([\s\S]*?)(?=Between)'
        catch = re.search(regex, decision_string)
        #If the catch is successful
        if catch :
            string = catch.group(0)

            # Get rid of some table delimiters
            string = string.replace('|','')
            string = string.replace('?','')
            string = string.replace(',','')

            # Remove leading and trailing spaces
            string = string.strip()
            
            # Split strings (spaces > 3 usually indicates two "joint" names)
            # Alternative approach tried and discarded base on sentence tokenization 
            # from nltk.tokenize import sent_tokenize
            listNames = string.split("   ")
            # Make list of names with strings containijng names
            # Capitalize the first letter of each word & delete 
            listNames = [name.strip().title() for name in listNames if name.strip()]

            # Discard content in brackets as it's mostly titles and clutter
            listNames = [re.sub('[\(\[].*?[\)\]]', '', x).strip() for x in listNames]


            # Finally, delete titles, positions held and other clutter around the name
            clutter = ['Judge', 'Tribunal', 'Court', 'Upper', 'Deputy', 'Senior', 'Of', 'The', 'Mr', 'Dr', 'Vice', 'President',
            ':', 'Honourable', 'Hon.', '', '- - - - - - - - - - - - - - - - - - - -', 'Ut', 'Trinbunal', '-And-', 'Mrs', 'President,',
            'Tribnunal', '-', 'Hon', 'And', 'Chairman', 'Vice-President', 'Immigration', 'Asylum Chamber', '-Vice', '(Senior',
            '...............', 'Designated', 'His Honour', 'Respondent Representation: For Appellant', 'Secretary State For Home Department',
            'Appellant', 'Lord', 'Sir', 'In Matter An Application For Judicial Review', 'I) Eu Regulation Number 604/2013 Human',
            'Miss', 'Ms.', ':-']

            # 
            listNames = [' '.join(filter(lambda x: x not in clutter,  name.split())) for name in listNames]
            # Remove remaining 'issues' with empty strings ''
            listNames = list(filter(None, listNames))
            # Add a . following individual letters

            #print(listNames)
            
        else:
            listNames = ['NA']
        
        #print(decision.get('File'))
        #print(listNames)
        # Add dictionary key 'Judges:' with value list of strings to the dictionary
        decision.update({'Judges:': listNames})
    else:
        continue

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)


100%|██████████| 35305/35305 [00:01<00:00, 24965.24it/s]


In [43]:

# 'Manually' fix some mistakes
        
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):

    if decision.get('File') == '00046_ukut_iac_2020_ps_iran_cg':
        listNames = ['J Barnes', 'A R Mackey', 'S L Batiste']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '00393_ukut_iac_2019__jw_ors_ijr':
        listNames = ['Rimington Jackson']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '2004_ukiat_00248_gh_iraq_cg':
        listNames = ['Rintoul', 'Bruce']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '00270_ukut_iac_2015_mmw_ijr':
        listNames = ['Justice Mccloskey']
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == '00271_ukut_iac_2015_bh_ijr':
        listNames = ['Justice Mccloskey', "O'Connor"]
        decision.update({'Judges:': listNames})
    else:
        continue
    if decision.get('File') == 'AA082212015':
        listNames = ['Alis', 'I K']
        decision.update({'Judges:': listNames})
    else:
        continue

# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)


100%|██████████| 35305/35305 [00:00<00:00, 2771630.50it/s]


# 4. The legal representation for the appellant and the respondent

The legal team consists of the representation for the appellant and the respondent.

In [None]:
# Path to the txt documents
txt_path = './data/processed/txt_files_test/'
print(os.listdir(txt_path))
# Loop over each text file and extract Court information
for text in os.listdir(txt_path):
    print(text)

    with open(txt_path + text, 'r') as file:
        decision_string = file.read()
        # Regex expression: What comes after "Heard at" until hitting 3 balnks or new line
        #regex = '(?<=Heard at).*[^\S\r\n]{3,}'
        #regex = 'Before([\S\s]*)Between'
        regex = '(?<=Before)([\s\S]*?)(?=Between)'

        catch = re.search(regex, decision_string)
        #If the catch is successful
        if catch :
            string = catch.group(0)

            # Keep only alpha numeric
            string = string.replace('|','')
            #string = re.sub(r'[^A-Za-z0-9 ]+', '', string)
            # Remove leading and trailing spaces
            string = string.strip()
            print(string)
        else:
            continue




# Use regex on sample list
l =['00010_ukait_2009_gs_afghanistan_cg.txt', '00003_ukait_2008_aa_others_pakistan.txt', 
'IA411142014.txt', 'IA417362014___Others.txt', 'PA047742016.txt', 'PA053522017.txt',
'IA124652014.txt', 'IA125982015.txt', 'PA085102018.txt']

# Use regex on entire list
ll = os.listdir(txt_path)
print(len(ll))




    # Loading string with court decision to data
for txt_file in  tqdm(os.listdir(txt_path)):
    
    # Open file and obtain string and file_name
    with open(txt_path + txt_file, 'r') as file:
        string = file.read()
        f_name, f_ext = os.path.splitext(file.name)
        head, file_name = os.path.split(f_name)
    # Search data list of dictionaries for dict where {"File":} = file_name
    for d in data:
        if d.get('File') == file_name:
            # Add dictionary key 'String' with value string
            d.update({'String': string})

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')                                                                                                                  
sents = nlp('Dr H H Storey  (Senior Immigration Judge)') 
type(sents)

for ent in sents.ents:
    print(ent.text, ent.label_)

# Install package
#python3.8 -m pip install stanza 
import stanza

#stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("DEPUTY UPPER TRIBUNAL JUDGE M A. HALL") # run annotation over a sentence

print(doc)
print(doc.entities)

import blackstone



#string= 'Dr H H Storey  (Senior Immigration Judge)                         Mr I F Macdonald (Immigration Judge) '
#string = 'UPPER TRIBUNAL JUDGE GRUBB (bullshit)'

string = '                       Dr H H Storey (Vice President)                                 Mr P Bompas                                Mr S S Percy'


# Split strings (spaces > 3 usually indicates two "joint" names)
# Alternative approach tried and discarded base on sentence tokenization 
# from nltk.tokenize import sent_tokenize
listNames = string.split("   ")
# Make list of names with strings containijng names
# Capitalize the first letter of each word & delete 
listNames = [name.strip().title() for name in listNames if name.strip()]

# Discard content in brackets as it's mostly titles and clutter
listNames = [re.sub('[\(\[].*?[\)\]]', '', x).strip() for x in listNames]


# Finally, delete titles, positions held and other clutter around the name
clutter = ['Judge', 'Tribunal', 'Court', 'Upper', 'Deputy', 'Senior', 'Of', 'The', 'Mr', 'Dr', 'Vice', 'President']

# 
listNames = [' '.join(filter(lambda x: x not in clutter,  name.split())) for name in listNames]

# Add a . following individual letters

print(listNames)

# 5. The decision of the judge

The decision of the judge is the most challenging piece of information to extract from the documents.

In [49]:
# First isolate the part of the document most likely to include the decission

# classifying judgments is not the same as classifying cases.
from pprint import pprint

from nltk.tokenize import sent_tokenize, word_tokenize
import stanza

def all_exist(avalue, bvalue):
    return all(any(x in y for y in bvalue) for x in avalue)


nlp = stanza.Pipeline(lang = 'en', processors = 'tokenize', tokenize_no_ssplit = True)

# Path to the txt documents
#txt_path = './data/processed/txt_files/'
#print(os.listdir(txt_path))

decisions = []
files = []

# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data[49:]):
    # Obtain the full text of the court decision
    string = decision.get('String')
    file_name = decision.get('File')
    files.append(file_name)

    # Use only second half of text (skip references to annxes and appendixes)
    string = string[len(string)//2:]

    # Discard text following appendix and annexes
    string = string.rsplit("appendix", 1)
    string = string[0]
    string = string.rsplit("annex", 1)
    string = string[0]

    # Narrow down from the end
    # Split on last occurrence of "Signed"
    string = string.rsplit("Signed", 1)
    string = string[0].lower()


    # Keep a max of 2000 characters
    string = string[ min(-2000, len(string)):]

    # Get rid of text after the last occurrence of 'anonymity'
    string = string.rsplit("anonymity", 1)
    string = string[0]

    # Apply stanford nlp
    doc = nlp(string)

    # List to store the ruling sentences
    catch = []
    # Flag = 1 when decision found
    flag = 0
        
    # Make sentences
    for i, sentence in enumerate(doc.sentences):
        sente = [token.text for token in sentence.tokens]
        # Keep only the alpha tokens
        sente = [e for e in sente if e.isalpha()]
        #print(type(sente))
        catch.append(sente)
        
    # Identify decision leads in sentences
    decision_leads = [['notice', 'of', 'decision'], ['decision'], ['decisions'], ['conclusions']]
        
    # When decision lead found, trim catch and update flag value 
    for lead in decision_leads:
        try:
            # Find index of decision lead in ruling
            index = catch.index(lead)
            # Remove sentences before the decision lead sentence
            del catch[0:index]
            # Flatten the list of lists/sentences
            flat_catch = [item for sublist in catch for item in sublist]
            # Decision found
            flag = 1
            # Store decision in decisions list
            decisions.append(flat_catch)
            decision.update({'Decision:': flat_catch})

            #print('Found decision 1')
            #print(flat_catch)
            break
        except ValueError:
            continue
    
    # If a decision has not been found yet (flag = 0)
    if flag == 0:
    # Look for partial hits in text 
        decision_leads_part = [['for', 'the', 'above', 'reasons'], ['for', 'the', 'reasons', 'i', 'have', 'given'], ['general', 'conclusions'],
        ['for', 'the', 'reasons', 'set', 'out', 'above'], ['for', 'all', 'of', 'these', 'reasons'], ['decision', 'and', 'directions'], ['conclusions'],
        ['notice', 'of', 'decision'], ['decision','the', 'application', 'for', 'judicial', 'review', 'is'], ['there', 'is', 'no', 'material', 'error', 'of', 'law', 'in'],
        ['decision', 'the', 'decision', 'of', 'tribunal', 'judge', 'dean', 'promulgated'], ['the', 'decision', 'of', 'the', 'ftt', 'is', 'set', 'aside'],
        ['i', 'grant', 'permission', 'to', 'appeal', 'i', 'set', 'aside', 'the', 'decision', 'of', 'the', 'tribunal'], ['i', 'set', 'aside', 'that', 'decision'],
        ['the', 'appellant', 'appeal', 'as', 'originally', 'brought', 'to', 'the', 'ftt', 'is', 'dismissed']]
            
        for element in catch:
            for part in decision_leads_part:
                idx_part = decision_leads_part.index(part)
                if all_exist(decision_leads_part[idx_part], element):
                    index = catch.index(element)
                    # Decision found in catch
                    flag = 1
                    # Remove sentences before the decision lead sentence
                    del catch[0:index]
                    # Flatten the list of lists/sentences
                    flat_catch = [item for sublist in catch for item in sublist]
                    #print('Found decision 2')
                    #print(flat_catch)
                    break
                
                else:
                    continue
                
        # If a decision has still not been found (flag = 0)
        if flag == 0:
            decisions.append(np.nan)
            decision.update({'Decision:': np.nan})
            #print('Did not find a decision')
            #print(catch)
        else:
            # Store decision in decisions list
            decisions.append(flat_catch)
            decision.update({'Decision:': flat_catch})
            continue


# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

2021-11-13 13:41:48 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2021-11-13 13:41:48 INFO: Use device: cpu
2021-11-13 13:41:48 INFO: Loading: tokenize
2021-11-13 13:41:48 INFO: Done loading processors!
100%|██████████| 35256/35256 [6:12:49<00:00,  1.58it/s]


File           0
Decision    5217
dtype: int64

In [53]:
dict_decisions = {'File':files,'Decision':decisions}

df = pd.DataFrame(dict_decisions, columns=['File','Decision'])
df.isna().sum()

File           0
Decision    5217
dtype: int64

In [57]:
print(data[48])
print(len(files))
print(len(decisions))
print(decisions[32488])
print(files[5000])
print(decisions[5000])
print(files[6000])
print(decisions[6002])



{'Case title:': '', 'Appellant name:': '', 'Status of case:': 'Unreported', 'Hearing date:': '14 Jul 2021', 'Promulgation date:': '11 Oct 2021', 'Publication date:': '26 Oct 2021', 'Last updated on:': '26 Oct 2021', 'Country:': '', 'Judges:': '', 'Document': 'https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/73729/DC000912019___DC001272019.doc', 'Reference': ['DC/00091/2019 &amp; DC/00127/2019'], 'Download': 'Yes', 'File': 'DC000912019___DC001272019', 'ID': '91fff7ff-1af7-435f-a3fd-00889afc2456', 'Code label:': 'DC'}
35256
35256
['decision', 'the', 'secretary', 'of', 'state', 'appeal', 'to', 'the', 'upper', 'tribunal', 'is', 'dismissed', 'the', 'tribunal', 'decision', 'to', 'allow', 'the', 'appellant', 'appeal', 'under', 'article', 'of', 'the', 'echr', 'and', 'on', 'the', 'basis', 'that', 'his', 'deportation', 'would', 'breach', 'the', 'refugee', 'convention', 'stand']
HU122212017
nan
HU129662018___HU129712018
['notice', 'of', 'decision', 'the', 'decision', 'did',

In [35]:
#print(df[df['Decision'].isnull()])
df.isnull().sum(axis = 0)


File          0
Decision    100
dtype: int64

In [135]:
print(len(files))
print(len(decisions))
print(files[98])
print(decisions[98])
print(df[df['Decision'].isnull()])



100
100
JR004822021
nan
                         File Decision
28                PA116272019      NaN
29                HU200422018      NaN
31                LP001692020      NaN
38  HU154862019___HU154872019      NaN
41                HU110842019      NaN
46                PA029442020      NaN
47                HU115652019      NaN
55                HU153562019      NaN
56                PA066102019      NaN
58  HU116682018___HU206172018      NaN
59                IA089812015      NaN
67                HU116682018      NaN
74                JR050772019      NaN
76                HU023242020      NaN
81                HU183682019      NaN
85                JR040732019      NaN
87                JR019472020      NaN
93       HU033052019___Others      NaN
98                JR004822021      NaN


In [None]:
# Loop over each text file and extract Court information
for text in os.listdir(txt_path):
    with open(txt_path + text, 'r') as file:
        files.append(text)
        #print(text)
        decision_string = file.read()
        # The strategy is to trim from both ends of the string


In [121]:
# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)


In [97]:
print(f'Current working directory: {os.getcwd()}')

# Open jsonData file
jsonData_path = os.path.join(os.getcwd(), 'data/jsonData.json')
with open(jsonData_path) as json_file:
    data = json.load(json_file)
    print(json.dumps(data[32554], indent = 4, sort_keys = True))

#parsed = json.loads(jsonData)
#print(json.dumps(parsed[16366], indent = 4, sort_keys = True))

Current working directory: /Users/albertamurgopacheco/Documents/GitHub/TFM
{
    "Appellant name:": "",
    "Case title:": "",
    "Country:": "",
    "Document": "https://moj-tribunals-documents-prod.s3.amazonaws.com/decision/doc_file/39898/DA000192013.doc",
    "Download": "Yes",
    "Hearing date:": "",
    "Judges:": "",
    "Last updated on:": "4 Dec 2013",
    "Promulgation date:": "23 Oct 2013",
    "Publication date:": "4 Dec 2013",
    "Reference": [
        "DA/00019/2013"
    ],
    "Status of case:": "Unreported"
}


# 7. Sense of the decision.

The decision has been isolated. However, no information on whether the sentence accepts/rejects or is neutral.

In [None]:
# Sense of decision depends on the appellent. If appellent is home office, then... The decision of the First-tier Tribunal did not involve the  making  of an error of law and I uphold it
# is accepted, otherwise is rejected.

# Rejected
The appeal is dismissed
The decision of the First-tier Tribunal stands
not involve an error on 
not satisfied that  the  judge  erred 
decision stands
did  not  involve  the making of a material error on a point of law
I do not set aside the decision but order that it shall stand
appeal is dismissed

# Accepted
The First-tier Tribunal erred in law
I have remade the decision
is set aside
The appeal, as brought by the appellant to the First-tier  Tribunal,  is allowed.
the appeal is remade and I allow the appeal
Appeals allowed
It  is  set  aside
I allow the claimant's appeal
The decision of the First-tier Tribunal has already  been  set  aside
The original decision shall stand
set aside the decision
is set aside

decision allowing the appeal on humanitarian protection grounds, as well as on human rights grounds

# Neutral
else



# 8. Nationality of the appellant. 
The field country is empty to a large extent.

In [None]:
# look for it in first half of string
# GPE Countries, cities, states.
# LOC Non-GPE locations, mountain ranges, bodies of water.
#
sp = spacy.load("en_core_web_sm")
# loop over every row in the 'Bio' column
for text in df['Bio'].tolist():
    # use spacy to extract the entities
    doc = sp(text)
    for ent in doc.ents:    
        # check if entity is equal 'LOC' or 'GPE'
        if ent.label_ in ['GPE']:
            print(ent.text, ent.label_)  


the appellant is a national of
the appellant is a citizen of
the respondent is a citizen of
is a citizen of
citizen of
is a national of
The appellants are all citizens of 

# Information extraction

In [23]:



# Regex for the Appellant
For the Appellant: ([\S\s]*)For the Respondent

# Regex for the Respondent
For the Respondent: (.*)\n\n 
# OR
For the Respondent: (.*)

Decisions 
https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.match?view=net-5.0

In [None]:
# Extraction of decisions

https://research.iclr.co.uk/blog/blackstone-goes-live


# keyword extraction for issues in strategies
https://www.airpair.com/nlp/keyword-extraction-tutorial


WAYS IN WHICH DECISIONS ARE INTRODUCED:

1.- Throw away everything in ANNEX  or Appendix
2.- Keep only last 150 words of document

Apply rules with lemmas



For all of these reasons,

Notice of Decision      Signed
Notice of Decision  BETWEEN THESE TWO WILL CATCH A FEW      Direction Regarding Anonymity

DECISSION     Dated

Decision     Signed

For the above reasons:      Signed:

For  the  above  reasons  we  conclude 

We have concluded that,

For the above reasons 


In [None]:
# Decision extraction

f = open('./data/processed/txt_files/00003_ukait_2008_aa_others_pakistan.txt', "r")
for number, paragraph in enumerate(f.read().split("\n\n"), 1):
    print(number)
    print(paragraph)
    pattern = "Decisions"
    if paragraph.find(pattern) != -1:
        print("save to file") 
    else:
        print("don't save to file")

        

In [None]:
import spacy

# Load the model
nlp = spacy.load("en_blackstone_proto")
