# DESCRIPTIVE STATISTICS OF THE TRIBUNAL DECISIONS (20th November 2021)

This notebook extracts additional information from the text of the tribunal decisions and stores it in the relevant dictionary.

In particular, the notebook performs information extraction on:

1. The label included in the name of the file.

2. The court where the case was heard ("Heard at").

3. The judges.

4. The legal representation for the appellant and the respondent.

5. The decision/ruling by the judge.

Each of these filds is added to the dictionary of each judicial decision.

The resulting data set - a list of updated dictionaries -  is serialised as a json object (jsonDataFinal.json).

This notebook should run in the tfm environment, which can be created with the environment.yml file.

In [5]:
import ipykernel
from os import listdir
from os.path import isfile, join, getsize
import numpy as np
import time
import re
import json
import pickle
import pandas as pd
#import whois
import sys
import datetime
from tqdm import tqdm
#import textract

import sys
IN_COLAB = 'google.colab' in sys.modules


# What environment am I using?
print(f'Current environment: {sys.executable}')

# Change the current working directory
os.chdir('/Users/albertamurgopacheco/Documents/GitHub/TFM')
# What's my working directory?
print(f'Current working directory: {os.getcwd()}')


Current environment: /usr/local/bin/python3.7
Current working directory: /Users/albertamurgopacheco/Documents/GitHub/TFM


In [5]:
# Define working directories in colab and local execution

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/gdrive')
    docs_path = '/content/gdrive/MyDrive/TFM/data/raw'
    input_path = '/content/gdrive/MyDrive/TFM'
    output_path = '/content/gdrive/MyDrive/TFM/output'

else:
    docs_path = './data/raw'
    input_path = '.'
    output_path = './output'

In [None]:
!conda activate blackstoneEnv

# DESCRIPTIVE STATISTICS OF THE TRIBUNAL DECISIONS

# 1. Build a dataframe with the needed fields

There are two categories of cases: the reported and the unreported ones. The reported cases include richer data while the unreported ones (the vast majority of cases) miss several data fields due to a request for annonimity from any of the parties involved in the legal dispute.

The first two letters in the file name seem to follow some logic. Inspecting the documents reveals the following meanings:

In [13]:
# Open jsonData file as data
with open('./data/jsonDataFinal.json') as json_file:
    data = json.load(json_file)

# Loop over each text file and extract Court information
for decision in tqdm(data):
    # Obtain the full text of the court decision
    string = decision.get('Decision:')
    #file_name = decision.get('File')
    if isinstance(string, float):
        continue
    else:
        # If a label has not been found (flag = 0)
        flag = 0
        # The decisions are stored as a listt of tokens. Use plain text.
        string = ' '.join(x for x in string)

100%|██████████| 35308/35308 [00:00<00:00, 595181.58it/s]


# 2. Generate statistics on the fields

An inspection of a sample of judicial decisions reveals that the name of the court is located in the first part of the document and it usually follows the expression "Heard at".

The strategy to capture this field will consist of a search using regular expressions. 

print(data[10481])

# 3. Charts

