# Managing scraped data (27th October 2021)

This notebook manages the tribunal decision's data scraped in 0_dataScraping.ipynb

In particular, the notebook:

1. Converts the 35258 downloaded word (doc/docx) documents to .txt

2. Stores the text of each judicial decision in the corresponding dictionary.
    
3. Provides some descriptive statistics of the downloaded files.

The resulting data set (a list of dictionaries) is serialised as a json object (jsonDataFinal.json).

This notebook should run in the tfm environment, which can be created with the environment.yml file.

In [6]:
from os import listdir
from os.path import isfile, join, getsize
import numpy as np
import time
import re
import json
import pickle
import pandas as pd
import whois
import sys
import datetime
from tqdm import tqdm
import textract

import sys
IN_COLAB = 'google.colab' in sys.modules


# What environment am I using?
print(f'Current environment: {sys.executable}')

# Change the current working directory
os.chdir('/Users/albertamurgopacheco/Documents/GitHub/TFM')
# What's my working directory?
print(f'Current working directory: {os.getcwd()}')


Current environment: /Users/albertamurgopacheco/anaconda3/envs/tfm/bin/python
Current working directory: /Users/albertamurgopacheco/Documents/GitHub/TFM


In [7]:
# Define working directories in colab and local execution

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/gdrive')
    docs_path = '/content/gdrive/MyDrive/TFM/data/raw'
    input_path = '/content/gdrive/MyDrive/TFM'
    output_path = '/content/gdrive/MyDrive/TFM/output'

else:
    docs_path = './data/raw'
    input_path = '.'
    output_path = './output'

# Converting word documents (.doc/.docx) to .txt

In [8]:
# Extract all files in docs_path (name, size and last modification_time)
files_name_list_raw = [f for f in listdir(docs_path) if isfile(join(docs_path, f))]
files_size_list = [getsize(join(docs_path, f)) for f in listdir(docs_path) if isfile(join(docs_path, f))]

In [9]:
# Obtain/check number of files
print('Number of files:', len(files_name_list_raw))
# Unique files based on size file_name
print('Number of unique file names:', len(set(files_name_list_raw)))
# Unique files based on size file_name
print('Number of unique file sizes:', len(set(files_name_list_raw)))

Number of files: 35258
Number of unique file names: 35258
Number of unique file sizes: 35258


There are no duplicated files.

Create a function to obtain the data from the court's decision detailed page using beautifulSoup.     TO DO: 2. Try with different files make sure it works, 3. Capture exceptions and 204 responses. 4. Create function, 5. How am I storing the dicts? In a list? 5. Try function with just a few obs. https://stackoverflow.com/questions/20638006/convert-list-of-dictionaries-to-a-pandas-dataframe



In [11]:
# Delete DS_Store files in raw data folder
!find . -name '.DS_Store' -type f -delete

# Files HU077022015.doc & HU029682017.docx are corrupt. Manually deleted from data/raw 
# (textract not dealing with Shell Error execptions)

# Destination directory of txt files
dest_files_path = os.path.join(os.getcwd(), 'data/processed/txt_files')

# Loop to extract txt from word files (with decorator progress bar)
for word_file in  tqdm(os.listdir(docs_path)):

    file, extension = os.path.splitext(word_file)
    
    # Create txt file concatenating .txt extension to file name
    dest_file_name = file + '.txt'
    
    # Extract text from the file
    content = textract.process(os.path.join(docs_path, word_file))
    
    # Create and open new file & prepare to write the Binary Data (represented by wb - Write Binary)
    write_text_file = open(os.path.join(dest_files_path, dest_file_name), "wb")
    
    # Write the content and close the newly created file
    write_text_file.write(content)
    write_text_file.close()

  2%|▏         | 871/35255 [01:34<1:02:24,  9.18it/s]


KeyboardInterrupt: 

Now that we have defined all the necessary functions, we can open a browser and start scraping.

# Adding the text of each decision to jsonData
A string with sentence text is added to each object in the list.

In [117]:
# Paths to jsonData & txt files
jsonData_path = os.path.join(os.getcwd(), 'data/jsonData.json')
txt_path = './data/processed/txt_files/'

# Open jsonData file as data
with open(jsonData_path) as json_file:
    data = json.load(json_file)

# Loading string with court decision to data
for txt_file in  tqdm(os.listdir(txt_path)):
    
    # Open file and obtain string and file_name
    with open(txt_path + txt_file, 'r') as file:
        string = file.read()
        f_name, f_ext = os.path.splitext(file.name)
        head, file_name = os.path.split(f_name)
    # Search data list of dictionaries for dict where {"File":} = file_name
    for d in data:
        if d.get('File') == file_name:
            # Add dictionary key 'String' with value string
            d.update({'String': string})


100%|██████████| 35087/35087 [05:32<00:00, 105.43it/s]


In [None]:
# Save data as a json file jsonDataFinal in data directory
with open('./data/jsonDataFinal.json', 'w') as fout:
    json.dump(data, fout)

# Descriptive statistics on the files


In [None]:
# # Number of files

# Longest sentence

# Shortest sentemce

# Number of reported vs unreported cases (use the name of the file to discriminate them)
