## Quantifying losses at each step in the code

In this notebook we see how many X17-A forms pass through each step in the process (from step3 - taking pdf subset, to step 8 - creating structured asset and liable database)

Note: step 7 and 8 are in the notebook 'Structured Asset and Liability"

In [142]:
from FocusReportSlicing import selectPages, extractSubset, brokerFilter
from ExtractBrokerDealers import dealerData

import time
import boto3
from sagemaker.session import Session
import sys
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import json

company_email = 'mathias.andler@ny.frb.org'
s3_pointer = boto3.client('s3')
s3_session = Session()

s3_bucket = "x17-a5-mathias-version-nit"
temp_folder ='temp/'

In [143]:
from GLOBAL import GlobVars
export_pdf = GlobVars.temp_folder_pdf_slice
export_png = GlobVars.temp_folder_png_slice
input_raw = GlobVars.input_folder_raw
input_pdf = temp_folder + 'X-17A-5-PDF-SUBSETS/'
out_folder_raw_pdf = temp_folder + 'X-17A-5-PDF-RAW/'

# How to read these results

For each step, len(..) tells us how many files are in the S3 bucket after each step.

## After step 2: total PDFs downloaded

In [144]:
x17 = s3_session.list_s3_files(s3_bucket,'input/X-17A-5/')

In [145]:
x17

['input/X-17A-5/',
 'input/X-17A-5/1000146-2002-03-13.pdf',
 'input/X-17A-5/1000146-2003-03-03.pdf',
 'input/X-17A-5/1000146-2004-03-08.pdf',
 'input/X-17A-5/1000147-2002-02-28.pdf',
 'input/X-17A-5/1000147-2003-02-28.pdf',
 'input/X-17A-5/1000147-2004-02-25.pdf',
 'input/X-17A-5/1000147-2004-03-29.pdf',
 'input/X-17A-5/1000147-2005-02-28.pdf',
 'input/X-17A-5/1000147-2006-03-02.pdf',
 'input/X-17A-5/1000147-2007-03-01.pdf',
 'input/X-17A-5/1000147-2008-02-29.pdf',
 'input/X-17A-5/1000147-2009-03-02.pdf',
 'input/X-17A-5/1000147-2010-03-01.pdf',
 'input/X-17A-5/1000147-2011-03-02.pdf',
 'input/X-17A-5/1000148-2002-03-01.pdf',
 'input/X-17A-5/1000148-2003-02-28.pdf',
 'input/X-17A-5/1000148-2004-03-01.pdf',
 'input/X-17A-5/1000148-2005-03-01.pdf',
 'input/X-17A-5/1000148-2006-02-24.pdf',
 'input/X-17A-5/1000148-2007-03-01.pdf',
 'input/X-17A-5/1000148-2008-02-29.pdf',
 'input/X-17A-5/1000148-2009-02-27.pdf',
 'input/X-17A-5/1000148-2010-03-02.pdf',
 'input/X-17A-5/1000148-2011-02-28.pdf

In [146]:
len(x17)

93599

## After step 3:  PDFs extracted first 20 pages

In [147]:
pdf_sub = s3_session.list_s3_files(s3_bucket,'temp/X-17A-5-PDF-SUBSETS/')

In [148]:
pdf_sub 

['temp/X-17A-5-PDF-SUBSETS/1000146-2002-03-13-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000146-2003-03-03-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000146-2004-03-08-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2002-02-28-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2003-02-28-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2004-02-25-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2004-03-29-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2005-02-28-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2006-03-02-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2007-03-01-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2008-02-29-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2009-03-02-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2010-03-01-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000147-2011-03-02-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000148-2002-03-01-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000148-2003-02-28-subset.pdf',
 'temp/X-17A-5-PDF-SUBSETS/1000148-2004-03-01-subset.pdf

In [149]:
len(pdf_sub)

93598

## After step 4: Textract

In [150]:
pdf_raw = s3_session.list_s3_files(s3_bucket,'temp/X-17A-5-PDF-RAW/')

In [151]:
pdf_raw

['temp/X-17A-5-PDF-RAW/1000146-2002-03-13.csv',
 'temp/X-17A-5-PDF-RAW/1000146-2003-03-03.csv',
 'temp/X-17A-5-PDF-RAW/1000146-2004-03-08.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2002-02-28.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2003-02-28.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2004-02-25.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2004-03-29.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2005-02-28.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2006-03-02.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2007-03-01.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2008-02-29.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2009-03-02.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2010-03-01.csv',
 'temp/X-17A-5-PDF-RAW/1000147-2011-03-02.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2002-03-01.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2003-02-28.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2004-03-01.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2005-03-01.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2006-02-24.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2007-03-01.csv',
 'temp/X-17A-5-PDF-RAW/1000148-2008-02-2

In [152]:
len(pdf_raw)

86409

## After step 5: PDF Clean

In [153]:
pdf_clean = s3_session.list_s3_files(s3_bucket,'temp/X-17A-5-CLEAN-PDFS/')

In [155]:
pdf_clean

['temp/X-17A-5-CLEAN-PDFS/1000146-2002-03-13.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000146-2003-03-03.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000146-2004-03-08.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2002-02-28.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2003-02-28.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2004-02-25.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2004-03-29.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2005-02-28.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2006-03-02.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2007-03-01.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2008-02-29.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2009-03-02.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2010-03-01.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000147-2011-03-02.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000148-2002-03-01.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000148-2003-02-28.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000148-2004-03-01.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000148-2005-03-01.csv',
 'temp/X-17A-5-CLEAN-PDFS/1000148-2006-02-24.csv',
 'temp/X-17A-5-CLEAN-PDFS/10001

In [156]:
len(pdf_clean)

86408

## After step 6: PDF Split

In [157]:
pdf_split_asset = s3_session.list_s3_files(s3_bucket,'temp/X-17A-5-SPLIT-PDFS/Assets')

In [158]:
pdf_split_asset

['temp/X-17A-5-SPLIT-PDFS/Assets/1000146-2002-03-13.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000146-2003-03-03.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000146-2004-03-08.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2003-02-28.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2005-02-28.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2006-03-02.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2007-03-01.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2008-02-29.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2009-03-02.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2010-03-01.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000147-2011-03-02.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000148-2002-03-01.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000148-2003-02-28.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000148-2005-03-01.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000148-2006-02-24.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000148-2007-03-01.csv',
 'temp/X-17A-5-SPLIT-PDFS/Assets/1000148-2008-02-29.csv',
 'temp/X-17A-5

In [159]:
len(pdf_split_asset)

77846

In [160]:
pdf_split_liable = s3_session.list_s3_files(s3_bucket,'temp/X-17A-5-SPLIT-PDFS/Liability & Equity/')

In [161]:
len(pdf_split_liable)

77846