## Find key data points from multiple documents

Download <a href="https://drive.google.com/file/d/1V6hmJhCqMyR65e4tal1Q70Lc_jvtZm0F/view?usp=sharing">these documents</a>.

They all have an identical structure to them.

Using regex, capture and export as a CSV the following data points in all the documents:

- The case number.
- Whether the decision was to accept or reject the appeal.
- The request date.
- The decision date.
- Source file name




In [1]:
pip install icecream

Note: you may need to restart the kernel to use updated packages.


In [95]:
## Lets import all the libaries we are likely to need
import pandas as pd ## to easily export our data to dataframes/CSVs
from icecream import ic ## easily debug
import itertools ## to flatten lists
import re
import glob #organizing files
from zipfile import ZipFile # to deal with zipped files

In [29]:
## Unzip the files needed

folder_name = "regex-docs.zip"

with ZipFile(folder_name, "r") as zipObj:
    zipObj.extractall()
    file_names = zipObj.namelist()
    
# organize the files 

myfiles = sorted(glob.glob('docs/*.txt'))


In [73]:
## open the files and compile all text into a list
## add the file name to the text so it can be later identified with its file

texts = []

for file in myfiles:
    with open(file,'r') as text_obj:
        text = text_obj.read() + ' ' + file
        texts.append(text)
texts

["STATE OF NEW YORK REQUEST: February 5, 2015 \nDEPARTMENT OF HEALTH AGENCY\nCASE #: 6952578N\n______________________________________________________\n In the Matter of the Appeal of\n:\n:\nHEARING from a determination by the New York City :\n\n2. On December 22, 2014, a nursing assessor completed a Uniform Assessment System evaluation of the Appellant’s personal care needs.\n3. On December 22, 2014, a nursing assessor completed a client task sheet as to the Appellant’s personal care needs.\n4. By notice dated January 23, 2015, the Managed Long Term Care Plan determined to reduce the Appellant’s Personal Care Services authorization from 16 hours daily, 7 days weekly to 8 hours daily, 7 days weekly.\n5. On January 23, 2015, the Appellant requested an internal appeal.\n6. On February 5, 2015, this fair hearing was requested.\n7. By notice dated February 27, 2015, the Managed Long Term Care Plan determined to\nuphold its determination to reduce the Appellant’s Personal Care Services autho

In [109]:
## Initialize future column headers as empty lists

case_numbers = []
decisions = []
req_dates = []
dec_dates = []

## Iterate through the texts with the regex to gather the info
## Take the [0] item from each element captured otherwise the info comes in as a list

for text in texts:
    case_pat = re.compile(r'^CASE\s\#\:\s(\d{7}[A-Z])', re.M | re.I)
    case_number = case_pat.findall(text)[0]
    case_numbers.append(case_number)

    dec_pat = re.compile(r"^The Managed Long Term Care Plan’s decision dated \w+\s\d{1,2}\,\s\d{4}\,\sis\s(\w+)", re.M | re.I)
    decision = dec_pat.findall(text)[0]
    decisions.append(decision)
    
    req_pat = re.compile(r"REQUEST:\s(\w+\s\d{1,2}\,\s\d{4})", re.M | re.I)
    req_date = req_pat.findall(text)[0]
    req_dates.append(req_date)
    
    dec_date_pat = re.compile(r"^The Managed Long Term Care Plan’s decision dated (\w+\s\d{1,2}\,\s\d{4})", re.M|re.I)
    dec_date = dec_date_pat.findall(text)[0]
    dec_dates.append(dec_date)
    
## Zip together the lists to create a dataframe of the scraped info with the file names

df = pd.DataFrame(zip(case_numbers, decisions, req_dates, dec_dates, myfiles),
            columns = ["Case-Number","Decision","Request-Date","Decision-Date","File-Name"])
df

Unnamed: 0,Case-Number,Decision,Request-Date,Decision-Date,File-Name
0,6952578N,rejected,"February 5, 2015","February 27, 2015",docs/decision01.txt
1,7924923N,accepted,"March 14, 2019","March 14, 2019",docs/decision02.txt
2,4964154N,rejected,"October 28, 2019","March 14, 2019",docs/decision03.txt


In [110]:
## Send the df to a csv file
df.to_csv('appeal_decsions.csv', index = False, encoding = "UTF-8")