![alt text](https://briankowallaw.com/wp-content/uploads/2019/12/How-Long-does-a-Foreclosure-take-in-Florida-scaled.jpeg)


# Foreclosure Document Reader

## Overview
| Detail Tag            | Information                                                                                        |
|-----------------------|----------------------------------------------------------------------------------------------------|
| Originally Created By | Ariel Herrera arielherrera@analyticsariel.com                                                      |
| External References   | Demo: Hillsborough County Foreclosure Records |
| Input Datasets        | Foreclosure PDF doc key                                                                                    |
| Output Datasets       | POS (Parts of Speech) Tags on PDF |
| Input Data Source     | PDF |
| Output Data Source    | Visual |

## History
| Date         | Developed By  | Reason                                                |
|--------------|---------------|-------------------------------------------------------|
| 24th Oct 2020 | Ariel Herrera | Set up PDF scraper. |
| 30th Jan 2021 | Ariel Herrera | Transfer to Google Collab. |

## Getting Started
1. Copy this notebook -> File -> Save a Copy in Drive
2. Visit your local county's foreclosure site
  - <font color="green"><b>Demo</b></font>: [Hillsborough County Foreclosures](https://www.hillsborough.realforeclose.com/index.cfm)

## Useful Resources
- [Google Collab Cheat Sheet](https://towardsdatascience.com/cheat-sheet-for-google-colab-63853778c093)
- [NLP Resource](https://stackabuse.com/python-for-nlp-parts-of-speech-tagging-and-named-entity-recognition/)

# <font color="blue">Install Packages</font>

In [None]:
# install packages
# pip vs. apt-get
"""
pip is used to download and install packages directly from PyPI. 
PyPI is a specialized package manager that only deals with python packages. 

apt-get is used to download and install packages from Ubuntu repositories"""

!pip install PyPDF2
!pip install textract
!apt-get install poppler-utils
!apt-get install tesseract-ocr

Collecting PyPDF2
[?25l  Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
[K     |████▎                           | 10kB 16.6MB/s eta 0:00:01[K     |████████▌                       | 20kB 10.8MB/s eta 0:00:01[K     |████████████▊                   | 30kB 8.2MB/s eta 0:00:01[K     |█████████████████               | 40kB 6.6MB/s eta 0:00:01[K     |█████████████████████▏          | 51kB 4.4MB/s eta 0:00:01[K     |█████████████████████████▍      | 61kB 4.8MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71kB 5.0MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 3.8MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-cp36-none-any.whl size=61087 sha256=5e500f1dd7ffe919c07e4f5f54950ab949447637437dbb68658972184a88e7a3
  Stored in directory: /ro

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 15 not upgraded.
Need to get 154 kB of archives.
After this operation, 613 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 poppler-utils amd64 0.62.0-2ubuntu2.12 [154 kB]
Fetched 154 kB in 2s (100 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 146442 files and directories currently installed.)
Preparing to unpack .../poppler-utils_0.62.0-2ubuntu2.12_amd64.deb ...
Unpacking poppler-utils (0.62.0-2ubuntu2.12) ...
Setting up poppler-utils (0.62.0-2ubuntu2.12) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-oc

# <font color="blue">Import Libraries</font>

In [None]:
# google collab libraries
from google.colab import files
# pdf libraries
import PyPDF2 
import textract
# nlp libraries
# NLTK vs spaCy
# NLTK is string processing library vs Spacy is object-oriented
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.chunk import tree2conlltags
from nltk.tag import pos_tag
import spacy
from spacy import displacy
# pre-trained statistical model, assigns context-specific token vectors, POS tags, dependency parse and named entities.
import en_core_web_sm

# <font color="blue">Download Packages</font>

In [None]:
nltk.download('punkt') # tokenizer divides a text into a list of sentences
nltk.download('stopwords') #  common words in a language
nltk.download('averaged_perceptron_tagger') # tagging words with their parts of speech (POS)
nltk.download('maxent_ne_chunker') # pre-trained English named entity chunkers
nltk.download('words') # corpus of english words
nlp = en_core_web_sm.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


# <font color="blue">Upload PDF file</font>

In [None]:
# set google method for upload
uploaded = files.upload()

# allow user to upload a pdf file
for filename in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=filename, length=len(uploaded[filename])))

Saving Judgment - ORI Record Date -   1_6_2021 11_22_38 AM  Name -  NEW RESIDENTIAL MORTGAGE LLC - NEWREZ LLC,  Inst. #_  2021006751   Case # - 292019CA012595A001HC Recpt # -   .pdf to Judgment - ORI Record Date -   1_6_2021 11_22_38 AM  Name -  NEW RESIDENTIAL MORTGAGE LLC - NEWREZ LLC,  Inst. #_  2021006751   Case # - 292019CA012595A001HC Recpt # -   .pdf
User uploaded file "Judgment - ORI Record Date -   1_6_2021 11_22_38 AM  Name -  NEW RESIDENTIAL MORTGAGE LLC - NEWREZ LLC,  Inst. #_  2021006751   Case # - 292019CA012595A001HC Recpt # -   .pdf" with length 382442 bytes


In [None]:
print("Name of file uploaded:", filename)

Name of file uploaded: Judgment - ORI Record Date -   1_6_2021 11_22_38 AM  Name -  NEW RESIDENTIAL MORTGAGE LLC - NEWREZ LLC,  Inst. #_  2021006751   Case # - 292019CA012595A001HC Recpt # -   .pdf


# <font color="blue">Read PDF file</font>

In [None]:
# open allows you to read the file.
pdfFileObj = open(filename,'rb')

# The pdfReader variable is a readable object that will be parsed.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# Discerning the number of pages will allow us to parse through all the pages.
num_pages = pdfReader.numPages
count = 0
text = ""

# The while loop will read each page.
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

# This if statement exists to check if the above library returned words. 
# It's done because PyPDF2 cannot read scanned files.
if text != "":
    text = text
# If the above returns as False, we run the OCR library textract to 
# convert scanned/image based PDF files into text.
else:
    text = textract.process(filename , method='tesseract', language='eng')

In [None]:
# decode text
str_text = text.decode("utf-8") 
str_text

" \n\nInstrument #: 2021006751, Pg 1 of 7, 1/6/2021 11:22:38 AM Deputy Clerk: O Cindy Stuart, Clerk of the Circuit Court\nHillsborough County\n\nIN THE CIRCUIT COURT OF THE THIRTEENTH JUDICIAL CIRCUIT\nIN AND FOR HILLSBOROUGH COUNTY, FLORIDA\n\nCIRCUIT CIVIL DIVISION\nNEWREZ LLC D/B/A SHELLPOINT MORTGAGE\nSERVICING,\n\nPlaintiff,\n\nCASE NO.: 2019CA012595\nvs. DIVISION:\n\nMARY TESFAY; UNKNOWN SPOUSE OF MARY\nTESFAY; SOUTH\n\n \n\n3\nS ge\nQO m Bm\nPOINTE OF TAMPA o @ 22\nHOMEOWNERS ASSOCIATION, INC., c 8 He\nDefendants. wD ~o Ss —\n/ aA = Bm\no mv ~\nnN\nBog\n\nUNIFORM FINAL JUDGMENT OF FORECLOSURE\n(Effective July 22, 2019)\n\nTHIS ACTION was heard before the court on plaintiff's Motion for Summary\n\nFinal Judgment on December 22, 2020. Based on the evidence presented and being\notherwise fully informed in the premises,\n\nIT IS ADJUDGED that:\n\nPlaintiff's Motion for Summary Judgment is GRANTED. Service of process has\n\nbeen duly and regularly obtained over MARY TESFAY; UNKNOWN S

# <font color="blue">Explore Text</font>

In [None]:
# The word_tokenize() function will break our text phrases into individual words.
tokens = word_tokenize(str_text)

# We'll create a new list that contains punctuation we wish to clean.
punctuations = ['(',')',';',':','[',']',',']

# We initialize the stopwords variable
# list of words like "The," "I," "and," etc. that don't hold much value as keywords.
stop_words = stopwords.words('english')

# List of words that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

In [None]:
# view top keywords
keywords[:50]

['Instrument',
 '#',
 '2021006751',
 'Pg',
 '1',
 '7',
 '1/6/2021',
 '11:22:38',
 'AM',
 'Deputy',
 'Clerk',
 'O',
 'Cindy',
 'Stuart',
 'Clerk',
 'Circuit',
 'Court',
 'Hillsborough',
 'County',
 'IN',
 'THE',
 'CIRCUIT',
 'COURT',
 'OF',
 'THE',
 'THIRTEENTH',
 'JUDICIAL',
 'CIRCUIT',
 'IN',
 'AND',
 'FOR',
 'HILLSBOROUGH',
 'COUNTY',
 'FLORIDA',
 'CIRCUIT',
 'CIVIL',
 'DIVISION',
 'NEWREZ',
 'LLC',
 'D/B/A',
 'SHELLPOINT',
 'MORTGAGE',
 'SERVICING',
 'Plaintiff',
 'CASE',
 'NO',
 '.',
 '2019CA012595',
 'vs.',
 'DIVISION']

# <font color="blue">Write and Read Text</font>

In [None]:
# write file
text_file = open("output.txt", "w")
n = text_file.write(str_text)
text_file.close()

In [None]:
# read file
f = open("output.txt", "r")
doc = f.read()
print(doc)

 

Instrument #: 2021006751, Pg 1 of 7, 1/6/2021 11:22:38 AM Deputy Clerk: O Cindy Stuart, Clerk of the Circuit Court
Hillsborough County

IN THE CIRCUIT COURT OF THE THIRTEENTH JUDICIAL CIRCUIT
IN AND FOR HILLSBOROUGH COUNTY, FLORIDA

CIRCUIT CIVIL DIVISION
NEWREZ LLC D/B/A SHELLPOINT MORTGAGE
SERVICING,

Plaintiff,

CASE NO.: 2019CA012595
vs. DIVISION:

MARY TESFAY; UNKNOWN SPOUSE OF MARY
TESFAY; SOUTH

 

3
S ge
QO m Bm
POINTE OF TAMPA o @ 22
HOMEOWNERS ASSOCIATION, INC., c 8 He
Defendants. wD ~o Ss —
/ aA = Bm
o mv ~
nN
Bog

UNIFORM FINAL JUDGMENT OF FORECLOSURE
(Effective July 22, 2019)

THIS ACTION was heard before the court on plaintiff's Motion for Summary

Final Judgment on December 22, 2020. Based on the evidence presented and being
otherwise fully informed in the premises,

IT IS ADJUDGED that:

Plaintiff's Motion for Summary Judgment is GRANTED. Service of process has

been duly and regularly obtained over MARY TESFAY; UNKNOWN SPOUSE
OF MARY TESFAY; SOUTH

POINTE OF TAMPA H

# <font color="blue">Tokenize and tag POS</font>

![alt text](https://nlpforhackers.io/wp-content/uploads/2016/08/Intro-POS-Tagging.png)

In [None]:
# create a spaCy document that we will be using to perform parts of speech tagging
sen = nlp(u"I like to do boxing. I hated it in my childhood though")
sen

I like to do boxing. I hated it in my childhood though

In [None]:
# print the text, coarse-grained POS tags, fine-grained POS tags, 
# and the explanation for the tags for all the words in the sentenc
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

I            PRON       PRP      pronoun, personal
like         VERB       VBP      verb, non-3rd person singular present
to           PART       TO       infinitival "to"
do           AUX        VB       verb, base form
boxing       NOUN       NN       noun, singular or mass
.            PUNCT      .        punctuation mark, sentence closer
I            PRON       PRP      pronoun, personal
hated        VERB       VBD      verb, past tense
it           PRON       PRP      pronoun, personal
in           ADP        IN       conjunction, subordinating or preposition
my           DET        PRP$     pronoun, possessive
childhood    NOUN       NN       noun, singular or mass
though       SCONJ      IN       conjunction, subordinating or preposition


In [None]:
# spaCy library comes pre-built with machine learning algorithms that
# depending upon the context (surrounding words), 
# it is capable of returning the correct POS tag for the word.

# view dependency of each token on one another
displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})

In [None]:
# named entities
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(doc)))
tree2conlltags(ne_tree)

[('Instrument', 'JJ', 'O'),
 ('#', '#', 'O'),
 (':', ':', 'O'),
 ('2021006751', 'CD', 'O'),
 (',', ',', 'O'),
 ('Pg', 'NNP', 'O'),
 ('1', 'CD', 'O'),
 ('of', 'IN', 'O'),
 ('7', 'CD', 'O'),
 (',', ',', 'O'),
 ('1/6/2021', 'CD', 'O'),
 ('11:22:38', 'CD', 'O'),
 ('AM', 'NNP', 'O'),
 ('Deputy', 'NNP', 'O'),
 ('Clerk', 'NNP', 'O'),
 (':', ':', 'O'),
 ('O', 'NNP', 'O'),
 ('Cindy', 'NNP', 'B-PERSON'),
 ('Stuart', 'NNP', 'I-PERSON'),
 (',', ',', 'O'),
 ('Clerk', 'NNP', 'B-PERSON'),
 ('of', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('Circuit', 'NNP', 'B-ORGANIZATION'),
 ('Court', 'NNP', 'I-ORGANIZATION'),
 ('Hillsborough', 'NNP', 'O'),
 ('County', 'NNP', 'O'),
 ('IN', 'NNP', 'O'),
 ('THE', 'NNP', 'O'),
 ('CIRCUIT', 'NNP', 'B-ORGANIZATION'),
 ('COURT', 'NNP', 'O'),
 ('OF', 'IN', 'O'),
 ('THE', 'NNP', 'B-ORGANIZATION'),
 ('THIRTEENTH', 'NNP', 'B-ORGANIZATION'),
 ('JUDICIAL', 'NNP', 'O'),
 ('CIRCUIT', 'NNP', 'O'),
 ('IN', 'NNP', 'O'),
 ('AND', 'NNP', 'O'),
 ('FOR', 'NNP', 'B-ORGANIZATION'),
 ('HILLSBOROUG

In [None]:
# render on PDF document
displacy.render(nlp(doc), jupyter=True, style='ent')

# End Notebook