## Borrower Data Quality Assurance
This file contains code snippets that demonstrate my approach to the problem of parsing a borrower contract in PDF format for desired fields and their respective values. By using the PDFMiner library for PDF parsing and Regex for pattern matching, these desired key-values pairs can be easily identified.


### Extracting all text
I began by extracting all text from the PDF. Luckily, PDFMiner makes this very simple!

In [42]:
from pdfminer.high_level import extract_pages, extract_text

pdf_filename = "Adam Smith - Private Loan Borrower Form.pdf"
pdf_path = "../PDFs/" + pdf_filename
pdf_text = extract_text(pdf_path).replace("’", "\'") # To get rid of "smart apostrophe" formatting
print(pdf_text)

PRIVATE LOAN BORROWER FORM
To be submitted by private loan borrowers after credit approval by lender

Borrower Last Name: Smith

Borrower First Name: Adam

Date of Birth: 11/19/1982

Application ID: 423982870190

Your lender's name: XYZ Capital

Your lender's website home page: xyzcapital.com

Loan amount approved by your lender (US dollars): $4200

Signature: I understand that this form… I further understand that I must complete the following steps before my loan will be ready
for disbursement.

Borrower Signature Date: 03/27/2023

NEXT STEPS:

1. Review your Application Disclosure to find general information about your loan. Online applicants are able to click on a link to read the

disclosure. Students who apply by phone receive the disclosure via U.S. mail.

2. Once your loan is credit-approved, follow your lender's instructions for the next steps in your loan application process. 3.
Self-Certification-Complete the Applicant Self-Certification Form. You will receive it from your le

### Using Regex to find desired fields and their respective values
With the extracted text in string format, we now use Regex to parse the string for desired fields and their respective values. This is done by creating "regular expressions," which are essentially patterns of text that represent three essential parts of a string pattern:

1) the type of characters we're looking for,
2) the quantity of each character,
3) and the desired order of these characters.

By creating a regular expressions, we can easily search a string for its matches using the *findall* function. Although there should only be one match (assuming the PDF is in the proper format,) I chose to use *findall* instead of the *search* function here to allow the program to catch instances in which the PDF is improperly formatted.

In [43]:
import re

# Functionally the same as using pdf_text, but easier to alter for testing
test_text = "Date of Birth: 11/19/1982"
test_text_dupe = "Date of Birth: 11/19/1982\n Date of Birth: 11/19/1982"
test_text_none = "Dtae of Birth: 11/19/1982"

# Uses Regex (regular expressions) to create a pattern for the Date of Birth line in the form
dob_pattern = re.compile(r"Date of Birth: .+")
dob_match = dob_pattern.findall(test_text)[0] if len(dob_pattern.findall(test_text)) == 1 else "Multiple/no matches found."
dob_match_dupe = dob_pattern.findall(test_text_dupe)[0] if len(dob_pattern.findall(test_text_dupe)) == 1 else "Multiple/no matches found."
dob_match_none = dob_pattern.findall(test_text_none)[0] if len(dob_pattern.findall(test_text_none)) == 1 else "Multiple/no matches found."
print(dob_match)
print(dob_match_dupe)
print(dob_match_none)

Date of Birth: 11/19/1982
Multiple/no matches found.
Multiple/no matches found.


### Separating field names and their values with Regex's ( ... ) capturing group operator
Notice that in the *dob_match* example above, we find the key-value pair as a single string. This means that in order to separate the field name from its value, some additional operations must be done. To eliminate this, we use Regex's ( ... ) capturing group operator, which allows us to specify which parts of the match we want. This is used in the following example:

In [44]:
# Uses capturing groups to separate field names and their respective values
dob_kv_pattern = re.compile(r"(Date of Birth): (.+)")
dob_kv_match = dob_kv_pattern.findall(test_text)[0] if len(dob_kv_pattern.findall(test_text)) == 1 else "Multiple/no matches found."
dob_kv_match_dupe = dob_kv_pattern.findall(test_text_dupe)[0] if len(dob_kv_pattern.findall(test_text_dupe)) == 1 else "Multiple/no matches found."
dob_kv_match_none = dob_kv_pattern.findall(test_text_none)[0] if len(dob_kv_pattern.findall(test_text_none)) == 1 else "Multiple/no matches found."
print(dob_kv_match)
print(dob_kv_match_dupe)
print(dob_kv_match_none)

('Date of Birth', '11/19/1982')
Multiple/no matches found.
Multiple/no matches found.


### Customized Exception handling
In the code above, we handle errors using a simple string message being put in place of the match if multiple or no matches are found. There are some issues with this — first, we don't know there are multiple matches or no matches, and second, it doesn't actually throw an error, which can let bad data sneak through.

Let's improve this error-handling system. Notice we've created two customized Exceptions below, MultipleFieldMatchesFoundException and NoFieldMatchFoundException, which are thrown if multiple matches or no matches are found for a given field. This helps ensure that PDFs are in proper format and that we aren't recording junk data.

In [45]:
class MultipleFieldMatchesFoundException(Exception):
    """
    Exception thrown when there are multiple matches for a given field name
    in a particular PDF, indicating improper PDF contract format.

    Inputs:
        field - field name with multiple matches
        num_matches - number of matches found
        pdf - filename of the PDF
        message - string explanation of the error
    """
    def __init__(self, field, num_matches, pdf):
        self.field = field
        self.num_matches = num_matches
        self.pdf = pdf
        self.message = "Multiple matches ({1}) found for \'{0}\' in \'{2}\'. Check PDF \'{2}\' for proper format.".format(self.field, self.num_matches, self.pdf)
        super().__init__(self.message)
# # Test case
# raise MultipleFieldMatchesFoundException("Borrower Last Name", 3, "Adam Smith - Private Loan Borrower Form.pdf")

In [46]:
class NoFieldMatchFoundException(Exception):
    """
    Exception thrown when no match is found for a given field name
    in a particular PDF, indicating improper PDF contract format.

    Inputs:
        field - field name with no match
        pdf - filename of the PDF
        message - string explanation of the error
    """
    def __init__(self, field, pdf):
        self.field = field
        self.pdf = pdf
        self.message = "No match found for \'{0}\' in \'{1}\'. Check PDF \'{1}\' for proper format.".format(self.field, self.pdf)
        super().__init__(self.message)
# # Test case
# raise NoFieldMatchFoundException("Borrower First Name", "Adam Smith - Private Loan Borrower Form.pdf")

### Dynamic Regex for variable field names
Up until this point, all of the Regex patterns we've created have been hard-coded. As a reminder, our Date of Birth Regex pattern separating key-value pairs looks like the following:

```python
dob_kv_pattern = re.compile(r"(Date of Birth): (.+)")
```

To be able to find matches for any specified field, we need to make our pattern dynamic.

In [59]:
field_name = "Date of Birth"
field_pattern = re.compile(rf"({field_name}): (.+)")
print(field_pattern.findall(pdf_text)[0])

# This is the same as...
print(dob_kv_pattern.findall(pdf_text)[0])

# And this dynamic pattern can be used to find other fields, too
field_name = "Your lender's name"
field_pattern = re.compile(rf"({field_name}): (.+)")
print(field_pattern.findall(pdf_text)[0])


('Date of Birth', '11/19/1982')
('Date of Birth', '11/19/1982')
("Your lender's name", 'XYZ Capital')


### Creating a field searching function
With these components, we can now modularize this logic.

In [76]:
def find_field_match(field_name, pdf_text, pdf):
    """
    Searches PDF text for field name and its respective value. Throws an
    error if multiple matches or no matches are found.

    Inputs:
        field_name - the field being searched for
        pdf_text - the PDF text being searched
        pdf - filename of the PDF

    Returns a tuple containing the field name and its value.
    """
    field_pattern = re.compile(rf"({field_name}): (.+)")
    num_matches = len(field_pattern.findall(pdf_text))
    if num_matches == 1:
        return field_pattern.findall(pdf_text)[0]
    elif num_matches > 1:
        raise MultipleFieldMatchesFoundException(field_name, num_matches, pdf)
    else:
        raise NoFieldMatchFoundException(field_name, pdf)
# Test cases
print(find_field_match("Borrower First Name", pdf_text, pdf_filename))
print(find_field_match("Date of Birth", pdf_text, pdf_filename))
print(find_field_match("Application ID", pdf_text, pdf_filename))
print(find_field_match("Your lender's name", pdf_text, pdf_filename))
print(find_field_match("Your lender's website home page", pdf_text, pdf_filename))
#print(find_field_match("Borrower Fiirst Name", pdf_text, pdf_filename)) # Properly catches the no match found
#print(find_field_match("Field", "Field: asd\nField: asd", "Test PDF.pdf"))

('Borrower First Name', 'Adam')
('Date of Birth', '11/19/1982')
('Application ID', '423982870190')
("Your lender's name", 'XYZ Capital')
("Your lender's website home page", 'xyzcapital.com')


In [49]:
desired_fields = ["Borrower Last Name", "Borrower First Name", "Date of Birth"]


In [50]:
def get_data():

IndentationError: expected an indented block after function definition on line 1 (3310128034.py, line 1)