
As part of the pdf to struct transform there will be cases of intities being close but not matching that we will want to account for.

Here is some ref code that reuses our existing simple batch. Given that the fuzzy match will likely be limited to evaluation over a mini-batch, we'll write the method to work with pandas dataframes. 

**NOTE**: In practice, it is better to fuzzy match to a UID of an individual instead of a fuzzy match to only records within a mini-batch. 

---

The function below isn't a UDF, but rather a function that takes a Spark Dataframe and locally creates a Broadcast spark dataframe. The intention of writing this way is to isolate the fuzzy match logic so that it can be improved upon in future iterations.

In [0]:
%pip install rapidfuzz
dbutils.library.restartPython() 

In [0]:
from pyspark.sql import DataFrame
from rapidfuzz import process
from pyspark.sql.functions import broadcast

def broadcast_fuzzy_map(pdf: DataFrame) -> DataFrame:
    # Requires that there is a column named `xtract`
    pdf = pdf.select("extract").toPandas().drop_duplicates()
    pdf['fullname'] = pdf.extract.apply(lambda x: f"{x['firstname']} {x['lastname']}")

    fullname_map = {r.fullname: r.extract for r in pdf.itertuples()}
    fuzzy_keys = set()
    fuzzy_vals = []

    for row in pdf.itertuples(index=True):
        best_match = process.extractOne(row.fullname, fuzzy_keys)
        if best_match is None or best_match[1] < 85:
            # No quality match found, will map to self:
            fuzzy_vals.append(row.extract)
        else:
            # Quality match found, will map to best matching extract
            fuzzy_vals.append(fullname_map.get(best_match[0], row.extract))
        fuzzy_keys.add(row.fullname)

    pdf['fuzzy_match'] = fuzzy_vals
    pdf = pdf.drop(columns=['fullname'])

    return broadcast(spark.createDataFrame(pdf))

In [0]:
dat = broadcast_fuzzy_map(spark.table("main.default.pdf_content"))
dat_matched = spark.table("main.default.pdf_content").join(broadcast(dat), 
                                                           on="extract", 
                                                           how="left")
display(dat_matched)

In [0]:
# Left for potential troubleshooting
# Note this process assumes that the first name will be the prefered name, this logic needs to be verified as it will create issues if not a valid assumption.
# This method will also create risk of a three way name assignment, not mapping to one value

pdf = spark.table("main.default.pdf_content").select("extract").toPandas().drop_duplicates()
pdf['fullname'] = pdf.extract.apply(lambda x: f"{x['firstname']} {x['lastname']}")

fullname_map = {r.fullname: r.extract for r in pdf.itertuples()}
fuzzy_keys = set()
fuzzy_vals = []

for row in pdf.itertuples(index=True):
    best_match = process.extractOne(row.fullname, fuzzy_keys)
    #print("fullname: " + str(row.fullname))
    #print("fuzzy_keys: " + str(fuzzy_keys))
    #print("best_match: " + str(best_match))
    if best_match is None or best_match[1] < 85:
        # No quality match found, will map to self:
        fuzzy_vals.append(row.extract)
    else:
        # Quality match found, will map to best matching extract
        fuzzy_vals.append(fullname_map.get(best_match[0], row.extract))
    fuzzy_keys.add(row.fullname)
    #print("\nXXXXXXXXXXXXX\n")

pdf['fuzzy_match'] = fuzzy_vals

display(pdf)