The master_file is an example of how the address_compare library can be used to load training and test data, tag the addresses, standardize the addresses, and compare the different address lists.  It can serve as a reusable program by updating the input parameters.  If ground truth files are available, it will also show how well the tagger and compare functions perform.

In [1]:
from address_compare import aggregate_functions as aggf
from address_compare import address_randomizer as add_rndm
import pandas as pd

### Editable Parameters

Although this notebook is an example of how to use the address_compare library (and especially the aggregate functions in the aggregate_functions.py file), the following parameters can be changed in order to control the inputs and outputs.  I.e., the following parameters allow this file to be a reusable program sitting on top of the address_compare library.  A description of each variable is as follows:


The **run_mode** variable controls which portions of this notebook are run.  Options are:
- **'tagger'** = run the address tagger against a single file that also contains the ground truths.  output will show how well the tagger did against the ground truths.  if using your own files, the fields with the ground truth values should be in the same spreadsheet tab and the field names should be "Tagged Street Number", "Tagged Pre Street Direction", "Tagged Street Name", "Tagged Street Type", "Tagged Post Street Direction", "Tagged Unit Type", "Tagged  Unit Number".  Alternatively, this mode can be run using randomly created addresses.
- **'comparer'** = tag 2 separate lists of addresses and find matches between the lists.  no ground truths will be used to verify the accuracy of the tagger or the matcher. program will run against both **file_location_raw_addresses_1** and **file_location_raw_addresses_2**.  Alternatively, this mode can be run by using randomly created addresses.
- **'comparer_truths'** = run the comparer and validate the matcher performance against the ground truths. program will run against both **file_location_raw_addresses_1** and **file_location_raw_addresses_2**.  in addition, the matched ground truths will be found in **file_name_ground_truth_matches**
- **'all'** = runs all 3 modes.  i.e., tagger results compared against the ground truths and the matcher results compared against the ground truths.  program will run against both @file_location_raw_addresses_1 and @file_location_raw_addresses_2. in addition, the matched ground truths will be found in @file_name_ground_truth_matches

**standardize_addresses** - this variable can be set to True or False determines whether or not the tagged addresses will be standardized (changed to upper case, spelled out street types, directionals, unit types, etc.).  True = the addresses will be standardized

**use_raw_address_files** - if True, the files in file_location_raw_addresses_1 and 2 will be used.  If False, the specified number of random addresses in the **num_rndm_addresses_to_create** variable will be randomly created via the model.  The randomly created addresses can only be used if **run_mode** in ['tagger','comparer']

**field_name_raw_addresses** - the field name in the files or random addresses containing the single address string to be parsed/standardized

**field_name_record_id** - the field name in the files containing the record IDs.  If not present in the files, populate with None

**file_location_raw_addresses_1** and **file_location_raw_addresses_2** - the file names and locations of the files to be tagged, standardized, and/or compared.  The default values are to files in the data folder that can be used to see how the address_compare model works

**file_name_ground_truth_matches** - this file will be used if **run_mode** == 'comparer_truths' or 'all'.  It contains the ground truth matched record IDs along with a field to denote if the matched records are exact, standardized exact, or inexact matches.

**write_output_to_excel** - if True, the output of the specified **run_mode** will be written to excel files in the output folder.  If False, the output will be printed in this notebook

In [2]:
run_mode = 'comparer' #choose from ['tagger','comparer','comparer_truths','all']


standardize_addresses = True #if True, the tagged address components will be standardized (changed to upper case, unit types, street types, etc. changed to long form names)

use_raw_address_files = False #if False, only the specified number of randomly created addresses above will be used; False only works with the 'comparer' run_mode
num_rndm_addresses_to_create = 1000 #if use_raw_address_files = False, the number of addresses that will be randomly created for use in the tagger and compare functions

field_name_raw_addresses = 'Single String Address' #represents the name of the field in the raw address files containing the raw address (street information)
field_name_record_id = None #represents the name of the field containing the Record ID in the raw files; if not present in the raw files, populate with None

file_location_raw_addresses_1 = 'data\\stnd tagged WA addresses - hwy as st type.xlsx'
#file_location_raw_addresses_1 = 'data\\tagged stnd CO Stores - hwy as street type.xlsx'
#file_location_raw_addresses_1 = 'data\\MarijuanaApplicants - test data list 1.xlsx'
file_location_raw_addresses_2 = 'data\\MarijuanaApplicants - test data list 2.xlsx'

file_name_ground_truth_matches = 'data\\marijuana applicants test data - correct matches.xlsx'

write_output_to_excel = True #if True, the output from the applicable modes will be written to Excel; otherwise, results will be printed in the notebook

### Tagger Run Mode

The following cell depicts an example using the tagger_vs_ground_truths aggregate function (the tagger_vs_ground_truths function is a single function using various components from the address_compare folder).  
- This function starts with a single file containing the single string address (the unparsed address), city, state, zip_code, as well as the Tagged versions of each record (i.e., the ground truths).  This mode will allow the user to start with randomly created addresses (setting **use_raw_address_files** == False)
- The function parses and tags each component of the single string address, standardizes the components if standardize_addresses == "True", compares the tagged components to the ground truths, and calculates applicable metrics (true positives = correct tag from the model [non-blanks] with the same non-blank tag in the ground truths).  
- Depending on whether the write_output_to_excel is set to true or false, the function results will either be written to excel or printed within the cell below.

In [3]:
if run_mode == 'tagger':
    if not use_raw_address_files:
        randomized_addresses1 = add_rndm.random_addresses(num_rndm_addresses_to_create, field_name_raw_addresses)
        file_location_raw_addresses_1 = 'data\\randomized_addresses_list_1.xlsx'
        randomized_addresses1.to_excel(file_location_raw_addresses_1)
        
    df_dict = aggf.tagger_vs_ground_truths(file_location_raw_addresses_1, field_name_record_id, field_name_raw_addresses, standardize_addresses)
    if write_output_to_excel:
        output_name = 'output\\file_1_tagger_vs_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in df_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
    else:
        for sheet, frame in df_dict.items():
            print ("sheet name = ", sheet)
            print (frame)

### Comparer and Comparer_Truths Run Modes
The following cell depicts an example using the tag_and_compare_addresses aggregatae function (the tag_and_compare_addresses is a single function using various components from the address_compare folder).

- This function starts with 2 files containing the unparsed/untagged addresses, along with the cities, states, and zip_codes.  For the 'comparer' mode only, randomly created addresses can be used in place of source files.
- The raw addresses are parsed into their components and tagged.  If standardize_addresses == "True", the address components will be standardized.
- The cities will also be standardized by using the 'primary_city' for the corresponding zip_code from the USPS.  If the provided zip_code is not valid for the listed state, it will be logged as an error via the standardization.
- Once parsed, tagged, and standardized, an exact match will be run against the 2 lists to find matches.  Exact Matches will be split out from the remaining addresses that were unable to be matched in the output.
- If run_mode == 'comparer_truths', the exact matches found above will be compared against the ground truth matches.  Applicable metrics will be calculated to show how well the model did against the ground truths (true positives = an exact match from the model and an exact match in the ground truths).
- Depending on whether the write_output_to_excel is set to be true or false, the function results will either be written to excel or printed within the cell below.

In [4]:
if run_mode in ['comparer','comparer_truths']:
    if not use_raw_address_files:
        randomized_addresses1 = add_rndm.random_addresses(num_rndm_addresses_to_create, field_name_raw_addresses)
        file_location_raw_addresses_1 = 'data\\randomized_addresses_list_1.xlsx'
        randomized_addresses1.to_excel(file_location_raw_addresses_1)
        
        randomized_addresses2 = add_rndm.random_addresses(num_rndm_addresses_to_create, field_name_raw_addresses)
        file_location_raw_addresses_2 = 'data\\randomized_addresses_list_2.xlsx'
        randomized_addresses2.to_excel(file_location_raw_addresses_2)
        
    compared_dict, matcher_truths_dict = aggf.tag_and_compare_addresses(file_location_raw_addresses_1, file_location_raw_addresses_2, file_name_ground_truth_matches, field_name_record_id, field_name_raw_addresses, standardize_addresses, run_mode)
    if write_output_to_excel:
        output_name = 'output\\raw_to_matched_addresses.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in compared_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
    else:
        for sheet, frame in compared_dict.items():
            print ("sheet name = ", sheet)
            print (frame)

    if run_mode == 'comparer_truths':
        if write_output_to_excel:
            output_name = 'output\\modeled_matches_vs_ground_truths.xlsx'
            tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
            for sheet, frame in matcher_truths_dict.items():
                frame.to_excel(tagger_writer, sheet_name=sheet)
            tagger_writer.save()
        else:
            for sheet, frame in matcher_truths_dict.items():
                print ("sheet name = ", sheet)
                print (frame)

KeyError: "['Record_ID'] not in index"

### 'All' Run Mode
The following cell depicts an example using the tag_vs_truths_and_compare_addresses aggregate function (the tag_vs_truths_and_compare_addresses is a single function using various components from the address_compare folder).  This function is equivalent to the 'tagger' run_mode against both input files and then the 'comparer_truths' run_model against the the input files.  I.e., it will depict how well the tagger performed against both input files followed by matching the addresses and showing the matcher performance against the ground truths.

In [5]:
if run_mode == 'all':
    tag_truths1_dict, tag_truths2_dict, compared_dict, matcher_truths_dict = aggf.tag_vs_truths_and_compare_addresses(file_location_raw_addresses_1, file_location_raw_addresses_2, file_name_ground_truth_matches, use_raw_address_files, num_rndm_addresses_to_create, field_name_record_id, field_name_raw_addresses, standardize_addresses, run_mode)
    if write_output_to_excel:
        output_name = 'output\\file_1_tagger_vs_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in tag_truths1_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
        
        output_name = 'output\\file_2_tagger_vs_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in tag_truths2_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
        
        output_name = 'output\\raw_to_matched_addresses.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in compared_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
        
        output_name = 'output\\modeled_matches_vs_ground_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in matcher_truths_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
    else:
        for sheet, frame in tag_truths1_dict.items():
            print ("sheet name = ", sheet)
            print (frame)

        for sheet, frame in tag_truths2_dict.items():
            print ("sheet name = ", sheet)
            print (frame)
            
        for sheet, frame in compared_dict.items():
            print ("sheet name = ", sheet)
            print (frame)
            
        for sheet, frame in matcher_truths_dict.items():
            print ("sheet name = ", sheet)
            print (frame)