The master_file is an example of how the address_compare library can be used to load training and test data, tag the addresses, standardize the addresses, and compare the different address lists.  It can serve as a reusable program by updating the input parameters.  If ground truth files are available, it will also show how well the tagger and compare functions perform.

In [1]:
from address_compare import aggregate_functions as aggf
import pandas as pd

                address_1                 address_2  match
0      #A, 59767 62 AVE S     PH A - 59767 62 AVE S   True
1      #A, 59767 62 AVE S   PH A - 59767 62nd AVE S   True
2      #A, 59767 62 AVE S       PH A-59767 62 AVE S   True
3      #A, 59767 62 AVE S     PH A-59767 62nd AVE S   True
4      #A, 59767 62 AVE S      PH A, 59767 62 AVE S   True
5      #A, 59767 62 AVE S    PH A, 59767 62nd AVE S   True
6      #A, 59767 62 AVE S      59767 62 AVE S, PH A   True
7      #A, 59767 62 AVE S    59767 62nd AVE S, PH A   True
8      #A, 59767 62 AVE S        A - 59767 62 AVE S   True
9      #A, 59767 62 AVE S      A - 59767 62nd AVE S   True
10  PH A - 59767 62 AVE S       PH A-59767 62 AVE S   True
11  PH A - 59767 62 AVE S     PH A-59767 62nd AVE S   True
12  PH A - 59767 62 AVE S      PH A, 59767 62 AVE S   True
13  PH A - 59767 62 AVE S    PH A, 59767 62nd AVE S   True
14  PH A - 59767 62 AVE S      59767 62 AVE S, PH A   True
15  PH A - 59767 62 AVE S    59767 62nd AVE S, PH A   Tr

Although this notebook is an example of how to use the address_compare library, the following parameters can be changed in order to control the inputs and outputs.  I.e., the following parameters allow this file to be a reusable program sitting on top of the address_compare library.


The **run_mode** variable controls which portions of this notebook are run.  Options are:
- **'tagger'** = run the address tagger against a single file that also contains the ground truths.  output will show how well the tagger did against the ground truths.  tagger will only run against the file found in @file_location_raw_addresses_1
- **'comparer'** = tag 2 separate lists of addresses and find matches between the lists.  no ground truths for comparisons.  no tagger ground truths or match ground truths included. program will run against both @file_location_raw_addresses_1 and @file_location_raw_addresses_2
- **'comparer_truths'** = run the comparer and validate the matcher performance against the ground truths. program will run against both @file_location_raw_addresses_1 and @file_location_raw_addresses_2.  in addition, the matched ground truths will be found in @file_name_ground_truth_matches
- **'all'** = runs all 3 modes.  i.e., tagger results compared against the ground truths and the matcher results compared against the ground truths.  program will run against both @file_location_raw_addresses_1 and @file_location_raw_addresses_2. in addition, the matched ground truths will be found in @file_name_ground_truth_matches

In [2]:
run_mode = 'comparer' #choose from ['tagger','comparer','comparer_truths','all']


standardize_addresses = True #if True, the tagged address components will be standardized (changed to upper case, unit types, street types, etc. changed to long form names)

use_raw_address_files = True #if False, only the specified number of randomly created addresses above will be used; False only works with the 'comparer' run_mode
num_rndm_addresses_to_create = 100 #if use_raw_address_files = False, the number of addresses that will be randomly created for use in the tagger and compare functions

field_name_raw_addresses = 'Single String Address' #represents the name of the field in the raw address files containing the raw address (street information)
field_name_record_id = 'Record_ID' #represents the name of the field containing the Record ID in the raw files; if not present in the raw files, populate with None

#file_location_raw_addresses_1 = 'data\\standardized tagged washington state addresses.xlsx'
#file_location_raw_addresses_1 = 'data\\tagged standardized colorado Stores.xlsx'
file_location_raw_addresses_1 = 'data\\MarijuanaApplicants - test data list 1 - copy.xlsx'
file_location_raw_addresses_2 = 'data\\MarijuanaApplicants - test data list 2 - copy.xlsx'

file_name_ground_truth_matches = 'data\\marijuana applicants test data - correct matches.xlsx'

write_output_to_excel = True #if True, the output from the applicable modes will be written to Excel; otherwise, results will be printed in the notebook

In [3]:
if run_mode == 'tagger':
    df_dict = aggf.tagger_vs_ground_truths(file_location_raw_addresses_1, field_name_record_id, field_name_raw_addresses, standardize_addresses)
    if write_output_to_excel:
        output_name = 'output\\file_1_tagger_vs_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in df_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
    else:
        for sheet, frame in df_dict.items():
            print ("sheet name = ", sheet)
            print (frame)

In [4]:
if run_mode in ['comparer','comparer_truths']:
    compared_dict, matcher_truths_dict = aggf.tag_and_compare_addresses(file_location_raw_addresses_1, file_location_raw_addresses_2, file_name_ground_truth_matches, use_raw_address_files, num_rndm_addresses_to_create, field_name_record_id, field_name_raw_addresses, standardize_addresses, run_mode)
    if write_output_to_excel:
        output_name = 'output\\raw_to_matched_addresses.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in compared_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
    else:
        for sheet, frame in compared_dict.items():
            print ("sheet name = ", sheet)
            print (frame)

    if run_mode == 'comparer_truths':
        if write_output_to_excel:
            output_name = 'output\\modeled_matches_vs_ground_truths.xlsx'
            tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
            for sheet, frame in matcher_truths_dict.items():
                frame.to_excel(tagger_writer, sheet_name=sheet)
            tagger_writer.save()
        else:
            for sheet, frame in matcher_truths_dict.items():
                print ("sheet name = ", sheet)
                print (frame)

In [5]:
if run_mode == 'all':
    tag_truths1_dict, tag_truths2_dict, compared_dict, matcher_truths_dict = aggf.tag_vs_truths_and_compare_addresses(file_location_raw_addresses_1, file_location_raw_addresses_2, file_name_ground_truth_matches, use_raw_address_files, num_rndm_addresses_to_create, field_name_record_id, field_name_raw_addresses, standardize_addresses, run_mode)
    if write_output_to_excel:
        output_name = 'output\\file_1_tagger_vs_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in tag_truths1_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
        
        output_name = 'output\\file_2_tagger_vs_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in tag_truths2_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
        
        output_name = 'output\\raw_to_matched_addresses.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in compared_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
        
        output_name = 'output\\modeled_matches_vs_ground_truths.xlsx'
        tagger_writer = pd.ExcelWriter(output_name, engine='xlsxwriter')
        for sheet, frame in matcher_truths_dict.items():
            frame.to_excel(tagger_writer, sheet_name=sheet)
        tagger_writer.save()
    else:
        for sheet, frame in tag_truths1_dict.items():
            print ("sheet name = ", sheet)
            print (frame)

        for sheet, frame in tag_truths2_dict.items():
            print ("sheet name = ", sheet)
            print (frame)
            
        for sheet, frame in compared_dict.items():
            print ("sheet name = ", sheet)
            print (frame)
            
        for sheet, frame in matcher_truths_dict.items():
            print ("sheet name = ", sheet)
            print (frame)