## Street Address Matching

March 2024

The goal of the demo below is to match two lists of street addresses using selected similarity algorithms such as Levenshenstein, train a binary classifier and evaluate the accuracy of the classifier with the test dataset.

The demo uses `pandas`, the `libpostal` library to normalize street addresses and the `distance` library to measure the degree of the matches.

The `distance` package provides helpers for computing similarities between arbitrary sequences. 

So called Levenshtein, Jaccard and Sorensen matching algorithms are among what's used below - the lower the distance (i.e. the smaller the number returned), the more similar are the street addresses.



In [1]:
## change to working folder
%cd /content
%pwd

## read uploaded File_1.csv and File_2.csv list of addresses
import os
import logging
import pandas as pd
File_1 = pd.read_csv('street_address_listing.csv')
File_2 = pd.read_csv('testerAddress.csv')

/content


In [2]:
## install pypostal
!sudo apt-get install curl autoconf automake libtool python-dev-is-python3 pkg-config
!sudo apt-get install curl autoconf automake libtool pkg-config
!git clone https://github.com/openvenues/libpostal
%cd libpostal
!./bootstrap.sh
!./configure
!make -j4
!sudo make install
!sudo ldconfig
!pip install postal

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
autoconf is already the newest version (2.71-2).
autoconf set to manually installed.
automake is already the newest version (1:1.16.5-1.3).
automake set to manually installed.
pkg-config is already the newest version (0.29.2-1ubuntu3).
curl is already the newest version (7.81.0-1ubuntu1.15).
The following additional packages will be installed:
  python-is-python3
Suggested packages:
  libtool-doc gcj-jdk
The following NEW packages will be installed:
  libtool python-dev-is-python3 python-is-python3
0 upgraded, 3 newly installed, 0 to remove and 39 not upgraded.
Need to get 168 kB of archives.
After this operation, 1,255 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libtool all 2.4.6-15build2 [164 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 python-is-python3 all 3.9.2-2 [2,788 B]
Get:3 http://archive.ubuntu.com/ubuntu jammy/

In [3]:
## install distance
!pip install distance
from postal.parser import parse_address
from postal.expand import expand_address

Collecting distance
  Downloading Distance-0.1.3.tar.gz (180 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.3/180.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: distance
  Building wheel for distance (setup.py) ... [?25l[?25hdone
  Created wheel for distance: filename=Distance-0.1.3-py3-none-any.whl size=16258 sha256=54e066017433b411d0ff4c6cffd27ce4fbac4aad99e4449b73fc82e5b5bce6f3
  Stored in directory: /root/.cache/pip/wheels/e8/bb/de/f71bf63559ea9a921059a5405806f7ff6ed612a9231c4a9309
Successfully built distance
Installing collected packages: distance
Successfully installed distance-0.1.3


In [4]:
## import the distance lib
from distance import levenshtein as lev, hamming as hamming, sorensen as sorensen, jaccard as jaccard

In [5]:
## examine data
File_1.head(1)


Unnamed: 0,business_name,address_street,address_city,address_state,address_zip,UID,Flag_Viz
0,AVANTI BATTERY COMPANY-FA,149 GROVE STREET,WATERTOWN,MA,2472,30110130000.0,


In [6]:
File_2.head(1)

Unnamed: 0,business_name,address_street,address_city,address_state,address_zip
0,Oak Valley Hospital,350 S Oak Ave,Oakdale,CA,95361


In [7]:
print('-->File_1')
File_1.isnull().sum()

-->File_1


business_name        0
address_street       0
address_city         0
address_state        0
address_zip          0
UID                  0
Flag_Viz          4032
dtype: int64

In [8]:
File_1.dtypes

business_name      object
address_street     object
address_city       object
address_state      object
address_zip         int64
UID               float64
Flag_Viz          float64
dtype: object

In [9]:
# convert any int type columns to string type
File_1['address_zip'] = File_1['address_zip'].astype(str)

In [10]:
File_1.dtypes

business_name      object
address_street     object
address_city       object
address_state      object
address_zip        object
UID               float64
Flag_Viz          float64
dtype: object

In [11]:
print('-->File_2')
File_2.isnull().sum()

-->File_2


business_name     0
address_street    0
address_city      0
address_state     0
address_zip       0
dtype: int64

In [12]:
File_2.dtypes

business_name     object
address_street    object
address_city      object
address_state     object
address_zip       object
dtype: object

In [13]:
# rename cols as necessary
# File_2.rename(columns={"Zip Code": "Zip_Code"},inplace=True)

In [14]:
# Isolate address columns into lists
file_1_list = []
for index, rows in File_1.iterrows():
  my_list = [rows.business_name, rows.address_street, rows.address_city, rows.address_state, rows.address_zip]
  file_1_list .append(my_list)

file_2_list = []
for index, rows in File_2.iterrows():
  my_list = [rows.business_name, rows.address_street, rows.address_city, rows.address_state, rows.address_zip]
  file_2_list .append(my_list)

In [15]:
# Try smaller lists
sub1_list = file_1_list[0:101]
sub2_list = file_2_list[0:101]

In [16]:
## built a list of the address pairs and their levenstein distance where '0' means identical and non-zero means not identical
separator = ' '
levenshtein_compare = []
hamming_compare = []
sorensen_compare = []
jaccard_compare = []
for index1,addy1 in enumerate(sub1_list):
  for index2, addy2 in enumerate(sub2_list):
    #print(addy1, addy2)
    result1 = separator.join(addy1)
    result2 = separator.join(addy2)
    sub1_norm_list = expand_address(result1, languages='en')
    sub2_norm_list = expand_address(result2, languages='en')
    levenshtein_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],lev(sub1_norm_list[0],sub2_norm_list[0])])
    # hamming_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],hamming(sub1_norm_list[0],sub2_norm_list[0],normalized=True)]) # requires same length
    sorensen_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],sorensen(sub1_norm_list[0],sub2_norm_list[0])])
    jaccard_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],jaccard(sub1_norm_list[0],sub2_norm_list[0])])
    #print(sub1_norm_list[0],sub2_norm_list[0], lev(sub1_norm_list[0],sub2_norm_list[0]))


In [17]:
print(jaccard_compare) #test

[[['avanti battery company-fa 149 grove street watertown ma 2472', 'oak valley hospital 350 s oak ave oakdale ca 95361'], 0.59375], [['avanti battery company-fa 149 grove street watertown ma 2472', 'david kelley md 11501 regency ln carmel in 46033'], 0.5806451612903225], [['avanti battery company-fa 149 grove street watertown ma 2472', 'house call doctors med group 103 woodshadow ln encinitas ca 92024'], 0.4], [['avanti battery company-fa 149 grove street watertown ma 2472', 'riverchase dermatology 413 del prado blvd s ste 100 cape coral fl 33990'], 0.3448275862068966], [['avanti battery company-fa 149 grove street watertown ma 2472', 'scott e theilbar dds 1332 e apple ave muskegon mi 49442'], 0.3666666666666667], [['avanti battery company-fa 149 grove street watertown ma 2472', 'john overton m d 420 w acacia st ste 17 stockton ca 95203'], 0.4193548387096774], [['avanti battery company-fa 149 grove street watertown ma 2472', 'ems austin county 1 e main st bellville tx 77418'], 0.428571

In [18]:
# Build dataframes
levcompare_df = pd.DataFrame(levenshtein_compare, columns = ['Address pair', 'Distance'])
hammingcompare_df = pd.DataFrame(hamming_compare, columns = ['Address pair', 'Distance'])
sorensencompare_df = pd.DataFrame(sorensen_compare, columns = ['Address pair', 'Distance'])
jaccardcompare_df = pd.DataFrame(jaccard_compare, columns = ['Address pair', 'Distance'])

In [19]:
%cd /content
levcompare_df.to_csv('levResult_sub.csv', sep='\t')
hammingcompare_df.to_csv('hammingResult_sub.csv', sep='\t')
sorensencompare_df.to_csv('sorensenResult_sub.csv', sep='\t')
jaccardcompare_df.to_csv('jaccardResult_sub.csv', sep='\t')

/content


In [None]:
## Try the bigger lists
separator = ' '
levenshtein_compare = []
hamming_compare = []
sorensen_compare = []
jaccard_compare = []
for index1,addy1 in enumerate(file_1_list):
  for index2, addy2 in enumerate(file_2_list):
    #print(addy1, addy2)
    result1 = separator.join(addy1)
    result2 = separator.join(addy2)
    sub1_norm_list = expand_address(result1, languages='en')
    sub2_norm_list = expand_address(result2, languages='en')
    levenshtein_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],lev(sub1_norm_list[0],sub2_norm_list[0])])
    # hamming_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],hamming(sub1_norm_list[0],sub2_norm_list[0],normalized=True)]) # requires same length
    sorensen_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],sorensen(sub1_norm_list[0],sub2_norm_list[0])])
    jaccard_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],jaccard(sub1_norm_list[0],sub2_norm_list[0])])
    #print(sub1_norm_list,sub2_norm_list, lev(sub1_norm_list,sub2_norm_list))


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [36]:
levcompare_df = pd.DataFrame(levenshtein_compare, columns = ['Address pair', 'Distance'])
hammingcompare_df = pd.DataFrame(hamming_compare, columns = ['Address pair', 'Distance'])
sorensencompare_df = pd.DataFrame(sorensen_compare, columns = ['Address pair', 'Distance'])
jaccardcompare_df = pd.DataFrame(jaccard_compare, columns = ['Address pair', 'Distance'])