## Street Address Matching

March 2024

The goal of the demo below is to match two lists of street addresses using selected similarity algorithms such as Levenshenstein, train a binary classifier and evaluate the accuracy of the classifier with the test dataset.

The demo uses `pandas`, the `libpostal` library to normalize street addresses and the `distance` library to measure the degree of the matches.

The `distance` package provides helpers for computing similarities between arbitrary sequences.

So called Levenshtein, Jaccard and Sorensen matching algorithms are among what's used below - the lower the distance (i.e. the smaller the number returned), the more similar are the street addresses.



In [1]:
## change to working folder
%cd /content
%pwd

## read uploaded street_address_listing.csv and testerAddress.csv list of addresses
import os
import logging
import pandas as pd
File_1 = pd.read_csv('street_address_listing.csv')
File_2 = pd.read_csv('testerAddress.csv')

/content


In [2]:
## install pypostal
!sudo apt-get install curl autoconf automake libtool python-dev-is-python3 pkg-config
!sudo apt-get install curl autoconf automake libtool pkg-config
!git clone https://github.com/openvenues/libpostal
%cd libpostal
!./bootstrap.sh
!./configure
!make -j4
!sudo make install
!sudo ldconfig
!pip install postal

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
autoconf is already the newest version (2.71-2).
autoconf set to manually installed.
automake is already the newest version (1:1.16.5-1.3).
automake set to manually installed.
pkg-config is already the newest version (0.29.2-1ubuntu3).
curl is already the newest version (7.81.0-1ubuntu1.16).
The following additional packages will be installed:
  python-is-python3
Suggested packages:
  libtool-doc gcj-jdk
The following NEW packages will be installed:
  libtool python-dev-is-python3 python-is-python3
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 168 kB of archives.
After this operation, 1,255 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libtool all 2.4.6-15build2 [164 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 python-is-python3 all 3.9.2-2 [2,788 B]
Get:3 http://archive.ubuntu.com/ubuntu jammy/

In [3]:
## install distance
!pip install distance
from postal.parser import parse_address
from postal.expand import expand_address

Collecting distance
  Downloading Distance-0.1.3.tar.gz (180 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/180.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m174.1/180.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m180.3/180.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: distance
  Building wheel for distance (setup.py) ... [?25l[?25hdone
  Created wheel for distance: filename=Distance-0.1.3-py3-none-any.whl size=16258 sha256=be158c87bbeb542e6c05cd67bbfbe9f7d35983c39fa2d3d935a31bbbaa8572a9
  Stored in directory: /root/.cache/pip/wheels/e8/bb/de/f71bf63559ea9a921059a5405806f7ff6ed612a9231c4a9309
Successfully built distance
Installing collected packages: distance
Successfully installed distance-0.1.3


In [4]:
## import the distance lib
from distance import levenshtein as lev, hamming as hamming, sorensen as sorensen, jaccard as jaccard

In [5]:
## examine data
File_1.head(1)


Unnamed: 0,address_no,street_prefix_direction,street_prefix_type,street_name,street_suffix_type,street_suffix_direction,street_extension,full_address,city,zip,council_district_no,councilperson_name,jurisdiction
0,7353 STE B 282,,,HIGHLAND,RD,,,"7353 HIGHLAND RD, STE B 282",BATON ROUGE,70808,12,Jennifer Racca,BATON ROUGE


In [6]:
File_2.head(1)

Unnamed: 0,council_person_name,address_street,address_city,address_state,address_zip,district
0,Gaudet Rowdy,1103 CHERRY BIRCH AVE,BATON ROUGE,LA,70820,3


In [7]:
print('-->File_1')
File_1.isnull().sum()

-->File_1


address_no                      0
street_prefix_direction    177077
street_prefix_type         193593
street_name                     0
street_suffix_type           2253
street_suffix_direction    194294
street_extension           195408
full_address                    0
city                            0
zip                             0
council_district_no             0
councilperson_name              0
jurisdiction                    0
dtype: int64

In [8]:
File_1.dtypes

address_no                 object
street_prefix_direction    object
street_prefix_type         object
street_name                object
street_suffix_type         object
street_suffix_direction    object
street_extension           object
full_address               object
city                       object
zip                         int64
council_district_no         int64
councilperson_name         object
jurisdiction               object
dtype: object

In [9]:
# convert any int type columns to string type
File_1['zip'] = File_1['zip'].astype(str)
File_1['council_district_no'] = File_1['council_district_no'].astype(str)

In [15]:
File_1.dtypes

address_no                 object
street_prefix_direction    object
street_prefix_type         object
street_name                object
street_suffix_type         object
street_suffix_direction    object
street_extension           object
full_address               object
city                       object
zip                        object
council_district_no        object
councilperson_name         object
jurisdiction               object
dtype: object

In [11]:
print('-->File_2')
File_2.isnull().sum()

-->File_2


council_person_name    0
address_street         0
address_city           0
address_state          0
address_zip            0
district               0
dtype: int64

In [12]:
File_2.dtypes

council_person_name    object
address_street         object
address_city           object
address_state          object
address_zip             int64
district                int64
dtype: object

In [14]:
# convert any int type columns to string type
File_2['address_zip'] = File_2['address_zip'].astype(str)
File_2['district'] = File_2['district'].astype(str)

In [None]:
# rename cols as necessary
# File_2.rename(columns={"Zip Code": "Zip_Code"},inplace=True)

In [17]:
# Isolate address columns into lists
file_1_list = []
for index, rows in File_1.iterrows():
  my_list = [rows.councilperson_name, rows.full_address, rows.city, rows.zip]
  file_1_list .append(my_list)

file_2_list = []
for index, rows in File_2.iterrows():
  my_list = [rows.council_person_name, rows.address_street, rows.address_city, rows.address_zip]
  file_2_list .append(my_list)

In [18]:
# Try smaller lists
sub1_list = file_1_list[0:101]
sub2_list = file_2_list[0:101]

In [19]:
## built a list of the address pairs and their levenstein distance where '0' means identical and non-zero means not identical
separator = ' '
levenshtein_compare = []
hamming_compare = []
sorensen_compare = []
jaccard_compare = []
for index1,addy1 in enumerate(sub1_list):
  for index2, addy2 in enumerate(sub2_list):
    #print(addy1, addy2)
    result1 = separator.join(addy1)
    result2 = separator.join(addy2)
    sub1_norm_list = expand_address(result1, languages='en')
    sub2_norm_list = expand_address(result2, languages='en')
    levenshtein_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],lev(sub1_norm_list[0],sub2_norm_list[0])])
    # hamming_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],hamming(sub1_norm_list[0],sub2_norm_list[0],normalized=True)]) # requires same length
    sorensen_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],sorensen(sub1_norm_list[0],sub2_norm_list[0])])
    jaccard_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],jaccard(sub1_norm_list[0],sub2_norm_list[0])])
    #print(sub1_norm_list[0],sub2_norm_list[0], lev(sub1_norm_list[0],sub2_norm_list[0]))


In [20]:
print(jaccard_compare) #test

[[['jennifer racca 7353 highland rd ste b 282 baton rouge 70808', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.3214285714285714], [['jennifer racca 9007 highland rd ste 9 baton rouge 70810', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.3571428571428571], [['denise amoroso 5830 s sherwood forest blvd ste a6 baton rouge 70816', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.31034482758620685], [['laurie white adams 4520 s sherwood forest blvd ste 103 baton rouge 70816', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.30000000000000004], [['laurie white adams 8334 ohara ct ste d baton rouge 70806', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.3214285714285714], [['chauna banks 4250 blount rd unit 14 baton rouge 70807', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.3214285714285714], [['laurie white adams 8316 picardy ave baton rouge 70809', 'gaudet rowdy 1103 cherry birch ave baton rouge 70820'], 0.2413793103448

In [21]:
# Build dataframes
levcompare_df = pd.DataFrame(levenshtein_compare, columns = ['Address pair', 'Distance'])
hammingcompare_df = pd.DataFrame(hamming_compare, columns = ['Address pair', 'Distance'])
sorensencompare_df = pd.DataFrame(sorensen_compare, columns = ['Address pair', 'Distance'])
jaccardcompare_df = pd.DataFrame(jaccard_compare, columns = ['Address pair', 'Distance'])

In [22]:
%cd /content
levcompare_df.to_csv('levResult_sub.csv', sep='\t')
hammingcompare_df.to_csv('hammingResult_sub.csv', sep='\t')
sorensencompare_df.to_csv('sorensenResult_sub.csv', sep='\t')
jaccardcompare_df.to_csv('jaccardResult_sub.csv', sep='\t')

/content


In [None]:
## Try the bigger lists
separator = ' '
levenshtein_compare = []
hamming_compare = []
sorensen_compare = []
jaccard_compare = []
for index1,addy1 in enumerate(file_1_list):
  for index2, addy2 in enumerate(file_2_list):
    #print(addy1, addy2)
    result1 = separator.join(addy1)
    result2 = separator.join(addy2)
    sub1_norm_list = expand_address(result1, languages='en')
    sub2_norm_list = expand_address(result2, languages='en')
    levenshtein_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],lev(sub1_norm_list[0],sub2_norm_list[0])])
    # hamming_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],hamming(sub1_norm_list[0],sub2_norm_list[0],normalized=True)]) # requires same length
    sorensen_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],sorensen(sub1_norm_list[0],sub2_norm_list[0])])
    jaccard_compare.append([[sub1_norm_list[0],sub2_norm_list[0]],jaccard(sub1_norm_list[0],sub2_norm_list[0])])
    #print(sub1_norm_list,sub2_norm_list, lev(sub1_norm_list,sub2_norm_list))


In [None]:
levcompare_df = pd.DataFrame(levenshtein_compare, columns = ['Address pair', 'Distance'])
hammingcompare_df = pd.DataFrame(hamming_compare, columns = ['Address pair', 'Distance'])
sorensencompare_df = pd.DataFrame(sorensen_compare, columns = ['Address pair', 'Distance'])
jaccardcompare_df = pd.DataFrame(jaccard_compare, columns = ['Address pair', 'Distance'])