# Customer Matching Example
## Matching Names
This notebook demonstrates how we use the Levenstein distance to match customer names: [Levenstein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to match. The Levenstein distance is a measure of the similarity between two strings. The distance is the number of deletions, insertions, or substitutions required to transform one string into the other. The distance is 0 if the two strings are identical.

In [2]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
import numpy as np
import asyncio

In [23]:
customers = pd.read_parquet('customers.parquet')
customers.head()

Unnamed: 0,firstname,lastname,street,housenumber,box,postalcode,city,country,birthdate,email
0,Luc,Tack,Westendelaan,30,202.0,8430,Middelkerke,Belgium,1989-05-19,lauwersgeorges@example.net
1,Hilde,Verstappen,Boudewijnlaan,6,,2390,Malle,Belgium,1969-09-21,nlammens@example.com
2,Sebastiaan,Dubois,Bosstraat,107B,,3620,Lanaken,Belgium,2021-12-28,wvandenberghe@example.net
3,Hilde,Wynants,Molenstraat,94,,2490,Balen,Belgium,2019-03-28,lowie16@example.net
4,Jacqueline,Verstappen,Limburgstraat,39,,2020,Antwerpen,Belgium,1934-07-16,ferdinand73@example.net


In [24]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   firstname    10000 non-null  object
 1   lastname     10000 non-null  object
 2   street       10000 non-null  object
 3   housenumber  10000 non-null  object
 4   box          10000 non-null  object
 5   postalcode   10000 non-null  int64 
 6   city         10000 non-null  object
 7   country      10000 non-null  object
 8   birthdate    10000 non-null  object
 9   email        10000 non-null  object
dtypes: int64(1), object(9)
memory usage: 781.4+ KB


In [4]:
streetnames = pd.read_csv('openaddress-bevlg.csv', low_memory=False)

In [5]:
streetnames.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4057010 entries, 0 to 4057009
Data columns (total 20 columns):
 #   Column                Dtype  
---  ------                -----  
 0   EPSG:31370_x          float64
 1   EPSG:31370_y          float64
 2   EPSG:4326_lat         float64
 3   EPSG:4326_lon         float64
 4   address_id            int64  
 5   box_number            object 
 6   house_number          object 
 7   municipality_id       int64  
 8   municipality_name_de  object 
 9   municipality_name_fr  object 
 10  municipality_name_nl  object 
 11  postcode              int64  
 12  postname_fr           object 
 13  postname_nl           object 
 14  street_id             int64  
 15  streetname_de         float64
 16  streetname_fr         object 
 17  streetname_nl         object 
 18  region_code           object 
 19  status                object 
dtypes: float64(5), int64(4), object(11)
memory usage: 619.1+ MB


In [6]:
streetnames[(streetnames['streetname_nl'] == 'Trapstraat') & (streetnames['postcode'] == 2060)]

Unnamed: 0,EPSG:31370_x,EPSG:31370_y,EPSG:4326_lat,EPSG:4326_lon,address_id,box_number,house_number,municipality_id,municipality_name_de,municipality_name_fr,municipality_name_nl,postcode,postname_fr,postname_nl,street_id,streetname_de,streetname_fr,streetname_nl,region_code,status
3305,153843.40,213240.26,51.22908,4.42378,937923,,35,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
22677,153858.63,213269.79,51.22935,4.42400,989955,,47,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
122967,153748.25,213070.79,51.22756,4.42242,3048072,,3,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
150117,153839.89,213236.17,51.22905,4.42373,479600,,33,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
156578,153810.31,213175.58,51.22850,4.42330,459593,,15,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3898932,153851.84,213257.28,51.22924,4.42390,490490,,43,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
3913164,153810.36,213173.25,51.22848,4.42331,410999,,17,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
3963518,153858.76,213277.32,51.22942,4.42400,332613,,53,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current
3973325,153752.47,213026.57,51.22716,4.42248,334795,,10,11002,Antwerpen,Anvers,Antwerpen,2060,,Antwerpen,2852,,,Trapstraat,BE-VLG,current


In [7]:
streetnames['box_number'].replace(np.nan, '', inplace=True)
streetnames[['postcode','municipality_name_nl','streetname_nl', 'house_number', 'box_number']].to_csv('streetnames.csv')
