Link of articles 
### Link :- https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536

### **Deduplication.** Aligning similar categories or entities in a data set (for example, we may need to combine ‘D J Trump’, ‘D. Trump’ and ‘Donald Trump’ into the same entity).


### **Record Linkage.** Joining data sets on a particular entity (for example, joining records of ‘D J Trump’ to a URL of his Wikipedia page).

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
from tqdm import tqdm
from google.colab import drive
import os
from matplotlib import style
style.use('fivethirtyeight')

  


In [2]:
names = pd.read_csv('/content/messy org names.csv')
names.head()

Unnamed: 0.1,Unnamed: 0,buyer,count
0,0,Crescent Purchasing Consortium (CPC),4404
1,1,Crown Commercial Service,3683
2,2,UK SHARED BUSINESS SERVICES LIMITED,2467
3,3,Leeds City Council,2320
4,4,FCO SERVICES,2310


In [5]:
print('The shape: %d x %d' % names.shape)
print('There are %d unique values' % names.buyer.shape[0])

The shape: 3651 x 3
There are 3651 unique values


##De duplication:

In [6]:
import re
!pip install ftfy # amazing text cleaning for decode issues..
from ftfy import fix_text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.1 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1


# **Smart Deduping**
We will first explore how to dedupe close matches. The process is made painless using Python’s Scikit-Learn library:

Create a function to split our stings into character ngrams.

Create a tf-idf matrix from these characters using Scikit-Learn.

Use cosine similarity to show close matches across the population.

### The ngram function

In [7]:
def ngrams(string, n=3):
    string = str(string)
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

In [9]:
print('All 3-grams in "Department":')
print(ngrams('Department'))

All 3-grams in "Department":
[' Ed', 'Edu', 'duc', 'uct', 'cti', 'tio', 'ion', 'on ']


In [10]:
import numpy as np
from scipy.sparse import csr_matrix
!pip install sparse_dot_topn #uncomment to install
import sparse_dot_topn.sparse_dot_topn as ct

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sparse_dot_topn
  Downloading sparse_dot_topn-0.3.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 4.9 MB/s 
Installing collected packages: sparse-dot-topn
Successfully installed sparse-dot-topn-0.3.3


### Applying the function and creating a tf-idf matrix

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

org_names = names['buyer'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(org_names)

### Finding close matches through cosine similarity

In [11]:
def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data,indices,indptr),shape=(M,N))

### Putting all of this together we get the following result:

In [17]:
def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()
    
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size
    
    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)
    
    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]
    
    return pd.DataFrame({'left_side': left_side,
                          'right_side': right_side,
                           'similairity': similairity})

In [13]:
import time
t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.85)
t = time.time()-t1
print("SELFTIMED:", t)

SELFTIMED: 0.23862051963806152


In [18]:
matches_df = get_matches_df(matches, org_names, top=1000)
matches_df = matches_df[matches_df['similairity'] < 0.99999] # Remove all exact matches
matches_df.sample(20)

Unnamed: 0,left_side,right_side,similairity
617,Kingstown Works Ltd,Kingstown Works Limited,0.878768
635,KINGS LYNN & WEST NORFOLK BOROUGH COUNCIL,Borough Council of King&#039;s Lynn &amp; West Norfolk,0.922915
107,East Riding of Yorkshire,East Riding of Yorkshire Council,0.951229
476,The University Of Leeds,University of Leeds,0.912836
470,NORTH WESTERN UNIVERSITIES PURCHASING CONSORTIUM LIMITED,North eastern Universities Purchasing consortium,0.874805
302,University College of London Hospitals NHS Foundation Trust,University College London Hospitals NHS Foundation Trust,0.941518
124,South Tyneside Council,North Tyneside Council,0.850709
225,The University of Warwick,UNIVERSITY OF WARWICK,0.924797
396,University of Leeds,The University of Leeds,0.912836
384,Dorset HealthCare University NHS Foundation Trust,Dorset Health Care University NHS Foundation Trust,0.897224


#### Comparison to traditional matching
This code prints the time it takes to compare <b>only one</b> item against the population. As you can see, the TD IDF approach can match all items (3,600) significantly faster than it takes to compare a single item using the fuzzywuzzy library.

In [15]:
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
t1 = time.time()
print(process.extractOne('Ministry of Justice', org_names))
t = time.time()-t1
print("SELFTIMED:", t)
print("Estimated hours to complete for full dataset:", (t*len(org_names))/60/60)

('MINISTRY OF JUSTICE', 100)
SELFTIMED: 4.050333738327026
Estimated hours to complete for full dataset: 4.1077134662866595


# Record linkage and a different approach

If we want to use this technique to match against another data source then we can recycle the majority of our code. In the below section we will see how this is achieved and also use the K Nearest Neighbour algorithm as an alternative closeness measure.

The dataset we would like to join on is a set of ‘clean’ organisation names created by the Office for National Statistics (ONS):

Using a similar technique to the above, we can join our messy data to a clean set of master data.

In [19]:
##################
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import re

clean_org_names = pd.read_excel('Gov Orgs ONS.xlsx')
clean_org_names = clean_org_names.iloc[:, 0:6]

org_name_clean = clean_org_names['Institutions'].unique()

print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(org_name_clean)
print('Vecorizing completed...')

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)

org_column = 'buyer' #column to match against in the messy data
unique_org = set(names[org_column].values) # set used for increased performance

Vecorizing the data - this could take a few minutes for large datasets...
Vecorizing completed...


In [20]:
###matching query:
def getNearestN(query):
  queryTFIDF_ = vectorizer.transform(query)
  distances, indices = nbrs.kneighbors(queryTFIDF_)
  return distances, indices

import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_org)
t = time.time()-t1
print("COMPLETED IN:", t)

unique_org = list(unique_org) #need to convert back to a list
print('finding matches...')
matches = []
for i,j in enumerate(indices):
  temp = [round(distances[i][0],2), clean_org_names.values[j][0][0],unique_org[i]]
  matches.append(temp)

print('Building data frame...')  
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Origional name'])
print('Done') 

getting nearest n...
COMPLETED IN: 1.2724599838256836
finding matches...
Building data frame...
Done


In [21]:
matches.head(10)

Unnamed: 0,Match confidence (lower is better),Matched name,Origional name
0,0.9,NHS England,NHS England - North West Hub
1,1.0,Bristol Port,University of Bristol
2,0.83,Advantage - West Midlands,Advantage South West
3,0.55,Derby Homes Ltd,Derby Homes Ltd and Derby City Council
4,0.97,Homes for Northumberland Ltd,"Northumberland, Tyne and Wear NHS Foundation Trust"
5,0.94,Welsh Ambulance Services NHS Trust,West Midlands Ambulance Service University NHS Foundation Trust
6,1.05,Home Service Group (HS) - (s BBC),HOME GROUP LIMITED
7,1.06,North West Cultural Consortium,eMBED Health Consortium
8,1.18,Crossrail Limited,Paragon asra Housing Limited
9,0.0,City of York Council,CITY OF YORK COUNCIL
