# Fuzzy String Matching at Scale

### Entity Resolution is Everywhere, and Sometimes All We Got Are Strings
![StringUrl](https://media.giphy.com/media/l3JDHiU3rdY4oBK3S/giphy.gif "string")  
More and more often, companies are combining data from different sources to enhance and enrich the value we are getting from the data. Central to this effort is the concept of entity resolution (or record linkage) to ensure that we are looking at the same record across multiple different sources. In some cases the records may have enough different types of information that can be used to build a probabilistic estimate on whether it is the same entity. In other cases, we may only be looking at one field, such as a name, and we need to decide whether it is enough of a match or not.

#### More Data = Need for Speed
Fuzzy string matching is not a new problem, and several algorithms are commonly employed (levenstein distance, However, given the growth in the number of data that are being matched, it is increasingly important to be able to perform this matching at scale. Instead of comparing every record to every possible match, we can employ a vectorized approach. Natural Language Processing (NLP) libraries make this not only possible, but relatively painless in implementation.

#### TF-IDF Approach
I got excited about using the NLP toolkit for short string matching after reading about its success on [Chris van den Berg's Blog Post](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html). TF-IDF stands for "term frequency-inverse document frequency" and is a common approach to measuring similarity/dissimilarity among documents in a corpus (a collection of documents). The TF-IDF calculations typically consist of the following steps:
1. Pre-processing & Tokenization: Perform any cleaning on the data (case conversion, removal of stopwords & punctuation) and convert each document into tokens. Although this is typically done at the word level, we have the flexibility to define a token at a lower level, such as an n-gram, which is more useful for short string matching since we might only have a few words in each string.
2. Calculate the Term Frequency: The purpose of this step is to determine which words define the document. For each document (a string in our case), calculate the frequency for each term (token) in the document and divide by the total number of terms in the document. If we define a token as an n-gram, we will calculate the frequency of  
3. Calculate the Inverse Document Frequency: The purpose of this step is to calculate the appropriate *weight* for each term, depending on how often it appears across all documents. A term that appears in all the different documents will have a lower weight compared to a term that only appears in one of the documents. The idea is that a token that appears in all documents is less is less descriptive of any particular document compared to a token that appears in only one of the documents.
4. Calculate the Cosine Similarity: As described by Chris van den Berg, data scientists at ING developed a custom library to make the cosine similarity calcualtions faster than the built-in sci-kit learn implementation. We will use this library for matching.



## Example: Matching Film Titles
Matching titles is a perfect use case, since in many cases there may not be much more than a name to use for matching and we need to find the best match against a medium-large data set. For this example, I will demonstrate the TF-IDF string matching approach using the IMDB title dataset, joining title aliases to the main titles. The IMDB set is ideal for two reasons: 
1. Often title aliases are very similar to the original title name
2. We can join the tables to get the true positive match for every record

The TF-IDF approach is also well suited to movie titles, since there are some words contained in titles that we would consider more important for matching compared to others. Articles such as "The" and "A" are worth keeping around for matching, as opposed to being removed as a stopword&mdash;a common early step in the NLP process, since titles may differ by only an article word (eg. [The Batman](https://www.imdb.com/title/tt1877830) vs. [Batman](https://www.imdb.com/title/tt0096895)) However, we still want to place less weight on articles and other common words. For example, [Playmobil: The Movie](https://www.imdb.com/title/tt4199898) should be more similar to "Playmobil" than to [Deadwood: The Movie](https://www.imdb.com/title/tt4943998). As mentioned above, because TF-IDF takes into account the distinctness of tokens among different documents (here each title is a document), we get this benefit of variable weights.

For this specific exercise, my goal is to find the top candidate target (main title) for each of the alias titles, and then compare that result with the true positive match. Building off of the work from Chris van den Berg's Blog Post, I created a python `TitleMatch` class to handle the matching, with the following additional changes: 
- The class methods below are designed to take in two different lists (instead of matching within one list).
- I added an option to return as either a dictionary or dataframe.
- Instead of using a custom ngram function, I leverage sci-kit learn's built-in ngram tokenizer, which has padding for words less than three characters long.

In [1]:
# Load libraries
import re
import time

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.sparse import csr_matrix
import pandas as pd

import sparse_dot_topn.sparse_dot_topn as ct

In [None]:
# Core Functions

def ngrams(string, n=3):
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]


def awesome_cossim_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data,indices,indptr),shape=(M,N))


def make_matchdf(sparse_matrix, A_vec, B_vec):
    # CSR matrix -> COO matrix
    cx = sparse_matrix.tocoo()
    
    # COO matrix to list of tuples
    match_list = []
    for row,col,val in zip(cx.row, cx.col, cx.data):
        match_list.append((row, A_vec[row], col, B_vec[col], val))
    
    # List of tuples to dataframe
    colnames = ['Row Idx', 'Title', 'Candidate Idx', 'Candidate Title', 'Score']
    match_df = pd.DataFrame(match_list, columns=colnames)
    
    return match_df


def make_matchdict(sparse_matrix, A_vec, B_vec):
    # CSR matrix -> COO matrix
    cx = sparse_matrix.tocoo()
    
    # dict value should be tuple of values
    match_dict = {}
    for row,col,val in zip(cx.row, cx.col, cx.data):
        if match_dict.get(row):
            match_dict[row].append((col,val))
        else:
            match_dict[row] = [(col, val)]
    
    return match_dict


def get_top_matches(matchdict):
    # Find the max value in the match dict
    max_dict = {}
    for target_key, scores in match_dict.items():

        # For each movie, get top candidate
        max_score = max(scores, key=operator.itemgetter(1))[1]

        # In case of ties, keep all
        matches = [score for score in scores if (score[1] == max_score)]
        max_dict[target_key] = matches

    return max_dict
    

In [None]:
# Aggregate Function
def match_names(source_names, target_names, ntop=10, lower_bound=0.7, analyzer='word', returntype='dict'):
    
    # Set the timer
    t1 = time.time()

    # Create initial count vectorizer & fit it on both lists
    if analyzer == '3gram':
        ct_vect = CountVectorizer(analyzer='char_wb', ngram_range=(3, 3))
    elif analyzer == 'word':
        ct_vect = CountVectorizer(analyzer='word')
    elif analyzer == 'ngram':
        ct_vect = CountVectorizer(analyzer=ngrams)
    vocab = ct_vect.fit(source_names + target_names).vocabulary_
    
    # Create tf-idf vectorizer"
    if analyzer == '3gram':
        tfidf_vect = TfidfVectorizer(vocabulary=vocab, analyzer='char_wb', ngram_range=(3, 3))
    elif analyzer == 'word':
        tfidf_vect = TfidfVectorizer(vocabulary=vocab, analyzer='word')
    elif analyzer == 'ngram':
        tfidf_vect = TfidfVectorizer(vocabulary=vocab, analyzer=ngrams)
    
    source_names_tfidf_mat = tfidf_vect.fit_transform(source_names)
    target_names_tfidf_mat = tfidf_vect.transform(target_names)
    t2 = time.time()
    
    # Get matches, convert to df
    matches = awesome_cossim_top(source_names_tfidf_mat,
                                 target_names_tfidf_mat.transpose(),
                                 ntop,
                                 lower_bound)
    if returntype == 'df':
        match_output = make_matchdf(matches, source_names, target_names)
    elif returntype == 'dict':
        match_output = make_matchdict(matches, source_names, target_names)
    t3 = time.time()
    
    # Print out run summary
    print(f"time to create vectorizer: {t2-t1}")
    print(f"time to calculate cosine similarity: {t3-t2}")
    print(f"runtime: {time.time() - t1}")
    
    return match_output

In [None]:
# Import IMDB Main Titles, filtering for movies
imdb_data = pd.read_csv('data/title_basics.tsv', sep='\t')
imdb_movies = imdb_data[imdb_data['titleType'] == 'movie']
imdb_movie_titles = imdb_movies['primaryTitle'].reset_index(drop=True).tolist()
print(f"Target Count: {len(imdb_movie_titles)}")
      
# Import AKAs
imdb_akas = pd.read_csv('data/title_akas.tsv', sep='\t')

 TODO:
 - 1. Finish the new class in the TF-IDF notebook
 - 2. Compare it to running fuzzy on everything, using this code: https://galaxydatatech.com/2017/12/31/fuzzy-string-matching-pandas-fuzzywuzzy/
 
 
 - 1. Time the code, using our vectorized approach vs. fuzzy-wuzzy, and using different token patterns
 - 2. Calculate the match rate (after the join), using different approaches
 - 3. Follow-Up: Do the same version, but using Spark instead
 - 4. Make sure I follow my steps from TF-IDF (Magic)
 - 5. Maybe test different versions of the ngram length?

IMDB Data Source: https://www.imdb.com/interfaces/  
Information courtesy of IMDb (http://www.imdb.com).  
Used with permission.