<center> <h1>"Companies" Fuzzy-Matching with Levenshtein Algorithm</h1> </center>
<center> <h3>Developed during Sprint#8</h2> </center>
<center> <h4>ML Devs Agustina Dinamarca & Karla Aviles</h2> </center>

## Description

This notebook deals with fuzzy-matching based on Levenshtein's algorithm for application to company names.

Two versions of the fuzzy-matching function were implemented for this use case, which differ in the type of input received by the function:

* Version "v1" receives as input a list of company names (strings).
* Version "v2" receives as input only one company name (string) so that this case must be used in a for-loop if it is to be applied to a set of companies.

In [None]:
# UNCOMMENT THE FOLLOWING LINE IF REQUIRED
# !pip install Levenshtein

### Libraries

In [1]:
import Levenshtein
import nltk
import os
import pandas as pd
import re
import time

from nltk.corpus import stopwords

### User-defined functions

In [2]:
def clean_query_v1(query:list):
    """
    Cleaning of company names from a list of companies
    
    Parameters
    ----------
    
    query : list
        List of company names
    
    Returns
    -------
    
    query_cleaned : list
        List of company names cleaned
        
    """
    patterns = r"""[!+?:"\,<>\\(){}@%$#=*/[/-]"""
    stpwrds = stopwords.words('spanish') + stopwords.words('english')
    query_lst = [re.sub(patterns, " ", q.lower()) for q in query]
    query_lst = [[x for x in q.split() if x not in stpwrds] for q in query_lst]
    query_cleaned = [' '.join(w for w in q) for q in query_lst]
    return query_cleaned


def apply_fuzzy_matching_levenshtein_v1(df, query:list, topk=5):
    """
    Applies fuzzy-matching (Levenshtein) to a query of company names to find the main name of the companies
    
    Parameters
    ----------
    
    df : Pandas DataFrame
        Company names dictionary dataframe
        
    query : list
        List of company names
    
    topk : int (default=5)
        Number of the most probable matches to return per company name
        
    Returns
    -------
    
    results: list
        List of topk results per company name
    
    """
    query_cleaned = clean_query_v1(query)
    results = []
    for i in range(len(query)):
        df_filtered = df[df["alternate_title"].isin([query_cleaned[i]] + query_cleaned[i].split())].copy()
        if not df_filtered.empty:
            df_filtered['accuracy'] = 100 * df_filtered['alternate_title'].apply(lambda x: Levenshtein.ratio(query_cleaned[i], x))
            df_best_match = df_filtered.nlargest(topk, 'accuracy').round(2)
            best_results = df_best_match[['realName', 'accuracy']].to_dict('records')
        else:
            best_results = [{'realName':'NOT FOUND', 'accuracy': 0}]
        
        results.append(
            {
                'name': query[i],
                'realName': best_results[0]['realName'],
                'accuracy': best_results[0]['accuracy'],
                'topk': best_results
            }
        )
    return results


def clean_query_v2(query:str):
    """
    Cleaning of a company name
    
    Parameters
    ----------
    
    query : str
        A company names
    
    Returns
    -------
    
    query_cleaned : str
        Company names cleaned
        
    """
    patterns = r"""[!+?:"\,<>\\(){}@%$#=*/[/-]"""
    stpwrds = stopwords.words('spanish') + stopwords.words('english')
    query_lst = re.sub(patterns, " ", query.lower())
    query_lst = [x for x in query_lst.split() if x not in stpwrds]
    query_cleaned = ' '.join(w for w in query_lst)
    return query_cleaned


def apply_fuzzy_matching_levenshtein_v2(df, query:str, topk=5):
    """
    Applies fuzzy-matching (Levenshtein) to a company name to find the main name of the companies
    
    Parameters
    ----------
    
    df : Pandas DataFrame
        Company names dictionary dataframe
        
    query : str
        A company name
    
    topk : int (default=5)
        Number of the most probable matches to return per company name
        
    Returns
    -------
    
    results: dict
        Dictionary with the topk best results for the company name
    
    """
    query_cleaned = clean_query_v2(query)    
    df_filtered = df[df["alternate_title"].isin([query_cleaned] + query_cleaned.split())].copy()
    if not df_filtered.empty:
        df_filtered['accuracy'] = 100 * df_filtered['alternate_title'].apply(lambda x: Levenshtein.ratio(query_cleaned, x))
        df_best_match = df_filtered.nlargest(topk, 'accuracy').round(2)
        best_results = df_best_match[['realName', 'accuracy']].to_dict('records')
    else:
        best_results = [{'realName':'NOT FOUND', 'accuracy': 0}]
        
    results = {
                'name': query,
                'realName': best_results[0]['realName'],
                'accuracy': best_results[0]['accuracy'],
                'topk': best_results
              }
    return results

### Load the company dictionary dataset

In [3]:
main_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))

path = os.path.join(main_dir, 'data', 'cleaned', 'companies_fuzzy_v3.csv')

t1 = time.time()
df = pd.read_csv(path)
df.rename(columns={'title':'realName'}, inplace=True)
t2 = time.time()

print('Dataframe loaded and processed in {:.2f} s'.format(t2 - t1))
print('Number of samples: {}'.format(df.shape[0]))
print('This is a sample:')
df.head(5)

Dataframe loaded and processed in 0.36 s
Number of samples: 399056
This is a sample:


Unnamed: 0,id,realName,alternate_title
0,385015,Zaker,zaker
1,101222,Yellow Free Economic Zone,yellow free economic zone
2,302411,Kakaopay,kakaopay
3,287632,Kamry,kamry
4,352916,Rainist,rainist


### Load a sample of company names from Indeed for testing

In [4]:
path = os.path.join(main_dir, 'data', 'indeed_sample', 'indeed_sample_companies.csv')

t1 = time.time()
indeed_df = pd.read_csv(path)
t2 = time.time()

query = indeed_df.company.values.tolist()

print('Dataframe loaded and processed in {:.2f} s'.format(t2 - t1))
print('Number of samples: {}'.format(indeed_df.shape[0]))
print('This is a sample:')
indeed_df.head(5)

Dataframe loaded and processed in 0.00 s
Number of samples: 997
This is a sample:


Unnamed: 0,company
0,General Dynamics Information Technology
1,Acadia Healthcare
2,"Connecticut Counseling Centers, Inc"
3,Sprint
4,Sutherland


### Fuzzy-Matching with Levenshtein ("v1" version: list of strings as input)

In [5]:
# topk indicates the number of results per company
topk = 5

t1 = time.time()
results_v1 = apply_fuzzy_matching_levenshtein_v1(df, query, topk)
t2 = time.time()

num = len(query)
t = t2 - t1
print('Number of samples in dictionary: {}\n'.format(df.shape[0]))
print('Returning up to top_k={} results per company name...\n'.format(topk))
print('- Procesed {} samples in {:.5f} s\n- Procesed 1 sample in {:.5f} s\n'.format(num, t, float(t / num)))

Number of samples in dictionary: 399056

Returning up to top_k=5 results per company name...

- Procesed 997 samples in 13.25671 s
- Procesed 1 sample in 0.01330 s



**This is what a result looks like for a company:**

In [6]:
# Example
k = 1
company_ex = results_v1[:k]
company_ex 

[{'name': 'General Dynamics Information Technology',
  'realName': 'Gdit',
  'accuracy': 100.0,
  'topk': [{'realName': 'Gdit', 'accuracy': 100.0},
   {'realName': 'Information', 'accuracy': 44.0},
   {'realName': 'Technology', 'accuracy': 40.82},
   {'realName': 'Between', 'accuracy': 40.82},
   {'realName': 'Dynamics', 'accuracy': 34.04}]}]

### Fuzzy-Matching with Levenshtein ("v2" version: an strings as input)

In [7]:
# topk indicates the number of results per company
topk = 5
results_v2 = []

t1 = time.time()
for company in query:
    res_company = apply_fuzzy_matching_levenshtein_v2(df, company, topk)
    results_v2.append(res_company)
t2 = time.time()

num = len(query)
t = t2 - t1
print('Number of samples in corpus: {}\n'.format(df.shape[0]))
print('Returning up to top_k={} results per company name...\n'.format(topk))
print('- Procesed {} samples in {:.5f} s\n- Procesed 1 sample in {:.5f} s\n'.format(num, t, float(t / num)))

Number of samples in corpus: 399056

Returning up to top_k=5 results per company name...

- Procesed 997 samples in 13.51770 s
- Procesed 1 sample in 0.01356 s



**This is what a result looks like for a company:**

In [8]:
# Example
k = 1
company_ex = results_v2[:k]
company_ex 

[{'name': 'General Dynamics Information Technology',
  'realName': 'Gdit',
  'accuracy': 100.0,
  'topk': [{'realName': 'Gdit', 'accuracy': 100.0},
   {'realName': 'Information', 'accuracy': 44.0},
   {'realName': 'Technology', 'accuracy': 40.82},
   {'realName': 'Between', 'accuracy': 40.82},
   {'realName': 'Dynamics', 'accuracy': 34.04}]}]

#### CHECK HERE WITH A RANDOM COMPANY NAME

In [17]:
# Write a company name
name = "apple inc"

query = [name]
q_res = apply_fuzzy_matching_levenshtein_v1(df, query, topk=5)
q_res

[{'name': 'apple inc',
  'realName': 'Apple Inc',
  'accuracy': 100.0,
  'topk': [{'realName': 'Apple Inc', 'accuracy': 100.0},
   {'realName': 'Dorado', 'accuracy': 71.43},
   {'realName': 'Apple', 'accuracy': 71.43},
   {'realName': 'My Inc', 'accuracy': 50.0},
   {'realName': 'Inc', 'accuracy': 50.0}]}]

### Observations:

```
* Version "v1" (string list as input) is approximately 0.3 s faster than version "v2" (string as input) for processing a set of 997 company names extracted from a sample of Indeed data. It is important to mention that as a result the topk=5 best matches per company were returned, and that the fuzzy-matching dictionary has ~400k samples.

* There is a linear behaviour between the number of company names to be processed and the execution time of fuzzy-matching with Levenshtein.
```