### Company Name Matching Algorithm

1. The algorithm is designed to match company names from two separate datasets. It uses both cosine similarity of TF-IDF vectors and fuzzy string matching to quantify the similarity between company names. The algorithm is implemented in several stages:

2. **Preprocessing**: Company names are lowercased and tokenized. Common words (stop words) are removed from the company names to focus on the most distinguishing features of each name.

3. **Feature Extraction**: The TF-IDF (Term Frequency-Inverse Document Frequency) of the preprocessed company names is calculated. TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. This transforms the text data into a format that can be used in the subsequent steps.

4. **Similarity Calculation**: Cosine similarity is computed for the TF-IDF vectors of the company names. This gives a measure of how similar the company names are in terms of their word usage. In parallel, a fuzzy string matching ratio is calculated for the original company names, giving a measure of the similarity of the company names in terms of character sequences.

5. **Rank Calculation**: For each company in the first list, the cosine similarities and fuzzy match ratios with all companies in the second list are ranked. The rank of 1 is given to the highest score.

6. **Average Rank Calculation**: The average of the TF-IDF cosine similarity rank and the fuzzy match ratio rank is calculated for each pair of companies. This average rank gives a balanced measure of similarity that takes into account both word usage and character sequence.

7. **Best Match Identification**: For each company in the first list, the company from the second list with the lowest average rank is identified as the best match. The result is a DataFrame that lists, for each company in the first list, all the companies in the second list along with their similarity measures, ranks, and an indication of which is the best match.

This algorithm provides a robust way of matching company names that can handle a variety of differences in the way company names are represented, including abbreviations, inclusion or exclusion of 'Inc.' or 'Corp.', and minor variations in spelling.






In [None]:
# Import the necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
from fuzzywuzzy import fuzz
import nltk

nltk.download('stopwords')
nltk.download('punkt')

In [3]:
# Create two dataframes with company names and addresses
df1 = pd.DataFrame({
    'Company': ["Apple Inc.", "Microsoft Corporation", "Amazon.com Inc.", "Facebook Inc."],
    'Address': ["1 Infinite Loop, Cupertino, CA 95014", "1 Microsoft Way, Redmond, WA 98052", "410 Terry Ave N, Seattle, WA 98109", "1 Hacker Way, Menlo Park, CA 94025"]
})

df2 = pd.DataFrame({
    'Company': ["Apple", "Microsoft Corp.", "Amazon Inc.", "FB"],
    'Address': ["1 Apple Park Way, Cupertino, CA 95014", "15010 NE 36th Street, Redmond, WA 98052", "300 Pine St, Seattle, WA 98101", "770 Broadway, New York, NY 10003"]
})

# Preprocessing function
stop_words = stopwords.words('english')

def preprocess(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stop_words]
    return ' '.join(tokens)

# Preprocess the company names
df1['Preprocessed'] = df1['Company'].apply(preprocess)
df2['Preprocessed'] = df2['Company'].apply(preprocess)

# Calculate the cosine similarity
vectorizer = TfidfVectorizer().fit_transform(pd.concat([df1['Preprocessed'], df2['Preprocessed']]))
vectors = vectorizer.toarray()
n = len(df1)
similarity_matrix = cosine_similarity(vectors[:n], vectors[n:])

# Calculate the fuzzy match ratio
fuzz_ratio = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        fuzz_ratio[i, j] = fuzz.ratio(df1.iloc[i]['Company'], df2.iloc[j]['Company'])

# Generate all combinations of company names and addresses
combos = [(df1.iloc[i]['Company'], df2.iloc[j]['Company'], df1.iloc[i]['Address'], df2.iloc[j]['Address']) for i in range(n) for j in range(n)]

# Create the final dataframe
similarity_data = pd.DataFrame({
    'Company1': [combo[0] for combo in combos],
    'Company2': [combo[1] for combo in combos],
    'Address1': [combo[2] for combo in combos],
    'Address2': [combo[3] for combo in combos],
    'TF-IDF_Cosine_Similarity': similarity_matrix.flatten(),
    'Fuzzy_Match_Ratio': fuzz_ratio.flatten()
})

# Rank each similarity score (1 is best) separately for each company in df1
for _, row in df1.iterrows():
    company = row['Company']
    mask = similarity_data['Company1'] == company
    similarity_data.loc[mask, 'TF-IDF_Cosine_Similarity_Rank'] = similarity_data.loc[mask, 'TF-IDF_Cosine_Similarity'].rank(ascending=False)
    similarity_data.loc[mask, 'Fuzzy_Match_Ratio_Rank'] = similarity_data.loc[mask, 'Fuzzy_Match_Ratio'].rank(ascending=False)

# Calculate average rank
similarity_data['Average_Rank'] = similarity_data[['TF-IDF_Cosine_Similarity_Rank', 'Fuzzy_Match_Ratio_Rank']].mean(axis=1)

# Flag the best match (lowest Average_Rank) for each Company1
similarity_data['Best_Match'] = similarity_data.groupby('Company1')['Average_Rank'].transform(lambda x: x == x.min())

similarity_data


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,Company1,Company2,Address1,Address2,TF-IDF_Cosine_Similarity,Fuzzy_Match_Ratio,TF-IDF_Cosine_Similarity_Rank,Fuzzy_Match_Ratio_Rank,Average_Rank,Best_Match
0,Apple Inc.,Apple,"1 Infinite Loop, Cupertino, CA 95014","1 Apple Park Way, Cupertino, CA 95014",0.797471,67.0,1.0,1.0,1.0,True
1,Apple Inc.,Microsoft Corp.,"1 Infinite Loop, Cupertino, CA 95014","15010 NE 36th Street, Redmond, WA 98052",0.0,16.0,3.5,3.0,3.25,False
2,Apple Inc.,Amazon Inc.,"1 Infinite Loop, Cupertino, CA 95014","300 Pine St, Seattle, WA 98101",0.36404,57.0,2.0,2.0,2.0,False
3,Apple Inc.,FB,"1 Infinite Loop, Cupertino, CA 95014","770 Broadway, New York, NY 10003",0.0,0.0,3.5,4.0,3.75,False
4,Microsoft Corporation,Apple,"1 Microsoft Way, Redmond, WA 98052","1 Apple Park Way, Cupertino, CA 95014",0.0,8.0,3.0,3.0,3.0,False
5,Microsoft Corporation,Microsoft Corp.,"1 Microsoft Way, Redmond, WA 98052","15010 NE 36th Street, Redmond, WA 98052",0.412585,78.0,1.0,1.0,1.0,True
6,Microsoft Corporation,Amazon Inc.,"1 Microsoft Way, Redmond, WA 98052","300 Pine St, Seattle, WA 98101",0.0,19.0,3.0,2.0,2.5,False
7,Microsoft Corporation,FB,"1 Microsoft Way, Redmond, WA 98052","770 Broadway, New York, NY 10003",0.0,0.0,3.0,4.0,3.5,False
8,Amazon.com Inc.,Apple,"410 Terry Ave N, Seattle, WA 98109","1 Apple Park Way, Cupertino, CA 95014",0.0,10.0,3.0,3.0,3.0,False
9,Amazon.com Inc.,Microsoft Corp.,"410 Terry Ave N, Seattle, WA 98109","15010 NE 36th Street, Redmond, WA 98052",0.0,13.0,3.0,2.0,2.5,False
