# Deep Learning-Based Name Matching

What I will do:

1. **Input 2 Names**: If we have two names, like "John" and "Mary". These names are the input data we want to analyze.

2. **Tokenization**: Tokenization means breaking down each name into smaller parts, typically words or characters. For our names, tokenization would simply split them into individual characters because names are already small and don't contain spaces.

   - Example:
     - "John" → ["J", "o", "h", "n"]
     - "Mary" → ["M", "a", "r", "y"]

3. **Word Embedding**: Word embedding is a way to convert these tokens (characters in this case) into meaningful numerical representations (vectors) that capture the context and relationships between words.

   - Each character (token) is assigned a unique numerical vector. These vectors have specific values that encode information about the character's meaning or usage.

     Example (hypothetical):
     - "J" → [0.3, -0.1, 0.8]
     - "o" → [-0.2, 0.5, -0.6]
     - "h" → [0.7, 0.4, -0.2]
     - "n" → [0.1, -0.3, 0.6]

     - "M" → [-0.5, 0.2, -0.7]
     - "a" → [0.4, -0.6, 0.3]
     - "r" → [0.2, 0.1, -0.4]
     - "y" → [-0.3, 0.7, -0.1]

4. **Numerical Representation**: Now, each name ("John" and "Mary") is represented as a sequence of these numerical vectors by combining the vectors of its constituent tokens (characters).

   - For "John":
     - Vector representation of "John" = [Vector("J") + Vector("o") + Vector("h") + Vector("n")]
     - Example (hypothetical): [0.3, -0.1, 0.8] + [-0.2, 0.5, -0.6] + [0.7, 0.4, -0.2] + [0.1, -0.3, 0.6] = [0.9, 0.5, -0.4]

   - For "Mary":
     - Vector representation of "Mary" = [Vector("M") + Vector("a") + Vector("r") + Vector("y")]
     - Example (hypothetical): [-0.5, 0.2, -0.7] + [0.4, -0.6, 0.3] + [0.2, 0.1, -0.4] + [-0.3, 0.7, -0.1] = [-0.2, 0.4, -0.9]

5. **Dot Product for Similarity**: The dot product is a mathematical operation used to measure the similarity between two vectors. In this case, we can calculate the dot product between the vector representations of "John" and "Mary" to determine how similar they are based on their numerical representations.

   - Dot product formula between two vectors (a and b): a · b = a1 * b1 + a2 * b2 + ... + an * bn (where ai and bi are components of vectors a and b)

   - Example (hypothetical):
     - Dot product of "John" and "Mary" = [0.9, 0.5, -0.4] · [-0.2, 0.4, -0.9] 
     - = (0.9 * -0.2) + (0.5 * 0.4) + (-0.4 * -0.9) 
     - = -0.18 + 0.2 + 0.36 
     - = 0.38

   - The resulting value (0.38 in this example) from the dot product indicates the similarity between "John" and "Mary". A higher value suggests more similarity in their numerical representations, while a lower value suggests less similarity.

Therefore, by using tokenization, word embedding, and the dot product, we can mathematically quantify the similarity between two names ("John" and "Mary") based on their underlying meanings and contexts as represented by numerical vectors.

I will use Bert transformer.

BertTokenizer:

This class is responsible for tokenizing (breaking down) text into individual tokens that can be understood by BERT and other similar models.
It handles tasks like splitting words into subwords (sub-tokenization) using the WordPiece algorithm, converting tokens to IDs (numerical representations), and adding special tokens for tasks like classification or question answering.

BertModel:

This class represents the BERT model itself, which is a deep neural network architecture pre-trained on large text corpora.
The BertModel is capable of transforming input text (tokenized sequences) into rich context representations (embeddings) that capture the meaning and context of words within sentences or documents.

Choosing a BERT model:


**Word Piece Tokenization**: BERT uses a subword tokenization approach (WordPiece), which means it can handle out-of-vocabulary words and break down complex words into smaller meaningful subunits. This is particularly useful when dealing with abbreviated or truncated names commonly found in transaction descriptions.

**Pre-trained Language Model**: BERT is pre-trained on a large corpus of text data, which includes a wide range of language patterns and nuances. This pre-training helps BERT to generalize well across different domains and tasks, including matching names with varying forms.

**Transfer Learning Benefits**: Leveraging BERT for name matching involves transfer learning, where the model's pre-trained knowledge is transferred to a specific task (name matching). This often leads to improved performance with less labeled data required for training -- in our case, we don't have any data

**State-of-the-art Performance**: BERT has demonstrated state-of-the-art performance on various NLP benchmarks and tasks. It's a widely adopted and well-tested model that can provide strong performance for name matching task.

### Model Architecture:

**Token Embedding Layer**:
   - Convert input names and transaction descriptions into token sequences.
   - Use a tokenization method that captures subword units and handles variations/abbreviations effectively.

**Pre-trained Language Model (BERT)**:
   - Fine-tune a pre-trained transformer model (BERT) for the name matching task:
     - **Input Representation**: Convert tokenized inputs into contextualized embeddings using BERT's token embedding layer.

**Sequence Matching Layer**:

   - **BERT-based Approach**:
     - Use BERT's output embeddings (token representation) as input to downstream sequence matching layers (fully connected layers, softmax for classification).

**Output Layer**:
   - Output layer to predict the similarity or matching score between customer names and transaction descriptions:
     - **Ranking/Scoring**: Assign a similarity score (cosine similarity) to quantify the match strength between names and transaction descriptions.

In [8]:
from transformers import BertModel, BertTokenizer
import torch
from scipy.spatial.distance import cosine

In [2]:
# Load model tokenizer and model

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exact

These lines of code initialize a BERT model and tokenizer for tokenizing and encoding text into numerical representations.
- `tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')`: Loads a pre-trained BERT tokenizer capable of converting text into tokens understood by BERT.
- `model = BertModel.from_pretrained('bert-base-uncased')`: Loads a pre-trained BERT model for processing tokenized text and generating contextual embeddings.

In [3]:
def encode(text):
    #Encode text
    encoded_input = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=32)
    #Compute token embeddings
    with torch.no_grad():
        output = model(**encoded_input)
    #Take the first token ([CLS]) embeddings for each sample
    embeddings = output.last_hidden_state[:, 0, :]
    return embeddings

This `encode` function takes a text input, encodes it using a pre-trained transformer model (like BERT) and a given tokenizer, then extracts embeddings for the text.

- It first tokenizes the `text` input using the provided `tokenizer`, ensuring padding, truncation, and a maximum sequence length of 32 tokens.
  
- The encoded tokens are passed through the `model` (e.g., BERT) to compute embeddings, where the `[CLS]` token representation is extracted.

- The function returns the `[CLS]` token embeddings, which capture contextual information of the entire input text in a fixed-size vector format.

In [4]:
# Function to calculate cosine similarity between two names
def name_similarity(name1, name2):
    #Encode names
    embedding1 = encode(name1)
    embedding2 = encode(name2)
    #flatten the embeddings to 1D
    embedding1 = embedding1.squeeze().numpy()
    embedding2 = embedding2.squeeze().numpy()
    #Calculate cosine similarity
    similarity = 1 - cosine(embedding1, embedding2)
    return similarity

This function `name_similarity` calculates the cosine similarity between two names represented as embeddings. It encodes each name into a numerical embedding, flattens the embeddings to 1D arrays, then computes the cosine similarity between these arrays. The resulting similarity score ranges from -1 (dissimilar) to 1 (identical), where higher values indicate greater similarity between the names.

In [6]:
# Example names
name1 = "Adel hany"
name2 = "Adel Hany"

# Compute similarity
similarity = name_similarity(name1, name2)
print(f"The similarity between '{name1}' and '{name2}' is: {similarity:.4f}")

The similarity between 'Adel hany' and 'Adel Hany' is: 1.0000


In [7]:
# Example names
name1 = "Adel hany"
name2 = "Ahmed Ali"

# Compute similarity
similarity = name_similarity(name1, name2)
print(f"The similarity between '{name1}' and '{name2}' is: {similarity:.4f}")

The similarity between 'Adel hany' and 'Ahmed Ali' is: 0.9177
