# `nltk_distance`

## Overview
The `nltk_distance` Python function uses the [NLTK library](https://www.nltk.org/) to calculate similarity scores between strings, returns an array containing the match index and similarity score.

| Task | Description | Boardflare RUNPY() | Excel PY() | Source in Jupyter | Demo Workbook |
|:----:|:------------|:-------:|:----------:|:-------:|:-------:|
| [Fuzzy Matching](https://www.boardflare.com/tasks/nlp/fuzzy-match) | Uses [`nltk`](https://github.com/nltk/nltk) library for similarity scoring with `jaccard`, `jaro`, and `levenshtein`. | ✅ | ✅ | [Open](https://addins.boardflare.com/functions/prod/jupyterlite/lab/index.html?path=text/fuzzy-match/nltk_distance.ipynb) | [Open](https://whistlernetworks.sharepoint.com/:x:/s/Boardflare/Eb_nCI4mR6tImGx_S1hPVs8B4UYmrJRrkk0_Grai6A4adg?e=xfUuNQ) |

## Usage

Compares `lookup_value` with each value in the `lookup_array` and returns the index of the closest match and a normalized similarity score (within a given algorithm) between 0 and 1 (higher is more similar)

```python
nltk_distance(lookup_value, lookup_array, algorithm)
```

Arguments:

| Argument        | Positional | Type           | Description                                                    |
|-----------------|------------|----------------|----------------------------------------------------------------|
| `lookup_value`  | arg1       | string or list | The string(s) to look up matches for                           |
| `lookup_array`  | arg2       | list           | Array of strings to search through for matches                 |
| `algorithm`     | arg3       | string         | The similarity algorithm to use (e.g. "jaccard", "levenshtein") |

Returns a list of lists containing the match results. Each inner list contains:

| Return Value | Type  | Description                                                         |
|--------------|-------|---------------------------------------------------------------------|
| Index        | int   | Index of the closest matching string (1-based)                      |
| Similarity   | float | Similarity score between 0-1 (higher = more similar)                |

### BOARDFLARE.RUNPY

```excel
=BOARDFLARE.RUNPY("text/fuzzy-match/nltk_distance.ipynb", lookup_value, lookup_array, algorithm)
```

Example usage to find closest match for a string:

```excel
=BOARDFLARE.RUNPY("text/fuzzy-match/nltk_distance.ipynb", "example", {"sample","test","example"}, "jaccard")
```

In [14]:
%pip install nltk
import pandas as pd
import json

# Setup globals similar to RUNPY function.
# Arrays must be in pandas DataFrame.
arg1 = pd.DataFrame(["sample", "exemplary", "sampler", "example"], columns=['needles'])
arg2 = pd.DataFrame(["samples", "exemplar", "sample", "examples"], columns=['haystack'])
arg3 = 'jaccard'





'[["sample"], ["exemplary"], ["sampler"], ["example"]]'

In [20]:
# Convert arg1 DataFrame 'needles' column to nested list and output as JSON
nested_list = [[word] for word in arg1['needles'].values.tolist()]
json.dumps(nested_list)

'[["sample"], ["exemplary"], ["sampler"], ["example"]]'

In [15]:
nested_list = [[word] for word in arg2['haystack'].values.tolist()]
json.dumps(nested_list)

'[["samples"], ["exemplar"], ["sample"], ["examples"]]'

In [16]:
arg3

'jaccard'

In [17]:
import pandas as pd
import nltk
from nltk.metrics.distance import edit_distance, jaccard_distance, jaro_similarity
from nltk.util import ngrams

def nltk_distance(needle, haystack_df, algorithm='jaccard'):
    """
    Calculate the similarity between a needle and a haystack using various distance algorithms.

    Parameters:
    needle (str or pd.DataFrame): The string or DataFrame to search for.
    haystack_df (pd.DataFrame): The DataFrame to search within.
    algorithm (str): The algorithm to use for calculating similarity. Options are 'levenshtein', 'jaccard', and 'jaro'. Default is 'jaccard'.

    Returns:
    list: A list of lists where each sublist contains the index (1-based) and the similarity score of the most similar item in the haystack.
    """
    # Define a dictionary to map algorithm names to functions
    algo_funcs = {
        'levenshtein': lambda x, y: 1 - edit_distance(x, y) / max(len(x), len(y)),
        'jaccard': lambda x, y: 1 - jaccard_distance(set(ngrams(x, 2)), set(ngrams(y, 2))),
        'jaro': jaro_similarity
    }
    
    # Get the algorithm function from the dictionary
    algo_func = algo_funcs.get(algorithm)
    if algo_func is None:
        raise ValueError(f"Unsupported algorithm: {algorithm}")
    
    # Flatten the DataFrame to a list
    haystack = haystack_df.values.flatten().tolist()
    
    # Check if needle is a DataFrame
    if isinstance(needle, pd.DataFrame):
        needle_list = needle.values.flatten().tolist()
    else:
        needle_list = [needle]
    
    results = [] 
    for needle_item in needle_list:
        # Calculate similarity scores and round to 2 decimal places
        scores = [(index + 1, round(algo_func(needle_item, item), 2)) for index, item in enumerate(haystack)]
        
        # Sort based on scores in descending order
        scores.sort(key=lambda x: x[1], reverse=True)
        # Append the top index and score to results as a list
        results.append(list(scores[0]))

    # results is 2D list, e.g. [[1, 0.75], [2, 0.85]]
    return results

nltk_distance(arg1, arg2, arg3)

[[3, 1.0], [2, 0.88], [3, 0.83], [4, 0.86]]

In [18]:
# Column headers to use in demo workbook.
headers = ["Lookup_Value(arg1)", "Lookup_Array(arg2)", "Algorithm(arg3)", "Index", "Similarity_Score"]
json.dumps(headers)

'["Lookup_Value(arg1)", "Lookup_Array(arg2)", "Algorithm(arg3)", "Index", "Similarity_Score"]'

In [19]:
# List of algorithms to test
algorithms = ['jaccard', 'levenshtein', 'jaro']

# Example needle and haystack DataFrame
needle = "sampler"
haystack_df = pd.DataFrame(["sample", "example", "sampling", "test"])

# Calculate results for each algorithm
results = [['Algorithm', 'Closest Match', 'Score']]
for algo in algorithms:
    match, score = nltk_distance(needle, haystack_df, algo)[0]
    results.append([algo, match, float(score)])

# Return results as a nested list with headers
results

[['Algorithm', 'Closest Match', 'Score'],
 ['jaccard', 1, 0.83],
 ['levenshtein', 1, 0.86],
 ['jaro', 1, 0.95]]