# TEXT_DISTANCE

## Overview
This function demonstrates fuzzy matching techniques using the Python [textdistance](https://github.com/life4/textdistance) library. It implements various algorithms including edit distance, token-based, sequence-based, and phonetic algorithms to calculate the similarity between strings.

## Usage
To use the `TEXT_DISTANCE` function in Excel, enter it as a formula in a cell, specifying your lookup value(s), lookup array, algorithm, and top_n:

```excel
=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])
```

## Arguments
| Argument       | Type              | Required | Description                                                                 | Example |
|----------------|-------------------|----------|-----------------------------------------------------------------------------|---------|
| lookup_value   | string or 2D list | Yes      | String(s) to compare with the strings in the lookup_array.                  | `{"apple"}` |
| lookup_array   | 2D list           | Yes      | A list of strings to compare with the lookup_value.                         | `{"appl"; "banana"; "orange"; "grape"}` |
| algorithm      | string            | No       | Specifies the similarity algorithm to use. Default: 'jaccard'.              | `"jaccard"` |
| top_n          | int               | No       | The number of top matches to return for each lookup_value. Default: 1.      | `1` |

## Returns
| Returns    | Type               | Description                                                      | Example |
|------------|--------------------|------------------------------------------------------------------|---------|
| Matches    | list or 2D list    | For each lookup_value, a flat list of [position, score, ...] for the top N matches. | `{1, 0.8}` |

## Examples

### Example 1
**Find products with names similar to 'apple' in your product catalog.**

```excel
=TEXT_DISTANCE({"apple"}, {"appl"; "banana"; "orange"; "grape"})
```

**Output:** `{1, 0.8}`

### Example 2
**Find customers with names similar to 'Johnson' in your customer database.**

```excel
=TEXT_DISTANCE("Johnson", {"Johnsen"; "Jonson"; "Johanson"; "Smith"; "Jonsen"}, "jaro_winkler", 3)
```

### Example 3
**Find matches for multiple product names using Levenshtein distance.**

```excel
=TEXT_DISTANCE({"aple", "banaa"}, {"apple"; "banana"; "orange"; "grape"}, "levenshtein", 2)
```

### Example 4
**Match addresses in your CRM with addresses in your billing system.**

```excel
=TEXT_DISTANCE({"123 Main St"; "456 Oak Ave"}, {"123 Main Street"; "456 Oak Avenue"; "789 Pine Blvd"; "321 Elm Street"}, "ratcliff_obershelp", 1)
```

## Similarity Algorithms
- jaccard (default)
- jaro_winkler
- levenshtein
- ratcliff_obershelp
- See [textdistance documentation](https://github.com/life4/textdistance) for more options.

In [None]:
import micropip
await micropip.install('textdistance')
import textdistance

def text_distance(needle, haystack, algorithm='jaccard', top_n=1):
    """Calculate text similarity scores between needle(s) and haystack items.
    Args:
        needle: String or 2D list of strings to search for
        haystack: 2D list of strings to search within
        algorithm (str): Algorithm name from textdistance library (default: 'jaccard')
        top_n (int): Number of top matches to return (default: 1).
    Returns:
        list: For each needle, a flat list of [position, score, ...] for the top N matches (row format).
    """
    algo_func = getattr(textdistance, algorithm)
    if isinstance(needle, str):
        needle_flat = [needle] if needle.strip() else []
    else:
        needle_flat = [item for sublist in needle for item in sublist if item is not None]
    haystack_flat = [item for sublist in haystack for item in sublist if item is not None]
    if not haystack_flat:
        return [[] for _ in needle_flat] if needle_flat else []
    results = []
    for needle_item in needle_flat:
        if not str(needle_item).strip():
            results.append([])
            continue
        scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2))
                  for index, item in enumerate(haystack_flat)]
        scores.sort(key=lambda x: x[1], reverse=True)
        row = []
        for score in scores[:top_n]:
            row.extend(list(score))
        results.append(row)
    if len(results) == 1:
        return results[0]
    return results

In [None]:
# Unit Tests (ipytest)
import ipytest
ipytest.autoconfig()

def test_demo_exact_match():
    result = text_distance(
        [["apple"]],
        [["appl"], ["banana"], ["orange"], ["grape"]],
        "jaccard", 1
    )
    assert isinstance(result, list)
    assert len(result) == 2
    assert result[0] == 1
    assert 0 <= result[1] <= 1

def test_demo_customer_names():
    result = text_distance(
        "Johnson",
        [["Johnsen"], ["Jonson"], ["Johanson"], ["Smith"], ["Jonsen"]],
        "jaro_winkler", 3
    )
    assert isinstance(result, list)
    assert len(result) == 6
    assert all(isinstance(x, (int, float)) for x in result)

def test_demo_multiple_products():
    result = text_distance(
        [["aple", "banaa"]],
        [["apple"], ["banana"], ["orange"], ["grape"]],
        "levenshtein", 2
    )
    print(result)
    assert isinstance(result, list)
    assert len(result) == 2
    assert all(isinstance(row, list) for row in result)
    assert all(len(row) == 4 for row in result)

def test_demo_address_fuzzy_matching():
    result = text_distance(
        [["123 Main St"], ["456 Oak Ave"]],
        [["123 Main Street"], ["456 Oak Avenue"], ["789 Pine Blvd"], ["321 Elm Street"]],
        "ratcliff_obershelp", 1
    )
    assert isinstance(result, list)
    assert len(result) == 2
    assert all(isinstance(row, list) for row in result)
    assert all(len(row) == 2 for row in result)

ipytest.run('-s')

In [None]:
# Interactive Demo
import gradio as gr

examples = [
    [
        [["apple"]],
        [["appl"], ["banana"], ["orange"], ["grape"]],
        "jaccard",
        1
    ],
    [
        [["Johnson"]],
        [["Johnsen"], ["Jonson"], ["Johanson"], ["Smith"], ["Jonsen"]],
        "jaro_winkler",
        3
    ],
    [
        [["aple"], ["banaa"]],
        [["apple"], ["banana"], ["orange"], ["grape"]],
        "levenshtein",
        2
    ],
    [
        [["123 Main St"], ["456 Oak Ave"]],
        [["123 Main Street"], ["456 Oak Avenue"], ["789 Pine Blvd"], ["321 Elm Street"]],
        "ratcliff_obershelp",
        1
    ]
]

demo = gr.Interface(
    fn=text_distance,
    inputs=[
        gr.Dataframe(headers=["Needle(s)"], label="Needle(s)", row_count=2, col_count=1, type="array", value=[["apple"]]),
        gr.Dataframe(headers=["Haystack"], label="Haystack", row_count=4, col_count=1, type="array", value=[["appl"], ["banana"], ["orange"], ["grape"]]),
        gr.Textbox(label="Algorithm", value="jaccard"),
        gr.Number(label="Top N", value=1),
    ],
    outputs=gr.Dataframe(headers=["Position", "Score"], label="Matches", type="array"),
    examples=examples,
    description="Find the closest matches for a string or list of strings using various text similarity algorithms.",
    flagging_mode="never",
)
demo.launch()