# White Distance

The algorithm was driven by the following requirements:

**A true reflection of lexical similarity** - Strings with small differences should be recognised as being similar. In particular, a significant substring overlap should point to a high level of similarity between the strings.

**A robustness to changes of word order** - Two strings which contain the same words, but in a different order, should be recognised as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognised as dissimilar.

**Language Independence** - The algorithm should work not only in English, but in many different languages.

> Find out how many adjacent character pairs are contained in both strings.

## Steps
1. Convert the words to upper case; making them insensitive to case differences
2. Split the words into character pairs
3. Apply the formula -
![formula](http://www.catalysoft.com/images/howtostrikeamatch001.gif)

**Example** -

Input words are *France* and *French*

1. Capitalise them to *FRANCE* & *FRENCH*
2. Create pairs - *FRANCE*: {FR, RA, AN, NC, CE} & *FRENCH*: {FR, RE, EN, NC, CH}
3. ![example](http://www.catalysoft.com/images/howtostrikeamatch002.gif)




In [1]:
from collections import Counter

In [2]:
def upper_case(s):
    return s.upper()

In [3]:
def get_pairs(s):
    pairs = []
    words = s.strip().split(' ')
    for word in words:
        for idx in range(len(word)-1):
            pairs.append(word[idx:idx+2])
    return pairs

In [4]:
def get_similarity(s1, s2):
    s1 = upper_case(s1)
    s2 = upper_case(s2)
    p1 = get_pairs(s1)
    p2 = get_pairs(s2)
#   nr = 2*len(set(p1).intersection(set(p2)))
    nr = 2*len(list((Counter(p1) & Counter(p2)).elements()))
    dr = len(p1)+len(p2)
    return nr/dr

In [5]:
get_similarity("france", "french")

0.4

**Single Words**

In [6]:
target_string = "Healed"
match_strings = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

In [7]:
for match in match_strings:
    print("="*80)
    print(f"{target_string}\n{match}")
    print(f"Similarity: {get_similarity(target_string, match)*100:.2f}%")

Healed
Heard
Similarity: 44.44%
Healed
Healthy
Similarity: 54.55%
Healed
Help
Similarity: 25.00%
Healed
Herded
Similarity: 40.00%
Healed
Sealed
Similarity: 80.00%
Healed
Sold
Similarity: 0.00%


**Sentences**

In [8]:
target_string = "Web Database Applications"
match_strings = ["Web Database Applications with PHP & MySQL",
                 "Creating Database Web Applications with PHP and ASP", 
                 "Building Database Applications on the Web Using PHP3"]

In [9]:
for match in match_strings:
    print("="*80)
    print(f"{target_string}\n{match}")
    print(f"Similarity: {get_similarity(target_string, match)*100:.2f}%")

Web Database Applications
Web Database Applications with PHP & MySQL
Similarity: 81.63%
Web Database Applications
Creating Database Web Applications with PHP and ASP
Similarity: 71.43%
Web Database Applications
Building Database Applications on the Web Using PHP3
Similarity: 70.18%
