The aim of this exercise is to find all dog names in the 20210103_hundenamen.csv dataset that have Levenshtein distance of 1 from the name "Luca". 

The Levenshtein distance is defined as the minimal number of single character manipulations necessary to transform one word into another. A single character manipulation is either adding, removing, or changing a character.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('20210103_hundenamen.csv')

In [3]:
df.head()

Unnamed: 0,HUNDENAME,GEBURTSJAHR_HUND,GESCHLECHT_HUND
0,Ituma,2011,w
1,"""Bo"" Bendy of Treegarden",2020,m
2,"""Bobby"" Lord Sinclair",2009,m
3,"""Buddy"" Fortheringhay's J.",2011,m
4,"""Fly"" Showring i fly for you",2015,w


In [4]:
len(df)

8574

I construct my own function to check whether the Levenshtein distance between two input strings is 1. Alternatively, I could have used the inbuilt 'levenshtein' function from the 'enchant' module.

In [5]:
def levenshtein_dist_1(w1, w2='Luca'):
    """A function that takes two input words w1 and w2 and checks if the Levenshtein distance between them is 1."""
    # Levenshtein distance is independend of capitalisation.
    word1 = w1.lower()
    word2 = w2.lower()
    # If one of the words is longer by more than 1 character, then the Levenshtein distance is greater than 1.
    if len(word1) > len(word2) + 1 or len(word1) < len(word2) - 1:
        return False
    # If the two words have the same length, then Levenshtein distance is 1 only if they differ by a single letter
    if len(word1) == len(word2):
        if sum([word1[i]!=word2[i] for i in range(len(word1))]) == 1:
            return True
        else:
            return False
    # If word1 is longer than word2, then Levenshtein distance is 1 if removing a single character from word1 makes it
    # the same as word2
    if len(word1) == len(word2) + 1:
        for i in range(len(word1)):
            if word1[:i]+word1[i+1:]==word2:
                return True
        return False
    # Same as before but with word1 and word2 reversed
    if len(word2) == len(word1) + 1:
        for i in range(len(word2)):
            if word2[:i]+word2[i+1:]==word1:
                return True
        return False

Use the 'apply' and 'unique' methods in pandas together with the manually defined function above in order to check all dog names that have Levenshtein distance of 1 from "Luca".

In [6]:
df[df['HUNDENAME'].apply(levenshtein_dist_1)]['HUNDENAME'].unique()

array(['Cuca', 'Lua', 'Luba', 'Lucas', 'Luce', 'Lucia', 'Lucy', 'Lula',
       'Luma', 'Luna', 'Lupa', 'Yuca'], dtype=object)

The answer is: 'Cuca', 'Lua', 'Luba', 'Lucas', 'Luce', 'Lucia', 'Lucy', 'Lula', 'Luma', 'Luna', 'Lupa', 'Yuca'.