Requirements 
<ul>
    <li>Your function should account for keyboard distance for all characters (even special characters)</li>
    <li>Your function should account for levenshtein distance as well. If there are more typos, then the score should lean more towards 1 (since it is unlikely for a user to make so many typos)</li>
    <li>Think about if it is more likely to make a horizontal typo vs a vertical typo, you may want to assign a weight to differentiate the typos</li>
    <li>Think about the case when the strings have different lengths and how you should handle it</li>
    <li>Think about if it is necessary to distinguish if the character is already very far away (e.g wikip9dia.org vs wikip0dia.org), both are most likely typosquats, is there a need for a different score? How many keyboard characters away then should I consider it to be not a typo vs not typo?</li>
    <li>Try to think of any other conditions / requirements that I may have missed out, and feel free to suggest any</li>
</ul>

what about swapped letters, one-too-many letters

numbers above qwertyuiop are possible typos, but some may be intended typosquats (i.e. o -> 0; E -> 3 ?)

what about when a user presses 2 keys on accident? e.g. wikoipedia -> presses "o" and "k" when trying to press "k"

are special chars/homoglyphs legal in the url box?

what if they miss a letter?


[python-Levenshtein PyPI](https://pypi.org/project/python-Levenshtein/)

[euclidean distance using numpy (stack overflow)](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) (may help make calculating ED more efficient?)

[top 10 most common TLDs](https://www.statista.com/statistics/265677/number-of-internet-top-level-domains-worldwide/)

[domain name regex](https://medium.com/@vaghasiyaharryk/how-to-validate-a-domain-name-using-regular-expression-9ab484a1b430)


[Prototype 3 string check](https://stackoverflow.com/questions/774316/python-difflib-highlighting-differences-inline)

In [1]:
from math import *
import dnstwist
from tldextract import extract
import pylev as ls
# import Levenshtein as ls
import numpy as np
import difflib as dl
import re
import pandas as pd

## Euclidean Distance

In [2]:
keyboard_cartesian = {
                        "1": {"y": -1, "x": 0},
                        "2": {"y": -1, "x": 1},
                        "3": {"y": -1, "x": 2},
                        "4": {"y": -1, "x": 3},
                        "5": {"y": -1, "x": 4},
                        "6": {"y": -1, "x": 5},
                        "7": {"y": -1, "x": 6},
                        "8": {"y": -1, "x": 7},
                        "9": {"y": -1, "x": 8},
                        "0": {"y": -1, "x": 9},
                        "-": {"y": -1, "x": 10},
                        "q": {"y": 0, "x": 0},
                        "w": {"y": 0, "x": 1},
                        "e": {"y": 0, "x": 2},
                        "r": {"y": 0, "x": 3},
                        "t": {"y": 0, "x": 4},
                        "y": {"y": 0, "x": 5},
                        "u": {"y": 0, "x": 6},
                        "i": {"y": 0, "x": 7},
                        "o": {"y": 0, "x": 8},
                        "p": {"y": 0, "x": 9},
                        "a": {"y": 1, "x": 0},
                        "s": {"y": 1, "x": 1},
                        "d": {"y": 1, "x": 2},
                        "f": {"y": 1, "x": 3},
                        "g": {"y": 1, "x": 4},
                        "h": {"y": 1, "x": 5},
                        "j": {"y": 1, "x": 6},
                        "k": {"y": 1, "x": 7},
                        "l": {"y": 1, "x": 8},
                        ";": {"y": 2, "x": 9},
                        "'": {"y": 2, "x": 10},
                        "z": {"y": 2, "x": 0},
                        "x": {"y": 2, "x": 1},
                        "c": {"y": 2, "x": 2},
                        "v": {"y": 2, "x": 3},
                        "b": {"y": 2, "x": 4},
                        "n": {"y": 2, "x": 5},
                        "m": {"y": 2, "x": 6},
                        ",": {"y": 2, "x": 7},
                        ".": {"y": 2, "x": 8},
                        "/": {"y": 2, "x": 9}                   
                     }

def euclidean_distance(a,b):
    X = (keyboard_cartesian[a]['x']-keyboard_cartesian[b]['x'])**2
    Y = (keyboard_cartesian[a]['y']-keyboard_cartesian[b]['y'])**2
    return sqrt(X+Y)

print(euclidean_distance('q', 'w'))
print(euclidean_distance('q', 's'))

1.0
1.4142135623730951


### First Try
simply checking against Euclidean distance.

In [18]:
# https://stackoverflow.com/questions/44113335/extract-domain-from-url-in-python

def replace_special_char(char):
    flag = '"!@#$%^&*()+?_=,<>'':\\'
    flag_list = [char for char in flag]
    if char.isalnum()==False and char in flag:
        return 'Z'
    return char
    
# Clean the string first. The extract python library cannot properly extract the domain from URLs with special characters

# 1. make string lower case
# 2. replace all flagged special characters with 'Z'
# 3. extract the domain or TLD
# 4. replace all 'Z' with '!'
# 5. calculate edit distance

def clean_string(url):
    # First make the string lowercase
    url = url.lower()

    # ':' is a flagged character, but if it appears with http or https it is fine
    url = url.replace('https://','')
    url = url.replace('http://','')
    return ''.join([replace_special_char(char) for char in url])
    
def extract_domain_and_tld(url):
    url = clean_string(url)
    tsd, td, tsu = extract(url)
    final = td + '.' + tsu
    return final.replace('Z','!')
    
def extract_tld(url):
    url = clean_string(url)
    tsd, td, tsu = extract(url)
    return tsu.replace('Z','!')  

def extract_sld(url):
    url = clean_string(url)
    tsd, td, tsu = extract(url)
    return td.replace('Z','!')

# Theres no need for this container to be mutable and tuples are faster
# Index 0 - 9 = most popular to least popular
tlds = ("com", "ru", "org", "net", "in", "ir", "au", "uk", "de", "br")

whitelist = ['https://www.wikipedia.org/']


whitelist_domains = []
whitelist_slds = []
for url in whitelist:
    whitelist_domains.append(extract_domain_and_tld(url))
    whitelist_slds.append(extract_sld(url))
typosquat = []
for url in whitelist_domains:
    fuzz = dnstwist.DomainFuzz(url)
    fuzz.generate()
    typosquat.extend([x['domain-name'] for x in fuzz.domains])
    
#Lookups in sets are much more efficient
typosquat = set(typosquat)

#Delete the original whitelisted domains from the blacklist set
typosquat.difference_update(whitelist_domains)

print('Number of typosquatted urls generated: ', len(typosquat))

Number of typosquatted urls generated:  7900


In [4]:
legit = "wikipedia.org"
typo = "wiiipedia.org" 
typosqt = "wikiped1a.org"

def typo_check(url):
    for i in range(len(legit)):
        if legit[i] != url[i]:
            result = euclidean_distance(legit[i], url[i])
            if result >= 1.5:
                # typosquat
                return 1
            else:
                # typo
                return 0

print(typo_check(typosqt))

1


### Notes

T : URL being tested <br>
L : Legit URL <br>
LD : levenshtein distance <br>
ED : euclidean distance <br>

Assumptions: <br>
<ul >
    <li>Given the suspicious URL (S), its genuine URL (G) is known</li>
    <li>S has an edit distance of at least 1</li>
    <li>There is no reason to press shift; urls are not case sensitive, and the only legal special character is hyphen, which does not require shift</li>
    <li>Hyphens are only allowed in between characters in domain names (i.e. domain names cannot start/end with hyphens, and the top level domain cannot contain hyphen)</li>
    <li>T is assumed to be a typosquat when it meets one of the following:
        <ol>
            <li>Contains illegal special characters</li>
            <li>Has a Levenshtein distance of more than 1</li>
            <li>Is shorter than len(G) - 1</li>
            <li>Is longer than len(G) + 1</li>
        </ol>
    </li>
</ul>

Then, checks:
- if there are special characters in T, typosquat (in domain names: hyphens are allowed, underscores are not)
- if length of T = length of L + 1, check how far away the extra letter is from the previous letter. was it fat-fingered?
- if length of T = length of L, check LD. if > 2, definitely typosquat
- if length of T = length of L AND LD <= 2, check ED. if any error has ED > 1.5, typosquat

definitely typosquat:
- special char present anywhere
- len(T) > len(L) + 1
- LD > 2
- ED > 1.5


## is_typo(url) Function
Code Flow:
- checks if url has any special characters
- checks LD of url against legit URL
- checks the TLD of url agains legit URL
- checks length of url against lenght of legit URL
- checks the ED of any wrong char in url against corresponding char in legit URL

In [20]:
def is_typo(url):
    l = whitelist_domains[0]
    
    result = {"suspicious url": url, "original url": l, "result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    url_sld = extract_sld(url)
    l_sld = extract_sld(l)
    
    url_tld = extract_tld(url)
    l_tld = extract_tld(l)
    
    url_len = len(url)
    l_len = len(l)
    
    # checks if illegal special characters are present
    if not re.match("^[^-][a-zA-Z0-9-]{1,}[^-][.]{1}[a-zA-Z0-9]{1,}$", url):
        result["result"] = 1
        result["reasons_typosquat"].append("Illegal characters found in url")
    
    # checks LD
    if ls.levenshtein(url, l) > 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Edit distance more than 1")
    elif ls.levenshtein(url, l) == 1:
        result["reasons_typo"].append("Edit distance is only one")
    
    # checks TLD
    if url_tld != l_tld:
        url_index = 10
        l_index = 10
        if url_tld in tlds:
            url_index = tlds.index(url_tld)
        if l_tld in tlds:
            l_index = tlds.index(l_tld)
        if l_index > url_index:
            result["reasons_typo"].append("TLD is more common")
            
     
    # compares lengths
    if url_len > l_len + 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Too long")
    
    elif url_len < l_len - 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Too short")
    
    elif url_len == l_len + 1:
        for i in range(len(l)):
            if url[i] != l[i]:
                url_left = url[i - 1] if i != 0 else None # char on the left of the wrong/extra char
                url_middle = url[i] # wrong/extra char
                url_right = url[i + 1] if i + 1 < len(url) else None # char on the right of wrong/extra char          
                
                if url_left == None:
                    if euclidean_distance(url_right, url_middle) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")
                        
                elif url_right == None:
                    if euclidean_distance(url_left, url_middle) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")
                else:
                    if euclidean_distance(url_left, url_middle) > 1.5 and euclidean_distance(url_right, url_middle) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")
                
                # prevent the function from running on the rest of the string
                break
                
    elif url_len == l_len:
        for i in range(len(l)):
            if url[i] != l[i]:
                if euclidean_distance(url[i], l[i]) > 1.5:
                    result["result"] = 1
                    result["reasons_typosquat"].append("'{}' key and '{}' key are too far apart".format(url[i], l[i]))
                else:
                        result["reasons_typo"].append("Wrong character is within ED boundary")
    
    return result

## Testing

<u>Potential Results</u>: <br>
0: Typo <br>
1: Typosquat

<u>Test Cases</u>:
1. Legit URL (Expected Result: 0)
2. Substitute 1 char with SC (Expected Result: 1)
3. Substitute 1 char with hyphen (Expected Result: 1)
4. Substitute 1 char with wrong char, within ED boundary (Expected Result: 0)
5. Substitute 2 char with wrong char, within ED boundary (Expected Result: 0)
6. Substitute 3 char with wrong char, within ED boundary(Expected Result: 1)
7. Append 1 char, within ED boundary of last char (Expected Result: 0)
8. Append 1 char, exceeding ED boundary (Expected Result: 1)
9. Append 2 char, within ED boundary(Expected Result: 1)
10. Remove 1 char (Expected Result: 1)
    
<u>Conclusion/Notes</u>:
- need to recheck for special characters, as they are not accounted for in the next checks after

In [21]:
test_cases = ["bankofsingapore.com", 
              "bankofs!ngapore.com", 
              "bankofsing-pore.com", 
              "bankofsingaporr.com", 
              "bankofsingapoer.com", 
              "bankofsingapier.com", 
              "bankofsingaporee.com", 
              "bankofsingaporep.com", 
              "bankofsingaporeee.com", 
              "bankofsingapor.com"]

test_results = []

for url in test_cases:
    test_results.append(is_typo(url))
    
df = pd.DataFrame(test_results)
df

Unnamed: 0,suspicious url,original url,result,reasons_typosquat,reasons_typo
0,bankofsingapore.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
1,bankofs!ngapore.com,wikipedia.org,1,"[Illegal characters found in url, Edit distanc...",[TLD is more common]
2,bankofsing-pore.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
3,bankofsingaporr.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
4,bankofsingapoer.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
5,bankofsingapier.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
6,bankofsingaporee.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
7,bankofsingaporep.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
8,bankofsingaporeee.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]
9,bankofsingapor.com,wikipedia.org,1,"[Edit distance more than 1, Too long]",[TLD is more common]


### Prototype 2

In [22]:
def is_typo(sus_url, legit_url):
    result = {"suspicious url": sus_url, "original url": legit_url, "result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    sus_sld = extract_sld(sus_url)
    legit_sld = extract_sld(legit_url)
    
    sus_tld = extract_tld(sus_url)
    legit_tld = extract_tld(legit_url)
    
    sus_len = len(sus_url)
    legit_len = len(legit_url)
    
    # checks length
    if legit_len + 1 < sus_len:
        result["result"] = 1
        result["reasons_typosquat"].append("Too long")
        return result
    elif sus_len < legit_len - 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Too short")
        return result
    
    # checks if illegal special characters are present
    if not re.match("^((?!-)[A-Za-z0–9-]{1,}(?<!-)\.)+[A-Za-z0-9]{1,}$", sus_url):
        result["result"] = 1
        result["reasons_typosquat"].append("Illegal characters found in url")
        return result
    
    # checks TLD; WIP
    if sus_tld != legit_tld:
        sus_index = 10
        legit_index = 10
        if sus_tld in tlds:
            sus_index = tlds.index(sus_tld)
        if legit_tld in tlds:
            legit_index = tlds.index(legit_tld)
        if legit_index > sus_index:
            result["reasons_typo"].append("TLD is more common")
            
            # reassign values so next checks will only be on sld, as tld mistakes have already been ruled out
            sus_url = sus_sld
            legit_url = legit_sld
            sus_len = len(sus_url)
            legit_len = len(legit_url)
    
    
    # levenshtein distance check
    if ls.levenshtein(sus_url, legit_url) > 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Edit distance more than 1")
        return result
    else:
        # length checks
        if sus_len == legit_len:
            for i in range(len(legit_url)):
                if sus_url[i] != legit_url[i]:
                    if euclidean_distance(sus_url[i], legit_url[i]) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("'{}' key and '{}' key are too far apart".format(sus_url[i], legit_url[i]))
                    else:
                            result["reasons_typo"].append("Wrong character is within ED boundary")
        
        elif sus_len == legit_len + 1:
            for i in range(len(legit_url)):
                if sus_url[i] != legit_url[i]:
                    sus_left = sus_url[i - 1] if i != 0 else None # char on the left of the wrong/extra char
                    sus_middle = sus_url[i] # wrong/extra char
                    sus_right = sus_url[i + 1] if i + 1 < len(url) else None # char on the right of wrong/extra char          

                    if sus_left == None:
                        if euclidean_distance(sus_right, sus_middle) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    elif sus_right == None:
                        if euclidean_distance(sus_left, sus_middle) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")
                    else:
                        if euclidean_distance(sus_left, sus_middle) > 1.5 and euclidean_distance(sus_right, sus_middle) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    # prevent the function from running on the rest of the string
                    break
        
        elif sus_len == legit_len - 1:
            result["reasons_typo"].append("Only missing 1 character")
            
        if len(result["reasons_typo"]) > 1:
            result["result"] = 1
            result["reasons_typosquat"].append("Too many typos")
    
    return result

In [23]:
legit_url = whitelist_domains[0]
test_urls = ["wikipedia.org", "wikipediaaaaa.org", "wikipe.org", "wikipedi@.org", "wikipedia.com", "wikipedis.com", "wikipedai.org", "wikipedih.org", "wikipediap.org"]
test_results = []

for url in test_urls:
    test_results.append(is_typo(url, legit_url))

df = pd.DataFrame(test_results)
df

Unnamed: 0,suspicious url,original url,result,reasons_typosquat,reasons_typo
0,wikipedia.org,wikipedia.org,0,[],[]
1,wikipediaaaaa.org,wikipedia.org,1,[Too long],[]
2,wikipe.org,wikipedia.org,1,[Too short],[]
3,wikipedi@.org,wikipedia.org,1,[Illegal characters found in url],[]
4,wikipedia.com,wikipedia.org,0,[],[TLD is more common]
5,wikipedis.com,wikipedia.org,1,[Too many typos],"[TLD is more common, Wrong character is within..."
6,wikipedai.org,wikipedia.org,1,[Edit distance more than 1],[]
7,wikipedih.org,wikipedia.org,1,['h' key and 'a' key are too far apart],[]
8,wikipediap.org,wikipedia.org,1,[Extra character too far from characters next ...,[]


### Prototype 3
Helper functions

In [32]:

# Input - 1 URL
# Output - True / False

# e.g contains_special_characters(wikipedia.org) 
# expected output False

# e.g contains_special_characters(wikipedi@.org) 
# expected output True

def contains_special_characters(url):
    return not re.match("^((?!-)[A-Za-z0-9-]{1,}(?<!-)\.)+[A-Za-z0-9]{1,}$", url)


# Helper function to find out if the first TLD is more common than the second TLD

# Input - 2 TLDs
# Output - True / False

# e.g exact_tld_swap(com, org) 
# expected output True

# e.g exact_tld_swap(org, com) 
# expected output False

def tld_more_common(tld1, tld2):    
    tld1_index = 10
    tld2_index = 10
    if tld1 in tlds:
        tld1_index = tlds.index(tld1)
    if tld2 in tlds:
        tld2_index = tlds.index(tld2)
    if tld2_index > tld1_index:
        return True
    elif tld2_index < tld1_index:
        return False



# Helper function to perform various euclidean distance checks based off of the lengths of both URLs

# Input - 2 URLs
# Output - dictionary of results

# e.g. edit_check("w1k1pedia.org", "wikipedia.org")
# expected output: {'result': 1, 'reasons_typosquat': ["'1' key and 'i' key are too far apart", "'1' key and 'i' key are too far apart"], 'reasons_typo': []}

def edit_check(sus_url, legit_url):
    result = {"result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    # retrieving length of both URLs
    sus_len = len(sus_url)
    legit_len = len(legit_url)
    
    # if lengths are equal, two characters were swapped
    if sus_len == legit_len:

        # running through each letter in both sides
        for i in range(len(legit_url)): 

            # if letters dont match, check ED
            if sus_url[i] != legit_url[i]: 
                if euclidean_distance(sus_url[i], legit_url[i]) > 1.5:
                    result["result"] = 1
                    result["reasons_typosquat"].append("'{}' key and '{}' key are too far apart".format(sus_url[i], legit_url[i]))
                else:
                    result["reasons_typo"].append("Wrong character is within ED boundary")

    # if sus_len == legit_len - 1, missing 1 char, swapped 1
    # if sus_len == legit_len + 1, extra 1 char, swapped 1
    elif sus_len == legit_len - 1 or sus_len == legit_len + 1:

        # creating an object that compares both urls
        seqm = dl.SequenceMatcher(None, sus_url, legit_url)

        # get_opcodes() gets the "differences" between each url, or the steps required for first url to match second url
        for opcode, a0, a1, b0, b1 in seqm.get_opcodes():
            # a : first url
            # b : second url
            # a0, a1 | b0, b1 : index range holding the characters being compared
            # opcode : "equal", "insert", "delete", "replace" -- indicates the action require to turn a to b

            # if a character has to be deleted, means it's an extra character
            if opcode == 'delete':

                # retrieving the extra char, and the chars on its left and right
                extra_char = seqm.a[a0: a1]
                left_char = seqm.a[a0-1: a1-1]
                right_char = seqm.a[a0+1: a1+1]

                # if the left char is the end of the url, check only the right char
                if left_char == "":
                    if euclidean_distance(right_char, extra_char) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")

                # if the right char is the end of the url, check only the left char
                elif right_char == "":
                    if euclidean_distance(left_char, extra_char) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")

                # check both left and right
                else:
                    if euclidean_distance(left_char, extra_char) > 1.5 and euclidean_distance(right_char, extra_char) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")

            # if a character has to be replaced, means its a swapped character
            elif opcode == 'replace':
                wrong_char = seqm.a[a0: a1]
                correct_char = seqm.b[b0:b1]

                # if the wrong char + extra/missing char are side by side, correct_char will hold both correct characters in one string
                if len(correct_char) == 2:

                    # retrieving each of the correct chars, and the chars on their left and right
                    left_char = seqm.b[b0-1:b1-2]
                    correct1 = correct_char[0]
                    correct2 = correct_char[1]
                    right_char = seqm.b[b0+2:b1+1]

                    if euclidean_distance(left_char, wrong_char) > 1.5 and euclidean_distance(correct1, wrong_char) > 1.5 and euclidean_distance(correct2, wrong_char) > 1.5 and euclidean_distance(right_char, wrong_char) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")                            
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")

                else:
                    if euclidean_distance(wrong_char, correct_char) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("'{}' key and '{}' key are too far apart".format(wrong_char, correct_char))
                    else:
                        result["reasons_typo"].append("Wrong character is within ED boundary")

    elif sus_len == legit_len + 2:
        # creating an object that compares both urls
        seqm = dl.SequenceMatcher(None, sus_url, legit_url)

        # get_opcodes() gets the "differences" between each url, or the steps required for first url to match second url
        for opcode, a0, a1, b0, b1 in seqm.get_opcodes():
            # a : first url
            # b : second url
            # a0, a1 | b0, b1 : index range holding the characters being compared
            # opcode : "equal", "insert", "delete", "replace" -- indicates the action require to turn a to b

            # if a character has to be deleted, means it's an extra character
            if opcode == 'delete':
                extra_char = seqm.a[a0: a1]

                # if the extra chars are next to each other, extra_char will hold both extra chars
                if len(extra_char) == 2:

                    # retrieving the extra characters individually, and the chars on their left and right
                    left_char = seqm.a[a0-1:a1-2]
                    extra1 = extra_char[0]
                    extra2 = extra_char[1]
                    right_char = seqm.a[a0+2:a1+1]

                    # if the char on the left is empty, check only against the right
                    if left_char == "":
                        if euclidean_distance(right_char, extra1) > 1.5 and euclidean_distance(extra1, extra2) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    # if the char on the right is empty, check only against the left
                    elif right_char == "":
                        if euclidean_distance(left_char, extra1) > 1.5 and euclidean_distance(extra1, extra2) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    # check both left and right
                    else:                        
                        # checking the ED of extra chars against each char next to it on the url
                        if euclidean_distance(left_char, extra1) > 1.5 and euclidean_distance(extra1, extra2) > 1.5 and euclidean_distance(extra2, right_char) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")                          
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")
                else:

                    # retrieving the chars on the left and right of the extra char
                    left_char = seqm.a[a0-1: a1-1]
                    right_char = seqm.a[a0+1: a1+1]

                    # if the char on the left is empty, check only against the right
                    if left_char == "":
                        if euclidean_distance(right_char, extra_char) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    # if the char on the right is empty, check only against the left
                    elif right_char == "":
                        if euclidean_distance(left_char, extra_char) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    # check both left and right
                    else:
                        if euclidean_distance(left_char, extra_char) > 1.5 and euclidean_distance(right_char, extra_char) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")


    # if sus_len == legit_len -2, missing 2 characters, TYPO
    elif sus_len == legit_len - 2:
        result["reasons_typo"].append("Only 2 characters are missing")
    
    return result

Main program codes

In [37]:
def is_typo(sus_url, legit_url):
    # preparing 
    result = {"suspicious url": sus_url, "original url": legit_url, "result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    # extracting SLD from both URLs
    sus_sld = extract_sld(sus_url)
    legit_sld = extract_sld(legit_url)
    
    # extracting TLD from both URLs
    sus_tld = extract_tld(sus_url)
    legit_tld = extract_tld(legit_url)
    
    # retrieving length of both URLs
    sus_len = len(sus_url)
    legit_len = len(legit_url)
    
    # check for illegal special characters
    if contains_special_characters(sus_url):
        result["result"] = 1
        result["reasons_typosquat"].append("Illegal characters found in url")
        return result
    
    # check for exact TLD swap
    if sus_sld == legit_sld and sus_tld != legit_tld:
        if tld_more_common(sus_tld, legit_tld):
            result["reasons_typo"].append("TLD is more common")
            return result
        else:
            result["result"] = 1
            result["reasons_typosquat"].append("TLD is less common")
            return result
    
    # check edit distance
    ld = ls.levenshtein(sus_url, legit_url)
    
    # if only 1 edit
    if ld == 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Edit distance is only 1")
        return result
    
    # if exactly 2 edits
    elif ld == 2:
        res = edit_check(sus_url, legit_url)
        
        result["reasons_typosquat"].extend(res["reasons_typosquat"])
        result["reasons_typo"].extend(res["reasons_typo"])
        
        if res["result"] == 1:
            result["result"] = 1
            return result
    
    # if 3 or more edits
    else:
        result["result"] = "Inconclusive"
        return result
    
    return result

Test Cases

In [39]:
legit_url = whitelist_domains[0]
test_urls = ["wikipedi@.org", "wikipedia.com", "wkpedia.org", "w1k1ped1a.org", "w1k1pedia.org", "wkipedis.org", "wiikipedis.org", "wikipedg.org", "wikipedv.org", "wikipedia.orggg", "aikipedia.orf"]
test_results = []

for url in test_urls:
    test_results.append(is_typo(url, legit_url))
    
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(test_results)
df

Unnamed: 0,suspicious url,original url,result,reasons_typosquat,reasons_typo
0,wikipedi@.org,wikipedia.org,1,[Illegal characters found in url],[]
1,wikipedia.com,wikipedia.org,0,[],[TLD is more common]
2,wkpedia.org,wikipedia.org,0,[],[Only 2 characters are missing]
3,w1k1ped1a.org,wikipedia.org,Inconclusive,[],[]
4,w1k1pedia.org,wikipedia.org,1,"['1' key and 'i' key are too far apart, '1' key and 'i' key are too far apart]",[]
5,wkipedis.org,wikipedia.org,0,[],[Wrong character is within ED boundary]
6,wiikipedis.org,wikipedia.org,0,[],"[Extra character is within ED boundary, Wrong character is within ED boundary]"
7,wikipedg.org,wikipedia.org,1,[Extra character too far from characters next to it],[]
8,wikipedv.org,wikipedia.org,0,[],[Extra character is within ED boundary]
9,wikipedia.orggg,wikipedia.org,0,[],[Extra character is within ED boundary]
