Requirements 
<ul>
    <li>Your function should account for keyboard distance for all characters (even special characters)</li>
    <li>Your function should account for levenshtein distance as well. If there are more typos, then the score should lean more towards 1 (since it is unlikely for a user to make so many typos)</li>
    <li>Think about if it is more likely to make a horizontal typo vs a vertical typo, you may want to assign a weight to differentiate the typos</li>
    <li>Think about the case when the strings have different lengths and how you should handle it</li>
    <li>Think about if it is necessary to distinguish if the character is already very far away (e.g wikip9dia.org vs wikip0dia.org), both are most likely typosquats, is there a need for a different score? How many keyboard characters away then should I consider it to be not a typo vs not typo?</li>
    <li>Try to think of any other conditions / requirements that I may have missed out, and feel free to suggest any</li>
</ul>

what about swapped letters, one-too-many letters

numbers above qwertyuiop are possible typos, but some may be intended typosquats (i.e. o -> 0; E -> 3 ?)

what about when a user presses 2 keys on accident? e.g. wikoipedia -> presses "o" and "k" when trying to press "k"

are special chars/homoglyphs legal in the url box?

what if they miss a letter?


[python-Levenshtein PyPI](https://pypi.org/project/python-Levenshtein/)

[euclidean distance using numpy (stack overflow)](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) (may help make calculating ED more efficient?)

[top 10 most common TLDs](https://www.statista.com/statistics/265677/number-of-internet-top-level-domains-worldwide/)

[domain name regex](https://medium.com/@vaghasiyaharryk/how-to-validate-a-domain-name-using-regular-expression-9ab484a1b430)

In [4]:
from math import *
import dnstwist
from tldextract import extract
import pylev as ls
# import Levenshtein as ls
import numpy as np
import difflib as dl
import re

## Euclidean Distance

In [5]:
keyboard_cartesian = {
                        "1": {"y": -1, "x": 0},
                        "2": {"y": -1, "x": 1},
                        "3": {"y": -1, "x": 2},
                        "4": {"y": -1, "x": 3},
                        "5": {"y": -1, "x": 4},
                        "6": {"y": -1, "x": 5},
                        "7": {"y": -1, "x": 6},
                        "8": {"y": -1, "x": 7},
                        "9": {"y": -1, "x": 8},
                        "0": {"y": -1, "x": 9},
                        "-": {"y": -1, "x": 10},
                        "q": {"y": 0, "x": 0},
                        "w": {"y": 0, "x": 1},
                        "e": {"y": 0, "x": 2},
                        "r": {"y": 0, "x": 3},
                        "t": {"y": 0, "x": 4},
                        "y": {"y": 0, "x": 5},
                        "u": {"y": 0, "x": 6},
                        "i": {"y": 0, "x": 7},
                        "o": {"y": 0, "x": 8},
                        "p": {"y": 0, "x": 9},
                        "a": {"y": 1, "x": 0},
                        "s": {"y": 1, "x": 1},
                        "d": {"y": 1, "x": 2},
                        "f": {"y": 1, "x": 3},
                        "g": {"y": 1, "x": 4},
                        "h": {"y": 1, "x": 5},
                        "j": {"y": 1, "x": 6},
                        "k": {"y": 1, "x": 7},
                        "l": {"y": 1, "x": 8},
                        ";": {"y": 2, "x": 9},
                        "'": {"y": 2, "x": 10},
                        "z": {"y": 2, "x": 0},
                        "x": {"y": 2, "x": 1},
                        "c": {"y": 2, "x": 2},
                        "v": {"y": 2, "x": 3},
                        "b": {"y": 2, "x": 4},
                        "n": {"y": 2, "x": 5},
                        "m": {"y": 2, "x": 6},
                        ",": {"y": 2, "x": 7},
                        ".": {"y": 2, "x": 8},
                        "/": {"y": 2, "x": 9}                   
                     }

def euclidean_distance(a,b):
    X = (keyboard_cartesian[a]['x']-keyboard_cartesian[b]['x'])**2
    Y = (keyboard_cartesian[a]['y']-keyboard_cartesian[b]['y'])**2
    return sqrt(X+Y)

print(euclidean_distance('q', 'r'))
print(euclidean_distance('q', 'c'))

3.0
2.8284271247461903


In [10]:
print(whitelist_domains[:10])
print(whitelist_slds[:10])

['wikipedia.org']
['wikipedia']


### First Try
simply checking against Euclidean distance.

In [20]:
legit = "wikipedia.org"
typo = "wiiipedia.org" 
typosqt = "wikiped1a.org"

def typo_check(url):
    for i in range(len(legit)):
        if legit[i] != url[i]:
            result = euclidean_distance(legit[i], url[i])
            if result >= 1.5:
                # typosquat
                return 1
            else:
                # typo
                return 0

print(typo_check(typosqt))

1


### Notes

T : URL being tested <br>
L : Legit URL <br>
LD : levenshtein distance <br>
ED : euclidean distance <br>

Assumptions: <br>
<ul >
    <li>Given the suspicious URL (S), its genuine URL (G) is known</li>
    <li>S has an edit distance of at least 1</li>
    <li>There is no reason to press shift; urls are not case sensitive, and the only legal special character is hyphen, which does not require shift</li>
    <li>Hyphens are only allowed in between characters in domain names (i.e. domain names cannot start/end with hyphens, and the top level domain cannot contain hyphen)</li>
    <li>T is assumed to be a typosquat when it meets one of the following:
        <ol>
            <li>Contains illegal special characters</li>
            <li>Has a Levenshtein distance of more than 1</li>
            <li>Is shorter than len(G) - 1</li>
            <li>Is longer than len(G) + 1</li>
        </ol>
    </li>
</ul>

Then, checks:
- if there are special characters in T, typosquat (in domain names: hyphens are allowed, underscores are not)
- if length of T = length of L + 1, check how far away the extra letter is from the previous letter. was it fat-fingered?
- if length of T = length of L, check LD. if > 2, definitely typosquat
- if length of T = length of L AND LD <= 2, check ED. if any error has ED > 1.5, typosquat

definitely typosquat:
- special char present anywhere
- len(T) > len(L) + 1
- LD > 2
- ED > 1.5


### Functions & Codes For Testing
functions to extract TLD & SLD

codes to generate various suspicious URLs to test based off of legit URLs

In [6]:
def extract_domain_and_tld(url):
    tsd, td, tsu = extract(url)
    return td + '.' + tsu

def extract_tld(url):
    tsd, td, tsu = extract(url)
    return tsu    

def extract_sld(url):
    tsd, td, tsu = extract(url)
    return td

# does not include: , . / ; '
special_characters = ['~', ':', '+', '[', '\\', '@', '^', '{', '%', '(', '"', '*', '|', ',', '&', '<', '`', '}', '_', '=', ']', '!', '>', '?', '#', '$', ')']

# used tuple because theres no need for this container to be mutable and tuples are faster
tlds = ("com", "ru", "org", "net", "in", "ir", "au", "uk", "de", "br")

# whitelist = ['https://www.bankofsingapore.com/']
whitelist = ['https://www.wikipedia.org/']


whitelist_domains = []
whitelist_slds = []
for url in whitelist:
    whitelist_domains.append(extract_domain_and_tld(url))
    whitelist_slds.append(extract_sld(url))
typosquat = []
for url in whitelist_domains:
    fuzz = dnstwist.DomainFuzz(url)
    fuzz.generate()
    typosquat.extend([x['domain-name'] for x in fuzz.domains])
    
#Lookups in sets are much more efficient
typosquat = set(typosquat)

#Delete the original whitelisted domains from the blacklist set
typosquat.difference_update(whitelist_domains)

print('Number of typosquatted urls generated: ', len(typosquat))

Number of typosquatted urls generated:  7900


## is_typo(url) Function
Code Flow:
- checks if url has any special characters
- checks LD of url against legit URL
- checks the TLD of url agains legit URL
- checks length of url against lenght of legit URL
- checks the ED of any wrong char in url against corresponding char in legit URL

todo:
- also check tld
- add extra chars to any of the letters

In [18]:
def is_typo(url):
    l = whitelist_domains[0]
    
    result = {"suspicious url": url, "original url": l, "result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    url_sld = extract_sld(url)
    l_sld = extract_sld(l)
    
    url_tld = extract_tld(url)
    l_tld = extract_tld(l)
    
    url_len = len(url)
    l_len = len(l)
    
    # checks if illegal special characters are present
    if not re.match("^[^-][a-zA-Z0-9-]{1,}[^-][.]{1}[a-zA-Z0-9]{1,}$", url):
        result["result"] = 1
        result["reasons_typosquat"].append("Illegal characters found in url")
    
    # checks LD
    if ls.levenshtein(url, l) > 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Edit distance more than 1")
    elif ls.levenshtein(url, l) == 1:
        result["reasons_typo"].append("Edit distance is only one")
    
    # checks TLD
    if url_tld != l_tld:
        url_index = 10
        l_index = 10
        if url_tld in tlds:
            url_index = tlds.index(url_tld)
        if l_tld in tlds:
            l_index = tlds.index(l_tld)
        if l_index > url_index:
            result["reasons_typo"].append("TLD is more common")
            
     
    # compares lengths
    if url_len > l_len + 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Too long")
    
    elif url_len < l_len - 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Too short")
    
    elif url_len == l_len + 1:
        for i in range(len(l)):
            if url[i] != l[i]:
                url_left = url[i - 1] if i != 0 else None # char on the left of the wrong/extra char
                url_middle = url[i] # wrong/extra char
                url_right = url[i + 1] if i + 1 < len(url) else None # char on the right of wrong/extra char          
                
                if url_left == None:
                    if euclidean_distance(url_right, url_middle) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")
                        
                elif url_right == None:
                    if euclidean_distance(url_left, url_middle) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")
                else:
                    if euclidean_distance(url_left, url_middle) > 1.5 and euclidean_distance(url_right, url_middle) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Extra character too far from characters next to it")
                    else:
                        result["reasons_typo"].append("Extra character is within ED boundary")
                
                # prevent the function from running on the rest of the string
                break
                
    elif url_len == l_len:
        for i in range(len(l)):
            if url[i] != l[i]:
                if euclidean_distance(url[i], l[i]) > 1.5:
                    result["result"] = 1
                    result["reasons_typosquat"].append("'{}' key and '{}' key are too far apart".format(url[i], l[i]))
                else:
                        result["reasons_typo"].append("Wrong character is within ED boundary")
    
    return result

print(is_typo("wikipedia.org"))
print(is_typo("wikipedip.org")) # equal parts typo and typosquat, according to my code logic
print(is_typo("wikipedia.com")) 

# for url in list(typosquat)[1010:1020]:
#     # if typo
#     if is_typo(url)["result"] == 1:
#         print(is_typo(url))
#         print()

{'suspicious url': 'wikipedia.org', 'original url': 'wikipedia.org', 'result': 0, 'reasons_typosquat': [], 'reasons_typo': []}
{'suspicious url': 'wikipedip.org', 'original url': 'wikipedia.org', 'result': 1, 'reasons_typosquat': ["'p' key and 'a' key are too far apart"], 'reasons_typo': ['Edit distance is only one']}
{'suspicious url': 'wikipedia.com', 'original url': 'wikipedia.org', 'result': 1, 'reasons_typosquat': ['Edit distance more than 1', "'c' key and 'o' key are too far apart", "'o' key and 'r' key are too far apart", "'m' key and 'g' key are too far apart"], 'reasons_typo': ['TLD is more common']}


### Prototype 2

In [17]:
def is_typo(url):
    l = whitelist_domains[0]
    
    result = {"suspicious url": url, "original url": l, "result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    url_sld = extract_sld(url)
    l_sld = extract_sld(l)
    
    url_tld = extract_tld(url)
    l_tld = extract_tld(l)
    
    url_len = len(url)
    l_len = len(l)
    
    # checks length
    if l_len + 1 < url_len:
        result["result"] = 1
        result["reasons_typosquat"].append("Too long")
        return result
    elif url_len < l_len - 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Too short")
        return result
    
    # checks if illegal special characters are present
    if not re.match("^((?!-)[A-Za-z0–9-]{1,}(?<!-)\.)+[A-Za-z0-9]{1,}$", url):
        result["result"] = 1
        result["reasons_typosquat"].append("Illegal characters found in url")
        return result
    
    # checks TLD; WIP
    if url_tld != l_tld:
        url_index = 10
        l_index = 10
        if url_tld in tlds:
            url_index = tlds.index(url_tld)
        if l_tld in tlds:
            l_index = tlds.index(l_tld)
        if l_index > url_index:
            result["reasons_typo"].append("TLD is more common")
            
            # reassign values so next checks will only be on sld, as tld mistakes have already been ruled out
            url = url_sld
            l = l_sld
            url_len = len(url)
            l_len = len(l)
    
    
    # levenshtein distance check
    if ls.levenshtein(url, l) > 1:
        result["result"] = 1
        result["reasons_typosquat"].append("Edit distance more than 1")
        return result
    else:
        # length checks
        if url_len == l_len:
            for i in range(len(l)):
                if url[i] != l[i]:
                    if euclidean_distance(url[i], l[i]) > 1.5:
                        result["result"] = 1
                        result["reasons_typosquat"].append("'{}' key and '{}' key are too far apart".format(url[i], l[i]))
                    else:
                            result["reasons_typo"].append("Wrong character is within ED boundary")
        
        elif url_len == l_len + 1:
            for i in range(len(l)):
                if url[i] != l[i]:
                    url_left = url[i - 1] if i != 0 else None # char on the left of the wrong/extra char
                    url_middle = url[i] # wrong/extra char
                    url_right = url[i + 1] if i + 1 < len(url) else None # char on the right of wrong/extra char          

                    if url_left == None:
                        if euclidean_distance(url_right, url_middle) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    elif url_right == None:
                        if euclidean_distance(url_left, url_middle) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")
                    else:
                        if euclidean_distance(url_left, url_middle) > 1.5 and euclidean_distance(url_right, url_middle) > 1.5:
                            result["result"] = 1
                            result["reasons_typosquat"].append("Extra character too far from characters next to it")
                        else:
                            result["reasons_typo"].append("Extra character is within ED boundary")

                    # prevent the function from running on the rest of the string
                    break
        
        elif url_len == l_len - 1:
            result["reasons_typo"].append("Only missing 1 character")
            
        if len(result["reasons_typo"]) > 1:
            result["result"] = 1
            result["reasons_typosquat"].append("Too many typos")
    
    return result

# original url
print(is_typo("wikipedia.org"))

# too long
print(is_typo("wikipediaaaaa.org"))

# too short
print(is_typo("wikipe.org"))

# with special character
print(is_typo("wikipedi@.org"))

# exact TLD swap
print(is_typo("wikipedia.com"))

# 2 typos
print(is_typo("wikipedis.com"))

# LD = 2
print(is_typo("wikipedai.org"))

# lengths are equal
print(is_typo("wikipedih.org"))

# one character too long
print(is_typo("wikipediap.org"))

# count = 0
# for url in list(typosquat):
#     result = is_typo2(url)
#     if result["result"] == 0:
#         count += 1
#         print(result)
    
# print(count)

{'suspicious url': 'wikipedia.org', 'original url': 'wikipedia.org', 'result': 0, 'reasons_typosquat': [], 'reasons_typo': []}
{'suspicious url': 'wikipediaaaaa.org', 'original url': 'wikipedia.org', 'result': 1, 'reasons_typosquat': ['Too long'], 'reasons_typo': []}
{'suspicious url': 'wikipe.org', 'original url': 'wikipedia.org', 'result': 1, 'reasons_typosquat': ['Too short'], 'reasons_typo': []}
{'suspicious url': 'wikipedi@.org', 'original url': 'wikipedia.org', 'result': 1, 'reasons_typosquat': ['Illegal characters found in url'], 'reasons_typo': []}
{'suspicious url': 'wikipedia.com', 'original url': 'wikipedia.org', 'result': 0, 'reasons_typosquat': [], 'reasons_typo': ['TLD is more common']}
{'suspicious url': 'wikipedis.com', 'original url': 'wikipedia.org', 'result': 1, 'reasons_typosquat': ['Too many typos'], 'reasons_typo': ['TLD is more common', 'Wrong character is within ED boundary']}
{'suspicious url': 'wikipedai.org', 'original url': 'wikipedia.org', 'result': 1, 'rea

## Testing

<u>Potential Results</u>: <br>
0: Typo <br>
1: Typosquat

<u>Test Cases</u>:
1. Legit URL (Expected Result: 0)
2. Substitute 1 char with SC (Expected Result: 1)
3. Substitute 1 char with hyphen (Expected Result: 1)
4. Substitute 1 char with wrong char, within ED boundary (Expected Result: 0)
5. Substitute 2 char with wrong char, within ED boundary (Expected Result: 0)
6. Substitute 3 char with wrong char, within ED boundary(Expected Result: 1)
7. Append 1 char, within ED boundary of last char (Expected Result: 0)
8. Append 1 char, exceeding ED boundary (Expected Result: 1)
9. Append 2 char, within ED boundary(Expected Result: 1)
10. Remove 1 char (Expected Result: 1)
    
<u>Conclusion/Notes</u>:
- need to recheck for special characters, as they are not accounted for in the next checks after

In [156]:
test_cases = ["bankofsingapore.com", 
              "bankofs!ngapore.com", 
              "bankofsing-pore.com", 
              "bankofsingaporr.com", 
              "bankofsingapoer.com", 
              "bankofsingapier.com", 
              "bankofsingaporee.com", 
              "bankofsingaporep.com", 
              "bankofsingaporeee.com", 
              "bankofsingapor.com"]

for i in range(len(test_cases)):
    print("URL", i, ":", is_typo(test_cases[i]))

URL 0 : {'suspicious url': 'bankofsingapore.com', 'original url': 'bankofsingapore.com', 'result': 0, 'reasons': []}


KeyError: '!'