Requirements 
<ul>
    <li>Your function should account for keyboard distance for all characters (even special characters)</li>
    <li>Your function should account for levenshtein distance as well. If there are more typos, then the score should lean more towards 1 (since it is unlikely for a user to make so many typos)</li>
    <li>Think about if it is more likely to make a horizontal typo vs a vertical typo, you may want to assign a weight to differentiate the typos</li>
    <li>Think about the case when the strings have different lengths and how you should handle it</li>
    <li>Think about if it is necessary to distinguish if the character is already very far away (e.g wikip9dia.org vs wikip0dia.org), both are most likely typosquats, is there a need for a different score? How many keyboard characters away then should I consider it to be not a typo vs not typo?</li>
    <li>Try to think of any other conditions / requirements that I may have missed out, and feel free to suggest any</li>
</ul>

what about swapped letters, one-too-many letters

numbers above qwertyuiop are possible typos, but some may be intended typosquats (i.e. o -> 0; E -> 3 ?)

what about when a user presses 2 keys on accident? e.g. wikoipedia -> presses "o" and "k" when trying to press "k"

are special chars/homoglyphs legal in the url box?

what if they miss a letter?


[python-Levenshtein PyPI](https://pypi.org/project/python-Levenshtein/)

[euclidean distance using numpy (stack overflow)](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) (may help make calculating ED more efficient?)

In [45]:
from math import *
import dnstwist
from tldextract import extract
import Levenshtein as ls
import numpy as np

## Euclidean Distance

In [75]:
keyboard_cartesian = {
                        "1": {"y": -1, "x": 0},
                        "2": {"y": -1, "x": 1},
                        "3": {"y": -1, "x": 2},
                        "4": {"y": -1, "x": 3},
                        "5": {"y": -1, "x": 4},
                        "6": {"y": -1, "x": 5},
                        "7": {"y": -1, "x": 6},
                        "8": {"y": -1, "x": 7},
                        "9": {"y": -1, "x": 8},
                        "0": {"y": -1, "x": 9},
                        "-": {"y": -1, "x": 10},
                        "q": {"y": 0, "x": 0},
                        "w": {"y": 0, "x": 1},
                        "e": {"y": 0, "x": 2},
                        "r": {"y": 0, "x": 3},
                        "t": {"y": 0, "x": 4},
                        "y": {"y": 0, "x": 5},
                        "u": {"y": 0, "x": 6},
                        "i": {"y": 0, "x": 7},
                        "o": {"y": 0, "x": 8},
                        "p": {"y": 0, "x": 9},
                        "a": {"y": 1, "x": 0},
                        "s": {"y": 1, "x": 1},
                        "d": {"y": 1, "x": 2},
                        "f": {"y": 1, "x": 3},
                        "g": {"y": 1, "x": 4},
                        "h": {"y": 1, "x": 5},
                        "j": {"y": 1, "x": 6},
                        "k": {"y": 1, "x": 7},
                        "l": {"y": 1, "x": 8},
                        "z": {"y": 2, "x": 0},
                        "x": {"y": 2, "x": 1},
                        "c": {"y": 2, "x": 2},
                        "v": {"y": 2, "x": 3},
                        "b": {"y": 2, "x": 4},
                        "n": {"y": 2, "x": 5},
                        "m": {"y": 2, "x": 6}                    
                     }

def euclidean_distance(a,b):
    X = (keyboard_cartesian[a]['x']-keyboard_cartesian[b]['x'])**2
    Y = (keyboard_cartesian[a]['y']-keyboard_cartesian[b]['y'])**2
    return sqrt(X+Y)

print(euclidean_distance('q', 'r'))
print(euclidean_distance('q', 'c'))

3.0
2.8284271247461903


### Functions & Codes For Testing
functions to extract TLD & SLD

codes to generate various suspicious URLs to test based off of legit URLs

In [20]:
def extract_domain_and_tld(url):
    tsd, td, tsu = extract(url)
    return td + '.' + tsu

def extract_tld(url):
    tsd, td, tsu = extract(url)
    return tsu    

def extract_sld(url):
    tsd, td, tsu = extract(url)
    return td

# whitelist = ['https://www.bankofsingapore.com/','http://www.ocbc.com','http://www.dbs.com','http://www.uobgroup.com','http://www.bnpparibas.com.sg','http://www.icbc.com.cn/new-branch/xjp/index.htm','https://www.citibank.com.sg/','https://www.maybank2u.com.sg/','http://www.sbising.com','https://www.sc.com/sg/','http://www.icicibank.com','https://www.hsbc.com.sg/','http://www.ccb.com/','https://www.bankofchina.com/sg/','https://www.boi.com.sg/','http://www.jpmorgan.com','http://iob.com','http://www.indian-bank.com','https://www.hlbank.com.sg','http://www.ca-cib.com','http://www.cimb.com/','http://www.bankofamerica.com/','http://www.bbl.co.th/','http://www.bgcpartners.com/','http://www.ebsgroup.si/','http://www.hlf.com.sg/','http://www.sif.com.sg/','http://www.singapurafinance.com.sg/','https://www.mas.gov.sg/']
# print(len(whitelist))

# print('Number of whitelisted domains input: ', len(whitelist))

# whitelist_domains = []
# whitelist_slds = []
# for url in whitelist:
#     whitelist_domains.append(extract_domain_and_tld(url))
#     whitelist_slds.append(extract_sld(url))
# typosquat = []
# for url in whitelist_domains:
#     fuzz = dnstwist.DomainFuzz(url)
#     fuzz.generate()
#     typosquat.extend([x['domain-name'] for x in fuzz.domains])
    
# print(typosquat[200:260])
# #Lookups in sets are much more efficient
# typosquat = set(typosquat)

# #Delete the original whitelisted domains from the blacklist set
# typosquat.difference_update(whitelist_domains)

# print('Number of typosquatted urls generated: ', len(typosquat))


In [17]:
print(whitelist_domains[:10])
print(whitelist_slds[:10])


['bankofsingapore.com', 'ocbc.com', 'dbs.com', 'uobgroup.com', 'bnpparibas.com.sg', 'icbc.com.cn', 'citibank.com.sg', 'maybank2u.com.sg', 'sbising.com', 'sc.com']
['bankofsingapore', 'ocbc', 'dbs', 'uobgroup', 'bnpparibas', 'icbc', 'citibank', 'maybank2u', 'sbising', 'sc']


### First Try
simply checking against Euclidean distance.

In [20]:
legit = "wikipedia.org"
typo = "wiiipedia.org" 
typosqt = "wikiped1a.org"

def typo_check(url):
    for i in range(len(legit)):
        if legit[i] != url[i]:
            result = euclidean_distance(legit[i], url[i])
            if result >= 1.5:
                # typosquat
                return 1
            else:
                # typo
                return 0

print(typo_check(typosqt))

1


### Notes

T : URL being tested <br>
L : Legit URL <br>
LD : levenshtein distance <br>
ED : euclidean distance <br>

Assuming: <br>
-> given T, we know what L is. <br>
-> no reason to press shift; urls are not case sensitive, and the only legal special character is hyphen, which does not require shift <br>
-> if len(T) = len(L) + 1, assume that extra letters are at the end of domain names


Then, checks:
- if there are special characters in T, typosquat (in domain names: hyphens are allowed, underscores are not)
- if length of T = length of L + 1, check how far away the extra letter is from the previous letter. was it fat-fingered?
- if length of T = length of L, check LD. if > 2, definitely typosquat
- if length of T = length of L AND LD <= 2, check ED. if any error has ED > 1.5, typosquat

definitely typosquat:
- special char present anywhere
- len(T) > len(L) + 1
- LD > 2
- ED > 1.5


Refined codes to aid in testing

In [34]:
special_characters = ['~', ':', "'", '+', '[', '\\', '@', '^', '{', '%', '(', '"', '*', '|', ',', '&', '<', '`', '}', '_', '=', ']', '!', '>', ';', '?', '#', '$', ')', '/']

whitelist = ['https://www.bankofsingapore.com/']

whitelist_domains = []
whitelist_slds = []
for url in whitelist:
    whitelist_domains.append(extract_domain_and_tld(url))
    whitelist_slds.append(extract_sld(url))
typosquat = []
for url in whitelist_domains:
    fuzz = dnstwist.DomainFuzz(url)
    fuzz.generate()
    typosquat.extend([x['domain-name'] for x in fuzz.domains])
    
#Lookups in sets are much more efficient
typosquat = set(typosquat)

#Delete the original whitelisted domains from the blacklist set
typosquat.difference_update(whitelist_domains)

print('Number of typosquatted urls generated: ', len(typosquat))

Number of typosquatted urls generated:  13515


## is_typo(url) Function
Code Flow:
- checks if url has any special characters
- checks LD of url against legit URL
- checks length of url against lenght of legit URL
- checks the ED of any wrong char in url against corresponding char in legit URL

todo:
- also check tld
- add extra chars to any of the letters

In [114]:
def is_typo(url):
    l = whitelist_domains[0]
    
    url_sld = extract_sld(url)
    l_sld = extract_sld(l)
    
    url_len = len(url_sld)
    l_len = len(l_sld)
    
    # checks if illegal special characters are present
    if any(char in url_sld for char in special_characters) or url_sld.startswith("-") or url_sld.endswith("-"):
        return 1
    
    # checks LD
    if ls.distance(url_sld, l_sld) > 2:
        return 1
     
    # compares lengths
    if url_len > l_len + 1:
        return 1
    elif url_len < l_len:
        # TODO: add more checks here
        return 1
    else:
        # assuming no errors in tld
        for i in range(l_len):
            if l_sld[i] != url_sld[i]:
                result = euclidean_distance(l_sld[i], url_sld[i])
                if result >= 1.5:
                    return 1
        if url_len == l_len + 1:
            url_last = url_sld[-1]
            l_last = l_sld[-1]
            result = euclidean_distance(l_last, url_last)
            if result >= 1.5:
                return 1    
    return 0

for url in typosquat:
    # if typo
    if is_typo(url) == 0:
        print(url)

fankofsingapore.com
bankofsimgap0re.com
bankofsingapoer.com
bwnkofsingapore.com
bankofsignapore.com
bankofslmgapore.com
bankkfsingapore.com
banklfsingapore.com
banlofsingapore.com
bankofskngapore.com
bamk0fsingapore.com
bamkofsingapore.com
nankofsingapore.com
bajkofsingapore.com
bankobsingapore.com
bznkofsingapore.com
bankpfsingapore.com
banoofsingapore.com
bankofaingapore.com
bank9fsingapore.com
bankofsingapofe.com
bankofsingapord.com
bankofsingaporde.com
bank0fsingapore.com
bankofsingapored.com
bankofsingapkre.com
bankofsingaporre.com
bankofsibgapore.com
bankofsingaplre.com
bankofsongapore.com
bankofsingapo5e.com
bankofqingapore.com
bankofslngapore.com
bankofzingapore.com
bankofcingapore.com
bank0fsimgapore.com
bankofsingapores.com
hankofsingapore.com
bankofsingalore.com
bankofxingapore.com
bankocsingapore.com
bankifsingapore.com
bankofsimgapore.com
bankofsingapo5re.com
bankofsingapoee.com
bankofsinbapore.com
vankofsingapore.com
bankofsingwpore.com
bankofsingspore.com
bank0fsingap0re

## Testing

<u>Potential Results</u>: <br>
0: Typo <br>
1: Typosquat

<u>Test Cases</u>:
1. Legit URL (Expected Result: 0)
2. Substitute 1 char with SC (Expected Result: 1)
3. Substitute 1 char with hyphen (Expected Result: 1)
4. Substitute 1 char with wrong char, within ED boundary (Expected Result: 0)
5. Substitute 2 char with wrong char, within ED boundary (Expected Result: 0)
6. Substitute 3 char with wrong char, within ED boundary(Expected Result: 1)
7. Append 1 char, within ED boundary of last char (Expected Result: 0)
8. Append 1 char, exceeding ED boundary (Expected Result: 1)
9. Append 2 char, within ED boundary(Expected Result: 1)
10. Remove 1 char (Expected Result: 1)
    
<u>Conclusion/Notes</u>:
- need to check entire URL (SLD + TLD) because of potential periods in domain name, which can be removed when extracting SLD
- include more checks for suspicious URLs that are shorter than original URLs

In [115]:
test_cases = ["bankofsingapore.com", 
              "bankofs!ngapore.com", 
              "bankofsing-pore.com", 
              "bankofsingaporr.com", 
              "bankofsingapoer.com", 
              "bankofsingapier.com", 
              "bankofsingaporee.com", 
              "bankofsingaporep.com", 
              "bankofsingaporeee.com", 
              "bankofsingapor.com"]

for i in range(len(test_cases)):
    print("URL", i, ":", is_typo(test_cases[i]))

URL 0 : 0
URL 1 : 1
URL 2 : 1
URL 3 : 0
URL 4 : 0
URL 5 : 1
URL 6 : 0
URL 7 : 1
URL 8 : 1
URL 9 : 1
