Requirements 
<ul>
    <li>Your function should account for keyboard distance for all characters (even special characters)</li>
    <li>Your function should account for levenshtein distance as well. If there are more typos, then the score should lean more towards 1 (since it is unlikely for a user to make so many typos)</li>
    <li>Think about if it is more likely to make a horizontal typo vs a vertical typo, you may want to assign a weight to differentiate the typos</li>
    <li>Think about the case when the strings have different lengths and how you should handle it</li>
    <li>Think about if it is necessary to distinguish if the character is already very far away (e.g wikip9dia.org vs wikip0dia.org), both are most likely typosquats, is there a need for a different score? How many keyboard characters away then should I consider it to be not a typo vs not typo?</li>
    <li>Try to think of any other conditions / requirements that I may have missed out, and feel free to suggest any</li>
</ul>

what about swapped letters, one-too-many letters

numbers above qwertyuiop are possible typos, but some may be intended typosquats (i.e. o -> 0; E -> 3 ?)

what about when a user presses 2 keys on accident? e.g. wikoipedia -> presses "o" and "k" when trying to press "k"

are special chars/homoglyphs legal in the url box?

what if they miss a letter?


[python-Levenshtein PyPI](https://pypi.org/project/python-Levenshtein/)

[euclidean distance using numpy (stack overflow)](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) (may help make calculating ED more efficient?)

[top 10 most common TLDs](https://www.statista.com/statistics/265677/number-of-internet-top-level-domains-worldwide/)

[domain name regex](https://medium.com/@vaghasiyaharryk/how-to-validate-a-domain-name-using-regular-expression-9ab484a1b430)


[Prototype 3 string check](https://stackoverflow.com/questions/774316/python-difflib-highlighting-differences-inline)

[regex to extract subdomain and domain](https://stackoverflow.com/questions/56157896/regex-for-extracting-domains-and-subdomains)

In [1]:
from math import *
import dnstwist
import pylev as ls
# import Levenshtein as ls
import numpy as np
import difflib as dl
import re
import pandas as pd
from multiprocessing import Pool
import cProfile
import pstats

ked_boundary = 1.5
allow_transposition = False

In [15]:
pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org python-Levenshtein

Collecting python-Levenshtein
  Using cached python-Levenshtein-0.12.2.tar.gz (50 kB)
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py): started
  Building wheel for python-Levenshtein (setup.py): finished with status 'error'
  Running setup.py clean for python-Levenshtein
Failed to build python-Levenshtein
Installing collected packages: python-Levenshtein
    Running setup.py install for python-Levenshtein: started
    Running setup.py install for python-Levenshtein: finished with status 'error'
Note: you may need to restart the kernel to use updated packages.


  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\User\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\User\\AppData\\Local\\Temp\\pip-install-msyx9cej\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\User\\AppData\\Local\\Temp\\pip-install-msyx9cej\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\User\AppData\Local\Temp\pip-wheel-v97pnypi'
       cwd: C:\Users\User\AppData\Local\Temp\pip-install-msyx9cej\python-levenshtein\
  Complete output (27 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.8
  creating build\lib.win-amd64-3.8\Levenshtein
  copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.8\Levenshtein
  copying Levenshtein\__init__.py -> build\lib.win-amd64-

## Keyboard Euclidean Distance

In [2]:
keyboard_cartesian = {
                        "1": {"y": -1, "x": 0},
                        "2": {"y": -1, "x": 1},
                        "3": {"y": -1, "x": 2},
                        "4": {"y": -1, "x": 3},
                        "5": {"y": -1, "x": 4},
                        "6": {"y": -1, "x": 5},
                        "7": {"y": -1, "x": 6},
                        "8": {"y": -1, "x": 7},
                        "9": {"y": -1, "x": 8},
                        "0": {"y": -1, "x": 9},
                        "-": {"y": -1, "x": 10},
                        "q": {"y": 0, "x": 0},
                        "w": {"y": 0, "x": 1},
                        "e": {"y": 0, "x": 2},
                        "r": {"y": 0, "x": 3},
                        "t": {"y": 0, "x": 4},
                        "y": {"y": 0, "x": 5},
                        "u": {"y": 0, "x": 6},
                        "i": {"y": 0, "x": 7},
                        "o": {"y": 0, "x": 8},
                        "p": {"y": 0, "x": 9},
                        "a": {"y": 1, "x": 0},
                        "s": {"y": 1, "x": 1},
                        "d": {"y": 1, "x": 2},
                        "f": {"y": 1, "x": 3},
                        "g": {"y": 1, "x": 4},
                        "h": {"y": 1, "x": 5},
                        "j": {"y": 1, "x": 6},
                        "k": {"y": 1, "x": 7},
                        "l": {"y": 1, "x": 8},
                        ";": {"y": 2, "x": 9},
                        "'": {"y": 2, "x": 10},
                        "z": {"y": 2, "x": 0},
                        "x": {"y": 2, "x": 1},
                        "c": {"y": 2, "x": 2},
                        "v": {"y": 2, "x": 3},
                        "b": {"y": 2, "x": 4},
                        "n": {"y": 2, "x": 5},
                        "m": {"y": 2, "x": 6},
                        ",": {"y": 2, "x": 7},
                        ".": {"y": 2, "x": 8},
                        "/": {"y": 2, "x": 9}                   
                     }

def euclidean_distance(a,b):
    X = (keyboard_cartesian[a]['x']-keyboard_cartesian[b]['x'])**2
    Y = (keyboard_cartesian[a]['y']-keyboard_cartesian[b]['y'])**2
    return sqrt(X+Y)

# print(euclidean_distance('q', 'w'))
# print(euclidean_distance('q', 's'))

In [3]:
# https://stackoverflow.com/questions/44113335/extract-domain-from-url-in-python

def replace_special_char(char):
    flag = '"!@#$%^&*()+?_=,<>'':\\'
    flag_list = [char for char in flag]
    if char.isalnum()==False and char in flag:
        return 'Z'
    return char
    
# Clean the string first. The extract python library cannot properly extract the domain from URLs with special characters

# 1. make string lower case
# 2. replace all flagged special characters with 'Z'
# 3. extract the domain or TLD
# 4. replace all 'Z' with '!'
# 5. calculate edit distance

def clean_string(url):
    # First make the string lowercase
    url = url.lower()

    # ':' is a flagged character, but if it appears with http or https it is fine
    url = url.replace('https://','')
    url = url.replace('http://','')
    return ''.join([replace_special_char(char) for char in url])
    
def extract_subdomain_and_domain(url):
    url = clean_string(url)
    rgx = r"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)"
    result = re.search(rgx, url).group(1)
    return result.replace('Z','!')

def extract_sld(url):
    url = clean_string(url)
    url_split = url.split(".")
    return url_split[-2].replace('Z','!')
    
def extract_tld(url):
    url = clean_string(url)
    url_split = url.split(".")
    return url_split[-1].replace('Z','!')

# Theres no need for this container to be mutable and tuples are faster
# Index 0 - 9 = most popular to least popular
tlds = ("com", "ru", "org", "net", "in", "ir", "au", "uk", "de", "br")

# whitelist = ['https://www.wikipedia.org/']


# whitelist_domains = []
# whitelist_slds = []
# for url in whitelist:
#     whitelist_domains.append(extract_subdomain_and_domain(url))
#     whitelist_slds.append(extract_sld(url))
# typosquat = []
# for url in whitelist_domains:
#     fuzz = dnstwist.DomainFuzz(url)
#     fuzz.generate()
#     typosquat.extend([x['domain-name'] for x in fuzz.domains])
    
# #Lookups in sets are much more efficient
# typosquat = set(typosquat)

# #Delete the original whitelisted domains from the blacklist set
# typosquat.difference_update(whitelist_domains)

# print('Number of typosquatted urls generated: ', len(typosquat))

### Prototype 4
- included LD == 1 for further checks
- added new helper function: exceeds_ked_check(), extract_chars()
- improved efficiency, removed redundant/repetitive codes

Helper Functions

In [4]:
# Input - 1 URL
# Output - True / False

# e.g contains_special_characters(wikipedia.org) 
# expected output False

# e.g contains_special_characters(wikipedi@.org) 
# expected output True

def contains_special_characters(url):
    # regex checks if the url is in the correct format
    #(e.g. two periods (".") side-by-side is an invalid format)
    return not re.match("^((?!-)[A-Za-z0-9-]{1,}(?<!-)\.)+[A-Za-z0-9]{1,}$", url)


# Helper function to find out if the first TLD is more common than the second TLD

# Input - 2 TLDs
# Output - True / False

# e.g exact_tld_swap(com, org) 
# expected output True

# e.g exact_tld_swap(org, com) 
# expected output False

def tld_more_common(tld1, tld2):    
    tld1_index = 10
    tld2_index = 10
    if tld1 in tlds:
        tld1_index = tlds.index(tld1)
    if tld2 in tlds:
        tld2_index = tlds.index(tld2)
    if tld2_index > tld1_index:
        return True
    elif tld2_index < tld1_index:
        return False


# Helper function to check if "wrong" character is within KED of any of the other supplied characters

# Input - 4 characters
# Output - False

# e.g. exceeds_ked_check("i", "i", "k", "i")
# expected output True

# e.g. exceeds_ked_check("p", "d", "e", "m")
# expected output False
    
def exceeds_ked_check(left, right, correct, wrong):
    result = True
    
    # if the wrong/extra char is at the start of the url, the left will most likely be empty
    if left != "":
        if euclidean_distance(left, wrong) < ked_boundary:
            result = False
    
    # if the wrong/extra char is at the end of the url, the right will most likely be empty
    if right != "":
        if euclidean_distance(right, wrong) < ked_boundary:
            result = False
    
    # if the error is an extra char, there wont be a "correct" char
    if correct != "":
        if euclidean_distance(correct, wrong) < ked_boundary:
            result = False
        
    return result
    
# Helper function to extract wrong chars, correct chars, and chars on the left and right into a tuple of lists for easier comparison

# Input - SequenceMatcher object, indexes of chars to be compared
# Output - tuple of list    
def extract_chars(seqm, a0, a1, b0, b1):
    # list holding the chars on the left and right; index 0: left char, index 1: right char
    left_right = []
    
    if len(seqm.a[a0:a1]) == 2:
        left_right.append(seqm.a[a0-1:a1-2])
        left_right.append(seqm.a[a0+2:a1+1])
    else:
        left_right.append(seqm.a[a0-1:a1-1])
        left_right.append(seqm.a[a0+1:a1+1])
    
    # tuple consisting of 3 lists
    # first list: wrong/extra char(s)
    # second list: correct char(s) (may be empty if opcode == delete)
    # third list: chars on the left and right
    result = ([char for char in seqm.a[a0:a1]], [char for char in seqm.b[b0:b1]], left_right)
    return result   

    
# Helper function to perform various euclidean distance checks based off of the lengths of both URLs

# Input - 2 URLs
# Output - dictionary of results

# e.g. edit_check("w1k1pedia.org", "wikipedia.org")
# expected output: {'result': 1, 'reasons_typosquat': ["'1' key and 'i' key are too far apart", "'1' key and 'i' key are too far apart"], 'reasons_typo': []}
def edit_check(sus_url, legit_url):
    result = {"result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    # retrieving length of both URLs
    sus_len = len(sus_url)
    legit_len = len(legit_url)
    
    # retrieving levenshtein distance of both urls
    ld = ls.levenshtein(sus_url, legit_url)
    
    # if the suspicious is only missing characters, then it's highly likely a typo than a typosquat
    if sus_len == legit_len - 1 and ld == 1 or sus_len == legit_len - 2 and ld == 2:
        result["reasons_typo"].append("Only missing 1-2 characters")
        return result
    
    # checking if side-by-side swap
    if sus_len == legit_len and ld == 2:
        for index in range(sus_len):
            if sus_url[index] != legit_url[index] and sus_url[index+1] != None:
                if sus_url[index+1] == legit_url[index] and sus_url[index] == legit_url[index+1]:
                    if allow_transposition:
                        result["reasons_typo"].append("Characters are just swapped in-place")
                        return result
                    else:
                        result["result"] = 1
                        result["reasons_typosquat"].append("Characters were swapped")
                        return result
    
    
    # creating an object that compares both urls
    seqm = dl.SequenceMatcher(None, sus_url, legit_url)

    # get_opcodes() gets the "differences" between each url, or the steps required for first url to match second url
    for opcode, a0, a1, b0, b1 in seqm.get_opcodes():
        # a : first url
        # b : second url
        # a0, a1 | b0, b1 : index range holding the characters being compared
        # opcode : "equal", "insert", "delete", "replace" -- indicates the action require to turn a to b

        # extracting chars to be compared
        # chars_tuple[0] : list of wrong/extra char(s)
        # chars_tuple[1] : list of correct char(s)
        # chars_tuple[2] : list of chars on the left and right
        chars_tuple = extract_chars(seqm, a0, a1, b0, b1)

        # if a character has to be deleted, means it's an extra character
        if opcode == 'delete':
            for extra_char in chars_tuple[0]:
                if exceeds_ked_check(chars_tuple[2][0], chars_tuple[2][1], "", extra_char):
                    result["result"] = 1
                    result["reasons_typosquat"].append("Extra character ('{}') exceeds keyboard euclidean distance boundary".format(extra_char))
                    return result

        elif opcode == 'replace':
            for wrong_char in chars_tuple[0]:
                exceeds = True
                if not exceeds_ked_check(chars_tuple[2][0], chars_tuple[2][1], chars_tuple[1][0], wrong_char):
                    exceeds = False
                if len(chars_tuple[1]) == 2:
                    if not exceeds_ked_check(chars_tuple[2][0], chars_tuple[2][1], chars_tuple[1][1], wrong_char):
                        exceeds = False

                if exceeds:
                    result["result"] = 1
                    result["reasons_typosquat"].append("Wrong/Extra character ('{}') exceeds keyboard euclidean distance boundary".format(wrong_char))
                    return result
                        
    result["reasons_typo"].append("Wrong/Extra characters within keyboard euclidean distance boundary")
        
    return result


Main Program

In [5]:
def is_typo(sus_original_url, legit_original_url):
    # cleaning original_url
    legit_url = extract_subdomain_and_domain(legit_original_url)
    sus_url = extract_subdomain_and_domain(sus_original_url)
    
    # preparing 
    result = {"legitimate_url": legit_original_url, "legitimate_cleaned_url": legit_url, "suspicious_url": sus_original_url, "suspicious_cleaned_url": sus_url, "result": 0, "reasons_typosquat": [], "reasons_typo": []}
    
    # extracting SLD from both URLs
    sus_sld = extract_sld(sus_url)
    legit_sld = extract_sld(legit_url)
    
    # extracting TLD from both URLs
    sus_tld = extract_tld(sus_url)
    legit_tld = extract_tld(legit_url)
    
    # retrieving length of both URLs
    sus_len = len(sus_url)
    legit_len = len(legit_url)
    
    # check for illegal special characters
    if contains_special_characters(sus_url):
        result["result"] = 1
        result["reasons_typosquat"].append("Illegal characters found in url")
        return result
    
    # check for exact TLD swap
    if sus_sld == legit_sld and sus_tld != legit_tld:
        if tld_more_common(sus_tld, legit_tld):
            result["reasons_typo"].append("TLD is more common")
            return result
        else:
            result["result"] = 1
            result["reasons_typosquat"].append("TLD is less common")
            return result
    
    # check edit distance
    ld = ls.levenshtein(sus_url, legit_url)
    
    # if only 1 or 2 edits
    if ld == 1 or ld == 2:
        res = edit_check(sus_url, legit_url)
        
        result["reasons_typosquat"].extend(res["reasons_typosquat"])
        result["reasons_typo"].extend(res["reasons_typo"])
        
        if res["result"] == 1:
            result["result"] = 1
            return result
    
    # if 3 or more edits
    else:
        result["result"] = "Inconclusive"
        return result
    
    
    return result

In [6]:
# def parallel_process(l1, l2):
#     pool = Pool(4)
#     df = pd.concat(pool.map(is_typo, zip(l1, l2)))
#     pool.close()
#     pool.join()
#     return df

In [7]:
# legit_urls = ["https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org"]
# sus_urls = ["https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org"] 

# df = parallel_process(legit_urls, sus_urls)
    
# pd.set_option('display.max_colwidth', None)
# df

In [19]:
profile = cProfile.Profile()

legit_urls = ["https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org", "https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org"]
sus_urls = ["https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org", "https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org"] 
test_results = []

legit_urls = legit_urls * 1000
sus_urls = sus_urls * 1000

print(len(legit_urls))
print(len(sus_urls))

for index in range(len(legit_urls)):
    profile.runctx('is_typo(s, l)', {"is_typo": is_typo, "s": sus_urls[index], "l": legit_urls[index]}, {})

ps = pstats.Stats(profile)
ps.sort_stats(pstats.SortKey.TIME).print_stats()

pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(test_results)
df

336000
336000
         142248001 function calls in 319.191 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   432000   94.969    0.000   96.729    0.000 C:\Users\User\anaconda3\lib\site-packages\pylev.py:156(wfi_levenshtein)
 31320000   65.232    0.000   65.232    0.000 <ipython-input-3-68fdd09f7116>:5(<listcomp>)
 31320000   45.313    0.000  117.360    0.000 <ipython-input-3-68fdd09f7116>:3(replace_special_char)
  2016000   20.206    0.000  137.565    0.000 <ipython-input-3-68fdd09f7116>:25(<listcomp>)
   336000   16.108    0.000  320.185    0.001 {built-in method builtins.exec}
   384000    9.459    0.000   11.860    0.000 C:\Users\User\anaconda3\lib\difflib.py:336(find_longest_match)
  2016000    7.320    0.000  148.755    0.000 <ipython-input-3-68fdd09f7116>:18(clean_string)
 31320000    6.814    0.000    6.814    0.000 {method 'isalnum' of 'str' objects}
   336000    5.801    0.000  302.956    0.001 <ipython-input-5-34

Test Cases

In [9]:
# legit_urls = ["https://www.wikipedia.org/", "http://www.wikipedia.org/", "https://wikipedia.org/", "http://wikipedia.org/", "http://www.abc.wikipedia.org/", "https://www.wikipedia.org/home.html", "m.wikipedia.org", "www.wikipedia.org/home.aspx", "https://m.wikipedia.org", "www.wikipedia.org", "https://abc.wikipedia.org/", "www.wikipedia.org/about.php", "https://www.wikipedia.org/", "wikipedia.org"]
# sus_urls = ["https://www.wikipedi@.org/", "http://www.wikipedia.com/", "https://wikipedia.br/", "http://wikipedi.org/", "http://www.abc.kipedia.org/", "https://www.wikipediabb.org/home.html", "m.wwikipediac.org", "www.wikipedia.bvg/home.aspx", "https://m.wikipemnia.org", "www.wbipedia.org", "https://abc.wiklped1o.0rg/", "www.wikiped1a.org/about.php", "https://www.wikipediaa.org/", "wlklpedla.org"] 
# test_results = []

# for index in range(len(legit_urls)):
#     test_results.append(is_typo(sus_urls[index], legit_urls[index]))
    
# pd.set_option('display.max_colwidth', None)
# df = pd.DataFrame(test_results)
# df