## **URL Data Processing**
The process of transforming raw data into a format suitable for model training and analysis. The dataset consists of **100,00 URLs**, each labeled as either **phishing (1) or legitimate (0)**. Data cleaning was performed by removing duplicates and null entries, and ensuring that the 'result' column is standardized for binary classification.

## **Data Collection**
1. https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
2. https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls
3. https://data.mendeley.com/datasets/vfszbj9b36/1
4. https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset

In [1]:
#importing required libraries

import pandas as pd
import re
import math
import tldextract
import idna
from urllib.parse import urlparse, unquote
import validators
from collections import Counter

## **Loading Raw Dataset**
Ensure you load the **raw dataset**.

In [2]:
# Load the dataset and randomly preview 20 rows from it.

df = pd.read_csv('url_dataset.csv', quotechar='"')
df.sample(20)

Unnamed: 0,url,result
44737,washingtonresorthotel.com,1
29877,https://en.goldenmap.com/Marya_Carter,0
15831,http://www.ekl-net.conn-aocscmeoas.iidxia.top/...,1
40680,https://www.lawyers.com/Missouri/St.-Louis/Hus...,0
12887,https://www.hotels.com/ho256598/grand-midwest-...,0
47095,http://satyamultiplex.com/imgg/listmenu/BBHMM%...,1
60775,https://cnsnews.com/news/article/obama-challen...,0
16440,https://www.films.gayeroticarchives.com/?p=7389,0
15071,https://themarknews.com/authors/848-jeremy-kin...,0
48419,foodmoiet514.com/wp-content/newpage/ii.php?rand=,1


## **Feature Extraction**
If the function return 0 means legitimate otherwise 1 means phishing, if error return 1.

In [3]:
def hasHttps(url):
    try:
        https = urlparse(url).scheme
        if 'https' in https:
            return 0
        return 1
    except:
        return 1


def validateUrl(url):
    try:
        if validators.url(url):
            return 0
        else:
            return 1
    except:
        return 1


def shannonEntropy(domain):
    char_counts = Counter(domain)
    total_chars = len(domain)
    
    char_probabilities = [count / total_chars for count in char_counts.values()]
    
    entropy = -sum(p * math.log2(p) for p in char_probabilities)
    
    return entropy


def domainEntropy(url):
    try: 
        domain = urlparse(url).netloc
        entropy = shannonEntropy(domain)
        if entropy >= 3.3:
            return 1
        else:
            return 0
    except:
        return 1
    
    
def shannonEntropy(url):
    char_counts = Counter(url)
    total_chars = len(url)
    char_probabilities = [count / total_chars for count in char_counts.values()]    
    entropy = -sum(p * math.log2(p) for p in char_probabilities)
    
    return entropy


def urlEntropy(url):
    try:
        entropy = shannonEntropy(url)
        return 1 if entropy >= 4.2 else 0
    except:
        return 1


def longUrl(url):
    try:
        return 1 if len(url) >= 48 else 0
    except:
        return 1


def longDomain(url):
    try:
        if not urlparse(url).scheme:
            url = "https://" + url
        domain = urlparse(url).netloc
        return 1 if len(domain) >= 17 else 0
    except:
        return 1    


def countDepth(url):
    try:
        matches = re.findall(r'/', url)
        return 1 if len(matches) >= 4 else 0
    except:
        return 1
    

def countDot(url):
    try:
        matches = len(re.findall(r"\.", url))
        return 1 if matches >= 3 else 0
    except:
        return 1    


def uppercaseUrl(url):
    try:
        if any(char.isupper() for char in url):
            return 1
        return 0
    except:
        return 1


def countDigitUrl(url):
    try:
        matches = re.findall(r"\d", url)
        return 1 if matches else 0
    except:
        return 1 
    

def countDigitDomain(url):
    try:
        if not urlparse(url).scheme:
            url = "https://" + url
        domain = urlparse(url).netloc
        matches = re.findall(r"\d", domain)
        return 1 if matches else 0
    except:
        return 1 


def hypenDomain(url):
    try:
        if not urlparse(url).scheme:
            url = "https://" + url
        domain = urlparse(url).netloc
        matches = re.findall(r'-', domain)
        return 1 if matches else 0
    except:
        return 1


def openRedirect(url):
    try:
        decoded = unquote(url)

        suspicious_domains = [
            r'[\x00\r\n<>\|\"#{}\[\]^~@` ]',                                   # Suspicious ASCII/control/symbols
            r'%09',                                                            # Horizontal tab
            r'%00|%0[dD]|%0[aA]',                                              # Null byte, CR, LF
            r'%2e%2e',                                                         # Encoded '..'
            r'\/\/{2,}',                                                       # Multiple slashes
            r'https?:[^/]{1}',                                                 # http: or https:
            r'(\\\\/)+',                                                       # Escaped slashes like \/ or \\//
            r'%2f',                                                            # Encoded forward slash '/'
            r'\\',                                                             # Backslash (escaped)
            r'%5c',                                                            # Encoded backslash
            r'javascript:',                                                    # JavaScript URI
            r'data:text/html;base64',                                          # Base64 data URI
            r'alert\s*\(',                                                     # alert( with optional space
            r'confirm\s*\(',                                                   # confirm( with optional space
            r'%E3%80%82',                                                      # Unicode full-width period (U+3002)
            r'\?.*http',                                                       # HTTP Parameter Pollution
            r'/http',                                                          # Folder as domain
            r'\?http',                                                         # disguised redirect
            r'data:text/html;base64,[A-Za-z0-9+/=]+',                          # Full base64 suspicious_domain
            r'[^\x00-\x7F]',                                                   # Non-ASCII characters
            r'%68%74%74%70',                                                   # Encoded 'http'
            r'\b\d{1,3}(?:\.\d{1,3}){3}\b',                                    # IPv4 address
            r'\b\d{8,10}\b',                                                   # Decimal-encoded IP
            r'([a-fA-F0-9]{1,4}:){1,7}[a-fA-F0-9]{1,4}',                       # IPv6 address
            r'\b0[0-7]+\.[0-7]+\.[0-7]+\.[0-7]+\b',                            # Octal
            r'\b0x[0-9a-fA-F]{8}\b',                                           # Full hex
            r'\b0x[0-9a-fA-F]+\.(?:0x[0-9a-fA-F]+\.){2}0x[0-9a-fA-F]+\b'       # Dot-separated hex
            r'javascript\s*:',                                                 # simple suspicious_domain
            r'java[\s%0a%0d]*script[\s%0a%0d]*:',                              # separated with CRLF/tab/space
            r'javascript\s*//',                                                # line comment based
            r'[\\\/%5c%2f]+javascript\s*:',                                    # /%5cjavascript: and similar
            r'j\s*a\s*v\s*a\s*s\s*c\s*r\s*i\s*p\s*t\s*:',                      # spaced letters
            r'(?i)<>\s*javascript:',                                           # <>javascript:
            r'javascrip[tT]\s*:',                                              # case variation
            r'%09.*javascript\s*:',                                            # tab then javascript
            r'javascript\s*:\s*(alert|prompt|confirm)\s*\(',                   # common XSS functions
            r'[^\w]javascript\s*:',                                            # preceding character like `/`, `;`, etc.
            r'x:1:///+%01*javascript\s*:',                                     # exotic pseudo-scheme
        ]

        redirect_keywords = [
            r"/redirect/", r"/cgi-bin/redirect\.cgi\?", r"/out/", r"/out\?",r"\?next=", 
            r"\?url=", r"\?target=", r"\?rurl=", r"\?dest=", r"\?destination=", r"\?redir=",
            r"\?redirect_uri=", r"\?redirect_url=", r"\?redirect=", r"\?view=", r"\?image_url=", 
            r"\?go=",r"\?return=", r"\?returnTo=", r"\?return_to=", r"\?checkout_url=", r"\?continue=", 
            r"\?return_path=",r"/login\?to=",r"success=", r"data=", r"login=", r"logout=", r"clickurl=",
            r"goto=", r"rit_url=", r"forward_url=", r"callback_url=",r"jump=", r"jump_url=", r"click\?u=", 
            r"originUrl=", r"origin=", r"Url=", r"desturl=", r"u=", r"u1=",r"page=", r"action=",
            r"action_url=", r"Redirect=", r"sp_url=", r"service=", r"recurl=", r"j\?url=", r"uri=",
            r"allinurl=", r"q=", r"link=", r"src=", r"tc\?src=", r"linkAddress=", r"location=", r"burl=",
            r"request=", r"backurl=", r"RedirectUrl=", r"ReturnUrl="
        ]

        for suspicious_domain in redirect_keywords + suspicious_domains:
            if re.search(suspicious_domain, decoded, re.IGNORECASE) or re.search(suspicious_domain, url, re.IGNORECASE):
                return 1
        return 0
    except:
        return 1


def suspiciousExtension(url):
    try:
        if not urlparse(url).scheme:
            url = "https://" + url
        parsed = urlparse(url)
        stripped = parsed.path + "?" + parsed.query if parsed.query else parsed.path
        malicious_extensions = (
            '.pdf', '.exe', '.dll', '.bat', '.cmd', '.scr', '.js', '.vb', '.vbs', '.msp', '.ps2', '.psc1', '.zip', 
            '.ps1', '.jar', '.py', '.rb', '.pif', '.rtf', '.vbe', '.docx', '.ps1xml', '.lnk', '.reg', '.sh','.bin',
            '.apk', '.msi', '.iso', '.doc', '.xsls', '.inf', '.ws', '.xls', '.jpeg', '.xlsm', '.ppt', '.html', '.htm',
            '.application', '.gadget', '.docm', '.jse', '.psc2', '.php', '.aspx', '.jsp', '.asp', '.cgi', '.mips',
            '.pl', '.wsf', '.class', '.sldm', '.war', '.ear', '.sys', '.cpl', '.drv', '.dmg', '.pkg', '.gif','.xhtml',
            '.mde', '.msc', '.xlam', '.ppam', '.mst', '.paf', '.scf', '.sct', '.shb', '.vxd', '.wsc', '.wsh', '.mpsl',
            '.txt', '.pptm', '.potm', '.msh', '.msh1', '.msh2', '.mshxml', '.mhs1xml', '.msh2xml', '.pol', '.hlp', 
            '.chm', '.rar', '.z', '.bz2', '.cab', '.gz', '.tar', '.ace', '.msu', '.ocx', '.feed','.ppc', '.arm', 
            '.phtml', '.stm', '.ppkg', '.bak', '.tmp', '.ost', '.pst', '.arm7', '.avi','.hta', '.shtml', '.sh4',
            '.img', '.vhd', '.vhdx', '.lock', '.lck', '.sln', '.cs', '.csproj', '.resx', '.config', '.snoopy',
            '.resources', '.pdb', '.manifest', '.mp3', '.wma', '.dot', '.wbk', '.xlt', '.xlm', '.arm6','.com',
            '.xla', '.pot', '.pps', '.ade', '.adp', '.mdb', '.cdb', '.mda', '.mdn', '.mdt', '.mdf', '.xml', 
            '.ldb', '.wps', '.xlsb', '.xll', '.xlw', '.m', '.jpg', '.css', '.-1', '.png', '.x86', '.spc'
        )
        suspicious_domain = re.compile(r"(" + "|".join(re.escape(ext) for ext in malicious_extensions) + r")(\?|$)", re.IGNORECASE)

        match = re.search(suspicious_domain, stripped)
        return 1 if match else 0
    except:
        return 1   


def suspiciousTld(url):
    try:
        suspicious_tlds = {
        "icu", "ml", "py", "tk", "xyz", "am", "bd", "best", "bid", "cd", "cfd", "cf", "click", "cyou", "date",
        "download", "faith", "ga", "gq", "help", "info", "ke", "loan", "men", "porn", "pw", "quest", "rest",
        "review", "sbs", "sex", "su", "support", "win", "ws", "xxx", "zip", "zw", "asia", "autos", "bar", "bio",
        "blue", "buzz", "casa", "cc", "charity", "club", "country", "dad", "degree", "earth", "email", "fit",
        "fund", "futbol", "fyi", "gdn", "gives", "gold", "guru", "haus", "homes", "id", "in", "ink", "jetzt",
        "kim", "lat", "life", "live", "lol", "ltd", "makeup", "mom", "monster", "mov", "ninja", "online", "pics",
        "plus", "pro", "pub", "racing", "realtor", "ren", "rip", "rocks", "rodeo", "run", "shop", "skin", "space",
        "tokyo", "uno", "vip", "wang", "wiki", "work", "world", "xin", "zone", "accountant", "accountants", "adult",
        "bet", "cam", "casino", "cm", "cn", "cricket", "ge", "il", "link", "lk", "me", "ng", "party", "pk", "poker",
        "ru", "sa", "science", "sexy", "site", "stream", "th", "tn", "top", "trade", "tube", "webcam", "wtf"
        }
        
        parsed = urlparse(url)
        domain = parsed.netloc.split(':')[0]
        domain_ascii = idna.encode(domain).decode('ascii')

        ext = tldextract.extract(domain_ascii)
        tld = ext.suffix.lower()
        if tld.startswith("xn--"):
            return 1
        return 1 if tld in suspicious_tlds else 0

    except:
        return 1  
   

def suspiciousWord(url):
    try:
        suspicious_words = [
            "index", "login", "wp-content", "images", "wp-includes", "js", "wp-admin", "component","wais",
            "home", "css", "plugins", "uploads", "dropbox", "html", "mozi", "themes", "view", "en","telnet",
            "admin", "ipfs", "secure", "site", "includes", "signin", "doc", "update", "alibaba","nntp",
            "products", "data", "file", "auth", "news", "modules", "document", "ii", "bins", "gopher",
            "components", "files", "content", "blog", "mailto", "myaccount", "gate", "img", "media",
            "dhl", "new", "app", "public", "user", "de", "d", "article", "a", "assets", "templates",
            "cp", "libraries", "bookmark", "default", "system", "mail", "web", "sejeal", "upload",
            "account", "detail", "index2", "openme", "info", "projects", "e", "category", "verify",
            "verification", "raw", "es", "db", "administrator", "log", "b", "personal", "prospero"
        ]

        suspicious_domain = r'\b(' + '|'.join(re.escape(word) for word in suspicious_words) + r')\b'

        match = re.search(suspicious_domain, url, re.IGNORECASE)
        return 1 if match else 0
    except:
        return 1


def suspiciousDomain(url):
    try:
        suspicious_domains = [
            "at.ua", "usa.cc", "baltazarpresentes.com.br", "pe.hu", "esy.es", "hol.es", "sweddy.com", "myjino.ru", "96.lt",
            "ow.ly", "clikar.com", "tinyurl.com", "bc.vc", "ity.im", "q.gs", "zytpirwai.net", "buff.ly", "bitly.is", "rb.gy",
            "chilp.it", "000webhostapp.com", "altervista.org", "awardspace.com", "biz.tc", "bravenet.com", "byethost.com",
            "freehosting.com", "freeservers.com", "heliohost.org", "hostinger.com", "infinityfree.net", "nfshost.com",
            "pages.jaiku.com", "scam.org", "uw.hu", "x10hosting.com", "zohosites.com", "s3.amazonaws.com", "site.90.cf",
            "webs.com", "tripod.com", "ipfs.io", "workers.dev", "profreehost.com", "livehost.fr", "hostfree.es", "claro.am",
            "freedynamicdns.org", "dottk.com", "zankyou.com", "freewebspace.com", "freeuk.com", "weebly.com", "geocities.com",
            "sitemix.jp", "ucoz.com", "8m.com", "00server.com", "000space.com", "t35.com", "pantheonsite.io", "wefreeweb.com",
            "brinkster.com", "50webs.com", "8k.com", "7li.ink", "fast2host.com", "000a.biz", "0fees.net", "abysales.com",
            "ietf.org", "weeblysite.com", "mixh.jp", "dweb.link", "1337x.to", "katcr.co", "kickass.to", "thepiratebay.org",
            "rarbg.to", "yify-torrents.com", "lemonparty.org", "goatse.cx", "meatspin.com", "tubgirl.com", "2girls1cup.info",
            "2girls1cup.tv", "mydeals.com", "graboid.com", "lifescams.com", "angelfire.com", "pastebin.com", "xsph.ru",
            "phishing.com", "malware.com", "scamalert.com", "square.site", "apbfiber.com", "sharepoint.com", "mxsimulator.com",
            "sogou.com", "clickbank.com", "myfavoritesites.com", "mysearch123.com", "herokuapp.com", "github.io", "freenom.com",
            "repl.co", "glitch.me", "netlify.app", "pastehtml.com", "surge.sh", "pages.dev", "fly.dev", "firebaseapp.com",
            "awsstatic.com", "azurewebsites.net", "vercel.app", "web.app", "appspot.com", "appchkr.com", "blogspot.com",
            "hostingerapp.com", "infomaniak.com", "myfreesites.net", "square7.ch", "wixsite.com","temp.domains/~",
            "zohosites.in", "squarespace.com", "blogger.com", "tumblr.com", "ghost.io", "strikingly.com", "jimdo.com",
            "webflow.io", "shopify.com", "bigcartel.com", "storenvy.com", "ecwid.com", "tictail.com", "gumroad.com",
            "sellfy.com", "fastspring.com", "sendowl.com", "paddle.com", "gumtree.com", "mozello.com", "ucraft.com",
            "carrd.co", "launchrock.com", "tilda.cc", "bubble.io", "instapage.com", "unbounce.com", "leadpages.com",
            "getresponse.com", "wordpress.com", "now.sh", "render.com", "glitch.com", "codepen.io", "sandboxd.io","/~",
            "jsfiddle.net", "codesandbox.io", "plunker.co", "scratch.mit.edu", "expo.io", "hyper.dev", "plnkr.co",
            "bitballoon.com", "itch.io", "scrimba.com", "stackblitz.com", "observablehq.com", "replit.com", "codeanywhere.com",
            "stacksity.com", "runkit.com", "xip.io", "nip.io", "vapor.cloud", "simmer.io", "glitchet.com", "felony.io",
            "deckdeckgo.com", "shynet.io", "fly.io", "updog.co", "nanoapp.io", "epizy.com", "trovalds.github.io", "netlify.com"
        ]
                
        shortening_domains = [
            "bit.ly", "goo.gl", "shorte.st", "go2l.ink", "x.co", "ow.ly", "t.co", "tinyurl", "tr.im", "is.gd", "cli.gs",
            "yfrog.com", "migre.me", "ff.im", "tiny.cc", "url4.eu", "twit.ac", "su.pr", "twurl.nl", "snipurl.com", "short.to",
            "BudURL.com", "ping.fm", "post.ly", "Just.as", "bkite.com", "snipr.com", "fic.kr", "loopt.us", "doiop.com",
            "short.ie", "kl.am", "wp.me", "rubyurl.com", "om.ly", "to.ly", "bit.do", "lnkd.in", "db.tt", "qr.ae", "adf.ly",
            "bitly.com", "cur.lv", "ity.im", "q.gs", "po.st", "bc.vc", "twitthis.com", "u.to", "j.mp", "buzurl.com", "cutt.us",
            "u.bb", "yourls.org", "prettylinkpro.com", "scrnch.me", "filoops.info", "vzturl.com", "qr.net", "1url.com",
            "tweez.me", "v.gd", "link.zip.net", "shorturl.at", "rebrand.ly", "shorten.at", "shortenurl.at", "tiny.one",
            "tinyurl.one", "t2mio.com", "yep.it", "youtu.be", "zpr.io", "zurl.ws", "clck.ru", "cutt.ly", "shorturl.cm", "soo.gd",
            "tiny.vc", "tr.tt", "u.ii", "ur1.ca", "bit.li", "t2m.io", "clicky.me", "cr.yp.to", "owly.ai", "chilp.it", "snip.ly",
            "snurl.com", "poprl.com", "memurl.com", "trimurl.com", "zurl.co", "zzb.vc", "v.tc", "qr.cc", "t.it", "x.ee",
            "short.cm", "u.mavrev.com", "u.mytu.tu", "u.nu", "u.ddy.pr", "go.usa.gov", "miniurl.com", "corta.at", "sh.rt",
            "adcrun.ch", "surl.li", "rb.gy"
        ]

        shortening_services = r"\b(" + "|".join(re.escape(domain) for domain in shortening_domains) + r")\b"

        suspicious_domain = r'\b(' + '|'.join(re.escape(word) for word in suspicious_domains) + r')\b'

        if re.search(suspicious_domain, url, re.IGNORECASE) or re.search(shortening_services, url, re.IGNORECASE):
            return 1
        return 0
    except:
        return 1

## **Features into Dataframe**
Ensure the number of features in the function matches the dataframe.

In [4]:
# Process Pandas dataframe
df['entropyDomain'] = df['url'].apply(domainEntropy)
df['entropyurl'] = df['url'].apply(urlEntropy)
df['longUrl'] = df['url'].apply(longUrl)
df['suspiciousExtension'] = df['url'].apply(suspiciousExtension)
df['countDepth'] = df['url'].apply(countDepth)
df['countDot'] = df['url'].apply(countDot)
df['hasHttps'] = df['url'].apply(hasHttps)
df['suspiciousTld'] = df['url'].apply(suspiciousTld)
df['suspiciousDomain'] = df['url'].apply(suspiciousDomain)
df['validateUrl'] = df['url'].apply(validateUrl)
df['suspiciousWord'] = df['url'].apply(suspiciousWord)
df['longDomain'] = df['url'].apply(longDomain)
df['hypenDomain'] = df['url'].apply(hypenDomain)
df['countDigitURL'] = df['url'].apply(countDigitUrl)
df['countDigitDomain'] = df['url'].apply(countDigitDomain)
df['openRedirect'] = df['url'].apply(openRedirect)
df['uppercaseUrl'] = df['url'].apply(uppercaseUrl)

# Save the processed dataframe to a new CSV file
df.to_csv("url_dataset_processed.csv", index=False)


## **Move On to Testing & Training The Models**