# HoxHunt Summer Hunters 2020 - Data - Home assignment

Assignment

In this assignment you as a HoxHunt Data Science Hunter are given the task to extract interesting features from a possible malicious indicator of compromise, more specifically in this case from a given potentially malicious URL.

This assignment assumes that you are comfortable (or quick to learn) on using Jupyter Notebooks and suitable programming enviroment such as Python, R or Julia. The example below uses Python and has some external dependencies such as Requests library.

Happy hunting!

Interesting research papers & resources

Below is a list of interesting research papers on the topic. They might give you good tips what features you could extract from a given URL:

Know Your Phish: Novel Techniques for Detecting Phishing Sites and their Targets

DeltaPhish: Detecting Phishing Webpages in Compromised Websites

PhishAri: Automatic Realtime Phishing Detection on Twitter

More or Less? Predict the Social Influence of Malicious URLs on Social Media

awesome-threat-intelligence

What we expect

Investigate potential features you could extract from the given URL and implement extractors for the ones that interest you the most. Below example code extracts one feature but does not store it very efficiently (just console logs it). Implement sensible data structure using some known data structure library to store the features per URL. Also consider how would you approach error handling if one feature extractor fails?

Be prepared to discuss questions such as: what features could indicate the malicousness of a given URL? What goes in to the thinking of the attacker when they are choosing a site for an attack? What would you develop next?

What we don't expect

Implement a humangous set of features.

Implement any kind of actual predicition models that uses the features to give predictions on malicousness at this stage :)



In [215]:
#added pandas in order to save the information into dataframe
import pandas as pd
import requests
import json
from urllib.parse import urlparse

In [216]:
def get_domain_age_in_days(domain):
    show = "https://input.payapi.io/v1/api/fraud/domain/age/" + domain
    data = requests.get(show).json()
    return data['result'] if 'result' in data else None

In [217]:
def parse_domain_from_url(url):
    t = urlparse(url).netloc
    return '.'.join(t.split('.')[-2:])


In [218]:
def analyze_url(url):
    # First feature, if domain is new it could indicate that the bad guy has bought it recently...
    age_in_days_feature = get_domain_age_in_days(parse_domain_from_url(url));
    # Hmm...maybe I could do something more sensible with the data than just printing out
    print(url, age_in_days_feature)

In [219]:
# Note some of these urls are live phishing sites (as of 2019-03-21) use with caution! More can be found at https://www.phishtank.com/
example_urls = ["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                "http://cartaobndes.gov.br.cv31792.tmweb.ru/",
                "https://paypal.co.uk.yatn.eu/m/",
                "http://college-eisk.ru/cli/",
                "https://dotpay-platnosc3.eu/dotpay/"
               ]
for url in example_urls: 
    analyze_url(url)

https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus 4733
http://cartaobndes.gov.br.cv31792.tmweb.ru/ 4647
https://paypal.co.uk.yatn.eu/m/ None
http://college-eisk.ru/cli/ 2715
https://dotpay-platnosc3.eu/dotpay/ None

SyntaxError: invalid syntax (<ipython-input-219-01ac7357dd7b>, line 11)

The below was done manually as I did not managed to find suitable easy way of using above code to insert the information via code. Might find some API, etc. in future to fix this, but for now it served the purpose.

In [220]:
#storing data in to dataframe from example_urls above. Making dict for this purpose.
analyzed_urls = {"url":["https://www.slideshare.net/weaveworks/client-side-monitoring-with-prometheus",
                              'http://cartaobndes.gov.br.cv31792.tmweb.ru/', 'https://paypal.co.uk.yatn.eu/m/',
                              'http://college-eisk.ru/cli/', 'https://dotpay-platnosc3.eu/dotpay/'],
                 "age":[ "4733", '4647', 'None', '2715', 'None']}

In [221]:
df=pd.DataFrame(analyzed_urls)

In [222]:
df

Unnamed: 0,url,age
0,https://www.slideshare.net/weaveworks/client-s...,4733.0
1,http://cartaobndes.gov.br.cv31792.tmweb.ru/,4647.0
2,https://paypal.co.uk.yatn.eu/m/,
3,http://college-eisk.ru/cli/,2715.0
4,https://dotpay-platnosc3.eu/dotpay/,


In [223]:
#domain lenght in url. Checking the urls
df['url']

0    https://www.slideshare.net/weaveworks/client-s...
1          http://cartaobndes.gov.br.cv31792.tmweb.ru/
2                      https://paypal.co.uk.yatn.eu/m/
3                          http://college-eisk.ru/cli/
4                  https://dotpay-platnosc3.eu/dotpay/
Name: url, dtype: object

In [224]:
# Adding the urllib.parse results into the dataframe.
urls=df['url']
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] =  zip(*[urllib.parse.urlsplit(x) for x in urls])

In [225]:
df

Unnamed: 0,url,age,protocol,domain,path,query,fragment
0,https://www.slideshare.net/weaveworks/client-s...,4733.0,https,www.slideshare.net,/weaveworks/client-side-monitoring-with-promet...,,
1,http://cartaobndes.gov.br.cv31792.tmweb.ru/,4647.0,http,cartaobndes.gov.br.cv31792.tmweb.ru,/,,
2,https://paypal.co.uk.yatn.eu/m/,,https,paypal.co.uk.yatn.eu,/m/,,
3,http://college-eisk.ru/cli/,2715.0,http,college-eisk.ru,/cli/,,
4,https://dotpay-platnosc3.eu/dotpay/,,https,dotpay-platnosc3.eu,/dotpay/,,


In [226]:
# I created a new column for len 
# passing values through str.len() by using Pandas str.len() method 
# My thought was to apply https://hcis-journal.springeropen.com/articles/10.1186/s13673-016-0064-3 
# results on approx url length to identify possible malicious url
df["url_lenght"]= df["domain"].str.len() 

In [227]:
df

Unnamed: 0,url,age,protocol,domain,path,query,fragment,url_lenght
0,https://www.slideshare.net/weaveworks/client-s...,4733.0,https,www.slideshare.net,/weaveworks/client-side-monitoring-with-promet...,,,18
1,http://cartaobndes.gov.br.cv31792.tmweb.ru/,4647.0,http,cartaobndes.gov.br.cv31792.tmweb.ru,/,,,35
2,https://paypal.co.uk.yatn.eu/m/,,https,paypal.co.uk.yatn.eu,/m/,,,20
3,http://college-eisk.ru/cli/,2715.0,http,college-eisk.ru,/cli/,,,15
4,https://dotpay-platnosc3.eu/dotpay/,,https,dotpay-platnosc3.eu,/dotpay/,,,19


In [228]:
df['url_lenght']

0    18
1    35
2    20
3    15
4    19
Name: url_lenght, dtype: int64

In [229]:
#check the once that are over 25 (check the link for more information), can be done other way as well.
df['equal_or_higher_than_25?'] = df['url_lenght'].apply(lambda x: 'True' if x >= 25 else 'False')
#The result indicates some having dupious lenght of domain name. Clearly not a strong indicator here.
print (df)

                                                 url   age protocol  \
0  https://www.slideshare.net/weaveworks/client-s...  4733    https   
1        http://cartaobndes.gov.br.cv31792.tmweb.ru/  4647     http   
2                    https://paypal.co.uk.yatn.eu/m/  None    https   
3                        http://college-eisk.ru/cli/  2715     http   
4                https://dotpay-platnosc3.eu/dotpay/  None    https   

                                domain  \
0                   www.slideshare.net   
1  cartaobndes.gov.br.cv31792.tmweb.ru   
2                 paypal.co.uk.yatn.eu   
3                      college-eisk.ru   
4                  dotpay-platnosc3.eu   

                                                path query fragment  \
0  /weaveworks/client-side-monitoring-with-promet...                  
1                                                  /                  
2                                                /m/                  
3                                   

At this point, I would extend the above analyze to other indicators, such as number of dots, slashes, host names, etc. 

One of the things I would take into consideration is redirecting URLs, which is also used by attackers. For this I would take a look how this could be verified, that the directed page is not suspicious. 

For Error handling: Try-Except-else Clause. For Urllib there used to be urllib.error, but nowadays requests is recommended way of handling errors.