# Exploring the StalkPhish.io REST API
StalkPhish.io is a hosted service based on the open-source software version of StalkPhish. StalkPhish is handy for gathering information about phishing sites, capturing copies of the phish kits used etc. If you are doing research on phish kits, or performing incident response for phishing incidents, or gathering threat intelligence on phishing campaigns or infrastrcuture, StalkPhish is really nice.

The hosted version has an API that you can use to search for information they gather. If you have you own instance of StalkPhish you can have it gather info on targets of your choice. But with StalkPhish.io you can search for info about a large existing base of information.

## API Documentation
I used the following three sources to understand the API:
- The FAQ: https://www.stalkphish.io/faq/
- The API documentation: https://www.stalkphish.io/documentation/api/ 
- This Blog Post: https://stalkphish.com/2021/06/30/howto-stalkphish-io/

## Purpose of this Notebook
This notebook contains my experiments exploring the StalkPhish.io API and tests of how the API works. I have some general tests to ensure that my python works correctly with the API, but I also have experiements to see what kind of data is available.

# Requirements
I will require only a few python libraries for our experiments.

- "os" is used to get environment variables
- "getpass" is used to allow the user to securely enter their API token
- "json" to format JSON output nicely.
- "requests" to handled HTTP GET requests and responses

In [7]:
import os
import getpass
import json
import requests

# Define the StalkPhish.io API v1 Endpoints
Not all of the endpoints are curreently usable. This is based on the API documentation found here:
https://www.stalkphish.io/documentation/api/

According to this blog we have access to the following endpoints on the free plan.
    https://stalkphish.com/2021/06/30/howto-stalkphish-io/

    /api/v1/me : Return informations about account linked to API key.
    /api/v1/last : Return n last results, with n depending on your subscription.
    /api/v1/search/url : Return results of string search appearing in a URL
    /api/v1/search/title : Return results of string search appearing in a website title.
    /api/v1/search/ipv4 : Return results of IPv4 search.
  

In [8]:
# Stalkphish API Endpoints for API v1
ep_base_url = 'https://www.stalkphish.io/api/v1'

# me: Return informations about account linked to API key. Limited to 100 request/day, out of profile quota.
ep_me    = f'{ep_base_url}/me'

# last: Return n last results, with n depending on your subscription.
ep_last  = f'{ep_base_url}/last'

# email: Return results of e-mail found in phishing kits.
ep_email = f'{ep_base_url}/search/email' # requires a search string appended to URL

# ipv4 : Return results of IPv4 search.
ep_ipv4  = f'{ep_base_url}/search/ipv4'  # requires a search string appended to URL

# title : Return results of string search appearing in a website title.
ep_title = f'{ep_base_url}/search/title' # requires a search string appended to URL

# url: Return results of string search appearing in a URL.
ep_url   = f'{ep_base_url}/search/url'   # requires a search string appended to URL

# brand: Return results of string search appearing in a brand name.
ep_brand = f'{ep_base_url}/search/brand' # requires a search string appended to URL

# API Authentication
The documentation at https://www.stalkphish.io/documentation/api/ seems to contradict the documentation at https://www.stalkphish.io/faq/. Perhaps I misinterpreted or misunderstood the API docs.

I found the instructions in the FAQ work. We need to set the "Authorization" header to include a string that starts with "Token " and ends with our API key.

I have also added a header for a user-agent to help the API operators identify this script should troubleshooting be required. While not required it's a good practice and polite when using a free API.

In [9]:
# Get the StalkPhish.io API Token from ENV or user input
if 'SP_TOKEN' in os.environ.keys():
    token = os.getenv('SP_TOKEN')
else:
    token = getpass.getpass('Enter your stalkphish.io Token:')

# When we send the token in the authorization header, we need to add the string "Token" in front.
authorization = f'Token {token}'

# It is polite to set a meaningful user-agent. If our script causes problems for the API this
# helps the operators troubleshoot so they can contact us about problems.
# In general RFC7231 says user-agent strings should follow this format:
#   User-Agent: <product> / <product-version> (<comment>)
user_agent = 'cyberlibrarian-stalkphish-test/1.0 (michael@cyberlibrarian.ca)'

# We need these headers at minimum, Authorization is most important
headers = {
    'Content-Type': "application/json",
    'Authorization': authorization,
    'Accept': '*/*',
    'User-Agent': user_agent
    }

Enter your stalkphish.io Token:········


# Tests of the API

## /api/v1/me test
This endpoint will fetch information about our account. It makes a good test to see if we authenticated correctly to the API. It is limited to 100 per day, so limit the number of times you test with it.

In [76]:
# Search for my account info
# We have already set the authentication information in our headers at the beginning
try:
    response = requests.request("GET", ep_me, headers=headers)
except Exception as e:
    print(f'Err: {e}')

### Information about our Account

In [77]:
print(f'Response Code: {response.status_code}')
[print(f'{header}: {response.headers[header]}') for header in response.headers]

# Careful! The response JSON contains our API key which we DON'T want saved in the printed output.
try:
    response_json = response.json() # raises an error if response is not JSON
except Exception as e:
    print(f'Err: {e}')

response_json[0]['api_key'] = 'XXXXXXXXXXXXXXXXXXXXXXXXXX'
print(json.dumps(response_json, indent=2))


Response Code: 200
Server: nginx
Date: Tue, 01 Mar 2022 01:35:10 GMT
Content-Type: application/json
Content-Length: 145
Connection: keep-alive
Allow: GET, HEAD, OPTIONS
X-Frame-Options: DENY, SAMEORIGIN
X-Content-Type-Options: nosniff, nosniff
Referrer-Policy: same-origin, strict-origin-when-cross-origin
Cross-Origin-Opener-Policy: same-origin
Strict-Transport-Security: max-age=15768000
X-Xss-Protection: 1; mode=block
Feature-policy: accelerometer 'none'; camera 'none'; geolocation 'none'; gyroscope 'none'; magnetometer 'none'; microphone 'none'; payment 'none'; usb 'none'
Content-Security-Policy: default-src 'self' http: https: data: blob: 'unsafe-inline'
[
  {
    "username": "cyberlibrarian",
    "email": "michael@cyberlibrarian.ca",
    "api_key": "XXXXXXXXXXXXXXXXXXXXXXXXXX",
    "subscribed_plan": "Free"
  }
]


## /api/v1/last test
The documentation says this returns the last "n" results with "n" depending on your subscription. 

It is unclear what this means form the doucumentation. What results? Whose results? Let's test it and find out.

Conclusion: This appears to be system wide: The number of records addedto stalkphish.io. Interesting.

In [31]:
try:
    response = requests.request("GET", ep_last, headers=headers)
except Exception as e:
    print(f'Err: {e}')

### The last queries you made with this account

In [34]:
print(f'Number of results fetched: {len(response.json())}')
print(json.dumps(response.json(), indent=2))

Number of results fetched: 30
[
  {
    "siteurl": "https://www.craftservicench.ru/",
    "sitedomain": "www.craftservicench.ru",
    "pagetitle": "The Sandbox - A Decentralized Gaming Platform Made By Players",
    "firstseentime": "2022-02-28T23:21:00Z",
    "firstseencode": "200",
    "ipaddress": "178.208.83.42",
    "asn": "210079",
    "asndesc": "EUROBYTE Eurobyte LLC, RU",
    "asnreg": "ripencc",
    "extracted_emails": null
  },
  {
    "siteurl": "https://biswap.pmsuvidha.com/",
    "sitedomain": "biswap.pmsuvidha.com",
    "pagetitle": "School Bus Simulator Driving",
    "firstseentime": "2022-02-28T23:20:53Z",
    "firstseencode": "aborted",
    "ipaddress": "45.156.22.26",
    "asn": "56971",
    "asndesc": "CLOUDBACKBONE, RU",
    "asnreg": "ripencc",
    "extracted_emails": null
  },
  {
    "siteurl": "https://connecti-auonei-jp.obs0eq5.cn/",
    "sitedomain": "connecti-auonei-jp.obs0eq5.cn",
    "pagetitle": null,
    "firstseentime": "2022-02-28T23:20:45Z",
    "firs

### Observations on /last test
Generally, the API results seem to be in this format. Will all endpoints produce this same type of record or will more context be available?

`
  {
    "siteurl": "http://openseca.com",
    "sitedomain": "openseca.com",
    "pagetitle": null,
    "firstseentime": "2022-02-28T23:14:18Z",
    "firstseencode": "timeout",
    "ipaddress": "80.66.64.192",
    "asn": "57416",
    "asndesc": "INSTARS, RU",
    "asnreg": "ripencc",
    "extracted_emails": null
  },
`

## /api/v1/search/email/ test
This endpoint will search for e-mails found in phishing kits. It is unclear if this means emails referenced in the phishkit code (e.g. emails used for collection of phished credentials) or if it means emails found in victim dumps in captured phishkits. Or perhaps it refers to emails referenced in phishing URLs (some phishing URLs contain the victims email address)

In our test we are going to search for a list of email address. These fall into several categories:
- Addresses that have sent phishing links in the past (compromised or attacker emails)
- Addresses that have been observed in phishing links (known victim emails)
- Addresses that have received phishing emails (but not mentioned in actual phishing URLs)
- Addresses that have send maldocs (compromised or attacker emails)
- Addresses that have send test emails (during preparation of new gmail.com accounts)

### Test 1: with 1 emails: one expected known to send phish, one known safe

In [68]:
# Test emails
emails = [
    'michael@cyberlibrarian.ca',
    'thamaraiselvan.m@ionexchange.co.in'
]

# Make a separate query for each email. 
responses = []
for email in emails:
    url = f'{ep_email}/{email}'
    try:
        response = requests.request("GET", url, headers=headers)
        responses.append(response)
    except Exception as e:
        print(f'Err: {e}')

In [75]:
print(f'Number of responses: {len(responses)}')
for response in responses:
    print(f'Respnse Status: {response.status_code}, {response.json()["detail"]}')

Number of responses: 2
Respnse Status: 429, Request was throttled. Expected available in 81789 seconds.
Respnse Status: 429, Request was throttled. Expected available in 80164 seconds.


### Observations on /email test
I guess this does not work for the free API. "You don't have access to this search option with you profile"

Initially we sent 14 requests, and ended up throttled after a few attempts to troubleshoot. 

response.status_code=429 means we were throttled. What other status_codes are there? 

## /api/v1/search/ipv4/ test
This endpoint allows us to search ipv4 address. But are these IPv4 addresses of sites hosting phishing sites? Or IPv4 addresses associated with phishing email senders?

In [48]:
# Search for IPv4 (46.101.222.88)
ip = '46.101.222.88'
for ip in ips:
    url = f'{ep_ipv4}/{ip}'
    try:
        response = requests.request("GET", url, headers=headers)
    except Exception as e:
        print(f'Err: {e}')

In [49]:
print(json.dumps(response.json(), indent=2))

{
  "detail": "Request was throttled. Expected available in 85165 seconds."
}


## /api/v1/search/title test
This searches for titles in web pages. Presumably in the pages for phishing sites. This would be great for cases where we have found a phishing site and want to see where else it has been used.

For our test we will use several strings for titles we believe are unique: representing things we expect and DO NOT expect to find in phishing sites.

In [81]:
# Search for Page Titles
titles = [
    'cyberlibrarian',
    'wawanesa',
    'O365',
    'Mail Center',
    'Dropbox'
]
responses = []
for title in titles:
    url = f'{ep_title}/{title}'
    try:
        response = requests.request("GET", url, headers=headers)
        responses.append(response)
    except Exception as e:
        print(f'Err: {e}')

In [83]:
print(f'Number of responses: {len(responses)}')
for response in responses:
    print(f'Respnse Status: {response.status_code}, {response.json()}')

Number of responses: 5
Respnse Status: 401, {'error': "You don't have access to this search option with you profile"}
Respnse Status: 401, {'error': "You don't have access to this search option with you profile"}
Respnse Status: 429, {'detail': 'Request was throttled. Expected available in 80423 seconds.'}
Respnse Status: 429, {'detail': 'Request was throttled. Expected available in 80422 seconds.'}
Respnse Status: 401, {'error': "You don't have access to this search option with you profile"}


### Observations for /title test

response.status_code=401 occurs when you don't have access to this search option.
The response will contain a JSON/dict item called "error" with value "You don't have access to this search option with you profile"


## /api/v1/search/url test
This endpoint allows us to search URLs. It is not clear if we have supply a full URL or a string to search for in an URL. I think "string to search for".

So for our first experiment we will search for a single string we expect to find results for.

In [84]:
# Search for emails containing "cyberlibrarian"

url_string = 'o365'

url = f'{ep_url}/{url_string}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [89]:
print(f'Response Status: {response.status_code}')
print(f'Headers: {response.headers}')
print(f'Number of results: {len(response.json())}')
print(json.dumps(response.json(), indent=2))

Response Status: 200
Headers: {'Server': 'nginx', 'Date': 'Tue, 01 Mar 2022 02:05:29 GMT', 'Content-Type': 'application/json', 'Content-Length': '33993', 'Connection': 'keep-alive', 'Allow': 'GET, HEAD, OPTIONS', 'X-Frame-Options': 'DENY, SAMEORIGIN', 'X-Content-Type-Options': 'nosniff, nosniff', 'Referrer-Policy': 'same-origin, strict-origin-when-cross-origin', 'Cross-Origin-Opener-Policy': 'same-origin', 'Strict-Transport-Security': 'max-age=15768000', 'X-Xss-Protection': '1; mode=block', 'Feature-policy': "accelerometer 'none'; camera 'none'; geolocation 'none'; gyroscope 'none'; magnetometer 'none'; microphone 'none'; payment 'none'; usb 'none'", 'Content-Security-Policy': "default-src 'self' http: https: data: blob: 'unsafe-inline'"}
Number of results: 50
[
  {
    "siteurl": "https://login.microsoftonline.com/common/oauth2/authorize%3Fclient_id%3D00000002-0000-0ff1-ce00-000000000000%26redirect_uri%3Dhttps%253a%252f%252femail.o365.autodesk.com%252fowa%252f%26resource%3D00000002-00

### Observations for /url test
We got 50 results for our simple test. Was that because we are limited to 50 results? Or because that was how many there were? We will need to test some more to find out.

The format of results is shown below and is the same format as other results we have seen. So it seems there is one format for all search results. All API endpoints produce the same record type.

`
{
    "siteurl": "https://login.microsoftonline.com/common/oauth2/authorize%3Fclient_id%3D00000002-0000-0ff1-ce00-000000000000%26redirect_uri%3Dhttps%253a%252f%252femail.o365.autodesk.com%252fowa%252f%26resource%3D00000002-0000-0ff1-ce00-000000000000%26response_mode%3Dform_post%26response_type%3Dcode%2Bid_token%26scope%3Dopenid%26msafed%3D0%26msaredir%3D0%26client-request-id%3D14e1a2bb-f887-0e4f-7951-b9009dea7b6f%26protectedtoken%3Dtrue%26claims%3D%257b%2522id_token%2522%253a%257b%2522xms_cc%2522%253a%257b%2522values%2522%253a%255b%2522CP1%2522%255d%257d%257d%257d%26nonce%3D637816299138232509.6d434649-a21f-4718-97dc-0f1c74fe5bdf%26state%3DfU9dT4MwAAT9LdsbjkIL9IGYYotBVxdI4xxvfBQzBhIHUrL_uf9j5w_wkrvkLncPZxqGca95p2naWgzfc_0AeA7GwA0c10E2fvBq6EIPYqtwQGNBHwQW9uvKshtQ-bCRqKwbU2-v5mZQxeYxmWSf0JAQfiI0veRt3PP-pd0K5nLBpjdaXQ5HG-1EprOozWne7QSDeZuqQxo9kz88DeSV8q8SKYFY7rMyWz7eVaSilRPvoSoXtD-LYjuzPtFJLCemR-yb3Bwh_7X4CZUR1u2VS9dymcefcxeC9XyUqh9q2YWZLGoux7H4lLcrvw%26sso_reload%3Dtrue",
    "sitedomain": "login.microsoftonline.com",
    "pagetitle": "Bad Request",
    "firstseentime": "2022-02-28T07:45:06Z",
    "firstseencode": "400",
    "ipaddress": "20.190.160.132",
    "asn": "8075",
    "asndesc": "MICROSOFT-CORP-MSN-AS-BLOCK, US",
    "asnreg": "arin",
    "extracted_emails": null
  },
`

An important question is, "What do these records reprensent?" What is the taxonomy?

Take a look at the record above. Is the siteurl the URL for a phishing site? Or for a URL that redirects to a phishing site? It looks like it redirects to an autodesk.com URL.

When I am investigating a phish, I often when to distinguish the URL delivered to the victim (via email for example), all of the domains that redirect, as well as the final "landing page" URL the victim is redirected to. In addition to that, some of these pages are just MITM proxies, and whenever possible it is nice to know the site they are proxying for.