# Exploring the StalkPhish.io REST API
StalkPhish.io is a hosted service based on the open-source software version of StalkPhish. StalkPhish is handy for gathering information about phishing sites, capturing copies of the phish kits used etc. If you are doing research on phish kits, or performing incident response for phishing incidents, or gathering threat intelligence on phishing campaigns or infrastrcuture, StalkPhish is really nice.

The hosted version has an API that you can use to search for information they gather. If you have you own instance of StalkPhish you can have it gather info on targets of your choice. But with StalkPhish.io you can search for info about a large existing base of information.

## API Documentation
I used the following three sources to understand the API:
- The FAQ: https://www.stalkphish.io/faq/
- The API documentation: https://www.stalkphish.io/documentation/api/ 
- This Blog Post: https://stalkphish.com/2021/06/30/howto-stalkphish-io/

## Purpose of this Notebook
This notebook contains my experiments exploring the StalkPhish.io API and tests of how the API works. I have some general tests to ensure that my python works correctly with the API, but I also have experiements to see what kind of data is available.

# Requirements
I will require only a few python libraries for our experiments.

- "os" is used to get environment variables
- "time" is used for time.sleep() so we can wait between multiple API calls
- "getpass" is used to allow the user to securely enter their API token
- "json" to format JSON output nicely.
- "requests" to handled HTTP GET requests and responses

In [1]:
import os
import time
import getpass
import json
import requests

# Define the StalkPhish.io API v1 Endpoints
Not all of the endpoints are curreently usable. This is based on the API documentation found here:
https://www.stalkphish.io/documentation/api/

According to this blog we have access to the following endpoints on the free plan.
    https://stalkphish.com/2021/06/30/howto-stalkphish-io/

    /api/v1/me : Return informations about account linked to API key.
    /api/v1/last : Return n last results, with n depending on your subscription.
    /api/v1/search/url : Return results of string search appearing in a URL
    /api/v1/search/title : Return results of string search appearing in a website title.
    /api/v1/search/ipv4 : Return results of IPv4 search.
  

In [2]:
# Stalkphish API Endpoints for API v1
ep_base_url = 'https://www.stalkphish.io/api/v1'

# me: Return informations about account linked to API key. Limited to 100 request/day, out of profile quota.
ep_me    = f'{ep_base_url}/me'

# last: Return n last results, with n depending on your subscription.
ep_last  = f'{ep_base_url}/last'

# email: Return results of e-mail found in phishing kits.
ep_email = f'{ep_base_url}/search/email' # requires a search string appended to URL

# ipv4 : Return results of IPv4 search.
ep_ipv4  = f'{ep_base_url}/search/ipv4'  # requires a search string appended to URL

# title : Return results of string search appearing in a website title.
ep_title = f'{ep_base_url}/search/title' # requires a search string appended to URL

# url: Return results of string search appearing in a URL.
ep_url   = f'{ep_base_url}/search/url'   # requires a search string appended to URL

# brand: Return results of string search appearing in a brand name.
ep_brand = f'{ep_base_url}/search/brand' # requires a search string appended to URL

# API Authentication
The documentation at https://www.stalkphish.io/documentation/api/ seems to contradict the documentation at https://www.stalkphish.io/faq/. Perhaps I misinterpreted or misunderstood the API docs.

I found the instructions in the FAQ work. We need to set the "Authorization" header to include a string that starts with "Token " and ends with our API key.

I have also added a header for a user-agent to help the API operators identify this script should troubleshooting be required. While not required it's a good practice and polite when using a free API.

In [3]:
# Get the StalkPhish.io API Token from ENV or user input
if 'SP_TOKEN' in os.environ.keys():
    token = os.getenv('SP_TOKEN')
else:
    token = getpass.getpass('Enter your stalkphish.io Token:')

# When we send the token in the authorization header, we need to add the string "Token" in front.
authorization = f'Token {token}'

# It is polite to set a meaningful user-agent. If our script causes problems for the API this
# helps the operators troubleshoot so they can contact us about problems.
# In general RFC7231 says user-agent strings should follow this format:
#   User-Agent: <product> / <product-version> (<comment>)
user_agent = 'cyberlibrarian-stalkphish-test/1.0 (michael@cyberlibrarian.ca)'

# We need these headers at minimum, Authorization is most important
headers = {
    'Content-Type': "application/json",
    'Authorization': authorization,
    'Accept': '*/*',
    'User-Agent': user_agent
    }

Enter your stalkphish.io Token:········


# Tests of the API

## /api/v1/me test
This endpoint will fetch information about our account. It makes a good test to see if we authenticated correctly to the API. It is limited to 100 per day, so limit the number of times you test with it.

In [51]:
# Search for my account info
# We have already set the authentication information in our headers at the beginning
try:
    response = requests.request("GET", ep_me, headers=headers)
except Exception as e:
    print(f'Err: {e}')

### Information about our Account

In [5]:
try:
    response_json = response.json() # raises an error if response is not JSON
except Exception as e:
    print(f'Err: {e}')

# Careful! The response JSON contains our API key which we DON'T want saved in the printed output.
response_json[0]['api_key'] = 'XXXXXXXXXXXXXXXXXXXXXXXXXX'

In [6]:
# print the response code we received
print(f'Response Code: {response.status_code}')

Response Code: 200


In [7]:
# print the response we got from the /me endpoint
print(json.dumps(response_json, indent=2))

[
  {
    "username": "cyberlibrarian",
    "email": "michael@cyberlibrarian.ca",
    "api_key": "XXXXXXXXXXXXXXXXXXXXXXXXXX",
    "subscribed_plan": "Standard"
  }
]


In [8]:
# print the headers from the response
print("\n".join([f'{header}: {response.headers[header]}' for header in response.headers]))

Server: nginx
Date: Sat, 29 Oct 2022 21:33:36 GMT
Content-Type: application/json
Content-Length: 149
Connection: keep-alive
Allow: GET, HEAD, OPTIONS
X-Frame-Options: DENY, SAMEORIGIN
X-Content-Type-Options: nosniff, nosniff
Referrer-Policy: same-origin, strict-origin-when-cross-origin
Cross-Origin-Opener-Policy: same-origin
Strict-Transport-Security: max-age=15768000
X-Xss-Protection: 1; mode=block
Feature-policy: accelerometer 'none'; camera 'none'; geolocation 'none'; gyroscope 'none'; magnetometer 'none'; microphone 'none'; payment 'none'; usb 'none'
Content-Security-Policy: default-src 'self' http: https: data: blob: 'unsafe-inline'


### 2022-04-01 Observations for /me test 
It would be nice if the JSON returned some information about which tier of service the account is in, or how many API lookups are left for the day. More status information would be very helpful.

I expect the "Content-Type" header to always be "application/json", but when there are errors does it sometimes return plain text or HTML?

The Response code should be explored. The API documentation says there are three values: 200, 401, 429.

### 2022-10-26 Observatios for /me test
The "subscribed_plan" returned in /me is a really nice touch. 
Hmmm, maybe there should be a returned value for subscription start and end dates? Or just an expiry date? 
I don't think the API supports expiry dates, but if API keys ever have expiry dates, the /me endpoint might be a nice place to put that information too.

I still have not explored the different response codes, but so far I think this works just fine.

## /api/v1/last test
The documentation says this returns the last "n" results with "n" depending on your subscription. 

It is unclear what this means form the doucumentation. What results? Whose results? Let's test it and find out.

Conclusion: This appears to be system wide: The number of records addedto stalkphish.io. Interesting.

In [9]:
try:
    response = requests.request("GET", ep_last, headers=headers)
except Exception as e:
    print(f'Err: {e}')

### The last queries you made with this account

In [10]:
print(f'Number of results fetched: {len(response.json())}')

Number of results fetched: 50


In [11]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "https://boredapeyachtclub.lives-premints.com/",
    "sitedomain": "boredapeyachtclub.lives-premints.com",
    "pagetitle": "",
    "firstseentime": "2022-10-29T21:30:13Z",
    "firstseencode": "aborted",
    "ipaddress": "",
    "asn": "",
    "asndesc": "",
    "asnreg": "",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": "Non",
        "SSLcert_Issuer": "None",
        "SSLcert_commonName": "None",
        "SSLcert_notBefore": "None",
        "SSLcert_notAfter": "None",
        "SSLcert_subjectAltName": "None",
        "SSLcert_serialNumber_hex": "None",
        "SSLcert_md5": "None",
        "SSLcert_sha1": "None",
        "SSLcert_sha256": "None"
      }
    ]
  },
  {
    "siteurl": "https://underground.mint-nftfree.com",
    "sitedomain": "underground.mint-nftfree.com",
    "pagetitle": "",
    "firstseentime": "2022-10-29T21:30:10Z",
    "firstseencode": "200",
 

### 2022-04-01 Observations on /last test
The documentation says this endpoint returns the last "n" results for your account's access level. Apparently "30" is the limit for the free account. n=30

Generally, the API results seem to be in this format. Will all endpoints produce this same type of record or will more context be available?

`
  {
    "siteurl": "http://openseca.com",
    "sitedomain": "openseca.com",
    "pagetitle": null,
    "firstseentime": "2022-02-28T23:14:18Z",
    "firstseencode": "timeout",
    "ipaddress": "80.66.64.192",
    "asn": "57416",
    "asndesc": "INSTARS, RU",
    "asnreg": "ripencc",
    "extracted_emails": null
  },
`

### 2022-10-29 Observations on /last test
I'm testing with a trial of the commercial account. Indeed the limit is higher. I retrieve 50 records this time with /last

I don't remember seeing the certificate date last time around testing. That's REALLY nice to have. Example:
`
"certificate": [
      {
        "SSLcert_countryName": "US",
        "SSLcert_Issuer": "C=US, O=Let's Encrypt, CN=R3",
        "SSLcert_commonName": "www.gitlab.git.git.git.git.service.sunnytube.net",
        "SSLcert_notBefore": "2022-10-29T18:02:09",
        "SSLcert_notAfter": "2023-01-27T18:02:08",
        "SSLcert_subjectAltName": "www.gitlab.git.git.git.git.service.sunnytube.net",
        "SSLcert_serialNumber_hex": "0x39daf067d0d1f5b35d4e48e5e885168760b",
        "SSLcert_md5": "2A646E4585EB4D9C7A7DF6E529BBD9CB",
        "SSLcert_sha1": "BF9B61D0A59596B511B3A8E4100FCF09A869E0FC",
        "SSLcert_sha256": "699D0A92981C2C1B22EAA21E272F3ED7B241EF601A0A4D73E81EFADE92A44E23"
      }
`

In my original test last year, there were examples of records with extracted emails. I did not seem many of those. What are they extracted from? Was the phish kit captured and included those emails? Where these the emails used to send this phishing URL? Were these victims?

I probably need to read the StalkPhish source code to figure this out.

`
{
    "siteurl": "https://skart.co.in/admin/webmail.cpanel.net/user/cp.user.sign_in/auth/cpanel_mailbox/index.htm",
    "sitedomain": "skart.co.in",
    "pagetitle": "Webmail Login",
    "firstseentime": "2021-12-16T18:33:24Z",
    "firstseencode": "200",
    "ipaddress": "43.225.53.210",
    "asn": "394695",
    "asndesc": "PUBLIC-DOMAIN-REGISTRY, US",
    "asnreg": "apnic",
    "extracted_emails": "pentestmonkey@pentestmonkey.net, openfoxxthemes@gmail.com, check@isnotspam.com, leafot@gmail.com, anthon.pang@gmail.com, traveltino@gmail.com, oyejorge@gmail.com, nicolas.francois@frog-labs.com, Contact@company.com, email@storeaddress.com, info@OpenCartArab.com, florinpatan@gmail.com, fabien@symfony.com, andrew@noop.lv, arnaud.lb@gmail.com, martin.hason@gmail.com, drak@zikula.org, chabotc@google.com, slangley@google.com, beaton@google.com, jon.wayne.parrott@gmail.com, someuser@example.com, chirags@google.com, elsigh@google.com, openfoxin@gmail.com, user@example.com, iam.asm89@gmail.com, stloyd@gmail.com, fran6co@gmail.com, zoujingli@qq.com, BackEndTea@gmail.com, p@tchwork.com, a.aitboudad@gmail.com, kontakt@beberlei.de, michelsalib@hotmail.com, jeanfrancois.simon@sensiolabs.com, contact@jfsimon.fr, michael.lee@zerustech.com, marcosdsanchez@gmail.com, bschussek@gmail.com, clemens@build2be.nl, gtelegin@gmail.com, umpirsky@gmail.com, benjamin.dulau@gmail.com, daniel@danielholmes.org, manu@sprain.ch, michael.vhirsch@gmail.com, miha.vrhovnik@pagein.si, colinodell@gmail.com, thewholelifetolearn@gmail.com, t.nagel@infinite.net.au, florian@eckerstorfer.org, aj@garcialagar.es, bschussek@symfony.com, example@example.co.uk, fabien_potencier@example.fr, foo@example.com, foo@bar.fr, password@symfony.com, pass.word@symfony.com, user-name@symfony.com, error@example.com, florian@voutzinos.com, mallluhuct@gmail.com, naderman@naderman.de, j.boggiano@seld.be, hallsten@me.com, support@divido.com, smith@example.com, mike.jones@example.com, old.email@example.com, new.email@example.com, joe.martin@example.com, timmy@example.com, jane.doe@example.com, joe@bloggs.com, billgates@outlook.com, john.doe@example.com, check@this.com, dan@example.com, name@email.com, payer@example.com, payee@example.com, sandworm@example.com, john@smith.com"
  },
`

## /api/v1/search/email/ test
This endpoint will search for e-mails found in phishing kits. It is unclear if this means emails referenced in the phishkit code (e.g. emails used for collection of phished credentials) or if it means emails found in victim dumps in captured phishkits. Or perhaps it refers to emails referenced in phishing URLs (some phishing URLs contain the victims email address)

In our test we are going to search for a list of email address. These fall into several categories:
- Addresses that have sent phishing links in the past (compromised or attacker emails)
- Addresses that have been observed in phishing links (known victim emails)
- Addresses that have received phishing emails (but not mentioned in actual phishing URLs)
- Addresses that have send maldocs (compromised or attacker emails)
- Addresses that have send test emails (during preparation of new gmail.com accounts)

### Test 1: with 1 emails: one expected known to send phish, one known safe
Let's test this endpoint with one email that should not be found at all ("michael@cyberlibrarian.ca") and one that definately should be found (from previous threat intel). 

In [52]:
# Test emails
emails = [
    'michael@cyberlibrarian.ca',
    'thamaraiselvan.m@ionexchange.co.in'
]

# Make a separate query for each email. 
responses = []
for email in emails:
    url = f'{ep_email}/{email}'
    try:
        response = requests.request("GET", url, headers=headers)
        responses.append(response)
        time.sleep(1) # be polite and wait between each API call. This is a free API.
    except Exception as e:
        print(f'Err: {e}')

In [53]:
print(f'Number of responses: {len(responses)}')

Number of responses: 2


In [54]:
for response in responses:
    print(f'Respnse Status: {response.status_code}, {response.json()}')

Respnse Status: 401, {'error': "You don't have access to this search option with your profile"}
Respnse Status: 401, {'error': "You don't have access to this search option with your profile"}


### 2022-04-01 Observations on /email test
I guess this does not work for the free API. "You don't have access to this search option with you profile"

Initially we sent 14 requests, and ended up throttled after a few attempts to troubleshoot. The throttling was inconsistent. If we made repeated attempts it would report the number of seconds until throttle was lifted but it was different each time. Perhaps there are multiple backends each with a different count?

    response.status_code=429 means we were throttled.
    response.status_code=401 means we do not have access with this profile.
    
Response "401" can also mean our authentication token is invalid. So "401" gets returned under at least two conditions but there is a good human-readable error returned.

Errors are also returned in JSON format. Nice!

### 2022-10-29 Observations on /email test
I am testing with a trial of the commercial account, but this still returns status 401. 

## /api/v1/search/ipv4/ test
This endpoint allows us to search ipv4 address. But are these IPv4 addresses of sites hosting phishing sites? Or IPv4 addresses associated with phishing email senders?

In [55]:
# Search for an IP that will NOT be found (we tested in advance from known threat intel)
ip = '46.101.222.88'
url = f'{ep_ipv4}/{ip}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [56]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [57]:
print(json.dumps(response.json(), indent=2))

[]


### 2022-04-01 Observations
When the API call is successful, but nothing is found, the status code is "200", but the JSON is empty.

### What about for a well-known IP?
What happens when we put in a well-known IP that might be associated with a large number of legitimate and evil sites?

In [58]:
# Search for an IP that will NOT be found (we tested in advance from known threat intel)
ip = '1.1.1.1'
url = f'{ep_ipv4}/{ip}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [59]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [60]:
print(f'Number of results: {len(response.json())}')

Number of results: 100


In [61]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "https://alert88.tv/",
    "sitedomain": "1.1.1.1",
    "pagetitle": "",
    "firstseentime": "2022-10-22T18:26:42Z",
    "firstseencode": "403",
    "ipaddress": "1.1.1.1",
    "asn": "13335",
    "asndesc": "CLOUDFLARENET, US",
    "asnreg": "apnic",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": "US",
        "SSLcert_Issuer": "C=US, O=Amazon, CN=Amazon RSA 2048 M02",
        "SSLcert_commonName": "alert88.tv",
        "SSLcert_notBefore": "2022-10-22T00:00:00",
        "SSLcert_notAfter": "2023-11-20T23:59:59",
        "SSLcert_subjectAltName": "alert88.tv",
        "SSLcert_serialNumber_hex": "0xd7112dafff00e4a378c7a40ad831a0a",
        "SSLcert_md5": "7C03EDD599F066AB9B4C651DB7E30CE7",
        "SSLcert_sha1": "EB8A0196535A7A3B97D48289ACBFE025E20B5A2F",
        "SSLcert_sha256": "23B6EC5CA1F61F8016AEB98C036AC4A5E541A4FE5D5326226EB814BCAC26219C"
      }
    ]
  },
 

### 2022-04-01 Observations
This shows a lot of sites protected by Cloudflare (1.1.1.1)

It is interesting how many of these URLs end in "/%27". Which is a single-quote "'" in HTML encoding. I wonder if this is a parsing error or if it really was part of the phishing URL sent to a victim. Or perhaps it is some oddity related to Cloudflare.

It is also interesting that most of the titles are "DNS Resolution Error". Is the title fetched by stalkphish? Does this occur because of a failure to fetch the actual webpage at the URL? Is it because Cloudflare blocked fetching the page? Or because the site was down by the time it was investigated?

### 2022-10-29 Observations
I still see a lot of URLs ending in /%27. Assuming this is copied from the original source, and is not a parsing bug, I'm not sure how I feel about it. On the one hand, I think copying source data, complete with errors or anomomlies supports good research. But on the other hand, it makes searching for siteurls a bit trickier. It's probably better as-is than alerting anything.

PS Hey Cloudflare thanks for hosting the world's criminals. What would we do without you? Oh yeah. We would identify them and take them down.

In [22]:
# Search for IP that WILL be found
ip = '80.66.64.192'
url = f'{ep_ipv4}/{ip}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [23]:
print(f'Number of results: {len(response.json())}')

Number of results: 7


In [24]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "http://wallet-polygone.com",
    "sitedomain": "wallet-polygone.com",
    "pagetitle": "<p><b> <font color=",
    "firstseentime": "2022-03-01T22:02:39Z",
    "firstseencode": "aborted",
    "ipaddress": "80.66.64.192",
    "asn": "57416",
    "asndesc": "INSTARS, RU",
    "asnreg": "ripencc",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": null,
        "SSLcert_Issuer": null,
        "SSLcert_commonName": null,
        "SSLcert_notBefore": null,
        "SSLcert_notAfter": null,
        "SSLcert_subjectAltName": null,
        "SSLcert_serialNumber_hex": null,
        "SSLcert_md5": null,
        "SSLcert_sha1": null,
        "SSLcert_sha256": null
      }
    ]
  },
  {
    "siteurl": "https://wallet-polygone.com/",
    "sitedomain": "wallet-polygone.com",
    "pagetitle": "<p><b> <font color=",
    "firstseentime": "2022-03-01T08:02:01Z",
    "firstseencode": "200",


### 2022-04-01 Observations
It is unclear if these are actually phishing sites. I got the IP address (80.66.64.192) from an existing entry in the StalkPhish API. I knew there would be at least one entry. Are all of these sites phishing sites? Or are they just hosted on the same IP? Was this discovered through Passive DNS?

It would be helpful if the source of the database entry was known. For example, if the siteurl was added because it was suspected to be a phishing page, that is different than it being added via Passive DNS, revsere lookups, or via a redirection.

If the URL was discovered because it was the target of a redirection it would be good know that association.

## /api/v1/search/title test
This searches for titles in web pages. Presumably in the pages for phishing sites. This would be great for cases where we have found a phishing site and want to see where else it has been used.

For our test we will use several strings for titles we believe are unique: representing things we expect and DO NOT expect to find in phishing sites.

In [65]:
# Search for Page Titles
title = 'broker'
url = f'{ep_title}/{title}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [66]:
print(f'Number of responses: {len(response.json())}')

Number of responses: 61


In [67]:
print(f'Respnse Status: {response.status_code}')
print(json.dumps(response.json(), indent=2))

Respnse Status: 200
[
  {
    "siteurl": "https://czaialaw.com/wp-content/plugins/vzlnjrx/trxm/login.html",
    "sitedomain": "czaialaw.com",
    "pagetitle": "Onlinebanking und Brokerage der Deutschen Bank",
    "firstseentime": "2022-10-29T03:05:01Z",
    "firstseencode": "200",
    "ipaddress": "173.236.136.213",
    "asn": "26347",
    "asndesc": "DREAMHOST-AS, US",
    "asnreg": "arin",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": "US",
        "SSLcert_Issuer": "C=US, O=Let's Encrypt, CN=R3",
        "SSLcert_commonName": "www.czaialaw.com",
        "SSLcert_notBefore": "2022-09-11T05:57:36",
        "SSLcert_notAfter": "2022-12-10T05:57:35",
        "SSLcert_subjectAltName": "czaialaw.com, www.czaialaw.com",
        "SSLcert_serialNumber_hex": "0x3bad77428d198ebc1f128ff7b8f6213dfb5",
        "SSLcert_md5": "6C56A1B52D83B7A5BCA0B4F61EE436AC",
        "SSLcert_sha1": "FEAAE9DC8951702

### 2022-04-01 Observations for /title test
response.status_code=401 occurs when you don't have access to this search option.
The response will contain a JSON/dict item called "error" with value "You don't have access to this search option with you profile"

If this did work, would the search be case-sensitive or case-insensitive?

### 2022-10-29 Observations for /title test
When testing with the new API and commericial trial account, I get a lot of data this time.

The page titles meaningfully include the search term anywhere inside the pagetitle field. It's really interesting results!

In fact, I am most excited by this API endpoint. I see results that confirm trends I have observed in my own day-to-day work: many law firms are being compromised and being used to phish others. Some of these domains might be registered by criminals, but many where clearly compromised and then re-used.

One could do a lot of interesting reporting, trending, and intel analysis with this endpoint!

One could also setup searches for brand protection purposes. Is someone phishing your customers on a site you don't know about that is branded with your company name? This title search could find them.

## /api/v1/search/url test
This endpoint allows us to search URLs. It is not clear if we have supply a full URL or a string to search for in an URL. I think "string to search for".

So for our first experiment we will search for a single string we expect to find results for.

I am going to use "broker" related search terms. I work in the insurance industry and this term used in the URL of a phishing page would be of great interest to us.

In [68]:
# Search for emails containing "broker"
url_string = 'broker'

url = f'{ep_url}/{url_string}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [69]:
print(f'Number of results: {len(response.json())}')

Number of results: 100


In [70]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [71]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "https://account-identity.mesbroker.ir/",
    "sitedomain": "sts-identity.mesbroker.ir",
    "pagetitle": "Skoruba IdentityServer4",
    "firstseentime": "2022-10-29T10:38:27Z",
    "firstseencode": "aborted",
    "ipaddress": "87.236.209.21",
    "asn": "208555",
    "asndesc": "MOBINHOST MobinInfrastructure, IR",
    "asnreg": "ripencc",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": "US",
        "SSLcert_Issuer": "C=US, O=Let's Encrypt, CN=R3",
        "SSLcert_commonName": "account-identity.mesbroker.ir",
        "SSLcert_notBefore": "2022-10-29T08:35:48",
        "SSLcert_notAfter": "2023-01-27T08:35:47",
        "SSLcert_subjectAltName": "account-identity.mesbroker.ir",
        "SSLcert_serialNumber_hex": "0x454ed76255e987f937e8b82fec78a9edc53",
        "SSLcert_md5": "86F5B1971B3842BADC306533B61C5363",
        "SSLcert_sha1": "FE46E2EE94851595EE33389701DC76DB090

### 2022-04-01 Observations
These do look like evil sites for the most part.

This one appears to be a parsing error of some kind. Maybe it was autodiscovered when stalkphish followed a link on another site being investigated? The "siteurl" might be missing part of the string.

`
{
    "siteurl": "https://%2A.service-broker-cluster-",
    "sitedomain": "*.service-broker-cluster-e40750a6150dc7e1cdb465eaf4188cdf-0000.us-south.containers.appdomain.cloud",
    "pagetitle": null,
    "firstseentime": "2022-02-28T07:23:36Z",
    "firstseencode": "aborted",
    "ipaddress": "52.117.197.3",
    "asn": "36351",
    "asndesc": "SOFTLAYER, US",
    "asnreg": "arin",
    "extracted_emails": null
  },
`

### 2022-10-29 Observations
This still works well as it did before, but is made better when the additional fields provided. I note the GoogleSafeBrowsing field (and the certificate field I commented on in earlier tests). The scores are interesting not just because they are there, but because you can determine which URLs DO NOT have a score: meaning they are not protected by other systems like Google's. Very handy for assessing the risk posed by a discovered URL... or verifying if the URL is well known/previously known.

### Search for "claimant"
Another command string in URLs for legitimate insurance sites is "claimant". This is also used in legal sites.

In [72]:
# Search for URls containing "claimant"
url_string = 'claimant'

url = f'{ep_url}/{url_string}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [73]:
print(f'Number of results: {len(response.json())}')

Number of results: 100


In [74]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [75]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "https://www.govts.indiana-claimant.site",
    "sitedomain": "www.govts.indiana-claimant.site",
    "pagetitle": "",
    "firstseentime": "2022-10-17T12:59:34Z",
    "firstseencode": "403",
    "ipaddress": "192.3.190.242",
    "asn": "36352",
    "asndesc": "AS-COLOCROSSING, US",
    "asnreg": "arin",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "100",
    "certificate": [
      {
        "SSLcert_countryName": "US",
        "SSLcert_Issuer": "C=US, O=Let's Encrypt, CN=R3",
        "SSLcert_commonName": "govts.indiana-claimant.site",
        "SSLcert_notBefore": "2022-10-17T11:06:03",
        "SSLcert_notAfter": "2023-01-15T11:06:02",
        "SSLcert_subjectAltName": "govts.indiana-claimant.site, www.govts.indiana-claimant.site",
        "SSLcert_serialNumber_hex": "0x34f5cb0c678e6050eb68ddf5eb6d0aca102",
        "SSLcert_md5": "539001331BAFD2B633CB6593042586DE",
        "SSLcert_sha1": "EA8AB5146278311EC59BAD40FC79EA66F1D491B9"

### 2022-04-01 Observations
Many of these appear to be legitimate insurance sites and are veribably not phishing sites.

For example, this appears to be a legit StateFarm claims form:

`
{
    "siteurl": "https://report.claims.statefarm.com/claimant/%3FlossType%3Dauto%26reporterType%3DotherInsuranceCompany",
    "sitedomain": "report.claims.statefarm.com",
    "pagetitle": null,
    "firstseentime": "2021-07-06T22:33:44Z",
    "firstseencode": "200",
    "ipaddress": "13.225.87.99",
    "asn": "16509",
    "asndesc": "AMAZON-02, US",
    "asnreg": "arin",
    "extracted_emails": null
  },
`

Another example, this site is from a company called "Claims Space" that recently rebranded to "Zemble" (zmbl.io). This isn't a phishing site. But 99.86.3.12 is probably hosting many different sites. It is unclear what the relationship of this record is to phishing sites.

`
{
    "siteurl": "https://wawanesa-dev.stg.zmbl.io/sign_in/claimant",
    "sitedomain": "wawanesa-dev.stg.zmbl.io",
    "pagetitle": "Claims Portal",
    "firstseentime": "2022-01-24T16:07:37Z",
    "firstseencode": "200",
    "ipaddress": "99.86.3.12",
    "asn": "16509",
    "asndesc": "AMAZON-02, US",
    "asnreg": "arin",
    "extracted_emails": null
  },
`

### 2022-10-29 Observations
This time I do not see the "claims space" false positive detection. I still wonder how one would determine the context of an entry in the stalkphish database. Why was the URL added? What added it and in what context? What was the original detection that lead to the info being harvested and added?

When determining if something is a false positive context is helpful. 

That said, I did see anything that was obviously a false positive this time. Sadly, I see a big increase in real phishing sites that match my insurance related query. It's a good use case, and shows the API provides great info.

### What about URLs with paths in them?
Can we search for a full URL? Let's continue our last experiment but with a known full URL and see what happens. We will use a URL with a lot of URL encoded characters.

In [76]:
# Search for a specific URL
url_string = 'https://id.moneyforward.com/sign_in/email%3Fclient_id%3DOdII7gHa4v8Oouz6IbXSdRrVkTdbdzyISbOdEpEv070%26nonce%3D'

url = f'{ep_url}/{url_string}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [77]:
print(f'Number of results: {len(response.json())}')

Number of results: 0


In [78]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [79]:
print(json.dumps(response.json(), indent=2))

[]


### Observations
That did not work. That URL is known to be in the database. Perhaps it is encoded characters? Let's try with a simpler URL

In [80]:
# Search for a specific URL (simpler this time)
url_string = 'http://corona-19claimants.com/'

url = f'{ep_url}/{url_string}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [81]:
print(f'Number of results: {len(response.json())}')

Number of results: 1


In [82]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [83]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "http://corona-19claimants.com/",
    "sitedomain": "corona-19claimants.com",
    "pagetitle": "<p><b> <font color=",
    "firstseentime": "2020-05-24T08:05:24Z",
    "firstseencode": "200",
    "ipaddress": "184.168.221.44",
    "asn": "26496",
    "asndesc": "",
    "asnreg": "",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": null,
        "SSLcert_Issuer": null,
        "SSLcert_commonName": null,
        "SSLcert_notBefore": null,
        "SSLcert_notAfter": null,
        "SSLcert_subjectAltName": null,
        "SSLcert_serialNumber_hex": null,
        "SSLcert_md5": null,
        "SSLcert_sha1": null,
        "SSLcert_sha256": null
      }
    ]
  }
]


### Observations
Works as expected

### What about a "/" in the URL? Can we search for a more than a simple string?
The string "sign_in" appears in many phishing URLs. It should have many matches. So, we would need to narrow our search. Let's see if we can include path elements in the string. Some REST APIs cannot handle that OR require special encoding to handle it.

In [84]:
# Search for URLs containing "sign_in" with a slash
url_string = 'sign_in/'

url = f'{ep_url}/{url_string}'
try:
    response = requests.request("GET", url, headers=headers)
except Exception as e:
    print(f'Err: {e}')

In [85]:
print(f'Number of results: {len(response.json())}')

Number of results: 100


In [86]:
print(f'Response Status: {response.status_code}')

Response Status: 200


In [87]:
print(json.dumps(response.json(), indent=2))

[
  {
    "siteurl": "https://wvvw-bitforex.com/sign_in/",
    "sitedomain": "wvvw-bitforex.com",
    "pagetitle": "Just a moment...",
    "firstseentime": "2022-10-18T05:00:01Z",
    "firstseencode": "403",
    "ipaddress": "172.67.143.23",
    "asn": "13335",
    "asndesc": "CLOUDFLARENET, US",
    "asnreg": "arin",
    "extracted_emails": "",
    "GoogleSafebrowsing": "",
    "phishing_score": "",
    "certificate": [
      {
        "SSLcert_countryName": "US",
        "SSLcert_Issuer": "C=US, O=Let's Encrypt, CN=E1",
        "SSLcert_commonName": "*.wvvw-bitforex.com",
        "SSLcert_notBefore": "2022-09-23T01:53:13",
        "SSLcert_notAfter": "2022-12-22T01:53:12",
        "SSLcert_subjectAltName": "*.wvvw-bitforex.com, wvvw-bitforex.com",
        "SSLcert_serialNumber_hex": "0x36b92ebb28321f7155869edaa467260c29d",
        "SSLcert_md5": "B9A223A260582C1940943296E0BB413C",
        "SSLcert_sha1": "99ADD6FA7F2C7EAD1D714C73AAC854B2B23CC08F",
        "SSLcert_sha256": "0543EF4A4

### Observation for "/" test
That worked! We can search for URLs with path elements (slashes) in them.

### 2022-10-29 Observations for /url test
We got 50 results for our simple test. Was that because we are limited to 50 results? Or because that was how many there were? We will need to test some more to find out.

The format of results is shown below and is the same format as other results we have seen. So it seems there is one format for all search results. All API endpoints produce the same record type.

`
{
    "siteurl": "https://login.microsoftonline.com/common/oauth2/authorize%3Fclient_id%3D00000002-0000-0ff1-ce00-000000000000%26redirect_uri%3Dhttps%253a%252f%252femail.o365.autodesk.com%252fowa%252f%26resource%3D00000002-0000-0ff1-ce00-000000000000%26response_mode%3Dform_post%26response_type%3Dcode%2Bid_token%26scope%3Dopenid%26msafed%3D0%26msaredir%3D0%26client-request-id%3D14e1a2bb-f887-0e4f-7951-b9009dea7b6f%26protectedtoken%3Dtrue%26claims%3D%257b%2522id_token%2522%253a%257b%2522xms_cc%2522%253a%257b%2522values%2522%253a%255b%2522CP1%2522%255d%257d%257d%257d%26nonce%3D637816299138232509.6d434649-a21f-4718-97dc-0f1c74fe5bdf%26state%3DfU9dT4MwAAT9LdsbjkIL9IGYYotBVxdI4xxvfBQzBhIHUrL_uf9j5w_wkrvkLncPZxqGca95p2naWgzfc_0AeA7GwA0c10E2fvBq6EIPYqtwQGNBHwQW9uvKshtQ-bCRqKwbU2-v5mZQxeYxmWSf0JAQfiI0veRt3PP-pd0K5nLBpjdaXQ5HG-1EprOozWne7QSDeZuqQxo9kz88DeSV8q8SKYFY7rMyWz7eVaSilRPvoSoXtD-LYjuzPtFJLCemR-yb3Bwh_7X4CZUR1u2VS9dymcefcxeC9XyUqh9q2YWZLGoux7H4lLcrvw%26sso_reload%3Dtrue",
    "sitedomain": "login.microsoftonline.com",
    "pagetitle": "Bad Request",
    "firstseentime": "2022-02-28T07:45:06Z",
    "firstseencode": "400",
    "ipaddress": "20.190.160.132",
    "asn": "8075",
    "asndesc": "MICROSOFT-CORP-MSN-AS-BLOCK, US",
    "asnreg": "arin",
    "extracted_emails": null
  },
`

An important question is, "What do these records reprensent?" What is the taxonomy?

Take a look at the record above. Is the siteurl the URL for a phishing site? Or for a URL that redirects to a phishing site? It looks like it redirects to an autodesk.com URL.

When I am investigating a phish, I often when to distinguish the URL delivered to the victim (via email for example), all of the domains that redirect, as well as the final "landing page" URL the victim is redirected to. In addition to that, some of these pages are just MITM proxies, and whenever possible it is nice to know the site they are proxying for.