**InBECstigation-J**

An approach for easyly analyze BEC/EAC evidence.

- Author: **Eduardo Chavarro Ovalle**
    - @echavarro
- eduardo.ovalle@kaspersky.com

# InBECstigation - Approach to analyze BEC cases

The Business email compromise or email account compromises (BEC/EAC) is one of the threats that represent the biggest number of losses for corporations and individuals, presenting an impact of billions of dollars just based on IC3 stats but considered to be twice this value.
While analyzing BEC/EAC cases, we need verify and analyze mailboxes, messages and determine if there are traces of spoofing, vulnerabilities exploitation, unauthorized access to mailboxes or maybe usage of appearance domains while intruding the conversation loop. This means, it is necessary to verify lots of details and here is important to provide tools to improve the analysis tasks and obtain results.
This paper focus on the opportunities that python brings to analyze email messages based and how to configure these functions to verify the full set of messages trying to determine key elements to determine the elements affected and the needed controls or additional tasks.


# BEC/EAC Threat
This threat is related to the intrusion to financial and acquisitions communications where adversaries identify clue aspects for acquirement approvals. Adversaries identify the key participants from both sides in a negotiation and determine the moment to involve in the communications, mimicking themselves as the original participant. 
If the intromission is effective, the real participants won’t identify easily the changes and will continue the information exchange to accomplish all the needed data and procedures for the acquisition.
This intrusion is usually product of an initial phishing attack but can also be related to threat actor performing brute force or password guessing, public credentials dump from different services but identifying credentials reusage on multiple services or sometimes vulnerabilities related to mail services exposed over the internet and some cases in the companies’ premises.


## Appearance domains
One of the most common trends is the infrastructure acquisition **[ID: T1583]** where the attackers buy domains with a similar literal configuration as the original once. Using this technique, intruders will include themselves in the communication loop spoofing both sides of the communications, to request more information and then modify destinations for financial resources, assets or services. 


## Spoofing techniques

By exploiting vulnerabilities in software or miss-configurations, threat actors can use non authorized infrastructure to spoof legitimate domains and tamper the source of emails, making the communications seem legit. 


# Implementing the algorithm using Jupyter notebook

By collecting domains and links in the message body and headers, and using command lines like whois, is it possible to identify fake domains involved in the communications and determine the date when the infrastructure was created, to add this information in the analysis timeline.

The first step is to load the evidence in a way that can be parsed and sorted based on real needs for investigation. Dealing with bit OST files or multiple email messages stored in a container it’s a difficult task and the best approach is to load all these information in lists that can be filtered based on metadata, headers and message content. Once information is parsed this way, it will be easy to look for threats or keywords that provide best information in a malleable format.

For this purpose, **Pypff** and **Extract_Msg** libraries from python allow to load a file (PST or msg) ang get all the metadata for analysis. Pypff  allows to iterating over all items in the root folder, analyze message by message and extract details for analysis without having to load the PST file in a mail client.


In [None]:
import pypff
import whois
from email.parser import HeaderParser
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
from datetime import datetime
import extract_msg
import glob
import time
import base64, re

## Steps:  
0. Prepare data
1. Appearance domains
2. Whois Analysis
3. Suspicious Mailboxes
4. Messages headers' alerts
5. Statistical analysis and timeline
6. Subject analysis
7. IP addresses analysis

## Prepare data
**Extract_msg**  automates the extraction of key email data (from, to, cc, date, subject, body) and the email’s attachments so it can be managed as lists and information is stores in text format for its analysis.
Once PST files and messages are parsed, it would be necessary to extract relevant information from each message. Next is a list of relevant fields that can be extratcted from messages to perform a fast analysis and identify attack patterns:
- x_originating_ip
- Message Subject
- mailboxes: From, to, CC and return-path
- Domains: Extracted from headers or mailboxes.
- Authentication-results parameter is suspicious:
    - spf=softfail
    - dmarc=fail
    - compauth=fail


In [None]:
def headers_to_df(headers):
    return dict([(title.lower(), value) for title, value in headers.items()])

def parse_folder(base):
    messages = []
    for folder in base.sub_folders:
        try:
            if folder.number_of_sub_folders:
                messages += parse_folder(folder)
        except:
            a=1 
        for message in folder.sub_messages:
            parser = HeaderParser()
            headers = parser.parsestr(message.transport_headers)
            messages.append(headers_to_df(headers))
    return messages

def unique(list1):
    x = np.array(list1)
    return np.unique(x)

def mails(dfmails,m_type):
    if m_type != "from":
        tmp=[]
        for line in dfmails:
            for m in line.split(','):
                tmp.append(m)
    else:
        tmp=dfmails
    tmp2=[]    
    for l in unique(tmp):
        try:
            tmp2.append((l.split("<")[1]).split(">")[0])
        except:
            tmp2.append(l)
    return unique(tmp2)

def mail_domains(mails_list):
    tmp=[]
    for l in mails_list:
        try:
            tmp.append(l.split("@")[1])
        except:
            print(l, " not included")
    return unique(tmp)

def x_org_ip():
    print("","* Extracting x_originating_ip")
    x_originating_ip=[]
    try:
        x_originating_ip=[ip.replace("[","").replace("]","") for ip in df[df["x-originating-ip"].notna()]["x-originating-ip"].unique()]
        print("\t",x_originating_ip)
    except:
        print("\t No x-originating-ip details")     
    return x_originating_ip

def mboxes(mbtype,title):
    print("","- "+title)
    mailboxes=[]
    try:
        mailboxes=mails(df[df[mbtype].notna()][mbtype].unique(),mbtype)
        print("\t",mailboxes)
    except:
        print("\t No ",mbtype," details")        
    return mailboxes

legit_domains=['domain1.com', 'domain2.com']    # YOUR LEGIT DOMAINS HERE
tlfile='TL_InBECstigation.log'

pstfile=['Exchange.pst']                                      # YOUT PST FILE HERE
i=0
messages=[]                                                   # Messages will be collected from folder .\msg yo can change this path at line 84
print("","Appending Exchange files")
for file in pstfile:
    try:
        pst = pypff.open(file)
    except:
        print(""," -- Error, file "+file+" do not exists.")
    root = pst.get_root_folder()
    #print("\t","Including container for analysis: "+file)
    messages=messages+parse_folder(root)
    pst.close()

f = glob.glob('msg/*.msg')

for filename in f:
    #print("\t","Including msg file for analysis: "+filename)
    msg = extract_msg.Message(filename)
    messages.append(headers_to_df(msg.headerDict))

df=pd.DataFrame(messages)
now = datetime.now()

current_time = now.strftime("%H:%M:%S")
print("\t","Number of messages to analyze: "+str(len(df.index)))
print("\t","Legitimate domains to be evaluated: ")
print("\t",legit_domains)
print("\t","Analysis time: "+current_time)
print("","********************************************************")

x_originating_ip=x_org_ip()

print("\r\n","* Extracting mailboxes")
mail_from=mboxes("from","From Mail")
mail_to=mboxes("to","Mail in TO")
mail_cc=mboxes("cc","Mail in CC")
return_path=mboxes("return-path", "Mailboxes in return-path")


## Appearance domains

Having a list with mailboxes, it will be easy to extract specific domain (@domain.com) and create a new source for validation.
Once all domains are extracted from headers and body, an array can be used to verify each domain against the legitimate once.

For this, it will be relevant to include legitimate domains involved in the accessed communication, from all the participating sides. It will be necessary for the analyst to include legitimate domains and pass them to the analysis script. First thing to check can be the *misspelling or similar domains* extracted from the evidence.

First thing to check is if the domain extracted from evidence is listed as a real legitimate domain. If don’t, its time to check if it’s similar to a legitimate one. An easy way to verify the similitude based on characters, could be the function **SequenceMatcher** , this function allows to determine how close are a pair of strings resulting in a value between 0 and 1, with 1 as exactly the same.


In [None]:
appearance_threshold=0.7
susp_strings=["spf=softfail","dmarc=fail","compauth=fail"]

def similarities(a, b):
    return SequenceMatcher(None, a, b).ratio()

def dmn_appearance(dmnfull,real):
    dmn=[i for i in dmnfull if i not in real]
    dmnappearance=[]
    for b in dmn:
        for a in real:
            if similarities(a, b) > appearance_threshold:
                dmnappearance.append(b)
    return dmnappearance

mbfull=unique(np.concatenate([mail_from,mail_to,mail_cc,return_path]))

print("\r\n","* Extracting domains from pst collection")
domains=mail_domains(mbfull)
print("\t",domains)

print("\r\n","* Verifying possible appearance domains")  
dmn_app=dmn_appearance(domains,legit_domains)
print("",dmn_app)


Based on trends to modify just a couple of characters or the TLD (top level domain), having a threshold over 70% (0.7) would be a relevant value to alert the identified domain.

## Whois analysis

If we identify a domain flagged as a possible appearance domain, it would be important to identify its Creation date and how it matches the timeline for investigation. A good option would be analyzing the **whois** response and verify relevant dates:


In [None]:
def dmn_whois(dmn):
    w = whois.whois(dmn)
    return w

print("\r\n","* Verifying appearance domains metadata")

susp_rows=[]

for dmn in dmn_app:
    w=dmn_whois(dmn)
    print("\t[suspicious domain] ",dmn)
    print("\t","Creation date: "+str(w.creation_date))
    susp_rows.append([str(w.creation_date)+'+00:00',dmn,'Suspicious domain creation time','','','','',''])


Setting the creation date in the suspicious events timeline could present a clue for legitimate domains obtained and used to create the new spoof infrastructure

## Suspicious mailboxes

At this point we have a set of possible suspicious mailboxes based on domains they belong. We can create a new list including only suspicious mailboxes to analyze one by one, it would be a match between mailboxes and suspicious domains.

Also, we can filter the main data frame to look for fields that contain the specific appearance domain and extract its details in a new list.

Having this information, we can extract specific subjects or involved mailboxes to analyze them. It would be interesting to identify what kind of list the mailbox was detected (to, from or CC). This can be used further to try to identify the initial access point to the message loop.

In [None]:
def check_messages(msg_field,title,value):
    print("","Verifying messages ",title)
    tmp=pd.DataFrame()
    try:
        tmp=df[df[msg_field].str.lower().str.contains(value).fillna(False)]
        print("Subjects: ",unique(tmp["subject"]))
    except:
        print("No ",msg_field, " details")
    return tmp

susp_mb=[]
print("\r\n","* Suspicious mailboxes:")
for m in mbfull:
    for d in dmn_app:
        try:
            if m.index(d): 
                susp_mb.append(m)
        except:
            a=1
print("",susp_mb)

print("\r\n","* Verifying messages including appearance domains")

susp_msg=pd.DataFrame()
for dmn in dmn_app:
    tmp=check_messages("from","from domain "+dmn,dmn)
    if tmp.size > 0: 
        susp_msg=pd.concat([susp_msg,tmp]) 
    tmp=check_messages("to","to domain "+dmn,dmn)
    if tmp.size > 0: 
        susp_msg=pd.concat([susp_msg,tmp]) 
    tmp=check_messages("cc","in CC to domain "+dmn,dmn)
    if tmp.size > 0: 
        susp_msg=pd.concat([susp_msg,tmp]) 


## Messages Headers’ alerts

Now it’s time to verify the headers and focus on alerts and details. From messages it is possible to verify which of them are being flagged by mail security controls. One easy option could be look for registries where the following strings are triggered: **["spf=softfail","dmarc=fail","compauth=fail"]** (The list can be improved based on analyst experience)


In [None]:
print("\r\n","* Verifying messages where authentication-results parameter is suspicious")

for val in susp_strings:
    tmp=check_messages("authentication-results","with value "+val,val)
    if tmp.size > 0: 
        susp_msg=pd.concat([susp_msg,tmp]) 


## Statistical analysis and Timeline

While attackers spoof critical mailboxes in the conversation, it doesn’t mean this have been the initial compromised account. In most cases, compromising one of the employers that do not decide or perform the financial or acquisition orders, but is informed about every step in the process, could be the intrusion point. 

It would be important to perform an analysis where we decipher the legitimate users involved in the initial conversation and register this as part of the general analysis. A type analysis based on TO and CC fields could determine which mailboxes could be included in the analysis.


In [None]:
if susp_msg.size > 0: 
    susp_msg["date"]=pd.to_datetime(susp_msg["date"],format='%a, %d %b %Y %X %z')
    print("\r\n","* Suspicious messages: "+str(len(susp_msg.index)))

    for index, item in susp_msg.iterrows():
        susp_rows.append([str(item["date"]),item["from"],item["return-path"],item["subject"],item["to"],item["thread-index"],item["received-spf"]])

    if len(susp_rows)>1:
        print("\r\n","* Creating timeline for suspicious events to file: "+tlfile)
        sorted_list=sorted(susp_rows, key=lambda x: x[0])
        sorted_list.insert(0,["date","from","return-path","subject","to","cc","thread-index","received-spf"])
        with open(tlfile, 'w') as file:
            for l in sorted_list:
                out=str(l)[1:-1]
                file.write(out+'\r\n')


In [None]:
print(unique([i[3].replace('\r\n','') for i in susp_rows]))

Decode strings

We can use sentences to get the content from encoded subjects. Just to understand and verify the real objective of the communication.

In [None]:
def decodeSubject(subject):
    encoded_word_regex = r'=\?{1}(.+)\?{1}([B|Q])\?{1}(.+)\?{1}='
    charset, encoding, encoded_text = re.match(encoded_word_regex, subject).groups()
    if encoding == 'B':
        return base64.b64decode(encoded_text)
    else:
        return "unknown ciphering: " + subject

decodeSubject('=?UTF-8?B?RG8gWW91IERvIEFueSBvZiBUaGVzZSBFbWJhcnJhc3NpbmcgVGhpbmdzPw==?=')    #Your encoded UTF strings here

## Subject analysis
Identified the Subject and topics selected by threat actors to jump into the coversation can add other details related to attackments and communicattions attempts to different areas that could haven't been detected during the primari analysis. Using the messages lists we can extract all the messages related for aditional analysis and verification.

In [None]:
#print(unique(susp_rows["subject"]))
pattern='.*Your partnership is required urgently'               #Your interesting subject here you may use regular expressions
mask=df['subject'].str.contains(pattern,case=False,na=False)
print('Messages related to pattern '+pattern+': '+str(len(df[mask])))
print('Exporting to file mysubject.csv')
tmp=df.loc[mask, ['date','from','to','subject']]
tmp["date"]=pd.to_datetime(tmp["date"],format='%a, %d %b %Y %X %z')
tmp.to_csv('mysubject.csv',index=False)


## IP addresses analysis
We can use intelligence platforms to have a quick look to domains and IP addresses' reputation. 

In [None]:
import requests

apikey = 'YOUR API KEY'  #### ENTER API KEY HERE ####

requests.urllib3.disable_warnings()
client = requests.session()
client.verify = False
domainErrors = []
delay = {}

def DomainScanner(domain): #'From Matthew Clairmont VT_Domain_Scanner
    url = 'https://www.virustotal.com/vtapi/v2/url/scan'
    params = {'apikey': apikey, 'url': domain}

    try:
        r = client.post(url, params=params)
    except requests.ConnectTimeout as timeout:
        print('-','Connection timed out. Error is as follows-'+str(timeout))

    print('',domain)

    if r.status_code == 200:
        try:
            jsonResponse = r.json()
            if jsonResponse['response_code'] != 1:
                print('\t','There was an error submitting the domain for scanning.')
            elif jsonResponse['response_code'] == -2:
                print('\t',str(domain)+' is queued for scanning.')
                delay[domain] = 'queued'
            else:
                print('\t',str(domain)+' was scanned successfully.')

        except ValueError:
            print('\t','There was an error when scanning '+str(domain))

        time.sleep(1)  ############### IF YOU HAVE A PRIVATE ACCESS YOU CAN CHANGE THIS TO 1 ###################
        return delay

    elif r.status_code == 204:
        print('','Received HTTP 204 response. You may have exceeded your API request quota or rate limit.')

def DomainReportReader(domain, delay): #'From Matthew Clairmont VT_Domain_Scanner
    if delay:
        if domain in delay:
            time.sleep(10)

    url = 'https://www.virustotal.com/vtapi/v2/url/report'
    params = {'apikey': apikey, 'resource': domain}

    try:
        r = client.post(url, params=params)
    except requests.ConnectTimeout as timeout:
        print('\t','Connection timed out. Error '+str(timeout))
        exit(1)

    if r.status_code == 200:
        try:
            jsonResponse = r.json()
            if jsonResponse['response_code'] == 0:
                print('\t','There was an error submitting the domain for scanning.')
                pass

            scandate = jsonResponse['scan_date']
            positives = jsonResponse['positives']
            total = jsonResponse['total']

            data = [scandate, domain, positives, total]
            return data

        except ValueError:
            print('\t','There was an error when scanning '+str(domain))

        except KeyError:
            print('\t','There was an error when scanning '+str(domain))

    elif r.status_code == 204:
        print('','Received HTTP 204 response. You may have exceeded your API request quota or rate limit.')
        time.sleep(10)
        DomainReportReader(domain, delay)


print('','* Verifying IP addresses against VirusTotal')
for ip in x_originating_ip:
    try:
        delay = DomainScanner(ip)
        data = DomainReportReader(ip, delay)
        if data:
            print('\t',data)
            time.sleep(1)  
    except Exception as err:  
        print('','- Encountered an error but scanning will continue.'+str(err))
