# Research Malware from MITRE Att@ck

## Initialisation

We will start by loading the necessary libraries and setting up the environment.
To help with the research we will use the AI magic to get some help from the AI
which is enabled by the `jupyter-ai-magics` library.
The AI will help us to create a web scraper using Selenium 4 to extract information
from the MITRE Att@ck website.

Using the `dotenv` library we will load the environment variables from a `.env` file.

In [5]:
%load_ext jupyter_ai_magics
%load_ext dotenv
%dotenv

The jupyter_ai_magics extension is already loaded. To reload it, use:
  %reload_ext jupyter_ai_magics
The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Use AI to create a web scraper

The following prompt was used to generate the code for the web scraper.
To re-generate the code, you can remove the first line and run the cell.

In [6]:
%%script false --no-raise-error
%%ai -f code chatgpt
Create
a
simple
web
scraper
using
Selenium
4
that
opens
the
link
https: // attack.mitre.org / software /,
and collects
each
link in the
first
column
of
the
only
table
on
that
page in a
list.
It
then
iterates
over
each
link in the
list, opens
it, waits
for the page to fully load.
Then
it
searches
for the term "steganography" in the page corresponding to the link.
If
the
terms
are
found, add
the
link
to
a
new
list.
The
scraper
should
catch
all
exception and always
quit
the
driver.
The
code
should
be
compact and efficient.
The
path
to
the
chromedriver
should
be
loaded
from the environment

variable
CHROMEDRIVER_PATH.

## Extract further inforation of the MITRE Att@ck links

The extracted data includes:
- Name of the malware
- Description
- MITRE ID
- Creation date
- Modification date
- Platforms
- Techniques used

The code was created by the AI and slightly edited by the author.
For example, the search was moved to a separate function and is only called if there is no file with the found links in the `data` directory yet.

In [7]:
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Get the path to chromedriver from environment variable
chrome_driver_path = os.environ.get('CHROMEDRIVER_PATH')

# Set up Chrome driver
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Run Chrome in headless mode
options.add_argument('--disable-extensions')
options.add_argument('--disable-dev-shm-usage')
service = Service(chrome_driver_path)
driver = webdriver.Chrome(service=service, options=options)

FOUND_LINKS_FILE = 'data/mitre-attack-stego-malware.txt'


def search_mitre_attack(mitre_attack_url='https://attack.mitre.org/software/'):
    global driver
    # Open the link
    driver.get(mitre_attack_url)

    # Collect links in the first column of the table
    links = driver.find_elements(By.CSS_SELECTOR, 'table tr td:nth-child(1) a')
    link_urls = [link.get_attribute('href') for link in links]

    driver.quit()

    # Iterate over each link
    found_links = []
    for link_url in link_urls:
        driver = webdriver.Chrome(service=service, options=options)
        try:
            # Open the link
            driver.get(link_url)

            # Wait for the page to fully load
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, 'body'))
            )

            # Search for the terms "steganography" in the page
            page_content = driver.page_source.lower()
            if 'steganography' in page_content:
                found_links.append(link_url)
        except Exception as e:
            print(f'Error occurred while processing link: {link_url}')
            print(e)
        finally:
            # Quit the driver
            driver.quit()
    return found_links


# Check if the file with the found links already exists
if os.path.exists(FOUND_LINKS_FILE):
    # Load the found links from the file
    with open(FOUND_LINKS_FILE, 'r') as f:
        found_links = f.read().splitlines()
else:
    # Search for the links
    found_links = search_mitre_attack()

# Print the found links
for link in found_links:
    print(link)

https://attack.mitre.org/software/S0469
https://attack.mitre.org/software/S0440
https://attack.mitre.org/software/S0473
https://attack.mitre.org/software/S0234
https://attack.mitre.org/software/S0470
https://attack.mitre.org/software/S0471
https://attack.mitre.org/software/S0187
https://attack.mitre.org/software/S0659
https://attack.mitre.org/software/S0038
https://attack.mitre.org/software/S0037
https://attack.mitre.org/software/S0483
https://attack.mitre.org/software/S0231
https://attack.mitre.org/software/S0395
https://attack.mitre.org/software/S0513
https://attack.mitre.org/software/S0644
https://attack.mitre.org/software/S0439
https://attack.mitre.org/software/S0518
https://attack.mitre.org/software/S0139
https://attack.mitre.org/software/S0654
https://attack.mitre.org/software/S0565
https://attack.mitre.org/software/S0458
https://attack.mitre.org/software/S0495
https://attack.mitre.org/software/S0511
https://attack.mitre.org/software/S0633
https://attack.mitre.org/software/S0559


## Save the found links to a file

The found links will be saved to a file called `found-links.txt` in the `data` directory.

In [8]:
import json

with open(FOUND_LINKS_FILE, 'w') as f:
    f.write("\n".join(found_links))

## Extract further inforation of the MITRE Att@ck links

The extracted data includes:
- Name of the malware
- MITRE ID
- Creation date
- Modification date
- Platforms
- Techniques used

For that we will look for the small information box on the right hand side and the table of "Techniques Used". 

The code was created by the AI and slightly edited by the author.
To re-generate the code, you can remove the first line and run the cell.

In [9]:
%%script false --no-raise-error
%%ai -f code chatgpt
Create
a
simple
web
scraper
function
using
Selenium
4
that
has
a
link
to
a
specific
MITRE
Attack
software as input.
The
scraper
should
search
for a div with the class "card-body" that contains multiple divs with class "card-data".
The
card
data
contains
divs
with the class "col-md-11" which contain the relevant information that should be stored in a dictionary.
Furthermore
there is a
table
contained in the
page
with the class "techniques-used" which should be converted into another dictionary.
Both
dictionaries
shall
be
returned as the
output
of
the
function.
For
selection
use
the
function
`find_elements()`
with  `By.CSS_SELECTOR``.
The
path
to
the
chromedriver
should
be
loaded
from the environment

variable
CHROMEDRIVER_PATH.

In [10]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import os
import re
import json


MITRE_ATTACK_DATA_FILE = 'data/mitre-attack-stego-malware.json'


def web_scraper(link):
    driver = webdriver.Chrome(service=service, options=options)
    driver.get(link)

    name = driver.find_element(By.TAG_NAME, 'h1').text
    description = driver.find_element(By.CLASS_NAME, 'description-body').text

    card_body = driver.find_element(By.CLASS_NAME, 'card-body')
    card_data_divs = card_body.find_elements(By.CLASS_NAME, 'card-data')

    card_data_dict = {
        'Name': name,
        'Description': description,
    }
    for card_data in card_data_divs:
        col_md_11_div = card_data.find_element(By.CLASS_NAME, 'col-md-11')
        key, value = col_md_11_div.text.split(':')
        card_data_dict[key] = value

    techniques_table = driver.find_element(By.CLASS_NAME, 'techniques-used')
    techniques_rows = techniques_table.find_elements(By.TAG_NAME, 'tr')

    techniques_headers = [
        header.text
        for header in techniques_rows[0].find_elements(By.TAG_NAME, 'th')
    ]
    techniques = []
    for row in techniques_rows[1:]:
        row_classes = row.get_attribute('class')
        technique_dict = {}
        columns = row.find_elements(By.TAG_NAME, 'td')
        if len(columns) == 4:
            for index, header in enumerate(techniques_headers):
                technique_dict[header] = columns[index].text
        elif 'noparent' in row_classes:
            technique_dict[techniques_headers[0]] = columns[0].text
            technique_dict[techniques_headers[1]] = ''.join([columns[1].text, columns[2].text])
            technique_dict[techniques_headers[2]] = columns[3].text
            technique_dict[techniques_headers[3]] = columns[4].text
        else:
            prev_technique = techniques[-1]
            technique_dict[techniques_headers[0]] = prev_technique[techniques_headers[0]]
            technique_dict[techniques_headers[1]] = prev_technique[techniques_headers[1]]
            technique_dict[techniques_headers[2]] = columns[3].text
            technique_dict[techniques_headers[3]] = columns[4].text
        techniques.append(technique_dict)

    driver.quit()

    card_data_dict['MITRE ID'] = card_data_dict['ID']
    del card_data_dict['ID']

    return {
        **card_data_dict,
        'Techniques Used': techniques,
    }


if os.path.exists(MITRE_ATTACK_DATA_FILE):
    with open(MITRE_ATTACK_DATA_FILE, 'r') as f:
        mitre_attack_data = json.load(f)
else:
    mitre_attack_data = [web_scraper(link) for link in found_links]
    with open('data/mitre-attack-stego-malware.json', 'w') as f:
        json.dump(mitre_attack_data, f, indent=4)
mitre_attack_data

[{'Name': 'ABK',
  'Description': 'ABK is a downloader that has been used by BRONZE BUTLER since at least 2019.[1]',
  'Type': ' MALWARE',
  'Platforms': ' Windows',
  'Version': ' 1.0',
  'Created': ' 10 June 2020',
  'Last Modified': ' 24 June 2020',
  'MITRE ID': ' S0469',
  'Techniques Used': [{'Domain': 'Enterprise',
    'ID': 'T1071.001',
    'Name': 'Application Layer Protocol: Web Protocols',
    'Use': 'ABK has the ability to use HTTP in communications with C2.[1]'},
   {'Domain': 'Enterprise',
    'ID': 'T1059.003',
    'Name': 'Command and Scripting Interpreter: Windows Command Shell',
    'Use': 'ABK has the ability to use cmd to run a Portable Executable (PE) on the compromised host.[1]'},
   {'Domain': 'Enterprise',
    'ID': 'T1140',
    'Name': 'Deobfuscate/Decode Files or Information',
    'Use': 'ABK has the ability to decrypt AES encrypted payloads.[1]'},
   {'Domain': 'Enterprise',
    'ID': 'T1105',
    'Name': 'Ingress Tool Transfer',
    'Use': 'ABK has the abili

# Research Malware from Malpedia

## Download the Malpedia bibliography

The Malpedia bibliography is a list of references to malware families and their aliases.
We will download the bibliography from the Malpedia website and save it to a file.

In [11]:
import requests

MALPEDIA_BIBLIOGRAPHY_URL = 'https://malpedia.caad.fkie.fraunhofer.de/library/download'
MALPEDIA_BIBLIOGRAPHY_FILE = 'data/malpedia.bib'

try:
    response = requests.get(MALPEDIA_BIBLIOGRAPHY_URL)
    response.raise_for_status()

    with open(MALPEDIA_BIBLIOGRAPHY_FILE, 'wb') as f:
        f.write(response.content)
except requests.exceptions.RequestException as e:
    print(f'Error occurred while downloading the Malpedia bibliography: {e}')
    
    if os.path.exists(MALPEDIA_BIBLIOGRAPHY_FILE):
        print('Using the existing file.')


## Filter the Malpedia bibliography for steganography related malware

We will filter the Malpedia bibliography for malware families that are related to steganography.
For that we first parse the bibliography file and then search for the term "steganography" in the description of each malware family.

In [12]:
import bibtexparser

bibliography = bibtexparser.parse_file(MALPEDIA_BIBLIOGRAPHY_FILE)

stego_malware_entries = []
for entry in bibliography.entries:
    if 'steganography' in entry['title'].lower():
        stego_malware_entries.append(entry)

stego_malware_entries

[Entry(entry_type=`online`, key=`chen:20171107:redbaldknightbronze:63a08fe`, fields=`[Field(key=`author`, value=`Joey Chen and MingYen Hsieh`, start_line=23060), Field(key=`title`, value=`{REDBALDKNIGHT/BRONZE BUTLER’s Daserf Backdoor Now Using Steganography}`, start_line=23061), Field(key=`date`, value=`2017-11-07`, start_line=23062), Field(key=`organization`, value=`Trend Micro`, start_line=23063), Field(key=`url`, value=`https://blog.trendmicro.com/trendlabs-security-intelligence/redbaldknight-bronze-butler-daserf-backdoor-now-using-steganography/`, start_line=23064), Field(key=`language`, value=`English`, start_line=23065), Field(key=`urldate`, value=`2020-01-09`, start_line=23066)]`, start_line=23059),
 Entry(entry_type=`online`, key=`eschweiler:20181025:cutwail:494e458`, fields=`[Field(key=`author`, value=`Sebastian Eschweiler and Brett Stone-Gross and Bex Hartley`, start_line=37758), Field(key=`title`, value=`{Cutwail Spam Campaign Uses Steganography to Distribute URLZone}`, sta

## Scrape data from the Malpedia links

We will scrape data from the Malpedia links to get further information about the steganography related malware families. The goal is to extract information closely related to the MITRE Att@ck data.

In [13]:
from selenium.common import NoSuchElementException
from selenium import webdriver

PLATFORMS = ['Windows', 'macOS', 'Linux', 'Android', 'iOS']

def scrape_malware_data(url, name=None, description=None, created_at=None):
    data = None

    driver = webdriver.Chrome(service=service, options=options)
    try:
        driver.get(url)

        # Wait for the page to fully load
        driver.implicitly_wait(10)

        page_content = driver.page_source.lower()
        if 'steganography' in page_content:
            return data

        try:
            # Look for the title in the meta tags
            name = driver.find_element(By.XPATH, '//meta[@name="title" or @property="og:title"]').get_attribute(
                'content')
        except NoSuchElementException:
            pass

        # The description is not always available
        try:
            # Look for a meta tag that contains "description" in the name or property attribute
            description = driver.find_element(By.XPATH,
                                              '//meta[@name="description" or contains(@property, "description")]').get_attribute(
                'content')
        except NoSuchElementException:
            pass

        # The creation date is not always available
        try:
            # Look for a meta tag that contains "created" or "published" in the name or property attribute
            created_at = driver.find_element(By.XPATH,
                                             '//meta[contains(@name, "created") or contains(@name, "published") or contains(@property, "created") or contains(@property, "published")]').get_attribute(
                'content')
        except NoSuchElementException:
            pass

        platforms = [platform for platform in PLATFORMS if platform.lower() in page_content]
        data = {
            'Name': name,
            'Description': description,
            'Type': 'MALWARE',
            'Created': created_at,
            'Platforms': platforms,
            'References': [url],
        }
    except Exception as e:
        print(f'Error occurred while processing link: {url}')
        print(e)
    finally:
        driver.quit()

    return data


if os.path.exists('data/malpedia-stego-malware.json'):
    with open('data/malpedia-stego-malware.json', 'r') as f:
        malpedia_malware_data = json.load(f)
else:
    malpedia_malware_data = [scrape_malware_data(entry['url'], entry['title'], created_at=entry['date']) for entry in
                             stego_malware_entries]
    malpedia_malware_data = [data for data in malpedia_malware_data if data and data['Name']]
    with open('data/malpedia-stego-malware.json', 'w') as f:
        json.dump(malpedia_malware_data, f, indent=4)
malpedia_malware_data


[{'Name': 'REDBALDKNIGHT’s Daserf Backdoor Now Uses Steganography',
  'Description': 'REDBALDKNIGHT a.k.a BRONZE BUTLER cyberespionage group employ the Daserf backdoor in campaigns. We found that  Daserf was not only used on Japanese targets, but also against other countries. We also found versions of Daserf that use steganography.',
  'Type': 'MALWARE',
  'Created': '2017-11-07',
  'Platforms': ['Windows'],
  'References': ['https://blog.trendmicro.com/trendlabs-security-intelligence/redbaldknight-bronze-butler-daserf-backdoor-now-using-steganography/']},
 {'Name': 'Cutwail Spam Campaign Uses Steganography to Distribute URLZone',
  'Description': 'CrowdStrike analyzed a new Cutwail spam campaign from NARWHAL SPIDER that uses digital steganography to distribute URLZone.',
  'Type': 'MALWARE',
  'Created': '2018-10-25T22:17:25+00:00',
  'Platforms': ['Windows', 'macOS'],
  'References': ['https://www.crowdstrike.com/blog/cutwail-spam-campaign-uses-steganography-to-distribute-urlzone/']}

# Research Malware from https://github.com/lucacav/steg-in-the-wild

## Download the steg-in-the-wild dataset

The steg-in-the-wild dataset is a collection of links to articles, papers, and other resources related to steganography malware.
It is available on GitHub at https://raw.githubusercontent.com/lucacav/steg-in-the-wild/master/README.md as a Markdown file.

We will download the dataset and extract the links from the Markdown file which are related to image steganography.
These are contained in the first bullet list in the file. 
To gather further information we will use the web scraper used for the Malpedia data.

In [14]:
import itertools

SITW_DATASET_URL = 'https://raw.githubusercontent.com/lucacav/steg-in-the-wild/master/README.md'

response = requests.get(SITW_DATASET_URL)
response.raise_for_status()


def extract_links_from_list(text):
    lines = text.splitlines()

    # We only want the first bullet list
    list_entries = itertools.dropwhile(lambda line: not line.startswith('*'), lines)
    list_entries = itertools.takewhile(lambda line: line.startswith('*'), list_entries)

    # Extract links and their descriptions
    links = []
    for entry in list_entries:
        link, description = entry.split('):', 1)
        name, url = link.split('](', 1)
        name = name[3:]
        url = url[:-1]
        links.append((name, url, description.strip()))
    return links


sitw_links = extract_links_from_list(response.text)
sitw_links


[('Lumma',
  'https://twitter.com/1ZRR4H/status/1706747262993350752/photo/',
  'similarly to Lurk and Stegoloader, the new Lumma stealer now uses steganography to hide payloads in images to be retrieved from a web repository'),
 ('Formbook exploits steganography',
  'https://malwr0nwind0z.com/post_5-15-23_formbook_sample',
  'a malicious .NET executable (called MajorRevision.exe) is hidden in a compressed bitmap image'),
 ('Worok Group hides malware in PNG',
  'https://www.bleepingcomputer.com/news/security/worok-hackers-hide-new-malware-in-pngs-using-steganography',
  'LSB steganography is used to cloak data in PNG images. Worok hides two payloads: a PowerShell script and a custom .NET C# stealer able to abuse Dropbox for cloaking exfiltration and C&C communications'),
 ('Malicious PyPI Package',
  'https://research.checkpoint.com/2022/check-point-cloudguard-spectral-exposes-new-obfuscation-techniques-for-malicious-packages-on-pypi',
  'a malicious package published on [PyPI](https://

## Scrape data from the steg-in-the-wild links

We will scrape data from the steg-in-the-wild links to get further information about the steganography related malware families. 
The goal is to extract information closely related to the MITRE Att@ck data.

In [15]:
if os.path.exists('data/sitw-stego-malware.json'):
    with open('data/sitw-stego-malware.json', 'r') as f:
        sitw_data = json.load(f)
else:
    sitw_data = [scrape_malware_data(url, name, description) for name, url, description in sitw_links]
    sitw_data = [data for data in sitw_data if data and data['Name']]
    with open('data/sitw-stego-malware.json', 'w') as f:
        json.dump(sitw_data, f, indent=4)
sitw_data

[{'Name': 'Stealthy Formbook leverages steganography - malwr0nwind0z',
  'Description': 'Formbook: A Infostealer Formbook is a type of malware that is primarily used for stealing sensitive information from infected computers, was first discovered in the wild back in 2016. It is commonly distributed via malspam, or malicious spam, which is a type of spam email that contains malware or links to malware-infected websites. In this',
  'Type': 'MALWARE',
  'Created': '2023-05-15T13:07:17+00:00',
  'Platforms': ['Android', 'iOS'],
  'References': ['https://malwr0nwind0z.com/post_5-15-23_formbook_sample']},
 {'Name': 'Worok Group hides malware in PNG',
  'Description': 'LSB steganography is used to cloak data in PNG images. Worok hides two payloads: a PowerShell script and a custom .NET C# stealer able to abuse Dropbox for cloaking exfiltration and C&C communications',
  'Type': 'MALWARE',
  'Created': None,
  'Platforms': [],
  'References': ['https://www.bleepingcomputer.com/news/security/w

## Manually clean the data

Some of the data is not correctly extracted and needs to be cleaned manually:
- '{"®eve®se"' is the name of a blog post and not a malware family

In [16]:
for entry in malpedia_malware_data:
    if entry['Name'] == '{"®eve®se": "Enginee®ing"} ':
        entry['Name'] = 'Extracting Shellcode in ICEID .PNG Steganography'
        entry['Description'] = 'In this past few days I stumble to some new and old variant of ICEID malware that uses .png steganography to hide and execute its encrypted shellcode. In this article I will share how the structure of the Iceid png payload look like and how to extract its encrypted shellcode.'
        break

## Merge datasets

We will merge the MITRE Att@ck, Malpedia and steg-in-the-wild datasets into a single dataset.
The merging is done by iterating over the MITRE Att@ck data and adding the Malpedia data for the malware family which has a matching name as a reference.

Then we add the unmatched Malpedia data to the list.

In [17]:
import pandas as pd
import numpy as np

malpedia_malware_data = [malware for malware in malpedia_malware_data if malware is not None]


#def filter_mitre_entries(mitre_attack_entries, image_terms=None):
#    if image_terms is None:
#        image_terms = ['image', 'jpg', 'png', 'bmp', 'pixel', 'lsb']
#
#    regex = re.compile(r'\b(?:' + '|'.join(image_terms) + r')\b', re.IGNORECASE)
#    return [
#        e for e in mitre_attack_entries
#        if any(
#            regex.search(t['Use'])
#            for t in e['Techniques Used']
#            if 'Steganography' in t['Name']
#        )
#    ]
            
processed_malware_data = mitre_attack_data

# Try to extract any malware names from the other datasets than MITRE Att@ck and add them to the list
def try_add_malware_data(data):
    new_malware_data = []
    for entry in data:
        name = entry['Name']
        split_name = name.split(':')
        if len(split_name) > 1:
            name = split_name[0].strip()
            entry['Name'] = name
            new_malware_data.append(entry)
            data.remove(entry)
    return new_malware_data

processed_malware_data += try_add_malware_data(malpedia_malware_data)
processed_malware_data += try_add_malware_data(sitw_data)

# Try to match the processed malware data with the left over Malpedia and steg-in-the-wild data
def compare_names(name, other_name):
    return name in other_name or name.replace(' ', '') in other_name or name.replace('-', '') in other_name

for entry in processed_malware_data:
    name = entry['Name'].lower()    
    for malpedia_entry in malpedia_malware_data:
        malpedia_name = malpedia_entry['Name'].lower()
        if compare_names(name, malpedia_name):
            entry['References'] = entry.get('References', []) + malpedia_entry.get(
                'References', [])
            malpedia_malware_data.remove(malpedia_entry)
    
    for sitw_data_entry in sitw_data:
        sitw_name = sitw_data_entry['Name'].lower()
        if compare_names(name, sitw_name):
            entry['References'] = entry.get('References', []) + sitw_data_entry.get(
                'References', [])
            sitw_data.remove(sitw_data_entry)
            
# Convert the list of dictionaries to a DataFrame and clean the data
malware_data = pd.DataFrame(processed_malware_data + malpedia_malware_data + sitw_data)

malware_data = malware_data[malware_data['Name'].str.contains('404') == False]
malware_data = malware_data[malware_data['Description'].str.contains('404') == False]
malware_data = malware_data[malware_data['Name'].str.contains('Not Found', case=False) == False]
malware_data = malware_data[malware_data['Description'].str.contains('Not Found', case=False) == False]

malware_data['Created'] = pd.to_datetime(malware_data['Created'], errors='coerce', format='mixed', utc=True).dt.date

malware_data['Last Modified'] = pd.to_datetime(
    malware_data['Last Modified'],
    errors='coerce',
    format='mixed',
    utc=True).dt.date

malware_data['Platforms'] = malware_data['Platforms'].apply(
    lambda platforms: ', '.join(platforms)
    if isinstance(platforms, list) else platforms
)

malware_data['Techniques Used'] = malware_data['Techniques Used'].apply(
    lambda techniques: ', '.join(
        technique['Use']
        for technique in techniques
        if 'Steganography' in technique['Name']
    ) if techniques is not np.nan else None
)

malware_data['References'] = malware_data['References'].apply(
    lambda refs: ', '.join(refs)
    if refs is not np.nan else None
)

malware_data = malware_data.drop_duplicates(subset=['Name'], keep='first')
malware_data

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,Name,Description,Type,Platforms,Version,Created,Last Modified,MITRE ID,Techniques Used,Contributors,Associated Software,References
0,ABK,ABK is a downloader that has been used by BRON...,MALWARE,Windows,1.0,2020-06-10,2020-06-24,S0469,ABK can extract a malicious Portable Executabl...,,,
1,Agent Smith,Agent Smith is mobile malware that generates f...,MALWARE,Android,1.0,2020-05-07,2020-06-17,S0440,Agent Smith’s core malware is disguised as a J...,"Aviran Hazum, Check Point; Sergey Persikov, C...",,
2,Avenger,Avenger is a downloader that has been used by ...,MALWARE,Windows,1.0,2020-06-11,2020-06-24,S0473,Avenger can extract backdoor malware from down...,,,
3,Bandook,"Bandook is a commercially available RAT, writt...",MALWARE,Windows,2.0,2018-10-17,2021-10-11,S0234,Bandook has used .PNG images within a zip file...,,,
4,BBK,BBK is a downloader that has been used by BRON...,MALWARE,Windows,1.0,2020-06-10,2020-06-24,S0470,BBK can extract a malicious Portable Executabl...,,,
5,build_downer,build_downer is a downloader that has been use...,MALWARE,Windows,1.0,2020-06-10,2020-06-24,S0471,build_downer can extract malware from a downlo...,,,
6,Daserf,Daserf is a backdoor that has been used to spy...,MALWARE,Windows,1.1,2018-01-16,2020-03-30,S0187,Daserf can use steganography to hide malicious...,,"Muirim, Nioupale",https://blog.trendmicro.com/trendlabs-security...
7,Diavol,Diavol is a ransomware variant first observed ...,MALWARE,Windows,1.0,2021-11-12,2022-04-15,S0659,Diavol has obfuscated its main code routines w...,"Massimiliano Romano, BT Security",,
8,Duqu,Duqu is a malware platform that uses a modular...,MALWARE,Windows,1.2,2017-05-31,2023-03-08,S0038,When the Duqu command and control is operating...,,,
9,HAMMERTOSS,HAMMERTOSS is a backdoor that was used by APT2...,MALWARE,Windows,1.2,2017-05-31,2021-02-09,S0037,HAMMERTOSS is controlled via commands that are...,,"HammerDuke, NetDuke",


## Save the merged data to a file

The merged data will be saved to a file called `malware-data.csv` in the `data` directory.

In [18]:
malware_data.to_csv('data/malware-full-data.csv', sep=";", index=False)

# Statistical analysis of the data

We will perform a statistical analysis of the data to get an overview of the malware families and their properties.

## Types of steganography used

In [38]:
num_lsb_malware = len(malware_data[
    (malware_data['Description'].str.contains('LSB') == True) |
    (malware_data['Techniques Used'].str.contains('LSB') == True) | 
    (malware_data['Description'].str.contains('Least Significant Bit') == True) | 
    (malware_data['Techniques Used'].str.contains('Least Significant Bit') == True)
])
num_xor_malware = len(malware_data[
    (malware_data['Description'].str.contains('XOR') == True) |
    (malware_data['Techniques Used'].str.contains('XOR') == True)
])
num_eof_malware = len(malware_data[
    (malware_data['Description'].str.contains('append') == True) |
    (malware_data['Techniques Used'].str.contains('append') == True)
])
pd.DataFrame({
    'Steganography Type': ['LSB', 'XOR', 'End of File', 'Unknown'],
    'Number of Malware Families': [num_lsb_malware, num_xor_malware, num_eof_malware, len(malware_data) - num_lsb_malware - num_xor_malware - num_eof_malware]
})

Unnamed: 0,Steganography Type,Number of Malware Families
0,LSB,3
1,XOR,2
2,End of File,2
3,Unknown,53


## Attacks per platform

In [39]:
platforms = malware_data['Platforms'].str.split(', ', expand=True).stack().map(lambda x: x.strip()).value_counts()
platforms

Windows    48
iOS        15
Android    13
macOS      10
Linux       7
            5
Name: count, dtype: int64

## Types of carriers used

In [41]:
num_jpg_malware = len(malware_data[
    (malware_data['Description'].str.contains('jpg', case=False) == True) |
    (malware_data['Techniques Used'].str.contains('jpg', case=False) == True)
])

num_png_malware = len(malware_data[
    (malware_data['Description'].str.contains('png', case=False) == True) |
    (malware_data['Techniques Used'].str.contains('png', case=False) == True)
])

num_bmp_malware = len(malware_data[
    (malware_data['Description'].str.contains('bmp', case=False) == True) |
    (malware_data['Techniques Used'].str.contains('bmp', case=False) == True)
])

num_gif_malware = len(malware_data[
    (malware_data['Description'].str.contains('gif', case=False) == True) |
    (malware_data['Techniques Used'].str.contains('gif', case=False) == True)
])

num_tiff_malware = len(malware_data[
    (malware_data['Description'].str.contains('tiff', case=False) == True) |
    (malware_data['Techniques Used'].str.contains('tiff', case=False) == True)
])

num_pdf_malware = len(malware_data[
    (malware_data['Description'].str.contains('pdf', case=False) == True) |
    (malware_data['Techniques Used'].str.contains('pdf', case=False) == True)
])

pd.DataFrame({
    'Carrier Type': ['JPG', 'PNG', 'BMP', 'GIF', 'TIFF', 'PDF', 'Unknown'],
    'Number of Malware Families': [num_jpg_malware, num_png_malware, num_bmp_malware, num_gif_malware, num_tiff_malware, num_pdf_malware, len(malware_data) - num_jpg_malware - num_png_malware - num_bmp_malware - num_gif_malware - num_tiff_malware - num_pdf_malware]
})

Unnamed: 0,Carrier Type,Number of Malware Families
0,JPG,4
1,PNG,11
2,BMP,4
3,GIF,0
4,TIFF,0
5,PDF,2
6,Unknown,39


## Create a sample of the merged data

We will create a sample of the merged data to get an overview of the malware families and their properties.

In [42]:
malware_data_samples = malware_data.sort_values(by='Created', ascending=False)
malware_data_samples = malware_data_samples[malware_data_samples['Name'].str.match("[\"{}@]") == False]
malware_data_samples = malware_data_samples.head(20)
malware_data_samples.to_csv('data/malware-samples.csv', sep=";", index=False)

malware_data_samples

Unnamed: 0,Name,Description,Type,Platforms,Version,Created,Last Modified,MITRE ID,Techniques Used,Contributors,Associated Software,References
45,Stealthy Formbook leverages steganography - ma...,Formbook: A Infostealer Formbook is a type of ...,MALWARE,"Android, iOS",,2023-05-15,NaT,,,,,https://malwr0nwind0z.com/post_5-15-23_formboo...
47,Check Point CloudGuard Spectral exposes new ob...,Latest Research by our Team,MALWARE,Android,,2022-11-09,NaT,,,,,https://research.checkpoint.com/2022/check-poi...
39,Alibaba OSS Buckets Compromised to Distribute ...,,MALWARE,"Windows, Linux",,2022-07-21,NaT,,,,,https://www.trendmicro.com/en_us/research/22/g...
26,Zox,Zox is a remote access tool that has been used...,MALWARE,Windows,1.0,2022-01-09,2023-03-20,S0672,Zox has used the .PNG file format for C2 commu...,,"Gresim, ZoxRPC, ZoxPNG",
30,"Exploit, steganography and Delphi",We will unroll a maldoc spam exploiting CVE-20...,MALWARE,,,2021-12-07,NaT,,,,,https://malcat.fr/blog/exploit-steganography-a...
7,Diavol,Diavol is a ransomware variant first observed ...,MALWARE,Windows,1.0,2021-11-12,2022-04-15,S0659,Diavol has obfuscated its main code routines w...,"Massimiliano Romano, BT Security",,
18,ProLock,ProLock is a ransomware strain that has been u...,MALWARE,Windows,1.0,2021-09-30,2021-10-15,S0654,ProLock can use .jpg and .bmp files to store i...,,,
14,ObliqueRAT,"ObliqueRAT is a remote access trojan, similar ...",MALWARE,Windows,1.0,2021-09-08,2021-10-15,S0644,ObliqueRAT can hide its payload in BMP images ...,,,
23,Sliver,"Sliver is an open source, cross-platform, red ...",TOOL,"Windows, Linux, macOS",1.1,2021-07-30,2023-01-17,S0633,Sliver can encode binary data into a .PNG file...,"Achute Sharma, Keysight; Ayan Saha, Keysight",,
40,FormBook Malware Returns: New Variant Uses Ste...,Quick Heal Security Lab has seen a sudden incr...,MALWARE,Android,,2021-07-21,NaT,,,,,https://blogs.quickheal.com/formbook-malware-r...
