# CRISP DM


## Problem Understanding:
A central step in the ORCA pipeline is the automated mapping of natural language threat
descriptions to structured attack patterns from the Common Attack Pattern Enumeration
and Classification (CAPEC) framework. This process—referred to as Threat-to-CAPEC
Mapping—aims to translate unstructured textual threat inputs into standardized, machine-
readable representations that describe how an attacker might exploit a given vulnerability.
This mapping is critical for enabling further steps in the security analysis process, such
as correlating threats with known vulnerabilities (CWEs, CVEs), assessing risk using
scoring systems like CVSS, and informing mitigation strategies. However, the task poses
several inherent challenges:
Ambiguity: Threat descriptions are often informal, incomplete, or context-dependent.
Terminology mismatch: Natural language inputs may not directly align with the technical
vocabulary used in CAPEC definitions.
Granularity: A single threat may correspond to multiple CAPECs at varying levels of
abstraction, requiring semantic reasoning to determine relevance.
Solving this problem involves designing a system capable of understanding the semantics
of threat descriptions and reliably identifying the most appropriate CAPEC entries. This
step must balance accuracy, scalability, and interpretability to support reliable, automated
security assessments within the broader ORCA framework.

## Data Understanding

In [173]:
# imports
import pandas as pd
import json
import re

In [174]:
# setting display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

Import threat data from json file

In [175]:
# Load threat data from JSON file into a DataFrame
df_threats = pd.read_json('threat_data/all_threats.json')
df_threats.head(5)

Unnamed: 0,Threat ID,Threat title,Threat Description,Threat type,Impact type,Threat agent,Vulnerability,Threatened Asset,Affected Components
0,T-O-RAN-01,An attacker exploits insecure designs or lack of adaption in O-RAN components,"Unauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the design of the hardware-software O-RAN system and how different functions are segregated within the O-RAN system. \nO-RAN components might be vulnerable if: \n• Outdated component from the lack of update or patch management,\n• Poorly design architecture,\n• Missing appropriate security hardening,\n• Unnecessary or insecure function/protocol/component.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.\nIn addition, O-RAN components could be software providing network functions, so they are likely to be vulnerable to software flaws: it could be possible to bypass firewall restrictions or to take advantage of a buffer overflow to execute arbitrary commands, etc.",,,All,"[Outdated component from the lack of update or patch management, Poorly design architecture, Missing appropriate security hardening, Unnecessary or insecure function/protocol/component]",All,All
1,T-O-RAN-02,An attacker exploits misconfigured or poorly configured O-RAN components,"Unauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the configuration of the hardware-software O-RAN system. \nO-RAN components might be vulnerable if: \n• Errors from the lack of configuration change management,\n• Misconfigured or poorly configured O-RAN components,\n• Improperly configured permissions,\n• Unnecessary features are enabled (e.g. unnecessary ports, services, accounts, or privileges),\n• Default accounts and their passwords still enabled and unchanged,\n• Security features are disabled or not configured securely.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.",,,All,"[Errors from the lack of configuration change management, Misconfigured or poorly configured O-RAN components, Improperly configured permissions, Unnecessary features are enabled (e.g. unnecessary ports, services, accounts, or privileges), Default accounts and their passwords still enabled and unchanged, Security features are disabled or not configured securely]",All,All
2,T-O-RAN-03,Attacks from the internet to penetrate O-RAN network boundary,"Web servers serving O-RAN functional and management services should provide adequate protection. \nAn attacker that have access to the uncontrolled O-RAN network could:\n• Bypass the information flow control policy implemented by the firewall,\n• And/or attack O-RAN components in the trusted networks by taking advantage of particularities and errors in the design and implementation of the network protocols (IP, TCP, UDP, application protocols),\n• Use of incorrect or exceeded TCP sequence numbers,\n• Perform brute force attacks on FTP passwords,\n• Use of improper HTTP user sessions,\n• Etc.\nThe effects of such attacks may include:\n• An intrusion, meaning unauthorized access to O-RAN components,\n• Blocking, flooding or restarting an O-RAN component causing a denial of service,\n• Flooding of network equipment, causing a denial of service,\n• Etc.",,,All,"[Errors in the design and implementation of the network protocols (HTTP, P, TCP, UDP, application protocols)]",All,All
3,T-O-RAN-04,An attacker attempts to jam the airlink signal through IoT devices,"DDoS attacks on O-RAN systems: The 5G evolution means billions of things, collectively referred to as IoT, will be using the 5G O-RAN. Thus, IoT could increase the risk of O-RAN resource overload by way of DDoS attacks. Attackers create a botnet army by infecting many (millions/billions) IoT devices with a “remote-reboot” malware. Attackers instruct the malware to reboot all devices in a specific or targeted 5G coverage area at the same time.",,,All,[Failure to address overload situations],"ASSET-D-06, ASSET-D-18","O-RU, airlink with UE, O-DU"
4,T-O-RAN-05,"An attacker penetrates and compromises the O-RAN system through the open O-RAN’s Fronthaul, O1, O2, A1, and E2","O-RAN’s Fronthaul, O1, O2, A1, and E2 management interfaces are the new open interfaces that allow software programmability of RAN. These interfaces may not be secured to industry best practices.\nO-RAN components might be vulnerable if: \n• Improper or missing authentication and authorization processes,\n• Improper or missing ciphering and integrity checks of sensitive data exchanged over O-RAN interfaces,\n• Improper or missing replay protection of sensitive data exchanged over O-RAN interfaces,\n• Improper prevention of key reuse,\n• Improper implementation,\n• Improperly validate inputs, respond to error conditions in both the submitted data as well as out of sequence protocol steps.\nAn attacker could, in such case, cause denial-of-service, data tampering or information disclosure, etc.\nNOTE: O-RAN interfaces allow use of TLS or SSH. Industry best practices mandate the use of TLS (v1.2 or higher) or SSH certificate-based authentication. An implementation that implements TLS version lower than 1.2 or a SSH password authentication, may become the key source of vulnerability that a malicious code will exploit to compromise the O-RAN system.",,,All,"[Improper or missing authentication and authorization processes, Improper prevention of key reuse, Improper or missing replay protection of sensitive data exchanged over O-RAN interfaces, Improper or missing ciphering and integrity checks of sensitive data exchanged over O-RAN interfaces]",All,"rApps, xApps, O-RU, O-DU, O-CU, Near-RT RIC, Non-RT RIC"


In [176]:
# Display DataFrame information
df_threats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Threat ID            182 non-null    object
 1   Threat title         182 non-null    object
 2   Threat Description   182 non-null    object
 3   Threat type          82 non-null     object
 4   Impact type          82 non-null     object
 5   Threat agent         182 non-null    object
 6   Vulnerability        182 non-null    object
 7   Threatened Asset     182 non-null    object
 8   Affected Components  182 non-null    object
dtypes: object(9)
memory usage: 12.9+ KB


There a 182 threats in 9 columns, each threat has a unique ID, a title, a description, a threat type, an impact type, a threat agent, one or more vulnerabilities (safed in a list), a threatened asset and a affected component

In [177]:
df_threats.describe()

Unnamed: 0,Threat ID,Threat title,Threat Description,Threat type,Impact type,Threat agent,Vulnerability,Threatened Asset,Affected Components
count,182,182,182,82,82,182,182,182,182
unique,182,181,182,16,17,2,116,58,55
top,T-O-RAN-01,External attacker exploits authentication weakness on SMO,"Unauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the design of the hardware-software O-RAN system and how different functions are segregated within the O-RAN system. \nO-RAN components might be vulnerable if: \n• Outdated component from the lack of update or patch management,\n• Poorly design architecture,\n• Missing appropriate security hardening,\n• Unnecessary or insecure function/protocol/component.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.\nIn addition, O-RAN components could be software providing network functions, so they are likely to be vulnerable to software flaws: it could be possible to bypass firewall restrictions or to take advantage of a buffer overflow to execute arbitrary commands, etc.",Spoofing,Authenticity,All,[weak mutual authentication],"ASSET-D-12, ASSET-D-13, ASSET-D-14, ASSET-D-15, ASSET-D-16, ASSET-D-17, ASSET-D-18, ASSET-D-19, ASSET-D-20, ASSET-D-29, ASSET-D-31, ASSET-D-32",All
freq,1,2,1,28,26,178,12,24,19


In [178]:
from IPython.display import HTML, display

display(HTML('''
<style>
/* White background for output area */
.output_area {
    background: white !important;
    color: black !important;
}

/* Optional: White background for DataFrame cells */
.dataframe {
    background-color: white !important;
    color: black !important;
}
</style>
'''))



In [179]:
df_threats.isnull().sum()

Threat ID                0
Threat title             0
Threat Description       0
Threat type            100
Impact type            100
Threat agent             0
Vulnerability            0
Threatened Asset         0
Affected Components      0
dtype: int64

In [180]:
df_threats.columns

Index(['Threat ID', 'Threat title', 'Threat Description', 'Threat type',
       'Impact type', 'Threat agent', 'Vulnerability', 'Threatened Asset',
       'Affected Components'],
      dtype='object')

In [181]:
df_threats.groupby('Threat type').size().reset_index(name='count').sort_values(by='count', ascending=False).reset_index(drop=True)

Unnamed: 0,Threat type,count
0,Spoofing,28
1,Denial of Service,11
2,Elevation of Privilege,10
3,Information Disclosure,9
4,Tampering,8
5,Information disclosure,4
6,Tampering; Denial of Service,2
7,"Elevation of Privilege, Information Disclosure",2
8,Elevation of Privilege; Denial of Service,1
9,"Denial of Service, Escalation of Privilege",1


In [182]:
(df_threats['Threat type'].isnull() == df_threats['Impact type'].isnull()).all()


np.True_

In [183]:
df_threats['Threat type'].value_counts(dropna=False)

Threat type
None                                                          52
NaN                                                           48
Spoofing                                                      28
Denial of Service                                             11
Elevation of Privilege                                        10
Information Disclosure                                         9
Tampering                                                      8
Information disclosure                                         4
Elevation of Privilege, Information Disclosure                 2
Tampering; Denial of Service                                   2
Information disclosure, Tampering                              1
Tampering, Information Disclosure, Escalation of Privilege     1
Denial of Service, Escalation of Privilege                     1
Denial of Service; Tampering                                   1
Elevation of Privilege; Denial of Service                      1
Tampering; El

In [184]:
df_threats['Impact type'].value_counts(dropna=False)

Impact type
None                                         52
NaN                                          48
Authenticity                                 26
Confidentiality                              14
Availability                                 12
Authorization                                 9
Integrity                                     7
Authentication                                2
Integrity; Availability                       2
Authorization. Confidentiality                1
Integrity, Confidentiality, Authorization     1
Availability; Integrity                       1
Confidentiality, Integrity                    1
Authorization; Availability                   1
Integrity; Authorization; Availability        1
Integrity, Availability                       1
Availability, Confidentiality                 1
Confidentiality, Availability                 1
Authorization, Availability                   1
Name: count, dtype: int64

In [185]:
df_threats['Vulnerability'].value_counts(dropna=False)

Vulnerability
[weak mutual authentication]                                                                                                                                                                                                                                                     12
[Weak authentication can be exploited by a tenant to move laterally across the deployment.]                                                                                                                                                                                       9
[Lack of integrity verification]                                                                                                                                                                                                                                                  5
[Lack of overload protection and rate-limiting]                                                                                                               

### Distribution Analysis


In [186]:
# some threat types as well as impact types are seperated by a comma, while others are seperated by a semicolon or a point
# Replace all semicolons with commas
df_threats['Threat type'] = df_threats['Threat type'].str.replace(';', ',', regex=False)
df_threats['Threat type'] = df_threats['Threat type'].str.replace('.', ',', regex=False)
df_threats['Impact type'] = df_threats['Impact type'].str.replace(';', ',', regex=False)
df_threats['Impact type'] = df_threats['Impact type'].str.replace('.', ',', regex=False)
# in the Threatened asssets column, some values have a underscore instead of a dash
df_threats['Threatened Asset'] = df_threats['Threatened Asset'].str.replace('_', '-', regex=False)

In [187]:
# removing leading and trailing whitespace from the 'Threat type','Impact type','Threatened asset' and 'Affected components' columns
df_threats['Threat type'] = df_threats['Threat type'].apply(
    lambda x: [t.strip() for t in x.split(',')] if isinstance(x, str) else x
)
df_threats['Impact type'] = df_threats['Impact type'].apply(
    lambda x: [t.strip() for t in x.split(',')] if isinstance(x, str) else x
)
df_threats['Threatened Asset'] = df_threats['Threatened Asset'].apply(
    lambda x: [t.strip() for t in x.split(',')] if isinstance(x, str) else x
)
df_threats['Affected Components'] = df_threats['Affected Components'].apply(
    lambda x: [t.strip() for t in x.split(',')] if isinstance(x, str) else x
)
# the vulnerability column contains a list with only one string element
# this string contains multpile vulnerabilities separated by commas
# we convert the list to this one string element and then split the string by commas
df_threats['Vulnerability'] = df_threats['Vulnerability'].apply(lambda x: x[0] if isinstance(x, list) else x)
df_threats['Vulnerability'] = df_threats['Vulnerability'].apply(
    lambda x: [t.strip() for t in x.split(',')] if isinstance(x, str) else x
)

In [188]:
# the 'Threatened Asset' column contains values like 'All' or 'ASSET-D-33 to ASSET-D-38', we solve this by manually turning this into the correct list


# Example: start from your DataFrame
# df_threats = pd.read_json(...)  # or however you're loading it

# Step 1: Drop NaNs
asset_entries = df_threats['Threatened Asset'].dropna()

# Step 2: Function to expand ranges like 'ASSET-D-33 to ASSET-D-38'
def expand_range(entry):
    match = re.match(r"(ASSET-[A-Z]-)(\d+)\s+to\s+ASSET-[A-Z]-(\d+)", entry.strip())
    if match:
        prefix, start, end = match.groups()
        return [f"{prefix}{i}" for i in range(int(start), int(end) + 1)]
    return [entry.strip()]

# Step 3: Build the set of unique assets
all_assets = set()

for entry in asset_entries:
    # If entry is a list, turn it into a string
    if isinstance(entry, list):
        entry = ', '.join(entry)
    
    # Ensure it's a string before splitting
    if isinstance(entry, str):
        parts = [p.strip() for p in entry.split(',')]
        for part in parts:
            if part.lower() == 'all':
                continue  # or expand to full known asset list if desired
            elif 'to' in part:
                expanded = expand_range(part)
                all_assets.update(expanded)
            else:
                all_assets.add(part.strip())

# Step 4: Show all unique asset names sorted
all_assets = sorted(all_assets)
print(all_assets)




['ASSET D-11', 'ASSET D-20', 'ASSET D-21', 'ASSET-C-02', 'ASSET-C-03', 'ASSET-C-07', 'ASSET-C-08', 'ASSET-C-09', 'ASSET-C-1', 'ASSET-C-10', 'ASSET-C-11', 'ASSET-C-12', 'ASSET-C-14', 'ASSET-C-16', 'ASSET-C-17', 'ASSET-C-18', 'ASSET-C-19', 'ASSET-C-2', 'ASSET-C-20', 'ASSET-C-21', 'ASSET-C-22', 'ASSET-C-23', 'ASSET-C-24', 'ASSET-C-25', 'ASSET-C-26', 'ASSET-C-27', 'ASSET-C-28', 'ASSET-C-3', 'ASSET-C-31', 'ASSET-C-32', 'ASSET-C-33', 'ASSET-C-34', 'ASSET-C-35', 'ASSET-C-36', 'ASSET-C-37', 'ASSET-C-38', 'ASSET-C-39', 'ASSET-C-4', 'ASSET-C-40', 'ASSET-C-42', 'ASSET-C-5', 'ASSET-C-6', 'ASSET-C-7', 'ASSET-C-8', 'ASSET-C-9', 'ASSET-D-01', 'ASSET-D-02', 'ASSET-D-03', 'ASSET-D-04', 'ASSET-D-05', 'ASSET-D-06', 'ASSET-D-07', 'ASSET-D-08', 'ASSET-D-09', 'ASSET-D-1', 'ASSET-D-10', 'ASSET-D-11', 'ASSET-D-12', 'ASSET-D-13', 'ASSET-D-14', 'ASSET-D-15', 'ASSET-D-16', 'ASSET-D-17', 'ASSET-D-18', 'ASSET-D-19', 'ASSET-D-2', 'ASSET-D-20', 'ASSET-D-21', 'ASSET-D-22', 'ASSET-D-23', 'ASSET-D-24', 'ASSET-D-25', 'A

In [189]:
def normalize_assets(value):
    # Convert list to string if needed
    if isinstance(value, list):
        value = ', '.join(value)
    
    if isinstance(value, str):
        value = value.strip()
        if value.lower() == 'all':
            return all_assets
        elif 'to' in value:
            return expand_range(value)
        else:
            # handle comma-separated list like 'ASSET-D-15, ASSET-C-17'
            return [v.strip() for v in value.split(',')]
    
    return value  # leave as-is if unexpected

df_threats['Threatened Asset'] = df_threats['Threatened Asset'].apply(normalize_assets)



In [190]:
df_threats['Threatened Asset'].head(20)

0     [ASSET D-11, ASSET D-20, ASSET D-21, ASSET-C-02, ASSET-C-03, ASSET-C-07, ASSET-C-08, ASSET-C-09, ASSET-C-1, ASSET-C-10, ASSET-C-11, ASSET-C-12, ASSET-C-14, ASSET-C-16, ASSET-C-17, ASSET-C-18, ASSET-C-19, ASSET-C-2, ASSET-C-20, ASSET-C-21, ASSET-C-22, ASSET-C-23, ASSET-C-24, ASSET-C-25, ASSET-C-26, ASSET-C-27, ASSET-C-28, ASSET-C-3, ASSET-C-31, ASSET-C-32, ASSET-C-33, ASSET-C-34, ASSET-C-35, ASSET-C-36, ASSET-C-37, ASSET-C-38, ASSET-C-39, ASSET-C-4, ASSET-C-40, ASSET-C-42, ASSET-C-5, ASSET-C-6, ASSET-C-7, ASSET-C-8, ASSET-C-9, ASSET-D-01, ASSET-D-02, ASSET-D-03, ASSET-D-04, ASSET-D-05, ASSET-D-06, ASSET-D-07, ASSET-D-08, ASSET-D-09, ASSET-D-1, ASSET-D-10, ASSET-D-11, ASSET-D-12, ASSET-D-13, ASSET-D-14, ASSET-D-15, ASSET-D-16, ASSET-D-17, ASSET-D-18, ASSET-D-19, ASSET-D-2, ASSET-D-20, ASSET-D-21, ASSET-D-22, ASSET-D-23, ASSET-D-24, ASSET-D-25, ASSET-D-26, ASSET-D-27, ASSET-D-28, ASSET-D-29, ASSET-D-3, ASSET-D-30, ASSET-D-31, ASSET-D-32, ASSET-D-33, ASSET-D-34, ASSET-D-35, ASSET-D-3

In [191]:
df_threats['Threat type'].explode().value_counts()
# The most common threat type is Spoofing, which occurs 28 times

Threat type
Spoofing                   28
Denial of Service          19
Elevation of Privilege     14
Information Disclosure     14
Tampering                  14
Information disclosure      5
Escalation of Privilege     2
Name: count, dtype: int64

In [192]:
df_threats['Vulnerability'].explode().str.lower().value_counts()
# The most common vulnerability is 'lack of authentication', which occurs 17 times,
# followed by 'lack of authentication' with 12 occurences.

Vulnerability
lack of authentication                                                                       17
weak mutual authentication                                                                   12
weak authentication can be exploited by a tenant to move laterally across the deployment.     9
missing or improperly configured authorization                                                6
lack of integrity verification                                                                6
                                                                                             ..
pnfs                                                                                          1
etc.                                                                                          1
operation areas                                                                               1
physical access to the open fronthaul cable network                                           1
use of pretrained public m

In [193]:
# Step 3: Explode the list so each asset appears in its own row
asset_flat = df_threats.explode('Threatened Asset')

# Step 4: Count how often each asset appears
asset_counts = asset_flat['Threatened Asset'].value_counts()
asset_counts.head(50)

Threatened Asset
ASSET-D-15    43
ASSET-C-17    41
ASSET-D-29    40
ASSET-D-18    38
ASSET-D-14    37
ASSET-C-31    35
ASSET-D-17    35
ASSET-D-16    35
ASSET-D-12    34
ASSET-D-13    34
ASSET-D-19    34
ASSET-D-20    34
ASSET-D-32    34
ASSET-D-31    33
ASSET-C-11    33
ASSET-C-18    22
ASSET-C-10    21
ASSET-C-08    19
ASSET-C-35    18
ASSET-D-01    18
ASSET-C-34    17
ASSET-C-26    17
ASSET-C-19    16
ASSET-C-16    16
ASSET-C-20    16
ASSET-C-21    16
ASSET-C-23    16
ASSET-C-09    16
ASSET-D-04    15
ASSET-D-26    15
ASSET-D-25    15
ASSET-C-37    15
ASSET-D-37    15
ASSET-D-34    15
ASSET-D-33    15
ASSET-D-36    15
ASSET-D-35    15
ASSET-D-38    15
ASSET-C-25    14
ASSET-C-36    14
ASSET-D-21    14
ASSET-D-02    14
ASSET-D-05    14
ASSET-D-27    13
ASSET-D-28    13
ASSET-D-06    13
ASSET-D-10    13
ASSET-D-22    13
ASSET-C-02    12
ASSET-D-43    12
Name: count, dtype: int64

In [194]:
df_threats['Impact type'].explode().value_counts()

Impact type
Authenticity       26
Availability       21
Confidentiality    19
Authorization      14
Integrity          14
Authentication      2
Name: count, dtype: int64

In [195]:
df_threats['Threat type'].explode().value_counts()

Threat type
Spoofing                   28
Denial of Service          19
Elevation of Privilege     14
Information Disclosure     14
Tampering                  14
Information disclosure      5
Escalation of Privilege     2
Name: count, dtype: int64

In [196]:
print(set(df_threats['Threat agent']))

{'All', 'All except Script kiddies'}


In [197]:
df_threats['Threat agent'].value_counts()

Threat agent
All                          178
All except Script kiddies      4
Name: count, dtype: int64

In [198]:
df_threats['Affected Components'].explode().value_counts() 

Affected Components
O-Cloud                  30
Non-RT RIC               29
Shared O-RU              26
Apps/VNFs/CNFs           22
All                      19
O-DU                     17
O-RU                     16
rApps                    16
Near-RT RIC              16
SMO                      14
xApps                    12
O-DU Tenant              10
SMO Framework            10
O-DU Host                 9
O-CU                      8
O2 interface              7
External interfaces       7
R1 interface              7
UE                        7
SMO Functions             6
ASSET-C-29                6
ASSET-C-30                6
O-CU Tenant               6
CUS-Plane                 5
Apps/VNFs/CNFs images     5
O-CU Host                 5
ASSET-C-08                3
A1 interface              3
E2 interface              3
M-Plane                   3
SMO Host                  3
SMO Tenant                3
xAPPs                     3
Y1 interface              2
O2                        2


In [199]:
df_threats.columns

Index(['Threat ID', 'Threat title', 'Threat Description', 'Threat type',
       'Impact type', 'Threat agent', 'Vulnerability', 'Threatened Asset',
       'Affected Components'],
      dtype='object')

### Text analysis

In [200]:
length_of_threat_descriptions = df_threats['Threat Description'].str.len()
print(length_of_threat_descriptions.describe())
print("The average character length of the threat descriptions is", length_of_threat_descriptions.mean())
# the average length og the threat descriptions is 528.2 characters

standard_deviation = length_of_threat_descriptions.std()
print("The standard deviation of the character length of the threat descriptions is", standard_deviation)

count     182.000000
mean      528.818681
std       491.740145
min        76.000000
25%       187.250000
50%       319.000000
75%       756.250000
max      2668.000000
Name: Threat Description, dtype: float64
The average character length of the threat descriptions is 528.8186813186813
The standard deviation of the character length of the threat descriptions is 491.74014505307065


In [201]:
length_of_threat_titles = df_threats['Threat title'].str.len()
print(length_of_threat_titles.describe())
print("The average character length of the threat titles is", length_of_threat_titles.mean())
print("The standard deviation of the character length of the threat titles is", length_of_threat_titles.std())

count    182.00000
mean      57.10989
std       27.82580
min       14.00000
25%       34.00000
50%       54.00000
75%       70.00000
max      148.00000
Name: Threat title, dtype: float64
The average character length of the threat titles is 57.10989010989011
The standard deviation of the character length of the threat titles is 27.825800083090165


In [202]:

print(df_threats['Threat Description'].value_counts().head(2))
print(df_threats['Threat title'].value_counts().head(2))

Threat Description
Unauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the design of the hardware-software O-RAN system and how different functions are segregated within the O-RAN system. \nO-RAN components might be vulnerable if: \n• Outdated component from the lack of update or patch management,\n• Poorly design architecture,\n• Missing appropriate security hardening,\n• Unnecessary or insecure function/protocol/component.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches

In [203]:
df_threats[df_threats['Threat title'] == 'External attacker exploits authentication weakness on SMO']


Unnamed: 0,Threat ID,Threat title,Threat Description,Threat type,Impact type,Threat agent,Vulnerability,Threatened Asset,Affected Components
42,T-SMO-01,External attacker exploits authentication weakness on SMO,"An external attacker can exploit the improper/missing authentication weakness on SMO functions. If the authentication of O-RAN subjects on A1, O1, O2, and External interfaces on SMO is not supported or not properly implemented, those interfaces without proper credentials could be exploited to gain access to the SMO.",[Spoofing],[Authenticity],All,[Missing or improperly configured authentication],"[ASSET-C-11, ASSET-C-17]","[Non-RT RIC, SMO Framework]"
43,T-SMO-02,External attacker exploits authentication weakness on SMO,"An external attacker can exploit the improper/missing authorization weakness on SMO functions. A malicious external entity on A1, O1, O2, and External interfaces without authorization or with an incorrect access token may invoke the SMO functions. The data at rest related to that function will be leaked to the attacker. In addition, an attacker can be able to perform certain actions, e.g. disclose O-RAN sensitive information or alter O-RAN components.","[Elevation of Privilege, Information Disclosure]","[Authorization, Confidentiality]",All,[Missing or improperly configured authorization],"[ASSET-C-11, ASSET-C-17]","[Non-RT RIC, SMO Framework]"


### Data understanding of CAPEC data

In [204]:
from git import Repo, InvalidGitRepositoryError
import os
import shutil

from stix2 import FileSystemSource
from stix2 import Filter

In [240]:
def pull_clone_gitrepo(directory, repo):
    # Check if the data direcory exists
    if not os.path.isdir(directory):
        Repo.clone_from(repo, directory)
    else:
        try:
            # Check if the data directory is actually a repositry then pull the canges
            repo = Repo(directory)
            repo.remotes.origin.pull()
        except InvalidGitRepositoryError:
            # If not then remove the folder
            shutil.rmtree(directory)
            Repo.clone_from(repo, directory)

def prepare_capecs_df():
    pull_clone_gitrepo('./data', 'https://github.com/mitre/cti')
    fs = FileSystemSource('./data/capec/2.1')
    filt = Filter('type', '=', 'attack-pattern')

    attack_patterns = fs.query([filt])

    data_array = []
    for pattern in attack_patterns:
        if not pattern.x_capec_status == "Deprecated":
            info = []
            result = [obj for obj in pattern.external_references if obj['source_name'] == "capec"]
            info.append(result[0].external_id)
            info.append(pattern.name)
            info.append(pattern.description)
            if "x_capec_domains" in pattern:
                info.append(pattern.x_capec_domains)
            else:
                info.append("not given")
            data_array.append(info)

    columns = ['CAPEC ID', 'capec_name', 'capec_description', 'capec_domain']
    df = pd.DataFrame(data_array, columns=columns)

    df['summary'] = ''
    for index, row in df.iterrows():
        if len(row['capec_domain']) > 1:
            domains_string = ". The Domains of this are: "+', '.join(row['capec_domain'])
        else:
            domains_string = ". The Domain of this is: "+row['capec_domain'][0]

        summary = 'A CAPEC with the title '+row['capec_name']+\
                    '. The description of this CAPEC is:'+row['capec_description']+\
                    domains_string
        
        df.at[index, 'summary'] = summary
    return df


In [242]:
#generate the capecs_df
capecs_df = prepare_capecs_df()
#safe capecs data to csv
capecs_df.to_csv('capec_data/capecs.csv', index=False)

In [243]:
#load capecs data from csv
capecs_df = pd.read_csv('capec_data/capecs.csv')

In [244]:
# Display general information about capecs_df
print("Shape of capecs_df:", capecs_df.shape)
print("\nColumn names:", capecs_df.columns.tolist())
print("\nMissing values per column:\n", capecs_df.isnull().sum())
print("\nData types:\n", capecs_df.dtypes)
print("\nFirst 5 rows:")
display(capecs_df.head())
print("\nSummary statistics for text length in 'capec_description':")
desc_lengths = capecs_df['capec_description'].dropna().str.len()
print(desc_lengths.describe())

Shape of capecs_df: (559, 5)

Column names: ['CAPEC ID', 'capec_name', 'capec_description', 'capec_domain', 'summary']

Missing values per column:
 CAPEC ID             0
capec_name           0
capec_description    2
capec_domain         0
summary              0
dtype: int64

Data types:
 CAPEC ID             object
capec_name           object
capec_description    object
capec_domain         object
summary              object
dtype: object

First 5 rows:


Unnamed: 0,CAPEC ID,capec_name,capec_description,capec_domain,summary
0,CAPEC-87,Forceful Browsing,"An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.",['Software'],"A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"
1,CAPEC-391,Bypassing Physical Locks,"An attacker uses techniques and methods to bypass physical security measures of a building or facility. Physical locks may range from traditional lock and key mechanisms, cable locks used to secure laptops or servers, locks on server cases, or other such devices. Techniques such as lock bumping, lock forcing via snap guns, or lock picking can be employed to bypass those locks and gain access to the facilities or devices they protect, although stealth, evidence of tampering, and the integrity of the lock following an attack, are considerations that may determine the method employed. Physical locks are limited by the complexity of the locking mechanism. While some locks may offer protections such as shock resistant foam to prevent bumping or lock forcing methods, many commonly employed locks offer no such countermeasures.",['Physical Security'],"A CAPEC with the title Bypassing Physical Locks. The description of this CAPEC is:An attacker uses techniques and methods to bypass physical security measures of a building or facility. Physical locks may range from traditional lock and key mechanisms, cable locks used to secure laptops or servers, locks on server cases, or other such devices. Techniques such as lock bumping, lock forcing via snap guns, or lock picking can be employed to bypass those locks and gain access to the facilities or devices they protect, although stealth, evidence of tampering, and the integrity of the lock following an attack, are considerations that may determine the method employed. Physical locks are limited by the complexity of the locking mechanism. While some locks may offer protections such as shock resistant foam to prevent bumping or lock forcing methods, many commonly employed locks offer no such countermeasures.. The Domain of this is: Physical Security"
2,CAPEC-4,Using Alternative IP Address Encodings,"This attack relies on the adversary using unexpected formats for representing IP addresses. Networked applications may expect network location information in a specific format, such as fully qualified domains names (FQDNs), URL, IP address, or IP Address ranges. If the location information is not validated against a variety of different possible encodings and formats, the adversary can use an alternate format to bypass application access control.",['Software'],"A CAPEC with the title Using Alternative IP Address Encodings. The description of this CAPEC is:This attack relies on the adversary using unexpected formats for representing IP addresses. Networked applications may expect network location information in a specific format, such as fully qualified domains names (FQDNs), URL, IP address, or IP Address ranges. If the location information is not validated against a variety of different possible encodings and formats, the adversary can use an alternate format to bypass application access control.. The Domain of this is: Software"
3,CAPEC-185,Malicious Software Download,An attacker uses deceptive methods to cause a user or an automated process to download and install dangerous code that originates from an attacker controlled source. There are several variations to this strategy of attack.,['Software'],A CAPEC with the title Malicious Software Download. The description of this CAPEC is:An attacker uses deceptive methods to cause a user or an automated process to download and install dangerous code that originates from an attacker controlled source. There are several variations to this strategy of attack.. The Domain of this is: Software
4,CAPEC-226,Session Credential Falsification through Manipulation,An attacker manipulates an existing credential in order to gain access to a target application. Session credentials allow users to identify themselves to a service after an initial authentication without needing to resend the authentication information (usually a username and password) with every message. An attacker may be able to manipulate a credential sniffed from an existing connection in order to gain access to a target server.,['Software'],A CAPEC with the title Session Credential Falsification through Manipulation. The description of this CAPEC is:An attacker manipulates an existing credential in order to gain access to a target application. Session credentials allow users to identify themselves to a service after an initial authentication without needing to resend the authentication information (usually a username and password) with every message. An attacker may be able to manipulate a credential sniffed from an existing connection in order to gain access to a target server.. The Domain of this is: Software



Summary statistics for text length in 'capec_description':
count     557.000000
mean      475.750449
std       222.768722
min        65.000000
25%       303.000000
50%       428.000000
75%       627.000000
max      1069.000000
Name: capec_description, dtype: float64


In [246]:
# show rows with missing values in 'capec_description'
capecs_df[capecs_df['capec_description'].isnull()]

Unnamed: 0,CAPEC ID,capec_name,capec_description,capec_domain,summary
12,CAPEC-435,Target Influence via Instant Rapport,,['Social Engineering'],A CAPEC with the title Target Influence via Instant Rapport. The description of this CAPEC is:. The Domain of this is: Social Engineering
422,CAPEC-434,Target Influence via Interview and Interrogation,,['Social Engineering'],A CAPEC with the title Target Influence via Interview and Interrogation. The description of this CAPEC is:. The Domain of this is: Social Engineering


In [247]:
capecs_df[capecs_df['capec_domain'] == 'not given']

Unnamed: 0,CAPEC ID,capec_name,capec_description,capec_domain,summary
81,CAPEC-699,Eavesdropping on a Monitor,"An Adversary can eavesdrop on the content of an external monitor through the air without modifying any cable or installing software, just capturing this signal emitted by the cable or video port, with this the attacker will be able to impact the confidentiality of the data without being detected by traditional security tools",not given,"A CAPEC with the title Eavesdropping on a Monitor. The description of this CAPEC is:An Adversary can eavesdrop on the content of an external monitor through the air without modifying any cable or installing software, just capturing this signal emitted by the cable or video port, with this the attacker will be able to impact the confidentiality of the data without being detected by traditional security tools. The Domains of this are: n, o, t, , g, i, v, e, n"


In [248]:
def to_list(x):
    if x == 'not given':
        return ['not given']
    x = x.strip("[]").replace("'", "")
    return x.split(", ") if x else []

capecs_df['capec_domain'] = capecs_df['capec_domain'].apply(to_list)



In [249]:
capecs_df['capec_domain'].value_counts()

capec_domain
[Software]                                                                     287
[Communications, Software]                                                      67
[Social Engineering]                                                            30
[Software, Hardware]                                                            28
[Physical Security]                                                             15
[Communications]                                                                13
[Software, Physical Security, Hardware]                                          9
[Software, Software, Software]                                                   9
[Supply Chain, Hardware]                                                         9
[Social Engineering, Supply Chain, Software]                                     8
[Supply Chain, Physical Security, Hardware]                                      8
[Supply Chain, Software]                                                  

In [250]:
# value coutrs for 'capec_domain'
capecs_df['capec_domain'].explode().value_counts(dropna=False)

capec_domain
Software              490
Communications        101
Hardware               91
Social Engineering     66
Supply Chain           52
Physical Security      39
not given               1
Name: count, dtype: int64

In [251]:
duplicates = capecs_df[capecs_df['capec_domain'].apply(lambda x: len(x) != len(set(x)))]['capec_domain']

print(f"Count of rows with duplicates: {len(duplicates)}")
print(duplicates)


Count of rows with duplicates: 29
25                         [Software, Software, Software]
29                                   [Software, Software]
54                                   [Hardware, Hardware]
80                         [Software, Software, Software]
86     [Social Engineering, Social Engineering, Software]
91               [Social Engineering, Social Engineering]
99                                   [Software, Software]
119                        [Software, Software, Software]
123                                  [Hardware, Hardware]
156              [Social Engineering, Social Engineering]
165                                  [Software, Software]
177                        [Software, Software, Software]
201              [Social Engineering, Social Engineering]
225                        [Software, Software, Software]
243                        [Software, Hardware, Software]
262              [Social Engineering, Social Engineering]
286                        [Software, 

## Data Preparation

Merging title and description to match the form of the original orca pipeline.

In [252]:
threats_summary_context = {'Threat title': 'A Threat with the title ', 'Threat Description': ' and the description'}


In [253]:
def add_summary_to_df(input_df, coll_context, ):
    input_df['summary'] = ''
    for key, value in coll_context.items():
        input_df['summary'] +=  value + input_df[key]

In [254]:
add_summary_to_df(df_threats, threats_summary_context)
df_threats.loc[0, 'summary']  

'A Threat with the title An attacker exploits insecure designs or lack of adaption in O-RAN components and the descriptionUnauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the design of the hardware-software O-RAN system and how different functions are segregated within the O-RAN system. \nO-RAN components might be vulnerable if: \n• Outdated component from the lack of update or patch management,\n• Poorly design architecture,\n• Missing appropriate security hardening,\n• Unnecessary or insecure function/protocol/component.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause 

Data preparation for establishing a baseline, implementing the ORCA mapping.

1. Add the embedding to the threat and to to the capec data

In [255]:
from sentence_transformers import SentenceTransformer, util
import torch

In [256]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')

def preprocess_description(description):
    # remove patterns not semantically relevant
    pattern = r"\((http|https)\S+\)|['\[\]]+"
    description = re.sub(pattern, "", description)
    return description

def gen_embedding(description):
    if isinstance(description, str):
        description1 = description
    else:
        description1 = '\n'.join(description)
    description = preprocess_description(description1)
    return model.encode(description, convert_to_numpy=True)

def add_embedd_to_df(input_df, descryption_name):
    input_df['embedding'] = input_df[descryption_name].apply(gen_embedding)

In [257]:
# add embedding to the threats dataframe
add_embedd_to_df(df_threats, 'summary')
# add embedding to the capecs dataframe
add_embedd_to_df(capecs_df, 'summary')

2. generate the mapping of threats to capecs

In [258]:
DOMAIN = "enterprise-attack"
def gen_mapping(th_df, te_df):
    maping = pd.DataFrame()
    if 'summary' in th_df.columns:
        th_df.rename(columns={'summary': 'summary_th_df'}, inplace=True)
    if 'summary' in te_df.columns:
        te_df.rename(columns={'summary': 'summary_ca_df'}, inplace=True)

    for tec_index, row_techniques in te_df.iterrows():
        for thr_index, row_threats in th_df.iterrows():
            cosine_similarity = util.pytorch_cos_sim(torch.from_numpy(row_techniques['embedding']),
                                                     torch.from_numpy(row_threats['embedding']))
            comb = pd.concat([row_techniques, row_threats], axis=0)
            comb = comb.to_frame().T
            comb['Similarity'] = cosine_similarity
            maping = pd.concat([maping, comb], ignore_index=True)

    maping['Domain'] = DOMAIN
    return maping

In [259]:
capec_mapping = gen_mapping(df_threats, capecs_df)

In [260]:
capec_mapping.columns

Index(['CAPEC ID', 'capec_name', 'capec_description', 'capec_domain',
       'summary_ca_df', 'embedding', 'Threat ID', 'Threat title',
       'Threat Description', 'Threat type', 'Impact type', 'Threat agent',
       'Vulnerability', 'Threatened Asset', 'Affected Components',
       'summary_th_df', 'embedding', 'summary_th_df', 'Similarity', 'Domain'],
      dtype='object')

3. format the mapping

In [261]:
COLOM_MAP = {'ID': 'Technique', 'name': 'yeet', 'Threat ID': 'Name', 'tactics': 'tactic', 'Threat title':'Description'}
MAP_ORDER_capec = ['Name', 'Domain', 'Description', 'CAPEC ID', 'Similarity', 'summary_th_df', 'summary_ca_df']
def format_map(ma, MAP_ORDER):
    ma.rename(columns=COLOM_MAP, inplace=True)
    ma = ma.loc[:, MAP_ORDER]
    return ma

In [262]:
formatted_capec_mapping = format_map(capec_mapping, MAP_ORDER_capec)

In [263]:
formatted_capec_mapping.loc[0:5]

Unnamed: 0,Name,Domain,Description,CAPEC ID,Similarity,summary_th_df,summary_th_df.1,summary_ca_df
0,T-O-RAN-01,enterprise-attack,An attacker exploits insecure designs or lack of adaption in O-RAN components,CAPEC-87,0.246462,"A Threat with the title An attacker exploits insecure designs or lack of adaption in O-RAN components and the descriptionUnauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the design of the hardware-software O-RAN system and how different functions are segregated within the O-RAN system. \nO-RAN components might be vulnerable if: \n• Outdated component from the lack of update or patch management,\n• Poorly design architecture,\n• Missing appropriate security hardening,\n• Unnecessary or insecure function/protocol/component.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.\nIn addition, O-RAN components could be software providing network functions, so they are likely to be vulnerable to software flaws: it could be possible to bypass firewall restrictions or to take advantage of a buffer overflow to execute arbitrary commands, etc.","A Threat with the title An attacker exploits insecure designs or lack of adaption in O-RAN components and the descriptionUnauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the design of the hardware-software O-RAN system and how different functions are segregated within the O-RAN system. \nO-RAN components might be vulnerable if: \n• Outdated component from the lack of update or patch management,\n• Poorly design architecture,\n• Missing appropriate security hardening,\n• Unnecessary or insecure function/protocol/component.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.\nIn addition, O-RAN components could be software providing network functions, so they are likely to be vulnerable to software flaws: it could be possible to bypass firewall restrictions or to take advantage of a buffer overflow to execute arbitrary commands, etc.","A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"
1,T-O-RAN-02,enterprise-attack,An attacker exploits misconfigured or poorly configured O-RAN components,CAPEC-87,0.270846,"A Threat with the title An attacker exploits misconfigured or poorly configured O-RAN components and the descriptionUnauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the configuration of the hardware-software O-RAN system. \nO-RAN components might be vulnerable if: \n• Errors from the lack of configuration change management,\n• Misconfigured or poorly configured O-RAN components,\n• Improperly configured permissions,\n• Unnecessary features are enabled (e.g. unnecessary ports, services, accounts, or privileges),\n• Default accounts and their passwords still enabled and unchanged,\n• Security features are disabled or not configured securely.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.","A Threat with the title An attacker exploits misconfigured or poorly configured O-RAN components and the descriptionUnauthenticated/unauthorized access to O-RAN components could possibly be achieved via the different O-RAN interfaces, depending upon the configuration of the hardware-software O-RAN system. \nO-RAN components might be vulnerable if: \n• Errors from the lack of configuration change management,\n• Misconfigured or poorly configured O-RAN components,\n• Improperly configured permissions,\n• Unnecessary features are enabled (e.g. unnecessary ports, services, accounts, or privileges),\n• Default accounts and their passwords still enabled and unchanged,\n• Security features are disabled or not configured securely.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.","A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"
2,T-O-RAN-03,enterprise-attack,Attacks from the internet to penetrate O-RAN network boundary,CAPEC-87,0.391938,"A Threat with the title Attacks from the internet to penetrate O-RAN network boundary and the descriptionWeb servers serving O-RAN functional and management services should provide adequate protection. \nAn attacker that have access to the uncontrolled O-RAN network could:\n• Bypass the information flow control policy implemented by the firewall,\n• And/or attack O-RAN components in the trusted networks by taking advantage of particularities and errors in the design and implementation of the network protocols (IP, TCP, UDP, application protocols),\n• Use of incorrect or exceeded TCP sequence numbers,\n• Perform brute force attacks on FTP passwords,\n• Use of improper HTTP user sessions,\n• Etc.\nThe effects of such attacks may include:\n• An intrusion, meaning unauthorized access to O-RAN components,\n• Blocking, flooding or restarting an O-RAN component causing a denial of service,\n• Flooding of network equipment, causing a denial of service,\n• Etc.","A Threat with the title Attacks from the internet to penetrate O-RAN network boundary and the descriptionWeb servers serving O-RAN functional and management services should provide adequate protection. \nAn attacker that have access to the uncontrolled O-RAN network could:\n• Bypass the information flow control policy implemented by the firewall,\n• And/or attack O-RAN components in the trusted networks by taking advantage of particularities and errors in the design and implementation of the network protocols (IP, TCP, UDP, application protocols),\n• Use of incorrect or exceeded TCP sequence numbers,\n• Perform brute force attacks on FTP passwords,\n• Use of improper HTTP user sessions,\n• Etc.\nThe effects of such attacks may include:\n• An intrusion, meaning unauthorized access to O-RAN components,\n• Blocking, flooding or restarting an O-RAN component causing a denial of service,\n• Flooding of network equipment, causing a denial of service,\n• Etc.","A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"
3,T-O-RAN-04,enterprise-attack,An attacker attempts to jam the airlink signal through IoT devices,CAPEC-87,0.267341,"A Threat with the title An attacker attempts to jam the airlink signal through IoT devices and the descriptionDDoS attacks on O-RAN systems: The 5G evolution means billions of things, collectively referred to as IoT, will be using the 5G O-RAN. Thus, IoT could increase the risk of O-RAN resource overload by way of DDoS attacks. Attackers create a botnet army by infecting many (millions/billions) IoT devices with a “remote-reboot” malware. Attackers instruct the malware to reboot all devices in a specific or targeted 5G coverage area at the same time.","A Threat with the title An attacker attempts to jam the airlink signal through IoT devices and the descriptionDDoS attacks on O-RAN systems: The 5G evolution means billions of things, collectively referred to as IoT, will be using the 5G O-RAN. Thus, IoT could increase the risk of O-RAN resource overload by way of DDoS attacks. Attackers create a botnet army by infecting many (millions/billions) IoT devices with a “remote-reboot” malware. Attackers instruct the malware to reboot all devices in a specific or targeted 5G coverage area at the same time.","A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"
4,T-O-RAN-05,enterprise-attack,"An attacker penetrates and compromises the O-RAN system through the open O-RAN’s Fronthaul, O1, O2, A1, and E2",CAPEC-87,0.373829,"A Threat with the title An attacker penetrates and compromises the O-RAN system through the open O-RAN’s Fronthaul, O1, O2, A1, and E2 and the descriptionO-RAN’s Fronthaul, O1, O2, A1, and E2 management interfaces are the new open interfaces that allow software programmability of RAN. These interfaces may not be secured to industry best practices.\nO-RAN components might be vulnerable if: \n• Improper or missing authentication and authorization processes,\n• Improper or missing ciphering and integrity checks of sensitive data exchanged over O-RAN interfaces,\n• Improper or missing replay protection of sensitive data exchanged over O-RAN interfaces,\n• Improper prevention of key reuse,\n• Improper implementation,\n• Improperly validate inputs, respond to error conditions in both the submitted data as well as out of sequence protocol steps.\nAn attacker could, in such case, cause denial-of-service, data tampering or information disclosure, etc.\nNOTE: O-RAN interfaces allow use of TLS or SSH. Industry best practices mandate the use of TLS (v1.2 or higher) or SSH certificate-based authentication. An implementation that implements TLS version lower than 1.2 or a SSH password authentication, may become the key source of vulnerability that a malicious code will exploit to compromise the O-RAN system.","A Threat with the title An attacker penetrates and compromises the O-RAN system through the open O-RAN’s Fronthaul, O1, O2, A1, and E2 and the descriptionO-RAN’s Fronthaul, O1, O2, A1, and E2 management interfaces are the new open interfaces that allow software programmability of RAN. These interfaces may not be secured to industry best practices.\nO-RAN components might be vulnerable if: \n• Improper or missing authentication and authorization processes,\n• Improper or missing ciphering and integrity checks of sensitive data exchanged over O-RAN interfaces,\n• Improper or missing replay protection of sensitive data exchanged over O-RAN interfaces,\n• Improper prevention of key reuse,\n• Improper implementation,\n• Improperly validate inputs, respond to error conditions in both the submitted data as well as out of sequence protocol steps.\nAn attacker could, in such case, cause denial-of-service, data tampering or information disclosure, etc.\nNOTE: O-RAN interfaces allow use of TLS or SSH. Industry best practices mandate the use of TLS (v1.2 or higher) or SSH certificate-based authentication. An implementation that implements TLS version lower than 1.2 or a SSH password authentication, may become the key source of vulnerability that a malicious code will exploit to compromise the O-RAN system.","A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"
5,T-O-RAN-06,enterprise-attack,An attacker exploits insufficient/improper mechanisms for authentication and authorization to compromise O-RAN components,CAPEC-87,0.327463,"A Threat with the title An attacker exploits insufficient/improper mechanisms for authentication and authorization to compromise O-RAN components and the descriptionO-RAN management and orchestration should not be used without appropriate authentication and authorization and authorization checks. \nO-RAN components might be vulnerable if: \n• Unauthenticated access to O-RAN functions,\n• Improper authentication mechanisms,\n• Use of Predefined/ default accounts,\n• Weak or missing password policy,\n• Lack of mutual authentication to O-RAN components and interfaces,\n• Failure to block consecutive failed login attempts,\n• Improper authorization and access control policy.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.","A Threat with the title An attacker exploits insufficient/improper mechanisms for authentication and authorization to compromise O-RAN components and the descriptionO-RAN management and orchestration should not be used without appropriate authentication and authorization and authorization checks. \nO-RAN components might be vulnerable if: \n• Unauthenticated access to O-RAN functions,\n• Improper authentication mechanisms,\n• Use of Predefined/ default accounts,\n• Weak or missing password policy,\n• Lack of mutual authentication to O-RAN components and interfaces,\n• Failure to block consecutive failed login attempts,\n• Improper authorization and access control policy.\nAn attacker could, in such case, either inject malwares and/or manipulate existing software, harm the O-RAN components, create a performance issue by manipulation of parameters, or reconfigure the O-RAN components and disable the security features with the purpose of eavesdropping or wiretapping on various CUS & M planes, reaching northbound systems, attack broader network to cause denial-of-service, steal unprotected private keys, certificates, hash values, or other type of breaches.","A CAPEC with the title Forceful Browsing. The description of this CAPEC is:An attacker employs forceful browsing (direct URL entry) to access portions of a website that are otherwise unreachable. Usually, a front controller or similar design pattern is employed to protect access to portions of a web application. Forceful browsing enables an attacker to access information, perform privileged operations and otherwise reach sections of the web application that have been improperly protected.. The Domain of this is: Software"


In [264]:
formatted_capec_mapping.columns

Index(['Name', 'Domain', 'Description', 'CAPEC ID', 'Similarity',
       'summary_th_df', 'summary_th_df', 'summary_ca_df'],
      dtype='object')

In [265]:
# for every threat name display the first 10 capecs with the highest similarity
# --- Group by 'Name' (Threat Name) and get the top 5 for each ---
top_5_capecs_per_threat = formatted_capec_mapping.groupby('Name').apply(
    lambda x: x.nlargest(5, 'Similarity')
).reset_index(drop=True)

# --- Display the results ---
print("Top 5 CAPECs for Each Threat:")
for threat_name in top_5_capecs_per_threat['Name'].unique():
    print(f"\nThreat: {threat_name}")
    threat_top_capecs = top_5_capecs_per_threat[top_5_capecs_per_threat['Name'] == threat_name]
    # Select relevant columns for display
    display_cols = ['CAPEC ID', 'Similarity', 'Description'] # Assuming 'Description' is the CAPEC Description now
    for index, row in threat_top_capecs[display_cols].iterrows():
        # Shorten description for readability if it's the CAPEC description
        desc = row['Description']
        if len(desc) > 100:
            desc = desc[:97] + "..."
        print(f"  - CAPEC ID: {row['CAPEC ID']}, Similarity: {row['Similarity']:.4f}, Description: {desc}")

Top 5 CAPECs for Each Threat:

Threat: T-A1-01
  - CAPEC ID: CAPEC-94, Similarity: 0.5281, Description: Untrusted peering between Non-RT-RIC and Near-RT-RIC
  - CAPEC ID: CAPEC-132, Similarity: 0.5262, Description: Untrusted peering between Non-RT-RIC and Near-RT-RIC
  - CAPEC ID: CAPEC-662, Similarity: 0.5161, Description: Untrusted peering between Non-RT-RIC and Near-RT-RIC
  - CAPEC ID: CAPEC-272, Similarity: 0.5132, Description: Untrusted peering between Non-RT-RIC and Near-RT-RIC
  - CAPEC ID: CAPEC-700, Similarity: 0.5119, Description: Untrusted peering between Non-RT-RIC and Near-RT-RIC

Threat: T-A1-02
  - CAPEC ID: CAPEC-387, Similarity: 0.6099, Description: Malicious function or application monitors messaging across A1 interface
  - CAPEC ID: CAPEC-12, Similarity: 0.5907, Description: Malicious function or application monitors messaging across A1 interface
  - CAPEC ID: CAPEC-388, Similarity: 0.5831, Description: Malicious function or application monitors messaging across A1 

  top_5_capecs_per_threat = formatted_capec_mapping.groupby('Name').apply(


In [266]:
# threat with the highest similarity to a capec
highest_similarity = formatted_capec_mapping['Similarity'].max()
highest_similarity_row = formatted_capec_mapping[formatted_capec_mapping['Similarity'] == highest_similarity]
print("Threat with the highest similarity to a CAPEC:")
print(highest_similarity_row[['Name', 'CAPEC ID', 'Similarity', 'Description']])    

Threat with the highest similarity to a CAPEC:
             Name   CAPEC ID  Similarity  \
70440  T-O-RAN-07  CAPEC-268    0.667489   

                                                                                        Description  
70440  An attacker compromises O-RAN monitoring mechanisms and log files integrity and availability  


In [267]:
# average similarity overall
average_similarity = formatted_capec_mapping['Similarity'].mean()
print("Average similarity:", average_similarity)

Average similarity: 0.3198404


4. Safe the mappings above 0.5 similarity to a csv file

In [268]:
CAPEC_CSV_FILENAME = 'general_threats_capec_mapping'
similarity = 0.55
def filter_threshold_hfc(df, filter_criteria, min_similarity):
    filtered_df = df[df['Similarity'] > min_similarity]
    filtered_df = filtered_df.reset_index(drop=True)
    return filtered_df

In [269]:

capec_mapping_hfc = filter_threshold_hfc(formatted_capec_mapping, "Name",similarity)
capec_mapping_hfc.to_csv("./mapped_data/hfc_"+CAPEC_CSV_FILENAME+"_min"+str(similarity)+".csv", sep=';', index=False)

5. Compare ORCA mappings to my mappings

In [270]:
orca_mappings = pd.read_csv('mapped_data/orca_mappings_min0.55.csv', sep=';')
our_mappings = pd.read_csv('mapped_data/hfc_general_threats_capec_mapping_min0.55.csv', sep=';')
print(orca_mappings.columns)
print(our_mappings.columns)

Index(['Name', 'Domain', 'Description', 'CAPEC ID', 'Similarity'], dtype='object')
Index(['Name', 'Domain', 'Description', 'CAPEC ID', 'Similarity',
       'summary_th_df', 'summary_th_df.1', 'summary_ca_df'],
      dtype='object')


In [271]:
orca_mappings.shape, our_mappings.shape

((190, 5), (163, 8))

In [272]:
average_similarity_orca = orca_mappings['Similarity'].mean()
average_similarity_our = our_mappings['Similarity'].mean()
average_similarity_orca, average_similarity_our

(np.float64(0.577509985631579), np.float64(0.5774983157055215))

In [273]:
# Merge ORCA and our mappings on 'Name' and 'CAPEC ID'
merged_similarities = pd.merge(
    orca_mappings[['Name', 'CAPEC ID', 'Similarity']],
    our_mappings[['Name', 'CAPEC ID', 'Similarity']],
    on=['Name', 'CAPEC ID'],
    how='outer',
    suffixes=('_orca', '_our')
)

# Save to CSV
merged_similarities.to_csv('./mapped_data/merged_orca_our_similarities.csv', sep=';', index=False)

# Display the first few rows for verification
merged_similarities.head()

Unnamed: 0,Name,CAPEC ID,Similarity_orca,Similarity_our
0,T-A1-02,CAPEC-12,0.584865,0.590688
1,T-A1-02,CAPEC-272,,0.552193
2,T-A1-02,CAPEC-387,0.596484,0.609855
3,T-A1-02,CAPEC-388,0.564509,0.583051
4,T-A1-02,CAPEC-502,,0.556925


In [274]:
# CAPEC IDs in ORCA but not in our mappings
capec_in_orca_not_in_our = set(orca_mappings['CAPEC ID']) - set(our_mappings['CAPEC ID'])
print("CAPEC IDs in ORCA mappings but not in our mappings:", capec_in_orca_not_in_our)

# CAPEC IDs in our mappings but not in ORCA
capec_in_our_not_in_orca = set(our_mappings['CAPEC ID']) - set(orca_mappings['CAPEC ID'])
print("CAPEC IDs in our mappings but not in ORCA mappings:", capec_in_our_not_in_orca)

CAPEC IDs in ORCA mappings but not in our mappings: {'CAPEC-446', 'CAPEC-101', 'CAPEC-263', 'CAPEC-615', 'CAPEC-633', 'CAPEC-452', 'CAPEC-102', 'CAPEC-698', 'CAPEC-59', 'CAPEC-22', 'CAPEC-212', 'CAPEC-401', 'CAPEC-440', 'CAPEC-500', 'CAPEC-594', 'CAPEC-662'}
CAPEC IDs in our mappings but not in ORCA mappings: {'CAPEC-657', 'CAPEC-476', 'CAPEC-681', 'CAPEC-320', 'CAPEC-458', 'CAPEC-444'}


In [275]:
threats_in_orca_not_in_our = orca_mappings[~orca_mappings['Name'].isin(our_mappings['Name'])]
threats_in_orca_not_in_our

Unnamed: 0,Name,Domain,Description,CAPEC ID,Similarity
45,T-R1-03,enterprise-attack,Malicious actor bypasses authentication to Request Data,CAPEC-114,0.558573
69,T-SMO-28,enterprise-attack,External attacker uses External interface to exploit API vulnerability to gain access to SMO,CAPEC-113,0.565503
80,T-O-RAN-09,enterprise-attack,An attacker compromises O-RAN components integrity and availability,CAPEC-440,0.565102
83,T-AppLCM-05,enterprise-attack,Malicious actor modifies application’s SecurityDescriptor,CAPEC-698,0.551221
94,T-GEN-01,enterprise-attack,Software flaw attack,CAPEC-480,0.562108
97,T-VL-01,enterprise-attack,VM/Container hyperjacking attack,CAPEC-480,0.558357
101,T-PHYS-01,enterprise-attack,An intruder into a site gains physical access to O-RAN components to cause damage or access sensitive data,CAPEC-401,0.551274
141,T-AppLCM-05,enterprise-attack,Malicious actor modifies application’s SecurityDescriptor,CAPEC-502,0.579609
181,T-AppLCM-05,enterprise-attack,Malicious actor modifies application’s SecurityDescriptor,CAPEC-445,0.558988


In [276]:
threats_in_our_not_in_orca = our_mappings[~our_mappings['Name'].isin(orca_mappings['Name'])]
threats_in_our_not_in_orca

Unnamed: 0,Name,Domain,Description,CAPEC ID,Similarity,summary_th_df,summary_th_df.1,summary_ca_df
25,T-NEAR-RT-05,enterprise-attack,Attackers exploit non uniquely identified xApps using a trusted xAppID to access to resources and services which they are not entitled to use.,CAPEC-21,0.55103,"A Threat with the title Attackers exploit non uniquely identified xApps using a trusted xAppID to access to resources and services which they are not entitled to use. and the descriptionNot uniquely identifying xApps using a trusted xAppID potentially entails certain threats and potential attacks:\n- A non-unique xAppID might cause misidentification of an xApp, possibly allowing a potentially malicious xApp to request certain services (theft of services), information (data leakage), or alter existing information\n- A malicious xApp might use the xAppID assigned to a legitimate xApp to request services or information from Near-RT RIC platform\n- A non-unique xApp ID could make it impossible to accurately assign actions to the correct xApp\n- A non-unique xApp ID could make it difficult to recognize that a malicious xApp is in the environment","A Threat with the title Attackers exploit non uniquely identified xApps using a trusted xAppID to access to resources and services which they are not entitled to use. and the descriptionNot uniquely identifying xApps using a trusted xAppID potentially entails certain threats and potential attacks:\n- A non-unique xAppID might cause misidentification of an xApp, possibly allowing a potentially malicious xApp to request certain services (theft of services), information (data leakage), or alter existing information\n- A malicious xApp might use the xAppID assigned to a legitimate xApp to request services or information from Near-RT RIC platform\n- A non-unique xApp ID could make it impossible to accurately assign actions to the correct xApp\n- A non-unique xApp ID could make it difficult to recognize that a malicious xApp is in the environment","A CAPEC with the title Exploitation of Trusted Identifiers. The description of this CAPEC is:\n <xhtml:p>An adversary guesses, obtains, or ""rides"" a trusted identifier (e.g. session ID, resource ID, cookie, etc.) to perform authorized actions under the guise of an authenticated user or service.</xhtml:p>\n . The Domain of this is: Software"
52,T-VM-C-02,enterprise-attack,VM/Container escape attack,CAPEC-480,0.594237,"A Threat with the title VM/Container escape attack and the descriptionVNF/CNF deployed on the same physical machine as tenants share the same host kernel and host OS resources. Lack of strong isolation between the VMs/Containers and the host allows for a potential risk of a rogue VM/Container escaping the VM/Container confinement and impacting other co-hosted VMs/Containers. In others, an attacker may deploy a new malicious VM/Container configured without network rules, user limitations, etc. to bypass existing defenses within O-Cloud infrastructure.\nAttacker deploys malicious VM/Container to escapes the host (Hypervisor/Container Engine/Host OS) and reaches the server’s hardware, then the malicious VM/Container can gain root access to the whole server where it resides. This gives the malicious VM/Container full control on all the VMs/Containers hosted on the same hacked server. This could allow an attacker to undermine the confidentiality, integrity and/or availability of VNFs/CNFs resources.\nContainers can be deployed by various means, such as via Docker's create and start APIs or via a web application such as the Kubernetes dashboard or Kubeflow. Adversaries may deploy containers based on retrieved or built malicious images or from benign images that download and execute malicious payloads at runtime.\nWhen a malicious VM/Container escapes isolation, it can gain full control over the underlying host and cause any of the below serious threats:\n• Attacker would gain the ability to mount attacks on the host or compromise the host functionalities\n• Compromise the confidentiality & integrity of co-hosted VMs/Containers and tenants\n• Launch DDOS attacks on co-hosted VMs/Containers and host services thereby degrading their performance \n• Introduce new vulnerabilities in host to be used for future attacks\n• Lack of network segmentation could potentially expose other VMs/Containers in the environment to attack. An example of this could be reconnaissance, exploitation and subsequent lateral movement to another host within the cluster.","A Threat with the title VM/Container escape attack and the descriptionVNF/CNF deployed on the same physical machine as tenants share the same host kernel and host OS resources. Lack of strong isolation between the VMs/Containers and the host allows for a potential risk of a rogue VM/Container escaping the VM/Container confinement and impacting other co-hosted VMs/Containers. In others, an attacker may deploy a new malicious VM/Container configured without network rules, user limitations, etc. to bypass existing defenses within O-Cloud infrastructure.\nAttacker deploys malicious VM/Container to escapes the host (Hypervisor/Container Engine/Host OS) and reaches the server’s hardware, then the malicious VM/Container can gain root access to the whole server where it resides. This gives the malicious VM/Container full control on all the VMs/Containers hosted on the same hacked server. This could allow an attacker to undermine the confidentiality, integrity and/or availability of VNFs/CNFs resources.\nContainers can be deployed by various means, such as via Docker's create and start APIs or via a web application such as the Kubernetes dashboard or Kubeflow. Adversaries may deploy containers based on retrieved or built malicious images or from benign images that download and execute malicious payloads at runtime.\nWhen a malicious VM/Container escapes isolation, it can gain full control over the underlying host and cause any of the below serious threats:\n• Attacker would gain the ability to mount attacks on the host or compromise the host functionalities\n• Compromise the confidentiality & integrity of co-hosted VMs/Containers and tenants\n• Launch DDOS attacks on co-hosted VMs/Containers and host services thereby degrading their performance \n• Introduce new vulnerabilities in host to be used for future attacks\n• Lack of network segmentation could potentially expose other VMs/Containers in the environment to attack. An example of this could be reconnaissance, exploitation and subsequent lateral movement to another host within the cluster.","A CAPEC with the title Escaping Virtualization. The description of this CAPEC is:An adversary gains access to an application, service, or device with the privileges of an authorized or privileged user by escaping the confines of a virtualized environment. The adversary is then able to access resources or execute unauthorized code within the host environment, generally with the privileges of the user running the virtualized process. Successfully executing an attack of this type is often the first step in executing more complex attacks.. The Domain of this is: Software"


## Modeling

## Evaluation