**Predicting Cyber scores with Random Forest**


In this analysis, we processed a dataset containing various features related to the cybersecurity posture of different organizations. The key steps involved converting asset values from string formats (millions or billions) to numeric values, encoding categorical features into numerical values, and normalizing certain numerical features. A custom cyber score was created by combining several factors, such as threat level, vulnerability, control strength, time to remediate, security maturity, and security spending. The cyber score was then discretized into categories ranging from 1 to 10. A Random Forest Regressor model was trained to predict the cyber scores, and recommendations were generated based on the predicted scores

**Importing Libraries**

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

**Reading the Data File**

In [4]:
# Load the data

data = pd.read_excel("/content/cyber_data.xlsx")

**EDA of the Data**

In [5]:
data.head()

Unnamed: 0,Organization,Incident,Asset Value,Threat Level,Vulnerability,Control Strength,Details,Attack Type,Impact,Root Cause,Time to Remediate in Months,Industry,Security Maturity,Security Spending
0,Adobe,Adobe Data Breach 2024,1B,Medium,Medium,High,"Customer accounts compromised, data stolen",Phishing,Financial,Social Engineering,1.0,Technology,Medium,0.2
1,Airbnb,Airbnb API Exposure 2024,850M,Medium,Medium,Medium,Host and guest information exposed through API,API Exploitation,Reputational,Configuration Error,0.5,Hospitality,Medium,0.15
2,Amazon,Amazon Internal Leak 2024,1.3B,Medium,Medium,High,"Internal documents accessed, data leaked",Insider Threat,Operational,Insider Access,0.75,E-commerce,High,0.25
3,American Airlines,AA Payment Info Breach 2024,850M,High,Medium,Medium,Customer payment information exposed,Malware,Financial,Malware Infection,1.0,Transportation,Medium,0.2
4,Apple,Apple API Vulnerability 2024,1.6B,Medium,Medium,High,Customer information accessed through API vuln.,API Exploitation,Reputational,Poor API Security,2.0,Technology,High,0.25


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Organization                 107 non-null    object 
 1   Incident                     107 non-null    object 
 2   Asset Value                  107 non-null    object 
 3   Threat Level                 107 non-null    object 
 4   Vulnerability                107 non-null    object 
 5   Control Strength             107 non-null    object 
 6   Details                      107 non-null    object 
 7   Attack Type                  107 non-null    object 
 8   Impact                       107 non-null    object 
 9   Root Cause                   107 non-null    object 
 10  Time to Remediate in Months  107 non-null    float64
 11  Industry                     107 non-null    object 
 12  Security Maturity            107 non-null    object 
 13  Security Spending   

In [7]:
data.describe()

Unnamed: 0,Time to Remediate in Months,Security Spending
count,107.0,107.0
mean,1.100467,0.19757
std,0.556755,0.051759
min,0.5,0.08
25%,0.75,0.16
50%,1.0,0.19
75%,1.0,0.235
max,3.0,0.32


**Converting ASSET column string to Numerical**

In [9]:
# Convert Asset Value to numeric
def convert_asset_value(value):
    if isinstance(value, str):
        if 'M' in value:
            return float(value.replace('M', '')) * 1e6
        elif 'B' in value:
            return float(value.replace('B', '')) * 1e9
    return float(value)

In [10]:
data['Asset Value'] = data['Asset Value'].apply(convert_asset_value)


**Encoding Categorical columns into Numerical Values**

In [11]:
# Define a function to convert categorical features to numerical values
def encode_labels(df, column):
    le = LabelEncoder()
    df[column] = le.fit_transform(df[column])
    return df

In [12]:

# Encode categorical features
categorical_columns = ['Threat Level', 'Vulnerability', 'Control Strength', 'Industry', 'Security Maturity']
for col in categorical_columns:
    data = encode_labels(data, col)


**Normalizing Features: Asset Value and Time to remediate in Months**

Normalization of asset value is done to make it easily comparable with other features of the dataset

In [15]:
# Normalize numerical features (excluding Security Spending)
scaler = MinMaxScaler()
numerical_columns = ['Asset Value', 'Time to Remediate in Months']
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

In [14]:
data.head()

Unnamed: 0,Organization,Incident,Asset Value,Threat Level,Vulnerability,Control Strength,Details,Attack Type,Impact,Root Cause,Time to Remediate in Months,Industry,Security Maturity,Security Spending
0,Adobe,Adobe Data Breach 2024,0.571429,1,2,0,"Customer accounts compromised, data stolen",Phishing,Financial,Social Engineering,0.2,21,2,0.2
1,Airbnb,Airbnb API Exposure 2024,0.464286,1,2,2,Host and guest information exposed through API,API Exploitation,Reputational,Configuration Error,0.0,16,2,0.15
2,Amazon,Amazon Internal Leak 2024,0.785714,1,2,0,"Internal documents accessed, data leaked",Insider Threat,Operational,Insider Access,0.1,7,0,0.25
3,American Airlines,AA Payment Info Breach 2024,0.464286,0,2,2,Customer payment information exposed,Malware,Financial,Malware Infection,0.2,23,2,0.2
4,Apple,Apple API Vulnerability 2024,1.0,1,2,0,Customer information accessed through API vuln.,API Exploitation,Reputational,Poor API Security,0.6,21,0,0.25


**Creating a CyberScore Formula and addding the column to the data**

In [16]:
# Create the cyber score
data['cyber_score'] = (
    data['Threat Level'] +
    data['Vulnerability'] +
    (1 - data['Control Strength']) +
    data['Time to Remediate in Months'] +
    (1 - data['Security Maturity']) +
    (1 - data['Security Spending'])
)

In [17]:
# Discretize the cyber score into categories 1-10
data['cyber_score'] = np.ceil(MinMaxScaler((1, 10)).fit_transform(data[['cyber_score']])).astype(int)


In [18]:
# Prepare the data for training
X = data.drop(columns=['Organization', 'Incident', 'Details', 'Attack Type', 'Impact', 'Root Cause', 'cyber_score'])
y = data['cyber_score']

**Splitting the Dataset into Training and Testing**

In [22]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Training the Model**

In [23]:
# Train a Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [24]:
# Predict on the test set
y_pred = model.predict(X_test)

**Model Evaluation**

In [25]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

In [26]:
print(f'MSE: {mse}, R2: {r2}')

MSE: 0.34937727272727265, R2: 0.9181120581113802


The model has good Accuracy.

**Generating Recommendations based on following categories:-**

**(1)0-2=Poor Score (2)3-5=Moderate (3)6-8=Good (4)9-10=Best Score**

In [27]:
# Generate recommendations based on cyber score
def generate_recommendations(cyber_score):
    if cyber_score <= 2:
        return [
            "Critical security overhaul needed.",
            "Implement advanced security protocols immediately.",
            "Increase budget for cybersecurity improvements.",
            "Engage with top-tier cybersecurity firms for an in-depth assessment."
        ]
    elif cyber_score <= 5:
        return [
            "Immediate action required to address vulnerabilities.",
            "Implement strict access controls and monitoring.",
            "Engage with cybersecurity consultants for comprehensive risk assessment."
        ]
    elif cyber_score <= 8:
        return [
            "Conduct a thorough security audit.",
            "Increase security training for employees.",
            "Invest in advanced threat detection systems."
        ]
    else:
        return [
            "Maintain current security measures.",
            "Regularly update and review security policies."
        ]


In [28]:
data['recommendations'] = data['cyber_score'].apply(generate_recommendations)


In [29]:
# Display the updated dataframe with recommendations
data[['Organization', 'cyber_score', 'recommendations']]

Unnamed: 0,Organization,cyber_score,recommendations
0,Adobe,7,"[Conduct a thorough security audit., Increase ..."
1,Airbnb,5,[Immediate action required to address vulnerab...
2,Amazon,9,"[Maintain current security measures., Regularl..."
3,American Airlines,4,[Immediate action required to address vulnerab...
4,Apple,10,"[Maintain current security measures., Regularl..."
5,Atlassian,5,[Immediate action required to address vulnerab...
6,British Airways,2,"[Critical security overhaul needed., Implement..."
7,Capcom,5,[Immediate action required to address vulnerab...
8,Cisco,10,"[Maintain current security measures., Regularl..."
9,Discord,4,[Immediate action required to address vulnerab...
