# Secure by Design Data Science Workshop

## Hands-on Exercise with JavaScript Vulnerability Data Set

## Background
**Problem**: Static code analysis tools like OpenStaticAnalyzer and escomplex provide various metrics on code but does not necessarily indicate if code is vulnerable as a result. Some of the metrics used (like Halstead metrics) output numerical values with ambiguous interpretation for code security.

**Goal**: Explore methods to see if metrics generated by static code complexity analysis can be used to predict a function's vulnerability.


## Dataset
Dataset was derived from vulnerability information in public databases of the Node Security Platform, Snyk Vulnerability Database and code fixing patches from GitHub applied to JavaScript code.

Contains a labelled dataset of 12,125 functions with 1496 vulnerable records.

Code metrics are provided by OpenStaticAnalyzer and escomplex tools.

Reference: Ferenc, Rudolf & Hegedüs, Péter & Gyimesi, Péter & Antal, Gabor & Bán, Dénes & Gyimothy, Tibor. (2019). Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions. 8-14. 10.1109/RAISE.2019.00010.

## Download Dataset

In [None]:

import gdown

file_id = "1GfGsJ-rSdbyGwAZ7CxG8QFMw3H_mBJxL"

output_file = "JSVulnerabilityDataSet-1.0.csv"

gdown.download(f"https://drive.google.com/uc?id={file_id}", output_file)

## Import Libraries
- import _____ as ___ allows us to reference the package using an alias. For example, 'import pandas as pd' allows us to use pd every time we reference pandas.
- from _____ import _____ allows us to import specific methods or objects from a package. For example, 'from sklearn.metrics import KNeighborsClassifier' imports KNeighborsClassifier from the sklearn.neighbors module.
### Data Processing
- 'numpy' is a low-overhead data management library using arrays.
- 'pandas' is a high-overhead data management library that uses dataframes, which include many convenient methods for exploring and processing the data.
- 'sklearn' is a data science library that contains many convenient methods for predictive modeling.

### Data Visualization
- 'matplotlib' is for plotting.
- 'seaborn' is for plotting, specifically a heatmap
- 'plotly' is also used for plotting but with a more intuitive and easy to use API at the expense of complex customizability present with matplotlib
- ipywidgets is used to generate interactive widgets for the user
- 'IPython.display' provides a display and clear_ouput function to help control displayed artifacts within the notebook

In [None]:
# Install the packages if they are not in the system.
# If there are errors related to missing packages, delete '--quiet' at the end to view output of this step.
%pip install numpy pandas scikit-learn seaborn matplotlib plotly ipywidgets --quiet

In [None]:
# Data Processing
import numpy as np
import pandas as pd
import sklearn

# Model Training
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Model Evaluation
from sklearn.metrics import classification_report
### Double line import for easier reading
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import ipywidgets as widgets
from IPython.display import display, clear_output, Markdown
###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Import seaborn into the colab environement and use the alias sns

import seaborn as sns

###########################
###########################


# Load Data

In [None]:
df = pd.read_csv(output_file, index_col=0)
# Lowercase the column names
df.columns = df.columns.str.lower()

# Data Exploration

Field descriptions are as follows

METRIC | DESCRIPTION | TOOL
-------|------------|-------
CC | Clone Coverage | OSA
CCL | Clone Classes | OSA
CCO | Clone Complexity | OSA
CI  | Clone Instances | OSA
CLC | Clone Line Coverage | OSA
LDC | Lines of Duplicated Code | OSA
McCC, CYCL | Cyclomatic Complexity | OSA, escomplex
NL | Nesting Level | OSA
NLE |Nesting Level without else-if |OSA
CD, TCD | ($Total^2$) Comment Density | OSA
CLOC, TCLOC | (Total) Comment Lines of Code | OSA
DLOC | Documentation Lines of Code | OSA
LLOC, TLLOC | (Total) Logical Lines of Code | OSA
LOC, TLOC | (Total) Lines of Code | OSA
NOS, TNOS | (Total) Number of Statements | OSA
NUMPAR, PARAMS | Number of Parameters | OSA, escomplex
HOR D | Nr. of Distinct Halstead Operators | escomplex
HOR T | Nr. of Total Halstead Operators | escomplex
HON D | Nr. of Distinct Halstead Operands | escomplex
HON T | Nr. of Total Halstead Operands | escomplex
HLEN | Halstead Length | escomplex
HVOC | Halstead Vocabulary Size |escomplex
HDIFF | Halstead Difficulty | escomplex
HVOL |Halstead Volume| escomplex
HEFF |Halstead Effort |escomplex
HBUGS |Halstead Bugs| escomplex
HTIME |Halstead Time| escomplex
CYCL DENS| Cyclomatic Density |escomplex

# Problem
- Static code metrics do not necessarily reflect vulnerabilities. Consider a few examples:

## Cyclomatic Complexity
- Measure of the number of linearly independent paths through a program's source code. Also called McCabe's Complexity. 
- Counts logical forks in code.
- $CC = E - N + 2P$
    - E: Number of logical forks (edges) in the code
    - N: Number of code blocks (nodes) between logical forks
    - P: Number of connected components (1 for a single function)
- Can be simplified to $CC = \text{Num Decision Points} + 1$ for isolated code

#### Example
- CC = 3 Decision points + 1 = 4
```javascript
function validateInput(input) {
  if (!input) {
    return false;
  }
  
  if (input.length < 2) {
    return false;
  }
  
  if (input.length > 200) {
    return false;
  }

  return true;
}
```

## Halstead Metrics
- Represents metrics using operators and operands.
    - $\eta_1 = \text{The number of distinct operators}$: Includes functions, operators (like &, +, (), etc.)
    - $\eta_2 = \text{The number of distinct operands}$: Inlcudes variables, constants, etc.
    - $N_1 = \text{The total number of operators}$
    - $N_2 = \text{The total number of operands}$

- Halstead Length = $N_1 + N_2$
- Halstead Vocabulary = $\eta = \eta_1 + \eta_2$
- Halstead Difficulty = $D = \frac{\eta_1}{2}\times \frac{N_2}{\eta_2}$
- Halstead Volume = $V = N \times log_2\eta$
- Halstead Effort = $E = D \times V$
- Halstead Bugs = $B = \frac{V}{3000}$
- Halstead Time = $T = \frac{E}{18}\text{ seconds}$

#### Example
- Operators = `main, (), {}, prompt, [], prompt, =, split, map, Math.floor, /, console.log` 12 Operators
- Operands = `input, a, b, c, Number, avg, 3` 7 Operands
- Total Operators = 27
- Total Operands = 15

- Halstead Length  = 42
- Halstead Vocabulary = 19
- Halstead Difficulty = 12.85
- Halstead Volume = 178.4
- Halstead Effort = 2292.44
- Halstead Bugs = 0.05
- Halstead Time = 127.357 seconds

```javascript
function main() {
  // Simulate user input
  let input = prompt("Enter three integers separated by spaces:");
  let [a, b, c] = input.split(" ").map(Number);

  let avg = Math.floor((a + b + c) / 3);

  console.log("avg =", avg);
}
```

# Question
- Do these numbers give you sufficient insight into code vulnerability?


# View first 10 Records
'df.head()' prints the first 5 records, which provides an initial understanding of the fields, data types, and data scales. Observe that:

- Some fields are strings
- Other fields are numerical, either an integer or a float

In [None]:
###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Print and inspect the first 10 records
df.head(10)
###########################
###########################

# Statistics and Histogram of the raw data

Observe that:

- There are many different scales, and the data are not Gaussian (i.e., not normal)
- Many fields are categorical and not numeric (despite their dtypes being int64)
- Some fields are binary (despite their dtypes being int64)
- Some fields contain imbalanced data
- 'cllc' may be a mislabelled column (or could stand for LDC)

In [None]:
###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Inspect the data types of your dataframe
df.dtypes()
###########################
###########################

In [None]:
# Select only the numeric columns
numeric_df = df.select_dtypes(include=np.number)

# Drop positional features
numeric_df.drop(columns=['line','column','endline','endcolumn'], axis=1, inplace=True)


In [None]:
###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Generate histograms of the numeric data.
numeric_df.hist(figsize=(12, 10))
###########################
###########################
plt.tight_layout()


# Data Distributions

We can visualize the distribution of the vulnerable to non vulnerable functions in the dataset.
Observet that:
- There are more non vulnerable functions than vulnerable, thus we can define this to be an 'imbalanced' dataset

In [None]:
###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Get the value counts for the 'vuln' column from the numeric data
value_counts = numeric_df['vuln'].value_counts()
###########################
###########################


In [None]:
# Plot a pie chart
fig, ax = plt.subplots()
wedges, texts, autotexts = ax.pie(value_counts, autopct='%1.1f%%', startangle=90, textprops=dict(color="w"))

# Add count labels to each slice
for i, wedge in enumerate(wedges):
    count_label = f"{value_counts[i]}"
    plt.setp(autotexts[i], text=f"{autotexts[i].get_text()}\n({count_label})")

# Legend
custom_labels = ['Vuln not detected', 'Vuln detected']
ax.legend(wedges, custom_labels, loc='upper right', bbox_to_anchor=(1.1, 1))
ax.set_title('Distribution of vuln functions')
ax.axis('equal')
plt.show()

# Correlation Plot
We'll compute the correlation between each pair of variables in the pre-processed data. Correlation is a measure of how linearly related two variables are to each other. We'll then visualize the correlations in a heatmap.

Since there are a lot of features, we can improve our visualization by filtering for values above (or below) a threshold (0.7) before plotting it

Notice that our feature of interest ('vuln') does not have strong positive or negative correlations with the other features

In [None]:
###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Use the pandas method corr() to compute pairwise correlations
# between all engineered features
correlations = numeric_df.corr()
###########################
###########################
correlations


In [None]:



###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Find values with strong positive or negative correlation and omit the values in the diagonal (feature correlated with itself)
# Set a threshold of your choosing to show only values that are at or more correlated (positively and negatively)
threshold = 0.7
###########################
###########################

high_corr = correlations[((correlations < -threshold) | (correlations > threshold) ) & (correlations != 1)]

# Plot the filtered correlation matrix
plt.figure(figsize=(18, 12))
sns.heatmap(high_corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Filtered Correlation Matrix')
plt.show()

# Train and Classify

We will now train a classification model called 'K Nearest Neighbors' (k -NN). The goal of this model is to be able to learn features (and the appropriate values) that are indicative of a function's vulnerability. 


k-NN identifies outliers by identifying a point in the dataset, finding its k nearest neigbhors using a distance measure and using a majority vote to determine the class to be assigned to that point. This learned state is used to predict new points.  

### Key Parameters:

* **`k`**: The number of nearest neighbors to consider for each data point.

### Steps:

1. **Standardize Features**: The data is standardized using `StandardScaler` for consistent unit variance.
2. **Fit k-NN** on the training data to learn the distribution of the points.
3. **Compute Average Distances** for each test sample to its `k` nearest neighbors in the training set.
4. **Evaluate the Results** using a custom evaluation function.


The model's performance is assessed using a confusion matrix, which indicates the true positive, true negative, false positive, and false negative classification rates.

Additionally the following metrics give an insight into the model's strength/weaknesses:
- Accuracy: Determines the ration of true positives + true negatives over the entire dataset. For imblanaced datasets, accuracy will be higher given that the model will do well by just guessing the majority class all the time.
- Precision: Of the records predicted to be vulnerable, how many are truly vulnerable.
- Recall: Of the total vulnerable records, how many did the model predict as vulnerable.
- F1 score: Is the harmonic mean of the precision and recall. For imablanced datasets, it provides a better measure of the model's performance vs accuracy.

## Prepare Data

In [None]:
# Features to train (remove labels)
X = numeric_df.drop('vuln', axis=1)
# Class labels
y = numeric_df.vuln.values

# Model Training and Prediction

# BREAKDOWN STEPS

In [None]:
# Step 1: Data Preparation and Scaling


X_scaled = pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)

X_selected = X_scaled.copy(deep=True)
print(f"Shape of scaled features: {X_selected.shape}")

In [None]:
# Step 2: Train-Test Split

# Set the test size
test_size = 0.2

# Split the data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=test_size, stratify=y, random_state=123)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")


In [None]:
# Step 3: Create and Train KNN Model

# Set the number of neighbors
k = 5


knn = KNeighborsClassifier(n_neighbors=k, weights='distance')


###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Train the KNN model on the training data
knn.fit(X_train, y_train)
###########################
###########################

print(f"KNN model trained with k={k}")

In [None]:
# Step 4: Make Predictions

###########################
###########################
# STUDENTS ADD CODE HERE TO:
# Use the trained model to predict labels for the test set
y_pred = knn.predict(X_test)
###########################
###########################

print(f"Predictions made on {len(y_pred)} test samples")

In [None]:
# Step 5: Evaluate Model Performance

# Define class labels
labels = ['vuln not detected', 'vuln detected']


(ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred), display_labels=labels)).plot()

plt.title(f'KNN Confusion Matrix (k={k})')
plt.show()

print(classification_report(y_test, y_pred, target_names=labels))


# PLAY WITH K

In [None]:
# Widget elements
col_list = list(X.columns) + (["all"])
k_slider = widgets.IntSlider(value=1, min=1, max=20, step=1, description='K')
traintestsplit_slider = widgets.FloatSlider(value=0.3,min=0.1, max=0.9, step=0.1, description='Test Split')
feature1_dropdown = widgets.Dropdown(options=col_list, value='cc', description='Feature 1')
feature2_dropdown = widgets.Dropdown(options=col_list, value='ccl', description='Feature 2')

update_button = widgets.Button(description='Update')

# Display the widgets
#display(k_slider, traintestsplit_slider, feature1_dropdown, feature2_dropdown, update_button)

# Function to update the plot
def update_plot(k, train_test_split_ratio, feature1, feature2):
    # Scale data
    X_scaled = pd.DataFrame(StandardScaler().fit_transform(X), columns=numeric_df.drop('vuln', axis=1).columns)
    if feature1 == 'all' or feature2 == 'all':
      X_selected = X_scaled.values
    else:
      X_selected = X_scaled[[feature1, feature2]].values
    X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=train_test_split_ratio, stratify=y, random_state=123)

    knn = KNeighborsClassifier(n_neighbors=k, weights='distance')
    knn.fit(X_train, y_train)

    # Predict class labels for the training data
    y_pred = knn.predict(X_test)
        
    # Clear the previous plot
    clear_output(wait=True)
    # Display the widgets again
    display(k_slider, traintestsplit_slider, feature1_dropdown, feature2_dropdown, update_button)

    if feature1 == 'all' or feature2 == 'all':
      labels = ['vuln not detected','vuln detected']
      (ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test,y_pred), display_labels=labels)).plot()
      print((classification_report(y_test, y_pred, target_names = labels)))
    else:
      # Plotting
      plt.figure(figsize=(10, 6))
      scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor='k', s=20, label='Data points')

      # Create a legend with unique class labels and their counts
      handles, labels = scatter.legend_elements()
      class_labels = ['vuln not detected', 'vuln detected']
      counts = [np.sum(y_pred == i) for i in np.unique(y_pred)]
      true_counts = [np.sum(y_test == i) for i in np.unique(y_test)]
      legend_labels = [f'{class_labels[i]} ({counts[i]}/{true_counts[i]})' for i in range(len(class_labels))]

      plt.legend(handles=handles, labels=legend_labels)

      plt.title(f'KNN Classification (k={k})')
      plt.xlabel(feature1)
      plt.ylabel(feature2)
      plt.show()   


# Function to handle update button click
def on_update_button_clicked(b):
    update_plot(k_slider.value, traintestsplit_slider.value, feature1_dropdown.value, feature2_dropdown.value)

# Connect the button click event to the handler
update_button.on_click(on_update_button_clicked)

# Initial plot
update_plot(k_slider.value,traintestsplit_slider.value, feature1_dropdown.value, feature2_dropdown.value)