<a href="https://colab.research.google.com/github/WongCSheng/GovTech-Challenge-2024/blob/Sonotype/sonotype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

The TrAItor has attemped to break into the model and steal our sensitive information. But, you've intercepted an audio recording of the TrAItor typing their password. Can you decipher the password using only the keystroke sounds?

## Objective

Analyze the audio recordings to determine the exact password. Use audio analysis and pattern recognition to extract the password from the key press sounds. Submit the correct password to complete the challenge.

**NOTE** Rate Limit requirements.

Please limit endpoint requests to 1 request per second per user. Any excessive requests may result in disqualification from the competition.

## Setup

In order to interact with the challenge, you will need your API Key.

You can find this key in the Crucible platform in this challenge.

[https://crucible.dreadnode.io/challenges/sonotype](https://crucible.dreadnode.io/challenges/sonotype)

In [4]:
CRUCIBLE_API_KEY = "AGtHuUxX9MNAXTq3SvCPKKfCWNZRxatm"  # Replace with your actual API key

### Install Required Packages

Install the required packages.


In [5]:
%pip install requests
%pip install numpy
%pip install matplotlib
%pip install librosa
%pip install scikit-learn
%pip install joblib



### Set Notebook Variables

Set the variables used throughout the notebook


In [6]:
CHALLENGE = "sonotype"
CRUCIBLE_URL = "https://crucible.dreadnode.io"
CHALLENGE_URL = "https://sonotype.crucible.dreadnode.io"
ARTIFACT_FILES = ['recordings.tar.gz']

### Download Data

Before diving into data analysis, you'll need to download the necessary dataset for this challenge.

In [7]:
import requests


for artifact in ARTIFACT_FILES:
    url = f"{CRUCIBLE_URL}/api/artifacts/{CHALLENGE}/{artifact}"
    headers = {"X-API-Key": CRUCIBLE_API_KEY}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        with open(artifact, "wb") as file:
            file.write(response.content)
        print(f"{artifact} was successfully downloaded")
    else:
        print(f"Failed to download {artifact}")

recordings.tar.gz was successfully downloaded


Extract

In [8]:
import tarfile

# Extract out the recordings
file = tarfile.open('recordings.tar.gz')
file.extractall('./')
file.close()

## Submit Your Findings

Submit findings placeholder

In [3]:
import requests


def query(input_data):
    response = requests.post(
        f"{CHALLENGE_URL}/score",
        headers={"X-API-Key": CRUCIBLE_API_KEY},
        json={"data": input_data},
    )
    return response.json()


response = query("c1111c1c11'1e1111cc11111f1c111111111c111c111111111c111111c15'1e1c11e1111111")
print(response)

NameError: name 'CHALLENGE_URL' is not defined

## Submit Your Flag

When you find the answer, you will get a flag back. It will look something like `gAAAAA ...`.
You can submit it using the code below, or, on the challenge page.

In [None]:
import requests


def submit_flag(flag):
    url = f"{CRUCIBLE_URL}/api/challenges/{CHALLENGE}/submit-flag"
    headers = {"X-API-Key": CRUCIBLE_API_KEY}
    payload = {"challenge": CHALLENGE, "flag": flag}
    response = requests.post(url, headers=headers, json=payload)
    if response.status_code == 200:
        if response.json().get("correct") is True:
            print("The flag was correct. Congrats!")
        else:
            print("The flag was incorrect. Keep trying!")
    else:
        print("There was an error submitting your flag")
        print(response.text)


FLAG = "gAAAAA..."  # Replace with the flag once you find it
submit_flag(FLAG)

## Supplemental Materials

Use these materials for help solving the challenge.

### Helper Functions

In [9]:
import matplotlib.pyplot as plt
import numpy as np
import wave


def load_audio(file_path):
    """Load an audio file and return the audio data and frame rate."""
    with wave.open(file_path, "rb") as wf:
        frame_rate = wf.getframerate()
        n_frames = wf.getnframes()
        audio_data = wf.readframes(n_frames)
        audio_data = np.frombuffer(audio_data, dtype=np.int16)
    return audio_data, frame_rate


def plot_waveform(audio_data, frame_rate):
    """Plot the waveform of the audio data."""
    time_axis = np.linspace(0, len(audio_data) / frame_rate, num=len(audio_data))
    plt.figure(figsize=(15, 5))
    plt.plot(time_axis, audio_data)
    plt.title("Audio Waveform")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.show()


# Example usage
#audio_file = ARTIFACT_FILES[0]
#audio_data, frame_rate = load_audio(audio_file)
#plot_waveform(audio_data, frame_rate)

# Analyse

Placeholder

In [16]:
import librosa
import os
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def remove_silence(audio_data, frame_rate, top_db=20):
    """Remove silence from audio data based on a decibel threshold."""
    trimmed_audio, _ = librosa.effects.trim(audio_data.astype(float), top_db=top_db)
    return trimmed_audio

def extract_mfcc(audio_data, frame_rate, n_mfcc=13):
    """Extract MFCC features from audio data."""
    mfcc_features = librosa.feature.mfcc(y=audio_data.astype(float), sr=frame_rate, n_mfcc=n_mfcc)
    mfcc_features = np.mean(mfcc_features.T, axis=0)  # Average across time for simplicity
    return mfcc_features

def prepare_data(folder_path):
    """Extract MFCC features and prepare data for model training with silence removed."""
    features = []
    labels = []

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".wav"):
            try:
                file_path = os.path.join(folder_path, file_name)
                audio_data, frame_rate = load_audio(file_path)

                # Remove silence
                audio_data = remove_silence(audio_data, frame_rate)

                # Extract MFCC features from the trimmed audio data
                mfcc_features = extract_mfcc(audio_data, frame_rate)
                features.append(mfcc_features)

                # Extract label, assuming format "keystroke_a.wav"
                label = file_name.split('_')[1].split('.')[0]
                labels.append(label)
            except wave.Error as e:
                print(f"Error loading {file_path}")

    return np.array(features), np.array(labels)

# Load and prepare the training data
features, labels = prepare_data("recordings/keystrokes")
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Train a k-NN classifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

def detect_keystrokes(file_path, classifier, segment_duration=0.5):
    """Detect keystrokes in the audio file after removing silence."""
    detected_keys = []

    try:
        audio_data, frame_rate = load_audio(file_path)
        audio_data = remove_silence(audio_data, frame_rate)
        segment_samples = int(segment_duration * frame_rate)
    except wave.Error:
        print(f"Error loading {file_path}")
        return []

    for i in range(0, len(audio_data), segment_samples):
        segment = audio_data[i:i + segment_samples]
        if len(segment) < segment_samples:
            break  # Ignore incomplete segments at the end

        mfcc_features = librosa.feature.mfcc(y=segment.astype(float), sr=frame_rate, n_mfcc=13)
        mfcc_features = np.mean(mfcc_features.T, axis=0).reshape(1, -1)  # Shape for model
        predicted_key = classifier.predict(mfcc_features)[0]
        detected_keys.append(predicted_key)

    return detected_keys

# Run detection on the "passwords" audio
detected_keys = detect_keystrokes("recordings/keystrokes/keystroke_9.wav", classifier)
answer = ''.join(detected_keys)
print("Detected Keys:", answer)

Error loading recordings/keystrokes/._keystroke_'.wav
Error loading recordings/keystrokes/._keystroke_l.wav
Error loading recordings/keystrokes/._keystroke_j.wav
Error loading recordings/keystrokes/._keystroke_8.wav
Error loading recordings/keystrokes/._keystroke_6.wav
Error loading recordings/keystrokes/._keystroke_y.wav
Error loading recordings/keystrokes/._keystroke_s.wav
Error loading recordings/keystrokes/._keystroke_Enter.wav
Error loading recordings/keystrokes/._keystroke_k.wav
Error loading recordings/keystrokes/._keystroke_e.wav
Error loading recordings/keystrokes/._keystroke_m.wav
Error loading recordings/keystrokes/._keystroke_i.wav
Error loading recordings/keystrokes/._keystroke_g.wav
Error loading recordings/keystrokes/._keystroke_c.wav
Error loading recordings/keystrokes/._keystroke_5.wav
Error loading recordings/keystrokes/._keystroke_h.wav
Error loading recordings/keystrokes/._keystroke_u.wav
Error loading recordings/keystrokes/._keystroke_v.wav
Error loading recordings

Training data

In [1]:
import librosa
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def train(folder_path):
  for files in os.listdir(folder_path):
    if files.endswith(".wav"):
      try:
        audio_data, frame_rate = load_audio(os.path.join(folder_path, files))
        audio_data = remove_silence(audio_data)
      except wave.Error as e:
        print(f"Error loading {files}: {e}")

audio_data, frame_rate = load_audio(os.path.join(folder_path, files))
plot_waveform(audio_data, frame_rate)

NameError: name 'load_audio' is not defined