<a href="https://colab.research.google.com/github/WongCSheng/GovTech-Challenge-2024/blob/Sonotype/sonotype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

The TrAItor has attemped to break into the model and steal our sensitive information. But, you've intercepted an audio recording of the TrAItor typing their password. Can you decipher the password using only the keystroke sounds?

## Objective

Analyze the audio recordings to determine the exact password. Use audio analysis and pattern recognition to extract the password from the key press sounds. Submit the correct password to complete the challenge.

**NOTE** Rate Limit requirements.

Please limit endpoint requests to 1 request per second per user. Any excessive requests may result in disqualification from the competition.

## Setup

In order to interact with the challenge, you will need your API Key.

You can find this key in the Crucible platform in this challenge.

[https://crucible.dreadnode.io/challenges/sonotype](https://crucible.dreadnode.io/challenges/sonotype)

In [36]:
CRUCIBLE_API_KEY = "AGtHuUxX9MNAXTq3SvCPKKfCWNZRxatm"  # Replace with your actual API key

### Install Required Packages

Install the required packages.


In [45]:
%pip install requests
%pip install numpy
%pip install matplotlib
%pip install librosa
%pip install scikit-learn
%pip install joblib



### Set Notebook Variables

Set the variables used throughout the notebook


In [38]:
CHALLENGE = "sonotype"
CRUCIBLE_URL = "https://crucible.dreadnode.io"
CHALLENGE_URL = "https://sonotype.crucible.dreadnode.io"
ARTIFACT_FILES = ['recordings.tar.gz']

### Download Data

Before diving into data analysis, you'll need to download the necessary dataset for this challenge.

In [62]:
import requests


for artifact in ARTIFACT_FILES:
    url = f"{CRUCIBLE_URL}/api/artifacts/{CHALLENGE}/{artifact}"
    headers = {"X-API-Key": CRUCIBLE_API_KEY}
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        with open(artifact, "wb") as file:
            file.write(response.content)
        print(f"{artifact} was successfully downloaded")
    else:
        print(f"Failed to download {artifact}")

recordings.tar.gz was successfully downloaded


Extract

In [63]:
import tarfile

# Extract out the recordings
file = tarfile.open('recordings.tar.gz')
file.extractall('./')
file.close()

## Submit Your Findings

Submit findings placeholder

In [91]:
import requests


def query(input_data):
    response = requests.post(
        f"{CHALLENGE_URL}/score",
        headers={"X-API-Key": CRUCIBLE_API_KEY},
        json={"data": input_data},
    )
    return response.json()


response = query("<U9")
print(response)

{'error': 'Incorrect. Try again.'}


## Submit Your Flag

When you find the answer, you will get a flag back. It will look something like `gAAAAA ...`.
You can submit it using the code below, or, on the challenge page.

In [None]:
import requests


def submit_flag(flag):
    url = f"{CRUCIBLE_URL}/api/challenges/{CHALLENGE}/submit-flag"
    headers = {"X-API-Key": CRUCIBLE_API_KEY}
    payload = {"challenge": CHALLENGE, "flag": flag}
    response = requests.post(url, headers=headers, json=payload)
    if response.status_code == 200:
        if response.json().get("correct") is True:
            print("The flag was correct. Congrats!")
        else:
            print("The flag was incorrect. Keep trying!")
    else:
        print("There was an error submitting your flag")
        print(response.text)


FLAG = "gAAAAA..."  # Replace with the flag once you find it
submit_flag(FLAG)

## Supplemental Materials

Use these materials for help solving the challenge.

### Helper Functions

In [64]:
import matplotlib.pyplot as plt
import numpy as np
import wave


def load_audio(file_path):
    """Load an audio file and return the audio data and frame rate."""
    with wave.open(file_path, "rb") as wf:
        frame_rate = wf.getframerate()
        n_frames = wf.getnframes()
        audio_data = wf.readframes(n_frames)
        audio_data = np.frombuffer(audio_data, dtype=np.int16)
    return audio_data, frame_rate


def plot_waveform(audio_data, frame_rate):
    """Plot the waveform of the audio data."""
    time_axis = np.linspace(0, len(audio_data) / frame_rate, num=len(audio_data))
    plt.figure(figsize=(15, 5))
    plt.plot(time_axis, audio_data)
    plt.title("Audio Waveform")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.show()


# Example usage
#audio_file = ARTIFACT_FILES[0]
#audio_data, frame_rate = load_audio(audio_file)
#plot_waveform(audio_data, frame_rate)

# Analyse

Training data

In [119]:
import librosa
import os
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib

def remove_silence(signal):
  intervals = librosa.effects.split(signal)
  return signal[intervals[0][0] : intervals[-1][1]]

def preprocess_audio(folder_path):
  keystroke_data = []
  labels = []

  for filename in os.listdir(folder_path):
    if filename.endswith(".wav") and os.path.isfile(folder_path):
      try:
        file_path = os.path.join(folder_path, filename)
        audio_data, frame_rate = load_audio(file_path)

        audio_data = remove_silence(audio_data)

        keystroke_data.append(audio_data)
        labels.append(filename.split("_")[0])

        print(f"File location: {file_path}")
      except wave.Error as e:
        print(f"Error processing file {filename}")
  return keystroke_data, labels, frame_rate

def extract_features(audio_data, frame_rate):
  mfccs = librosa.feature.mfcc(y = audio_data.astype(float), sr = frame_rate, n_mfcc = 13)
  return np.mean(mfccs.T, axis = 0)

def training(folder_path):
  keystroke_data, labels, frame_rate = preprocess_audio(folder_path)
  features = [extract_features(audio_data, frame_rate) for audio_data in keystroke_data]

  X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.2, random_state = 42)

  model = RandomForestClassifier()
  model.fit(X_train, y_train)

  y_pred = model.predict(X_test)
  print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

  return model

def predict_keystrokes(audio_file, model):
  audio_data, frame_rate = load_audio(audio_file)

  segments = librosa.effects.split(audio_data, top_db = 30)
  predictions = []

  for start, end in segments:
    segment = audio_data[start:end]
    features = extract_features(segment, frame_rate)
    prediction = model.predict([features])
    predictions.append(prediction)[0]

  return "".join(predictions)

In [120]:
#audio_data, frame_rate = load_audio('recordings/keystrokes/keystroke_0.wav')
#plot_waveform(audio_data, frame_rate)

model = training('recordings/keystrokes/')

predicted_text = predict_keystrokes('recordings/passwords/', model)
print("Predicted Text:", predicted_text)

UnboundLocalError: local variable 'frame_rate' referenced before assignment