# Assignment 3: Real-time Acoustic Activity Sensing (20 points)

**Overview**

In this assignment, you will build a real-time acoustic activity recognition system that continuously listens through your microphone and classifies activities on the fly. Your system will run two models (the classical ML model (A1) and the deep learning model (A2)) to compare their predictions and speeds in real time.

**Learning Objectives**

1. Build a real-time acoustic activity recognition system

2. Compare predictions, confidence, and latency between your ML and DL models



In this section, you will build a real-time acoustic activity recognition system that continuously listens through your microphone and classifies activities on the fly. Your system will run two models in parallel (the classical ML model and the deep learning model) to compare their predictions and speeds in real time.

## Instructions:

1. Capture audio continuously from your computer’s microphone (e.g., using a library like PyAudio or sounddevice).
2. Process the audio in sliding windows. For example, use a window length of about 1 seconds, with an overlap (hop) of ~0.5 seconds between consecutive windows. This ensures new predictions are made multiple times per second.
3. Apply two models to each window of audio:
  1. Your best ML classifier from A1
  2. Your best deep learning model (from A2), or other models in the literature (example below)
    1. [Wav2Vec 2.0](https://huggingface.co/facebook/wav2vec2-base)
    2. [AST (Audio Spectrogram Transformer)](https://github.com/facebookresearch/AudioMAE)
    3. [AudioMAE](https://github.com/YuanGongND/ast)
4. Display the results in real-time, including:
  1. A visualization of the audio waveform for the current window (updating as new audio comes in).
  2. The predicted activity label from each model for that window, along with a confidence score or probability for each prediction.
  3. The inference time (latency in milliseconds) it took for each model to produce the prediction for that window.
5. Record 2–3 minutes each for step 3, including the five main activities from A1/2.

Make sure to handle real-time audio carefully. Using a buffer to collect audio samples and process overlapping windows is one way to implement sliding windows. Aim to update the predictions at least ~3 times per second so the system feels responsive. You can start by printing outputs to the console, but ideally build a simple GUI to display the waveform and predictions clearly (this could be as simple as a Matplotlib plot for the waveform and text labels for predictions, or a small custom interface).

The Ubicoustics GitHub repo contains some example code for your reference.

## Scoring:

GUI Visualization of signal **(5 points)**

Real-time nature of end-to-end pipeline **(5 points)** [Aim for atleast 3 FPS]

Prediction from ML classifier from A1 **(3 points)**

Prediction from DL classifier from A2 **(7 points)**



In [None]:
# TODO: Real time inferencing (you would ideally want to migrate this notebook to a .py local file to use your microphone seamlessly)


**Discussion:** In your report, summarize following items:

*   Describe the end-to-end pipeline: audio capture → processing → inference → display.
*   How did you handle buffering and overlapping windows?
*   What were the typical inference times for each model?
*   Was one model noticeably faster or more stable?
*   How did you visualize predictions and confidence?
*   How well did the system respond to different environments (quiet, noisy, echo, etc.)?

# Submission

For this assignment, please prepare the following deliverables:

1.   Code Submission: Submit a zip file containing your complete code for A3.1. Include a requirements.txt file listing any Python dependencies needed to run your code (e.g., PyAudio, Transformers, Torch, etc.). Ensure that your code is well-organized and commented where appropriate.
2.   Demonstration Videos: Provide one short video (approximately 2–3 minutes each) demonstrating your real-time system in action but performing inference for 2 ML classifiers (predictions shown in different lines on a GUI). In the video, perform each of the five target activities multiple times to showcase how the system responds. The output (GUI with visualization with two predictions below) should be clearly visible, displaying the predicted labels, confidence scores, and inference times as you perform the activities. One prediction label (including predicted activity, confidence, model latency) for best ML classifier from A1 and one for the best DL classifier for A2.

The video should clearly show you performing each of the 5 activities and the system’s live output (waveform/pecogram display with predictions, confidence and latency timing). Make sure the text in your interface is readable in the video. Aim to make the demonstrations convincing that your system works correctly for each activity in real time.