# ECS7020P mini-project submission


## What is the problem?

This year's mini-project considers the problem of predicting whether a narrated story is true or not. Specifically, you will build a machine learning model that takes as an input an audio recording of **30 seconds** of duration and predicts whether the story being narrated is **true or not**. 


## Which dataset will I use?

A total of 100 samples consisting of a complete audio recording, a *Language* attribute and a *Story Type* attribute have been made available for you to build your machine learning model. The audio recordings can be downloaded from:

https://github.com/MLEndDatasets/Deception/tree/main/MLEndDD_stories_small

A CSV file recording the *Language* attribute and *Story Type* of each audio file can be downloaded from:

https://github.com/MLEndDatasets/Deception/blob/main/MLEndDD_story_attributes_small.csv




## What will I submit?

Your submission will consist of **one single Jupyter notebook** that should include:

*   **Text cells**, describing in your own words, rigorously and concisely your approach, each implemented step and the results that you obtain,
*   **Code cells**, implementing each step,
*   **Output cells**, i.e. the output from each code cell,

Your notebook **should have the structure** outlined below. Please make sure that you **run all the cells** and that the **output cells are saved** before submission. 

Please save your notebook as:

* ECS7020P_miniproject_2425.ipynb


## How will my submission be evaluated?

This submission is worth 16 marks. We will value:

*   Conciseness in your writing.
*   Correctness in your methodology.
*   Correctness in your analysis and conclusions.
*   Completeness.
*   Originality and efforts to try something new.

**The final performance of your solutions will not influence your grade**. We will grade your understanding. If you have an good understanding, you will be using the right methodology, selecting the right approaches, assessing correctly the quality of your solutions, sometimes acknowledging that despite your attempts your solutions are not good enough, and critically reflecting on your work to suggest what you could have done differently. 

Note that **the problem that we are intending to solve is very difficult**. Do not despair if you do not get good results, **difficulty is precisely what makes it interesting** and **worth trying**. 

## Show the world what you can do 

Why don't you use **GitHub** to manage your project? GitHub can be used as a presentation card that showcases what you have done and gives evidence of your data science skills, knowledge and experience. **Potential employers are always looking for this kind of evidence**. 





-------------------------------------- PLEASE USE THE STRUCTURE BELOW THIS LINE --------------------------------------------

# [Your title goes here]

# 1. Author

**Student Name**:  Alen Abdrakhmanov

**Student ID**:  



# 2. Problem formulation

### **Problem Statement**

Deception plays a crucial role in various domains such as criminal investigations and fraud prevention. Deception detection traditionally relies on invasive methods like polygraphs or behavioral analysis, which require direct interaction with the subject. This project adopts a **non-invasive approach** by analyzing audio recordings and leveraging machine learning to classify the recorded stories as truthful or deceptive using **acoustic features**, which will be carefully formulated and presented later.

---

### **Research Question**

Can a machine learning model, trained on extracted audio features, accurately distinguish between truthful and deceptive speech in audio recordings, in various natural languages?

---

### **Objective**

To design and implement a machine learning-based system that can classify audio recordings into **truthful** or **deceptive** categories using **non-invasive speech & signal processing techniques**.

---

### **Scope**

#### **Data Scope**
- **100 speech audio recordings** in multiple languages (e.g., Hindi, English, Bengali).
- Labels indicating whether the speech is truthful or deceptive.

#### **Feature Scope**
- Extracted features such as:
  - **Mel-Frequency Cepstral Coefficients (MFCCs)**
  - **Pitch**
  - **Jitter**
  - **Intensity**
- Focus on **acoustic and prosodic characteristics** to detect psycho-emotional cues related to deception.

#### **Modeling Scope**
- Use a machine learning classifier (**Support Vector Machine**) for binary classification.
- Explore performance across various feature combinations and preprocessing methods.

---

### **Constraints**

#### **Data Availability**
- Limited labeled datasets for truthful and deceptive speech (**only 100 data points**).
- Variability in audio quality and speaker diversity.

#### **Model Performance**
- Achieve a balance between **accuracy** and **generalization** across different speakers and languages.

#### **Real-Time Applicability**
- Potential need for efficient processing pipelines to classify speech in real-time.


---


# 3. Methodology

To mirror the machine learning pipeline, the report will be split into **training** and **validation** sections:


## **Training Task**

The training task focuses on building a **binary classification model** to distinguish between truthful and deceptive audio recordings. It involves the following steps:

---

### **Step 1: Data Preparation**
1. Load the dataset of audio recordings with associated labels (`truthful_story` or `deceptive_story`).
2. Perform an **initial split** of the dataset into:
   - **Training set**: 80% of the data.
   - **Test set**: 20% of the data (reserved for final evaluation after training and validation are complete).
3. Ensure stratified sampling to balance:
   - **Languages** (e.g., Hindi, English, Bengali).
   - **Story Types** (`truthful_story`, `deceptive_story`).

---

### **Step 2: Data Preprocessing**
1. **Noise Reduction:**
   - Apply Short-Time Fourier Transform (STFT) thresholding to remove background noise.
   
2. **Segmentation and Windowing:**
   - Divide each audio recording into short frames (e.g., 25 ms with 10 ms overlap).
   
3. **Feature Extraction:**
   - Compute relevant features:
     - **MFCCs**: Captures spectral properties of the speech.
     - **Prosodic Features**: Includes pitch, jitter, shimmer, and intensity.
     
4. **Normalization:**
   - Scale feature values to ensure consistency across all samples.

---

### **Step 3: Model Selection**
1. Use **Support Vector Machines (SVM)** as the primary classifier for binary classification.
2. Choose kernels based on task requirements:
   - **Gaussian (RBF) Kernel**: Handles non-linear relationships.
   - **Polynomial Kernel**: Captures complex interactions in features.

---

### **Step 4: Model Training**
1. Train the model on the **training set** (80% of the dataset).
2. Use initial default hyperparameters for the SVM (e.g., `C`, kernel parameters) to establish a baseline.

---

### **Step 5: Model Saving**
1. Save the trained model after initial training for further use in validation and hyperparameter optimization.
2. Prepare for validation to assess performance and fine-tune the model.

---

# **Validation Task**

The validation task ensures the trained model generalises well to unseen data and optimizes its performance. It involves the following steps:

---

### **Step 1: Validation Data Preparation**
1. Use a **subset of the training set** (e.g., 10% of the total dataset) as the **validation set**.
2. Ensure the validation set is balanced with stratified sampling for `Language` and `Story_type`.

---

### **Step 2: Hyperparameter Tuning**
1. Optimize the following SVM hyperparameters:
   - **Regularization (`C`)**: Controls the trade-off between margin size and classification errors.
   - **Kernel-specific parameters** (e.g., `gamma` for RBF kernel or `degree` for polynomial kernel).
2. Use **Grid Search** or **Random Search**:
   - Evaluate combinations of hyperparameters on the validation set.

---

### **Step 3: Cross-Validation**
1. Apply **k-fold cross-validation** (e.g., 5-fold) within the training data to:
   - Avoid overfitting or underfitting.
   - Assess model consistency across different data splits.
2. Evaluate the model using:
   - Accuracy.
   - F1-Score.
   - AUC-ROC.

---

### **Step 4: Evaluation**
1. Evaluate model performance based on a group of metrics:
   - **Accuracy**: Overall correctness of predictions.
   - **F1-Score**: Balances precision and recall, especially useful for imbalanced data.
   - **Confusion Matrix**: Examines true/false positives and negatives.
   - **AUC-ROC**: Analyzes the model’s ability to distinguish between classes.

2. Use validation metrics to guide improvements:
   - Adjust features if necessary (e.g., add delta MFCCs).
   - Fine-tune hyperparameters for better generalization.


# 4 Implemented ML prediction pipelines

Describe the ML prediction pipelines that you will explore. Clearly identify their input and output, stages and format of the intermediate data structures moving from one stage to the next. It's up to you to decide which stages to include in your pipeline. After providing an overview, describe in more detail each one of the stages that you have included in their corresponding subsections (i.e. 4.1 Transformation stage, 4.2 Model stage, 4.3 Ensemble stage).



---

## **Visualization of the Pipeline**
1. **Input:**
   - Raw audio files + labels.
2. **Preprocessing:**
   - Noise reduction → Segmentation → Windowing → Normalization.
3. **Feature Extraction:**
   - MFCCs + Prosodic features.
4. **Feature Selection:**
   - PCA or other reduction techniques.
5. **Model Training:**
   - Train and tune classifiers (SVM).
6. **Evaluation:**
   - Test with metrics like accuracy and F1-score.
7. **Prediction:**
   - Deploy for real-time classification.

Loading and saving the 100 .wav files under their original names, as they appear in the github repository.

In [1]:
import requests
from bs4 import BeautifulSoup

# GitHub folder URL
url = "https://github.com/MLEndDatasets/Deception/tree/main/MLEndDD_stories_small"

# Base URL for raw file downloads
raw_base = "https://raw.githubusercontent.com/MLEndDatasets/Deception/main/MLEndDD_stories_small/"

# Get list of .wav files
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
wav_files = [a['href'].split('/')[-1] for a in soup.find_all('a', href=True) if a['href'].endswith('.wav')]

# Download each .wav file
for file in wav_files:
    file_url = raw_base + file
    print(f"Downloading {file}...")
    with open(file, 'wb') as f:
        f.write(requests.get(file_url).content)


Downloading 00001.wav...
Downloading 00001.wav...
Downloading 00002.wav...
Downloading 00002.wav...
Downloading 00003.wav...
Downloading 00003.wav...
Downloading 00004.wav...
Downloading 00004.wav...
Downloading 00005.wav...
Downloading 00005.wav...
Downloading 00006.wav...
Downloading 00006.wav...
Downloading 00007.wav...
Downloading 00007.wav...
Downloading 00008.wav...
Downloading 00008.wav...
Downloading 00009.wav...
Downloading 00009.wav...
Downloading 00010.wav...
Downloading 00010.wav...
Downloading 00011.wav...
Downloading 00011.wav...
Downloading 00012.wav...
Downloading 00012.wav...
Downloading 00013.wav...
Downloading 00013.wav...
Downloading 00014.wav...
Downloading 00014.wav...
Downloading 00015.wav...
Downloading 00015.wav...
Downloading 00016.wav...
Downloading 00016.wav...
Downloading 00017.wav...
Downloading 00017.wav...
Downloading 00018.wav...
Downloading 00018.wav...
Downloading 00019.wav...
Downloading 00019.wav...
Downloading 00020.wav...
Downloading 00020.wav...


Loading the QMUL MLEnd Deception Dataset, and saving it under the variable 'df'.

In [6]:
import pandas as pd

df = pd.read_csv('MLEndDD_story_attributes_small.csv')
df.head()

Unnamed: 0,filename,Language,Story_type
0,00001.wav,Hindi,deceptive_story
1,00002.wav,English,true_story
2,00003.wav,English,deceptive_story
3,00004.wav,Bengali,deceptive_story
4,00005.wav,English,deceptive_story


## 4.1 Transformation stage

Describe any transformations, such as feature extraction. Identify input and output. Explain why you have chosen this transformation stage.

## 4.2 Model stage

Describe the ML model(s) that you will build. Explain why you have chosen them.

## 4.3 Ensemble stage

Describe any ensemble approach you might have included. Explain why you have chosen them.

# 5 Dataset

Describe the datasets that you will create to build and evaluate your models. Your datasets need to be based on our MLEnd Deception Dataset. After describing the datasets, build them here. You can explore and visualise the datasets here as well. 

If you are building separate training and validatio datasets, do it here. Explain clearly how you are building such datasets, how you are ensuring that they serve their purpose (i.e. they are independent and consist of IID samples) and any limitations you might think of. It is always important to identify any limitations as early as possible. The scope and validity of your conclusions will depend on your ability to understand the limitations of your approach.

If you are exploring different datasets, create different subsections for each dataset and give them a name (e.g. 5.1 Dataset A, 5.2 Dataset B, 5.3 Dataset 5.3) .



# 6 Experiments and results

Carry out your experiments here. Analyse and explain your results. Unexplained results are worthless.

# 7 Conclusions

Your conclusions, suggestions for improvements, etc should go here.

# 8 References

Acknowledge others here (books, papers, repositories, libraries, tools) 