<a href="https://colab.research.google.com/github/fjadidi2001/AD_Prediction/blob/main/Speech_AD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Set Up Google Colab Environment

In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Install required libraries
!pip install opensmile pyAudioAnalysis

# Import libraries
import pandas as pd
import numpy as np
import librosa
import opensmile
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
import os

Mounted at /content/drive
Collecting opensmile
  Downloading opensmile-2.5.1-py3-none-manylinux_2_17_x86_64.whl.metadata (15 kB)
Collecting pyAudioAnalysis
  Downloading pyAudioAnalysis-0.3.14.tar.gz (41.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting audobject>=0.6.1 (from opensmile)
  Downloading audobject-0.7.11-py3-none-any.whl.metadata (2.6 kB)
Collecting audinterface>=0.7.0 (from opensmile)
  Downloading audinterface-1.2.3-py3-none-any.whl.metadata (4.2 kB)
Collecting audeer>=2.1.1 (from audinterface>=0.7.0->opensmile)
  Downloading audeer-2.2.1-py3-none-any.whl.metadata (4.1 kB)
Collecting audformat<2.0.0,>=1.0.1 (from audinterface>=0.7.0->opensmile)
  Downloading audformat-1.3.1-py3-none-any.whl.metadata (4.6 kB)
Collecting audiofile>=1.3.0 (from audinterface>=0.7.0->opensmile)
  Downloading audiofile-1.5.1-py3-none-any.whl.metadat

- Mount Google Drive to access the .tgz files and CSV files.
- Install opensmile for eGeMAPS acoustic feature extraction, librosa for audio processing, and scikit-learn for machine learning models.
- Import libraries for data handling, feature extraction, and visualization.

# Step 2: Load and Organize Datasets

In [5]:
import pandas as pd
import os

# Define paths to datasets in Google Drive
data_path = '/content/drive/MyDrive/Voice/'
diagnosis_train = data_path + 'ADReSSo21-diagnosis-train.tgz'
progression_train = data_path + 'ADReSSo21-progression-train.tgz'
progression_test = data_path + 'ADReSSo21-progression-test.tgz'

# Create directories for extraction
os.makedirs('/content/diagnosis_train', exist_ok=True)
os.makedirs('/content/progression_train', exist_ok=True)
os.makedirs('/content/progression_test', exist_ok=True)

# Unzip datasets
!tar -xvzf "{diagnosis_train}" -C "/content/diagnosis_train"
!tar -xvzf "{progression_train}" -C "/content/progression_train"
!tar -xvzf "{progression_test}" -C "/content/progression_test"

# Verify extracted files
print("Diagnosis Train Files:", os.listdir('/content/diagnosis_train'))
print("Progression Train Files:", os.listdir('/content/progression_train'))
print("Progression Test Files:", os.listdir('/content/progression_test'))

# Load CSV files
task1 = pd.read_csv(data_path + 'task1.csv')  # AD vs Control labels
task2 = pd.read_csv(data_path + 'task2.csv')  # MMSE scores
task3 = pd.read_csv(data_path + 'task3.csv')  # Cognitive decline labels

# Display dataset info
print("\nTask 1 (AD Classification):")
print(task1.head())
print("\nTask 2 (MMSE Regression):")
print(task2.head())
print("\nTask 3 (Cognitive Decline):")
print(task3.head())

ADReSSo21/diagnosis/
ADReSSo21/diagnosis/README.md
ADReSSo21/diagnosis/train/
ADReSSo21/diagnosis/train/segmentation/
ADReSSo21/diagnosis/train/segmentation/cn/
ADReSSo21/diagnosis/train/segmentation/cn/adrso281.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso308.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso270.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso022.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso298.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso300.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso265.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso186.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso148.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso152.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso182.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso268.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso259.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso276.csv
ADReSSo21/diagnosis/train/segmentation/cn/adrso261.csv
ADReSSo21/diag

# Step 3: Acoustic Feature Extraction (eGeMAPS)

In [12]:
import opensmile
import librosa
import pandas as pd
import os
import numpy as np

# Initialize opensmile for eGeMAPS feature extraction
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals
)

# Function to extract eGeMAPS features from an audio file
def extract_egemaps(audio_path):
    try:
        y, sr = librosa.load(audio_path, sr=16000)  # Load audio
        features = smile.process_signal(y, sr)  # Extract eGeMAPS
        return features.values.flatten()
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# Paths to audio files (diagnosis train: cn and ad subdirectories)
diagnosis_audio_base = '/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/'
cn_audio_path = os.path.join(diagnosis_audio_base, 'cn/')
ad_audio_path = os.path.join(diagnosis_audio_base, 'ad/')

# Collect all .wav files from both cn/ and ad/ directories
audio_files = []
for path in [cn_audio_path, ad_audio_path]:
    if os.path.exists(path):
        audio_files.extend([os.path.join(path, f) for f in os.listdir(path) if f.endswith('.wav')])
    else:
        print(f"Directory not found: {path}")

# Extract features for all audio files
audio_features = []  # Initialize as empty list
audio_ids = []
for audio_path in audio_files:
    audio_file = os.path.basename(audio_path)
    audio_id = audio_file.split('.')[0]  # Extract ID from filename (e.g., adrso123)
    features = extract_egemaps(audio_path)
    if features is not None:
        audio_features.append(features)
        audio_ids.append(audio_id)
    else:
        print(f"Skipping {audio_id} due to feature extraction failure")

# Check if any features were extracted
if not audio_features:
    raise ValueError("No audio features extracted. Check audio files or extraction process.")

# Convert to DataFrame
audio_features_df = pd.DataFrame(audio_features)
audio_features_df['ID'] = audio_ids

# Load task1.csv for labels
data_path = '/content/drive/MyDrive/Voice/'
task1 = pd.read_csv(data_path + 'task1.csv')

# Normalize IDs in task1.csv to match audio file IDs
task1['ID'] = task1['ID'].apply(lambda x: 'adrso' + x.replace('adrsdt', '').zfill(3))

# Merge with task1 labels
task1_data = pd.merge(audio_features_df, task1, on='ID', how='inner')
print("Merged Acoustic Features with Labels:")
print(task1_data.head())
print(f"Number of matched records: {len(task1_data)}")

# Save the merged DataFrame for debugging
task1_data.to_csv('/content/drive/MyDrive/Voice/task1_acoustic_features.csv', index=False)
print("Saved acoustic features to /content/drive/MyDrive/Voice/task1_acoustic_features.csv")

Merged Acoustic Features with Labels:
           0         1          2          3          4          5  \
0  34.314342  0.172523  31.954039  34.558792  38.469227   6.515188   
1  34.439098  0.178912  29.944578  33.201965  39.123035   9.178457   
2  34.765678  0.144698  31.995552  35.011833  37.706375   5.710823   
3  30.145615  0.129570  28.376390  29.582561  32.609303   4.232912   
4  31.052141  0.345289  22.244028  25.008968  40.717941  18.473913   

            6           7           8           9  ...        80        81  \
0  332.870453  461.649567  120.299301   81.307747  ...  0.544044  1.976285   
1  169.268906  333.093689  160.969574  267.036377  ...  0.434866  2.533154   
2  231.058731  375.145050  145.108765  262.729279  ...  0.130108  1.557071   
3  292.438629  535.989441   76.417542   96.036621  ...  0.086000  1.878543   
4  546.450195  783.379028  370.438660  510.268463  ...  0.291039  1.648093   

         82        83        84        85        86         87        ID

## low number of matches

In [14]:
print("Audio IDs:", audio_ids[:5])
print("Task1 IDs:", task1['ID'].head())

Audio IDs: ['adrso010', 'adrso014', 'adrso015', 'adrso005', 'adrso312']
Task1 IDs: 0    adrso015
1    adrso040
2    adrso026
3    adrso067
4    adrso058
Name: ID, dtype: object


In [15]:
test_features = extract_egemaps('/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/adrso173.wav')
print("Test features shape:", test_features.shape if test_features is not None else "Failed")

Test features shape: (88,)


In [13]:
print("Number of audio files:", len(audio_files))
print("Sample audio files:", audio_files[:5])

Number of audio files: 166
Sample audio files: ['/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/adrso010.wav', '/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/adrso014.wav', '/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/adrso015.wav', '/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/adrso005.wav', '/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/adrso312.wav']


In [16]:
import opensmile
import librosa
import pandas as pd
import os
import numpy as np

# Initialize opensmile for eGeMAPS feature extraction
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.Functionals
)

# Function to extract eGeMAPS features from an audio file
def extract_egemaps(audio_path):
    try:
        y, sr = librosa.load(audio_path, sr=16000)  # Load audio
        features = smile.process_signal(y, sr)  # Extract eGeMAPS
        return features.values.flatten()
    except Exception as e:
        print(f"Error processing {audio_path}: {e}")
        return None

# Paths to audio files (diagnosis train: cn and ad subdirectories)
diagnosis_audio_base = '/content/diagnosis_train/ADReSSo21/diagnosis/train/audio/'
cn_audio_path = os.path.join(diagnosis_audio_base, 'cn/')
ad_audio_path = os.path.join(diagnosis_audio_base, 'ad/')

# Collect all .wav files from both cn/ and ad/ directories
audio_files = []
for path in [cn_audio_path, ad_audio_path]:
    if os.path.exists(path):
        files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith('.wav')]
        audio_files.extend(files)
        print(f"Found {len(files)} audio files in {path}")
    else:
        print(f"Directory not found: {path}")

print(f"Total audio files found: {len(audio_files)}")

# Extract features for all audio files
audio_features = []
audio_ids = []
skipped_files = []
for audio_path in audio_files:
    audio_file = os.path.basename(audio_path)
    audio_id = audio_file.split('.')[0]  # Extract ID (e.g., adrso123)
    features = extract_egemaps(audio_path)
    if features is not None:
        audio_features.append(features)
        audio_ids.append(audio_id)
    else:
        print(f"Skipping {audio_id} due to feature extraction failure")
        skipped_files.append(audio_id)

print(f"Extracted features for {len(audio_features)} audio files")
print(f"Skipped {len(skipped_files)} audio files: {skipped_files}")

# Check if any features were extracted
if not audio_features:
    raise ValueError("No audio features extracted. Check audio files or extraction process.")

# Convert to DataFrame
audio_features_df = pd.DataFrame(audio_features)
audio_features_df['ID'] = audio_ids

# Load task1.csv for labels
data_path = '/content/drive/MyDrive/Voice/'
task1 = pd.read_csv(data_path + 'task1.csv')

# Normalize IDs in task1.csv to match audio file IDs
task1['ID'] = task1['ID'].apply(lambda x: 'adrso' + x.replace('adrsdt', '').zfill(3))

# Check ID overlap
audio_id_set = set(audio_ids)
task1_id_set = set(task1['ID'])
print(f"Audio IDs in audio_files: {len(audio_id_set)}")
print(f"Task1 IDs: {len(task1_id_set)}")
print(f"Common IDs: {len(audio_id_set & task1_id_set)}")
print(f"Audio IDs not in task1: {audio_id_set - task1_id_set}")
print(f"Task1 IDs not in audio: {task1_id_set - audio_id_set}")

# Merge with task1 labels
task1_data = pd.merge(audio_features_df, task1, on='ID', how='inner')
print("Merged Acoustic Features with Labels:")
print(task1_data.head())
print(f"Number of matched records: {len(task1_data)}")

# Save the merged DataFrame
task1_data.to_csv('/content/drive/MyDrive/Voice/task1_acoustic_features.csv', index=False)
print("Saved acoustic features to /content/drive/MyDrive/Voice/task1_acoustic_features.csv")

# Save unmatched IDs for debugging
unmatched_audio_ids = list(audio_id_set - task1_id_set)
unmatched_task1_ids = list(task1_id_set - audio_id_set)
pd.DataFrame({'unmatched_audio_ids': unmatched_audio_ids}).to_csv(
    '/content/drive/MyDrive/Voice/unmatched_audio_ids.csv', index=False
)
pd.DataFrame({'unmatched_task1_ids': unmatched_task1_ids}).to_csv(
    '/content/drive/MyDrive/Voice/unmatched_task1_ids.csv', index=False
)
print("Saved unmatched IDs to /content/drive/MyDrive/Voice/unmatched_{audio,task1}_ids.csv")

Found 79 audio files in /content/diagnosis_train/ADReSSo21/diagnosis/train/audio/cn/
Found 87 audio files in /content/diagnosis_train/ADReSSo21/diagnosis/train/audio/ad/
Total audio files found: 166
Extracted features for 166 audio files
Skipped 0 audio files: []
Audio IDs in audio_files: 166
Task1 IDs: 71
Common IDs: 41
Audio IDs not in task1: {'adrso228', 'adrso211', 'adrso291', 'adrso189', 'adrso128', 'adrso177', 'adrso172', 'adrso167', 'adrso157', 'adrso188', 'adrso206', 'adrso156', 'adrso308', 'adrso154', 'adrso178', 'adrso216', 'adrso169', 'adrso202', 'adrso276', 'adrso274', 'adrso285', 'adrso200', 'adrso186', 'adrso197', 'adrso266', 'adrso265', 'adrso160', 'adrso090', 'adrso223', 'adrso161', 'adrso259', 'adrso198', 'adrso170', 'adrso141', 'adrso165', 'adrso122', 'adrso093', 'adrso299', 'adrso074', 'adrso257', 'adrso234', 'adrso283', 'adrso281', 'adrso077', 'adrso209', 'adrso247', 'adrso277', 'adrso309', 'adrso151', 'adrso153', 'adrso307', 'adrso280', 'adrso268', 'adrso289', 'adr

### Explanation of Changes
1. **Debugging Audio Files**:
   - Print the number of `.wav` files in `cn/` and `ad/` directories and the total count.
   - Expect ~108 files (based on ADReSSo train split). If fewer, some audio files are missing.

2. **Tracking Skipped Files**:
   - Maintain a `skipped_files` list to log audio IDs where feature extraction failed.
   - Print the number of skipped files and their IDs.

3. **ID Overlap Analysis**:
   - Compare `audio_ids` (from audio files) with `task1['ID']` (after normalization) using set operations.
   - Print:
     - Number of unique audio IDs.
     - Number of unique `task1` IDs.
     - Number of common IDs (should be close to 108).
     - Audio IDs not in `task1.csv`.
     - `task1.csv` IDs not in audio files.
   - Save unmatched IDs to CSV files for inspection.

4. **Preserved Core Logic**:
   - Kept eGeMAPS extraction, ID normalization (`adrso` + zero-padded ID), and merging logic unchanged.
   - Ensured 88 features are extracted per audio file.

---

