## Try 1, this kinda went nowhere, but it's still here for documentation purposes

To help you prepare your EEG data for input into a ResNet model, we can go through the steps of loading the .edf files, preprocessing the data, and setting up the format for input into ResNet. Here's an outline of the process:

1. **Load .edf files**: We'll use the `pyEDFlib` library to load the EEG data.
2. **Preprocess the EEG data**: This may involve normalization or other preprocessing specific to EEG signals.
3. **Label encoding**: We'll ensure that the labels (normal or abnormal) are mapped into numerical values.
4. **Reshape data for ResNet**: ResNet typically expects input in a specific shape (e.g., (batch_size, height, width, channels)), so we'll format the EEG data accordingly.

In [1]:
# Install pyedflib using pip
%pip install pyedflib
%pip install mne
%pip install tensorflow
%pip install boto3
%pip install torch
%pip install torch torchvision
%pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113

Collecting pyedflib
  Downloading pyEDFlib-0.1.38-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
Downloading pyEDFlib-0.1.38-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyedflib
Successfully installed pyedflib-0.1.38
Note: you may need to restart the kernel to use updated packages.
Collecting mne
  Downloading mne-1.8.0-py3-none-any.whl.metadata (21 kB)
Collecting lazy-loader>=0.3 (from mne)
  Downloading lazy_loader-0.4-py3-none-any.whl.metadata (7.6 kB)
Collecting pooch>=1.5 (from mne)
  Downloading pooch-1.8.2-py3-none-any.whl.metadata (10 kB)
Downloading mne-1.8.0-py3-none-any.whl (7.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m105.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lazy_loader-0.4-py3-none-any.whl (12 kB)
Downloading po

In [2]:
import pyedflib
import numpy as np
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
import mne
import boto3

# Import SageMaker PyTorch framework for training jobs
from sagemaker.pytorch import PyTorch

# Import PyTorch for model development
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision import models
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

2024-11-14 23:50:33.345291: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1731628233.582830   18775 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1731628233.650690   18775 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-14 23:50:34.284828: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
def list_edf_files_from_s3(bucket, prefix):
    response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    # Use paginator to handle large number of objects in the bucket
    paginator = s3.get_paginator('list_objects_v2')
    edf_files = []
    
    for page in paginator.paginate(Bucket=BucketName, Prefix=prefix):
        if 'Contents' in page:
            for obj in page['Contents']:
                key = obj['Key']
                # Check if the object key ends with .edf
                if key.endswith('.edf'):
                    edf_files.append(key)
    # files = [content['Key'] for content in response.get('Contents', []) if content['Key'].endswith('.edf')]
    return edf_files

def download_file_from_s3(bucket, s3_key, local_path):
    try:
        s3.download_file(bucket, s3_key, local_path)
        print(f"Downloaded {s3_key} to {local_path}")
    except NoCredentialsError:
        print("Credentials not available")

In [4]:
import boto3
import pandas as pd

# Initialize boto3 S3 client
s3_client = boto3.client('s3')

# Bucket information
bucket_name = 'seniordesignt6'
base_prefix = 'edf/'

# Read the CSV file from S3
csv_key = 'edfFiles.csv'
csv_obj = s3_client.get_object(Bucket=bucket_name, Key=csv_key)
edf_names_df = pd.read_csv(csv_obj['Body'])

# Extract the names and labels from the CSV
edf_names = edf_names_df['name'].tolist()
edf_labels = edf_names_df['label'].tolist()

# Normalize slashes in edf_names to match S3 path format (forward slashes)
edf_names = [name.replace('\\', '/') for name in edf_names]

# List to store matching S3 file paths and their labels
matching_file_paths = []
matching_labels = []

# Loop through each "000" to "150" folder
for i in range(151):  # 000 to 150 inclusive
    prefix = f"{base_prefix}{i:03d}/"  # Creates 'edf/000/', 'edf/001/', ..., 'edf/150/'

    # List objects in the current prefix
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

    # If there are objects in the current prefix
    if 'Contents' in response:
        for obj in response['Contents']:
            # Extract the key (file path) of the current object
            s3_key = obj['Key']

            # Get the relative path (removing the edf/000/ part)
            relative_path = '/'.join(s3_key.split('/')[2:])  # Get path in "aaaaaaaa/s001_2015/01_tcp_ar/..." format
            
            # Check if the current S3 key is in the edf_names list
            if relative_path in edf_names:
                # Get the index of the relative path in edf_names
                idx = edf_names.index(relative_path)

                # Construct the full S3 path and label
                full_s3_path = f"s3://{bucket_name}/{s3_key}"
                label = edf_labels[idx]  # Get the corresponding label

                # Append the S3 path and its label to the lists
                matching_file_paths.append(full_s3_path)
                matching_labels.append(label)

# Create a DataFrame with matching file paths and their labels
data = {
    's3_path': matching_file_paths,
    'label': matching_labels
}
matching_df = pd.DataFrame(data)

# Print or save the resulting DataFrame
print(matching_df)

                                                s3_path  label
0     s3://seniordesignt6/edf/000/aaaaaaac/s001_2002...      3
1     s3://seniordesignt6/edf/000/aaaaaaac/s001_2002...      3
2     s3://seniordesignt6/edf/000/aaaaaaac/s002_2002...      3
3     s3://seniordesignt6/edf/000/aaaaaaac/s004_2002...      3
4     s3://seniordesignt6/edf/000/aaaaaaac/s004_2002...      3
...                                                 ...    ...
4615  s3://seniordesignt6/edf/134/aaaaatvr/s005_2015...      0
4616  s3://seniordesignt6/edf/134/aaaaatvr/s005_2015...      0
4617  s3://seniordesignt6/edf/134/aaaaatvr/s005_2015...      0
4618  s3://seniordesignt6/edf/134/aaaaatvr/s005_2015...      0
4619  s3://seniordesignt6/edf/134/aaaaatvr/s005_2015...      0

[4620 rows x 2 columns]


In [5]:
# Set the desired number of channels and fixed length of time points
TARGET_CHANNELS = 40  # Number of channels you want to have in the final data
TARGET_POINTS = 75000  # Fixed number of time points for each sample

# Function to load and preprocess each .edf file
def load_and_preprocess_edf(filePath, target_channels=TARGET_CHANNELS, target_points=TARGET_POINTS):
    # Load the raw EEG data
    RawEEGDataFile = mne.io.read_raw_edf(filePath, preload=True, verbose=False)
    RawEEGDataFile.interpolate_bads()

    # Get the raw data (channels × time)
    data = RawEEGDataFile.get_data()

    # Determine current number of channels
    current_channels, current_points = data.shape

    # Pad or truncate channels to make them equal to target_channels (e.g., 40)
    if current_channels < target_channels:
        # Pad with zeros if there are fewer channels than target_channels
        padding = target_channels - current_channels
        data = np.pad(data, ((0, padding), (0, 0)), mode='constant')
    else:
        # Truncate channels if there are more than target_channels
        data = data[:target_channels, :]

    # Interpolate or resample data to ensure target_points are present
    if current_points != target_points:
        data = np.array([np.interp(np.linspace(0, current_points - 1, target_points), np.arange(current_points), data[ch, :]) for ch in range(target_channels)])

    return data

In [6]:
# Prepare label encoding
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(matching_labels)
categorical_labels = to_categorical(encoded_labels)

In [7]:
# # Prepare the data: Shape should be (batch_size, 1, num_channels, signal_length)
# data = torch.tensor(eeg_data, dtype=torch.float32)  # Data is (batch_size, num_channels, signal_length, 1)
# data = data.permute(0, 3, 1, 2)  # Reshape to (batch_size, 1, num_channels, signal_length)

# labels = torch.tensor(matching_df['label'][:count], dtype=torch.long)  # Labels as long tensor

# # Create a dataset and dataloader
# dataset = TensorDataset(data, labels)
# dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [None]:
# Cell 1: Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
import mne  # assuming mne is used for EEG data loading
import torch.nn.functional as F
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import gc

# Cell 2: Define a dataset class to load EEG data in segments
class EEGDataset(Dataset):
    def __init__(self, eeg_file_paths, target_channels=40, target_points=75000, segment_length=25000):
        self.eeg_file_paths = eeg_file_paths
        self.target_channels = target_channels
        self.target_points = target_points
        self.segment_length = segment_length
        self.s3 = boto3.client('s3')

    def __len__(self):
        return len(self.eeg_file_paths)

    def __getitem__(self, idx):
        s3_file_path = self.eeg_file_paths[idx]

        # Parse bucket name and key from the S3 path
        s3_path_parts = s3_file_path.replace("s3://", "").split("/", 1)
        bucket_name = s3_path_parts[0]
        key = s3_path_parts[1]

        # Define local path to save the file
        local_file_path = os.path.join("/tmp", os.path.basename(key))

        # Download file from S3 if it does not exist locally
        if not os.path.exists(local_file_path):
            self.s3.download_file(bucket_name, key, local_file_path)

        # Load the EDF file using mne
        raw = mne.io.read_raw_edf(local_file_path, preload=True)

        # Apply preprocessing steps such as ICA or filtering here
        data = raw.get_data()  # Shape: (channels, time_points)

        # Pad or trim channels to match target_channels
        if data.shape[0] < self.target_channels:
            padding = np.zeros((self.target_channels - data.shape[0], data.shape[1]))
            data = np.vstack((data, padding))
        elif data.shape[0] > self.target_channels:
            data = data[:self.target_channels, :]

        # Split the data into segments to handle large files
        segments = []
        for start in range(0, data.shape[1], self.segment_length):
            end = min(start + self.segment_length, data.shape[1])
            segment = data[:, start:end]

            # Interpolate or compress each segment to match target_points
            if segment.shape[1] != self.target_points:
                segment = np.array([np.interp(np.linspace(0, 1, self.target_points), np.linspace(0, 1, segment.shape[1]), channel) for channel in segment])

            # Reshape segment to match the input dimensions required by the ResNet model
            segment = np.expand_dims(segment, axis=0)  # Add a batch dimension if needed
            segments.append(torch.tensor(segment, dtype=torch.float32))

        return segments

# Cell 3: List of all your EEG file paths
eeg_files = matching_file_paths

# Cell 4: Create a dataset and a dataloader for batch processing
batch_size = 2  # Set batch size based on your system's memory
segment_length = 5000  # Set segment length based on your system's memory

dataset = EEGDataset(eeg_files, segment_length=segment_length)

def collate_fn(batch):
    # Flatten the list of segments and create a new batch
    segments = [segment for segments in batch for segment in segments]
    return torch.stack(segments)

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

# Cell 5: Define the ResNet model
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        # Define a simple CNN architecture
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 2)  # Assuming binary classification

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(x.size(0), -1)  # Flatten the tensor
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = CNN()

# Cell 6: Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Cell 7: Training loop with batch processing
num_epochs = 2  # Set the number of epochs
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for i, batch in enumerate(dataloader):
        inputs = batch
        # Get labels from edfLabels
        labels = categorical_labels[i * batch_size : (i + 1) * batch_size]

        # Zero the gradient
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        
        # Calculate loss
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        # Clear unused variables to free memory
        del inputs, labels, outputs, loss
        gc.collect()
        torch.cuda.empty_cache()

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(dataloader)}")

print("Finished Training")

# Explanation of Computational Efficiency
# Increasing the batch size can make the training more computationally efficient, as it allows more data to be processed in parallel.
# However, it also requires more memory. If your system has limited memory, a smaller batch size may be more practical to avoid crashes.
# Increasing the segment length means more data is processed per segment, which can also improve efficiency.
# However, larger segments require more memory, which may not be feasible on systems with limited resources.
# Therefore, both batch size and segment length should be chosen carefully based on the available memory to balance efficiency and stability.


Extracting EDF parameters from /tmp/aaaaampz_s005_t007.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 153855  =      0.000 ...   600.996 secs...
Extracting EDF parameters from /tmp/aaaaaijs_s001_t001.edf...
EDF file detected
Setting channel info structure...
Creating raw.info structure...
Reading 0 ... 125999  =      0.000 ...   314.998 secs...
