# Goal of this notebook: 
### Create a Dataframe that maps file_name, rhythm_label, EKG_lead_1_values.
**I work with the Diagnostics.xlsx and ECGDataDenoised.zip files provided by the authors**

**Process:**

**1. Process the Diagnostic.xlsx file (this file maps ecg file name with labeled heart rhythm)**

**2. Unzip ECGDataDenoised.zip and unload the 10646 .csv files with 12 lead ECG readings.** 
**(names of these files correlate to Diagnostic.xlsx in step 1)**

**3. Extract Lead 1 data from 12 lead ECG reading csv and create ECG Dictionary/Dataframe.**
 - This created dictionary has file name mapped to rhythm label and the ecg microvolt readings (a list of floats stored as string)
 - Lead 1 is what an apple watch would use for an EKG
 
**4. Save new ECG Dictionary/Dataframe as CSV**

## About the EKG dataset 
### **Datasets**: **A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients**

"Description:
This newly inaugurated research database for 12-lead electrocardiogram signals was created under the auspices of Chapman University and Shaoxing People's Hospital (Shaoxing Hospital Zhejiang University School of Medicine) and aims at enabling the scientific community in conducting new studies on arrhythmia and other cardiovascular conditions. Certain types of arrhythmias, such as atrial fibrillation, have a pronounced negative impact on public health, quality of life, and medical expenditures. As a non-invasive test, the long term ECG monitoring is a major and vital diagnostic tool for detecting these conditions. However, such a practice generates a considerable amount of data that analysis of which require considerable time and effort by human experts. Advancement of modern machine learning and statistical tools can be trained on high quality, large data to achieve high levels of automated diagnostic accuracy. Thus, we collected and disseminated this novel database that contains 12-lead ECGs of 10,646 patients with 500 Hz sampling rate that features 11 common rhythms and 67 additional cardiovascular conditions, all labeled by professional experts. For each subject, a sample size of 10 seconds (12-dimension 5000 samples) was available. The dataset can be used to design, compare, and fine tune new and classical statistical and machine learning techniques in studies focused on arrhythmia and other cardiovascular conditions."


**Dataset Citation:**

Zheng, Jianwei; Rakovski, Cyril; Danioko, Sidy; Zhang, Jianming; Yao, Hai; Hangyuan, Guo (2019). A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.4560497

Zheng, Jianwei (2019). ECGDataDenoised.zip. figshare. Dataset. https://doi.org/10.6084/m9.figshare.8378291.v1

Zheng, Jianwei (2019). Diagnostics.xlsx. figshare. Dataset. https://doi.org/10.6084/m9.figshare.8360408.v2

### Import Packages

In [1]:
import pandas as pd
import zipfile
import os
import io

### Create file paths to ECGDataDenoised.zip and Diagnostics.xlsx
** Define a file path for each variable in the cell below!!*

In [None]:
zip_file_path = r'downloads/ecg_source_downloads/ECGDataDenoised.zip'         # This is the file path to the raw values of EKG readings in microvolts that are taken 500 times a second
diagnostic_file_path = r'downloads/ecg_source_downloads/Diagnostics.xlsx'     # This is the label that the creators of the dataset gave to each EKG. They associate a label to file name

### 1. Process the Diagnostic.xlsx file
**This file maps each of the 'FileName' to a heart 'Rhythm' labeled by an expert.** 

In [None]:
# Creates dataframe extracting the file name and the heart rhythm label for the file
# As mentioned in comment above this will be a file name with a label for heart rhythm
diagnostics_df = pd.read_excel(diagnostic_file_path, usecols="A,B", header=0) 

In [4]:
diagnostics_df.head() # View head of DF

Unnamed: 0,FileName,Rhythm
0,MUSE_20180113_171327_27000,AFIB
1,MUSE_20180112_073319_29000,SB
2,MUSE_20180111_165520_97000,SA
3,MUSE_20180113_121940_44000,SB
4,MUSE_20180112_122850_57000,AF


### 2. Unzip ECGDataDenoised.zip
**Unzipping will create over 10,000 .csv files each with an individual ECGs microvolts values (500 readings/second for 10 seconds on 12 leads)**



In [None]:
# ----- Unzip ECGDataDenoised.zip -----
# Create a directory to extract the CSVs into

output_dir = r'datasets/created_ecg_csv/ECGDataDenoised'  # ---!! Define your own path to an output directory for the 10,000+ 12 lead EKG csv!!--

os.makedirs(output_dir, exist_ok=True)              # Create directory or verify it is there

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref: # extract all files from zip to output directory
    zip_ref.extractall(output_dir)

print(f"Extracted ECG CSVs to: {output_dir}")

Extracted ECG CSVs to: datasets/created_ecg_csv/ECGDataDenoised


### 3. Extract Lead 1 From 12 lead ECG values csv and create ECG Dictionary/Dataframe.

#### **!!! --- The 12 leads are not labeled in the excel format. I am assuming the column A, is Lead 1 --- !!!**

**This is an assumption and not verified with authors of paper**

In [6]:

# Initialize an empty list to store our processed EKG data and labels
ekg_dataset = []

# Get a list of all CSV files in the extracted directory
ecg_files = [f for f in os.listdir(output_dir) if f.endswith('.csv')]  # This will return a list of directory of the 12lead csv files extracted above

# Iterate through each ECG CSV file
for ecg_file in ecg_files:
    file_name_without_ext = os.path.splitext(ecg_file)[0]

    # Find the corresponding rhythm from the diagnostics DataFrame to map a rhythm
    rhythm_entry = diagnostics_df[diagnostics_df['FileName'] == file_name_without_ext]

    if not rhythm_entry.empty:
        rhythm = rhythm_entry['Rhythm'].iloc[0] # This will associate a rhythm name to file 

        # Construct the full path to the CSV file
        csv_path = os.path.join(output_dir, ecg_file)

        try:
            # Read only the first column of the CSV (assuming it's Lead 1)
            # Since there are no headers, we'll read it without header and select the first column by index
            ecg_data = pd.read_csv(csv_path, header=None, usecols=[0])

            # Ensure 5000 values are present
            if len(ecg_data) == 5000:
                # Store the filename, rhythm, and Lead 1 data
                ekg_dataset.append({
                    'file_name': file_name_without_ext,
                    'rhythm': rhythm,
                    'lead_1_data': ecg_data[0].tolist() # Convert to list for easier handling
                })
            else:
                print(f"Skipping {ecg_file}: Expected 5000 values, but found {len(ecg_data)}")
        except Exception as e:
            print(f"Error processing {ecg_file}: {e}")
    else:
        print(f"No rhythm found for {file_name_without_ext} in Diagnostics.xlsx")



# Convert the list of dictionaries to a DataFrame (optional, but good for further analysis)
final_ekg_df = pd.DataFrame(ekg_dataset)

print(f"\nCreated dataset with {len(final_ekg_df)} entries.")
print("Example of a processed entry:")
if not final_ekg_df.empty:
    print(final_ekg_df.iloc[0])

Skipping MUSE_20180113_124215_52000.csv: Expected 5000 values, but found 1926

Created dataset with 10645 entries.
Example of a processed entry:
file_name                             MUSE_20180118_132508_86000
rhythm                                                        SB
lead_1_data    [28.56, 6.1529, -11.824, -22.725, -26.701, -25...
Name: 0, dtype: object


In [7]:
final_ekg_df.head() # View dataframe

Unnamed: 0,file_name,rhythm,lead_1_data
0,MUSE_20180118_132508_86000,SB,"[28.56, 6.1529, -11.824, -22.725, -26.701, -25..."
1,MUSE_20180116_124640_27000,ST,"[7.7133, 3.4957, -1.6378, -7.7151, -13.76, -18..."
2,MUSE_20180113_171837_63000,SB,"[-10.816, -11.143, -11.407, -11.518, -11.336, ..."
3,MUSE_20180113_134112_95000,AFIB,"[-86.666, -82.155, -76.289, -68.804, -61.092, ..."
4,MUSE_20180118_135058_59000,SA,"[-68.682, -67.147, -64.614, -60.604, -55.135, ..."


### 4. Save new ECG Dataframe as CSV

In [None]:
# output_csv_path = r'datasets\processed_ecg_data.csv' # Define a path that your want the newly created CSV to be stored. 
# final_ekg_df.to_csv(output_csv_path, index=False)        # converts the dataframe to a csv file located at the path defined above.

# print(f"DataFrame saved to: {output_csv_path}")