# Load different file formats with Pandas
You'll learn how to load data files in Python.

This demo is a jupyter notebook, i.e. intended to be run step by step.

Author: Eric Einspänner
<br>
Contributor: Nastaran Takmilhomayouni

First version: 6th of July 2023


Copyright 2023 Clinic of Neuroradiology, Magdeburg, Germany

License: Apache-2.0

## Table of contents
0. [Initial Set-Up for Google Colab](#initial-set-up-for-google-colab)
1. [Initial Set-Up (offline)](#initial-set-up-offline)
2. [File Formats](#Medical-data-file-formats)
3. [CSV](#CSV)
4. [XLSX](#XLSX)
5. [XML](#XML)
    - [XML to Dataframe](#Converting-xml-to-pandas-dataframe)
6. [RD](#RD)
7. [EDF](#EDF)
8. [BDF](#BDF)

## Initial Set-Up for Google Colab
<u> Execute these code blocks just in Google Colab! </u>

In [None]:
!git clone https://github.com/University-Clinic-of-Neuroradiology/python-bootcamp.git

In [None]:
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()

sys.path.insert(0,'/content/python-bootcamp/notebooks/DataManagement')
os.chdir(sys.path[0])

In [None]:
%pip install -q ipympl numpy matplotlib pandas seaborn mne pybdf

In [None]:
import os
import shutil
import gzip
import tarfile
import pandas as pd
import numpy as np
import gzip
import mne
from biosemipy import bdf
import seaborn as sn
import matplotlib.pyplot as plt
from xml.etree import ElementTree as ET                     # Parse XML file in a tree structure

from Utilities.EEG_load_function import import_eeg_file

## Initial Set-Up (offline)

In [None]:
# Initial imports etc
import os
import shutil
import gzip
import tarfile
import pandas as pd
import numpy as np
import gzip
import mne
from biosemipy import bdf
import seaborn as sn
import matplotlib.pyplot as plt
from xml.etree import ElementTree as ET                     # Parse XML file in a tree structure

from Utilities.EEG_load_function import import_eeg_file

## --- Start notebook ---

Mostly, medical files are shared as .gz or tar.gz files, which are compressed files containing medical data files with the formats in the table above.

To read a .gz file, you can use `open` function from python `gzip` module.

To read a tar.gz file, you can use `open` function from python `tarfile` module or `unpack_archive` function from `shutil`.

In [None]:
# tarfile
tar = tarfile.open('Data/smni_eeg_data/a_1_co2a0000364.tar.gz', "r:gz")
tar.extractall()
tar.close()
print(tar)

In [None]:
# shutil
shutil.unpack_archive('Data/smni_eeg_data/a_1_co2a0000364.tar.gz', 'Data/smni_eeg_data')

In [None]:
# gzip
fc = gzip.open('Data/SMNI_CMI_TRAIN/co2c0000347/co2c0000347.rd.000.gz', 'rb')

In [None]:
# gzip - loop
sleepdata_path='Data/smni_eeg_data/'
for tar_filename in os.listdir(sleepdata_path):
     if 'tar.gz' in tar_filename: 
        shutil.unpack_archive(sleepdata_path+tar_filename,sleepdata_path)
        
        for gz_filename in os.listdir(sleepdata_path+tar_filename.split('.')[0]):
            
            with gzip.open(sleepdata_path+tar_filename.split('.')[0]+'/'+gz_filename, 'rb') as f_in:
                
                with open(sleepdata_path+tar_filename.split('.')[0]+'/'+gz_filename.split('.gz')[0], 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)

## Medical data file formats
| Format Name | File Extension | Origin                                         |
|-------------|----------------|------------------------------------------------|
| XML         | .xml           | Extensible Markup Language                     |
| RD          | .rd            | R Documentation                                |
| EDF         | .edf           | European Data Format                           |
| BDF         | .bdf           | BioSemi Data Format                            |
| CSV         | .csv           | Comma Separated Values                         |
| XLSX        | .xlsx          | Microsoft Excel Spreadsheet                    |

## CSV
To read a csv file, you can use `read_csv()` function from python `Pandas` module.

In [None]:
dat2 = pd.read_csv("Data/Data_Entry_2017.csv")

print(type(dat2))
dat2.head()

## XLSX
To read an xlsx file, you can use `read_excel()` function from python `Pandas` module.

In [None]:
dat3 = pd.read_excel("Data/ESAC-Net_report_2021_downloadable_tables.xlsx",sheet_name='D1_J01A_AC',skiprows=1)

print(type(dat3))
dat3.head()

## XML
To read an XML file, you can use the `parse` function from python XML parsing module, which returns a Dataset object containing the data from the XML file.

Parsing means to read information from a file and split it into pieces

In [None]:
# parse an xml file by name
file = ET.parse('Data/naaccr-xml-sample-v210-abstract-10.xml')

# What main data our file has?
root=file.getroot() # fetch the root element of the file
print('root element of your file is:', root)
print('tag of the root element is:',root.tag)      # the data we have-->NaaccrData

print() # print new line
# What is the subselement of each root? 
print('First subelement of each root is;',root[0].tag)  # patient
print('number of patients is',len(root))    # how many patients 
for x in root: # for all root subelements
    print(x.tag, x.attrib) # print each subelement's tag , attribute
print()

# What is the first subselement of each data? 
print('First subelement of each patient is;',root[0][0].tag)  # Item
print('number of Items for each patient is',len(root[0]), ': 21 Items and 1 Tumor')    # how many Items for each patient 
for x in root[0]: # for all patient subelements
    print(x.tag, x.attrib) # print each subelement's tag , attribute
    
for x in root[0]:
    if len(x)!=0:
        print(len(x),'Items for each Tumor subelement:')
        for y in x:
            print(y.tag, y.attrib)

### Converting xml to pandas dataframe

In [None]:
# parse an xml file by name
file = ET.parse('Data/naaccr-xml-sample-v210-abstract-10.xml')
root=file.getroot() # fetch the root element of the file
#mldict = XmlDictConfig(root)

#for x in root.iter():
    #print(x)
def xml2df(root):
    all_records=[]
    
    for i, child in enumerate(root): #patient
        record={}
        my_dict={}
        values=[]
        for subchild in child:
            #print(subchild)
            values.append(subchild.tag.split('}')[1])
            #for i in len(np.unique(values)):
                #print(list(np.unique(values))[i])
            for key in list(np.unique(values)):
                my_dict[key]=values.count(key)
            record[child.tag.split('}')[1]]=my_dict
        #print(record)      
        all_records.append(record)
    return pd.DataFrame(all_records)

df=xml2df(root)
print(type(df))
df.head()

## RD
To read an RD file, you can use the `TextIOWrapper` function from python `io` module and then read and save it as a pandas dataframe by using python `Pandas` module.

In the following we call `EEG_load_function.py` in which `import_eeg_file` reads the .rd file.

In [None]:
# Import data from one trial from participant 338 in control group
fc = gzip.open('Data/SMNI_CMI_TRAIN/co2c0000347/co2c0000347.rd.000.gz', 'rb')
dfc = import_eeg_file(fc)
print(type(dfc)) # Pandas Data Frame

dfc.head()

In [None]:
# correlation matrix for partcipant
corrMatrix_c = dfc.corr()
ax = plt.axes()
sn.heatmap(corrMatrix_c,
           cbar=False,
           square=True,
           xticklabels=False,
           yticklabels=False,
           ax = ax
           )
ax.set_title('Correlation Matrix - Participant in Control Group')
ax.set_xlabel('Electrodes') # x-axis label with fontsize 15
ax.set_ylabel('Electrodes') # y-axis label with fontsize 15

plt.show()

## EDF
To read an EDF file, you can use the `read_raw_edf` function from python `mne.io` module

In [None]:
raw = mne.io.read_raw_edf('Data/test_generator.edf').load_data()
print()
print('Information of the data is:',raw.info)

## BDF
To read a BDF file, you can use BDF function from python biosemipy module
#git clone https://github.com/igmmgi/biosemipy.git
#pip install -e biosemipy

In [None]:
dat1 = bdf.BDF("Data/BDFtestfiles/Newtest17-256.bdf")
print(type(dat1))
