In [1]:
import pandas as pd

pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

Info:

Abstract: The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians.



2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore the dataset can be used either for 10-class or 3-class experiments.



Attribute Information:
 - LB - FHR baseline (beats per minute)
 - AC - # of accelerations per second
 - FM - # of fetal movements per second
 - UC - # of uterine contractions per second
 - DL - # of light decelerations per second
 - DS - # of severe decelerations per second
 - DP - # of prolongued decelerations per second
 - ASTV - percentage of time with abnormal short term variability
 - MSTV - mean value of short term variability
 - ALTV - percentage of time with abnormal long term variability
 - MLTV - mean value of long term variability
 - Width - width of FHR histogram
 - Min - minimum of FHR histogram
 - Max - Maximum of FHR histogram
 - Nmax - # of histogram peaks
 - Nzeros - # of histogram zeros
 - Mode - histogram mode
 - Mean - histogram mean
 - Median - histogram median
 - Variance - histogram variance
 - Tendency - histogram tendency
 - CLASS - FHR pattern class code (1 to 10)
 - NSP - fetal state class code (N=normal; S=suspect; P=pathologic)

Here is the website: [link](https://archive.ics.uci.edu/ml/datasets/Cardiotocography#)

# 1.0 Data retrieval

In [2]:
df = pd.read_excel('../../data_lake/input/CTG.xls',sheet_name='Raw Data',skiprows=[1])

In [3]:
# test = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls',sheet_name='Raw Data',skiprows=[1])

Drop last 3 rows (Error in original DB)

In [4]:
df = df[:-3]

In [5]:
df.head(1)

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,ASTV,MSTV,ALTV,MLTV,DL,DS,DP,DR,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency,A,B,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,1996-12-01,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,73.0,0.5,43.0,2.4,0.0,0.0,0.0,0.0,64.0,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0


# 2.0 EDA

Quick overview look at the dataframe

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 40 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   FileName  2126 non-null   object        
 1   Date      2126 non-null   datetime64[ns]
 2   SegFile   2126 non-null   object        
 3   b         2126 non-null   float64       
 4   e         2126 non-null   float64       
 5   LBE       2126 non-null   float64       
 6   LB        2126 non-null   float64       
 7   AC        2126 non-null   float64       
 8   FM        2126 non-null   float64       
 9   UC        2126 non-null   float64       
 10  ASTV      2126 non-null   float64       
 11  MSTV      2126 non-null   float64       
 12  ALTV      2126 non-null   float64       
 13  MLTV      2126 non-null   float64       
 14  DL        2126 non-null   float64       
 15  DS        2126 non-null   float64       
 16  DP        2126 non-null   float64       
 17  DR        2126

How many unique values for every columns?

In [7]:
df.nunique()

FileName     352
Date          48
SegFile     2126
b            979
e           1064
LBE           48
LB            48
AC            22
FM            96
UC            19
ASTV          75
MSTV          57
ALTV          87
MLTV         249
DL            15
DS             2
DP             5
DR             1
Width        154
Min          109
Max           86
Nmax          18
Nzeros         9
Mode          88
Mean         103
Median        95
Variance     133
Tendency       3
A              2
B              2
C              2
D              2
E              2
AD             2
DE             2
LD             2
FS             2
SUSP           2
CLASS         10
NSP            3
dtype: int64

In [8]:
df.head(1)

Unnamed: 0,FileName,Date,SegFile,b,e,LBE,LB,AC,FM,UC,ASTV,MSTV,ALTV,MLTV,DL,DS,DP,DR,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency,A,B,C,D,E,AD,DE,LD,FS,SUSP,CLASS,NSP
0,Variab10.txt,1996-12-01,CTG0001.txt,240.0,357.0,120.0,120.0,0.0,0.0,0.0,73.0,0.5,43.0,2.4,0.0,0.0,0.0,0.0,64.0,62.0,126.0,2.0,0.0,120.0,137.0,121.0,73.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,9.0,2.0


It's not clear what is the real subject of this dataset: filename can be related with more segfile. The website states that the rows are more than 2000, so let's work with this info

Drop unuseful columns aka use only columns labeled as features in the webiste

In [9]:
df = df.drop(columns=['FileName','Date','A', 'B', 'C', 'D', 'E', 'AD', 'DE', 'LD', 'FS', 'SUSP','b','e','CLASS'])

In [10]:
# df = df.drop(columns=['FileName','Date','CLASS','b','e',])

In [11]:
df.shape

(2126, 25)

In [12]:
df = df.set_index('SegFile')

In [13]:
df.shape

(2126, 24)

In [14]:
df.sample(3)

Unnamed: 0_level_0,LBE,LB,AC,FM,UC,ASTV,MSTV,ALTV,MLTV,DL,DS,DP,DR,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency,NSP
SegFile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
CTG0830.txt,152.0,152.0,0.0,0.0,4.0,58.0,0.5,59.0,7.5,1.0,0.0,0.0,0.0,54.0,110.0,164.0,3.0,2.0,159.0,155.0,158.0,4.0,1.0,1.0
CTG2106.txt,133.0,133.0,0.0,1.0,6.0,70.0,2.0,6.0,2.5,0.0,0.0,0.0,0.0,68.0,91.0,159.0,7.0,1.0,133.0,132.0,135.0,3.0,0.0,1.0
CTG1103.txt,122.0,122.0,1.0,0.0,0.0,23.0,1.6,0.0,16.2,0.0,0.0,0.0,0.0,33.0,109.0,142.0,2.0,1.0,126.0,127.0,128.0,3.0,0.0,1.0


# 3.0 Ouput

In [15]:
df.to_pickle('../../data_lake/output/1_du.pkl')