# Preprocessing of the Hypothyroid dataset

In this notebook we describe the different preprocessing steps we realize on the Hypothyroid dataset to prepare it for the clustering algorithms.

In [6]:
# Imports
import auxiliary
import pandas as pd

# Useful methods

## Introduction

We first load the dataset and do a quick check to the data

In [7]:
# Load dataset
data, metadata = auxiliary.load_arff('hypothyroid')
data = pd.DataFrame(data)
data.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,Class
0,41.0,b'F',b'f',b'f',b'f',b'f',b'f',b'f',b'f',b'f',...,b't',125.0,b't',1.14,b't',109.0,b'f',,b'SVHC',b'negative'
1,23.0,b'F',b'f',b'f',b'f',b'f',b'f',b'f',b'f',b'f',...,b't',102.0,b'f',,b'f',,b'f',,b'other',b'negative'
2,46.0,b'M',b'f',b'f',b'f',b'f',b'f',b'f',b'f',b'f',...,b't',109.0,b't',0.91,b't',120.0,b'f',,b'other',b'negative'
3,70.0,b'F',b't',b'f',b'f',b'f',b'f',b'f',b'f',b'f',...,b't',175.0,b'f',,b'f',,b'f',,b'other',b'negative'
4,70.0,b'F',b'f',b'f',b'f',b'f',b'f',b'f',b'f',b'f',...,b't',61.0,b't',0.87,b't',70.0,b'f',,b'SVI',b'negative'


In [10]:
data.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Columns: 30 entries, age to Class
dtypes: float64(7), object(23)
memory usage: 884.2+ KB


In [8]:
print(metadata)

Dataset: hypothyroid
	age's type is numeric
	sex's type is nominal, range is ('F', 'M')
	on_thyroxine's type is nominal, range is ('f', 't')
	query_on_thyroxine's type is nominal, range is ('f', 't')
	on_antithyroid_medication's type is nominal, range is ('f', 't')
	sick's type is nominal, range is ('f', 't')
	pregnant's type is nominal, range is ('f', 't')
	thyroid_surgery's type is nominal, range is ('f', 't')
	I131_treatment's type is nominal, range is ('f', 't')
	query_hypothyroid's type is nominal, range is ('f', 't')
	query_hyperthyroid's type is nominal, range is ('f', 't')
	lithium's type is nominal, range is ('f', 't')
	goitre's type is nominal, range is ('f', 't')
	tumor's type is nominal, range is ('f', 't')
	hypopituitary's type is nominal, range is ('f', 't')
	psych's type is nominal, range is ('f', 't')
	TSH_measured's type is nominal, range is ('t', 'f')
	TSH's type is numeric
	T3_measured's type is nominal, range is ('t', 'f')
	T3's type is numeric
	TT4_measured's type 

In [None]:
# for later operations
#data_numeric = data[['age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'TBG']]
#data_nomical = data[['sex', 'on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication', 'sick', '', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'TBG']]
#data_class = data['Class']

For summarizing the dataset has the following characteristics:
- It hast a total of 3772 samples and 29 features (we don't count the class column)
- It has 7 numerical features and 22 nominal features (we don't count the class column).
- 20 of the nominal features have only 2 different values.
- The feature to predict is nominal and has 4 different values.

## Preprocessing

We divide the preprocessing in the following 4 different steps:
- Check the class feature
- Check the numerical feautures one by one
- Check the nominal feautures one by one

### Check the class feature

In this subsection we analyse the feature to predict. 

We start checking the number of samples for each possible class (in absolute and relative value).

In [17]:
print(data['Class'].value_counts(), '\n')
print(data['Class'].value_counts(normalize=True))

b'negative'                   3481
b'compensated_hypothyroid'     194
b'primary_hypothyroid'          95
b'secondary_hypothyroid'         2
Name: Class, dtype: int64 

b'negative'                   0.922853
b'compensated_hypothyroid'    0.051432
b'primary_hypothyroid'        0.025186
b'secondary_hypothyroid'      0.000530
Name: Class, dtype: float64


We can clearly see that the dataset is highly descompensated, the 92,29% of the samples belong to only one class..

Now we check the correlation with the numeric features

In [None]:
corr = data_numeric.corr()
corr.style.background_gradient(cmap='coolwarm')