# Thyroid Disease Classification

Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan, New South Wales Institute, Syndney, Australia, 1987

http://archive.ics.uci.edu/ml/datasets/thyroid+disease

**Attribute Information:**

Classes: replacement therapy, underreplacement, overreplacement, negative

- age: continuous.
- sex: M, F.
- on thyroxine: f, t.
- query on thyroxine: f, t.
- on antithyroid medication: f, t.
- sick: f, t.
- pregnant: f, t.
- thyroid surgery: f, t.
- I131 treatment: f, t.
- query hypothyroid: f, t.
- query hyperthyroid: f, t.
- lithium: f, t.
- goitre: f, t.
- tumor: f, t.
- hypopituitary: f, t.
- psych: f, t.
- TSH measured: f, t.
- TSH: continuous.
- T3 measured: f, t.
- T3: continuous.
- TT4 measured: f, t.
- TT4: continuous.
- T4U measured: f, t.
- T4U: continuous.
- FTI measured: f, t.
- FTI: continuous.
- TBG measured: f, t.
- TBG: continuous.
- referral source: WEST, STMW, SVHC, SVI, SVHD, other.

## Imports

In [1]:
import numpy as np
import pandas as pd
import pandas_profiling

## EDA & preparation

In [2]:
fpath = "data/dataset_57_hypothyroid.csv"
df = pd.read_csv(fpath)

df.replace("?", np.nan, inplace=True)
numerical_cols = ["age", "TSH", "T3", "TT4", "T4U", "FTI", "TBG"]
df[numerical_cols] = df[numerical_cols].apply(pd.to_numeric)

In [3]:
profile = df.profile_report(title="Thyroid Disease Dataset", sort="None")
profile.to_file(output_file="thyroid-disease-eda.html")
profile



We'll perform following steps as data preparation:

In [4]:
# drop TBG_measured and TBG because they're constant
df.drop(columns=["TBG_measured", "TBG"], inplace=True)
numerical_cols.remove("TBG")

In [5]:
# standardize numerical columns
def scale(x):
    centered = x - x.mean()
    return centered / x.std()


df[numerical_cols] = df[numerical_cols].apply(scale)

In [6]:
df.sample(3)

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,referral_source,Class
1910,-0.6341,F,f,f,f,f,f,f,f,t,...,t,-0.137171,t,-0.289835,t,-0.588362,t,-0.014193,other,negative
2051,0.112727,M,f,f,f,f,f,f,f,f,...,t,-0.016315,t,-0.205575,t,0.486041,t,-0.558169,SVI,negative
2335,0.511035,F,f,f,f,f,f,f,f,f,...,f,,f,,f,,f,,other,negative
