# Autumn 2024 ADSP 31009 ON1 Machine Learning & Predictive Analytics Final Project - Thyroid Disease Prediction

##  Background: _____
Reading:
- https://link.springer.com/article/10.1007/s44196-023-00388-2
- https://iopscience.iop.org/article/10.1088/1742-6596/1963/1/012140/pdf

## Dataset Background: https://archive.ics.uci.edu/dataset/102/thyroid+disease

## Dataset Download: https://archive.ics.uci.edu/static/public/102/thyroid+disease.zip

To download the zip folder from the source

In [27]:
import requests 

def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)
            
source_zip_file_url = 'https://archive.ics.uci.edu/static/public/102/thyroid+disease.zip'
local_zip_file_location = "./thyroid+disease.zip"
download_url(source_zip_file_url, local_zip_file_location)

To extract the zip archive file into a folder

In [29]:
import zipfile

zip_file = zipfile.ZipFile(file = local_zip_file_location, mode='r')
local_folder_location = "./thyroid+disease"
zip_file.extractall(local_folder_location)

## Similar projects done on Thyroid Disease Prediction:
- https://www.kaggle.com/code/adiii1652/thyroid-disease-analysis
- https://www.kaggle.com/code/anubhavmaverick/thyroid-missing-value-algorithm

## Start of Project Analysis

### Import necessary packages

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.linear_model import BayesianRidge
# from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

### For clarity, addititional step to reorgnanize the dataset is performed. Here are the steps that are performed after the .zip file is unzipped
1. new folder ann, archive_thyroid_disease, domain_knowledge, new_thyroid, questionable_data are created under the throid+disease folder (associated code to do so programatically as attached below)
2. the file structure for the files moved is as below

Code to programaically create the folders and move files

In [36]:
base_dir = "./PotentialDatasets/HealthRelated/thyroid+disease/"
!mkdir {base_dir}/ann
!mkdir {base_dir}/archive_thyroid_disease
!mkdir {base_dir}/domain_knowledge
!mkdir {base_dir}/new_thyroid
!mkdir {base_dir}/questionable_data

!find {base_dir} -maxdepth 1 -name "ann*" ! -path {base_dir}/ann -exec mv {{}} {base_dir}/ann \;
!mv {base_dir}/thyroid0387* {base_dir}/archive_thyroid_disease/
!mv {base_dir}/thyroid.theory {base_dir}/domain_knowledge/
!mv {base_dir}/new-thyroid* {base_dir}/new_thyroid
!mv {base_dir}/hypothyroid* {base_dir}/questionable_data/
!mv {base_dir}/sick-euthyroid* {base_dir}/questionable_data/

mkdir: ./PotentialDatasets/HealthRelated/thyroid+disease//ann: File exists
mkdir: ./PotentialDatasets/HealthRelated/thyroid+disease//archive_thyroid_disease: File exists
mkdir: ./PotentialDatasets/HealthRelated/thyroid+disease//domain_knowledge: File exists
mkdir: ./PotentialDatasets/HealthRelated/thyroid+disease//new_thyroid: File exists
mkdir: ./PotentialDatasets/HealthRelated/thyroid+disease//questionable_data: File exists
zsh:1: no matches found: ./PotentialDatasets/HealthRelated/thyroid+disease//thyroid0387*
mv: ./PotentialDatasets/HealthRelated/thyroid+disease//thyroid.theory: No such file or directory
zsh:1: no matches found: ./PotentialDatasets/HealthRelated/thyroid+disease//new-thyroid*
zsh:1: no matches found: ./PotentialDatasets/HealthRelated/thyroid+disease//hypothyroid*
zsh:1: no matches found: ./PotentialDatasets/HealthRelated/thyroid+disease//sick-euthyroid*


File structure of the folder

In [38]:
# From https://stackoverflow.com/questions/9727673/list-directory-tree-structure-in-python
from pathlib import Path

# prefix components:
space =  '    '
branch = '│   '
# pointers:
tee =    '├── '
last =   '└── '


def tree(dir_path: Path, prefix: str=''):
    """A recursive generator, given a directory Path object
    will yield a visual tree structure line by line
    with each line prefixed by the same characters
    """    
    contents = list(dir_path.iterdir())
    # contents each get pointers that are ├── with a final └── :
    pointers = [tee] * (len(contents) - 1) + [last]
    for pointer, path in zip(pointers, contents):
        yield prefix + pointer + path.name
        if path.is_dir(): # extend the prefix and recurse:
            extension = branch if pointer == tee else space 
            # i.e. space because last, └── , above so no more |
            yield from tree(path, prefix=prefix+extension)

In [39]:
for line in tree(Path("./PotentialDatasets/HealthRelated/thyroid+disease/")):
    print(line)

├── ann
│   ├── ann-train.data
│   ├── ann-Readme
│   ├── ann-thyroid.names
│   └── ann-test.data
├── costs
│   ├── ann-thyroid.README
│   ├── ann-thyroid.delay
│   ├── ann-thyroid.expense
│   ├── ann-thyroid.group
│   ├── Index
│   └── ann-thyroid.cost
├── allrep.names
├── archive_thyroid_disease
│   ├── thyroid0387.names
│   └── thyroid0387.data
├── .DS_Store
├── allbp.test
├── allhypo.names
├── allhypo.test
├── dis.data
├── allrep.data
├── allhyper.data
├── sick.names
├── sick.data
├── dis.names
├── allhyper.test
├── allrep.test
├── dis.test
├── sick.test
├── new_thyroid
│   ├── new-thyroid.names
│   └── new-thyroid.data
├── questionable_data
│   ├── sick-euthyroid.names
│   ├── hypothyroid.data
│   ├── sick-euthyroid.data
│   ├── hypothyroid.names
│   └── .ipynb_checkpoints
│       ├── hypothyroid-checkpoint.names
│       └── hypothyroid-checkpoint.data
├── allhypo.data
├── domain_knowledge
│   ├── thyroid.theory
│   └── .ipynb_checkpoints
│       └── thyroid-checkpoint.theory
├── 

### Dataset description

In [41]:
from IPython.display import Markdown, display

display(Markdown("./PotentialDatasets/HealthRelated/thyroid+disease/HELLO"))

          General Description of Thyroid Disease Databases 
                        and Related Files

This directory contains 6 databases, corresponding test set, and 
corresponding documentation.  They were left at the University of
California at Irvine by Ross Quinlan during his visit in 1987 for
the 1987 Machine Learning Workshop.  

The documentation files (with file extension "names") are formatted to
be read by Quinlan's C4 decision tree program.  Though briefer than
the other documentation files found in this database repository, they
should suffice to describe the database, specifically:

    1. Source
    2. Number and names of attributes (including class names)
    3. Types of values that each attribute takes

In general, these databases are quite similar and can be characterized
somewhat as follows:

    1. Many attributes (29 or so, mostly the same set over all the databases)
    2. mostly numeric or Boolean valued attributes
    3. thyroid disease domains (records provided by the Garavan Institute
       of Sydney, Australia)
    4. several missing attribute values (signified by "?")
    5. small number of classes (under 10, changes with each database)
    7. 2800 instances in each data set
    8. 972 instances in each test set (It seems that the test sets' instances
       are disjoint with respect to the corresponding data sets, but this has 
       not been verified)

See the following for a discussion of relevant experiments and related work:
   Quinlan,J.R., Compton,P.J., Horn,K.A., & Lazurus,L. (1986).
   Inductive knowledge acquisition: A case study.
   In Proceedings of the Second Australian Conference on Applications
   of Expert Systems.  Sydney, Australia.

   Quinlan,J.R. (1986). Induction of decision trees. Machine Learning,
   1, 81--106.

Note that the instances in these databases are followed by a vertical
bar and a number.  These appear to be a patient id number.  The vertical
bar is interepreted by Quinlan's algorithms as "ignore the remainder of
this line". 

======================================================================
This database now also contains an additional two data files, named 
hypothyroid.data and sick-euthyroid.data.  They have approximately the
same data format and set of attributes as the other 6 databases, but
their integrity is questionable.  Ross Quinlan is concerned that they
may have been corrupted since they first arrived at UCI, but we have not
yet established the validity of this possibility.  These 2 databases differ
in terms of their number of instances (3163) and lack of corresponding 
test files.  They each have 2 concepts (negative/hypothyroid and 
sick-euthyroid/negative respectively).  Their source also appears to
be the Garavan institute.  Each contains several missing values.

Another relatively recent file thyroid0387.data has been added that 
contains the latest version of an archive of thyroid diagnoses obtained 
from the Garvan Institute, consisting of 9172 records from 1984 to early 1987.

A domain theory related to thyroid desease has also been added recently 
(thyroid.theory).

The files new-thyroid.[names,data] were donated by Stefan Aberhard.





### Load in the dataset

Examine the 6 data files

In [44]:
!find {base_dir} -maxdepth 1 -name "*.data" 

./PotentialDatasets/HealthRelated/thyroid+disease//dis.data
./PotentialDatasets/HealthRelated/thyroid+disease//allrep.data
./PotentialDatasets/HealthRelated/thyroid+disease//allhyper.data
./PotentialDatasets/HealthRelated/thyroid+disease//sick.data
./PotentialDatasets/HealthRelated/thyroid+disease//allhypo.data
./PotentialDatasets/HealthRelated/thyroid+disease//allbp.data


In [45]:
df = pd.read_csv("./PotentialDatasets/HealthRelated/thyroid+disease/allbp.data", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,41,F,f,f,f,f,f,f,f,f,...,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,...,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,...,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,...,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,...,t,61,t,0.87,t,70,f,?,SVI,negative.|2807


Parse the .names file to create column names and data dictionary

In [47]:
column_name_file = "./PotentialDatasets/HealthRelated/thyroid+disease/allbp.names"

col_names = []
data_dictionary = {}

with open(column_name_file, mode = 'r', encoding = 'utf-8-sig') as f:
    # First 6 lines are descriptions so ignore. 7th line is empty line so doesn't matter either
    for times in range(7):
        _ = f.readline()
    
    # Read in remaining lines into a variable
    list_of_string_in_line = f.readlines()
    
    # Build up a list of column name, and a dictionary for data
    for string_in_line in list_of_string_in_line:
        # The column name are identified by ':'
        if ':' in string_in_line:
            # The first part of the line tells me what the column name is
            col_name = string_in_line.split(":")[0]
            # Remove spacing with the charcter '_' for column name format
            if " " in col_name:
                col_name = col_name.replace(' ', '_')
            
            # Populate the column name for the dataset
            col_names.append(col_name)
            
            # The 2nd part of the line tells me what are the values that are in the column
            values_for_col = string_in_line.split(":")[1]
            # Cleanup the splitting characters
            values_for_col = values_for_col.replace("\t", "").replace('\n', '').replace('.', '').replace(' ', '')
            # Split up the string to identify individual values allowed for column
            values_for_col = values_for_col.split(",")
            
            # Create a dictionary of the data
            data_dictionary[col_name] = values_for_col
            
    # Class is the last column in the dataset
    col_names.append('class')
    
    # Find out where the word `class` is to identify the class variable
    index_of_class = ['class' in string_in_line for string_in_line in list_of_string_in_line].index(True)
    # print(index_of_class.index(True))
    
    # Concat the strings from start to identify the class options 
    class_options = ''.join(list_of_string_in_line[:index_of_class+1])
    # Cleanup the splitting characters
    class_options = class_options.replace("\t", "").replace('\n', '').replace('.', '')
    
    # Find out location of the | character to find out the classes that are represented in dataset
    index_of_pipe = class_options.index('|')
    # Split up the string to identify individual values allowed for class
    class_options = class_options[:index_of_pipe].split(",")
    
    # Populate data dictionary class key values
    data_dictionary['class'] = [class_option.strip().replace(' ', '_') for class_option in class_options]

    
print('The list of column names include the following')
print(col_names)

print('Below is the dictionary of form (key, value) \
where key are the column names, \
value are the allowable values for each column')
print(data_dictionary)

The list of column names include the following
['age', 'sex', 'on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication', 'sick', 'pregnant', 'thyroid_surgery', 'I131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium', 'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH_measured', 'TSH', 'T3_measured', 'T3', 'TT4_measured', 'TT4', 'T4U_measured', 'T4U', 'FTI_measured', 'FTI', 'TBG_measured', 'TBG', 'referral_source', 'class']
Below is the dictionary of form (key, value) where key are the column names, value are the allowable values for each column
{'age': ['continuous'], 'sex': ['M', 'F'], 'on_thyroxine': ['f', 't'], 'query_on_thyroxine': ['f', 't'], 'on_antithyroid_medication': ['f', 't'], 'sick': ['f', 't'], 'pregnant': ['f', 't'], 'thyroid_surgery': ['f', 't'], 'I131_treatment': ['f', 't'], 'query_hypothyroid': ['f', 't'], 'query_hyperthyroid': ['f', 't'], 'lithium': ['f', 't'], 'goitre': ['f', 't'], 'tumor': ['f', 't'], 'hypopituitary': ['f', 't'], 'psych': ['f',

Sanity check the extraction of column name matches the number of columns in the dataframe

In [49]:
assert len(col_names) == len(df.columns)

Set the column name of the dataframe

In [51]:
df.columns = col_names
df.head()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,class
0,41,F,f,f,f,f,f,f,f,f,...,t,125,t,1.14,t,109,f,?,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,...,t,102,f,?,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,...,t,109,t,0.91,t,120,f,?,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,...,t,175,f,?,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,...,t,61,t,0.87,t,70,f,?,SVI,negative.|2807


### Data Exploration

In [53]:
df.describe()

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG,referral_source,class
count,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800,...,2800,2800,2800,2800,2800,2800,2800,2800,2800,2800
unique,94,3,2,2,2,2,2,2,2,2,...,2,218,2,139,2,210,1,1,5,2800
top,59,F,f,f,f,f,f,f,f,f,...,t,?,t,?,t,?,f,?,other,negative.|3733
freq,75,1830,2470,2760,2766,2690,2759,2761,2752,2637,...,2616,184,2503,297,2505,295,2800,2800,1632,1


Notice that there are a lot of columns that need data preprocessing and cleaning

In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   age                        2800 non-null   object
 1   sex                        2800 non-null   object
 2   on_thyroxine               2800 non-null   object
 3   query_on_thyroxine         2800 non-null   object
 4   on_antithyroid_medication  2800 non-null   object
 5   sick                       2800 non-null   object
 6   pregnant                   2800 non-null   object
 7   thyroid_surgery            2800 non-null   object
 8   I131_treatment             2800 non-null   object
 9   query_hypothyroid          2800 non-null   object
 10  query_hyperthyroid         2800 non-null   object
 11  lithium                    2800 non-null   object
 12  goitre                     2800 non-null   object
 13  tumor                      2800 non-null   object
 14  hypopitu

Notice that the columns are not in appropriate format and need to change data types

Find the number of unique value in each column

In [58]:
df.nunique()

age                            94
sex                             3
on_thyroxine                    2
query_on_thyroxine              2
on_antithyroid_medication       2
sick                            2
pregnant                        2
thyroid_surgery                 2
I131_treatment                  2
query_hypothyroid               2
query_hyperthyroid              2
lithium                         2
goitre                          2
tumor                           2
hypopituitary                   2
psych                           2
TSH_measured                    2
TSH                           264
T3_measured                     2
T3                             65
TT4_measured                    2
TT4                           218
T4U_measured                    2
T4U                           139
FTI_measured                    2
FTI                           210
TBG_measured                    1
TBG                             1
referral_source                 5
class         

Examine the predictive/independent variables

In [60]:
for col_name in df.columns:
    # Separate the class column individually to examine
    if col_name == 'class':
        continue
    print(f'Column {col_name} has the following values:')
    print(list(df[col_name].unique()))

Column age has the following values:
['41', '23', '46', '70', '18', '59', '80', '66', '68', '84', '67', '71', '28', '65', '42', '63', '51', '81', '54', '55', '60', '25', '73', '34', '78', '37', '85', '26', '58', '64', '44', '48', '61', '35', '83', '21', '87', '53', '77', '27', '69', '74', '38', '76', '45', '36', '22', '43', '72', '82', '31', '39', '49', '62', '57', '1', '50', '30', '29', '75', '19', '7', '79', '17', '24', '15', '32', '47', '16', '52', '33', '13', '10', '89', '56', '20', '90', '40', '88', '14', '86', '94', '12', '4', '11', '8', '5', '455', '2', '91', '6', '?', '93', '92']
Column sex has the following values:
['F', 'M', '?']
Column on_thyroxine has the following values:
['f', 't']
Column query_on_thyroxine has the following values:
['f', 't']
Column on_antithyroid_medication has the following values:
['f', 't']
Column sick has the following values:
['f', 't']
Column pregnant has the following values:
['f', 't']
Column thyroid_surgery has the following values:
['f', 't']


Separately examine the outcome/dependent variable. The `class` column

In [62]:
print(*df['class'].unique().tolist(), sep='\n')

negative.|3733
negative.|1442
negative.|2965
negative.|806
negative.|2807
negative.|3434
negative.|1595
negative.|1367
negative.|1787
negative.|2534
negative.|1485
negative.|3448
negative.|1027
negative.|3331
negative.|2043
negative.|3169
negative.|2755
negative.|1010
negative.|803
negative.|2297
negative.|3564
negative.|152
negative.|936
negative.|716
negative.|1933
negative.|3445
negative.|3724
increased binding protein.|185
negative.|1966
negative.|466
negative.|1091
negative.|583
negative.|2137
negative.|1400
negative.|1815
negative.|3659
negative.|2797
negative.|3317
negative.|1566
negative.|1821
negative.|2427
negative.|1433
negative.|1632
negative.|563
negative.|1382
negative.|303
negative.|1428
negative.|3491
negative.|1873
negative.|2601
negative.|2235
negative.|1172
negative.|525
negative.|1690
negative.|3469
increased binding protein.|942
negative.|3266
negative.|251
negative.|2691
negative.|822
negative.|3007
negative.|2810
negative.|3038
negative.|694
negative.|1959
negati

Examining missing values in data

In [64]:
df.isna().any()

age                          False
sex                          False
on_thyroxine                 False
query_on_thyroxine           False
on_antithyroid_medication    False
sick                         False
pregnant                     False
thyroid_surgery              False
I131_treatment               False
query_hypothyroid            False
query_hyperthyroid           False
lithium                      False
goitre                       False
tumor                        False
hypopituitary                False
psych                        False
TSH_measured                 False
TSH                          False
T3_measured                  False
T3                           False
TT4_measured                 False
TT4                          False
T4U_measured                 False
T4U                          False
FTI_measured                 False
FTI                          False
TBG_measured                 False
TBG                          False
referral_source     

None of the columns in the dataset have NA values, which is considered a good sign as that means I have a complete image of the data distribution.

### Data Cleaning

The columns `TBG measured` and `TBG` have the same exact values across all 6 datasets, so it doesn't provide any value in terms of using those columns to predict the outcome variables

In [68]:
df.drop(columns=['TBG_measured', 'TBG'], axis=1, inplace=True)
df

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,referral_source,class
0,41,F,f,f,f,f,f,f,f,f,...,t,2.5,t,125,t,1.14,t,109,SVHC,negative.|3733
1,23,F,f,f,f,f,f,f,f,f,...,t,2,t,102,f,?,f,?,other,negative.|1442
2,46,M,f,f,f,f,f,f,f,f,...,f,?,t,109,t,0.91,t,120,other,negative.|2965
3,70,F,t,f,f,f,f,f,f,f,...,t,1.9,t,175,f,?,f,?,other,negative.|806
4,70,F,f,f,f,f,f,f,f,f,...,t,1.2,t,61,t,0.87,t,70,SVI,negative.|2807
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2795,70,M,f,f,f,f,f,f,f,f,...,f,?,t,155,t,1.05,t,148,SVI,negative.|3689
2796,73,M,f,t,f,f,f,f,f,f,...,t,0.7,t,63,t,0.88,t,72,other,negative.|3652
2797,75,M,f,f,f,f,f,f,f,f,...,f,?,t,147,t,0.8,t,183,other,negative.|1287
2798,60,F,f,f,f,f,f,f,f,f,...,f,?,t,100,t,0.83,t,121,other,negative.|3496


Sanity check column removed

In [70]:
assert all([dropped_column not in df.columns for dropped_column in ['TBG_measured', 'TBG']])

Cleanup the class column to only have the class value, and each class value is 1 word without spacing

In [72]:
df['class'] = df['class'].apply(lambda x: x[:x.index('.')])
df['class'] = df['class'].apply(lambda x: x.replace(' ', '_'))

In [73]:
print(df['class'].head())
print(df['class'].unique())

0    negative
1    negative
2    negative
3    negative
4    negative
Name: class, dtype: object
['negative' 'increased_binding_protein' 'decreased_binding_protein']


Cleanup the columns with `t` and `f` to be `True` and `False`

In [75]:
columns_with_t_f = [col_name for col_name in df.columns 
                    if list(df[col_name].unique()) == ['f', 't'] 
                    or list(df[col_name].unique()) == ['t', 'f']]
for column_with_t_f in columns_with_t_f:
    df[column_with_t_f] = df[column_with_t_f].apply(lambda x: False if x == 'f' else True)
    # df[column_with_t_f] = df[column_with_t_f].apply(lambda x: 0 if x == 'f' else 1)

In [76]:
print(df.head())
print(df['on_thyroxine'].info())

  age sex  on_thyroxine  query_on_thyroxine  on_antithyroid_medication   sick  \
0  41   F         False               False                      False  False   
1  23   F         False               False                      False  False   
2  46   M         False               False                      False  False   
3  70   F          True               False                      False  False   
4  70   F         False               False                      False  False   

   pregnant  thyroid_surgery  I131_treatment  query_hypothyroid  ...  \
0     False            False           False              False  ...   
1     False            False           False              False  ...   
2     False            False           False              False  ...   
3     False            False           False              False  ...   
4     False            False           False              False  ...   

   T3_measured   T3  TT4_measured  TT4  T4U_measured   T4U  FTI_measured  FTI  \

Identify the columns where there is `?` value

In [78]:
columns_with_missing_val = df.columns[df.isin(['?']).any()]
columns_with_missing_val

Index(['age', 'sex', 'TSH', 'T3', 'TT4', 'T4U', 'FTI'], dtype='object')

Convert the values where there is `?` value to string value of -1

In [80]:
for column_with_missing_val in columns_with_missing_val:
    df[column_with_missing_val] = df[column_with_missing_val].apply(lambda x: '-1' if x == '?' else x)

Sanity check `?` is converted to a string value of -1

In [82]:
assert all(['?' not in df[column] for column in df.columns])

Identify the columns where values are supposed to be continuous

In [84]:
columns_with_continuous = [key for key in data_dictionary.keys() if data_dictionary[key] == ['continuous']]
columns_with_continuous.remove('TBG')
columns_with_continuous

['age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI']

Convert columns where continuous value is expected

In [86]:
for column_with_continuous in columns_with_continuous:
    df[column_with_continuous] = df[column_with_continuous].apply(lambda x: float(x) if '.' in x else int(x))

Sanity check the cleaning resulted in a table that has numeric and boolean values

In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        2800 non-null   int64  
 1   sex                        2800 non-null   object 
 2   on_thyroxine               2800 non-null   bool   
 3   query_on_thyroxine         2800 non-null   bool   
 4   on_antithyroid_medication  2800 non-null   bool   
 5   sick                       2800 non-null   bool   
 6   pregnant                   2800 non-null   bool   
 7   thyroid_surgery            2800 non-null   bool   
 8   I131_treatment             2800 non-null   bool   
 9   query_hypothyroid          2800 non-null   bool   
 10  query_hyperthyroid         2800 non-null   bool   
 11  lithium                    2800 non-null   bool   
 12  goitre                     2800 non-null   bool   
 13  tumor                      2800 non-null   bool 

Examine the dataset after cleaning

In [89]:
df.describe(include='all')

Unnamed: 0,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,referral_source,class
count,2800.0,2800,2800,2800,2800,2800,2800,2800,2800,2800,...,2800,2800.0,2800,2800.0,2800,2800.0,2800,2800.0,2800,2800
unique,,3,2,2,2,2,2,2,2,2,...,2,,2,,2,,2,,5,3
top,,F,False,False,False,False,False,False,False,False,...,True,,True,,True,,True,,other,negative
freq,,1830,2470,2760,2766,2690,2759,2761,2752,2637,...,2215,,2616,,2503,,2505,,1632,2667
mean,51.825357,,,,,,,,,,...,,1.392964,,101.839071,,0.785991,,99.010321,,
std,20.481866,,,,,,,,,,...,,1.432044,,43.754027,,0.642186,,46.321656,,
min,-1.0,,,,,,,,,,...,,-1.0,,-1.0,,-1.0,,-1.0,,
25%,36.0,,,,,,,,,,...,,0.8,,84.0,,0.83,,86.75,,
50%,54.0,,,,,,,,,,...,,1.8,,102.0,,0.955,,104.0,,
75%,67.0,,,,,,,,,,...,,2.3,,123.0,,1.07,,122.0,,


In [90]:
print(df.columns[(df == -1).any()])

Index(['age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI'], dtype='object')


In [91]:
print(df.columns[(df == '-1').any()])

Index(['sex'], dtype='object')


Procedure to determine if the data '?' is missing at random:
- Split dataset into 2 set which have a missing value in a particular column vs not missing
- Create another new response variable indicating in the missing dataset
- Combine the dataset
- Do a logistic regression model to predict the categories that each row belong in
- If the data is missing at random, then the logistic regression model should perform very bad (having low accuracy)

In [None]:
df_adding_missing_column = df.copy()
df_adding_missing_column['missing'] = [1 if val == -1 else 0 for val in df_adding_missing_column['age']]
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X = df_adding_missing_column.drop(columns='missing', axis=1)
y = df_adding_missing_column['missing']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

### Visualization

### One Hot Encoding

## Modeling

## Evaluation

## Fine Tuning

## Cleanup code

In [95]:
!rm {local_zip_file_location}
!rm -rf {local_folder_location}