# Engenharia do Conhecimento 2023/2024

## Project: *Thyroid disease Data Set*

#### Group 6:

- Eduardo Proença 57551
- Tiago Oliveira 54979
- Bernardo Lopes 54386

### Summary

1. Data Processing
    1. Creating a Data Frame
    2. Data investigation
    3. Encoding Data
    4. Splitting into training and testing set
    5. Imputing missing values
    6. Scaling Data
    
2. Classification Models
    1. Feature Selection
    2. KFold Cross validation
    3. Decision Tree
    4. Logistic Regression
    5. Naive Bayes
    6. KNN
    7. SVM
       
3. Model Selection
    1. Hyperparameter tuning
    2. Model Testing

## 1. Data Processing

When constructing a machine learning model, data processing is an important step. We need to ensure that our data set is properly processed so that it can be used in the best possible way by diferent classification models.

### 1.1 Creating a Data Frame

The first step is load our data set. For that, we can use the [Pandas](https://pandas.pydata.org) Python Library, to read the file "proj-data.csv", which contains the data set we will be using in this project and build a DataFrame

In [456]:
import pandas as pd

# Load data set
df_thyroid = pd.read_csv('proj-data.csv')
df_thyroid.shape

(7338, 31)

In [457]:
df_thyroid.head()

Unnamed: 0,age:,sex:,on thyroxine:,query on thyroxine:,on antithyroid medication:,sick:,pregnant:,thyroid surgery:,I131 treatment:,query hypothyroid:,...,TT4:,T4U measured:,T4U:,FTI measured:,FTI:,TBG measured:,TBG:,referral source:,diagnoses,[record identification]
0,29,F,f,f,f,f,f,f,f,t,...,?,f,?,f,?,f,?,other,-,[861106018]
1,29,F,f,f,f,f,f,f,f,f,...,128,f,?,f,?,f,?,other,-,[860916073]
2,36,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,[850726049]
3,60,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,[861010020]
4,77,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,21,other,-,[860324074]


### 1.2 Data investigation

After building our DataFrame, it's important to do some investigation, so we can gain a better understanding of our data.

Lets a have a deeper look at our DataFrame:

In [458]:
# TODO fix prints
for col in df_thyroid.columns:
    print("Values of ", end='')
    print(df_thyroid[col].value_counts(), end="\n\n")

Values of age:
60       175
62       169
72       164
59       162
61       161
        ... 
7          1
65511      1
65512      1
3          1
9          1
Name: count, Length: 98, dtype: int64

Values of sex:
F    4848
M    2250
?     240
Name: count, dtype: int64

Values of on thyroxine:
f    6347
t     991
Name: count, dtype: int64

Values of query on thyroxine:
f    7222
t     116
Name: count, dtype: int64

Values of on antithyroid medication:
f    7237
t     101
Name: count, dtype: int64

Values of sick:
f    7063
t     275
Name: count, dtype: int64

Values of pregnant:
f    7254
t      84
Name: count, dtype: int64

Values of thyroid surgery:
f    7233
t     105
Name: count, dtype: int64

Values of I131 treatment:
f    7211
t     127
Name: count, dtype: int64

Values of query hypothyroid:
f    6835
t     503
Name: count, dtype: int64

Values of query hyperthyroid:
f    6830
t     508
Name: count, dtype: int64

Values of lithium:
f    7264
t      74
Name: count, dtype: int64

Val

The column "[record idenfication]"

In [459]:
df = df_thyroid.drop("[record identification]", axis=1)

Now we are going to take a look at the missing values in our DataFrame:

In [460]:
import numpy as np

df.replace('?', np.nan, inplace=True)
df.isna().sum()

age:                             0
sex:                           240
on thyroxine:                    0
query on thyroxine:              0
on antithyroid medication:       0
sick:                            0
pregnant:                        0
thyroid surgery:                 0
I131 treatment:                  0
query hypothyroid:               0
query hyperthyroid:              0
lithium:                         0
goitre:                          0
tumor:                           0
hypopituitary:                   0
psych:                           0
TSH measured:                    0
TSH:                           671
T3 measured:                     0
T3:                           2068
TT4 measured:                    0
TT4:                           362
T4U measured:                    0
T4U:                           664
FTI measured:                    0
FTI:                           658
TBG measured:                    0
TBG:                          7054
referral source:    

In [461]:
# TODO this needs tunning
#df_cleaned = (df.drop("TBG:", axis = 1)).dropna(subset=["T3:"])
#df_cleaned = df.dropna(subset=["T3:"])
df_cleaned = df.drop(["TBG measured:", "TBG:"], axis = 1)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7338 entries, 0 to 7337
Data columns (total 28 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age:                        7338 non-null   int64 
 1   sex:                        7098 non-null   object
 2   on thyroxine:               7338 non-null   object
 3   query on thyroxine:         7338 non-null   object
 4   on antithyroid medication:  7338 non-null   object
 5   sick:                       7338 non-null   object
 6   pregnant:                   7338 non-null   object
 7   thyroid surgery:            7338 non-null   object
 8   I131 treatment:             7338 non-null   object
 9   query hypothyroid:          7338 non-null   object
 10  query hyperthyroid:         7338 non-null   object
 11  lithium:                    7338 non-null   object
 12  goitre:                     7338 non-null   object
 13  tumor:                      7338 non-null   obje

### 1.3 Encoding Data

In machine learning, there are classification models that can not handle object type values. For this reason we need to transform our DataFrame so that it is onlu composed od numeric types, this process is refered to as data enconding.

Enconding variables

In [462]:
import numpy as np

encoded_values = {
    'M': '0', 'F': '1',
    'f': '0', 't': '1'
}
target = "diagnoses"
df_target = pd.DataFrame(df_cleaned["diagnoses"], columns=["diagnoses"])

encoded = df_cleaned.drop("diagnoses", axis=1).replace(encoded_values)
df_encoded = pd.get_dummies(encoded, columns=["referral source:"], dtype='int')

Encoding target variable

In [463]:
for index, value in df_target[target].items():
    if value == '-':
        df_target.at[index, target] = 0
    elif value == 'A' or value == 'B' or value == 'C' or value == 'D':
        df_target.at[index, target] = 1
    elif value == 'E' or value == 'F' or value == 'G' or value == 'H':
        df_target.at[index, target] = 2
    elif value == 'I' or value == 'J':
        df_target.at[index, target] = 3
    elif value == 'K':
        df_target.at[index, target] = 4
    elif value == 'L' or value == 'M' or value == 'N':
        df_target.at[index, target] = 5
    elif value == 'O' or value == 'P' or value == 'Q':
        df_target.at[index, target] = 6
    elif value == 'R' or value == 'S' or value == 'T':
        df_target.at[index, target] = 7
    else:
        df_target.at[index, target] = 8

df_target[target] = pd.to_numeric(df_target[target])
df_target[target].unique()

array([0, 7, 5, 2, 4, 3, 1, 8, 6], dtype=int64)

In [464]:
df = pd.concat([df_encoded, df_target], axis=1)
df.head()

Unnamed: 0,age:,sex:,on thyroxine:,query on thyroxine:,on antithyroid medication:,sick:,pregnant:,thyroid surgery:,I131 treatment:,query hypothyroid:,...,T4U:,FTI measured:,FTI:,referral source:_STMW,referral source:_SVHC,referral source:_SVHD,referral source:_SVI,referral source:_WEST,referral source:_other,diagnoses
0,29,1,0,0,0,0,0,0,0,1,...,,0,,0,0,0,0,0,1,0
1,29,1,0,0,0,0,0,0,0,0,...,,0,,0,0,0,0,0,1,0
2,36,1,0,0,0,0,0,0,0,0,...,,0,,0,0,0,0,0,1,0
3,60,1,0,0,0,0,0,0,0,0,...,,0,,0,0,0,0,0,1,0
4,77,1,0,0,0,0,0,0,0,0,...,,0,,0,0,0,0,0,1,0


### 1.4 Splitting into training and testing set

In [465]:
from sklearn.model_selection import train_test_split

X = df.drop("diagnoses", axis='columns')
y = df["diagnoses"]

X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(X, y, test_size=0.2)

# Print the shapes of the training and testing sets
print("Training set shape:", X_TRAIN.shape, y_TRAIN.shape)
print("Testing set shape:", X_IVS.shape, y_IVS.shape)

Training set shape: (5870, 32) (5870,)
Testing set shape: (1468, 32) (1468,)


### 1.5 Scaling Data

In [466]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_TRAIN)

X_train_scl = scaler.transform(X_TRAIN)
X_ivs_scl = scaler.transform(X_IVS)

### 1.6 Imputing missing values

In [467]:
# TODO Needs more tuning
from sklearn.impute import KNNImputer, SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
#imputer = KNNImputer(n_neighbors=5)

X_train_imp = imputer.fit_transform(X_train_scl)
X_ivs_imp = imputer.transform(X_ivs_scl)

X_TRAIN = pd.DataFrame(X_train_imp, columns=X_TRAIN.columns)
X_IVS = pd.DataFrame(X_ivs_imp, columns=X_IVS.columns)

missing_values = X_TRAIN.isna().sum().sum() + X_IVS.isna().sum().sum()
print(f"Missing values count: {missing_values}")

Missing values count: 0


In [468]:
X_TRAIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5870 entries, 0 to 5869
Data columns (total 32 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   age:                        5870 non-null   float64
 1   sex:                        5870 non-null   float64
 2   on thyroxine:               5870 non-null   float64
 3   query on thyroxine:         5870 non-null   float64
 4   on antithyroid medication:  5870 non-null   float64
 5   sick:                       5870 non-null   float64
 6   pregnant:                   5870 non-null   float64
 7   thyroid surgery:            5870 non-null   float64
 8   I131 treatment:             5870 non-null   float64
 9   query hypothyroid:          5870 non-null   float64
 10  query hyperthyroid:         5870 non-null   float64
 11  lithium:                    5870 non-null   float64
 12  goitre:                     5870 non-null   float64
 13  tumor:                      5870 

## 2. Classification Models

### 2.1 Feature Selection

In [469]:
# TODO feature selection needs tuning
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression

n_features = int(X_TRAIN.shape[1] * .4)
sfs = SequentialFeatureSelector(LinearRegression(), 
                                n_features_to_select=5, 
                                direction='forward', 
                                n_jobs=-1)
sfs.fit(X_TRAIN, y_TRAIN)

N, M = X_TRAIN.shape
features=sfs.get_support()
features_selected = np.arange(M)[features]
print("The features selected are columns: ", features_selected)

X_TRAIN = sfs.transform(X_TRAIN)
X_IVS = sfs.transform(X_IVS)

The features selected are columns:  [ 6 17 19 21 25]


### 2.2 KFold Cross validation

In [470]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def evaluate(model, X, y):
    TRUTH = None
    PREDS = None
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X):
        X_train, y_train = X[train_index], y.to_numpy()[train_index]
        X_test, y_test = X[test_index], y.to_numpy()[test_index]
        
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        if TRUTH is None:
            PREDS = preds
            TRUTH = y_test
        else:
            PREDS = np.hstack((PREDS, preds))
            TRUTH = np.hstack((TRUTH, y_test))
    return TRUTH, PREDS
           
def print_statistics(truth, preds):
    print("Accuracy: %7.4f" % accuracy_score(truth, preds))
    print("Precision: %7.4f" % precision_score(truth, preds, average='weighted', zero_division=1))
    print("Recall is: %7.4f" % recall_score(truth, preds, average='weighted'))
    print("F1 score: %7.4f" % f1_score(truth, preds, average='weighted'))
    print("Matthews correlation coefficient: %7.4f" % matthews_corrcoef(truth, preds))

### 2.3 Decision Tree

In [471]:
from sklearn.tree import DecisionTreeClassifier

TRUTH, PREDS = evaluate(DecisionTreeClassifier(), X_TRAIN, y_TRAIN)
print_statistics(TRUTH, PREDS)

Accuracy:  0.8675
Precision:  0.8647
Recall is:  0.8675
F1 score:  0.8654
Matthews correlation coefficient:  0.7005


### Random Forest Classifier

In [472]:
from sklearn.ensemble import RandomForestClassifier

TRUTH, PREDS = evaluate(RandomForestClassifier(), X_TRAIN, y_TRAIN)
print_statistics(TRUTH, PREDS)

Accuracy:  0.8957
Precision:  0.8859
Recall is:  0.8957
F1 score:  0.8878
Matthews correlation coefficient:  0.7634


### 2.4 Logistic Regression

In [473]:
from sklearn.linear_model import LogisticRegression

TRUTH, PREDS = evaluate(LogisticRegression(), X_TRAIN, y_TRAIN)
print_statistics(TRUTH, PREDS)

Accuracy:  0.8155
Precision:  0.7860
Recall is:  0.8155
F1 score:  0.7814
Matthews correlation coefficient:  0.5250


### 2.5 Naive Bayes

In [474]:
from sklearn.naive_bayes import GaussianNB

TRUTH, PREDS = evaluate(GaussianNB(), X_TRAIN, y_TRAIN)
print_statistics(TRUTH, PREDS)

Accuracy:  0.1434
Precision:  0.8358
Recall is:  0.1434
F1 score:  0.0677
Matthews correlation coefficient:  0.1833


### 2.6 KNN

In [475]:
from sklearn.neighbors import KNeighborsClassifier

TRUTH, PREDS = evaluate(KNeighborsClassifier(), X_TRAIN, y_TRAIN)
print_statistics(TRUTH, PREDS)

Accuracy:  0.8475
Precision:  0.8258
Recall is:  0.8475
F1 score:  0.8290
Matthews correlation coefficient:  0.6318


### 2.7 SVM

In [476]:
from sklearn.svm import SVC

TRUTH, PREDS = evaluate(SVC(), X_TRAIN, y_TRAIN)
print_statistics(TRUTH, PREDS)

Accuracy:  0.8378
Precision:  0.8050
Recall is:  0.8378
F1 score:  0.8071
Matthews correlation coefficient:  0.5968


## 3. Model Selection

### 3.1 Hyperparameter tuning

In [477]:
model_params = {
    'Decision Tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini', 'entropy', 'log_loss'],
            'splitter': ['best', 'random'],
            'max_depth': np.arange(1, 15, 1),
            'min_samples_split': np.arange(2, 20, 2),
            'min_samples_leaf': np.arange(1, 10, 1)
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(),
        'params': {
            'n_estimators': np.arange(1, 15, 1)
        }
    },
    'KNN': {
        'model': KNeighborsClassifier(),
        'params': {
            "n_neighbors": [3, 4, 5, 6, 7, 9],
            "weights": ["uniform", "distance"]
        }
    }
}

In [478]:
from sklearn.model_selection import GridSearchCV

scores = []
best_models = []

for name, model in model_params.items():
    grid_search = GridSearchCV(model['model'], model['params'], cv=5, n_jobs=-1)
    grid_search.fit(X_TRAIN, y_TRAIN)
    scores.append({
        'Model': name,
        'Best Score': grid_search.best_score_,
        'Best Parameters': grid_search.best_params_
    })
    best_models.append(grid_search.best_estimator_)
    
df = pd.DataFrame(scores,columns=['Model', 'Best Score', 'Best Parameters'])
df

Unnamed: 0,Model,Best Score,Best Parameters
0,Decision Tree,0.887734,"{'criterion': 'gini', 'max_depth': 6, 'min_sam..."
1,Random Forest,0.896252,{'n_estimators': 13}
2,KNN,0.853833,"{'n_neighbors': 7, 'weights': 'distance'}"


In [479]:
# TRUTH, PREDS = evaluate(grid_search.best_estimator_, X_TRAIN, y_TRAIN)
# print("Best Estimator:", grid_search.best_estimator_)
# print_statistics(TRUTH, PREDS)

### 3.2 Model Testing

In [480]:
for model in best_models:
    print("\nModel:", model)
    model.fit(X_TRAIN, y_TRAIN)
    preds = model.predict(X_IVS)
    print_statistics(y_IVS, preds)


Model: DecisionTreeClassifier(max_depth=6, min_samples_split=4)
Accuracy:  0.8951
Precision:  0.8911
Recall is:  0.8951
F1 score:  0.8888
Matthews correlation coefficient:  0.7409

Model: RandomForestClassifier(n_estimators=13)
Accuracy:  0.8896
Precision:  0.8903
Recall is:  0.8896
F1 score:  0.8859
Matthews correlation coefficient:  0.7301

Model: KNeighborsClassifier(n_neighbors=7, weights='distance')
Accuracy:  0.8604
Precision:  0.8492
Recall is:  0.8604
F1 score:  0.8507
Matthews correlation coefficient:  0.6415
