# Engenharia do Conhecimento 2023/2024

## Project: Classification and Regression models with *Thyroid disease Data Set*

#### Group 6:

- Eduardo Proença 57551
- Tiago Oliveira 54979
- Bernardo Lopes 54386

### 1. Data Processing

When constructing a machine learning model, data processing is an important step. We need to ensure that our data set is properly processed so that it can be used in the best possible way by diferent classification models.

#### 1.1 Creating a Data Frame
The first step is load our data set. For that, we can use the [Pandas](https://pandas.pydata.org) Python Library, to read the file "proj-data.csv", which contains the data set we will be using in this project and build a DataFrame

In [58]:
import pandas as pd

# Load data set
df_thyroid = pd.read_csv('proj-data.csv')
df_thyroid.shape

(7338, 31)

In [59]:
df_thyroid.head()

Unnamed: 0,age:,sex:,on thyroxine:,query on thyroxine:,on antithyroid medication:,sick:,pregnant:,thyroid surgery:,I131 treatment:,query hypothyroid:,...,TT4:,T4U measured:,T4U:,FTI measured:,FTI:,TBG measured:,TBG:,referral source:,diagnoses,[record identification]
0,29,F,f,f,f,f,f,f,f,t,...,?,f,?,f,?,f,?,other,-,[861106018]
1,29,F,f,f,f,f,f,f,f,f,...,128,f,?,f,?,f,?,other,-,[860916073]
2,36,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,[850726049]
3,60,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,[861010020]
4,77,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,21,other,-,[860324074]


#### 1.2 Data investigation
After building our DataFrame, it's important to do some investigation, so we can gain a better understanding of our data.

The first detail we notice is that there are missing values represented by '?'. These will be handled later in this notebook, for now
we will just replace them with NaN, so they can be identified.

In [60]:
import numpy as np

# Replace missing values with NaN
df = df_thyroid.copy()
df.replace('?', np.nan, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7338 entries, 0 to 7337
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age:                        7338 non-null   int64 
 1   sex:                        7098 non-null   object
 2   on thyroxine:               7338 non-null   object
 3   query on thyroxine:         7338 non-null   object
 4   on antithyroid medication:  7338 non-null   object
 5   sick:                       7338 non-null   object
 6   pregnant:                   7338 non-null   object
 7   thyroid surgery:            7338 non-null   object
 8   I131 treatment:             7338 non-null   object
 9   query hypothyroid:          7338 non-null   object
 10  query hyperthyroid:         7338 non-null   object
 11  lithium:                    7338 non-null   object
 12  goitre:                     7338 non-null   object
 13  tumor:                      7338 non-null   obje

Looking at the number of different values of each column.

In [61]:
print("Uniques values:")
for col in df_thyroid.columns:
    unique_vals = df_thyroid[col].nunique()
    print(f'{col} = ', unique_vals)

Uniques values:
age: =  98
sex: =  3
on thyroxine: =  2
query on thyroxine: =  2
on antithyroid medication: =  2
sick: =  2
pregnant: =  2
thyroid surgery: =  2
I131 treatment: =  2
query hypothyroid: =  2
query hyperthyroid: =  2
lithium: =  2
goitre: =  2
tumor: =  2
hypopituitary: =  2
psych: =  2
TSH measured: =  2
TSH: =  351
T3 measured: =  2
T3: =  82
TT4 measured: =  2
TT4: =  278
T4U measured: =  2
T4U: =  167
FTI measured: =  2
FTI: =  303
TBG measured: =  2
TBG: =  61
referral source: =  6
diagnoses =  32
[record identification] =  7338


Let's look at the number of missing values of each column.

In [62]:
df.isna().sum()

age:                             0
sex:                           240
on thyroxine:                    0
query on thyroxine:              0
on antithyroid medication:       0
sick:                            0
pregnant:                        0
thyroid surgery:                 0
I131 treatment:                  0
query hypothyroid:               0
query hyperthyroid:              0
lithium:                         0
goitre:                          0
tumor:                           0
hypopituitary:                   0
psych:                           0
TSH measured:                    0
TSH:                           671
T3 measured:                     0
T3:                           2068
TT4 measured:                    0
TT4:                           362
T4U measured:                    0
T4U:                           664
FTI measured:                    0
FTI:                           658
TBG measured:                    0
TBG:                          7054
referral source:    

After some investegation we know that our data set has two types of columns: binary columns which have only two possible non-numeric values
and numeric columns which contain different numeric values. It also has three more columns, the referral source that can have six different values,
the diagnoses our target variable, and the record information representing only a unique identifier, so this column can be dropped.

As to the missing values we will impute them later instead of deleting them, because they represent measures that where not taken, and
not values that are truly missing.

In [63]:
# Dropping the [record identification] column
df_cleaned = df.drop('[record identification]', axis = 1)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7338 entries, 0 to 7337
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age:                        7338 non-null   int64 
 1   sex:                        7098 non-null   object
 2   on thyroxine:               7338 non-null   object
 3   query on thyroxine:         7338 non-null   object
 4   on antithyroid medication:  7338 non-null   object
 5   sick:                       7338 non-null   object
 6   pregnant:                   7338 non-null   object
 7   thyroid surgery:            7338 non-null   object
 8   I131 treatment:             7338 non-null   object
 9   query hypothyroid:          7338 non-null   object
 10  query hyperthyroid:         7338 non-null   object
 11  lithium:                    7338 non-null   object
 12  goitre:                     7338 non-null   object
 13  tumor:                      7338 non-null   obje

#### 1.3 Encoding Data
Now our data set is ready to be encoded. In this process, all the binary columns will be transformed
into two 0's and 1's. The "referral source" column will be encoded using the method get_dummies from [Pandas](https://pandas.pydata.org).

As for the target variable, it will be encoded according to 8 classes given to us in the file "data.names".

In [64]:
def encode_data(X):
    encoded_values = {
        'M': '0', 'F': '1',
        'f': '0', 't': '1'
    }
    encoded = X.replace(encoded_values)
    df_encoded = pd.get_dummies(encoded, columns=['referral source:'], dtype='int')
    return df_encoded

def encode_target(y):
    target = 'diagnoses'
    value_mapping = {
        '-': 0,                          # healthy
        'A': 1, 'B': 1, 'C': 1, 'D': 1,  # hyperthyroid conditions
        'E': 2, 'F': 2, 'G': 2, 'H': 2,  # hypothyroid conditions
        'I': 3, 'J': 3,                  # binding protein
        'K': 4,                          # general health
        'L': 5, 'M': 5, 'N': 5,          # replacement therapy
        'R': 6,                          # discordant results
    }
    df_target = pd.DataFrame(y, columns=[target])
    df_target[target] = df_target[target].map(value_mapping).fillna(7).astype(int)
    return df_target

X = df_cleaned.drop('diagnoses', axis='columns')
y = df_cleaned['diagnoses']
df_encoded = pd.concat([encode_data(X), encode_target(y)], axis=1)
df_encoded.head()

Unnamed: 0,age:,sex:,on thyroxine:,query on thyroxine:,on antithyroid medication:,sick:,pregnant:,thyroid surgery:,I131 treatment:,query hypothyroid:,...,FTI:,TBG measured:,TBG:,referral source:_STMW,referral source:_SVHC,referral source:_SVHD,referral source:_SVI,referral source:_WEST,referral source:_other,diagnoses
0,29,1,0,0,0,0,0,0,0,1,...,,0,,0,0,0,0,0,1,0
1,29,1,0,0,0,0,0,0,0,0,...,,0,,0,0,0,0,0,1,0
2,36,1,0,0,0,0,0,0,0,0,...,,1,26.0,0,0,0,0,0,1,0
3,60,1,0,0,0,0,0,0,0,0,...,,1,26.0,0,0,0,0,0,1,0
4,77,1,0,0,0,0,0,0,0,0,...,,1,21.0,0,0,0,0,0,1,0


### 1.4 Splitting into training and testing set
With our data encoded, we are ready to split it into a training and testing sets.
The training set will be used to train our classification model and the testing set will be used to test it.

In [65]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('diagnoses', axis='columns')
y = df_encoded['diagnoses']

X_TRAIN, X_TEST, y_TRAIN, y_TEST = train_test_split(X, y, test_size=0.2)

# Print the shapes of the training and testing sets
print('Training set shape:', X_TRAIN.shape, y_TRAIN.shape)
print('Testing set shape:', X_TEST.shape, y_TEST.shape)

Training set shape: (5870, 34) (5870,)
Testing set shape: (1468, 34) (1468,)


### 1.5 Scaling Data
Because there are classification models that are based it the distance between the data, like KNN, 
it is important to normalize our training and testing sets.

In [66]:
from sklearn.preprocessing import StandardScaler

def scale_data(X_train, X_test):
    scaler = StandardScaler()
    scaler.fit(X_train)
    return scaler.transform(X_train), scaler.transform(X_test)

X_train_scl, X_test_scl = scale_data(X_TRAIN, X_TEST)

### 1.6 Imputing missing values
Again, because there are classification models that can not handle missing values, like KNN, 
we need to make the imputation of our NaN values.

In [67]:
from sklearn.impute import SimpleImputer

def impute_data(X_train, X_test):
    imputer = SimpleImputer(strategy='mean')
    imputer.fit(X_train)
    return imputer.transform(X_train), imputer.transform(X_test)

X_train_imp, X_test_imp = impute_data(X_train_scl, X_test_scl)

In [68]:
X_TRAIN = pd.DataFrame(X_train_imp, columns=X_TRAIN.columns)
X_TEST = pd.DataFrame(X_test_imp, columns=X_TEST.columns)
X_TRAIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5870 entries, 0 to 5869
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   age:                        5870 non-null   float64
 1   sex:                        5870 non-null   float64
 2   on thyroxine:               5870 non-null   float64
 3   query on thyroxine:         5870 non-null   float64
 4   on antithyroid medication:  5870 non-null   float64
 5   sick:                       5870 non-null   float64
 6   pregnant:                   5870 non-null   float64
 7   thyroid surgery:            5870 non-null   float64
 8   I131 treatment:             5870 non-null   float64
 9   query hypothyroid:          5870 non-null   float64
 10  query hyperthyroid:         5870 non-null   float64
 11  lithium:                    5870 non-null   float64
 12  goitre:                     5870 non-null   float64
 13  tumor:                      5870 

### 2. Classification Models

#### 2.1 Feature Selection

In [69]:
# TODO feature selection needs tuning
from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

def feature_selection(X_train, y_train, X_test, verbose=False): 
    selector = SFS(LinearRegression(), 
                   n_features_to_select=13, 
                   direction='forward',
                   n_jobs=-1)
    selector.fit(X_train, y_train)
    if verbose:
        N, M = X_train.shape
        features=selector.get_support()
        features_selected = np.arange(M)[features]
        print("The features selected are columns: ", features_selected)
    return selector.transform(X_train), selector.transform(X_test)

X_TRAIN, X_TEST = feature_selection(X_TRAIN, y_TRAIN, X_TEST, True)

The features selected are columns:  [ 2  6 16 17 18 19 21 25 26 27 28 29 31]


#### 2.2 KFold Cross Validation

In [70]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def cross_validation(model, X, y):
    TRUTH = None
    PREDS = None
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X):
        X_train, y_train = X[train_index], y.to_numpy()[train_index]
        X_test, y_test = X[test_index], y.to_numpy()[test_index]
        
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        if TRUTH is None:
            PREDS = preds
            TRUTH = y_test
        else:
            PREDS = np.hstack((PREDS, preds))
            TRUTH = np.hstack((TRUTH, y_test))
    return TRUTH, PREDS

In [71]:
def evaluate(model, X, y, n_iter=10):
    accuracy = []
    precision = []
    recall = []
    f1 = []
    mcc = []
    
    for _ in range(n_iter):
        truth, preds = cross_validation(model, X, y)
        accuracy.append(accuracy_score(truth, preds))
        precision.append(precision_score(truth, preds, average='weighted', zero_division=1))
        recall.append(recall_score(truth, preds, average='weighted'))
        f1.append(f1_score(truth, preds, average='weighted'))
        mcc.append(matthews_corrcoef(truth, preds))
        
    return {
        'Model': model,
        'Accuracy': np.mean(accuracy),
        'Precision': np.mean(precision),
        'Recall': np.mean(recall),
        'F1-Score': np.mean(f1),
        'MCC': np.mean(mcc)
    }

#### 2.3 Model Evaluation

In [72]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

eval_metrics = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'MCC']

tree = evaluate(DecisionTreeClassifier(), X_TRAIN, y_TRAIN)
lgr = evaluate(LogisticRegression(), X_TRAIN, y_TRAIN)
naive_bayes = evaluate(GaussianNB(), X_TRAIN, y_TRAIN)
knn = evaluate(KNeighborsClassifier(), X_TRAIN, y_TRAIN)
svm = evaluate(SVC(), X_TRAIN, y_TRAIN)

pd.DataFrame([tree, lgr, naive_bayes, knn, svm], columns=eval_metrics)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,MCC
0,DecisionTreeClassifier(),0.924889,0.924415,0.924889,0.924541,0.829129
1,LogisticRegression(),0.853049,0.837921,0.853049,0.834673,0.629239
2,GaussianNB(),0.164991,0.798435,0.164991,0.125834,0.194124
3,KNeighborsClassifier(),0.870579,0.863071,0.870579,0.859773,0.680761
4,SVC(),0.848876,0.839192,0.848876,0.823992,0.613862


#### 2.4 Hyperparameter Tuning

In [73]:
model_params = {
    'Decision Tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'max_depth': [2, 3, 5, 10, 20],
            'min_samples_split': [2, 3, 5, 10, 15],
            'min_samples_leaf': [2, 3, 5, 10, 15],
            'criterion': ['gini', 'entropy', 'log_loss']
        }
    },
    'KNN': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors' : [3, 5, 7, 9, 11],
            'weights' : ['uniform', 'distance'],
            'metric' : ['minkowski', 'euclidean', 'manhattan']
        }
    },
    'SVC': {
        'model': SVC(),
        'params': {
            'C': [0.1, 1, 10, 100],  
            'gamma': [1, 0.1, 0.01, 0.001], 
            'kernel': ['rbf', 'linear'] 
        }
    }
}

In [74]:
from sklearn.model_selection import GridSearchCV

scores = []
for name, model in model_params.items():
    grid_search = GridSearchCV(model['model'], model['params'], cv=5, n_jobs=-1)
    grid_search.fit(X_TRAIN, y_TRAIN)
    scores.append({
        'Best Estimator': grid_search.best_estimator_,
        'Best Score': grid_search.best_score_,
        'Best Params': grid_search.best_params_
    })
    
df_tuning = pd.DataFrame(scores, columns=['Best Estimator', 'Best Score', 'Best Params'])
df_tuning

Unnamed: 0,Best Estimator,Best Score,Best Params
0,"DecisionTreeClassifier(criterion='entropy', ma...",0.935264,"{'criterion': 'entropy', 'max_depth': 20, 'min..."
1,"KNeighborsClassifier(metric='manhattan', n_nei...",0.880239,"{'metric': 'manhattan', 'n_neighbors': 3, 'wei..."
2,"SVC(C=100, gamma=0.1)",0.899489,"{'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}"


#### 2.5 Model Selection

In [75]:
best_classification_model = df_tuning.at[df_tuning['Best Score'].idxmax(), 'Best Estimator']
print("-------------- Best Classification Model: -------------- \n")
print(best_classification_model)

-------------- Best Classification Model: -------------- 

DecisionTreeClassifier(criterion='entropy', max_depth=20, min_samples_leaf=2,
                       min_samples_split=5)


In [76]:
evaluation = pd.DataFrame(
    [evaluate(best_classification_model, X_TRAIN, y_TRAIN)],
    columns=eval_metrics
)
evaluation

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,MCC
0,"DecisionTreeClassifier(criterion='entropy', ma...",0.931022,0.929046,0.931022,0.929771,0.841411


In [77]:
best_classification_model.fit(X_TRAIN, y_TRAIN)
preds = best_classification_model.predict(X_TEST)
test_report = {
    'Model': best_classification_model,
    'Accuracy': accuracy_score(y_TEST, preds),
    'Precision': precision_score(y_TEST, preds, average='weighted', zero_division=1),
    'Recall': recall_score(y_TEST, preds, average='weighted'),
    'F1-Score': f1_score(y_TEST, preds, average='weighted'),
    'MCC': matthews_corrcoef(y_TEST, preds)
}

best_model_test = pd.DataFrame([test_report], columns=eval_metrics)
best_model_test

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,MCC
0,"DecisionTreeClassifier(criterion='entropy', ma...",0.940736,0.940215,0.940736,0.940347,0.867134


### 3. Regression Models

### 4. IVS Model Testing

In [78]:
# import pandas as pd
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.impute import SimpleImputer
# from sklearn.feature_selection import SequentialFeatureSelector as SFS
# from sklearn.linear_model import LinearRegression
# from sklearn.model_selection import KFold
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef
# from sklearn.tree import DecisionTreeClassifier
# 
# def encode_data(X):
#     encoded_values = {
#         'M': '0', 'F': '1',
#         'f': '0', 't': '1'
#     }
#     encoded = X.replace(encoded_values)
#     df_encoded = pd.get_dummies(encoded, columns=['referral source:'], dtype='int')
#     return df_encoded
# 
# def encode_target(y):
#     target = 'diagnoses'
#     value_mapping = {
#         '-': 0,                          # healthy
#         'A': 1, 'B': 1, 'C': 1, 'D': 1,  # hyperthyroid conditions
#         'E': 2, 'F': 2, 'G': 2, 'H': 2,  # hypothyroid conditions
#         'I': 3, 'J': 3,                  # binding protein
#         'K': 4,                          # general health
#         'L': 5, 'M': 5, 'N': 5,          # replacement therapy
#         'R': 6,                          # discordant results
#     }
#     df_target = pd.DataFrame(y, columns=[target])
#     df_target[target] = df_target[target].map(value_mapping).fillna(7).astype(int)
#     return df_target
# 
# def scale_data(X_train, X_test):
#     scaler = StandardScaler()
#     scaler.fit(X_train)
#     return scaler.transform(X_train), scaler.transform(X_test)
# 
# def impute_data(X_train, X_test):
#     imputer = SimpleImputer(strategy='mean')
#     imputer.fit(X_train)
#     return imputer.transform(X_train), imputer.transform(X_test)
# 
# def feature_selection(X_train, y_train, X_test, verbose=False): 
#     selector = SFS(LinearRegression(), 
#                    n_features_to_select=13, 
#                    direction='forward',
#                    n_jobs=-1)
#     selector.fit(X_train, y_train)
#     if verbose:
#         N, M = X_train.shape
#         features=selector.get_support()
#         features_selected = np.arange(M)[features]
#         print("The features selected are columns: ", features_selected)
#     return selector.transform(X_train), selector.transform(X_test)
# 
# 
# X_IVS = pd.read_csv('proj-test-data.csv') # Replace with the complete IVS set !
# y_IVS = pd.read_csv('proj-test-class.csv') # Replace with the complete IVS set !
# X_IVS.replace('?', np.nan, inplace=True)
# X_IVS_test = X_IVS.drop('[record identification]', axis = 1)
# 
# X = encode_data(X_IVS_test)
# y = encode_target(y_IVS['diagnoses'])
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# X_train_scl, X_test_scl = scale_data(X_train, X_test)
# X_train_imp, X_test_imp = impute_data(X_train_scl, X_test_scl)
# 
# X_TRAIN = pd.DataFrame(X_train_imp, columns=X_train.columns)
# X_TEST = pd.DataFrame(X_test_imp, columns=X_test.columns)
# X_TRAIN, X_TEST = feature_selection(X_TRAIN, y_train, X_TEST)
# 
# ##### Define the best model here ######
# classification_model = DecisionTreeClassifier(max_depth=10, min_samples_leaf=2, min_samples_split=5)
# classification_model.fit(X_TRAIN, y_train)
# preds = classification_model.predict(X_TEST)
# results = {
#     'Model': classification_model,
#     'Accuracy': accuracy_score(y_test, preds),
#     'Precision': precision_score(y_test, preds, average='weighted', zero_division=1),
#     'Recall': recall_score(y_test, preds, average='weighted'),
#     'F1-Score': f1_score(y_test, preds, average='weighted'),
#     'MCC': matthews_corrcoef(y_test, preds)
# }
# ##### Remove comment to show results ######
# # ivs_testing = pd.DataFrame([results], columns=eval_metrics)
# # ivs_testing