# Engenharia do Conhecimento 2023/2024

## Project: *Thyroid disease Data Set*

#### Group 6:

- Eduardo Proença 57551
- Tiago Oliveira 54979
- Bernardo Lopes 54386

### Summary

1. Data Processing

    1. Creating a Data Frame
    2. Data investigation
    3. Encoding Data
    4. Splitting into training and testing set
    5. Imputing missing values
    6. Scaling Data
    
2. Classification

    1. Feature Selection
    2. KFold Cross validation
    3. Classification models
       
3. Model Selection

    1. Hyperparameter tuning
    2. Model Testing

4. Conclusion

## 1. Data Processing

When constructing a machine learning model, data processing is an important step. We need to ensure that our data set is properly processed so that it can be used in the best possible way by diferent classification models.

### 1.1 Creating a Data Frame

The first step is load our data set. For that, we can use the [Pandas](https://pandas.pydata.org) Python Library, to read the file "proj-data.csv", which contains the data set we will be using in this project and build a DataFrame

In [1]:
import pandas as pd

# Load data set
df_thyroid = pd.read_csv('proj-data.csv')
df_thyroid.shape

(7338, 31)

In [2]:
df_thyroid.head()

Unnamed: 0,age:,sex:,on thyroxine:,query on thyroxine:,on antithyroid medication:,sick:,pregnant:,thyroid surgery:,I131 treatment:,query hypothyroid:,...,TT4:,T4U measured:,T4U:,FTI measured:,FTI:,TBG measured:,TBG:,referral source:,diagnoses,[record identification]
0,29,F,f,f,f,f,f,f,f,t,...,?,f,?,f,?,f,?,other,-,[861106018]
1,29,F,f,f,f,f,f,f,f,f,...,128,f,?,f,?,f,?,other,-,[860916073]
2,36,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,[850726049]
3,60,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,26,other,-,[861010020]
4,77,F,f,f,f,f,f,f,f,f,...,?,f,?,f,?,t,21,other,-,[860324074]


### 1.2 Data investigation

After building our DataFrame, it's important to do some investigation, so we can gain a better understanding of our data.

In [3]:
import numpy as np

df = df_thyroid.copy()
df.replace('?', np.nan, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7338 entries, 0 to 7337
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age:                        7338 non-null   int64 
 1   sex:                        7098 non-null   object
 2   on thyroxine:               7338 non-null   object
 3   query on thyroxine:         7338 non-null   object
 4   on antithyroid medication:  7338 non-null   object
 5   sick:                       7338 non-null   object
 6   pregnant:                   7338 non-null   object
 7   thyroid surgery:            7338 non-null   object
 8   I131 treatment:             7338 non-null   object
 9   query hypothyroid:          7338 non-null   object
 10  query hyperthyroid:         7338 non-null   object
 11  lithium:                    7338 non-null   object
 12  goitre:                     7338 non-null   object
 13  tumor:                      7338 non-null   obje

In [4]:
print("Uniques values:")
for col in df_thyroid.columns:
    unique_vals = df_thyroid[col].nunique()
    print(f'{col} = ', unique_vals)

Uniques values:
age: =  98
sex: =  3
on thyroxine: =  2
query on thyroxine: =  2
on antithyroid medication: =  2
sick: =  2
pregnant: =  2
thyroid surgery: =  2
I131 treatment: =  2
query hypothyroid: =  2
query hyperthyroid: =  2
lithium: =  2
goitre: =  2
tumor: =  2
hypopituitary: =  2
psych: =  2
TSH measured: =  2
TSH: =  351
T3 measured: =  2
T3: =  82
TT4 measured: =  2
TT4: =  278
T4U measured: =  2
T4U: =  167
FTI measured: =  2
FTI: =  303
TBG measured: =  2
TBG: =  61
referral source: =  6
diagnoses =  32
[record identification] =  7338


In [5]:
df.isna().sum()

age:                             0
sex:                           240
on thyroxine:                    0
query on thyroxine:              0
on antithyroid medication:       0
sick:                            0
pregnant:                        0
thyroid surgery:                 0
I131 treatment:                  0
query hypothyroid:               0
query hyperthyroid:              0
lithium:                         0
goitre:                          0
tumor:                           0
hypopituitary:                   0
psych:                           0
TSH measured:                    0
TSH:                           671
T3 measured:                     0
T3:                           2068
TT4 measured:                    0
TT4:                           362
T4U measured:                    0
T4U:                           664
FTI measured:                    0
FTI:                           658
TBG measured:                    0
TBG:                          7054
referral source:    

In [6]:
#df_cleaned = df.drop(['TBG measured:', 'TBG:', '[record identification]'], axis = 1)

# temp = df.drop('[record identification]', axis = 1)
# df_cleaned = temp.dropna(subset=['T3:'])
# TODO have another look at this issue
df_cleaned = df.drop('[record identification]', axis = 1)
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7338 entries, 0 to 7337
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   age:                        7338 non-null   int64 
 1   sex:                        7098 non-null   object
 2   on thyroxine:               7338 non-null   object
 3   query on thyroxine:         7338 non-null   object
 4   on antithyroid medication:  7338 non-null   object
 5   sick:                       7338 non-null   object
 6   pregnant:                   7338 non-null   object
 7   thyroid surgery:            7338 non-null   object
 8   I131 treatment:             7338 non-null   object
 9   query hypothyroid:          7338 non-null   object
 10  query hyperthyroid:         7338 non-null   object
 11  lithium:                    7338 non-null   object
 12  goitre:                     7338 non-null   object
 13  tumor:                      7338 non-null   obje

### 1.3 Encoding Data

In machine learning, there are classification models that can not handle object type values. For this reason we need to transform our DataFrame so that it is onlu composed od numeric types, this process is refered to as data enconding.

In [7]:
target = 'diagnoses'
encoded_values = {
    'M': '0', 'F': '1',
    'f': '0', 't': '1'
}

df_target = pd.DataFrame(df_cleaned[target], columns=[target])
encoded = df_cleaned.drop(target, axis=1).replace(encoded_values)
df_encoded = pd.get_dummies(encoded, columns=['referral source:'], dtype='int')

In [8]:
value_mapping = {
    '-': 0,                          # healthy
    'A': 1, 'B': 1, 'C': 1, 'D': 1,  # hyperthyroid conditions
    'E': 2, 'F': 2, 'G': 2, 'H': 2,  # hypothyroid conditions
    'I': 3, 'J': 3,                  # binding protein
    'K': 4,                          # general health
    'L': 5, 'M': 5, 'N': 5,          # replacement therapy
    'R': 6,                          # discordant results
}
df_target[target] = df_target[target].map(value_mapping).fillna(7).astype(int)
df_target[target].unique()

array([0, 6, 5, 2, 4, 3, 1, 7])

In [9]:
df = pd.concat([df_encoded, df_target], axis=1)
df.head()

Unnamed: 0,age:,sex:,on thyroxine:,query on thyroxine:,on antithyroid medication:,sick:,pregnant:,thyroid surgery:,I131 treatment:,query hypothyroid:,...,FTI:,TBG measured:,TBG:,referral source:_STMW,referral source:_SVHC,referral source:_SVHD,referral source:_SVI,referral source:_WEST,referral source:_other,diagnoses
0,29,1,0,0,0,0,0,0,0,1,...,,0,,0,0,0,0,0,1,0
1,29,1,0,0,0,0,0,0,0,0,...,,0,,0,0,0,0,0,1,0
2,36,1,0,0,0,0,0,0,0,0,...,,1,26.0,0,0,0,0,0,1,0
3,60,1,0,0,0,0,0,0,0,0,...,,1,26.0,0,0,0,0,0,1,0
4,77,1,0,0,0,0,0,0,0,0,...,,1,21.0,0,0,0,0,0,1,0


### 1.4 Splitting into training and testing set

In [10]:
from sklearn.model_selection import train_test_split

X = df.drop(target, axis='columns')
y = df[target]

X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(X, y, test_size=0.2)

# Print the shapes of the training and testing sets
print('Training set shape:', X_TRAIN.shape, y_TRAIN.shape)
print('Testing set shape:', X_IVS.shape, y_IVS.shape)

Training set shape: (5870, 34) (5870,)
Testing set shape: (1468, 34) (1468,)


### 1.5 Scaling Data

In [11]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_TRAIN)

X_train_scl = scaler.transform(X_TRAIN)
X_ivs_scl = scaler.transform(X_IVS)

### 1.6 Imputing missing values

In [12]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

X_train_imp = imputer.fit_transform(X_train_scl)
X_ivs_imp = imputer.transform(X_ivs_scl)

X_TRAIN = pd.DataFrame(X_train_imp, columns=X_TRAIN.columns)
X_IVS = pd.DataFrame(X_ivs_imp, columns=X_IVS.columns)

missing_values = X_TRAIN.isna().sum().sum() + X_IVS.isna().sum().sum()
print(f"Missing values count: {missing_values}")

Missing values count: 0


In [13]:
X_TRAIN.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5870 entries, 0 to 5869
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   age:                        5870 non-null   float64
 1   sex:                        5870 non-null   float64
 2   on thyroxine:               5870 non-null   float64
 3   query on thyroxine:         5870 non-null   float64
 4   on antithyroid medication:  5870 non-null   float64
 5   sick:                       5870 non-null   float64
 6   pregnant:                   5870 non-null   float64
 7   thyroid surgery:            5870 non-null   float64
 8   I131 treatment:             5870 non-null   float64
 9   query hypothyroid:          5870 non-null   float64
 10  query hyperthyroid:         5870 non-null   float64
 11  lithium:                    5870 non-null   float64
 12  goitre:                     5870 non-null   float64
 13  tumor:                      5870 

## 2. Classification Models

### 2.1 Feature Selection

In [14]:
# TODO feature selection needs tuning
from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

selector = SFS(LinearRegression(), 
               n_features_to_select=13, 
               direction='forward',
               n_jobs=-1)
selector.fit(X_TRAIN, y_TRAIN)

N, M = X_TRAIN.shape
features=selector.get_support()
features_selected = np.arange(M)[features]
print("The features selected are columns: ", features_selected)

X_TRAIN = selector.transform(X_TRAIN)
X_IVS = selector.transform(X_IVS)

The features selected are columns:  [ 2  6 16 17 18 19 21 25 26 27 28 29 31]


### 2.2 KFold Cross validation

In [15]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def cross_validation(model, X, y):
    TRUTH = None
    PREDS = None
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X):
        X_train, y_train = X[train_index], y.to_numpy()[train_index]
        X_test, y_test = X[test_index], y.to_numpy()[test_index]
        
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        if TRUTH is None:
            PREDS = preds
            TRUTH = y_test
        else:
            PREDS = np.hstack((PREDS, preds))
            TRUTH = np.hstack((TRUTH, y_test))
    return TRUTH, PREDS

In [16]:
def evaluate(model, X, y, n_iter=10):
    accuracy = []
    precision = []
    recall = []
    f1 = []
    mcc = []
    
    for _ in range(n_iter):
        truth, preds = cross_validation(model, X, y)
        accuracy.append(accuracy_score(truth, preds))
        precision.append(precision_score(truth, preds, average='weighted', zero_division=1))
        recall.append(recall_score(truth, preds, average='weighted'))
        f1.append(f1_score(truth, preds, average='weighted'))
        mcc.append(matthews_corrcoef(truth, preds))
        
    return {
        'Model': model,
        'Accuracy': np.mean(accuracy),
        'Precision': np.mean(precision),
        'Recall': np.mean(recall),
        'F1-Score': np.mean(f1),
        'MCC': np.mean(mcc)
    }

### 2.3 Classification models

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

cols = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'MCC']

tree = evaluate(DecisionTreeClassifier(), X_TRAIN, y_TRAIN)
lgr = evaluate(LogisticRegression(), X_TRAIN, y_TRAIN)
naive_bayes = evaluate(GaussianNB(), X_TRAIN, y_TRAIN)
knn = evaluate(KNeighborsClassifier(), X_TRAIN, y_TRAIN)
svm = evaluate(SVC(), X_TRAIN, y_TRAIN)

pd.DataFrame([tree, lgr, naive_bayes, knn, svm], columns=cols)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,MCC
0,DecisionTreeClassifier(),0.925111,0.925136,0.925111,0.925046,0.831904
1,LogisticRegression(),0.855826,0.842755,0.855826,0.838627,0.641889
2,GaussianNB(),0.153833,0.804914,0.153833,0.133193,0.186626
3,KNeighborsClassifier(),0.864787,0.857604,0.864787,0.852912,0.668688
4,SVC(),0.844157,0.837282,0.844157,0.81921,0.604965


## 3. Model Selection

### 3.1 Hyperparameter tuning

In [18]:
model_params = {
    'Decision Tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'max_depth': [2, 3, 5, 10, 20],
            'min_samples_split': [2, 3, 5, 10, 15],
            'min_samples_leaf': [2, 3, 5, 10, 15],
            'criterion': ['gini', 'entropy', 'log_loss']
        }
    },
    'KNN': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors' : [3, 5, 7, 9, 11],
            'weights' : ['uniform', 'distance'],
            'metric' : ['minkowski', 'euclidean', 'manhattan']
        }
    },
    'SVC': {
        'model': SVC(),
        'params': {
            'C': [0.1, 1, 10, 100],  
            'gamma': [1, 0.1, 0.01, 0.001], 
            'kernel': ['rbf', 'linear'] 
        }
    }
}

In [19]:
import time
from sklearn.model_selection import GridSearchCV

scores = []
start_time = time.time()
for name, model in model_params.items():
    grid_search = GridSearchCV(model['model'], model['params'], cv=5, n_jobs=-1)
    grid_search.fit(X_TRAIN, y_TRAIN)
    scores.append({
        'Model': name,
        'Best Score': grid_search.best_score_,
        'Best Estimator': grid_search.best_estimator_
    })
    
print('Computation time: %.2f' % (time.time() - start_time))

Computation time: 36.34


In [20]:
df_tuning = pd.DataFrame(scores, columns=['Model', 'Best Score', 'Best Estimator'])
df_tuning

Unnamed: 0,Model,Best Score,Best Estimator
0,Decision Tree,0.935775,"DecisionTreeClassifier(criterion='entropy', ma..."
1,KNN,0.881261,"KNeighborsClassifier(metric='manhattan', n_nei..."
2,SVC,0.897445,"SVC(C=100, gamma=0.1)"


In [21]:
tuned_reports = []
for index, row in df_tuning.iterrows():
    tuned_reports.append(evaluate(row['Best Estimator'], X_TRAIN, y_TRAIN))

best_models = pd.DataFrame(tuned_reports, columns=cols)
best_models

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,MCC
0,"DecisionTreeClassifier(criterion='entropy', ma...",0.931601,0.930256,0.931601,0.930778,0.845168
1,"KNeighborsClassifier(metric='manhattan', n_nei...",0.875894,0.870023,0.875894,0.869261,0.703624
2,"SVC(C=100, gamma=0.1)",0.89569,0.889577,0.89569,0.89061,0.754115


### 3.2 Model Testing

In [22]:
test_report = []
for index, row in best_models.iterrows():
    model = row['Model']
    model.fit(X_TRAIN, y_TRAIN)
    preds = model.predict(X_IVS)
    test_report.append({
        'Model': model,
        'Accuracy': accuracy_score(y_IVS, preds),
        'Precision': precision_score(y_IVS, preds, average='weighted', zero_division=1),
        'Recall': recall_score(y_IVS, preds, average='weighted'),
        'F1-Score': f1_score(y_IVS, preds, average='weighted'),
        'MCC': matthews_corrcoef(y_IVS, preds)
    })

models_test = pd.DataFrame(test_report, columns=cols)
models_test

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,MCC
0,"DecisionTreeClassifier(criterion='entropy', ma...",0.933243,0.931569,0.933243,0.931986,0.841757
1,"KNeighborsClassifier(metric='manhattan', n_nei...",0.895777,0.892282,0.895777,0.891922,0.744387
2,"SVC(C=100, gamma=0.1)",0.912807,0.907943,0.912807,0.907769,0.785662
