# Engenharia do Conhecimento 2023/2024

## Project: *Thyroid disease Data Set*

#### Group 6:

- Eduardo Proença 57551
- Tiago Oliveira 54979
- Bernardo Lopes 54386


### Summary

To be done...

## 1. Data processing


### 1.1 Creating a Data Frame

Firstly, we need to create a Data Frame. Using the [Pandas](https://pandas.pydata.org) Python Library, we can read our data from the file proj-data.csv, which contains the data set we will be using in this project.

In [None]:
import pandas as pd

# Load data set
df_thyroid = pd.read_csv('proj-data.csv')
df_thyroid.shape

In [None]:
df_thyroid.head()

### 1.2 Data investigation

In [None]:
df_thyroid.info()

In [None]:
for col in df_thyroid.columns:
    print("Values of ", end='')
    print(df_thyroid[col].value_counts(), end="\n\n")

### 1.3 Defining the train and target sets

In [None]:
df = df_thyroid.drop("[record identification]", axis = 1)

In [None]:
X = df.drop("diagnoses", axis='columns')
y = df["diagnoses"]

### 1.4 Encoding our data

In [None]:
import numpy as np

object_cols = [
    "sex:", "on thyroxine:", "query on thyroxine:", 
    "on antithyroid medication:", "sick:", "pregnant:",
    "thyroid surgery:", "I131 treatment:", "query hypothyroid:",
    "query hyperthyroid:", "lithium:", "goitre:", "tumor:", 
    "hypopituitary:", "psych:", "TSH measured:", "T3 measured:",
    "TT4 measured:", "T4U measured:", "FTI measured:", "TBG measured:",
    "referral source:"
]

numeric_cols = [
    "age:", "TSH:", "T3:", "TT4:", "T4U:", "FTI:", "TBG:"
]

X.replace('?', np.nan, inplace=True)
numeric = X.drop(object_cols, axis = 1)
object = pd.get_dummies(X.drop(numeric_cols, axis = 1), dtype='int')
X = pd.concat([object, numeric], axis = 1)
X.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(y)

y = label_encoder.fit_transform(y)

### 1.5 Splitting

In [None]:
from sklearn.model_selection import train_test_split

X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shapes of the training and testing sets
print("Training set shape:", X_TRAIN.shape, y_TRAIN.shape)
print("Testing set shape:", X_IVS.shape, y_IVS.shape)

### 1.5 Handling missing values

In [None]:
X_TRAIN = X_TRAIN.drop("TBG:", axis='columns')
X_IVS = X_IVS.drop("TBG:", axis='columns')

In [None]:
from sklearn.impute import KNNImputer

# Initialize KNNImputer with k=5 (you can adjust k as needed)
imputer = KNNImputer(n_neighbors=3)

# Perform KNN imputation
X_train_imputed = imputer.fit_transform(X_TRAIN)
X_ivs_imputed = imputer.transform(X_IVS)

# Convert the imputed array back to a DataFrame
X_TRAIN = pd.DataFrame(X_train_imputed, columns=X_TRAIN.columns)
X_IVS = pd.DataFrame(X_ivs_imputed, columns=X_IVS.columns)

### 1.6 Scaling Data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_TRAIN)

X_TRAIN_scl = scaler.transform(X_TRAIN)
X_IVS_scl = scaler.transform(X_IVS)

pd.DataFrame(X_TRAIN_scl, columns = X_TRAIN.columns).head()

## 2. Classification Models

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def evaluate(model):
    TRUTH = None
    PREDS = None
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X_TRAIN):
        X_train, X_test = X_TRAIN_scl[train_index], X_TRAIN_scl[test_index]
        y_train, y_test = y_TRAIN[train_index], y_TRAIN[test_index]
        
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        if TRUTH is None:
            PREDS = preds
            TRUTH = y_test
        else:
            PREDS = np.hstack((PREDS, preds))
            TRUTH = np.hstack((TRUTH, y_test))
            
    print("Cross validation statistics:")
    print("The Accuracy is: %7.4f" % accuracy_score(TRUTH, PREDS))
    print("The Precision is: %7.4f" % precision_score(TRUTH, PREDS, average='weighted', zero_division=1))
    print("The Recall is: %7.4f" % recall_score(TRUTH, PREDS, average='weighted'))
    print("The F1 score is: %7.4f" % f1_score(TRUTH, PREDS, average='weighted'))
    print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(TRUTH, PREDS))

In [None]:
from sklearn.tree import DecisionTreeClassifier

evaluate(DecisionTreeClassifier(max_depth = 3))

In [None]:
from sklearn.linear_model import LogisticRegression

evaluate(LogisticRegression(max_iter=1000))

In [None]:
from sklearn.naive_bayes import GaussianNB

evaluate(GaussianNB())

In [None]:
from sklearn.neighbors import KNeighborsClassifier

evaluate(KNeighborsClassifier(n_neighbors = 5, weights = "distance"))

In [None]:
from sklearn.svm import SVC

evaluate(SVC(kernel = "rbf", C = 1, gamma = 0.1))