# Engenharia do Conhecimento 2023/2024

## Project: *Thyroid disease Data Set*

#### Group 6:

- Eduardo Proença 57551
- Tiago Oliveira 54979
- Bernardo Lopes 54386


### Summary

To be done...

## 1. Data processing


### 1.1 Creating a Data Frame

Firstly, we need to create a Data Frame. Using the [Pandas](https://pandas.pydata.org) Python Library, we can read our data from the file proj-data.csv, which contains the data set we will be using in this project.

In [None]:
import pandas as pd

# Load data set
df_thyroid = pd.read_csv('proj-data.csv')
df_thyroid.shape

In [None]:
df_thyroid.head()

### 1.2 Data investigation

In [None]:
df_thyroid.info()

In [None]:
for col in df_thyroid.columns:
    print("Values of ", end='')
    print(df_thyroid[col].value_counts(), end="\n\n")

### 1.3 Defining the train and target sets

In [None]:
df = df_thyroid.drop("[record identification]", axis = 1)

In [None]:
X = df.drop("diagnoses", axis='columns')
y = df["diagnoses"]

### 1.4 Encoding our data

In [None]:
import numpy as np

encoded_values = {
    'M': '0', 'F': '1',
    'f': '0', 't': '1',
    '?': np.NaN
}

X_encoded = pd.get_dummies(X.replace(encoded_values), 
                           columns=["referral source:"], 
                           dtype='int')
X_encoded.head()

In [None]:
y_encoded = pd.get_dummies(y, dtype='int') # TODO needs a different encoding strategy 
y_encoded.head()

### 1.5 Splitting

In [None]:
from sklearn.model_selection import train_test_split

X = X_encoded
y = y_encoded

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shapes of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)

### 1.5 Imputation of missing values

In [None]:
X_train = X_train.drop("TBG:", axis='columns')
X_test = X_test.drop("TBG:", axis='columns')

In [None]:
from sklearn.impute import KNNImputer

# Initialize KNNImputer with k=5 (you can adjust k as needed)
imputer = KNNImputer(n_neighbors=3)

# Perform KNN imputation
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Convert the imputed array back to a DataFrame
X_train = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_test = pd.DataFrame(X_test_imputed, columns=X_test.columns)

### 1.6 Normalization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scl = scaler.transform(X_train)
X_test_scl = scaler.transform(X_test)

pd.DataFrame(X_train_scl, columns = X_train.columns).head()

## 2. Classification Models

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def evaluate_model(model):
    """
    Evaluate a single classification model.
    
    Args:
    - model: A scikit-learn classification model object.
    
    Returns:
    - metrics (dict): A dictionary containing evaluation metrics.
    """
    # Fit the model on the training data
    model.fit(X_train_scl, y_train)
    
    # Make predictions on the testing data
    y_pred = model.predict(X_test_scl)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    #mcc = matthews_corrcoef(y_test, y_pred)
    
    # Print the evaluation metrics
    print(f"Model Evaluation Metrics:")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1 Score: {f1:.4f}")
    #print(f"  Matthews Correlation Coefficient: {mcc:.4f}")
    
    # Store the evaluation metrics in a dictionary
    metrics = {
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
        #"Matthews Correlation Coefficient": mcc
    }
    
    return metrics


In [None]:
from sklearn.tree import DecisionTreeClassifier

metrics = evaluate_model(DecisionTreeClassifier(max_depth = 5))

In [None]:
# from sklearn.linear_model import LogisticRegression
# 
# evaluate_model(LogisticRegression(penalty = "l2"))