# Engenharia do Conhecimento 2023/2024

## Project: Classification and Regression models with *Thyroid disease Data Set*

#### Group 6:

- Eduardo Proença 57551
- Tiago Oliveira 54979
- Bernardo Lopes 54386

### Summary

1. Data Processing

    1. Creating a Data Frame
    2. Data investigation
    3. Encoding Data
    4. Splitting into training and testing set
    5. Imputing missing values
    6. Scaling Data
    
2. Classification

    1. Feature Selection
    2. KFold Cross validation
    3. Classification models
       
3. Model Selection

    1. Hyperparameter tuning
    2. Model Testing

4. Conclusion

## 1. Data Processing

When constructing a machine learning model, data processing is an important step. We need to ensure that our data set is properly processed so that it can be used in the best possible way by diferent classification models.

### 1.1 Creating a Data Frame

The first step is load our data set. For that, we can use the [Pandas](https://pandas.pydata.org) Python Library, to read the file "proj-data.csv", which contains the data set we will be using in this project and build a DataFrame

In [None]:
import pandas as pd

# Load data set
df_thyroid = pd.read_csv('proj-data.csv')
df_thyroid.shape

In [None]:
df_thyroid.head()

### 1.2 Data investigation

After building our DataFrame, it's important to do some investigation, so we can gain a better understanding of our data.

The first detail we notice is that there are missing values represented by '?'. These will be handled later in this notebook, for now
we will just replace them with NaN, so they can be identified.

In [None]:
import numpy as np

# Replace missing values with NaN
df = df_thyroid.copy()
df.replace('?', np.nan, inplace=True)

df.info()

Looking at the number of different values of each column.

In [None]:
print("Uniques values:")
for col in df_thyroid.columns:
    unique_vals = df_thyroid[col].nunique()
    print(f'{col} = ', unique_vals)

Let's look at the number of missing values of each column.

In [None]:
df.isna().sum()

After some investegation we know that our data set has two types of columns: binary columns which have only two possible non-numeric values
and numeric columns which contain different numeric values. It also has three more columns, the referral source that can have six different values,
the diagnoses our target variable, and the record information representing only a unique identifier, so this column can be dropped.

As to the missing values we will impute them later instead of deleting them, because they represent measures that where not taken, and
not values that are truly missing.

In [None]:
# Dropping the [record identification] column
df_cleaned = df.drop('[record identification]', axis = 1)
df_cleaned.info()

### 1.3 Encoding Data

Now our data set is ready to be encoded. In this process, all the binary columns will be transformed
into two 0's and 1's. The "referral source" column will be encoded using the method get_dummies from [Pandas](https://pandas.pydata.org).

As for the target variable, it will be encoded according to 8 classes given to us in the file "data.names".

In [None]:
target = 'diagnoses'
encoded_values = {
    'M': '0', 'F': '1',
    'f': '0', 't': '1'
}

df_target = pd.DataFrame(df_cleaned[target], columns=[target])
encoded = df_cleaned.drop(target, axis=1).replace(encoded_values)
df_encoded = pd.get_dummies(encoded, columns=['referral source:'], dtype='int')

In [None]:
# Target variable enconding
value_mapping = {
    '-': 0,                          # healthy
    'A': 1, 'B': 1, 'C': 1, 'D': 1,  # hyperthyroid conditions
    'E': 2, 'F': 2, 'G': 2, 'H': 2,  # hypothyroid conditions
    'I': 3, 'J': 3,                  # binding protein
    'K': 4,                          # general health
    'L': 5, 'M': 5, 'N': 5,          # replacement therapy
    'R': 6,                          # discordant results
}
df_target[target] = df_target[target].map(value_mapping).fillna(7).astype(int)
df_target[target].unique()

In [None]:
df = pd.concat([df_encoded, df_target], axis=1)
df.head()

### 1.4 Splitting into training and testing set

With our data encoded, we are ready to split it into a training and testing sets.
The training set will be used to train our classification model and the testing set will be used to test it.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(target, axis='columns')
y = df[target]

X_TRAIN, X_TEST, y_TRAIN, y_TEST = train_test_split(X, y, test_size=0.2)

# Print the shapes of the training and testing sets
print('Training set shape:', X_TRAIN.shape, y_TRAIN.shape)
print('Testing set shape:', X_TEST.shape, y_TEST.shape)

### 1.5 Scaling Data

Because there are classification models that are based it the distance between the data, like KNN, 
it is important to normalize our training and testing sets.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_TRAIN)

X_train_scl = scaler.transform(X_TRAIN)
X_test_scl = scaler.transform(X_TEST)

### 1.6 Imputing missing values

Again, because there are classification models that can not handle missing values, like KNN, 
we need to make the imputation of our NaN values.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X_train_scl)

X_train_imp = imputer.transform(X_train_scl)
X_test_imp = imputer.transform(X_test_scl)

X_TRAIN = pd.DataFrame(X_train_imp, columns=X_TRAIN.columns)
X_TEST = pd.DataFrame(X_test_imp, columns=X_TEST.columns)
X_TRAIN.info()

## 2. Classification Models

### 2.1 Feature Selection

In [None]:
# TODO feature selection needs tuning
from sklearn.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

selector = SFS(LinearRegression(), 
               n_features_to_select=13, 
               direction='forward',
               n_jobs=-1)
selector.fit(X_TRAIN, y_TRAIN)

N, M = X_TRAIN.shape
features=selector.get_support()
features_selected = np.arange(M)[features]
print("The features selected are columns: ", features_selected)

X_TRAIN = selector.transform(X_TRAIN)
X_TEST = selector.transform(X_TEST)

### 2.2 KFold Cross validation

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef

def cross_validation(model, X, y):
    TRUTH = None
    PREDS = None
    kf = KFold(n_splits=5, shuffle=True)
    for train_index, test_index in kf.split(X):
        X_train, y_train = X[train_index], y.to_numpy()[train_index]
        X_test, y_test = X[test_index], y.to_numpy()[test_index]
        
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        if TRUTH is None:
            PREDS = preds
            TRUTH = y_test
        else:
            PREDS = np.hstack((PREDS, preds))
            TRUTH = np.hstack((TRUTH, y_test))
    return TRUTH, PREDS

In [None]:
def evaluate(model, X, y, n_iter=10):
    accuracy = []
    precision = []
    recall = []
    f1 = []
    mcc = []
    
    for _ in range(n_iter):
        truth, preds = cross_validation(model, X, y)
        accuracy.append(accuracy_score(truth, preds))
        precision.append(precision_score(truth, preds, average='weighted', zero_division=1))
        recall.append(recall_score(truth, preds, average='weighted'))
        f1.append(f1_score(truth, preds, average='weighted'))
        mcc.append(matthews_corrcoef(truth, preds))
        
    return {
        'Model': model,
        'Accuracy': np.mean(accuracy),
        'Precision': np.mean(precision),
        'Recall': np.mean(recall),
        'F1-Score': np.mean(f1),
        'MCC': np.mean(mcc)
    }

### 2.3 Classification models

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

cols = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'MCC']

tree = evaluate(DecisionTreeClassifier(), X_TRAIN, y_TRAIN)
lgr = evaluate(LogisticRegression(), X_TRAIN, y_TRAIN)
naive_bayes = evaluate(GaussianNB(), X_TRAIN, y_TRAIN)
knn = evaluate(KNeighborsClassifier(), X_TRAIN, y_TRAIN)
svm = evaluate(SVC(), X_TRAIN, y_TRAIN)

pd.DataFrame([tree, lgr, naive_bayes, knn, svm], columns=cols)

## 3. Model Selection

### 3.1 Hyperparameter tuning

In [None]:
model_params = {
    'Decision Tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'max_depth': [2, 3, 5, 10, 20],
            'min_samples_split': [2, 3, 5, 10, 15],
            'min_samples_leaf': [2, 3, 5, 10, 15],
            'criterion': ['gini', 'entropy', 'log_loss']
        }
    },
    'KNN': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors' : [3, 5, 7, 9, 11],
            'weights' : ['uniform', 'distance'],
            'metric' : ['minkowski', 'euclidean', 'manhattan']
        }
    },
    'SVC': {
        'model': SVC(),
        'params': {
            'C': [0.1, 1, 10, 100],  
            'gamma': [1, 0.1, 0.01, 0.001], 
            'kernel': ['rbf', 'linear'] 
        }
    }
}

In [None]:
import time
from sklearn.model_selection import GridSearchCV

scores = []
start_time = time.time()
for name, model in model_params.items():
    grid_search = GridSearchCV(model['model'], model['params'], cv=5, n_jobs=-1)
    grid_search.fit(X_TRAIN, y_TRAIN)
    scores.append({
        'Best Estimator': grid_search.best_estimator_,
        'Best Score': grid_search.best_score_,
        'Best Params': grid_search.best_params_
    })
    
print('Computation time: %.2f' % (time.time() - start_time))
df_tuning = pd.DataFrame(scores, columns=['Best Estimator', 'Best Score', 'Best Params'])
df_tuning

In [None]:
tuned_reports = []
for index, row in df_tuning.iterrows():
    tuned_reports.append(evaluate(row['Best Estimator'], X_TRAIN, y_TRAIN))

best_models = pd.DataFrame(tuned_reports, columns=cols)
best_models

### 3.2 Model Testing

In [None]:
test_report = []
for index, row in best_models.iterrows():
    model = row['Model']
    model.fit(X_TRAIN, y_TRAIN)
    preds = model.predict(X_TEST)
    test_report.append({
        'Model': model,
        'Accuracy': accuracy_score(y_TEST, preds),
        'Precision': precision_score(y_TEST, preds, average='weighted', zero_division=1),
        'Recall': recall_score(y_TEST, preds, average='weighted'),
        'F1-Score': f1_score(y_TEST, preds, average='weighted'),
        'MCC': matthews_corrcoef(y_TEST, preds)
    })

models_test = pd.DataFrame(test_report, columns=cols)
models_test