## Purpose
The purpose of this project is to predict recurrence of well differentiated thyroid cancer. The data set was collected in duration of 15 years and each patient was followed for at least 10 years.

## Dataset Source
The data was procured from thyroid disease datasets provided by the UCI Machine Learning Repository.

## Data Dictionary
The summary of columns and their descriptions for this dataset is enumerated below:

1. Age: The age of the patient at the time of diagnosis or treatment.
2. Gender: The gender of the patient (male or female).
3. Smoking: Whether the patient is a smoker or not.
4. Hx Smoking: Smoking history of the patient (e.g., whether they have ever smoked).
5. Hx Radiotherapy: History of radiotherapy treatment for any condition.
6. Thyroid Function: The status of thyroid function, possibly indicating if there are any abnormalities.
7. Physical Examination: Findings from a physical examination of the patient, which may include palpation of the thyroid gland and surrounding structures.
8. Adenopathy: Presence or absence of enlarged lymph nodes (adenopathy) in the neck region.
9. Pathology: Specific types of thyroid cancer as determined by pathology examination of biopsy samples.
10. Focality: Whether the cancer is unifocal (limited to one location) or multifocal (present in multiple locations).
11. Risk: The risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type.
12. T: Tumor classification based on its size and extent of invasion into nearby structures.
13. N: Nodal classification indicating the involvement of lymph nodes.
14. M: Metastasis classification indicating the presence or absence of distant metastases.
15. Stage: The overall stage of the cancer, typically determined by combining T, N, and M classifications.
16. Response: Response to treatment, indicating whether the cancer responded positively, negatively, or remained stable after treatment.
17. Recurred: Indicates whether the cancer has recurred after initial treatment.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let's import the dataset into colab first

In [None]:
df=pd.read_csv("drive/MyDrive/AIML/thyroid-disease/Thyroid_Diff.csv")

In [None]:
df.head()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
0,27,F,No,No,No,Euthyroid,Single nodular goiter-left,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Indeterminate,No
1,34,F,No,Yes,No,Euthyroid,Multinodular goiter,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
2,30,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
3,62,F,No,No,No,Euthyroid,Single nodular goiter-right,No,Micropapillary,Uni-Focal,Low,T1a,N0,M0,I,Excellent,No
4,62,F,No,No,No,Euthyroid,Multinodular goiter,No,Micropapillary,Multi-Focal,Low,T1a,N0,M0,I,Excellent,No


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383 entries, 0 to 382
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Age                   383 non-null    int64 
 1   Gender                383 non-null    object
 2   Smoking               383 non-null    object
 3   Hx Smoking            383 non-null    object
 4   Hx Radiothreapy       383 non-null    object
 5   Thyroid Function      383 non-null    object
 6   Physical Examination  383 non-null    object
 7   Adenopathy            383 non-null    object
 8   Pathology             383 non-null    object
 9   Focality              383 non-null    object
 10  Risk                  383 non-null    object
 11  T                     383 non-null    object
 12  N                     383 non-null    object
 13  M                     383 non-null    object
 14  Stage                 383 non-null    object
 15  Response              383 non-null    ob

In [None]:
df.describe()

Unnamed: 0,Age
count,383.0
mean,40.866841
std,15.134494
min,15.0
25%,29.0
50%,37.0
75%,51.0
max,82.0


There are majorly categorical format data in the data set as seen in the df.info() command. We now need to convert them into a format that will permit processing. We will use an encoding function to achieve this

In [None]:
from sklearn.preprocessing import LabelEncoder
def convert_cat_to_bin(df):
    # Initialize the LabelEncoder
    label_encoder = LabelEncoder()

    # loop through each of the columns in the DataFrame
    for cols in df.columns:
        if df[cols].dtype == 'object':
            # convert the categorical values to binary values
            df[cols] = label_encoder.fit_transform(df[cols])

    return df

df = convert_cat_to_bin(df)
df.head()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
0,27,0,0,0,0,2,3,3,2,1,2,0,0,0,0,2,0
1,34,0,0,1,0,2,1,3,2,1,2,0,0,0,0,1,0
2,30,0,0,0,0,2,4,3,2,1,2,0,0,0,0,1,0
3,62,0,0,0,0,2,4,3,2,1,2,0,0,0,0,1,0
4,62,0,0,0,0,2,1,3,2,0,2,0,0,0,0,1,0


We'll now use the MinMax Normalization Method to ensure the rest of the data has values within the range 0 and 1

In [None]:

col_norm = ['Age', 'Thyroid Function', 'Pathology', 'Risk', 'Physical Examination', 'Adenopathy',  'Response']

# Duplicate the data frame so we work on the copy isntead of the original
df_minmax = df.copy()

# Import MinMaxScaler and initialise it
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# We can now normalise the columns we have highlighted
df_minmax[col_norm] = scaler.fit_transform(df_minmax[col_norm])



In [None]:

df_minmax.head()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
0,0.179104,0,0,0,0,0.5,0.75,0.6,0.666667,1,1.0,0,0,0,0,0.666667,0
1,0.283582,0,0,1,0,0.5,0.25,0.6,0.666667,1,1.0,0,0,0,0,0.333333,0
2,0.223881,0,0,0,0,0.5,1.0,0.6,0.666667,1,1.0,0,0,0,0,0.333333,0
3,0.701493,0,0,0,0,0.5,1.0,0.6,0.666667,1,1.0,0,0,0,0,0.333333,0
4,0.701493,0,0,0,0,0.5,0.25,0.6,0.666667,0,1.0,0,0,0,0,0.333333,0


In [None]:
df_minmax.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 383 entries, 0 to 382
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Age                   383 non-null    float64
 1   Gender                383 non-null    int64  
 2   Smoking               383 non-null    int64  
 3   Hx Smoking            383 non-null    int64  
 4   Hx Radiothreapy       383 non-null    int64  
 5   Thyroid Function      383 non-null    float64
 6   Physical Examination  383 non-null    float64
 7   Adenopathy            383 non-null    float64
 8   Pathology             383 non-null    float64
 9   Focality              383 non-null    int64  
 10  Risk                  383 non-null    float64
 11  T                     383 non-null    int64  
 12  N                     383 non-null    int64  
 13  M                     383 non-null    int64  
 14  Stage                 383 non-null    int64  
 15  Response              3

In [None]:
df_minmax.describe()

Unnamed: 0,Age,Gender,Smoking,Hx Smoking,Hx Radiothreapy,Thyroid Function,Physical Examination,Adenopathy,Pathology,Focality,Risk,T,N,M,Stage,Response,Recurred
count,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0,383.0
mean,0.386072,0.185379,0.127937,0.073107,0.018277,0.487598,0.640339,0.584856,0.850305,0.644909,0.78329,2.206266,0.543081,0.046997,0.24282,0.524804,0.281984
std,0.225888,0.389113,0.334457,0.260653,0.134126,0.157729,0.337527,0.234421,0.296752,0.479167,0.321617,1.344667,0.857732,0.21191,0.773274,0.305862,0.450554
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.208955,0.0,0.0,0.0,0.0,0.5,0.25,0.6,0.833333,0.0,0.5,2.0,0.0,0.0,0.0,0.333333,0.0
50%,0.328358,0.0,0.0,0.0,0.0,0.5,0.75,0.6,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.333333,0.0
75%,0.537313,0.0,0.0,0.0,0.0,0.5,1.0,0.6,1.0,1.0,1.0,3.0,1.0,0.0,0.0,0.666667,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,6.0,2.0,1.0,4.0,1.0,1.0


In [None]:
# We'll now split the data into it's respective X and y components
from sklearn.model_selection import train_test_split
X = df.drop('Recurred', axis=1)
y = df['Recurred']

# Split  into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


X_train.shape, X_test.shape, y_train.shape, y_test.shape

((306, 16), (77, 16), (306,), (77,))

It's now time to create our models. We will utilise the RandomForestClassifier, LogisticRegression and DecisionTreeClassifier models.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {"Logistic Regression": LogisticRegression(random_state=42),
         "Decision Tree": DecisionTreeClassifier(random_state=42),
         "Random Forest": RandomForestClassifier(random_state=42)}

# Create a function to fit and score the models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)

    # Loop through the models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        # Make predictions on the testing set
        y_preds = model.predict(X_test)

        # Calculate evaluation metrics
        accuracy = accuracy_score(y_test, y_preds)
        precision = precision_score(y_test, y_preds)
        recall = recall_score(y_test, y_preds)
        f1 = f1_score(y_test, y_preds)

        # Print evaluation metrics
        print(name, ":")
        print("Accuracy:", accuracy)
        print("Precision:", precision)
        print("Recall:", recall)
        print("F1 Score:", f1)
        print("")


In [None]:
scores = fit_and_score(models=models,
                            X_train=X_train,
                            X_test=X_test,
                            y_train=y_train,
                            y_test=y_test)

scores

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression :
Accuracy: 0.935064935064935
Precision: 0.9375
Recall: 0.7894736842105263
F1 Score: 0.8571428571428572

Decision Tree :
Accuracy: 0.922077922077922
Precision: 0.8095238095238095
Recall: 0.8947368421052632
F1 Score: 0.8500000000000001

Random Forest :
Accuracy: 0.987012987012987
Precision: 1.0
Recall: 0.9473684210526315
F1 Score: 0.972972972972973

