  ![](banner.png)

# Preface: Welcome to the disease predictor project!

Welcome to my machine learning model for disease prediction! In this project, we dive into the realm of medical diagnostics, aiming to develop a powerful tool for predicting diseases based on a wide array of symptoms which is encompassed in a dataset including 41 different disease diagnoses. Leveraging machine learning algorithms and techniques, we can see just how much they can be highly beneficial in accurately identifying diseases from symptom patterns appearing in data.  
Let's explore one of the many sides of the intersection of healthcare and technology, striving to make meaningful advancements in the world of healthcare!

## Importing dependencies  

In [1]:
# Libraries & tools
import numpy as np
import pandas as pd
# Preprocessing & classifiers
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import CategoricalNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Part 1: Data Transformation

The dataset is structured such as each case has at most 17 symptoms, where cases with a less number of symptoms are padded with NA values. In order to make this dataset suitable to be learned by a model, we'll transform it such that all possible symptoms included are the new features, and every case either has the symptom or not, making the new features binary.

In [2]:
# Read the dataset
data = pd.read_csv('dataset.csv')
# Remove leading spaces from all elements
data = data.apply(lambda x: x.str.strip())

data.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


We'll define a transformer for the data which will encode each case with the unique symptoms provided in the training data.

In [3]:
class SymptomEncoder(BaseEstimator, TransformerMixin):

  def __init__(self):
    # Initialize the symptoms as an empty set
    self.symptoms = set()

  def encode_row(self, row: pd.Series):
    # Create a series of zeros with index = symptoms
    encoded = pd.Series(data = np.zeros(len(self.symptoms)), index = self.symptoms)
    # Set the value of the symptoms in the row to 1
    for symptom in row.dropna().index: encoded[row[symptom]] = 1
    return encoded.astype(int)

  def fit(self, X, y = None):
    # Get unique symptoms across the whole dataset
    for col in X.columns:
      self.symptoms = self.symptoms.union(set(X[col].dropna().unique()))
    return self

  def transform(self, X, y = None):
    # Apply the encoding to all rows of the data
    return X.apply(self.encode_row, axis = 1)

# Part 2: Data Splitting & Hyperparameter Tuning

Splitting the data using a 50-50% train-test split.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ['Disease']), data['Disease'], train_size = 0.5, random_state = 40)

> We can do hyperparameter tuning with `GridSearchCV`, which does cross-validation within the training set using combinations from a space of different parameters passed to it.  

I decided to use 3 models that work well with categorical data: a **decision tree**, a **random forest**, and a **categorical naive Bayes classifier**.  
Now, we'll choose each of the 3 models' parameters using grid search cross validation.

In [5]:
# Define the parameter space for each classifier
param_grids = {
  'Decision Tree': {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 15, 20, 30],
    'random_state': [500],
    'min_impurity_decrease': [0.0, 0.1, 0.2],
    'min_samples_split': [2, 10, 50]
  },
  'Random Forest': {
    'n_estimators': [50, 100, 150],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 50],
    'max_features': [0.2, 'sqrt'],
    'n_jobs': [-1],
    'random_state': [500],
    'max_samples': [0.25, 0.5]
  },
  'Naive Bayes': {
    'alpha': [0.0001, 0.01, 0.1, 1.0]
  }
}

# Define classifiers
classifiers = {'Decision Tree': DecisionTreeClassifier(), 'Random Forest': RandomForestClassifier(), 'Naive Bayes': CategoricalNB()}

# Transform the training set
X_train_transformed = SymptomEncoder().fit_transform(X_train)

# Find the best parameters for each classifier
for c in classifiers.keys():
  # Do grid search on the classifier's parameters
  grid_search_cv = GridSearchCV(estimator = classifiers[c], param_grid = param_grids[c], cv = 5, n_jobs = -1)
  grid_search_cv.fit(X_train_transformed, y_train)
  # View the best parameter choice for the classifier and keep it
  print(f'Best {c} parameters are {grid_search_cv.best_params_}.')
  print(f'Best {c} accuracy cross-validation score = {grid_search_cv.best_score_}\n')
  classifiers[c] = grid_search_cv.best_estimator_

Best Decision Tree parameters are {'criterion': 'gini', 'max_depth': None, 'min_impurity_decrease': 0.0, 'min_samples_split': 2, 'random_state': 500}.
Best Decision Tree accuracy cross-validation score = 0.9898373983739835

Best Random Forest parameters are {'criterion': 'gini', 'max_depth': 10, 'max_features': 0.2, 'max_samples': 0.25, 'min_samples_split': 2, 'n_estimators': 150, 'n_jobs': -1, 'random_state': 500}.
Best Random Forest accuracy cross-validation score = 1.0

Best Naive Bayes parameters are {'alpha': 0.01}.
Best Naive Bayes accuracy cross-validation score = 1.0



# Part 3: Pipeline Construction & Model Training

We will construct 3 pipelines for the data, where each pipeline consists of a SymptomEncoder transformer and the corresponding classifier.

In [6]:
# Construct 3 pipleines
dt_pipe = Pipeline(steps = [('encoder', SymptomEncoder()), ('classifier', classifiers['Decision Tree'])])
rf_pipe = Pipeline(steps = [('encoder', SymptomEncoder()), ('classifier', classifiers['Random Forest'])])
nb_pipe = Pipeline(steps = [('encoder', SymptomEncoder()), ('classifier', classifiers['Naive Bayes'])])

Feeding the training set into the 3 pipelines to train the models.

In [7]:
dt_pipe.fit(X_train, y_train)
rf_pipe.fit(X_train, y_train)
nb_pipe.fit(X_train, y_train)

# Part 4: Testing & Evaluation

Running the 3 models on the testing set. 

In [8]:
dt_pred = dt_pipe.predict(X_test)
rf_pred = rf_pipe.predict(X_test)
nb_pred = nb_pipe.predict(X_test)

Evaluating the predictions of each model using the accuracy, precision, recall, and F1 scores.

In [9]:
# Lists which will construct the data frame
models = ['Decision Tree','Random Forest','Naive Bayes']
label_sets = [dt_pred, rf_pred, nb_pred]
accuracies, precisions, recalls, f1s = [],[],[],[]

# Loop to calculate the different scores for disease predictions
for labels in label_sets:
  accuracies.append(accuracy_score(y_test, labels))
  precisions.append(precision_score(y_test, labels, average = 'weighted'))
  recalls.append(recall_score(y_test, labels, average = 'weighted'))
  f1s.append(f1_score(y_test, labels, average = 'weighted'))

# Create & view the diagnosis score data frame for each model
pd.DataFrame(data = {'Model': models, 'Accuracy': accuracies, 'Precision': precisions, 'Recall': recalls, 'F1': f1s})

Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.997561,0.997761,0.997561,0.997551
1,Random Forest,1.0,1.0,1.0,1.0
2,Naive Bayes,1.0,1.0,1.0,1.0


# Part 5: Final Thoughts & Takeaways

I've always wanted to do an AI project which directly helps the medical field like this one, so this is a really exciting start!  

Here are the main points I learned and utilized in this project:  
* Trying to think about data from different perspectives to find the best representation that can be processed by a learning model, then finding how to achieve that representation by transforming the current one.
* How much hyperparameter tuning and cross-validation can strengthen the final model. I also learned a bit about model-selection practices.
* Using pipelines for combining data processing steps from its entry until its classification / regression output.  

This project could be easily extended to an application with a user-friendly interface, especially with the additional CSV files that accompany the data, describing each symptom and precautionary measures for each disease.