# Disease Prediction Using Structured Symptoms

This project builds a disease prediction system where users select symptoms from predefined options (like checkboxes), and a machine learning model predicts the most likely disease.

## Required Libraries

In [24]:
import pandas as pd
import numpy as np


from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Load Dataset

The dataset contains:

Disease → Target column (what we want to predict)

Symptom_1 to Symptom_17 → Symptoms reported for each patient

In [2]:
df = pd.read_csv("../data/dataset.csv")

df.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


## Understand Dataset Structure

We inspect basic structure to understand columns and size.

In [3]:
print("Shape:", df.shape)
print(df.columns)


Shape: (4920, 18)
Index(['Disease', 'Symptom_1', 'Symptom_2', 'Symptom_3', 'Symptom_4',
       'Symptom_5', 'Symptom_6', 'Symptom_7', 'Symptom_8', 'Symptom_9',
       'Symptom_10', 'Symptom_11', 'Symptom_12', 'Symptom_13', 'Symptom_14',
       'Symptom_15', 'Symptom_16', 'Symptom_17'],
      dtype='object')


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Disease     4920 non-null   object
 1   Symptom_1   4920 non-null   object
 2   Symptom_2   4920 non-null   object
 3   Symptom_3   4920 non-null   object
 4   Symptom_4   4572 non-null   object
 5   Symptom_5   3714 non-null   object
 6   Symptom_6   2934 non-null   object
 7   Symptom_7   2268 non-null   object
 8   Symptom_8   1944 non-null   object
 9   Symptom_9   1692 non-null   object
 10  Symptom_10  1512 non-null   object
 11  Symptom_11  1194 non-null   object
 12  Symptom_12  744 non-null    object
 13  Symptom_13  504 non-null    object
 14  Symptom_14  306 non-null    object
 15  Symptom_15  240 non-null    object
 16  Symptom_16  192 non-null    object
 17  Symptom_17  72 non-null     object
dtypes: object(18)
memory usage: 692.0+ KB


## Feature Engineering

convert the dataset into a one-hot encoded symptom matrix

Collect all unique symptoms

In [5]:
symptom_cols = [col for col in df.columns if col.startswith("Symptom")]


all_symptoms = pd.unique(df[symptom_cols].values.ravel())
all_symptoms = [s for s in all_symptoms if pd.notna(s)]


print("Total unique symptoms:", len(all_symptoms))

Total unique symptoms: 131


### Create Binary Feature Matrix

In [6]:
symptom_df = pd.DataFrame(0, index=df.index, columns=all_symptoms)


for col in symptom_cols:
    for i, symptom in df[col].items():
        if pd.notna(symptom):
            symptom_df.at[i, symptom] = 1


X = symptom_df
y = df["Disease"]


print(X.shape, y.shape)
X.head()

(4920, 131) (4920,)


Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,continuous_sneezing,shivering,chills,watering_from_eyes,stomach_pain,acidity,...,bladder_discomfort,foul_smell_of urine,continuous_feel_of_urine,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Clean and Normalize Symptom Names

Raw symptom names often contain spaces, inconsistent formatting, or extra characters. We normalize them so both model and user input behave consistently.

In [7]:
def clean_symptom(s):
    return (
    s.strip()
    .lower()
    .replace(" ", "_")
    .replace("__", "_")
    )


X.columns = [clean_symptom(c) for c in X.columns]
all_symptoms = [clean_symptom(s) for s in all_symptoms]

### Encode Target Labels

Machine learning models require numeric labels instead of text.

In [8]:
le = LabelEncoder()
y_encoded = le.fit_transform(y)


print("Number of disease classes:", len(le.classes_))

Number of disease classes: 41


## Train–Test Split

We split data into training and testing sets for fair evaluation.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded
)

print(X_train.shape, X_test.shape)


(3936, 131) (984, 131)


## Model - Random Forest

Random Forest is highly suitable for structured, high-dimensional binary data.

In [10]:
rf = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)


In [None]:
print("Accuracy :", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average="weighted"))
print("Recall   :", recall_score(y_test, y_pred, average="weighted"))
print("F1 Score :", f1_score(y_test, y_pred, average="weighted"))


Accuracy : 1.0
Precision: 1.0
Recall   : 1.0
F1 Score : 1.0


In [18]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(rf, X, y_encoded, cv=5, scoring="accuracy")

print("Cross-validation accuracies:", scores)
print("Mean CV accuracy:", scores.mean())


Cross-validation accuracies: [1. 1. 1. 1. 1.]
Mean CV accuracy: 1.0


## Overfitting & Data Leakage Validation

In [23]:

y_shuffled = np.random.permutation(y_encoded)

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X, y_shuffled, test_size=0.2, random_state=42
)

rf2 = RandomForestClassifier(n_estimators=200, random_state=42)
rf2.fit(X_train_s, y_train_s)

y_pred_s = rf2.predict(X_test_s)

print("Accuracy after label shuffling:", accuracy_score(y_test_s, y_pred_s))


Accuracy after label shuffling: 0.02032520325203252


Although the model achieved perfect accuracy on the test set, this behavior is expected for this dataset because symptom–disease mappings are nearly deterministic. To ensure that the model was not affected by data leakage, a label-shuffling experiment was performed. After randomly permuting the target labels, model accuracy dropped to ~2%, which is close to random guessing. This confirms that the original model learned genuine patterns from the data rather than memorizing or leaking information.

### Manual Prediction Function

In [12]:
def predict_disease(selected_symptoms):
    input_data = pd.DataFrame(0, index=[0], columns=X.columns)


    for symptom in selected_symptoms:
        s = clean_symptom(symptom)
        if s in input_data.columns:
            input_data.at[0, s] = 1
        else:
            print(f"Warning: '{symptom}' not recognized")


    pred = rf.predict(input_data)[0]
    disease = le.inverse_transform([pred])[0]
    return disease

In [13]:
predict_disease(["vomiting", "abdominal pain", "nausea"])

'Chronic cholestasis'

#### Saving the model

In [28]:
import os
import pickle

os.makedirs("../model", exist_ok=True)

with open("../model/disease_model.pkl", "wb") as f:
    pickle.dump(rf, f)

with open("../model/label_encoder.pkl", "wb") as f:
    pickle.dump(le, f)

with open("../model/symptom_list.pkl", "wb") as f:
    pickle.dump(all_symptoms, f)
