# Intelligent Medical Diagnosis and Drug Recommendation System
## Big Data Capstone Project
### Prepared By : Group A
- Mahmood Hossain
- Chanpreet Kaur
- Rajia Bano
- Robert Thapa
- Saaz Neupane


**Objective** : The primary objective of this project is to develop an integrated machine learning
application with a LLM implementation for conducting natural conversation with the user that assists in the diagnosis and treatment of patients by utilizing advanced natural
language processing (NLP) and classification models. The system will automate the extraction of
symptoms from patient descriptions, predict possible diseases, and recommend appropriate
medications through conversations with the user in chat.

The project encompasses the development and deployment of a pipeline:

1. Symptom Extraction Model
- Input: Patient's natural language description of their symptoms
- Output: A list of identified symptoms
- Technology: An algorithm to catch the symptoms from user's inputs thorugh keywords and Named Entity Recognition model.

2. Disease Prediction Model
- Input: Symptoms identified by the Symptom Extraction Model
- Output: Predicted disease(s)
- Dataset: We combined and preprocessed the following datasets for this : 
-- https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning?select=Training.csv
-- https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient-profile-dataset

3. Drug Recommendation Data
- Input: The output from the Disease Prediction Model
- Output: Suggested drugs and medications
- Dataset: We combined and preprocessed the following datasets for this : 
-- https://www.kaggle.com/datasets/jithinanievarghese/drugs-side-effects-and-medical-condition
-- https://huggingface.co/datasets/MattBastar/medicine
-- https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018

4. Large Language Model 
- Input : User's inputs - the natural language
- Output : Natural conversation like a healthcare professional
- Model : Mistral Large Latest

## We need  to import the libraries

In [1]:
import pandas as pd
import spacy
from spacy.training import Example
from spacy.util import minibatch, compounding
import random
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline

import torch
import torch.optim as optim
import torch.nn as nn
import pickle

In [14]:
import sklearn
print(sklearn.__version__)

1.2.2


In [13]:
print(spacy.__version__)

3.7.4


In [15]:
# reading disease prediction dataset 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

dis_df = pd.read_csv("/kaggle/input/disease-prediction-dataset/disease_prediction_datasets/Training.csv")
dis_df = dis_df.rename(columns={'fluid_overload.1': 'fluid_overload', 'toxic_look_(typhos)': 'toxic_look_typhos' , 'spotting_ urination' : 'spotting_urination', 'dischromic _patches' : 'dischromic_patches', 'foul_smell_of urine' : 'foul_smell_of_urine'})
dis_df = dis_df.drop('Unnamed: 133', axis=1)
dis_df = dis_df.loc[:,~dis_df.columns.duplicated()]

dis_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,vomiting,burning_micturition,spotting_urination,fatigue,weight_gain,anxiety,cold_hands_and_feets,mood_swings,weight_loss,restlessness,lethargy,patches_in_throat,irregular_sugar_level,cough,high_fever,sunken_eyes,breathlessness,sweating,dehydration,indigestion,headache,yellowish_skin,dark_urine,nausea,loss_of_appetite,pain_behind_the_eyes,back_pain,constipation,abdominal_pain,diarrhoea,mild_fever,yellow_urine,yellowing_of_eyes,acute_liver_failure,fluid_overload,swelling_of_stomach,swelled_lymph_nodes,malaise,blurred_and_distorted_vision,phlegm,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,chest_pain,weakness_in_limbs,fast_heart_rate,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,neck_pain,dizziness,cramps,bruising,obesity,swollen_legs,swollen_blood_vessels,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,excessive_hunger,extra_marital_contacts,drying_and_tingling_lips,slurred_speech,knee_pain,hip_joint_pain,muscle_weakness,stiff_neck,swelling_joints,movement_stiffness,spinning_movements,loss_of_balance,unsteadiness,weakness_of_one_body_side,loss_of_smell,bladder_discomfort,foul_smell_of_urine,continuous_feel_of_urine,passage_of_gases,internal_itching,toxic_look_typhos,depression,irritability,muscle_pain,altered_sensorium,red_spots_over_body,belly_pain,abnormal_menstruation,dischromic_patches,watering_from_eyes,increased_appetite,polyuria,family_history,mucoid_sputum,rusty_sputum,lack_of_concentration,visual_disturbances,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,distention_of_abdomen,history_of_alcohol_consumption,blood_in_sputum,prominent_veins_on_calf,palpitations,painful_walking,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection


Checking the columns and shape of the dataset

In [16]:
dis_df.shape

(4920, 132)

## We need to add more data to this dataset

We may use this dataset : https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient-profile-dataset


In [17]:
dis_2 = pd.read_csv("/kaggle/input/more-diseases/Disease_symptom_and_patient_profile_dataset.csv")
dis_2.shape

(349, 10)

In [18]:
dis_2.dtypes

Disease                 object
Fever                   object
Cough                   object
Fatigue                 object
Difficulty Breathing    object
Age                      int64
Gender                  object
Blood Pressure          object
Cholesterol Level       object
Outcome Variable        object
dtype: object

In [19]:
dis_2.head()

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal,Positive
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive


In [20]:
dis_2['Cholesterol Level'].value_counts()

Cholesterol Level
High      166
Normal    149
Low        34
Name: count, dtype: int64

In [21]:
dis_df_cp = dis_df.copy()

In [22]:
dis_df_cp.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,vomiting,burning_micturition,spotting_urination,fatigue,weight_gain,anxiety,cold_hands_and_feets,mood_swings,weight_loss,restlessness,lethargy,patches_in_throat,irregular_sugar_level,cough,high_fever,sunken_eyes,breathlessness,sweating,dehydration,indigestion,headache,yellowish_skin,dark_urine,nausea,loss_of_appetite,pain_behind_the_eyes,back_pain,constipation,abdominal_pain,diarrhoea,mild_fever,yellow_urine,yellowing_of_eyes,acute_liver_failure,fluid_overload,swelling_of_stomach,swelled_lymph_nodes,malaise,blurred_and_distorted_vision,phlegm,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,chest_pain,weakness_in_limbs,fast_heart_rate,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,neck_pain,dizziness,cramps,bruising,obesity,swollen_legs,swollen_blood_vessels,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,excessive_hunger,extra_marital_contacts,drying_and_tingling_lips,slurred_speech,knee_pain,hip_joint_pain,muscle_weakness,stiff_neck,swelling_joints,movement_stiffness,spinning_movements,loss_of_balance,unsteadiness,weakness_of_one_body_side,loss_of_smell,bladder_discomfort,foul_smell_of_urine,continuous_feel_of_urine,passage_of_gases,internal_itching,toxic_look_typhos,depression,irritability,muscle_pain,altered_sensorium,red_spots_over_body,belly_pain,abnormal_menstruation,dischromic_patches,watering_from_eyes,increased_appetite,polyuria,family_history,mucoid_sputum,rusty_sputum,lack_of_concentration,visual_disturbances,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,distention_of_abdomen,history_of_alcohol_consumption,blood_in_sputum,prominent_veins_on_calf,palpitations,painful_walking,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection


More disease data that we need to run

In [23]:
dis_2.columns = dis_2.columns.str.lower()

# Filter rows where Outcome Variable is Positive
dis_2 = dis_2[dis_2['outcome variable'] == 'Positive']
dis_2 = dis_2.drop('outcome variable', axis=1)

Renaming some columns for getting the target column name as prognosis

In [24]:
dis_2 = dis_2.rename(columns={'disease': 'prognosis'})

Since we dont need age and gender columns

In [25]:
dis_2 = dis_2.drop(['age', 'gender'], axis=1)

In [26]:
dis_2.head()

Unnamed: 0,prognosis,fever,cough,fatigue,difficulty breathing,blood pressure,cholesterol level
0,Influenza,Yes,No,Yes,Yes,Low,Normal
3,Asthma,Yes,Yes,No,Yes,Normal,Normal
4,Asthma,Yes,Yes,No,Yes,Normal,Normal
5,Eczema,Yes,No,No,No,Normal,Normal
6,Influenza,Yes,Yes,Yes,Yes,Normal,Normal


In [27]:
# duplicate fever column and rename fever = high_fever, another one mild_fever 
dis_2['high_fever'] = dis_2['fever'].copy()
dis_2 = dis_2.rename(columns={'fever': 'mild_fever'})


In [28]:
# make difficulty breathing = difficulty_breathing and blood pressure = blood_pressure
dis_2 = dis_2.rename(columns={
    'difficulty breathing': 'difficulty_breathing',
    'blood pressure': 'blood_pressure',
    'cholesterol level': 'cholesterol_level'
})

In [29]:
dis_2.head()

Unnamed: 0,prognosis,mild_fever,cough,fatigue,difficulty_breathing,blood_pressure,cholesterol_level,high_fever
0,Influenza,Yes,No,Yes,Yes,Low,Normal,Yes
3,Asthma,Yes,Yes,No,Yes,Normal,Normal,Yes
4,Asthma,Yes,Yes,No,Yes,Normal,Normal,Yes
5,Eczema,Yes,No,No,No,Normal,Normal,Yes
6,Influenza,Yes,Yes,Yes,Yes,Normal,Normal,Yes


In [30]:
dis_2['prognosis'].nunique()

77

We need to convert symptoms to binary modes

In [31]:
# Convert yes/high = 1, no/low/normal = 0
columns_to_convert = ['mild_fever', 'cough', 'fatigue', 'difficulty_breathing', 'high_fever']
dis_2[columns_to_convert] = dis_2[columns_to_convert].replace({'Yes': 1, 'No': 0})

columns_to_convert = ['blood_pressure', 'cholesterol_level']
dis_2[columns_to_convert] = dis_2[columns_to_convert].replace({'High': 1, 'Low': 0, 'Normal': 0})

  dis_2[columns_to_convert] = dis_2[columns_to_convert].replace({'Yes': 1, 'No': 0})
  dis_2[columns_to_convert] = dis_2[columns_to_convert].replace({'High': 1, 'Low': 0, 'Normal': 0})


In [32]:
# add this data to the dis_df_cp , for other remainig columns fill with 0
dis_2.head()

Unnamed: 0,prognosis,mild_fever,cough,fatigue,difficulty_breathing,blood_pressure,cholesterol_level,high_fever
0,Influenza,1,0,1,1,0,0,1
3,Asthma,1,1,0,1,0,0,1
4,Asthma,1,1,0,1,0,0,1
5,Eczema,1,0,0,0,0,0,1
6,Influenza,1,1,1,1,0,0,1


In [33]:
dis_df_cp.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,vomiting,burning_micturition,spotting_urination,fatigue,weight_gain,anxiety,cold_hands_and_feets,mood_swings,weight_loss,restlessness,lethargy,patches_in_throat,irregular_sugar_level,cough,high_fever,sunken_eyes,breathlessness,sweating,dehydration,indigestion,headache,yellowish_skin,dark_urine,nausea,loss_of_appetite,pain_behind_the_eyes,back_pain,constipation,abdominal_pain,diarrhoea,mild_fever,yellow_urine,yellowing_of_eyes,acute_liver_failure,fluid_overload,swelling_of_stomach,swelled_lymph_nodes,malaise,blurred_and_distorted_vision,phlegm,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,chest_pain,weakness_in_limbs,fast_heart_rate,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,neck_pain,dizziness,cramps,bruising,obesity,swollen_legs,swollen_blood_vessels,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,excessive_hunger,extra_marital_contacts,drying_and_tingling_lips,slurred_speech,knee_pain,hip_joint_pain,muscle_weakness,stiff_neck,swelling_joints,movement_stiffness,spinning_movements,loss_of_balance,unsteadiness,weakness_of_one_body_side,loss_of_smell,bladder_discomfort,foul_smell_of_urine,continuous_feel_of_urine,passage_of_gases,internal_itching,toxic_look_typhos,depression,irritability,muscle_pain,altered_sensorium,red_spots_over_body,belly_pain,abnormal_menstruation,dischromic_patches,watering_from_eyes,increased_appetite,polyuria,family_history,mucoid_sputum,rusty_sputum,lack_of_concentration,visual_disturbances,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,distention_of_abdomen,history_of_alcohol_consumption,blood_in_sputum,prominent_veins_on_calf,palpitations,painful_walking,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Fungal infection


In [34]:
# Find columns in dis_2 that are not in dis_df_cp
extra_columns_in_dis_2 = [col for col in dis_2.columns if col not in dis_df_cp.columns]

# Display the extra columns
print("Columns in dis_2 that are not in dis_df_cp:")
print(extra_columns_in_dis_2)

Columns in dis_2 that are not in dis_df_cp:
['difficulty_breathing', 'blood_pressure', 'cholesterol_level']


In [35]:
# Create a DataFrame with the extra columns initialized to NaN for existing rows in dis_df_cp
extra_columns_df = pd.DataFrame(pd.NA, index=dis_df_cp.index, columns=extra_columns_in_dis_2)

# Concatenate this new DataFrame with dis_df_cp to include the extra columns
dis_df_cp_with_extras = pd.concat([dis_df_cp, extra_columns_df], axis=1)


In [36]:
# Append rows from dis_2 to dis_df_cp_with_extras
final_dis_df = pd.concat([dis_df_cp_with_extras, dis_2], ignore_index=True)

Finally, after concatenation, we have the final disease dataframe

In [37]:
final_dis_df.tail()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,vomiting,burning_micturition,spotting_urination,fatigue,weight_gain,anxiety,cold_hands_and_feets,mood_swings,weight_loss,restlessness,lethargy,patches_in_throat,irregular_sugar_level,cough,high_fever,sunken_eyes,breathlessness,sweating,dehydration,indigestion,headache,yellowish_skin,dark_urine,nausea,loss_of_appetite,pain_behind_the_eyes,back_pain,constipation,abdominal_pain,diarrhoea,mild_fever,yellow_urine,yellowing_of_eyes,acute_liver_failure,fluid_overload,swelling_of_stomach,swelled_lymph_nodes,malaise,blurred_and_distorted_vision,phlegm,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,chest_pain,weakness_in_limbs,fast_heart_rate,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,neck_pain,dizziness,cramps,bruising,obesity,swollen_legs,swollen_blood_vessels,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,excessive_hunger,extra_marital_contacts,drying_and_tingling_lips,slurred_speech,knee_pain,hip_joint_pain,muscle_weakness,stiff_neck,swelling_joints,movement_stiffness,spinning_movements,loss_of_balance,unsteadiness,weakness_of_one_body_side,loss_of_smell,bladder_discomfort,foul_smell_of_urine,continuous_feel_of_urine,passage_of_gases,internal_itching,toxic_look_typhos,depression,irritability,muscle_pain,altered_sensorium,red_spots_over_body,belly_pain,abnormal_menstruation,dischromic_patches,watering_from_eyes,increased_appetite,polyuria,family_history,mucoid_sputum,rusty_sputum,lack_of_concentration,visual_disturbances,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,distention_of_abdomen,history_of_alcohol_consumption,blood_in_sputum,prominent_veins_on_calf,palpitations,painful_walking,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,difficulty_breathing,blood_pressure,cholesterol_level
5101,,,,,,,,,,,,,,,1,,,,,,,,,,0,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Stroke,0,1,1
5102,,,,,,,,,,,,,,,1,,,,,,,,,,0,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Stroke,0,1,1
5103,,,,,,,,,,,,,,,1,,,,,,,,,,0,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Stroke,0,1,1
5104,,,,,,,,,,,,,,,1,,,,,,,,,,0,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Stroke,0,1,1
5105,,,,,,,,,,,,,,,1,,,,,,,,,,0,1,,,,,,,,,,,,,,,,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Stroke,0,1,1


In [38]:
# Find columns with non-numeric types and their unique values in final_dis_df
non_numeric_columns = []
for col in final_dis_df.columns:
    if not pd.api.types.is_numeric_dtype(final_dis_df[col]):
        non_numeric_columns.append(col)
        unique_values = final_dis_df[col].unique()
        print(f"Column '{col}' has non-numeric type with unique values:")
        print(unique_values)
        print()

# Display non-numeric columns and their unique values
print("Non-numeric columns and their unique values:")
print(non_numeric_columns)

Column 'prognosis' has non-numeric type with unique values:
['Fungal infection' 'Allergy' 'GERD' 'Chronic cholestasis' 'Drug Reaction'
 'Peptic ulcer diseae' 'AIDS' 'Diabetes ' 'Gastroenteritis'
 'Bronchial Asthma' 'Hypertension ' 'Migraine' 'Cervical spondylosis'
 'Paralysis (brain hemorrhage)' 'Jaundice' 'Malaria' 'Chicken pox'
 'Dengue' 'Typhoid' 'hepatitis A' 'Hepatitis B' 'Hepatitis C'
 'Hepatitis D' 'Hepatitis E' 'Alcoholic hepatitis' 'Tuberculosis'
 'Common Cold' 'Pneumonia' 'Dimorphic hemmorhoids(piles)' 'Heart attack'
 'Varicose veins' 'Hypothyroidism' 'Hyperthyroidism' 'Hypoglycemia'
 'Osteoarthristis' 'Arthritis' '(vertigo) Paroymsal  Positional Vertigo'
 'Acne' 'Urinary tract infection' 'Psoriasis' 'Impetigo' 'Influenza'
 'Asthma' 'Eczema' 'Depression' 'Liver Cancer' 'Stroke'
 'Urinary Tract Infection' 'Bipolar Disorder' 'Bronchitis'
 'Cerebral Palsy' 'Colorectal Cancer' 'Hypertensive Heart Disease'
 'Multiple Sclerosis' 'Myocardial Infarction (Heart...'
 'Urinary Tract Inf

In [39]:
# Fill NaN or <NA> values with 0
final_dis_df = final_dis_df.fillna(0)

  final_dis_df = final_dis_df.fillna(0)


In [22]:
final_dis_df.dtypes

itching                 int64
skin_rash               int64
nodal_skin_eruptions    int64
continuous_sneezing     int64
shivering               int64
                        ...  
yellow_crust_ooze       int64
prognosis               int64
difficulty_breathing    int64
blood_pressure          int64
cholesterol_level       int64
Length: 135, dtype: object

In [23]:
# Convert float64 columns to integers
for col in final_dis_df.columns:
    if final_dis_df[col].dtype == 'float64':
        final_dis_df[col] = final_dis_df[col].astype(int)

In [24]:
final_dis_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,difficulty_breathing,blood_pressure,cholesterol_level
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,40,0,0,0
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,40,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,40,0,0,0
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,40,0,0,0
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,40,0,0,0


In [43]:
final_dis_df.shape

(5106, 135)

In [44]:
# saving the disease df 
final_dis_df.to_csv('final_dis_df.csv', index=False)

In [3]:
# reading the final disease dataframe
final_dis_df = pd.read_csv('/kaggle/input/final-disease-dataframe-and-drugs-data/final_dis_df.csv')

## Generating Keywords and Data for NER

For capturing the symptoms from the user inputs, we need to work with keywords based approach and NER model. 

In [32]:
#### add the keywords for new added columns

In [4]:
# defining keywords : we need to make a dictionary of the symptoms and possible keywords. 
# Our symptom extraction pipeline takes the input from the user, searches in the keywords and map it to the symptom names
keywords_new = {
    "itching": ["itching", "itch"],
    "skin_rash": ["skin rash", "rash"],
    "nodal_skin_eruptions": ["nodal skin eruptions"],
    "continuous_sneezing": ["continuous sneezing"],
    "shivering": ["shivering", "chills"],
    "chills": ["chills", "fever"],
    "joint_pain": ["joint pain", "arthritis"],
    "stomach_pain": ["stomach pain", "abdominal pain"],
    "acidity": ["acidity", "heartburn"],
    "ulcers_on_tongue": ["ulcers on tongue", "mouth sores"],
    "muscle_wasting": ["muscle wasting", "muscle weakness"],
    "vomiting": ["vomiting", "nausea"],
    "burning_micturition": ["burning micturition", "painful urination"],
    "spotting_urination": ["spotting urination", "frequent urination"],
    "fatigue": ["fatigue", "tiredness"],
    "weight_gain": ["weight gain", "obesity"],
    "anxiety": ["anxiety", "stress"],
    "cold_hands_and_feets": ["cold hands and feet", "poor circulation"],
    "mood_swings": ["mood swings", "irritability"],
    "weight_loss": ["weight loss", "slimming"],
    "restlessness": ["restlessness", "insomnia"],
    "lethargy": ["lethargy", "apathy"],
    "patches_in_throat": ["patches in throat", "sore throat"],
    "irregular_sugar_level": ["irregular sugar level", "diabetes"],
    "cough": ["cough", "respiratory issues"],
    "high_fever": ["high fever", "temperature elevation"],
    "sunken_eyes": ["sunken eyes", "fatigue"],
    "breathlessness": ["breathlessness", "shortness of breath"],
    "sweating": ["sweating", "perspiration"],
    "dehydration": ["dehydration", "water loss"],
    "indigestion": ["indigestion", "heartburn"],
    "headache": ["headache", "migraine"],
    "yellowish_skin": ["yellowish skin", "jaundice"],
    "dark_urine": ["dark urine", "kidney issues"],
    "nausea": ["nausea", "vomiting"],
    "loss_of_appetite": ["loss of appetite", "decreased hunger"],
    "pain_behind_the_eyes": ["pain behind the eyes", "eye strain"],
    "back_pain": ["back pain", "muscle strain"],
    "constipation": ["constipation", "bowel issues"],
    "abdominal_pain": ["abdominal pain", "stomach cramps"],
    "diarrhoea": ["diarrhoea", "frequent bowel movements"],
    "mild_fever": ["mild fever", "low-grade temperature"],
    "yellow_urine": ["yellow urine", "urine color change"],
    "yellowing_of_eyes": ["yellowing of eyes", "jaundice"],
    "acute_liver_failure": ["acute liver failure", "liver dysfunction"],
    "fluid_overload": ["fluid overload", "edema"],
    "swelling_of_stomach": ["swelling of stomach", "abdominal distension"],
    "swelled_lymph_nodes": ["swelled lymph nodes", "enlarged lymph nodes"],
    "malaise": ["malaise", "general discomfort"],
    "blurred_and_distorted_vision": ["blurred and distorted vision", "eye problems"],
    "phlegm": ["phlegm", "mucus"],
    "throat_irritation": ["throat irritation", "sore throat"],
    "redness_of_eyes": ["redness of eyes", "eye irritation"],
    "sinus_pressure": ["sinus pressure", "congestion"],
    "runny_nose": ["runny nose", "rhinorrhea"],
    "congestion": ["congestion", "stuffy nose"],
    "chest_pain": ["chest pain", "cardiac issues"],
    "weakness_in_limbs": ["weakness in limbs", "muscle weakness"],
    "fast_heart_rate": ["fast heart rate", "tachycardia"],
    "pain_during_bowel_movements": ["pain during bowel movements", "rectal pain"],
    "pain_in_anal_region": ["pain in anal region", "anal pain"],
    "bloody_stool": ["bloody stool", "hematochezia"],
    "irritation_in_anus": ["irritation in anus", "anal itching"],
    "neck_pain": ["neck pain", "cervical pain"],
    "dizziness": ["dizziness", "lightheadedness"],
    "cramps": ["cramps", "muscle spasms"],
    "bruising": ["bruising", "ecchymosis"],
    "obesity": ["obesity", "overweight"],
    "swollen_legs": ["swollen legs", "edema"],
    "swollen_blood_vessels": ["swollen blood vessels", "vasculitis"],
    "puffy_face_and_eyes": ["puffy face and eyes", "edema"],
    "enlarged_thyroid": ["enlarged thyroid", "goiter"],
    "brittle_nails": ["brittle nails", "nail fragility"],
    "swollen_extremeties": ["swollen extremeties", "edema"],
    "excessive_hunger": ["excessive hunger", "polyphagia"],
    "extra_marital_contacts": ["extra marital contacts", "infidelity"],
    "drying_and_tingling_lips": ["drying and tingling lips", "cheilitis"],
    "slurred_speech": ["slurred speech", "dysarthria"],
    "knee_pain": ["knee pain", "knee joint pain"],
    "hip_joint_pain": ["hip joint pain", "hip arthritis"],
    "muscle_weakness": ["muscle weakness", "muscle atrophy"],
    "stiff_neck": ["stiff neck", "cervical stiffness"],
    "swelling_joints": ["swelling joints", "joint swelling"],
    "movement_stiffness": ["movement stiffness", "muscle rigidity"],
    "spinning_movements": ["spinning movements", "vertigo"],
    "loss_of_balance": ["loss of balance", "ataxia"],
    "unsteadiness": ["unsteadiness", "lightheadedness"],
    "weakness_of_one_body_side": ["weakness of one body side", "hemiparesis"],
    "loss_of_smell": ["loss of smell", "anosmia"],
    "bladder_discomfort": ["bladder discomfort", "urinary issues"],
    "foul_smell_of_urine": ["foul smell of urine", "urinary tract infection"],
    "continuous_feel_of_urine": ["continuous feel of urine", "urinary frequency"],
    "passage_of_gases": ["passage of gases", "flatulence"],
    "internal_itching": ["internal itching", "pruritus"],
    "toxic_look_typhos": ["toxic look typhos", "typhoid fever"],
    "depression": ["depression", "mental health disorders"],
    "irritability": ["irritability", "mood swings"],
    "muscle_pain": ["muscle pain", "myalgia"],
    "altered_sensorium": ["altered sensorium", "confusion"],
    "red_spots_over_body": ["red spots over body", "petechiae"],
    "belly_pain": ["belly pain", "abdominal pain"],
    "abnormal_menstruation": ["abnormal menstruation", "menstrual irregularities"],
    "dischromic_patches": ["dischromic patches", "skin discoloration"],
    "watering_from_eyes": ["watering from eyes", "lacrimation"],
    "increased_appetite": ["increased appetite", "polyphagia"],
    "polyuria": ["polyuria", "frequent urination"],
    "family_history": ["family history", "genetic disorders"],
    "mucoid_sputum": ["mucoid sputum", "respiratory issues"],
    "rusty_sputum": ["rusty sputum", "hemoptysis"],
    "lack_of_concentration": ["lack of concentration", "attention deficit"],
    "visual_disturbances": ["visual disturbances", "blurred vision"],
    "receiving_blood_transfusion": ["receiving blood transfusion", "blood transfusion"],
    "receiving_unsterile_injections": ["receiving unsterile injections", "unsterile injections"],
    "coma": ["coma", "unconsciousness"],
    "stomach_bleeding": ["stomach bleeding", "gastrointestinal bleeding"],
    "distention_of_abdomen": ["distention of abdomen", "abdominal swelling"],
    "history_of_alcohol_consumption": ["history of alcohol consumption", "alcoholism"],
    "fluid_overload": ["fluid overload", "edema"],
    "blood_in_sputum": ["blood in sputum", "hemoptysis"],
    "prominent_veins_on_calf": ["prominent veins on calf", "varicose veins"],
    "palpitations": ["palpitations", "heart palpitations"],
    "painful_walking": ["painful walking", "foot pain"],
    "pus_filled_pimples": ["pus filled pimples", "acne"],
    "blackheads": ["blackheads", "comedones"],
    "scurring": ["scurring", "scarring"],
    "skin_peeling": ["skin peeling", "exfoliation"],
    "silver_like_dusting": ["silver like dusting", "skin discoloration"],
    "small_dents_in_nails": ["small dents in nails", "nail abnormalities"],
    "inflammatory_nails": ["inflammatory nails", "nail inflammation"],
    "blister": ["blister", "fluid-filled bumps"],
    "red_sore_around_nose": ["red sore around nose", "nasal irritation"],
    "yellow_crust_ooze": ["yellow crust ooze", "pus-filled crust"],
    "difficulty_breathing": ["difficulty breathing", "shortness of breath", "breathlessness", "labored breathing", "dyspnea", "breath"],
    "blood_pressure": ["blood pressure", "hypertension", "high BP", "high blood pressure", "high pressure"],
    "cholesterol_level": ["cholesterol", "high cholesterol" , "high lipid levels" , "LDL"]
}

We basically created a dictionary, where keys are the column names for the disease prediction dataset so that the generated dictionary can be turned into a dataframe and can be used for the disease prediction model to output the disease.

In [5]:
# our disease prediction model has 134 columns, so we want to check whether our keywords_new dictionary has 134 keys or not
len(keywords_new)

134

## Generating Training Data for NER model
To create the NER model, we need data. The keywords_new dictionary have the data for our symptom extraction model

In [19]:
import random

# Function to generate training data based on keywords
def generate_training_data(keywords, num_samples=1000):
    training_data = []
    for _ in range(num_samples):
        # Randomly select a symptom and generate a sentence containing it
        symptom = random.choice(list(keywords.keys()))
        keyword = random.choice(keywords[symptom])
        sentence = f"I have {keyword}."
        start = sentence.find(keyword)
        end = start + len(keyword)
        # Annotate the sentence with the symptom entity
        annotations = {"entities": [(start, end, symptom.upper())]}
        training_data.append((sentence, annotations))
    return training_data

# Generate training data
training_data = generate_training_data(keywords_new, num_samples=1000)

# Print sample training data
for text, annotations in training_data[:5]:
    print("Text:", text)
    print("Annotations:", annotations)
    print()


Text: I have knee pain.
Annotations: {'entities': [(7, 16, 'KNEE_PAIN')]}

Text: I have sinus pressure.
Annotations: {'entities': [(7, 21, 'SINUS_PRESSURE')]}

Text: I have gastrointestinal bleeding.
Annotations: {'entities': [(7, 32, 'STOMACH_BLEEDING')]}

Text: I have indigestion.
Annotations: {'entities': [(7, 18, 'INDIGESTION')]}

Text: I have edema.
Annotations: {'entities': [(7, 12, 'SWOLLEN_LEGS')]}



In [20]:
# creating a dataframe from the keywords data
ner_data = pd.DataFrame(training_data, columns=["text", "annotations"])
ner_data.head()

Unnamed: 0,text,annotations
0,I have knee pain.,"{'entities': [(7, 16, 'KNEE_PAIN')]}"
1,I have sinus pressure.,"{'entities': [(7, 21, 'SINUS_PRESSURE')]}"
2,I have gastrointestinal bleeding.,"{'entities': [(7, 32, 'STOMACH_BLEEDING')]}"
3,I have indigestion.,"{'entities': [(7, 18, 'INDIGESTION')]}"
4,I have edema.,"{'entities': [(7, 12, 'SWOLLEN_LEGS')]}"


## Symptom Tracker NER Modelling
Now we have our data from NER model, we can train our NER model.

In [40]:
# !python -m spacy download en_core_web_sm

In [None]:
# Load a blank spaCy model
nlp = spacy.blank("en")

# Create the NER component and add it to the pipeline
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner")
else:
    ner = nlp.get_pipe("ner")

# Add labels to the NER component
for _, annotations in training_data:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# Convert the data to spaCy's format
train_data = []
for text, annotations in training_data:
    train_data.append(Example.from_dict(nlp.make_doc(text), annotations))

# Train the NER model
optimizer = nlp.begin_training()
for itn in range(200):
    losses = {}
    batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        nlp.update(batch, drop=0.5, losses=losses)
    print(f"Losses at iteration {itn}: {losses}")

# Save the model
nlp.to_disk("disease_ner_model")

In [6]:
# Load the trained model
nlp = spacy.load("/kaggle/input/disease-ner-model/disease_ner_model")

# Function to map entities to prediction model columns
def map_entities_to_columns(text, symptom_keywords):
    # Initialize mapping with zeros for all symptoms
    mapping = {symptom: 0 for symptom in symptom_keywords.keys()}
    
    # Process text with the NER model
    doc = nlp(text)
    
    # Update mapping based on detected entities
    for ent in doc.ents:
        symptom = ent.label_.lower()
        if symptom in mapping:
            mapping[symptom] = 1
            
    return mapping


Saving the model to disk so that we can use the model for our application

In [6]:
# Save the model
nlp.to_disk("disease_ner_model")

In [7]:
import re

# we need to ensure that no symptoms are ignored from user inputs. So we have created a a manual logic so that it can detect the symptoms based on regular expression
# Function to map keywords to prediction model columns


def map_keywords_to_columns(text, symptom_keywords):
    # Initialize mapping with zeros for all symptoms
    mapping = {symptom: 0 for symptom in symptom_keywords.keys()}
    
    # Check for keyword matches in the text
    for symptom, keys in symptom_keywords.items():
        pattern = re.compile(r'\b(' + '|'.join(keys) + r')\b', re.IGNORECASE)
        matches = pattern.findall(text)
        if matches:
            mapping[symptom] = 1
            
    return mapping

**Testing whether the symptom extraction model can detect the symptoms or not**

In [8]:
text = "I have itching and knee pain."

result = map_keywords_to_columns(text, keywords_new)

print(result)

{'itching': 1, 'skin_rash': 0, 'nodal_skin_eruptions': 0, 'continuous_sneezing': 0, 'shivering': 0, 'chills': 0, 'joint_pain': 0, 'stomach_pain': 0, 'acidity': 0, 'ulcers_on_tongue': 0, 'muscle_wasting': 0, 'vomiting': 0, 'burning_micturition': 0, 'spotting_urination': 0, 'fatigue': 0, 'weight_gain': 0, 'anxiety': 0, 'cold_hands_and_feets': 0, 'mood_swings': 0, 'weight_loss': 0, 'restlessness': 0, 'lethargy': 0, 'patches_in_throat': 0, 'irregular_sugar_level': 0, 'cough': 0, 'high_fever': 0, 'sunken_eyes': 0, 'breathlessness': 0, 'sweating': 0, 'dehydration': 0, 'indigestion': 0, 'headache': 0, 'yellowish_skin': 0, 'dark_urine': 0, 'nausea': 0, 'loss_of_appetite': 0, 'pain_behind_the_eyes': 0, 'back_pain': 0, 'constipation': 0, 'abdominal_pain': 0, 'diarrhoea': 0, 'mild_fever': 0, 'yellow_urine': 0, 'yellowing_of_eyes': 0, 'acute_liver_failure': 0, 'fluid_overload': 0, 'swelling_of_stomach': 0, 'swelled_lymph_nodes': 0, 'malaise': 0, 'blurred_and_distorted_vision': 0, 'phlegm': 0, 'thr

In [40]:
len(result)

134

In [41]:
# Example usage
text = "I have itching and knee pain."

result = map_entities_to_columns(text, keywords_new)
print(result)

{'itching': 1, 'skin_rash': 0, 'nodal_skin_eruptions': 0, 'continuous_sneezing': 0, 'shivering': 1, 'chills': 0, 'joint_pain': 0, 'stomach_pain': 0, 'acidity': 0, 'ulcers_on_tongue': 0, 'muscle_wasting': 0, 'vomiting': 0, 'burning_micturition': 0, 'spotting_urination': 0, 'fatigue': 0, 'weight_gain': 0, 'anxiety': 0, 'cold_hands_and_feets': 0, 'mood_swings': 0, 'weight_loss': 0, 'restlessness': 0, 'lethargy': 0, 'patches_in_throat': 0, 'irregular_sugar_level': 0, 'cough': 0, 'high_fever': 0, 'sunken_eyes': 0, 'breathlessness': 0, 'sweating': 0, 'dehydration': 0, 'indigestion': 0, 'headache': 0, 'yellowish_skin': 0, 'dark_urine': 0, 'nausea': 0, 'loss_of_appetite': 0, 'pain_behind_the_eyes': 0, 'back_pain': 0, 'constipation': 0, 'abdominal_pain': 0, 'diarrhoea': 0, 'mild_fever': 0, 'yellow_urine': 0, 'yellowing_of_eyes': 0, 'acute_liver_failure': 0, 'fluid_overload': 0, 'swelling_of_stomach': 0, 'swelled_lymph_nodes': 0, 'malaise': 0, 'blurred_and_distorted_vision': 0, 'phlegm': 0, 'thr

In [9]:
#### Combining keyword and NER model
#### We are creating a function so that it can combine both NER and manual symptom extraction


def map_symptoms(text):
    # First, use NER to map entities
    mapping = map_entities_to_columns(text, keywords_new)
    
    # Then, use refined keyword-based mapping as a fallback
    keyword_mapping = map_keywords_to_columns(text, keywords_new)
    
    # Update the mapping only if the NER did not find the symptom
    for symptom in mapping.keys():
        if mapping[symptom] == 0 and keyword_mapping[symptom] == 1:
            mapping[symptom] = 1
    return mapping

In [37]:
# usage with combined mapping
test_text = "I am 22 years old. I am facing nose itching and vomiting with i have a severe chest pain and high fever."
mapped_columns = map_symptoms(test_text)
print(f"Combined mapping: {mapped_columns}")

Combined mapping: {'itching': 1, 'skin_rash': 0, 'nodal_skin_eruptions': 0, 'continuous_sneezing': 0, 'shivering': 1, 'chills': 1, 'joint_pain': 0, 'stomach_pain': 0, 'acidity': 0, 'ulcers_on_tongue': 0, 'muscle_wasting': 0, 'vomiting': 1, 'burning_micturition': 0, 'spotting_urination': 0, 'fatigue': 0, 'weight_gain': 0, 'anxiety': 0, 'cold_hands_and_feets': 0, 'mood_swings': 0, 'weight_loss': 0, 'restlessness': 0, 'lethargy': 0, 'patches_in_throat': 0, 'irregular_sugar_level': 0, 'cough': 0, 'high_fever': 1, 'sunken_eyes': 0, 'breathlessness': 0, 'sweating': 0, 'dehydration': 0, 'indigestion': 0, 'headache': 0, 'yellowish_skin': 1, 'dark_urine': 0, 'nausea': 1, 'loss_of_appetite': 0, 'pain_behind_the_eyes': 0, 'back_pain': 0, 'constipation': 0, 'abdominal_pain': 0, 'diarrhoea': 0, 'mild_fever': 0, 'yellow_urine': 0, 'yellowing_of_eyes': 0, 'acute_liver_failure': 1, 'fluid_overload': 0, 'swelling_of_stomach': 0, 'swelled_lymph_nodes': 0, 'malaise': 0, 'blurred_and_distorted_vision': 1,

In [38]:
len(mapped_columns)

134

As we just saw, the symptom extraction mapping is working well. Now the next step is to feed the output of this extraction to the disease prediction model.

## Disease Prediction Model
Now we will train the disease prediction model and push the symptoms to the model for prediction. 

In [2]:
# reading the combined disease dataset
final_dis_df = pd.read_csv('/kaggle/input/final-disease-dataframe-and-drugs-data/final_dis_df.csv')

In [55]:
#### this block of codes was used for saving the disease mappings for using in the final application

# # Get unique disease names
# disease_names = final_dis_df['prognosis'].unique().tolist()

# # Create the correct mapping
# disease_mapping = dict(enumerate(disease_names))

# # Save the correct mapping
# import joblib
# joblib.dump(disease_mapping, 'disease_mapping.pkl')

['disease_mapping.pkl']

In [56]:
disease_names

['Fungal infection',
 'Allergy',
 'GERD',
 'Chronic cholestasis',
 'Drug Reaction',
 'Peptic ulcer diseae',
 'AIDS',
 'Diabetes ',
 'Gastroenteritis',
 'Bronchial Asthma',
 'Hypertension ',
 'Migraine',
 'Cervical spondylosis',
 'Paralysis (brain hemorrhage)',
 'Jaundice',
 'Malaria',
 'Chicken pox',
 'Dengue',
 'Typhoid',
 'hepatitis A',
 'Hepatitis B',
 'Hepatitis C',
 'Hepatitis D',
 'Hepatitis E',
 'Alcoholic hepatitis',
 'Tuberculosis',
 'Common Cold',
 'Pneumonia',
 'Dimorphic hemmorhoids(piles)',
 'Heart attack',
 'Varicose veins',
 'Hypothyroidism',
 'Hyperthyroidism',
 'Hypoglycemia',
 'Osteoarthristis',
 'Arthritis',
 '(vertigo) Paroymsal  Positional Vertigo',
 'Acne',
 'Urinary tract infection',
 'Psoriasis',
 'Impetigo',
 'Influenza',
 'Asthma',
 'Eczema',
 'Depression',
 'Liver Cancer',
 'Stroke',
 'Urinary Tract Infection',
 'Bipolar Disorder',
 'Bronchitis',
 'Cerebral Palsy',
 'Colorectal Cancer',
 'Hypertensive Heart Disease',
 'Multiple Sclerosis',
 'Myocardial Infarc

In [27]:
# Encoding the target column

le = LabelEncoder()
final_dis_df['prognosis'] = le.fit_transform(final_dis_df['prognosis'])

In [36]:
final_dis_df.shape

(5106, 135)

In [28]:
# looking at the disease dataframe
pd.set_option('display.max_columns', None)
final_dis_df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,vomiting,burning_micturition,spotting_urination,fatigue,weight_gain,anxiety,cold_hands_and_feets,mood_swings,weight_loss,restlessness,lethargy,patches_in_throat,irregular_sugar_level,cough,high_fever,sunken_eyes,breathlessness,sweating,dehydration,indigestion,headache,yellowish_skin,dark_urine,nausea,loss_of_appetite,pain_behind_the_eyes,back_pain,constipation,abdominal_pain,diarrhoea,mild_fever,yellow_urine,yellowing_of_eyes,acute_liver_failure,fluid_overload,swelling_of_stomach,swelled_lymph_nodes,malaise,blurred_and_distorted_vision,phlegm,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,chest_pain,weakness_in_limbs,fast_heart_rate,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,neck_pain,dizziness,cramps,bruising,obesity,swollen_legs,swollen_blood_vessels,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,excessive_hunger,extra_marital_contacts,drying_and_tingling_lips,slurred_speech,knee_pain,hip_joint_pain,muscle_weakness,stiff_neck,swelling_joints,movement_stiffness,spinning_movements,loss_of_balance,unsteadiness,weakness_of_one_body_side,loss_of_smell,bladder_discomfort,foul_smell_of_urine,continuous_feel_of_urine,passage_of_gases,internal_itching,toxic_look_typhos,depression,irritability,muscle_pain,altered_sensorium,red_spots_over_body,belly_pain,abnormal_menstruation,dischromic_patches,watering_from_eyes,increased_appetite,polyuria,family_history,mucoid_sputum,rusty_sputum,lack_of_concentration,visual_disturbances,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,distention_of_abdomen,history_of_alcohol_consumption,blood_in_sputum,prominent_veins_on_calf,palpitations,painful_walking,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,difficulty_breathing,blood_pressure,cholesterol_level
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,0,0,0
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,0,0,0
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,0,0,0
3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,0,0,0
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,0,0,0


In [4]:
########## This block was used to save the label encoder file. This was used for the final application ############

# Fit the encoder
le.fit(final_dis_df['prognosis'])

# # Save the encoder
# with open('label_encoder.pkl', 'wb') as file:
#     pickle.dump(le, file)

In [5]:
import joblib
joblib.dump(le, 'label_encoder.pkl')

['label_encoder.pkl']

**The following codes were for saving the disease names as mapping**

In [6]:
disease_mapping = dict(zip(le.transform(le.classes_), le.classes_))

In [7]:
joblib.dump(disease_mapping, 'disease_mapping.pkl')

['disease_mapping.pkl']

In [9]:
disease_names = final_dis_df['prognosis'].unique().tolist()
disease_mapping = dict(enumerate(disease_names))
joblib.dump(disease_mapping, 'disease_mapping2.pkl')

['disease_mapping2.pkl']

In [13]:
final_dis_df['prognosis']

0       40
1       40
2       40
3       40
4       40
        ..
5101    91
5102    91
5103    91
5104    91
5105    91
Name: prognosis, Length: 5106, dtype: int64

## Modelling Developement for the Disease Prediction 

In [29]:

# splitting train and test sets
X = final_dis_df.drop('prognosis', axis=1)
y = final_dis_df['prognosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [40]:
pd.set_option('display.max_columns', None)
X.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,muscle_wasting,vomiting,burning_micturition,spotting_urination,fatigue,weight_gain,anxiety,cold_hands_and_feets,mood_swings,weight_loss,restlessness,lethargy,patches_in_throat,irregular_sugar_level,cough,high_fever,sunken_eyes,breathlessness,sweating,dehydration,indigestion,headache,yellowish_skin,dark_urine,nausea,loss_of_appetite,pain_behind_the_eyes,back_pain,constipation,abdominal_pain,diarrhoea,mild_fever,yellow_urine,yellowing_of_eyes,acute_liver_failure,fluid_overload,swelling_of_stomach,swelled_lymph_nodes,malaise,blurred_and_distorted_vision,phlegm,throat_irritation,redness_of_eyes,sinus_pressure,runny_nose,congestion,chest_pain,weakness_in_limbs,fast_heart_rate,pain_during_bowel_movements,pain_in_anal_region,bloody_stool,irritation_in_anus,neck_pain,dizziness,cramps,bruising,obesity,swollen_legs,swollen_blood_vessels,puffy_face_and_eyes,enlarged_thyroid,brittle_nails,swollen_extremeties,excessive_hunger,extra_marital_contacts,drying_and_tingling_lips,slurred_speech,knee_pain,hip_joint_pain,muscle_weakness,stiff_neck,swelling_joints,movement_stiffness,spinning_movements,loss_of_balance,unsteadiness,weakness_of_one_body_side,loss_of_smell,bladder_discomfort,foul_smell_of_urine,continuous_feel_of_urine,passage_of_gases,internal_itching,toxic_look_typhos,depression,irritability,muscle_pain,altered_sensorium,red_spots_over_body,belly_pain,abnormal_menstruation,dischromic_patches,watering_from_eyes,increased_appetite,polyuria,family_history,mucoid_sputum,rusty_sputum,lack_of_concentration,visual_disturbances,receiving_blood_transfusion,receiving_unsterile_injections,coma,stomach_bleeding,distention_of_abdomen,history_of_alcohol_consumption,blood_in_sputum,prominent_veins_on_calf,palpitations,painful_walking,pus_filled_pimples,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,difficulty_breathing,blood_pressure,cholesterol_level
0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
# data for 
X.shape

(5106, 134)

In [43]:
pd.set_option('display.max_rows', None)
len(X.columns)

134

Checking the number of keys of the mapped columns and the length of keywords_new. We need to ensure that they are similar.

In [33]:
len(mapped_columns.keys())

134

In [34]:
len(keywords_new.keys())

134

## Building the model with hyperparameter tuning
We will build our model with hyperparameter tuning. We have used Random Forest Classifier for this use case because : 

- Handling High-Dimensional Data: Random Forest can handle a large number of features (134 symptoms) without feature selection or dimensionality reduction.
- Non-Linear Relationships: Random Forest can capture non-linear relationships between symptoms and diseases.
- Robust to Outliers: Random Forest is robust to outliers and noisy data.
- Interpretable Results: Random Forest provides feature importance scores, which can help identify the most relevant symptoms for disease prediction.
- Handling Imbalanced Data: Random Forest can handle imbalanced datasets, where some diseases may have a larger number of cases than others.
- High Accuracy: Random Forest often achieves high accuracy in classification tasks, especially with a large number of trees.
- Reducing Overfitting: Random Forest's ensemble approach reduces overfitting, which is common in decision tree models.

In [13]:
param_grid = {
    'max_depth': [5, 10, 15],
    'n_estimators': [50, 100, 200]
}

model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)



In [16]:
print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

model = grid_search.best_estimator_
y_pred = model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Best Parameters: {'max_depth': 10, 'n_estimators': 100}
Best Accuracy: 0.9640065879472965
Test Accuracy: 0.961839530332681


In [49]:
# save the model 
import pickle
pickle_file = "dis_pred_model.pkl"
with open(pickle_file, 'wb') as file:
    pickle.dump(model, file)

## Pushing the symptom data to disease prediction model to get the output
Now we will create a pipeline which extracts symptoms from texts and pushes to the symptom extract and then disease classification model. The output that it returns is the disease name

In [50]:
# Define the pipeline
pipeline = Pipeline([
    ('symptom_mapping', map_symptoms),
    ('create_dataframe', lambda x: pd.DataFrame([x], columns=x.keys())),
    ('disease_classification', model)
])

# Define a function to take text input and return the predicted disease
def predict_disease(text):
    symptoms = pipeline['symptom_mapping'](text)
    df = pipeline['create_dataframe'](symptoms)
    disease = pipeline['disease_classification'].predict(df)
    # disease_name = le.inverse_transform(pd.Series(disease).to_numpy())[0]
    # return disease_name
    return disease

In [51]:
# Test the pipeline
text = "Doc, I've been feeling really crummy lately. I've had a nasty cough for weeks, and it's been getting worse. My chest hurts, and I feel like I can't catch my breath. I've also had a fever on and off, and I'm just so tired all the time. Sometimes I get these sharp pains in my head, and my nose is always stuffy. I've been throwing up a lot too...it's just been a real struggle. I don't know what's going on, but I hope you can help me figure it out!"
disease = predict_disease(text)
print(disease)

['Common Cold']


## Creating Drug Data

We need to create a big drug dataset. For this reason, we have combined 3 datasets. We had to perform cleaning and processing on the datasets to make them combined.

In [3]:
# first drug dataset
drugs_df = pd.read_csv("/kaggle/input/drug-dataset/drugs_side_effects_drugs_com.csv")
drugs_df.head()

Unnamed: 0,drug_name,medical_condition,side_effects,generic_name,drug_classes,brand_names,activity,rx_otc,pregnancy_category,csa,alcohol,related_drugs,medical_condition_description,rating,no_of_reviews,drug_link,medical_condition_url
0,doxycycline,Acne,"(hives, difficult breathing, swelling in your ...",doxycycline,"Miscellaneous antimalarials, Tetracyclines","Acticlate, Adoxa CK, Adoxa Pak, Adoxa TT, Alod...",87%,Rx,D,N,X,amoxicillin: https://www.drugs.com/amoxicillin...,Acne Other names: Acne Vulgaris; Blackheads; B...,6.8,760.0,https://www.drugs.com/doxycycline.html,https://www.drugs.com/condition/acne.html
1,spironolactone,Acne,hives ; difficulty breathing; swelling of your...,spironolactone,"Aldosterone receptor antagonists, Potassium-sp...","Aldactone, CaroSpir",82%,Rx,C,N,X,amlodipine: https://www.drugs.com/amlodipine.h...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.2,449.0,https://www.drugs.com/spironolactone.html,https://www.drugs.com/condition/acne.html
2,minocycline,Acne,"skin rash, fever, swollen glands, flu-like sym...",minocycline,Tetracyclines,"Dynacin, Minocin, Minolira, Solodyn, Ximino, V...",48%,Rx,D,N,,amoxicillin: https://www.drugs.com/amoxicillin...,Acne Other names: Acne Vulgaris; Blackheads; B...,5.7,482.0,https://www.drugs.com/minocycline.html,https://www.drugs.com/condition/acne.html
3,Accutane,Acne,problems with your vision or hearing; muscle o...,isotretinoin (oral),"Miscellaneous antineoplastics, Miscellaneous u...",,41%,Rx,X,N,X,doxycycline: https://www.drugs.com/doxycycline...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.9,623.0,https://www.drugs.com/accutane.html,https://www.drugs.com/condition/acne.html
4,clindamycin,Acne,hives ; difficult breathing; swelling of your ...,clindamycin topical,"Topical acne agents, Vaginal anti-infectives","Cleocin T, Clindacin ETZ, Clindacin P, Clindag...",39%,Rx,B,N,,doxycycline: https://www.drugs.com/doxycycline...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.4,146.0,https://www.drugs.com/mtm/clindamycin-topical....,https://www.drugs.com/condition/acne.html


In [4]:

drugs_df.medical_condition.value_counts()

medical_condition
Pain                    264
Colds & Flu             245
Acne                    238
Hypertension            177
Osteoarthritis          129
Hayfever                124
Eczema                  122
AIDS/HIV                109
Diabetes (Type 2)       104
Psoriasis                93
GERD (Heartburn)         77
Pneumonia                72
Angina                   71
Bronchitis               71
Migraine                 61
Insomnia                 60
Constipation             60
Diabetes (Type 1)        57
Osteoporosis             56
ADHD                     55
Depression               51
Seizures                 50
Bipolar Disorder         47
UTI                      46
Asthma                   45
Anxiety                  45
Cholesterol              45
Diarrhea                 38
Covid 19                 34
Rheumatoid Arthritis     33
Alzheimer's              27
Weight Loss              23
COPD                     23
IBD (Bowel)              22
Cancer                   20
Sc

In [5]:
# Addtional dataset downloaded (new_medicine-dataset)
# https://huggingface.co/datasets/MattBastar/medicine

In [6]:
# second drug dataset
drugs_df_2 = pd.read_csv("/kaggle/input/new-medicine-dataset/Medicine_Details.csv")
drugs_df_2.head()

Unnamed: 0,Medicine Name,Composition,Uses,Side_effects,Image URL,Manufacturer,Excellent Review %,Average Review %,Poor Review %
0,Avastin 400mg Injection,Bevacizumab (400mg),Cancer of colon and rectum Non-small cell lun...,Rectal bleeding Taste change Headache Noseblee...,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Roche Products India Pvt Ltd,22,56,22
1,Augmentin 625 Duo Tablet,Amoxycillin (500mg) + Clavulanic Acid (125mg),Treatment of Bacterial infections,Vomiting Nausea Diarrhea Mucocutaneous candidi...,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Glaxo SmithKline Pharmaceuticals Ltd,47,35,18
2,Azithral 500 Tablet,Azithromycin (500mg),Treatment of Bacterial infections,Nausea Abdominal pain Diarrhea,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Alembic Pharmaceuticals Ltd,39,40,21
3,Ascoril LS Syrup,Ambroxol (30mg/5ml) + Levosalbutamol (1mg/5ml)...,Treatment of Cough with mucus,Nausea Vomiting Diarrhea Upset stomach Stomach...,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Glenmark Pharmaceuticals Ltd,24,41,35
4,Aciloc 150 Tablet,Ranitidine (150mg),Treatment of Gastroesophageal reflux disease (...,Headache Diarrhea Gastrointestinal disturbance,"https://onemg.gumlet.io/l_watermark_346,w_480,...",Cadila Pharmaceuticals Ltd,34,37,29


In [None]:
# renaming the columns 

drugs_df_2 = drugs_df_2.rename(columns={
    'Medicine Name': 'drug_name',
    'Uses': 'medical_condition',
})

In [8]:
drugs_df_2.shape

(11825, 9)

In [9]:
len(drugs_df_2['medical_condition'].unique())

712

In [10]:
drugs_df_2['medical_condition'][10]

'Treatment of Bacterial infections'

Reading the 3rd dataset

In [11]:
# third drug dataset
drugs_df_3 = pd.read_csv('/kaggle/input/drug-dataset-2/drugsComTrain_raw.csv')
drugs_df_3.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


In [12]:
# renaming the columns
drugs_df_3 = drugs_df_3.rename(columns={
    'drugName': 'drug_name',
    'condition': 'medical_condition',
})

In [13]:
drugs_df_4 = pd.read_csv('/kaggle/input/drugbank/drugbank.tsv', sep='\t')
drugs_df_4.head()

Unnamed: 0,drugbank_id,name,type,groups,atc_codes,categories,inchikey,inchi,description
0,DB00001,Lepirudin,biotech,approved,B01AE02,Antithrombins|Fibrinolytic Agents,,,Lepirudin is identical to natural hirudin exce...
1,DB00002,Cetuximab,biotech,approved,L01XC06,Antineoplastic Agents,,,Epidermal growth factor receptor binding FAB. ...
2,DB00003,Dornase alfa,biotech,approved,R05CB13,Enzymes,,,Dornase alfa is a biosynthetic form of human d...
3,DB00004,Denileukin diftitox,biotech,approved|investigational,L01XX29,Antineoplastic Agents,,,A recombinant DNA-derived cytotoxic protein co...
4,DB00005,Etanercept,biotech,approved|investigational,L04AB01,Immunosuppressive Agents,,,Dimeric fusion protein consisting of the extra...


In [14]:
drugs_df_4['description'][20]

'This drug is the synthetic form of natural secretin.  It is prepared using solid phase peptide synthesis.  Secretin is a peptide hormone produced in the S cells of the duodenum. Its main effect is to regulate the pH of the small  intestine’s contents through the control of gastric acid secretion and buffering with bicarbonate. It was the first hormone to be discovered.'

Seems like drugbank dataset will not be of good use because the medical condition is not simply mentioned. The description column is rather about the contents of the drugs

In [15]:
# Select only the drug_name and medical_condition columns from each dataframe
drugs_df_selected = drugs_df[['drug_name', 'medical_condition']]
drugs_df_2_selected = drugs_df_2[['drug_name', 'medical_condition']]
drugs_df_3_selected = drugs_df_3[['drug_name', 'medical_condition']]

# Concatenate the selected dataframes
drug_df_combined = pd.concat([drugs_df_selected, drugs_df_2_selected, drugs_df_3_selected])

In [16]:
drugs_df_selected.shape

(2931, 2)

In [17]:
drug_df_combined.shape

(176053, 2)

In [18]:
drug_df_combined['drug_name'].nunique()

17045

In [19]:
# saving drugs_df_combined 
drug_df_combined.to_csv("drugs_df_combined.csv")

In [24]:
drug_df_combined.head()

Unnamed: 0,drug_name,medical_condition
0,doxycycline,Acne
1,spironolactone,Acne
2,minocycline,Acne
3,Accutane,Acne
4,clindamycin,Acne


Now we have a drug dataset that we can use for getting the drug names. The codes, models , datasets and components will be taken from here and used in the application modularized code now.