## Capstone project

Group 1 members
* William Itotia
* Firdosa mohamed
* Esther Munene
* Frank Oyugi

## Overview

The healthcare industry is rapidly evolving with the integration of technology, aiming to provide better accessibility, efficiency, and personalized care. This project focuses on developing a comprehensive recommendation system that predicts diseases, offers detailed descriptions, suggests precautions, and recommends medications based on the symptoms input by users. By leveraging machine learning and extensive medical data, the system aims to empower individuals with timely and accurate medical advice, reducing the need for immediate hospital visits and improving overall health outcomes.

## Problem Statement

Many individuals face challenges in accessing timely and accurate medical advice due to various factors such as geographical barriers, busy schedules, and overcrowded healthcare facilities. These challenges often lead to delayed diagnosis and treatment, potentially worsening health conditions. There is a need for a solution that can provide immediate, reliable, and personalized medical recommendations based on symptoms, thereby improving accessibility to healthcare and reducing the strain on medical facilities.

## Objectives

* To gather a repository of detailed descriptions for a wide range of diseases, including causes, symptoms, and treatment options.
* To develop and train a machine learning model on extensive medical data to predict possible diseases based on the input symptoms.
* To provide suggestions for precautions and preventive measures tailored to the predicted diseases.
* To recommend appropriate medications based on the predicted disease.
* To integrate a user-friendly interface for individuals to input their symptoms.

## Data Understanding

Importing necessary libraries

In [1]:
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
import seaborn as sns
from sklearn.svm import SVC
import pandas as pd
import numpy as np
import warnings
import pickle
import os
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

Load the data set

In [3]:
sym_des = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\symtoms_df.csv")
precautions = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\precautions_df.csv")
workout = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\workout_df.csv")
description = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\description.csv")
medications = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\medications.csv")
diets = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\diets.csv")
dataset = pd.read_csv("C:\\Users\\willi\\OneDrive\\Documents\\GitHub\\Medicine-Recommendation-system\\Dataset\\dataset.csv")

In [8]:
sym_des.head()

Unnamed: 0.1,Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4
0,0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches
1,1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,
2,2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,
3,3,Fungal infection,itching,skin_rash,dischromic _patches,
4,4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,


In [7]:
precautions.head()

Unnamed: 0.1,Unnamed: 0,Disease,Precaution_1,Precaution_2,Precaution_3,Precaution_4
0,0,Drug Reaction,stop irritation,consult nearest hospital,stop taking drug,follow up
1,1,Malaria,Consult nearest hospital,avoid oily food,avoid non veg food,keep mosquitos out
2,2,Allergy,apply calamine,cover area with bandage,,use ice to compress itching
3,3,Hypothyroidism,reduce stress,exercise,eat healthy,get proper sleep
4,4,Psoriasis,wash hands with warm soapy water,stop bleeding using pressure,consult doctor,salt baths


In [9]:
workout.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,disease,workout
0,0,0,Fungal infection,Avoid sugary foods
1,1,1,Fungal infection,Consume probiotics
2,2,2,Fungal infection,Increase intake of garlic
3,3,3,Fungal infection,Include yogurt in diet
4,4,4,Fungal infection,Limit processed foods


In [10]:
description.head()

Unnamed: 0,Disease,Description
0,Fungal infection,Fungal infection is a common skin condition ca...
1,Allergy,Allergy is an immune system reaction to a subs...
2,GERD,GERD (Gastroesophageal Reflux Disease) is a di...
3,Chronic cholestasis,Chronic cholestasis is a condition where bile ...
4,Drug Reaction,Drug Reaction occurs when the body reacts adve...


In [11]:
medications.head()

Unnamed: 0,Disease,Medication
0,Fungal infection,"['Antifungal Cream', 'Fluconazole', 'Terbinafi..."
1,Allergy,"['Antihistamines', 'Decongestants', 'Epinephri..."
2,GERD,"['Proton Pump Inhibitors (PPIs)', 'H2 Blockers..."
3,Chronic cholestasis,"['Ursodeoxycholic acid', 'Cholestyramine', 'Me..."
4,Drug Reaction,"['Antihistamines', 'Epinephrine', 'Corticoster..."


In [12]:
diets.head()

Unnamed: 0,Disease,Diet
0,Fungal infection,"['Antifungal Diet', 'Probiotics', 'Garlic', 'C..."
1,Allergy,"['Elimination Diet', 'Omega-3-rich foods', 'Vi..."
2,GERD,"['Low-Acid Diet', 'Fiber-rich foods', 'Ginger'..."
3,Chronic cholestasis,"['Low-Fat Diet', 'High-Fiber Diet', 'Lean prot..."
4,Drug Reaction,"['Antihistamine Diet', 'Omega-3-rich foods', '..."


In [13]:
dataset.head()

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


After further inspection of all the datasets the most important datasets are the dataset, dataset and the medication dataset. In order to create the medicine recommendation system these are the two datasets that we will be utilizing. 

Next up is to now clean these two datasets


## Data Cleaning

In [15]:
# Find out how many prognosis' are in this dataset
len(dataset['Disease'].unique())

41

The df dataset contains a prognosis of 41 different diseases

In [17]:
#Check for missing values
dataset.isna().sum()

Disease          0
Symptom_1        0
Symptom_2        0
Symptom_3        0
Symptom_4      348
Symptom_5     1206
Symptom_6     1986
Symptom_7     2652
Symptom_8     2976
Symptom_9     3228
Symptom_10    3408
Symptom_11    3726
Symptom_12    4176
Symptom_13    4416
Symptom_14    4614
Symptom_15    4680
Symptom_16    4728
Symptom_17    4848
dtype: int64

We will not drop the missing values because different diseases have different symptoms.

In [38]:
dataset.fillna('None', inplace=True)

In [40]:
dataset.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
402    False
403    False
405    False
406    False
407    False
Length: 304, dtype: bool

In [41]:
dataset = dataset.drop_duplicates()


In [42]:
dataset.shape

(304, 18)

In [43]:
dataset

Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402,Impetigo,high_fever,blister,red_sore_around_nose,yellow_crust_ooze,,,,,,,,,,,,,
403,Impetigo,skin_rash,blister,red_sore_around_nose,yellow_crust_ooze,,,,,,,,,,,,,
405,Impetigo,skin_rash,high_fever,red_sore_around_nose,yellow_crust_ooze,,,,,,,,,,,,,
406,Impetigo,skin_rash,high_fever,blister,yellow_crust_ooze,,,,,,,,,,,,,


In [134]:
medications.head()

Unnamed: 0,Disease,Medication
0,Fungal infection,"['Antifungal Cream', 'Fluconazole', 'Terbinafi..."
1,Allergy,"['Antihistamines', 'Decongestants', 'Epinephri..."
2,GERD,"['Proton Pump Inhibitors (PPIs)', 'H2 Blockers..."
3,Chronic cholestasis,"['Ursodeoxycholic acid', 'Cholestyramine', 'Me..."
4,Drug Reaction,"['Antihistamines', 'Epinephrine', 'Corticoster..."


In [44]:
#Check for missing values
medications.isna().sum()

Disease       0
Medication    0
dtype: int64

No missing values in this small dataset 

In [46]:
#Check for duplicates
medications.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
dtype: bool

It seems like we have no duplicate rows also in this small dataset

In [47]:
medications.shape


(41, 2)

The medication dataset contains two columns one containing disease/prognosis and the other column containing medications for the prognosis the dataset contans 41 rows which represents 41 different diseases with their respective medication

## EDA

In [48]:
# Print the first few rows of the DataFrame
print(dataset.head())

# Check the unique values in the prognosis column
unique_prognoses = dataset['Disease'].unique()
print(unique_prognoses)

# Check if there is any data imbalance or anomaly
prognosis_counts = dataset['Disease'].value_counts()
print(prognosis_counts)


            Disease   Symptom_1              Symptom_2              Symptom_3  \
0  Fungal infection     itching              skin_rash   nodal_skin_eruptions   
1  Fungal infection   skin_rash   nodal_skin_eruptions    dischromic _patches   
2  Fungal infection     itching   nodal_skin_eruptions    dischromic _patches   
3  Fungal infection     itching              skin_rash    dischromic _patches   
4  Fungal infection     itching              skin_rash   nodal_skin_eruptions   

              Symptom_4 Symptom_5 Symptom_6 Symptom_7 Symptom_8 Symptom_9  \
0   dischromic _patches      None      None      None      None      None   
1                  None      None      None      None      None      None   
2                  None      None      None      None      None      None   
3                  None      None      None      None      None      None   
4                  None      None      None      None      None      None   

  Symptom_10 Symptom_11 Symptom_12 Symptom_13 Symp

In [None]:
df.describe()

In [None]:
from sklearn.preprocessing import LabelEncoder

# Assuming df is your DataFrame and 'prognosis' is categorical
label_encoder = LabelEncoder()
df['prognosis_encoded'] = label_encoder.fit_transform(df['Disease'])

# Calculate correlation between symptoms and prognosis
correlation_with_prognosis = df[['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing', 'prognosis_encoded']].corr()

# Display correlation with prognosis
print(correlation_with_prognosis['prognosis_encoded'].sort_values(ascending=False))
