# Medical based diagnoser

dataset contains info about peoples symptoms and the medication and diagnosis they recieved
your program has to predict those two for new patients that come in


based on: https://www.geeksforgeeks.org/build-a-deep-learning-based-medical-diagnoser/


Deep Learning has already shown remarkable success in many industries by helping us to automate the processes. Now let's try to use this technology in the field of medicine. We will build a deep learning model that will be trained on Patient's Problems which will be textual data, then our model will give the predicted Disease and will recommend Medicine to treat the patient's problem as an output.

This is clearly an application of Recurrent Neural Network (RNN). This is because we need a model that will store the information from the previous text and use it later to predict the output. Hence, we will use the Long Short-Term Memory (LSTM) algorithm with Tensorflow to train our model.

### Long Short-Term Memory (LSTM) Networks
When dealing with textual data, such as patient symptoms, a specific type of deep learning architecture called a Long Short-Term Memory (LSTM) network is often used. LSTM networks are well-suited for tasks involving sequences of data, as they can learn long-term dependencies between elements in the sequence.

For example, consider a patient describing their symptoms as " I've experienced a loss of appetite and don't enjoy food anymore, followed by fatigue and muscle weakness." An LSTM network can understand the importance of the order of these symptoms ("loss of appetite" followed by "fatigue and muscle weakness") to make an accurate diagnosis.

#### Importing Libraries
First, we will import all the necessary libraries for handling data. 'Tokenizer' from tensorflow will be used for text tokenization, 'pad_sequences' will be used for sequence padding. 'to_categorical' will be used for converting labels to binary class matrices, and 'LabelEncoder' from scikit-learn will be used for encoding text labels as integers.

In [1]:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense

In [2]:
# Step 1: Load the CSV file

file_path = 'medical_data.csv'
df = pd.read_csv(file_path)

# Step 2: Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Step 3: Display basic information about the dataset
print("\nDataset Info:")
print(df.info())

# Step 4: Display basic statistics about numeric columns
print("\nDescriptive Statistics:")
print(df.describe())

# Step 5: Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum())

# Step 6: Display column names
print("\nColumn Names:")
print(df.columns)

# Step 7: Get a quick overview of the dataset shape (rows, columns)
print("\nDataset Shape:")
print(df.shape)

data = df

First 5 rows of the dataset:
                                     Patient_Problem  \
0  Constant fatigue and muscle weakness, struggli...   
1  Frequent severe migraines, sensitivity to ligh...   
2  Sudden weight gain and feeling cold, especiall...   
3  High fever, sore throat, and swollen lymph nod...   
4  Excessive thirst and frequent urination, dry m...   

                    Disease                                       Prescription  
0  Chronic Fatigue Syndrome  Cognitive behavioral therapy, graded exercise ...  
1        Migraine with Aura  Prescription triptans, avoid triggers like bri...  
2            Hypothyroidism  Levothyroxine to regulate thyroid hormone levels.  
3             Mononucleosis            Rest and hydration, ibuprofen for pain.  
4         Diabetes Mellitus             Insulin therapy and lifestyle changes.  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407 entries, 0 to 406
Data columns (total 3 columns):
 #   Column           Non-Nul

#### Data Preprocessing and Preparation
Before using medical data in a deep learning model, it needs to be preprocessed to ensure the model can understand it. Preprocessing steps often include:

- Text Tokenization: Converting textual data into sequences of numbers that the model can process.
- Padding Sequences: Making all sequences the same length by adding padding characters at the beginning or end of shorter sequences.
- Label Encoding: Converting categorical variables, such as disease names and medication names, into numerical labels.

In [3]:
tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(data['Patient_Problem'])

sequences = tokenizer.texts_to_sequences(data['Patient_Problem'])

A 'tokenizer' variable is created to convert the textual data into sequences of integers. It only considers the top 5,000 words in the dataset in order to reduce the complexity. If the model encounters any out-of-vocabulary words during the training process then it will be replaced with the '<OOV>' token.

#### Padding Sequences
In order to make the input sequences have the same length, the code finds the longest sequence and pads all other sequences with zeros at the end ('post' padding) to match this sentence.

In [4]:
max_length = max(len(x) for x in sequences)
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

#### Encoding the Labels and Converting them to Categorical
We will encode the 'Disease' and 'Prescription' columns as integers. Then the integer-encoded labels are converted into binary class matrices.

In [5]:
# Encoding the labels
label_encoder_disease = LabelEncoder()
label_encoder_prescription = LabelEncoder()

disease_labels = label_encoder_disease.fit_transform(data['Disease'])
prescription_labels = label_encoder_prescription.fit_transform(data['Prescription'])

# Converting labels to categorical
disease_labels_categorical = to_categorical(disease_labels)
prescription_labels_categorical = to_categorical(prescription_labels)

#### Combining Labels into a Multi-label Target Variable
Finally, now we will stack the binary class matrices together to form a single multi-label target variable 'Y'. This allows the model to predict both 'Disease' and 'Prescription' from the patient's problem.

e.g.  <br>
disease_labels_categorical:    
[[1, 0, 0], <br>
 [0, 1, 0], <br>
 [0, 0, 1]]

prescription_labels_categorical: <br>
[[0, 1], <br>
 [1, 0], <br>
 [0, 1]]

Combo: <br>
[[1, 0, 0, 0, 1], <br>
 [0, 1, 0, 1, 0], <br>
 [0, 0, 1, 0, 1]]

In [6]:
Y = np.hstack((disease_labels_categorical, prescription_labels_categorical))

## Model building

#### Defining Model Architecture
We will use the 'Model' and 'Input' to define the model architecture, and 'Embedding' to convert the integer sequences into dense vectors of fixed size. We will use 'Dense' for output layers that make predictions.

In [7]:
input_layer = Input(shape=(max_length,))

embedding = Embedding(input_dim=5000, output_dim=64)(input_layer)
lstm_layer = LSTM(64)(embedding)

disease_output = Dense(len(label_encoder_disease.classes_), activation='softmax', 
name='disease_output')(lstm_layer)

prescription_output = Dense(len(label_encoder_prescription.classes_), 
activation='softmax', name='prescription_output')(lstm_layer)

The model firstly have, an input layer that can handle sequences up to a certain length. Then there's an embedding layer that turns the numbers into vectors. After that, there's an LSTM layer that looks at the order of things, and finally, two dense layers that predict diseases and prescriptions using a softmax function for classification.

#### Compiling the model

In [8]:
model = Model(inputs=input_layer, outputs=[disease_output, prescription_output])

model.compile(
    loss={'disease_output': 'categorical_crossentropy', 
    'prescription_output': 'categorical_crossentropy'},
    optimizer='adam',
    metrics={'disease_output': ['accuracy'], 'prescription_output': ['accuracy']}
)

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 17)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 17, 64)       320000      ['input_1[0][0]']                
                                                                                                  
 lstm (LSTM)                    (None, 64)           33024       ['embedding[0][0]']              
                                                                                                  
 disease_output (Dense)         (None, 178)          11570       ['lstm[0][0]']                   
                                                                                              

#### Training the model

In [9]:
model.fit(padded_sequences, {'disease_output': disease_labels_categorical, 'prescription_output':
      prescription_labels_categorical}, epochs=100, batch_size=32)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x1adb6626620>

### Making Predictions
The model is used to make predictions for new patients:

1. Pre-processed the patient's symptoms by performing tokenization and padding.
2. Feed the pre-processed data into the trained model.
3. The model predicts the disease and medication based on the patient's symptoms.
4. The predicted disease and medication will be presented.



In [16]:
def make_prediction(patient_problem):
    # Preprocessing the input
    sequence = tokenizer.texts_to_sequences([patient_problem])
    padded_sequence = pad_sequences(sequence, maxlen=max_length, padding='post')
    
    # Making prediction
    prediction = model.predict(padded_sequence)
    
    # Decoding the prediction
    disease_index = np.argmax(prediction[0], axis=1)[0]
    prescription_index = np.argmax(prediction[1], axis=1)[0]
    
    disease_predicted = label_encoder_disease.inverse_transform([disease_index])[0]
    prescription_predicted = label_encoder_prescription.inverse_transform([prescription_index])[0]
    
    print(f"Predicted Disease: {disease_predicted}")
    print(f"Suggested Prescription: {prescription_predicted}")


patient_input = "I've experienced a loss of appetite and don't enjoy food anymore."
make_prediction(patient_input)

Predicted Disease: Hyperthyroidism
Suggested Prescription: Thyroid hormone replacement therapy.


## Visualized training and validation accuracy

In [17]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
history_df.loc[:, ['accuracy', 'val_accuracy']].plot()
plt.show()

NameError: name 'history' is not defined