# Homework 2 Part 2 (due 7/07/2024)

# Health-care assistance via probabilistic graphical modeling

### Objective
In this project, you will create a health-care assistance bot that can suggest diagnoses for a set of symptoms based on a probabilistic graphical model.

### Step 1: Review 
Review the code from the Bayesian networks exercise.

### Step 2: Acquire data
View this [research article](https://www.nature.com/articles/ncomms5212) and download its supplementary data sets 1, 2 and 3. These data sets include the occurrences of diseases, symptoms, and their co-occurrences in the scientific literature. (For the purpose of this exercise, we are going to assume that the frequency of co-occurrences of diseases and symptoms in scientific papers is proportional to the co-occurence frequencies of actual disease cases and symptoms.)

### Step 3: Create a Bayesian network
Using commands from the `pgmpy` library, create a Bayesian network in which the probability of exhibiting a symptom is conditional on the probability of having an associated disease. 

### Step 4: Initialize priors
Use the disease occurrence data to assign prior probabilities for diseases.

### Step 5: Calculate conditional probability tables
Use the co-occurrence data to define CPTs for each connected pair of disease and symptoms. (Hint: You may need to assign some occurrences of symptoms to an "idiopathic disease" to create valid CPTs.)

### Step 6:
Create a minimal interface in which your bot asks a users for a list of observed symptoms and then returns the name of the disease that is the most likely match to the symptoms. (Hint: Review the input/output commands that you have used in last week's homework.)

In [1]:
from functools import reduce
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination
import pandas as pd
import numpy as np

In [2]:
# Import all data

diseases = pd.read_csv('data/diseases.txt', sep = '\t')
symptoms = pd.read_csv('data/symptoms.txt', sep = '\t')
relations = pd.read_csv('data/relations.txt', sep = '\t')

In [3]:
# Filter relations
relations = relations[relations['PubMed occurrence'] > 1000]

# Filter diseases
idxs_diseases = diseases['MeSH Disease Term'].isin(relations['MeSH Disease Term'])
diseases = diseases[idxs_diseases]

# Filter diseases
idxs_symptoms = symptoms['MeSH Symptom Term'].isin(relations['MeSH Symptom Term'])
symptoms = symptoms[idxs_symptoms]

In [4]:
diseases.head()

Unnamed: 0,MeSH Disease Term,PubMed occurrence
1,Hypertension,107294
2,Coronary Artery Disease,82819
4,Myocardial Infarction,75945
6,Coronary Disease,64339
7,Asthma,63669


In [5]:
symptoms.head()

Unnamed: 0,MeSH Symptom Term,PubMed occurrence
0,Body Weight,147857
1,Pain,103168
2,Obesity,100301
3,Anoxia,47351
4,Mental Retardation,43883


In [6]:
relations.head()

Unnamed: 0,MeSH Symptom Term,MeSH Disease Term,PubMed occurrence,TFIDF score
2420,Fever,Neutropenia,1236,765.006914
6173,Body Weight,Hypertension,3054,1047.484885
6404,Body Weight,Diabetes Mellitus,1100,377.286632
6405,Body Weight,"Diabetes Mellitus, Experimental",2268,777.896437
6407,Body Weight,"Diabetes Mellitus, Type 2",1612,552.89641


In [7]:
# Create a Bayesian Network
edges = []

for symptom in symptoms['MeSH Symptom Term']:
    idxs = relations['MeSH Symptom Term'] == symptom

    for disease in relations[idxs]['MeSH Disease Term']:
        edges.append((f'Disease: {disease}', f'Symptom: {symptom}'))

model = BayesianNetwork(edges)

# Create CPDs
for disease in diseases['MeSH Disease Term']:
    p = 0

    if diseases['PubMed occurrence'].sum() != 0:
        p = diseases[diseases['MeSH Disease Term'] == disease]['PubMed occurrence'].values[0] / diseases['PubMed occurrence'].sum()

    cpd = TabularCPD(
        f'Disease: {disease}',
        2,
        [[1 - p], [p]]
    )

    model.add_cpds(cpd)

for symptom in symptoms['MeSH Symptom Term']:
    idxs = relations['MeSH Symptom Term'] == symptom

    diseases_with_symptom = relations[idxs]['MeSH Disease Term'].values
    probs = []

    for i in range(2 ** diseases_with_symptom.size):
        p = 0
        binary_str = np.binary_repr(i, diseases_with_symptom.size)
        binary = [bool(int(b)) for b in binary_str]

        conds = []

        for j, disease in enumerate(diseases_with_symptom):
            if binary[j]:
                conds.append(relations['MeSH Disease Term'] == disease)

        if len(conds) != 0:
            cond = reduce(lambda x, y: x & y, conds)
            
            with_symptom = relations[cond & (relations['MeSH Symptom Term'] == symptom)]['PubMed occurrence'].sum()
            total = relations[cond]['PubMed occurrence'].sum()
            p = with_symptom / total if total != 0 else 0
            
        probs.append(p)

    cpd = TabularCPD(
        f'Symptom: {symptom}',
        2,
        [1 - np.array(probs), probs],
        evidence=[f'Disease: {disease}' for disease in diseases_with_symptom],
        evidence_card=[2 for _ in diseases_with_symptom],
    )

    model.add_cpds(cpd)

# model.get_cpds()
assert model.check_model()

In [21]:
print('Here are all the symptoms you can have:')

for symptom in symptoms['MeSH Symptom Term']:
  print(symptom)

print('\nHow many symptoms do you have?')
n_symptoms = int(input())
symptoms_of_patient = []

while n_symptoms > 0:
  print('Enter the symptom:')
  symptom = input()

  if symptom in symptoms['MeSH Symptom Term'].values:
    symptoms_of_patient.append(symptom)
    n_symptoms -= 1
  else:
    print('Invalid symptom, try again:')

# Create an inference object
inference = VariableElimination(model)

max_disease = 'asdas'
max_disease_prob = 0

for disease in diseases['MeSH Disease Term'].values:
  prob = inference.query(variables = [f'Disease: {disease}'], evidence = {f'Symptom: {symptom}': 1 for symptom in symptoms_of_patient})

  if prob.values[1] > max_disease_prob and disease != symptom:
    max_disease = disease
    max_disease_prob = prob.values[1]
  
print(f'The most probable disease is {max_disease} with a probability of {max_disease_prob * 100:.2f}%')

Here are all the symptoms you can have:
Body Weight
Pain
Obesity
Anoxia
Mental Retardation
Seizures
Angina Pectoris
Fever
Pain, Postoperative
Deafness
Headache
Vision Disorders
Weight Loss
Weight Gain
Proteinuria
Urinary Incontinence
Paralysis
Blindness
Back Pain
Dyspnea
Hearing Disorders
Sleep Disorders
Memory Disorders
Learning Disorders
Low Back Pain
Hearing Loss, Sensorineural
Cough
Paraplegia
Dizziness
Albuminuria
Coma
Hemiplegia
Facial Paralysis
Speech Disorders
Jaundice
Obesity, Morbid
Muscle Weakness
Syncope
Hallucinations
Urinary Incontinence, Stress
Aphasia
Muscular Atrophy
Chest Pain
Vertigo
Angina, Unstable
Pruritus
Tremor
Ophthalmoplegia
Overweight
Dyspepsia
Hearing Loss
Quadriplegia
Intermittent Claudication
Neuralgia
Muscle Spasticity
Urinary Bladder, Neurogenic
Sleep Deprivation
Respiratory Sounds
Apnea
Ataxia
Spasm
Amnesia
Purpura, Thrombocytopenic
Dyslexia
Hyperalgesia
Hearing Loss, Noise-Induced
Dystonia
Amblyopia
Pain, Intractable
Language Disorders
Cardiac Output, 