In [21]:
# Homework 2 Part 2 (due 7/07/2024)

# Health-care assistance via probabilistic graphical modeling

### Objective
In this project, you will create a health-care assistance bot that can suggest diagnoses for a set of symptoms based on a probabilistic graphical model.

### Step 1: Review 
Review the code from the Bayesian networks exercise.

### Step 2: Acquire data
View this [research article](https://www.nature.com/articles/ncomms5212) and download its supplementary data sets 1, 2 and 3. These data sets include the occurrences of diseases, symptoms, and their co-occurrences in the scientific literature. (For the purpose of this exercise, we are going to assume that the frequency of co-occurrences of diseases and symptoms in scientific papers is proportional to the co-occurence frequencies of actual disease cases and symptoms.)

### Step 3: Create a Bayesian network
Using commands from the `pgmpy` library, create a Bayesian network in which the probability of exhibiting a symptom is conditional on the probability of having an associated disease. 

### Step 4: Initialize priors
Use the disease occurrence data to assign prior probabilities for diseases.

### Step 5: Calculate conditional probability tables
Use the co-occurrence data to define CPTs for each connected pair of disease and symptoms. (Hint: You may need to assign some occurrences of symptoms to an "idiopathic disease" to create valid CPTs.)

### Step 6:
Create a minimal interface in which your bot asks a users for a list of observed symptoms and then returns the name of the disease that is the most likely match to the symptoms. (Hint: Review the input/output commands that you have used in last week's homework.)

### Step 2&3&5: Downloading the datasets, and Creating Bayesian Network, and everthing else

In [10]:
import pgmpy
print(pgmpy.__version__)
#checking for pgmpy

0.1.25


### Here is the Whole code for the Healthcare assitance

In [23]:
import pandas as pd
import numpy as np
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination
import itertools

# Function to clean data for data3 (co-occurrence data)
def clean_data3(df):
    if {'MeSH Symptom Term', 'MeSH Disease Term', 'PubMed occurrence'}.issubset(df.columns):
        df = df.dropna(subset=['MeSH Symptom Term', 'MeSH Disease Term', 'PubMed occurrence'])
        df['PubMed occurrence'] = pd.to_numeric(df['PubMed occurrence'], errors='coerce')
        df = df[df['PubMed occurrence'] >= 500]  # Filtering threshold so that it actually runs
    return df

# Function to load data from CSV files
def load_data(file_path, separator='\t'):
    return pd.read_csv(file_path, sep=separator)

# Function to calculate 2x2 conditional probability tables
def calculate_CPT2x2(disease_occ, symptom_occ, interaction_occ, 
                     total_disease_occ, total_symptom_occ, total_interaction_occ):
    p_disease = disease_occ / total_disease_occ
    p_joint = interaction_occ / total_interaction_occ

    pTT = (p_joint / p_disease) if p_disease > 0 else 0.0
    pFT = 1 - pTT

    pTF = (symptom_occ - interaction_occ) / total_symptom_occ
    pFF = 1 - pTF

    return [pFF, pTF, pFT, pTT]

# Function to create Bayesian Network structure and CPDs
def create_bayesian_network(data1, data2, data3):
    structure = []
    symptoms_list = []
    disease_nodes = []

    for index, row in data3.iterrows():
        symptoms_list.append(row['MeSH Symptom Term'] + "(symptom)")
        disease_nodes.append(row['MeSH Disease Term'] + "(disease)")
        structure.append((row['MeSH Disease Term'] + "(disease)", row['MeSH Symptom Term'] + "(symptom)"))

    model = BayesianNetwork(structure)

    total_disease_occurrences = data1['PubMed occurrence'].sum()
    total_symptom_occurrences = data2['PubMed occurrence'].sum()
    total_interaction_occurrences = data3['PubMed occurrence'].sum()

    for disease in disease_nodes:
        prior_prob = data1.loc[data1["MeSH Disease Term"] == disease[:-9], "PubMed occurrence"].values / total_disease_occurrences
        prior_prob = float(prior_prob)
        value = [[1 - prior_prob], [prior_prob]]
        cpd_disease = TabularCPD(variable=disease, variable_card=2, values=value)
        model.add_cpds(cpd_disease)

    CPTs_symptoms = []

    for symptom in symptoms_list:
        parents = list(model.predecessors(symptom))
        little_cpts = []

        for disease in parents:
            disease_occurrence = grab_from_df(data1, disease)
            symptom_occurrence = grab_from_df(data2, symptom)
            interaction_occurrence = grab_from_df2(data3, symptom, disease)

            little_cpt = calculate_CPT2x2(disease_occurrence, symptom_occurrence,
                                          interaction_occurrence, total_disease_occurrences,
                                          total_symptom_occurrences, total_interaction_occurrences)
            little_cpts.append(little_cpt)

        rowT = []
        for bool_combo in itertools.product([0, 1], repeat=len(parents)):
            cond_probs = [little_cpts[i][2 + b] for i, b in enumerate(bool_combo)]
            rowT.append(np.prod(cond_probs))

        rowF = [1 - val for val in rowT]

        cpt = TabularCPD(variable=symptom, variable_card=2, values=[rowF, rowT],
                         evidence=parents, evidence_card=[2 for _ in parents])
        CPTs_symptoms.append(cpt)

    for cpd in CPTs_symptoms:
        model.add_cpds(cpd)

    return model, disease_nodes

# Function for user interaction and inference
def health_assistance_bot(model, disease_nodes):
    print("Welcome to the Disease Prediction Bot!")
    while True:
        symptoms_input = input("Please enter the symptoms you are experiencing, separated by commas (or 'exit' to quit): ").strip()
        if symptoms_input.lower() == 'exit':
            print("Exiting Disease Prediction Bot.")
            break

        observed_symptoms = [symptom.capitalize() + "(symptom)" for symptom in symptoms_input.split(",")]
        observed_evidence = {symptom: 1 for symptom in observed_symptoms}

        inference = VariableElimination(model)
        maximum_list = {}

        for disease in disease_nodes:
            prob_dis = inference.query(variables=[disease], evidence=observed_evidence)
            maximum_list[disease] = prob_dis.values[1]

        most_likely = sorted(maximum_list.items(), key=lambda x: x[1], reverse=True)[:7]

        print("\nThe 7 most likely diseases based on your symptoms of " +symptoms_input+ " are:\n")
        for disease, probability in most_likely:
            print(f"{disease} with probability: {round(probability * 100, 2)}%")
        print()

# Helper functions to fetch values from dataframes
def grab_from_df(df, value, iv_index=0, dv_index=1):
    sub_df = df[df[df.columns[iv_index]] == value][df.columns[dv_index]]
    return sub_df.iloc[0] if len(sub_df) else 0

def grab_from_df2(df, value1, value2, iv1_index=0, iv2_index=1, dv_index=2):
    sub_df = df[(df[df.columns[iv1_index]] == value1) & (df[df.columns[iv2_index]] == value2)][df.columns[dv_index]]
    return sub_df.iloc[0] if len(sub_df) else 0

if __name__ == "__main__":
    # Load and clean data
    data1 = clean_data3(load_data('suppdata1.txt'))
    data2 = clean_data3(load_data('suppdata2.txt'))
    data3 = clean_data3(load_data('suppdata3.txt'))

    # Create Bayesian Network
    model, disease_nodes = create_bayesian_network(data1, data2, data3)

    # Run the Health Assistance Bot
    health_assistance_bot(model, disease_nodes)




Welcome to the Disease Prediction Bot!

The 7 most likely diseases based on your symptoms of anoxia are:

Hypertension(disease) with probability: 1.05%
Coronary Artery Disease(disease) with probability: 0.81%
Myocardial Infarction(disease) with probability: 0.74%
Coronary Disease(disease) with probability: 0.63%
Asthma(disease) with probability: 0.62%
Dementia(disease) with probability: 0.54%
Obesity(disease) with probability: 0.49%

The 7 most likely diseases based on your symptoms of fever are:

Hypertension(disease) with probability: 1.05%
Coronary Artery Disease(disease) with probability: 0.81%
Myocardial Infarction(disease) with probability: 0.74%
Coronary Disease(disease) with probability: 0.63%
Asthma(disease) with probability: 0.62%
Dementia(disease) with probability: 0.54%
Obesity(disease) with probability: 0.49%

The 7 most likely diseases based on your symptoms of pain are:

Hypertension(disease) with probability: 1.05%
Coronary Artery Disease(disease) with probability: 0.81