In [None]:
# Homework 2 Part 2 (due 7/07/2024)

# Health-care assistance via probabilistic graphical modeling

### Objective
In this project, you will create a health-care assistance bot that can suggest diagnoses for a set of symptoms based on a probabilistic graphical model.

### Step 1: Review 
Review the code from the Bayesian networks exercise.

### Step 2: Acquire data
View this [research article](https://www.nature.com/articles/ncomms5212) and download its supplementary data sets 1, 2 and 3. These data sets include the occurrences of diseases, symptoms, and their co-occurrences in the scientific literature. (For the purpose of this exercise, we are going to assume that the frequency of co-occurrences of diseases and symptoms in scientific papers is proportional to the co-occurence frequencies of actual disease cases and symptoms.)

### Step 3: Create a Bayesian network
Using commands from the `pgmpy` library, create a Bayesian network in which the probability of exhibiting a symptom is conditional on the probability of having an associated disease. 

### Step 4: Initialize priors
Use the disease occurrence data to assign prior probabilities for diseases.

### Step 5: Calculate conditional probability tables
Use the co-occurrence data to define CPTs for each connected pair of disease and symptoms. (Hint: You may need to assign some occurrences of symptoms to an "idiopathic disease" to create valid CPTs.)

### Step 6:
Create a minimal interface in which your bot asks a users for a list of observed symptoms and then returns the name of the disease that is the most likely match to the symptoms. (Hint: Review the input/output commands that you have used in last week's homework.)

In [77]:
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination
import itertools
import numpy as np
Disease_symptom_interactions = BayesianNetwork()



# Open disease file and add to list
file = open('disease_occurrence', 'r')
lines_disease = file.readlines()
lines_disease.pop(0)
file.close()

total_disease_occurrences = 0

# add a 'd' to differentiate from symptoms
for line in lines_disease:
    line = "d_" + line
    
    # also add the occurrences in each line to the total occurrences
    total_disease_occurrences = total_disease_occurrences + int(line.split('\t')[1])




# Open symptom file and add to a list of lines
file = open('symptom_occurrence', 'r')
lines_symptom = file.readlines()
lines_symptom.pop(0)
file.close()

total_symptom_occurrences = 0

# add an 's' to differentiate from diseases
for line in lines_symptom:
    line = "s_" + line

    # also add the occurrences in each line to the total occurrences
    total_symptom_occurrences = total_symptom_occurrences + int(line.split('\t')[1])




# Open coooccurrences file and add to a list of lines
file = open('disease_symptom_co', 'r')
lines_coocc = file.readlines()
lines_coocc.pop(0)
file.close()

total_interaction_occurrences = 0

# add the occurrences in each line to the total occurrences
for line in lines_coocc:
    total_interaction_occurrences = total_interaction_occurrences + int(line.split('\t')[2])




# start list of all symptom nodes
symptoms2 = []

# make a reduced list of cooccurrence data, involving only the cooccurrences >= 500
lines_coocc_relevant = []

threshold = 500

for line in lines_coocc:
    # split each line and the values in each line are symptom, disease, cooccurrence, and TFIDF score (not used)
    row_values = line.split("\t")
    symptom = row_values[0]
    disease = row_values[1]
    cooccurrence = row_values[2]

    # if above the threshold, add to reduced list of cooccurrences
    if int(cooccurrence) >= 500:
        lines_coocc_relevant.append(line)

        # also add "d_" in front of diseases and "s_" in front of symptoms to distinguish them from each other
        disease = "d_" + disease
        symptom = "s_" + symptom

        # add relevant symptoms to list
        symptoms2.append(symptom)

        # add an edge pointing from a disease to its associated symptom
        Disease_symptom_interactions.add_edge(disease, symptom)





def grabFromDF(list, value, iv_index = 0, dv_index = 1):
    # remove the _s or _d respectively
    value = value[2:]

    # search the list for the value--the name of the disease/symptom
    for line in list:
        line_values = line.split('\t')
        item = line_values[iv_index]
        if item == value:
            # if we have the # of occurrences for our given disease/symptom, return that vlaue
            if len(line_values) > 1:
                item_occurrences = line_values[dv_index]
                return int(item_occurrences)
            else:
                return 0




def grabFromDF2(list, value1, value2, iv1_index =0, iv2_index=1, dv_index=2):
    i = 0
    # remove the _s and _d from the beginning of the inputted disease + symptom
    value1 = value1[2:]
    value2 = value2[2:]

    # search the list for the line with the matching disease + symptom
    while i < len(list):
        line_values = list[i].split('\t')

        # make sure we have enough items in the line to conduct search (ie, length >=3)
        if len(line_values) > 2:
            var1 = line_values[iv1_index]
            var2 = line_values[iv2_index]
            var3 = line_values[dv_index]
            if var1 == value1:
                if var2 == value2:
                    # if we find a match, return the number of cooccurrences
                    return int(var3)
        i = i + 1




def CPT2x2(disease_occ, symptom_occ, interaction_occ, total_disease_occ, total_symptom_occ, total_interaction_occ):
    # prob. of disease
    p_disease = disease_occ/total_disease_occ
    # joint probability of symptom and disease occurring
    p_joint = interaction_occ/total_interaction_occ

    # conditional prob. of symptom occ. given disease
    pTT = (p_joint/p_disease if p_joint>0 else 0.0)
    # conditional prob. of symptom non-occ. given disease
    pFT = 1 - pTT
    # conditional prob. of symptom occ. given disease absence
    pTF = (symptom_occ-interaction_occ)/total_symptom_occ
    # conditional prob. of symptom non-occ. given disease absence
    pFF = 1-pTF

    return [pFF, pTF, pFT, pTT]




# create list of symptoms that does not repeat
symptoms_set = list(set(symptoms2))




# Define CPTs for symptom nodes
CPTs_symptoms = []

for symptom in symptoms_set:
    # get all parent nodes
    parents = list(Disease_symptom_interactions.predecessors(symptom))
    # print(len(parents), end = ' ')

    # collect 2x2 CPTs for each parent
    little_cpts = []

    for disease in parents: 
        # occurrence of selected disease
        disease_occurrence = grabFromDF(lines_disease, disease)
    
        # occurrence of selected symptom
        symptom_occurrence = grabFromDF(lines_symptom, symptom)

        # occurrence of interaction
        interaction_occurrence = grabFromDF2(lines_coocc_relevant, symptom, disease)        

        little_cpt = CPT2x2(disease_occurrence, symptom_occurrence, interaction_occurrence, total_disease_occurrences, total_symptom_occurrences, total_interaction_occurrences)
        # add 2x2-CPT to list of 2x2-CPTs
        little_cpts += [little_cpt]
        
    # assume naive Bayes
    rowT = [] # row of prob.s where symptom == True
    for bool_combo in itertools.product([0,1], repeat=len(parents)):
        cond_probs = [little_cpts[i][2+b] for i,b in enumerate(bool_combo)]
        rowT += [np.prod(cond_probs)]
    
    rowF = [1-val for val in rowT] # row of probs where symptom == False

    cpt = TabularCPD(variable=symptom, variable_card=2, values = [rowF, rowT], evidence = parents, evidence_card = [2 for _ in parents])
    CPTs_symptoms += [cpt]




def get_symptoms():
    print("Here are the available symptoms:")

    print("Symptoms: ")
    for i, symptom in enumerate(symptoms_set,1):
        symptom = symptom[2:]
        print(f"{i}. {symptom}")

    selected_symptoms = []
    while True:
        choice = input(f"Enter 'STOP' to finish list. Otherwise, select symptom{len(selected_symptoms) + 1} (1-170): ")
        if choice == "STOP":
            return selected_symptoms
        try:
            if 1 <= int(choice) <= 20 and symptoms_set[int(choice)-1] not in selected_symptoms:
                selected_symptoms.append(symptoms_set[int(choice)-1])
            else:
                print("Invalid choice or symptom already selected")
        except ValueError:
            print("Please enter a number between 1 and 170.")




def generate_answer(selected_symptoms):
    # we will find the highest_probability by going through diseases, claculating prob. of having a particular diseaese based on symptoms, and comparing probabilities
    most_probable_disease = ""
    highest_probability = 0
    
    parents_list = [] # this will be all the parent diseases we will consider (diseases w/ non negligible interactions with any of the symptoms)

    for symptom in selected_symptoms:
        symptom = symptom[2:]
        # get paretns for each symptom
        parents = list(Disease_symptom_interactions.predecessors(symptom))
        for parent in parents:
            parent = parent[2:]
            # if not already in the big list, add parent disease
            if parent not in parents_list:
                parents_list.append(parent)
    
    # now we calculate prob. to ahve each parent disease basedd on symptoms.
    for parent in parents_list:
        probability = 1

        # for each selected symptom, calculate prob. of having disease given symptom and multiply them all together to get prob. of having disease given all symptoms
        for symptom in selected_symptoms:            
            symptom = symptom[2:]

            symptom_occurrence = grabFromDF(lines_symptom, symptom)
            probability_of_symptom = symptom_occurrence/total_symptom_occurrences

            # if there is a relationship, we get the cooccurrences, otherwise round down to 0 cooccurrences
            if grabFromDF2(lines_coocc_relevant, symptom, parent):
                cooccurrences = grabFromDF2(lines_coocc_relevant, symptom, parent)
            else:
                cooccurrences = 0
            probability_of_both = cooccurrences/total_interaction_occurrences

            # probability of diseaes given symptom is prob. of both/prob of symptom
            conditional_probability = probability_of_both/probability_of_symptom

            # multiply probability to overall probability
            probability = probability * conditional_probability

        # compare to current highest rpobability, if higher then replace prob. and parent
        if probability > highest_probability:
            highest_probability = probability
            most_probable_disease = parent
        

    return most_probable_disease




def medbot():
    print("Welcome to medbot")

    # run getting sympotms
    selected_symptoms = get_symptoms()

    # calculate most probable disease based on symptoms
    generated_answer = generate_answer(selected_symptoms)
    print(generated_answer)




if __name__ == "__main__":
    medbot()

    

Welcome to medbot
Here are the available symptoms:
Symptoms: 
1. Language Development Disorders
2. Syncope
3. Dystonia
4. Overweight
5. Urinary Bladder, Neurogenic
6. Respiratory Paralysis
7. Chest Pain
8. Catalepsy
9. Purpura, Thrombocytopenic
10. Athetosis
11. Hirsutism
12. Communication Disorders
13. Catatonia
14. Respiratory Sounds
15. Tinnitus
16. Sleep Disorders
17. Auditory Perceptual Disorders
18. Neck Pain
19. Consciousness Disorders
20. Colic
21. Quadriplegia
22. Jaundice, Obstructive
23. Perceptual Disorders
24. Weight Loss
25. Proteinuria
26. Hyperphagia
27. Fetal Macrosomia
28. Snoring
29. Angina Pectoris, Variant
30. Gastroparesis
31. Dizziness
32. Jaundice
33. Apraxias
34. Ophthalmoplegia
35. Hydrops Fetalis
36. Hearing Loss
37. Presbycusis
38. Thinness
39. Hemianopsia
40. Hypoventilation
41. Cardiac Output, Low
42. Unconsciousness
43. Back Pain
44. Taste Disorders
45. Psychomotor Agitation
46. Amblyopia
47. Shoulder Pain
48. Confusion
49. Myotonia
50. Paraplegia
51. Hem

NetworkXError: The node Dystonia is not in the digraph.