# Rules of survival

### Mini-project

In this small project you will use the PRISM Rule Learner algorithm to learn some rules about COVID-19 comorbidity factors. Write as much about your findings as possible. You may add external information/additional datasets for an extra-credit.

## 1. Algorithm

Copy your implementation of the correct and tested algorithm in the cell below. You do not need to supply any comments or explanations. 

In [26]:
import numpy as np
import pandas as pd
from classes import *



def learn_one_rule(columns, data, class_label, rule = None, min_coverage=30, min_accuracy=0.6):
    covered = data.copy()
    accuracy = 0.00
    coverage = 0
    if(rule != None):
        current_rule = rule
        accuracy = rule.accuracy
        coverage = rule.coverage
    else:
        current_rule = Rule(class_label)

    best_accuracy = 0.00
    best_coverage = 0
    attr = None
    val = None
    true_false = None

    for col in columns[:-1]:
        options = covered[col].unique().tolist()
        for option in options:
            if (isinstance(option, int) or isinstance(option, float)):
                split1 = covered[covered[col] >= option]
                split2 = covered[covered[col] < option]
                if(len(split1) == 0):
                    true_acc = 0.0
                else:
                    true_acc = len(split1[(split1[columns[-1]] == class_label)])/len(split1)
                if(len(split2) == 0):
                    false_acc = 0.0
                else:
                    false_acc = len(split2[(split2[columns[-1]] == class_label)])/len(split2)
                true_false = (true_acc > false_acc)
                curr_acc = max(true_acc, false_acc)
                curr_cov = 0
                if(true_false == True):
                    curr_cov = len(split1)
                else:
                    curr_cov = len(split2)

                if(curr_acc > best_accuracy or (curr_acc == best_accuracy and curr_cov > best_coverage)):
                    best_accuracy = curr_acc
                    best_coverage = curr_cov
                    attr = col
                    val = option

            else:
                split = covered[covered[col] == option]
                #print(col, option, class_label)
                #print(split)
                #print(len(split[(split[columns[-1]] == class_label)]))
                #print(len(split))

                curr_acc = len(split[(split[columns[-1]] == class_label)])/len(split)
                #print(curr_acc)

                curr_cov = len(split)



                if(curr_acc > best_accuracy or (curr_acc == best_accuracy and curr_cov > best_coverage)):
                    true_false = None
                    best_accuracy = curr_acc
                    best_coverage = curr_cov
                    attr = col
                    val = option


    if(best_accuracy >= 1 and best_coverage >= min_coverage):
        cond = Condition(attr, val, true_false)
        current_rule.addCondition(cond)
        current_rule.setParams(best_accuracy, best_coverage)
        if(true_false is not None):
            if(true_false == True):
                covered = covered[(covered[attr] >= val)]
            else:
                covered = covered[(covered[attr] < val)]
        else:
            covered = covered[(covered[attr] == val)]


        return (current_rule, covered)

    else:
        if(len(columns) <= 2):
            if(best_accuracy >= min_accuracy and best_coverage >= min_coverage):
                cond = Condition(attr, val, true_false)
                current_rule.addCondition(cond)
                current_rule.setParams(best_accuracy, best_coverage)
                if(true_false is not None):
                    if(true_false == True):
                        covered = covered[(covered[attr] >= val)]
                    else:
                        covered = covered[(covered[attr] < val)]
                else:
                    covered = covered[(covered[attr] == val)]

                return (current_rule, covered)
            else:
                return (None, covered)
        else:

            cond = Condition(attr, val, true_false)
            current_rule.addCondition(cond)
            current_rule.setParams(best_accuracy, best_coverage)
            if(true_false is not None):
                if(true_false == True):
                    covered = covered[(covered[attr] >= val)]
                else:
                    covered = covered[(covered[attr] < val)]
            else:
                covered = covered[(covered[attr] == val)]
            #print(covered)
            copy = columns.copy()
            copy.remove(attr)
            new = (min_coverage+1)
            return learn_one_rule(copy, covered, class_label, current_rule, min_coverage, min_accuracy)


    return None



def learn_rules (columns, data, classes=None,
                 min_coverage = 30, min_accuracy = 0.6):
    # List of final rules
    rules = []

    # If list of classes of interest is not provided - it is extracted from the last column of data
    if classes is not None:
        class_labels = classes
    else:
        class_labels = data[columns[-1]].unique().tolist()

    current_data = data.copy()

    # This follows the logic of the original PRISM algorithm
    # It processes each class in turn.
    for class_label in class_labels:
        done = False

        while len(current_data) > min_coverage and not done:
            # Learn one rule
            rule, subset = learn_one_rule(columns, current_data, class_label, min_coverage=min_coverage, min_accuracy=min_accuracy)

            # If the best rule does not pass the coverage threshold - we are done with this class
            if rule is None:
                done = True
                continue

            # If we get the rule with accuracy and coverage above threshold

            if rule.accuracy >= min_accuracy:
                rules.append(rule)

                # remove rows covered by this rule
                # you have to remove the rows where all of the conditions hold
                #print(current_data)
                #current_data = current_data[(current_data['astigmatism'] != 'yes')]
                #current_data = current_data[(current_data['tear production rate'] != 'reduced')]

                current_data = (current_data[~current_data.isin(subset).all(1)])



            else:
                done = True

    return rules


## 2. Titanic dataset: the rules of survival

Our very familiar Titanic [dataset](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing).

In [27]:
data_file = "../../Datasets/titanic.csv"

In [28]:
data = pd.read_csv(data_file)

# take a subset of attributes
data = data[['Pclass', 'Sex', 'Age', 'Survived']]

# drop all columns and rows with missing values
data = data.dropna(how="any")
print("Total rows", len(data))

column_list = data.columns.to_numpy().tolist()
print("Columns:", column_list)

Total rows 714
Columns: ['Pclass', 'Sex', 'Age', 'Survived']


In [29]:
# we can set different accuracy thresholds
# here we can reorder class labels - to first learn the rules with class label "survived".

rules = learn_rules(column_list, data, [1,0], 30, 0.7)
for rule in rules[:10]:
    print(rule)

If [Age>=1.0:True, Sex=female, Pclass>=2:False] then 1. Coverage:85, accuracy: 0.9647058823529412
If [Age>=1.0:True, Sex=female, Pclass>=3:True] then 1. Coverage:74, accuracy: 0.918918918918919
If [Age>=1.0:True, Sex=female, Pclass>=2:True] then 1. Coverage:74, accuracy: 0.918918918918919
If [Age>=64.0:False, Pclass>=2:True, Sex=male] then 0. Coverage:347, accuracy: 0.8472622478386167


## 3. Coronavirus: symptoms and outcome

Coronavirus [dataset](https://drive.google.com/file/d/1uVd09ekR1ArLrA8qN-Xtu4l-FFbmetVy/view?usp=sharing) (preprocessed as outlined [here](rules_motivation.ipynb)).

In [30]:
data_file = "../../Datasets/covid_categorical_good.csv"

In [31]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')

Most accurate rules will have class label "alive". There could be too many rules, and we might never get to the class label "dead" if we rank them by accuracy. 

If we want to see which combination of attributes leads to "dead", we might want to run the algorithm with only this class label and set the lower accuracy threshold.

Remove the _age_ attribute and run your algorithm with parameters shown below.

In [32]:
# We really want to learn first what makes covid deadly
data_categorical = data.copy()
del data_categorical['age']
data_rows = data_categorical.to_numpy().tolist()
columns_list = data_categorical.columns.to_numpy().tolist()

class_labels = ["dead"]
rules = learn_rules (columns_list, data_categorical, class_labels, 2, 0.5)
for rule in rules[:20]:
    print(rule)

If [renal_chronic=yes, diabetes=yes, cardiovascular=yes, obesity=no, sex=male, copd=yes, imm_supr=no] then dead. Coverage:8, accuracy: 1.0


Now try on both classes and for the entire dataset including _age_. Collect top 20 most accurate rules.

In [37]:
# This may take some time to run (took 12 min on my computer - what about your implementation?)
columns_list = data.columns.to_numpy().tolist()
rules = learn_rules (columns_list, data, None, 1, 0.9)
for rule in rules[:20]:
    print(rule)

If [age>=106:True] then alive. Coverage:6, accuracy: 1.0
If [age>=26:False, tobacco=yes, asthma=yes] then alive. Coverage:47, accuracy: 1.0
If [age>=26:False, tobacco=yes, cardiovascular=yes] then alive. Coverage:15, accuracy: 1.0
If [age>=26:False, tobacco=yes, copd=yes] then alive. Coverage:2, accuracy: 1.0
If [age>=26:False, tobacco=yes, sex=female, obesity=yes] then alive. Coverage:82, accuracy: 1.0
If [age>=26:False, tobacco=yes, obesity=no, diabetes=yes] then alive. Coverage:6, accuracy: 1.0
If [age>=26:False, tobacco=yes, obesity=no, hypertension=no, sex=female, renal_chronic=yes] then alive. Coverage:1, accuracy: 1.0
If [age>=26:False, tobacco=yes, obesity=no, hypertension=no, sex=female, diabetes=no, copd=no, asthma=no, imm_supr=no, cardiovascular=no, renal_chronic=no] then alive. Coverage:264, accuracy: 0.9962121212121212
If [age>=26:False, hypertension=no, copd=yes] then alive. Coverage:22, accuracy: 1.0
If [age>=26:False, hypertension=no, tobacco=yes, obesity=no, renal_chro

## 4. Discussion

Write here a discussion about the rules that you have learned from both datasets. 

1. Did any of these rules surprise you?

Simply due to the large number of people who are marked as "alive" in the dataset, it seems there are a lot of rules that are seemingly illogical with what we know about Covid-19. For example, we know that tobacco and asthma in themselves increase the risk of death from Covid-19, but one of the rules we found states that people under the age of 26 with _both_ of these conditions will be alive with an accuracy of 100 percent (and relatively high coverage with 47). There are several rules that are similar to this one, notably the following: (note I reduced the coverage limit to learn as much as possible from this data)

- If age>=26:False, tobacco=yes, cardiovascular=yes then alive. Coverage:15, accuracy: 1.0
- If age>=26:False, asthma=yes, diabetes=yes then alive. Coverage:13, accuracy: 1.0

Another really interesting note is that tobacco usage seems to be present in many of these rules, and the rules predict "alive" in all cases. Even when there is a rule with a majority of "no" indicators, a "yes" tobacco indicator is present sometimes. The following rule seems especially strange:

- If age>=26:False, tobacco=yes, obesity=no, hypertension=no, sex=female, diabetes=no, copd=no, asthma=no, imm_supr=no, cardiovascular=no, renal_chronic=no then alive. Coverage:264, accuracy: 0.9962121212121212

2. Do you have a meaningful logical explanation for these rules?

The most meaningful explanation I think I can give for these results is that the death rate of Covid-19 is so low that it is difficult to discern any meaningful results from rule generation with this dataset. While each of the conditions in the dataset are known to be risk factors for Covid-19 related deaths, the odds of dying with these conditions is still so low that the majority of rules will be associated with a living patient, and it is near impossible to find an association between different conditions where a high number of patients died. 

3. What additional research is needed to understand the meaning of your findings?

We certainly would need to conduct more research into how risk factors interact with one another in the face of Covid-19. Someone with one condition is undoubtedly at a higher risk of dying from Covid-19, but if they have multiple conditions, is their risk compounded? Do some conditions make the symptoms from another condition worse? Also, for some of these conditions, an indication into the severity of the condition might be helpful. For example, with obesity, is this person just barely over the obesity threshold, or is this someone who is well beyond the threshold and has very unhealthy habits? How much tobacco does the person use? These are just a few of the questions that might. help us narrow down the true causes of death from Covid-19.

Copyright &copy; 2022 Marina Barsky. All rights reserved.