# Rules of survival

### Mini-project

In this small project you will use the PRISM Rule Learner algorithm to learn some rules about COVID-19 comorbidity factors. Write as much about your findings as possible. You may add external information/additional datasets for an extra-credit.

## 1. Algorithm

Copy your implementation of the correct and tested algorithm in the cell below. You do not need to supply any comments or explanations. 

In [None]:
import pandas as pd
import numpy as np

class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0
        
    def add_condition(self, condition):
        self.conditions.append(condition)

    def set_params(self, accuracy, coverage):
        self.accuracy = accuracy
        self.coverage = coverage
        
    def to_filter(self):
        result = ""
        for cond in self.conditions:
            result += cond.to_filter() + " & "
        result += "(current_data[columns[-1]] == class_label)"
        return result
    
    def to_filter_no_class(self):
        result = ""
        for cond in self.conditions:
            result += cond.to_filter() + " & "
        result += "True"
        return result
    
    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)

    
    
class Condition:
    def __init__(self, attribute, value, true_false = None):
        self.attribute = attribute
        self.value = value
        self.true_false = true_false
        
    def to_filter(self):
        result = ""
        if self is None:
            return result
        if self.true_false is None:
            result += '(current_data["' + self.attribute + '"]' + "==" + '"' + self.value + '")'
        elif self.true_false:
            result += '(current_data["' + self.attribute + '"]' + ">=" + str(self.value) + ")"
        else:
            result += '(current_data["' + self.attribute + '"]' + "<" + str(self.value) + ")"
        return result

    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)

        
        
def learn_one_rule(columns, data, class_label, prev_rule=None, min_coverage = 30, min_accuracy = 0.6):
    current_data = data.copy()

    current_rule = prev_rule
    current_accuracy = 0
    current_coverage = 0
    covered_subset = None
    
    if current_rule is not None:
        current_accuracy = current_rule.accuracy
        current_coverage = current_rule.coverage
    
    best_col = None
    best_val = None
    true_false = None

    for col in columns[:-1]:
        
        unique_vals = current_data[col].unique().tolist()
        
        for val in unique_vals:
            
            rule_filter = np.ones(len(current_data[columns[-1]]), dtype=bool)
            if current_rule is not None:
                rule_filter = eval(current_rule.to_filter_no_class())
            
            if isinstance(val, int) or isinstance(val, float):
                
                bigger_subset = current_data[(current_data[col] >= val) & rule_filter]
                smaller_subset = current_data[(current_data[col] < val) & rule_filter]
                    
                bigger_tot = len(bigger_subset[columns[-1]])
                smaller_tot = len(smaller_subset[columns[-1]])
                
                bigger_cov = len(bigger_subset[bigger_subset[columns[-1]] == class_label])
                smaller_cov = len(smaller_subset[smaller_subset[columns[-1]] == class_label])
                
                if bigger_tot == 0 or smaller_tot == 0:
                    continue
                
                bigger_acc = bigger_cov/bigger_tot
                smaller_acc = smaller_cov/smaller_tot
                    
                choose_bigger = True if bigger_acc > smaller_acc else False
                if bigger_acc == smaller_acc:
                    if bigger_tot > smaller_tot:
                        choose_bigger = True
                    
                if choose_bigger:
                    if (bigger_acc >= current_accuracy and bigger_acc >= min_accuracy and bigger_tot >= min_coverage):
                        best_col = col
                        best_val = val
                        current_accuracy = bigger_acc
                        current_coverage = bigger_tot
                        true_false = True
                else:
                    if (smaller_acc >= current_accuracy and smaller_acc >= min_accuracy and smaller_tot >= min_coverage):
                        best_col = col
                        best_val = val
                        current_accuracy = smaller_acc
                        current_coverage = smaller_tot
                        true_false = False
                        
                
              
            else:
                curr_subset = current_data[(current_data[col] == val) & rule_filter]
                total = len(curr_subset[columns[-1]])
                if total == 0:
                    continue

                curr_cov = len(curr_subset[curr_subset[columns[-1]] == class_label])
                curr_acc = curr_cov/total

                if curr_acc >= current_accuracy and curr_acc >= min_accuracy and total >= min_coverage:
                    best_col = col
                    best_val = val
                    current_accuracy = curr_acc
                    current_coverage = total
                    true_false = None
                else:
                    continue

    if best_col is not None:

        if current_rule is None:
            current_rule = Rule(class_label)

        condition = Condition(best_col, best_val, true_false)
        current_rule.add_condition(condition)
        current_rule.set_params(current_accuracy, current_coverage)

        rule_filter = eval(current_rule.to_filter_no_class())
        covered_subset = current_data[rule_filter]
    
    return (current_rule, covered_subset)




def learn_rules (columns, data, classes=None, 
                 min_coverage = 30, min_accuracy = 0.6):
    rules = []
    if classes is not None:
        class_labels = classes
    else:
        class_labels = data[columns[-1]].unique().tolist()

    current_data = data.copy()
    
    for class_label in class_labels:
        done = False
        while len(current_data) > min_coverage:
            # Learn a rule with a single condition
            
            (rule, current_subset) = learn_one_rule(columns, current_data, class_label, None, min_coverage, min_accuracy)
            # If the best rule does not pass the coverage threshold - we are done with this class
            if rule is None:
                break

            # If we get the rule with coverage above threshold
            # We try to refine this rule
            if rule is not None:
                # try to improve the rule using the same learn_one_rule and passing existing rule as parameter
                # here need another loop which stops when accuracy is not improving
                prev_acc=0
                new_acc=rule.accuracy
                while(new_acc < 1 and new_acc > prev_acc):
                    (rule, current_subset) = learn_one_rule(columns, current_subset, class_label, rule, min_coverage, min_accuracy)
                    prev_acc = new_acc
                    new_acc = rule.accuracy
                # done with this rule
                if rule.accuracy >= min_accuracy:
                    rules.append(rule)
                    current_data = current_data.drop(current_data[eval(rule.to_filter_no_class())].index)
                else:
                    break
                
    return rules

## 2. Titanic dataset: the rules of survival

Our very familiar Titanic [dataset](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing).

In [49]:
data_file = "../../Datasets/titanic.csv"

In [50]:
data = pd.read_csv(data_file)

# take a subset of attributes
data = data[['Pclass', 'Sex', 'Age', 'Survived']]

# drop all columns and rows with missing values
data = data.dropna(how="any")
print("Total rows", len(data))

column_list = data.columns.to_numpy().tolist()
print("Columns:", column_list)

Total rows 714
Columns: ['Pclass', 'Sex', 'Age', 'Survived']


In [52]:
# we can set different accuracy thresholds
# here we can reorder class labels - to first learn the rules with class label "survived".
rules = learn_rules(column_list, data, [1,0], 30, 0.7)
for rule in rules[:10]:
    print(rule)

If [Sex=female, Pclass>=2:False, Age>=26.0:True, Age>=47.0:False] then 1. Coverage:37, accuracy: 1.0
If [Sex=female, Pclass>=2:False, Age>=14.0:True, Age>=50.0:False, Sex=female] then 1. Coverage:32, accuracy: 0.96875
If [Age>=6.0:False, Pclass>=2:True] then 1. Coverage:41, accuracy: 0.7073170731707317
If [Sex=male, Age>=54.0:True, Age>=80.0:False, Sex=male] then 0. Coverage:36, accuracy: 0.9166666666666666
If [Sex=male, Pclass>=2:True, Age>=32.5:True, Age>=39.0:False, Sex=male] then 0. Coverage:42, accuracy: 0.9761904761904762
If [Sex=male, Pclass>=2:True, Age>=40.0:True, Sex=male] then 0. Coverage:41, accuracy: 0.926829268292683
If [Sex=male, Age>=25.0:False, Age>=20.5:True, Pclass>=2:True, Age>=24.0:False, Sex=male] then 0. Coverage:41, accuracy: 0.9512195121951219
If [Sex=male, Pclass>=2:True, Pclass>=3:False, Age>=31.0:False, Age>=16.0:True, Sex=male] then 0. Coverage:34, accuracy: 0.9705882352941176
If [Pclass>=3:True, Sex=male, Age>=20.0:False, Age>=14.0:True, Sex=male] then 0. 

## 3. Coronavirus: symptoms and outcome

Coronavirus [dataset](https://drive.google.com/file/d/1uVd09ekR1ArLrA8qN-Xtu4l-FFbmetVy/view?usp=sharing) (preprocessed as outlined [here](rules_motivation.ipynb)).

In [53]:
data_file = "../../Datasets/covid_categorical_good.csv"

In [54]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')

Most accurate rules will have class label "alive". There could be too many rules, and we might never get to the class label "dead" if we rank them by accuracy. 

If we want to see which combination of attributes leads to "dead", we might want to run the algorithm with only this class label and set the lower accuracy threshold.

Remove the _age_ attribute and run your algorithm with parameters shown below.

In [55]:
# We really want to learn first what makes covid deadly
class_labels = ["dead"]
data_categorical = data.drop('age', 1)
column_list = data_categorical.columns.to_numpy().tolist()
rules = learn_rules(column_list, data_categorical, class_labels, 30, 0.3)
for rule in rules[:20]:
    print(rule)

  data_categorical = data.drop('age', 1)


If [renal_chronic=yes, diabetes=yes, cardiovascular=yes, obesity=no, sex=male, imm_supr=no, hypertension=yes, asthma=no, renal_chronic=yes] then dead. Coverage:70, accuracy: 0.6571428571428571
If [renal_chronic=yes, diabetes=yes, obesity=no, copd=yes, tobacco=no, hypertension=yes, imm_supr=no, asthma=no, sex=female, tobacco=no] then dead. Coverage:31, accuracy: 0.6129032258064516
If [renal_chronic=yes, diabetes=yes, obesity=no, hypertension=yes, imm_supr=no, copd=yes, asthma=no, renal_chronic=yes] then dead. Coverage:30, accuracy: 0.5666666666666667
If [renal_chronic=yes, diabetes=yes, tobacco=no, copd=yes, sex=male, tobacco=no] then dead. Coverage:31, accuracy: 0.5806451612903226
If [renal_chronic=yes, diabetes=yes, obesity=no, hypertension=yes, imm_supr=no, sex=male, asthma=no, tobacco=no, tobacco=no] then dead. Coverage:658, accuracy: 0.48024316109422494
If [renal_chronic=yes, diabetes=yes, tobacco=no, cardiovascular=yes, sex=male, obesity=yes, hypertension=yes, tobacco=no] then dea

Now try on both classes and for the entire dataset including _age_. Collect top 20 most accurate rules.

In [59]:
# This may take some time to run (took 12 min on my computer - what about your implementation?)
rules = learn_rules(column_list, data, ["dead","alive"], 30, 0.4)
for rule in rules[:20]:
    print(rule)

If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, asthma=yes, copd=no, tobacco=yes] then alive. Coverage:88, accuracy: 0.9886363636363636
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, copd=no, cardiovascular=yes] then alive. Coverage:35, accuracy: 1.0
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, copd=no, imm_supr=no, renal_chronic=no, tobacco=yes] then alive. Coverage:2317, accuracy: 0.9762624082865775
If [hypertension=no, sex=female, diabetes=no, asthma=yes, obesity=no, imm_supr=no, copd=no, cardiovascular=no, tobacco=no] then alive. Coverage:1686, accuracy: 0.9673784104389087
If [hypertension=no, sex=female, diabetes=no, obesity=no, copd=no, imm_supr=no, renal_chronic=no, cardiovascular=no, tobacco=no] then alive. Coverage:54563, accuracy: 0.9620255484485823
If [hypertension=no, asthma=yes, diabetes=no, copd=no, imm_supr=no, sex=female, tobacco=no, obesity=yes, tobacco=no] then alive. Coverage:531, accuracy: 0.95

## 4. Discussion

Write here a discussion about the rules that you have learned from both datasets. 

Did any of these rules surprise you?

Do you have a meaningful logical explanation for these rules?

What additional research is needed to understand the meaning of your findings?


   In the end, the rules were not surprising. However, what was surprising was the lack of a rule for smokers. It appears that smoking itself does not highly increase the chance that one will die with covid, at least among this population. Many of the rules for the alive status included smoking=yes.
  
   
   For the titanic dataset, it makes sense that women would be saved first, as that is what was said to have happeneed. For covid, the comorbidities generated by our rules seem to be related to various health conditions that might play out as factors in survival. For example, not having hypertension almost immediately means you do not need to worry about dying.
   
   Additional research that needs to happen would, unfortunately, be randomly controlled trials where we give people with various conditions covid in order to find the risk ratios for those conditions. This is unethical, and should never happen, however. Barring that, more publically available data is needed.
   
In any case, the algorithm works *very* fast. It took a mere minute to find the rules it did.

Copyright &copy; 2022 Marina Barsky. All rights reserved.