# Rules of survival

### Mini-project

In this small project you will use the PRISM Rule Learner algorithm to learn some rules about COVID-19 comorbidity factors. Write as much about your findings as possible. You may add external information/additional datasets for an extra-credit.

## 1. Algorithm

Copy your implementation of the correct and tested algorithm in the cell below. You do not need to supply any comments or explanations. 

In [219]:
import pandas as pd
import numpy as np

class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0
        
    def add_condition(self, condition):
        self.conditions.append(condition)

    def set_params(self, accuracy, coverage):
        self.accuracy = accuracy
        self.coverage = coverage
        
    def to_filter(self):
        result = ""
        for cond in self.conditions:
            result += cond.to_filter() + " & "
        result += "(current_data[columns[-1]] == class_label)"
        return result
    
    def to_filter_no_class(self):
        result = ""
        for cond in self.conditions:
            result += cond.to_filter() + " & "
        result += "True"
        return result
    
    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)

    
    
class Condition:
    def __init__(self, attribute, value, true_false = None):
        self.attribute = attribute
        self.value = value
        self.true_false = true_false
        
    def to_filter(self):
        result = ""
        if self is None:
            return result
        if self.true_false is None:
            result += '(current_data["' + self.attribute + '"]' + "==" + '"' + self.value + '")'
        elif self.true_false:
            result += '(current_data["' + self.attribute + '"]' + ">=" + str(self.value) + ")"
        else:
            result += '(current_data["' + self.attribute + '"]' + "<" + str(self.value) + ")"
        return result

    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)

        
        
def learn_one_rule(columns, data, class_label, prev_rule=None, min_coverage = 30, min_accuracy = 0):
    current_data = data.copy()

    current_rule = prev_rule
    current_accuracy = 0
    current_coverage = 0
    covered_subset = None
    
    if current_rule is not None:
        current_accuracy = current_rule.accuracy
        current_coverage = current_rule.coverage
    
    best_col = None
    best_val = None
    true_false = None

    for col in columns[:-1]:
        
        unique_vals = current_data[col].unique().tolist()
        
        for val in unique_vals:
            
            rule_filter = np.ones(len(current_data[columns[-1]]), dtype=bool)
            if current_rule is not None:
                rule_filter = eval(current_rule.to_filter_no_class())
            
            if isinstance(val, int) or isinstance(val, float):
                
                bigger_subset = current_data[(current_data[col] >= val) & rule_filter]
                smaller_subset = current_data[(current_data[col] < val) & rule_filter]
                    
                bigger_tot = len(bigger_subset[columns[-1]])
                smaller_tot = len(smaller_subset[columns[-1]])
                
                bigger_cov = len(bigger_subset[bigger_subset[columns[-1]] == class_label])
                smaller_cov = len(smaller_subset[smaller_subset[columns[-1]] == class_label])
                
                if bigger_tot == 0 or smaller_tot == 0:
                    continue
                
                bigger_acc = bigger_cov/bigger_tot
                smaller_acc = smaller_cov/smaller_tot
                    
                choose_bigger = True if bigger_acc > smaller_acc else False
                if bigger_acc == smaller_acc:
                    if bigger_cov > smaller_cov:
                        choose_bigger = True
                    
                if choose_bigger:
                    if (bigger_acc >= current_accuracy and bigger_cov >= min_coverage):
                        best_col = col
                        best_val = val
                        current_accuracy = bigger_acc
                        current_coverage = bigger_cov
                        true_false = True
                else:
                    if (smaller_acc >= current_accuracy and smaller_cov >= min_coverage):
                        best_col = col
                        best_val = val
                        current_accuracy = smaller_acc
                        current_coverage = smaller_cov
                        true_false = False
                        
                
              
            else:
                curr_subset = current_data[(current_data[col] == val) & rule_filter]
                total = len(curr_subset[columns[-1]])
                if total == 0:
                    continue

                curr_cov = len(curr_subset[curr_subset[columns[-1]] == class_label])
                curr_acc = curr_cov/total

                if curr_acc >= current_accuracy and curr_cov >= min_coverage:
                    best_col = col
                    best_val = val
                    current_accuracy = curr_acc
                    current_coverage = curr_cov
                    true_false = None
                else:
                    continue

    if best_col is not None:

        if current_rule is None:
            current_rule = Rule(class_label)

        condition = Condition(best_col, best_val, true_false)
        current_rule.add_condition(condition)
        current_rule.set_params(current_accuracy, current_coverage)

        rule_filter = eval(current_rule.to_filter_no_class())
        covered_subset = current_data[rule_filter]
    
    return (current_rule, covered_subset)




def learn_rules (columns, data, classes=None, 
                 min_coverage = 30, min_accuracy = 0.6):
    rules = []
    if classes is not None:
        class_labels = classes
    else:
        class_labels = data[columns[-1]].unique().tolist()

    current_data = data.copy()
    
    for class_label in class_labels:
        done = False
        while len(current_data) > min_coverage:
            # Learn a rule with a single condition
            
            (rule, current_subset) = learn_one_rule(columns, current_data, class_label, None, min_coverage)
            # If the best rule does not pass the coverage threshold - we are done with this class
            if rule is None:
                break

            # If we get the rule with coverage above threshold
            # We try to refine this rule
            if rule is not None:
                # try to improve the rule using the same learn_one_rule and passing existing rule as parameter
                # here need another loop which stops when accuracy is not improving
                prev_acc=0
                new_acc=rule.accuracy
                while(new_acc < min_accuracy and new_acc > prev_acc):
                    (rule, current_subset) = learn_one_rule(columns, current_subset, class_label, rule, min_coverage)
                    prev_acc = new_acc
                    new_acc = rule.accuracy
                # done with this rule
                if rule.accuracy >= min_accuracy:
                    rules.append(rule)
                    current_data = current_data.drop(current_data[eval(rule.to_filter_no_class())].index)
                else:
                    break
                
    return rules

## 2. Titanic dataset: the rules of survival

Our very familiar Titanic [dataset](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing).

In [206]:
data_file = "../../Datasets/titanic.csv"

In [207]:
data = pd.read_csv(data_file)

# take a subset of attributes
data = data[['Pclass', 'Sex', 'Age', 'Survived']]

# drop all columns and rows with missing values
data = data.dropna(how="any")
print("Total rows", len(data))

column_list = data.columns.to_numpy().tolist()
print("Columns:", column_list)

Total rows 714
Columns: ['Pclass', 'Sex', 'Age', 'Survived']


In [210]:
# we can set different accuracy thresholds
# here we can reorder class labels - to first learn the rules with class label "survived".
rules = learn_rules(column_list, data, [1,0], 30, 0.7)
for rule in rules[:10]:
    print(rule)

If [Sex=female] then 1. Coverage:197, accuracy: 0.7547892720306514
If [Age>=54.0:True] then 0. Coverage:33, accuracy: 0.8918918918918919
If [Pclass>=3:True] then 0. Coverage:209, accuracy: 0.8461538461538461
If [Pclass>=2:True] then 0. Coverage:76, accuracy: 0.8444444444444444


## 3. Coronavirus: symptoms and outcome

Coronavirus [dataset](https://drive.google.com/file/d/1uVd09ekR1ArLrA8qN-Xtu4l-FFbmetVy/view?usp=sharing) (preprocessed as outlined [here](rules_motivation.ipynb)).

In [211]:
data_file = "../../Datasets/covid_categorical_good.csv"

In [212]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')

Most accurate rules will have class label "alive". There could be too many rules, and we might never get to the class label "dead" if we rank them by accuracy. 

If we want to see which combination of attributes leads to "dead", we might want to run the algorithm with only this class label and set the lower accuracy threshold.

Remove the _age_ attribute and run your algorithm with parameters shown below.

In [213]:
# We really want to learn first what makes covid deadly
class_labels = ["dead"]
data_categorical = data.drop('age', 1)
column_list = data_categorical.columns.to_numpy().tolist()
rules = learn_rules(column_list, data_categorical, class_labels, 30, 0.6)
for rule in rules[:20]:
    print(rule)

  data_categorical = data.drop('age', 1)


If [renal_chronic=yes, diabetes=yes, cardiovascular=yes, obesity=no, sex=male, imm_supr=no] then dead. Coverage:54, accuracy: 0.627906976744186


Now try on both classes and for the entire dataset including _age_. Collect top 20 most accurate rules.

In [228]:
# This may take some time to run (took 12 min on my computer - what about your implementation?)
rules = learn_rules(column_list, data, None, 30, 0.6)
for rule in rules[:20]:
    print(rule)

If [hypertension=no] then alive. Coverage:159673, accuracy: 0.9118543984283984
If [asthma=yes] then alive. Coverage:1176, accuracy: 0.8127159640635798
If [diabetes=no] then alive. Coverage:18979, accuracy: 0.788655724080615
If [sex=female] then alive. Coverage:6114, accuracy: 0.7080486392588303
If [obesity=yes] then alive. Coverage:1937, accuracy: 0.6834862385321101
If [renal_chronic=no] then alive. Coverage:4096, accuracy: 0.6595813204508857
If [cardiovascular=yes] then dead. Coverage:49, accuracy: 0.6282051282051282


## 4. Discussion

Write here a discussion about the rules that you have learned from both datasets. 

Did any of these rules surprise you?

Do you have a meaningful logical explanation for these rules?

What additional research is needed to understand the meaning of your findings?


   In the end, the rules were not surprising. However, what was surprising was the lack of a rule for smokers. It appears that smoking itself does not highly increase the chance that one will die with covid, at least among this population.
   
   For the titanic dataset, it makes sense that women would be saved first, as that is what was said to have happeneed. For covid, the comorbidities generated by our rules seem to be related to various health conditions that might play out as factors in survival. For example, not having hypertension almost immediately means you do not need to worry about dying.
   
   Additional research that needs to happen would, unfortunately, be randomly controlled trials where we give people with various conditions covid in order to find the risk ratios for those conditions. This is unethical, and should never happen, however. Barring that, more publically available data is needed.
   
   Regarding the rules I have above, I could not get the algorithm to generate more than 7 rules for the dataset for covid. I wonder if there is a bug in my code, but it appears to work for the test case from the notes, and it yields correct results for the ones it gives. In any case, it works *very* fast. It took mere seconds to find the rules it did in fact find.

Copyright &copy; 2022 Marina Barsky. All rights reserved.