# Rules of survival

### Mini-project

In this small project you will use the PRISM Rule Learner algorithm to learn some rules about COVID-19 comorbidity factors. Write as much about your findings as possible. You may add external information/additional datasets for an extra-credit.

## 1. Algorithm

Copy your implementation of the correct and tested algorithm in the cell below. You do not need to supply any comments or explanations. 

In [13]:
import pandas as pd
import numpy as np
class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0

    def addCondition(self, condition):
        self.conditions.append(condition)

    def setParams(self, accuracy, coverage):
        self.accuracy = accuracy
        self.coverage = coverage
    
    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)
class Condition:
    def __init__(self, attribute, value, true_false = None):
        self.attribute = attribute
        self.value = value
        self.true_false = true_false

    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)
        
        
import pandas as pd
import numpy as np

def learn_one_rule(columns, data, class_label, min_coverage=0, min_accuracy=0.6):
    if len(data)==0:
        return None
    covered_subset = data.copy()
    data2 = data.copy()
    current_rule = Rule(class_label)
    done = False
    # filter out data with right labels
    classheader = data2.columns[-1]
    current_data = data2[data2[classheader] == class_label]
    current_accuracy = 0
    current_coverage = 0
    current_condition = Condition(None,None)
    while not done:
        for i in range(len(current_data.columns) - 1):
            # make sure I am not repeatedly testing on the same criteria
            flag = False
            current_attribute = current_data.columns[i]
            for condition in current_rule.conditions:
                myattribute = condition.attribute 
                if current_attribute is myattribute:
                    flag = True
            if not flag:
            
                #test every possible value for the current attribute
                #check if it is numerical, get unique values, for each value, there is two rules: greater or equal to, or less than
                possible_values = current_data[current_attribute].unique().tolist()
                for value in possible_values:
                    #compute accuracy
                    correct = current_data[current_attribute].value_counts()[value]
                    coverage = data2[current_attribute].value_counts()[value]
                    accuracy = correct/coverage
                    #choose the best option based on accuracy, and update accordingly
                    if coverage >= min_coverage:
                        if accuracy > current_accuracy:
                            current_coverage = coverage
                            current_accuracy = accuracy
                            current_condition= Condition(current_attribute, value) 
                            covered_subset = data2[data2[current_attribute] == value] 
                        # if accuracy is the same, compare coverage
                        elif accuracy == current_accuracy and coverage > current_coverage:
                            current_coverage = coverage
                            current_accuracy = accuracy
                            current_condition= Condition(current_attribute, value) 
                            covered_subset = data2[data2[current_attribute] == value] 
                #if reached to the end accuracy = 1.0, add to rule and terminate
        if current_accuracy == 1.0 and current_coverage>min_coverage:
            done = True    
            current_rule.addCondition(current_condition)
            current_rule.accuracy = current_accuracy
            current_rule.coverage = current_coverage
        #if reached to the end that is acceptable, add to rule and terminate
        elif current_coverage <= min_coverage and current_accuracy > min_accuracy: 
            done = True
            current_rule.addCondition(current_condition)
            current_rule.accuracy = current_accuracy
            current_rule.coverage = current_coverage
        #if no rule possible, return none
        elif current_coverage <= min_coverage and current_accuracy <= min_accuracy:
            done = True
            return (None,None)
        #default: not done yet, continue
        else: 
            current_rule.addCondition(current_condition)
            data2 = covered_subset    
            current_data = data2[data2[classheader] == class_label]
            current_rule.accuracy = current_accuracy
            current_rule.coverage = current_coverage
            #reset
            current_accuracy = 0
            current_coverage = 0  
    return (current_rule, covered_subset)

def learn_rules (columns, data, classes=None, 
                 min_coverage = 30, min_accuracy = 0.6):
    # List of final rules
    rules = []
    
    # If list of classes of interest is not provided - it is extracted from the last column of data
    if classes is not None:
        class_labels = classes
    else:
        class_labels = data[columns[-1]].unique().tolist()

    current_data = data.copy()
    
    # This follows the logic of the original PRISM algorithm
    # It processes each class in turn. 
    for class_label in class_labels:
        done = False
        while len(current_data) > min_coverage and not done:
            # Learn one rule 
            rule, subset = learn_one_rule(columns, current_data, class_label, min_coverage, min_accuracy)
            
            # If the best rule does not pass the coverage threshold - we are done with this class
            if rule is None:
                break

            # If we get the rule with accuracy and coverage above threshold
            
            if rule.accuracy >= min_accuracy:
                rules.append(rule)
                for id in subset.index:
                    current_data.drop(index = id, inplace = True)
                    current_data = current_data.dropna()
                   
            else:
                done = True 

                    
    return rules


## 2. Titanic dataset: the rules of survival

Our very familiar Titanic [dataset](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing).

In [2]:
data_file = "titanic.csv"

In [3]:
data = pd.read_csv(data_file)

# take a subset of attributes
data = data[['Pclass', 'Sex', 'Age', 'Survived']]

# drop all columns and rows with missing values
data = data.dropna(how="any")
print("Total rows", len(data))

column_list = data.columns.to_numpy().tolist()
print("Columns:", column_list)
conditions = [ data['Pclass'].eq(1), data['Pclass'].eq(2), data['Pclass'].eq(3)]
choices = ["first","second","third"]
data['Pclass'] = np.select(conditions, choices)
bins= [0,1,18,30,60,110]
labels = ['error','kid','younster','adult','old']
data['Age'] = pd.cut(data['Age'], bins=bins, labels=labels, right=False)


Total rows 714
Columns: ['Pclass', 'Sex', 'Age', 'Survived']


In [14]:
# we can set different accuracy thresholds
# here we can reorder class labels - to first learn the rules with class label "survived".
# classes
print(data)
rules = learn_rules(column_list, data, [1,0], 30, 0.7)
for rule in rules[:10]:
    print(rule)

     Pclass     Sex       Age  Survived
0     third    male  younster         0
1     first  female     adult         1
2     third  female  younster         1
3     first  female     adult         1
4     third    male     adult         0
..      ...     ...       ...       ...
885   third  female     adult         0
886  second    male  younster         0
887   first  female  younster         1
889   first    male  younster         1
890   third    male     adult         0

[714 rows x 4 columns]
Pclass
Sex
Age
Pclass
Sex
Age
Pclass
Sex
Age
Pclass
Sex
Age
Pclass
Sex
Age
Pclass
Sex
Age
Pclass
Sex
Age
Pclass
Sex
Age


## 3. Coronavirus: symptoms and outcome

Coronavirus [dataset](https://drive.google.com/file/d/1uVd09ekR1ArLrA8qN-Xtu4l-FFbmetVy/view?usp=sharing) (preprocessed as outlined [here](rules_motivation.ipynb)).

In [None]:
data_file = "../data_sets/covid_categorical_good.csv"

In [None]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

FileNotFoundError: [Errno 2] No such file or directory: '../data_sets/covid_categorical_good.csv'

Most accurate rules will have class label "alive". There could be too many rules, and we might never get to the class label "dead" if we rank them by accuracy. 

If we want to see which combination of attributes leads to "dead", we might want to run the algorithm with only this class label and set the lower accuracy threshold.

Remove the _age_ attribute and run your algorithm with parameters shown below.

In [None]:
# We really want to learn first what makes covid deadly
class_labels = ["dead"]
rules = learn_rules (column_list, data_categorical, class_labels, 30, 0.6)
for rule in rules[:20]:
    print(rule)

Now try on both classes and for the entire dataset including _age_. Collect top 20 most accurate rules.

In [None]:
# This may take some time to run (took 12 min on my computer - what about your implementation?)
rules = learn_rules (column_list, data, None, 30, 0.9)
for rule in rules[:20]:
    print(rule)

## 4. Discussion

Write here a discussion about the rules that you have learned from both datasets. 

Did any of these rules surprise you?

Do you have a meaningful logical explanation for these rules?

What additional research is needed to understand the meaning of your findings?

Copyright &copy; 2022 Marina Barsky. All rights reserved.