# Rules of survival

### Mini-project

In this small project you will use the PRISM Rule Learner algorithm to learn some rules about COVID-19 comorbidity factors. Write as much about your findings as possible. You may add external information/additional datasets for an extra-credit.

## 1. Algorithm

Copy your implementation of the correct and tested algorithm in the cell below. You do not need to supply any comments or explanations. 

In [3]:
import pandas as pd
import numpy as np
class Rule:
    def __init__(self, class_label):
        self.conditions = []  # list of conditions
        self.class_label = class_label  # rule class
        self.accuracy = 0
        self.coverage = 0

    def addCondition(self, condition):
        self.conditions.append(condition)

    def setParams(self, accuracy, coverage):
        self.accuracy = accuracy
        self.coverage = coverage
    
    # Human-readable printing of this Rule
    def __repr__(self):
        return "If {} then {}. Coverage:{}, accuracy: {}".format(self.conditions, self.class_label,
                                                                 self.coverage, self.accuracy)
class Condition:
    def __init__(self, attribute, value, true_false = None):
        self.attribute = attribute
        self.value = value
        self.true_false = true_false

    def __repr__(self):
        if self.true_false is None:
            return "{}={}".format(self.attribute, self.value)
        else:
            return "{}>={}:{}".format(self.attribute, self.value, self.true_false)
        
        
import pandas as pd
import numpy as np
def learn_one_rule(column, data, class_label, min_coverage=0, min_accuracy=0.6):
    if len(data)==0:
        return None
    covered_subset = data.copy()
    data2 = data.copy()
    current_rule = Rule(class_label)
    done = False
    # filter out data with right labels
    classheader = data2.columns[-1]
    current_data = data2[data2[classheader] == class_label]

    columns = column.copy()
    columns.pop()
    flag = False
    while not done:
        current_accuracy = 0
        current_coverage = 0
        current_condition = Condition(None,None)
        index = 0
        for i in range(len(columns)):
            # make sure I am not repeatedly testing on the same criteria
            current_attribute = columns[i]        
        
            #test every possible value for the current attribute
            #check if it is numerical, get unique values, for each value, there is two rules: greater or equal to, or less than
            possible_values = current_data[current_attribute].unique().tolist()
            if isinstance(possible_values[0], int) or isinstance(possible_values[0],float):
                possible_values.sort()
            for value in possible_values:
                if isinstance(value, int) or isinstance(value,float):
                    mycolumn1 = current_data[current_attribute]
                    correct1 = mycolumn1[mycolumn1 >= value].count()
                    mycolumn2 = data2[current_attribute]
                    coverage1 = mycolumn2[mycolumn2 >= value].count()
                    if coverage1 == 0:
                        accuracy1 = 0
                    else:
                        accuracy1 = correct1/coverage1
                    mycolumn3 = current_data[current_attribute]
                    correct = mycolumn3[mycolumn3 <value].count()
                    column4 = data2[current_attribute]
                    coverage = column4[column4 < value].count()
                    if coverage == 0:
                        accuracy = 0
                    else:
                        accuracy = correct/coverage
                    GreaterThan = False
                    if accuracy1 > accuracy:
                        GreaterThan = True
                        accuracy = accuracy1
                        coverage = coverage1
                    elif accuracy1 == accuracy:
                        if coverage1 > coverage: 
                            GreaterThan = True
                            coverage = coverage1
                    Int = True
                    if coverage >= min_coverage:
                        if accuracy > current_accuracy:
                            index = i
                            current_coverage = coverage
                            current_accuracy = accuracy
                            current_condition= Condition(current_attribute, value, GreaterThan) 
                            if GreaterThan: 
                                covered_subset = data2[data2[current_attribute] >= value] 
                            else:
                                covered_subset = data2[data2[current_attribute] < value] 
                        # if accuracy is the same, compare coverage
                        elif accuracy == current_accuracy and coverage > current_coverage:
                            index = i
                            current_coverage = coverage
                            current_accuracy = accuracy
                            current_condition= Condition(current_attribute, value,GreaterThan) 
                            if GreaterThan: 
                                covered_subset = data2[data2[current_attribute] >= value] 
                            else:
                                covered_subset = data2[data2[current_attribute] < value]
                else:
                #compute accuracy
                    correct = current_data[current_attribute].value_counts()[value]
                    coverage = data2[current_attribute].value_counts()[value]
                    accuracy = correct/coverage
                    #choose the best option based on accuracy
                    if coverage >= min_coverage:
                        if accuracy > current_accuracy:
                            current_coverage = coverage
                            current_accuracy = accuracy
                            current_condition= Condition(current_attribute, value) 
                            covered_subset = data2[data2[current_attribute] == value] 
                            index = i
                        # if accuracy is the same, compare coverage
                        elif accuracy == current_accuracy and coverage > current_coverage:
                            current_coverage = coverage
                            current_accuracy = accuracy
                            current_condition= Condition(current_attribute, value) 
                            covered_subset = data2[data2[current_attribute] == value] 
                            index = i
                    #if reached to the end accuracy = 1.0, add to rule and terminate
        if current_accuracy == 1.0 and current_coverage > min_coverage:
            done = True
            current_rule.addCondition(current_condition)
            current_rule.accuracy = current_accuracy
            current_rule.coverage = current_coverage
        #if reached to the end that is acceptable, add to rule and terminate
        elif len(columns) == 0: 
            done = True  
            if current_rule.accuracy <= min_accuracy or current_rule.coverage < min_coverage:
                return (None, None)

        #if no rule possible, return none
        elif current_coverage < min_coverage:

            if current_rule.accuracy <= min_accuracy or current_rule.coverage < min_coverage:
                return (None, None)
            done = True                
        #default: not done yet, continue
        else: 
            columns.pop(index)
            current_rule.addCondition(current_condition)
            data2 = covered_subset    
            current_data = data2[data2[classheader] == class_label]
            current_rule.accuracy = current_accuracy
            current_rule.coverage = current_coverage
            #reset
    return (current_rule, covered_subset)

def learn_rules (columns, data, classes=None, 
                 min_coverage = 30, min_accuracy = 0.6, length = 20):
    # List of final rules
    rules = []
    
    # If list of classes of interest is not provided - it is extracted from the last column of data
    if classes is not None:
        class_labels = classes
    else:
        class_labels = data[columns[-1]].unique().tolist()
    i = 0
    current_data = data.copy()
    
    # This follows the logic of the original PRISM algorithm
    # It processes each class in turn. 
    for class_label in class_labels:
        done = False
        #do it for the sake for saving some time possibly
        if i >= length:
            break
        while len(current_data) > min_coverage and not done:
            if i >= length:
                break
            # Learn one rule 
            rule, subset = learn_one_rule(columns, current_data, class_label, min_coverage, min_accuracy)
            
            # If the best rule does not pass the coverage threshold - we are done with this class
            if rule is None:
                break
            copylabel = class_labels.copy()
            copylabel.remove(class_label)
            for mylabel in copylabel:
                myrule, mysubset = learn_one_rule(columns, current_data, mylabel, min_coverage, min_accuracy)
                if myrule is not None and (myrule.accuracy > rule.accuracy or myrule.coverage > rule.coverage):
                    rule, subset = myrule, mysubset

            # If we get the rule with accuracy and coverage above threshold
            
            if rule.accuracy >= min_accuracy:
                rules.append(rule)
                print(rule)
                i = i+1
                for id in subset.index:
                    current_data.drop(index = id, inplace = True)
                    current_data = current_data.dropna()
                   
            else:
                done = True 

                    
    return rules


## 2. Titanic dataset: the rules of survival

Our very familiar Titanic [dataset](https://docs.google.com/spreadsheets/d/1QGNxqRU02eAvTGih1t0cErB5R05mdOdUBgJZACGcuvs/edit?usp=sharing).

In [4]:
data_file = "titanic.csv"

In [5]:
data = pd.read_csv(data_file)

# take a subset of attributes
data = data[['Pclass', 'Sex', 'Age', 'Survived']]

# drop all columns and rows with missing values
data = data.dropna(how="any")
print("Total rows", len(data))

column_list = data.columns.to_numpy().tolist()
print("Columns:", column_list)
conditions = [ data['Pclass'].eq(1), data['Pclass'].eq(2), data['Pclass'].eq(3)]
choices = ["first","second","third"]
data['Pclass'] = np.select(conditions, choices)
bins= [0,1,18,30,110]
labels = ['error','kid','younster','old']
#data['Age'] = pd.cut(data['Age'], bins=bins, labels=labels, right=False)


Total rows 714
Columns: ['Pclass', 'Sex', 'Age', 'Survived']


In [6]:
# we can set different accuracy thresholds
# here we can reorder class labels - to first learn the rules with class label "survived".
# classes
rules = learn_rules(column_list, data, [1,0], 30, 0.7,10)
for rule in rules[:10]:
    print(rule)

If [Sex=female, Pclass=first, Age>=26.0:True] then 1. Coverage:57, accuracy: 0.9824561403508771
If [Age>=55.5:True, Sex=male] then 0. Coverage:31, accuracy: 0.8709677419354839
If [Sex=male, Pclass=second, Age>=21.0:True] then 0. Coverage:74, accuracy: 0.9459459459459459
If [Pclass=second, Age>=28.0:True, Sex=female] then 1. Coverage:41, accuracy: 0.926829268292683
If [Age>=39.0:True, Pclass=third, Sex=male] then 0. Coverage:33, accuracy: 0.9090909090909091
If [Pclass=third, Age>=34.0:True] then 0. Coverage:35, accuracy: 0.8857142857142857
If [Sex=male, Pclass=third, Age>=25.0:False] then 0. Coverage:118, accuracy: 0.847457627118644
If [Age>=28.0:True, Pclass=third, Sex=male] then 0. Coverage:50, accuracy: 0.8
If [Age>=6.0:False] then 1. Coverage:31, accuracy: 0.8387096774193549


From the result it seems that it is great to be a wealthy woman, while it is almost always bad to be a man. Also, it seems that really young children have more chance of survival which is a good thing. 

## 3. Coronavirus: symptoms and outcome

Coronavirus [dataset](https://drive.google.com/file/d/1uVd09ekR1ArLrA8qN-Xtu4l-FFbmetVy/view?usp=sharing) (preprocessed as outlined [here](rules_motivation.ipynb)).

In [7]:
data_file = "covid_categorical_good.csv"

In [8]:
data = pd.read_csv(data_file)
data = data.dropna(how="any")
data.columns

Index(['sex', 'age', 'diabetes', 'copd', 'asthma', 'imm_supr', 'hypertension',
       'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'outcome'],
      dtype='object')

Most accurate rules will have class label "alive". There could be too many rules, and we might never get to the class label "dead" if we rank them by accuracy. 

If we want to see which combination of attributes leads to "dead", we might want to run the algorithm with only this class label and set the lower accuracy threshold.

Remove the _age_ attribute and run your algorithm with parameters shown below.

In [9]:
# We really want to learn first what makes covid deadly
data_categorical = data.copy()
data_categorical.drop('age', inplace=True, axis=1)
class_labels = ["dead"]
column_list = data_categorical.columns.to_numpy().tolist()
rules = learn_rules (column_list, data_categorical, class_labels, 30, 0.6)
for rule in rules[:20]:
    print(rule)

If [renal_chronic=yes, diabetes=yes, cardiovascular=yes, obesity=no, sex=male, imm_supr=no, hypertension=yes, asthma=no, tobacco=no, copd=no] then dead. Coverage:58, accuracy: 0.6206896551724138
If [renal_chronic=yes, diabetes=yes, obesity=no, copd=yes, asthma=no, hypertension=yes, imm_supr=no, sex=male] then dead. Coverage:30, accuracy: 0.6666666666666666


Now try on both classes and for the entire dataset including _age_. Collect top 20 most accurate rules.

In [10]:
#bins= [0,17,29,39,49,64,74,85,125]

#labels = ['kid','young adult','29-39','39-49','49-64', '64-74','74-85','85 and above']

#data['age'] = pd.cut(data['age'], bins=bins, labels=labels, right=False)
print(data)

           sex  age diabetes copd asthma imm_supr hypertension cardiovascular  \
0         male   27       no   no     no       no           no             no   
1         male   24       no   no     no       no           no             no   
2       female   54       no   no     no       no           no             no   
3         male   30       no   no     no       no           no             no   
4       female   60      yes   no     no       no          yes            yes   
...        ...  ...      ...  ...    ...      ...          ...            ...   
219174  female   88      yes   no     no       no          yes             no   
219175  female   30       no   no     no       no           no             no   
219176  female   27       no   no     no       no           no             no   
219177  female   36       no   no    yes       no           no             no   
219178    male   70       no   no     no       no           no             no   

       obesity renal_chroni

In [15]:
# This may take some time to run (took 12 min on my computer - what about your implementation?)

#Todo: either separate to more bins, or implement numerically
column_list = data.columns.to_numpy().tolist()
rules = learn_rules (column_list, data, None, 30, 0.9,20)
for rule in rules[:20]:
    print(rule)
    #took 3hours

In [None]:
storage = rules.copy()
mycoverage = 0
myrule = None
for rule in storage:
    if mycoverage < rule.coverage:
        mycoverage = rule.coverage
        myrule = rule
print(myrule)    

If [hypertension=no, diabetes=no, obesity=no, copd=no, renal_chronic=no, imm_supr=no, cardiovascular=no, sex=male, asthma=no, tobacco=no] then alive. Coverage:63252, accuracy: 0.9117814456459875


## 4. Discussion

Write here a discussion about the rules that you have learned from both datasets. 

Did any of these rules surprise you?

Do you have a meaningful logical explanation for these rules?

What additional research is needed to understand the meaning of your findings?

## Raw Results 
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, asthma=yes, copd=no, imm_supr=no, renal_chronic=no, cardiovascular=no] then alive. Coverage:87, accuracy: 0.9885057471264368 
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, copd=no, cardiovascular=yes] then alive. Coverage:36, accuracy: 1.0  
If [hypertension=no, sex=female, diabetes=no, tobacco=yes, obesity=no, copd=no, imm_supr=no, renal_chronic=no, asthma=no, cardiovascular=no] then alive. Coverage:2317, accuracy: 0.9762624082865775    
If [hypertension=no, sex=female, diabetes=no, asthma=yes, obesity=no, imm_supr=no, copd=no, cardiovascular=no, tobacco=no, renal_chronic=no] then alive. Coverage:1676, accuracy: 0.9671837708830548    
If [hypertension=no, sex=female, diabetes=no, obesity=no, copd=no, imm_supr=no, asthma=yes, tobacco=no] then alive. Coverage:32, accuracy: 0.96875  
If [hypertension=no, sex=female, diabetes=no, obesity=no, copd=no, imm_supr=no, renal_chronic=no,  cardiovascular=no, asthma=no, tobacco=no] then alive. Coverage:54563, accuracy: 0.9620255484485823   
If [hypertension=no, asthma=yes, diabetes=no, copd=no, imm_supr=no, sex=female, tobacco=no, obesity=yes, renal_chronic=no, cardiovascular=no] then alive. Coverage:525, accuracy: 0.9561904761904761    
If [hypertension=no, asthma=yes, diabetes=no, obesity=no, copd=no, imm_supr=no, renal_chronic=no, tobacco=no, cardiovascular=no, sex=male] then alive. Coverage:1161, accuracy: 0.9509043927648578  
If [hypertension=no, diabetes=no, sex=female, tobacco=yes, obesity=yes, cardiovascular=no, asthma=yes, renal_chronic=no, copd=no, imm_supr=no] then alive. Coverage:41, accuracy: 0.975609756097561 
If [hypertension=no, diabetes=no, sex=female, obesity=yes, tobacco=yes, cardiovascular=no, copd=no, imm_supr=no, asthma=no, renal_chronic=no] then alive. Coverage:841, accuracy: 0.9453032104637337    
If [hypertension=no, diabetes=no, sex=female, obesity=yes, copd=no, cardiovascular=no, imm_supr=no, renal_chronic=no, asthma=no, tobacco=no] then alive. Coverage:9803, accuracy: 0.9365500357033562    
If [hypertension=no, diabetes=no, obesity=no, tobacco=yes, copd=no, asthma=yes, imm_supr=no, sex=male, renal_chronic=no, cardiovascular=no] then alive. Coverage:100, accuracy: 0.93    
If [hypertension=no, diabetes=no, obesity=no, tobacco=yes, copd=no, renal_chronic=no, imm_supr=no, cardiovascular=no, sex=male, asthma=no] then alive. Coverage:5921, accuracy: 0.9211281878061138  
If [hypertension=no, diabetes=no, obesity=no, copd=no, renal_chronic=no, imm_supr=no, cardiovascular=no, sex=male, asthma=no, tobacco=no] then alive. Coverage:63252, accuracy: 0.9117814456459875  
If [asthma=yes, hypertension=no, cardiovascular=yes, sex=male, renal_chronic=no] then alive. Coverage:31, accuracy: 0.9354838709677419  
If [asthma=yes, hypertension=no, obesity=yes, sex=male, tobacco=yes, renal_chronic=no, copd=no, cardiovascular=no, imm_supr=no, diabetes=no] then alive. Coverage:47, accuracy: 0.9574468085106383  
If [asthma=yes, hypertension=no, obesity=yes, copd=no, sex=male, diabetes=no, renal_chronic=no, imm_supr=no, cardiovascular=no, tobacco=no] then alive. Coverage:312, accuracy: 0.9166666666666666  

## Discussion

To begin with, the age did not really work for the covid dataset even though it worked with titanic. I also tried putting age into bins based on CDC 'https://www.cdc.gov/nchs/nvss/vsrr/covid_weekly/index.htm#SexAndAge'. However, in my first implementation I did not include age,so I would first discuss the data without age?    

The first thing that seems to be universal is that in order to survive it is better to have  no hypertension. This makes sense as hypertension do have negative effects on health. No diabetes is also universal but it is likely to be caused by the low occurance of diabetes so that it may not be interesting to look at survival rate. When looking at the data for death, it became clear that the diabetes would possibly cause a higher death possiblity. No COPD is also important as it also poses pressure to resporitory system and that would make people more vulunerable to the harm of resporitory diseases. Immune compromise is another universal thing that if immune_compromize is no, there is a higher chance of survival. This is because if immune compromized, then the people is more likely to get other disease and the combination would be harmful/dangerous. No renal_chronic disease is also important to avoid, but due to its later position it isless important than the previous ones. 

Generally, for people with no obesity the survival rate is higher with more coverage in rules, and it makes sense as obesity would pose more pressure on body and thus the person is more vulunerable to covid. 
    
It is also interesting to see that female tend to have better survival rate, as the coverage is higher for sex = female. This is interesting as I am not expecting there would be a gender difference. Another interetsing aspect is that in some cases cardiovescular = yes would be beneficial. However, this is likely to be coincidental since the coverage is relatively low. The last interesting  aspect is that smoking seems to promote the survival rate and more research should be done on smoking to see if there is anything special that may help to increase the survival rate in covid. 

Copyright &copy; 2022 Marina Barsky. All rights reserved.