# Raja Jain
## Asignment 2 - Rules Based Classification
### **Goal:** The goal of this assignment is to give you the opportunity to develop an intuition for classification using rule-based classification. 

The data is to be used to predict whether individuals will cheat in filing their taxes or not. The attributes are refund, indicating whether an individual received tax refund or not, marital_status indicating whether the individual is married, single, or divorced, the income_above_80k, indicating whether an individual’s taxable income is above $80,000 or not. The output variable, cheat, is a binary variable indicating whether an individual cheated in filing taxes or not. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
from sklearn.preprocessing import LabelEncoder
import pprint

In [2]:
tax_data = """refund	marital_status	income_above_80k	cheat
1	yes	single	yes	no
2	no	married	yes	no
3	no	single	no	no
4	yes	married	yes	no
5	no	divorced	yes	yes
6	no	married	no	no
7	yes	divorced	yes	no
8	no	single	yes	yes
9	no	married	no	no
10	no	single	yes	yes"""

# convert tax_data to a table
def str_to_data(data_str: str):
    dat = data_str.split("\n")
    dat = [row.split("\t") for row in dat]
    dat[0].insert(0, "index")
    return pd.DataFrame(dat[1:], columns=dat[0]).set_index(["index"])


tax_data = str_to_data(tax_data)

tax_data

Unnamed: 0_level_0,refund,marital_status,income_above_80k,cheat
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,yes,single,yes,no
2,no,married,yes,no
3,no,single,no,no
4,yes,married,yes,no
5,no,divorced,yes,yes
6,no,married,no,no
7,yes,divorced,yes,no
8,no,single,yes,yes
9,no,married,no,no
10,no,single,yes,yes


a)	Write an IF-THEN rule derived from this decision tree, that can be used to classify instances in the data suppose your desire is to predict “yes” for the output variable. 

In [3]:
# IF-THEN rules for classification

cheat = "no"
if "refund" == "no":
    if "income_above_80k" == "yes":
        if "marital_status" == "single" or "marital_status" == "divorced":
            cheat = "yes"

b)	In code, create a function with an if-else statement to implement the rule you wrote above in question 1a. You can name the function predict. The function takes the input data and returns a vector of predicted output values. You can decide if you want your function’s input data argument to be a numpy array or a data frame, ,then structure your function body to process the input data accordingly. Provide some brief document of your function. You can loop through each instance in the input data and assess whether the instance satisfies the rule for predicting “yes” or not. If the instance satisfies the rule for predicting “yes”, then predict 1, otherwise, predict 0. 

In [4]:
# build a decision tree to predict "cheat" from the tax_data
def predict(X):
    """takes a vector X, and implements if then logic to predict whether someone will cheat on their taxes or not"""
    cheat = 0
    if X["refund"] == "no":
        if X["income_above_80k"] == "yes":
            if X["marital_status"] == "single" or X["marital_status"] == "divorced":
                cheat = 1
    return cheat


predictions = []
for _, row in tax_data.iterrows():
    print(row.to_dict())
    prediction = predict(row)
    predictions.append(prediction)
    print("Prediction:", prediction)

print("All predictions:", predictions)

{'refund': 'yes', 'marital_status': 'single', 'income_above_80k': 'yes', 'cheat': 'no'}
Prediction: 0
{'refund': 'no', 'marital_status': 'married', 'income_above_80k': 'yes', 'cheat': 'no'}
Prediction: 0
{'refund': 'no', 'marital_status': 'single', 'income_above_80k': 'no', 'cheat': 'no'}
Prediction: 0
{'refund': 'yes', 'marital_status': 'married', 'income_above_80k': 'yes', 'cheat': 'no'}
Prediction: 0
{'refund': 'no', 'marital_status': 'divorced', 'income_above_80k': 'yes', 'cheat': 'yes'}
Prediction: 1
{'refund': 'no', 'marital_status': 'married', 'income_above_80k': 'no', 'cheat': 'no'}
Prediction: 0
{'refund': 'yes', 'marital_status': 'divorced', 'income_above_80k': 'yes', 'cheat': 'no'}
Prediction: 0
{'refund': 'no', 'marital_status': 'single', 'income_above_80k': 'yes', 'cheat': 'yes'}
Prediction: 1
{'refund': 'no', 'marital_status': 'married', 'income_above_80k': 'no', 'cheat': 'no'}
Prediction: 0
{'refund': 'no', 'marital_status': 'single', 'income_above_80k': 'yes', 'cheat': 

c)	Create another function that implements the rule for predicting “yes” for the output, but this time, you will not use if-else statement or a loop. Rather use numpy arrays, matrices or vector to vectorize your code for faster and more efficient implementation. If the instance satisfies the rule for predicting “yes”, then predict 1, otherwise, predict 0. You can use logical operators for such implementation to compare values of test instances to values of attributes in the rule. This function still takes the input data and returns a vector of predicted values. You can call this function vectorized_predict. 

In [5]:
le_tax_data = tax_data[["refund", "marital_status", "income_above_80k"]].apply(
    LabelEncoder().fit_transform
)


def vectorized_predict(X):
    """takes a vector X, and implements if then logic to predict whether someone will cheat on their taxes or not"""
    return np.where((X[0] == 0) & (X[2] == 1) & (X[1] != 1), 1, 0)


display(le_tax_data)

vectorized_predictions = list(map(vectorized_predict, le_tax_data.values))

print("original predictions:", predictions)

print("vectorized_predictions:", vectorized_predictions)

Unnamed: 0_level_0,refund,marital_status,income_above_80k
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2,1
2,0,1,1
3,0,2,0
4,1,1,1
5,0,0,1
6,0,1,0
7,1,0,1
8,0,2,1
9,0,1,0
10,0,2,1


original predictions: [0, 0, 0, 0, 1, 0, 0, 1, 0, 1]
vectorized_predictions: [array(0), array(0), array(0), array(0), array(1), array(0), array(0), array(1), array(0), array(1)]


d)	Select the input data (refund, marital_status, income_above_80k) in the training data and apply your predict function to the input data to predict the outcome values of the input data. Your function should return a vector of predicted values. 
Paste your code an output here

In [6]:
tax_data[["refund", "marital_status", "income_above_80k"]].apply(predict, axis=1)

index
1     0
2     0
3     0
4     0
5     1
6     0
7     0
8     1
9     0
10    1
dtype: int64



e)	Select the input data (refund, marital_status, income_above_80k) in the training data and apply your vectorized_predict function to the input data to predict the outcome values of the input data. Your function should return a vector of predicted values. Are the predicted values with the vectorized_predict function the same as the predicted values obtained through the predict function? (you should have the same results). 


In [7]:
print("Apply vectorized_predict:")
print(le_tax_data.apply(vectorized_predict, axis=1))

print("\ncompare outputs between predict and vectorized_predict:")
tax_data.apply(predict, axis=1) == le_tax_data.apply(vectorized_predict, axis=1)

Apply vectorized_predict:
index
1     0
2     0
3     0
4     0
5     1
6     0
7     0
8     1
9     0
10    1
dtype: object

compare outputs between predict and vectorized_predict:


index
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
dtype: bool

f)	Include the predicted values as a column to the training data and name that column predicted_cheat. 

In [8]:
tax_data["predicted_cheat"] = tax_data.apply(predict, axis=1)

tax_data

Unnamed: 0_level_0,refund,marital_status,income_above_80k,cheat,predicted_cheat
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,yes,single,yes,no,0
2,no,married,yes,no,0
3,no,single,no,no,0
4,yes,married,yes,no,0
5,no,divorced,yes,yes,1
6,no,married,no,no,0
7,yes,divorced,yes,no,0
8,no,single,yes,yes,1
9,no,married,no,no,0
10,no,single,yes,yes,1



g)	Create a function that computes the overall accuracy of the classification. You can call this function overall_accuracy. The function should take two arguments, a vector of  actual output values and a vector of predicted output values, then returns the classification accuracy. Inside the function, you can use boolean logic to compare the actual and predicted values, then count the proportion of how many predicted values are equal to the actual values to get the overall accuracy. 


In [9]:
def create_confusion_matrix(actual: np.ndarray, predicted: np.ndarray):
    tn, fp, fn, tp = 0, 0, 0, 0
    for a, p in zip(actual, predicted):
        if a == 0 and p == 0:
            tn += 1
        elif a == 1 and p == 0:
            fn += 1
        elif a == 0 and p == 1:
            fp += 1
        elif a == 1 and p == 1:
            tp += 1

    return [[tn, fp], [fn, tp]]


def overall_accuracy(confusion_matrix):
    tn, fp, fn, tp = (
        confusion_matrix[0][0],
        confusion_matrix[0][1],
        confusion_matrix[1][0],
        confusion_matrix[1][1],
    )
    accuracy = (tn + tp) / (tn + fp + fn + tp)
    print(f"Overall accuracy: {accuracy:.2%}")
    return accuracy


tax_data_confusion_matrix = create_confusion_matrix(
    actual=tax_data["cheat"].replace({"yes": 1, "no": 0}).values,
    predicted=tax_data["predicted_cheat"].values,
)

overall_accuracy_result = overall_accuracy(tax_data_confusion_matrix)

Overall accuracy: 100.00%


h)	Create a confusion matrix using the predicted and actual output values. You can use pandas crosstab function. 

In [10]:
tax_data_confusion_matrix

[[7, 0], [0, 3]]

i)	Supose the test set data is as follows. 
```
refund	marital_status	income_above_80k	cheat
1	no	single	yes	no
2	no	single	yes	no
3	no	married	yes	no
4	no	divorced	no	no
5	no	married	yes	no
6	no	single	yes	no
7	yes	single	yes	no
9	yes	married	yes	yes
8	no	single	yes	yes
10	yes	single	no	yes
```
Apply the vectorized_predict function to the input of the test dataset to predict the output values for the test dataset. Include these predicted values as a column to the test dataset and called this column predicted_cheat. 


In [11]:
test_data = """refund	marital_status	income_above_80k	cheat
1	no	single	yes	no
2	no	single	yes	no
3	no	married	yes	no
4	no	divorced	no	no
5	no	married	yes	no
6	no	single	yes	no
7	yes	single	yes	no
8	no	single	yes	yes
9	yes	married	yes	yes
10	yes	single	no	yes"""


test_data = str_to_data(test_data)

le_test_data = test_data[["refund", "marital_status", "income_above_80k"]].apply(
    sk.preprocessing.LabelEncoder().fit_transform
)

test_data["predicted_cheat"] = le_test_data.apply(vectorized_predict, axis=1)

test_data

Unnamed: 0_level_0,refund,marital_status,income_above_80k,cheat,predicted_cheat
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,no,single,yes,no,1
2,no,single,yes,no,1
3,no,married,yes,no,0
4,no,divorced,no,no,0
5,no,married,yes,no,0
6,no,single,yes,no,1
7,yes,single,yes,no,0
8,no,single,yes,yes,1
9,yes,married,yes,yes,0
10,yes,single,no,yes,0


j)	Apply the overall_accuracy function to the predicted out values and actual output values of the test set to compute the overall accuracy. Compare the overall accuracy of the test set and that of the training set. Is there overfitting? Why or why not? If there is overfitting, what would you do to avoid overfitting the rule to the training set? 

In [12]:
# j)	Apply the overall_accuracy function to the predicted out values and actual output values of the test set to compute the overall accuracy. Compare the overall accuracy of the test set and that of the training set. Is there overfitting? Why or why not? If there is overfitting, what would you do to avoid overfitting the rule to the training set?

test_data_confusion_matrix = create_confusion_matrix(
    actual=test_data["cheat"].replace({"yes": 1, "no": 0}).values,
    predicted=test_data["predicted_cheat"].values,
)

overall_accuracy(test_data_confusion_matrix)

Overall accuracy: 50.00%


0.5

# Question 2 
You will use the same training dataset provided in question 1. Suppose we wanted to create one rule using only one attribute that best classifies the input data, you will need to write a function or an algorithm to find that best attribute. There are different approaches we can use to evaluate the best attritubute that will classify our data with the highest accuracy. One of the accuracy measures we can use to find the best attribute is information gain, which uses entropy. 


In [13]:
# 	Given the training dataset in question 1, write a function that computes the entropy of output variable. The function should take the entire training dataset, and the name of the output variable as arguments and return the entropy value. You can call the function, entropy.

# Apply the entropy function to the training dataset to compute the entropy. Do you think the data is more pure, less pure or more/less pure?


def entropy(dataset, output_variable: str):
    output_variable_data = dataset[output_variable].values
    values = dataset[output_variable].unique()

    total_entropy = 0

    for v in values:
        prop = sum([True for x in output_variable_data if x == v]) / len(
            output_variable_data
        )
        total_entropy -= -prop * np.log2(prop)

    return abs(total_entropy)

In [14]:
entropy(dataset=tax_data, output_variable="cheat")

0.8812908992306927

In [15]:
# b)	Create another function that takes the dataset, the name of the output variable, and a specific input variable as arguments and returns the information gain for a split of the data on that specific input variable. You can call the function, information_gain.

#  Apply the information_gain function to each input variable to compute the information gain for each input variable. Which input variables is the best (has the highest information gain) for creating one-rule?


def information_gain(dataset, output_variable: str, input_variable: str):
    def binary_entropy(row):
        if row["no"] == 0 or row["yes"] == 0:
            return 0
        else:
            return -(row["no"] / row["All"]) * np.log2(row["no"] / row["All"]) - (
                row["yes"] / row["All"]
            ) * np.log2(row["yes"] / row["All"])

    def compute_proportion(row, num_instances):
        return row["All"] / num_instances

    cross_tab = pd.crosstab(
        tax_data[input_variable], tax_data[output_variable], margins=True
    )
    cross_tab["subset_entropy"] = cross_tab.apply(binary_entropy, axis=1)
    cross_tab["proportion"] = cross_tab.apply(
        compute_proportion, num_instances=cross_tab["All"]["All"], axis=1
    )
    cross_tab["weighted_entropy"] = (
        cross_tab["subset_entropy"] * cross_tab["proportion"]
    )

    entropy_before_split = cross_tab["weighted_entropy"]["All"]
    weighted_entropy_after_split = cross_tab[cross_tab.index != "All"][
        "weighted_entropy"
    ].sum()

    return entropy_before_split - weighted_entropy_after_split

In [16]:
information_gain_for_each_attribute = {}
for col in ["refund", "marital_status", "income_above_80k"]:
    information_gain_for_each_attribute[col] = information_gain(
        dataset=tax_data, output_variable="cheat", input_variable=col
    )

pprint.pprint(information_gain_for_each_attribute)

{'income_above_80k': 0.19163120400671663,
 'marital_status': 0.2812908992306926,
 'refund': 0.19163120400671663}


In [17]:
# c)	Do a cross tabulation using the best attribute obtained in 2b above and the output variable in the training dataset.
cross_tab_result = pd.crosstab(tax_data["marital_status"], tax_data["cheat"])
cross_tab_result

cheat,no,yes
marital_status,Unnamed: 1_level_1,Unnamed: 2_level_1
divorced,1,1
married,4,0
single,2,2


In [18]:
# e)	Create a one-rule from the decision tree in 2d. The rule should containing an antecedent and a consequent. The antecedent should use only the best attribute and it’s value or values.


def marital_status_1R(data):
    "returns 0 (coded for no) or 1 (coded for yes) for marital_status"
    result = data["marital_status"] != "married"
    return result.astype(int)


display(tax_data)
marital_status_1R_predictions = marital_status_1R(tax_data)
print("marital_status predictions:\n", marital_status_1R_predictions)

Unnamed: 0_level_0,refund,marital_status,income_above_80k,cheat,predicted_cheat
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,yes,single,yes,no,0
2,no,married,yes,no,0
3,no,single,no,no,0
4,yes,married,yes,no,0
5,no,divorced,yes,yes,1
6,no,married,no,no,0
7,yes,divorced,yes,no,0
8,no,single,yes,yes,1
9,no,married,no,no,0
10,no,single,yes,yes,1


marital_status predictions:
 index
1     1
2     0
3     1
4     0
5     1
6     0
7     1
8     1
9     0
10    1
Name: marital_status, dtype: int64


In [19]:
tax_data['marital_status_1R_predictions'] = marital_status_1R_predictions
tax_data

Unnamed: 0_level_0,refund,marital_status,income_above_80k,cheat,predicted_cheat,marital_status_1R_predictions
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,yes,single,yes,no,0,1
2,no,married,yes,no,0,0
3,no,single,no,no,0,1
4,yes,married,yes,no,0,0
5,no,divorced,yes,yes,1,1
6,no,married,no,no,0,0
7,yes,divorced,yes,no,0,1
8,no,single,yes,yes,1,1
9,no,married,no,no,0,0
10,no,single,yes,yes,1,1


In [20]:
# f)	Using the antecedent of your rule, extract the data covered by the rule and compute the coverage of the rule.
marital_status_values = tax_data["marital_status"].values
coverage = sum(
    np.where(
        (marital_status_values == "single")
        | (marital_status_values == "married")
        | (marital_status_values == "divorced"),
        1,
        0,
    )
) / len(marital_status_values)

print(f"Coverage: {coverage:.2%}")

Coverage: 100.00%


In [21]:
# g)	Using the antecedent of your rule, extract the data covered by the rule and compute the accuracy of the rule.
accuracy_confusion_matrix = pd.crosstab(
    tax_data["cheat"].replace({"yes": 1, "no": 0}).values, marital_status_1R_predictions
)

display(accuracy_confusion_matrix)
accuracy = overall_accuracy(accuracy_confusion_matrix)

marital_status,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4,3
1,0,3


Overall accuracy: 70.00%


In [35]:
# h)	Create a function that implents the rule as an if-else statement to predict the outcomes of any instance (a vector of values associated with the input variables for a specific individual). The function should be able to take one or more instances as an argument in the form of a dataframe or numpy arrary.


def if_else_prediction(data):
    if type(data) == pd.DataFrame:
        data = data.values

    if data.ndim == 1:
        data = data.reshape(1, -1)
        
    predictions = []
    
    for X in data:
        cheat = "no"
        if X[1] != "married":
            cheat = "yes"
        predictions.append(cheat)
    return predictions


display(tax_data)

Unnamed: 0_level_0,refund,marital_status,income_above_80k,cheat,predicted_cheat,marital_status_1R_predictions
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,yes,single,yes,no,0,1
2,no,married,yes,no,0,0
3,no,single,no,no,0,1
4,yes,married,yes,no,0,0
5,no,divorced,yes,yes,1,1
6,no,married,no,no,0,0
7,yes,divorced,yes,no,0,1
8,no,single,yes,yes,1,1
9,no,married,no,no,0,0
10,no,single,yes,yes,1,1


In [25]:
# i)	Implemement the function on the training set and test set in question 1 to get the predicted outputs for the training set and test set.
training_predictions = if_else_prediction(data=tax_data)
test_predictions = if_else_prediction(data=test_data)

print("training_predictions:\n", training_predictions)
print("test_predictions:\n", test_predictions)

training_predictions:
 ['yes', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes']
test_predictions:
 ['yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'no', 'yes']


In [24]:
# j)	What is the overall prediction accuracies for the training set and test set? You can use the overall_accuracy function you initially defined.
from sklearn.metrics import confusion_matrix

training_data_confusion_matrix = confusion_matrix(
    training_predictions, tax_data["cheat"]
)
training_accuracy = overall_accuracy(training_data_confusion_matrix)
print("training accuracy: ", training_accuracy)

# test_accuracy
test_data_confusion_matrix = confusion_matrix(test_predictions, test_data["cheat"])
test_accuracy = overall_accuracy(test_data_confusion_matrix)
print("test accuracy: ", test_accuracy)

Overall accuracy: 70.00%
training accuracy:  0.7
Overall accuracy: 40.00%
test accuracy:  0.4
