## Homework 2 -  Classification
***
**Name**: $<$insert name here$>$ 
***

Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.

The rules to be followed for the assignment are:

- Do **NOT** load additional packages beyond what we've shared in the cells below.
- Some problems with code may be autograded.  If we provide a function or class API **do not** change it.
- Do not change the location of the data or data directory.  Use only relative paths to access the data.

In [1]:
import argparse
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from collections import defaultdict

### [10 points] Problem 1 - Building a Decision Tree
***

A sample dataset has been provided to you in the './data/dataset.csv' path. Here are the attributes for the dataset. Use this dataset to test your functions.

- Age - ["<=30", "31-40", ">40"]
- Income - ["low", "medium", "high"]
- Student - ["no", "yes"]
- Credit Rating - ["fair", "excellent"]
- Loan - ["no", "yes"]

Note:
- A sample dataset to test your code has been provided in the location "data/dataset.csv". Please maintain this as it would be necessary while grading.
- Do not change the variable names of the returned values.
- After calculating each of those values, assign them to the corresponding value that is being returned.
- The "Loan" attribute should be used as the target variable while making calculations for your decision tree.

In [2]:
# df = pd.read_csv("data/dataset.csv")
# df

In [3]:
import math
import pandas as pd


def information_gain_target(dataset_file): 
    
#        Input: dataset_file - A string variable which references the path to the dataset file.
#        Output: ig_loan - A floating point variable which holds the information entropy associated with the target variable.
#        
#        NOTE: 
#        1. Return the information gain associated with the target variable in the dataset.
#        2. The Loan attribute is the target variable
#        3. The pandas dataframe has the following attributes: Age, Income, Student, Credit Rating, Loan
#        4. Perform your calculations for information gain and assign it to the variable ig_loan


    
    df = pd.read_csv(dataset_file)
    target_column = 'Loan'
    
    # Calculate the entropy of the entire dataset for the target variable
    target_counts = df[target_column].value_counts().tolist()
    total = sum(target_counts)
    probabilities = [v / total for v in target_counts]
    total_entropy = -sum(p * math.log2(p) for p in probabilities if p > 0)
    
    # Initialize information gain with total entropy
    ig_loan = total_entropy
    
    # Calculate the weighted entropy for each attribute
    attributes = ['Age', 'Income', 'Student', 'Credit Rating']
    
    for attribute in attributes:
        attribute_entropy = 0
        attribute_values = df[attribute].unique()
        
        for value in attribute_values:
            subset = df[df[attribute] == value]
            subset_counts = subset[target_column].value_counts().tolist()
            subset_total = sum(subset_counts)
            subset_probabilities = [v / subset_total for v in subset_counts]
            subset_entropy = -sum(p * math.log2(p) for p in subset_probabilities if p > 0)
            weight = len(subset) / len(df)
            attribute_entropy += weight * subset_entropy
        
        ig_loan -= attribute_entropy
    
    return ig_loan

In [4]:
# This cell has visible test cases that you can run to see if you are on the right track!
# Note: hidden tests will also be applied on other datasets for final grading.

ig_loan = information_gain_target('./data/dataset.csv')
ig_loan_expected = 0.9798687566511528

print(f'The expected ig_loan value for the given dataset is: {ig_loan_expected}')
print(f'Your ig_loan value is: {ig_loan}')

try:
    np.testing.assert_allclose(ig_loan, ig_loan_expected, rtol=0.001, atol=0.001)
    print("Visible tests passed!")
except:
    print("Visible tests failed!")

The expected ig_loan value for the given dataset is: 0.9798687566511528
Your ig_loan value is: -2.417706237779429
Visible tests failed!


In [5]:

# This cell has hidden test cases that will run after you submit your assignment. 


def information_gain(p_count_yes, p_count_no):
    
#   A helper function that returns the information gain when given counts of number of yes and no values. 
#   Please complete this function before you proceed to the information_gain_attributes function below.
    
    def calc_entropy(count_yes, count_no):
        total = count_yes + count_no
        if total == 0:
            return 0
        p_yes = count_yes / total
        p_no = count_no / total
        entropy_value = 0
        if p_yes > 0:
            entropy_value -= p_yes * math.log2(p_yes)
        if p_no > 0:
            entropy_value -= p_no * math.log2(p_no)
        return entropy_value
    
    parent_entropy = calc_entropy(p_count_yes, p_count_no)
    total_instances = p_count_yes + p_count_no
    weighted_child_entropy = 0
    for count_yes, count_no in zip(c_count_yes, c_count_no):
        total_child_instances = count_yes + count_no
        child_entropy = calc_entropy(count_yes, count_no)
        weighted_child_entropy += (total_child_instances / total_instances) * child_entropy
    
    ig = parent_entropy - weighted_child_entropy
    return ig

import operator

attribute_values = {
    "Age": ["<=30", "31-40", ">40"],
    "Income": ["low", "medium", "high"],
    "Student": ["yes", "no"],
    "Credit Rating": ["fair", "excellent"]
}

attributes = ["Age", "Income", "Student", "Credit Rating"]

def information_gain_attributes(dataset_file, ig_loan, attributes, attribute_values):
    
#        Input: 
#            1. dataset_file - A string variable which references the path to the dataset file.
#            2. ig_loan - A floating point variable representing the information gain of the target variable "Loan".
#            3. attributes - A python list which has all the attributes of the dataset
#            4. attribute_values - A python dictionary representing the values each attribute can hold.
#            
#        Output: results - A python dictionary representing the information gain associated with each variable.
#            1. ig_attributes - A sub dictionary representing the information gain for each attribute.
#            2. best_attribute - Returns the attribute which has the highest information gain.
#        
#        NOTE: 
#        1. The Loan attribute is the target variable
#        2. The pandas dataframe has the following attributes: Age, Income, Student, Credit Rating, Loan

    
    
    results = {
        "ig_attributes": {
            "Age": 0,
            "Income": 0,
            "Student": 0,
            "Credit Rating": 0
        },
        "best_attribute": ""
    }
    
    df = pd.read_csv(dataset_file)
    
    for attribute in attributes:
        c_count_yes = []
        c_count_no = []
        
        for value in attribute_values[attribute]:
            subset = df[df[attribute] == value]
            count_yes = subset[subset['Loan'] == 'yes'].shape[0]
            count_no = subset[subset['Loan'] == 'no'].shape[0]
            c_count_yes.append(count_yes)
            c_count_no.append(count_no)
        
        p_count_yes = df[df['Loan'] == 'yes'].shape[0]
        p_count_no = df[df['Loan'] == 'no'].shape[0]
        
        ig_attribute = information_gain(p_count_yes, p_count_no, c_count_yes, c_count_no)
        results["ig_attributes"][attribute] = ig_attribute
    
    results["best_attribute"] = max(results["ig_attributes"].items(), key=operator.itemgetter(1))[0]
    return results

In [6]:
import pandas as pd
import math
import operator

attribute_values = {
    "Age": ["<=30", "31-40", ">40"],
    "Income": ["low", "medium", "high"],
    "Student": ["yes", "no"],
    "Credit Rating": ["fair", "excellent"]
}

attributes = ["Age", "Income", "Student", "Credit Rating"]

def information_gain(p_count_yes, p_count_no, c_count_yes, c_count_no):
    def calc_entropy(count_yes, count_no):
        total = count_yes + count_no
        if total == 0:
            return 0
        p_yes = count_yes / total
        p_no = count_no / total
        entropy_value = 0
        if p_yes > 0:
            entropy_value -= p_yes * math.log2(p_yes)
        if p_no > 0:
            entropy_value -= p_no * math.log2(p_no)
        return entropy_value
    
    parent_entropy = calc_entropy(p_count_yes, p_count_no)
    total_instances = p_count_yes + p_count_no
    weighted_child_entropy = 0
    for count_yes, count_no in zip(c_count_yes, c_count_no):
        total_child_instances = count_yes + count_no
        child_entropy = calc_entropy(count_yes, count_no)
        weighted_child_entropy += (total_child_instances / total_instances) * child_entropy
    
    ig = parent_entropy - weighted_child_entropy
    return ig

def information_gain_attributes(dataset_file, ig_loan, attributes, attribute_values):
    results = {
        "ig_attributes": {
            "Age": 0,
            "Income": 0,
            "Student": 0,
            "Credit Rating": 0
        },
        "best_attribute": ""
    }
    
    df = pd.read_csv(dataset_file)
    
    for attribute in attributes:
        c_count_yes = []
        c_count_no = []
        
        for value in attribute_values[attribute]:
            subset = df[df[attribute] == value]
            count_yes = subset[subset['Loan'] == 'yes'].shape[0]
            count_no = subset[subset['Loan'] == 'no'].shape[0]
            c_count_yes.append(count_yes)
            c_count_no.append(count_no)
        
        p_count_yes = df[df['Loan'] == 'yes'].shape[0]
        p_count_no = df[df['Loan'] == 'no'].shape[0]
        
        ig_attribute = information_gain(p_count_yes, p_count_no, c_count_yes, c_count_no)
        results["ig_attributes"][attribute] = ig_attribute
    
    results["best_attribute"] = max(results["ig_attributes"].items(), key=operator.itemgetter(1))[0]
    return results

# Testing code
# import pprint
# pp = pprint.PrettyPrinter(depth=4)
# ig_loan_expected = 0.9798687566511528

# attribute_values = {
#     "Age": ["<=30", "31-40", ">40"],
#     "Income": ["low", "medium", "high"],
#     "Student": ["yes", "no"],
#     "Credit Rating": ["fair", "excellent"]
# }

# attributes = ["Age", "Income", "Student", "Credit Rating"]

# results = information_gain_attributes("./data/dataset.csv", ig_loan_expected, attributes, attribute_values)

# results_expected = {'ig_attributes': {'Age': 0.2419726756283742, 'Income': 0.012398717114751934, 'Student': 0.19570962879973097, 'Credit Rating': 0.07181901063117269}, 'best_attribute': 'Age'}

# print(f'The expected results value for the given dataset is:')
# pp.pprint(results_expected)
# print(f'Your results value is:')
# pp.pprint(results)

# try:
#     x = pd.Series(results["ig_attributes"])
#     y = pd.Series(results_expected["ig_attributes"])
#     pd.testing.assert_series_equal(x, y, check_less_precise=3)
#     assert results["best_attribute"] == results_expected["best_attribute"]
#     print("Visible tests passed!")
# except:
#     print("Visible tests failed!")


In [7]:
# This cell has visible test cases that you can run to see if you are on the right track!
# Note: hidden tests will also be applied on other datasets for final grading.

import pprint
pp = pprint.PrettyPrinter(depth=4)
ig_loan_expected = 0.9798687566511528

attribute_values = {
    "Age": ["<=30", "31-40", ">40"],
    "Income": ["low", "medium", "high"],
    "Student": ["yes", "no"],
    "Credit Rating": ["fair", "excellent"]
}

attributes = ["Age", "Income", "Student", "Credit Rating"]

results = information_gain_attributes("./data/dataset.csv", ig_loan_expected, attributes, attribute_values)

results_expected = {'ig_attributes': {'Age': 0.2419726756283742, 'Income': 0.012398717114751934, 'Student': 0.19570962879973097, 'Credit Rating': 0.07181901063117269}, 'best_attribute': 'Age'}

print(f'The expected results value for the given dataset is:')
pp.pprint(results_expected)
print(f'Your results value is:')
pp.pprint(results)

try:
    x = pd.Series(results["ig_attributes"])
    y = pd.Series(results_expected["ig_attributes"])
    pd.testing.assert_series_equal(x, y, check_less_precise=3)
    assert results["best_attribute"] == results_expected["best_attribute"]
    print("Visible tests passed!")
except:
    print("Visible tests failed!")

The expected results value for the given dataset is:
{'best_attribute': 'Age',
 'ig_attributes': {'Age': 0.2419726756283742,
                   'Credit Rating': 0.07181901063117269,
                   'Income': 0.012398717114751934,
                   'Student': 0.19570962879973097}}
Your results value is:
{'best_attribute': 'Age',
 'ig_attributes': {'Age': 0.2419726756283742,
                   'Credit Rating': 0.07181901063117269,
                   'Income': 0.012398717114751934,
                   'Student': 0.19570962879973097}}
Visible tests passed!


In [8]:

# This cell has hidden test cases that will run after you submit your assignment. 


### [10 points] Problem 2 - Building a Naive Bayes Classifier
***

A sample dataset has been provided to you in the './data/dataset.csv' path. Here are the attributes for the dataset. Use this dataset to test your functions.

- Age - ["<=30", "31-40", ">40"]
- Income - ["low", "medium", "high"]
- Student - ["no", "yes"]
- Credit Rating - ["fair", "excellent"]
- Loan - ["no", "yes"]

Note:
- A sample dataset to test your code has been provided in the location "data/dataset.csv". Please maintain this as it would be necessary while grading.
- Do not change the variable names of the returned values.
- After calculating each of those values, assign them to the corresponding value that is being returned.
- The "Loan" attribute should be used as the target variable while making calculations for your naive bayes classifier.

In [9]:
from collections import defaultdict

def naive_bayes(dataset_file, attributes, attribute_values):

#   Input:
#       1. dataset_file - A string variable which references the path to the dataset file.
#       2. attributes - A python list which has all the attributes of the dataset
#       3. attribute_values - A python dictionary representing the values each attribute can hold.
#        
#   Output: A probabilities dictionary which contains the values of when the input attribute is yes or no
#       depending on the corresponding Loan attribute.
#                
#   Hint: Starter code has been provided to you to calculate the probabilities.

    probabilities = {
        "Age": { "<=30": {"yes": 0, "no": 0}, "31-40": {"yes": 0, "no": 0}, ">40": {"yes": 0, "no": 0} },
        "Income": { "low": {"yes": 0, "no": 0}, "medium": {"yes": 0, "no": 0}, "high": {"yes": 0, "no": 0}},
        "Student": { "yes": {"yes": 0, "no": 0}, "no": {"yes": 0, "no": 0} },
        "Credit Rating": { "fair": {"yes": 0, "no": 0}, "excellent": {"yes": 0, "no": 0} },
        "Loan": {"yes": 0, "no": 0}
    }
    
    df = pd.read_csv(dataset_file)
    d_range = len(df)
    
    vcount = df["Loan"].value_counts()
    vcount_loan_yes = vcount["yes"]
    vcount_loan_no = vcount["no"]
    
    probabilities["Loan"]["yes"] = vcount_loan_yes/d_range
    probabilities["Loan"]["no"] = vcount_loan_no/d_range
    
    for attribute in attributes:
        for att_value in attribute_values[attribute]:
            subset_yes = df[(df[attribute] == att_value) & (df["Loan"] == "yes")]
            subset_no = df[(df[attribute] == att_value) & (df["Loan"] == "no")]
            probabilities[attribute][att_value]["yes"] = len(subset_yes) / vcount_loan_yes
            probabilities[attribute][att_value]["no"] = len(subset_no) / vcount_loan_no
    
    return probabilities

In [10]:
# This cell has visible test cases that you can run to see if you are on the right track!
# Note: hidden tests will also be applied on other datasets for final grading.

import pprint
pp = pprint.PrettyPrinter(depth=6)

attribute_values = {
    "Age": ["<=30", "31-40", ">40"],
    "Income": ["low", "medium", "high"],
    "Student": ["yes", "no"],
    "Credit Rating": ["fair", "excellent"]
}

attributes = ["Age", "Income", "Student", "Credit Rating"]

probabilities = naive_bayes("./data/dataset.csv", attributes, attribute_values)

probabilities_expected = {'Age': {'<=30': {'yes': 0.2857142857142857, 'no': 0.6},
  '31-40': {'yes': 0.42857142857142855, 'no': 0.0},
  '>40': {'yes': 0.2857142857142857, 'no': 0.4}},
 'Income': {'low': {'yes': 0.2857142857142857, 'no': 0.2},
  'medium': {'yes': 0.42857142857142855, 'no': 0.4},
  'high': {'yes': 0.2857142857142857, 'no': 0.4}},
 'Student': {'yes': {'yes': 0.7142857142857143, 'no': 0.2},
  'no': {'yes': 0.2857142857142857, 'no': 0.8}},
 'Credit Rating': {'fair': {'yes': 0.7142857142857143, 'no': 0.4},
  'excellent': {'yes': 0.2857142857142857, 'no': 0.6}},
 'Loan': {'yes': 0.5833333333333334, 'no': 0.4166666666666667}}

print(f'Your probabilities value is:')
pp.pprint(probabilities)
print(f'\nThe expected probabilities value for the given dataset is:')
pp.pprint(probabilities_expected)

try:
    for i in attributes:
        for j in attribute_values[i]:
            for k in ["yes", "no"]:
                np.testing.assert_allclose(probabilities[i][j][k], probabilities_expected[i][j][k], rtol=0.001, atol=0.001)
    print("Visible tests passed!")
except:
    print("Visible tests failed!")

Your probabilities value is:
{'Age': {'31-40': {'no': 0.0, 'yes': 0.42857142857142855},
         '<=30': {'no': 0.6, 'yes': 0.2857142857142857},
         '>40': {'no': 0.4, 'yes': 0.2857142857142857}},
 'Credit Rating': {'excellent': {'no': 0.6, 'yes': 0.2857142857142857},
                   'fair': {'no': 0.4, 'yes': 0.7142857142857143}},
 'Income': {'high': {'no': 0.4, 'yes': 0.2857142857142857},
            'low': {'no': 0.2, 'yes': 0.2857142857142857},
            'medium': {'no': 0.4, 'yes': 0.42857142857142855}},
 'Loan': {'no': 0.4166666666666667, 'yes': 0.5833333333333334},
 'Student': {'no': {'no': 0.8, 'yes': 0.2857142857142857},
             'yes': {'no': 0.2, 'yes': 0.7142857142857143}}}

The expected probabilities value for the given dataset is:
{'Age': {'31-40': {'no': 0.0, 'yes': 0.42857142857142855},
         '<=30': {'no': 0.6, 'yes': 0.2857142857142857},
         '>40': {'no': 0.4, 'yes': 0.2857142857142857}},
 'Credit Rating': {'excellent': {'no': 0.6, 'yes': 0.28571

In [11]:

# This cell has hidden test cases that will run after you submit your assignment. 
