1)
A Classification Decision Tree is a supervised learning algorithm used to solve classification problems. These are problems where the goal is to predict a discrete category (or class) for an input based on its features. The decision tree models the problem as a series of decisions, represented by a tree structure, where each node splits the data based on a specific feature condition, leading to leaf nodes that represent the predicted class.

Examples of real-world applications:

Medical Diagnosis: Predicting whether a patient has a specific disease based on symptoms and test results.
Spam Filtering: Classifying emails as "Spam" or "Not Spam."
Credit Risk Assessment: Determining whether a loan applicant is "High Risk" or "Low Risk" based on financial and demographic features.
Customer Segmentation: Categorizing customers into groups (e.g., "Frequent Shoppers" or "Occasional Shoppers") for targeted marketing.
Fraud Detection: Identifying fraudulent transactions based on patterns in transaction data.
(b) Comparison: Classification Decision Tree vs. Multiple Linear Regression
How a Classification Decision Tree Makes Predictions:
A decision tree works by splitting the dataset into smaller subsets based on decision rules that optimize a criterion like the Gini impurity or entropy (in classification tasks).
Each split corresponds to a condition on a feature, and the process continues until a stopping condition is met (e.g., all samples in a node belong to the same class).
For a new input, the decision tree follows the rules down the tree until it reaches a leaf node, which provides the class prediction.
Example:

Feature: "Annual Income"
Split: "Is Annual Income > $50,000?" → Follow one branch if "Yes," another if "No."
How Multiple Linear Regression Makes Predictions:
Multiple linear regression models a continuous target variable by finding the best-fit linear relationship between the input features and the target.
The model predicts the target as a weighted sum of the input features plus a bias term.
Equation:  y = β0 + β1x1 + β2x2 +…+ βnxn where y is the predicted output,  x1 , x2 , … , xn are features, and  β0, β1, … , βn are learned coefficients. 

2)
1. Accuracy

Definition:
Measures the proportion of all correct predictions (true positives and true negatives) out of the total predictions.
Best Scenario:
Accuracy is most appropriate in scenarios where the classes are balanced and all errors are equally important. For example:

Image Classification in Autonomous Vehicles: Classifying road signs like "Stop," "Yield," and "Speed Limit." Misclassifying these has similar consequences, and the dataset is likely well-balanced.
Rationale:
When the dataset has roughly equal numbers of classes, and false positives and false negatives carry similar weight, accuracy gives a clear view of overall model performance.

2. Sensitivity (Recall)

Definition:
Measures the proportion of actual positives that are correctly predicted.
Best Scenario:
Sensitivity is critical in situations where missing positives (false negatives) has severe consequences. For example:

Medical Diagnostics for Rare Diseases: Detecting cancer or other life-threatening conditions. Missing a case (false negative) can be life-threatening.
Rationale:
High sensitivity ensures that most positive cases are identified, minimizing false negatives even if it means tolerating more false positives.

3. Specificity

Definition:
Measures the proportion of actual negatives that are correctly predicted.
Best Scenario:
Specificity is key when avoiding false positives is more critical than catching every positive. For example:

Legal Applications in Criminal Justice: Screening candidates for parole eligibility. Incorrectly labeling a safe candidate as dangerous (false positive) could have societal consequences.
Rationale:
High specificity ensures that most negatives are correctly identified, minimizing false positives, which can lead to unnecessary actions or mistrust in the system.

4. Precision

Definition:
Measures the proportion of predicted positives that are actually correct.
Best Scenario:
Precision is crucial in scenarios where false positives are particularly costly or harmful. For example:

Spam Email Detection: Identifying spam emails. Flagging legitimate emails as spam (false positives) can lead to important communications being missed.
Rationale:
High precision ensures that when the model predicts a positive (e.g., spam), it’s almost always correct, reducing the inconvenience of false positives.

import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, make_scorer
import graphviz as gv

url = "https://raw.githubusercontent.com/pointOfive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")

In [None]:
# Import necessary libraries
import pandas as pd

# Load the dataset from the provided URL
url = "https://raw.githubusercontent.com/pointOfive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")

# Subset the data to exclude the specified columns
columns_to_keep = [col for col in ab.columns if col not in ["Weight_oz", "Width", "Height"]]
ab_reduced = ab[columns_to_keep]

# Drop rows with NaN entries in the remaining columns
ab_reduced_noNaN = ab_reduced.dropna()

# Set data types as specified
ab_reduced_noNaN["Pub year"] = ab_reduced_noNaN["Pub year"].astype(int)
ab_reduced_noNaN["NumPages"] = ab_reduced_noNaN["NumPages"].astype(int)
ab_reduced_noNaN["Hard_or_Paper"] = ab_reduced_noNaN["Hard_or_Paper"].astype("category")

# Display basic summary of the processed dataset
ab_reduced_noNaN.info(), ab_reduced_noNaN.describe(), ab_reduced_noNaN.head()

4)

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Split the dataset into training and testing sets (80/20 split)
ab_reduced_noNaN_train, ab_reduced_noNaN_test = train_test_split(
    ab_reduced_noNaN, test_size=0.2, random_state=42
)

# Report the number of observations in the training and testing sets
train_size = ab_reduced_noNaN_train.shape[0]
test_size = ab_reduced_noNaN_test.shape[0]

# Define the target variable (y) and feature (X)
y = pd.get_dummies(ab_reduced_noNaN_train["Hard_or_Paper"])["H"]
X = ab_reduced_noNaN_train[["List Price"]]

# Train a DecisionTreeClassifier with max_depth=2
clf = DecisionTreeClassifier(max_depth=2, random_state=42)
clf.fit(X, y)

# Plot the fitted decision tree
tree_plot = tree.plot_tree(clf, feature_names=["List Price"], class_names=["Paperback", "Hardcover"], filled=True)

train_size, test_size, clf


NameError: name 'ab_reduced_noNaN' is not defined

5)

In [2]:
# Define the new feature set (X) for the second model
X2 = ab_reduced_noNaN_train[["NumPages", "Thick", "List Price"]]

# Train a DecisionTreeClassifier with max_depth=4
clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X2, y)

# Plot the fitted decision tree for clf2
tree_plot2 = tree.plot_tree(
    clf2,
    feature_names=["NumPages", "Thick", "List Price"],
    class_names=["Paperback", "Hardcover"],
    filled=True
)

clf2

NameError: name 'ab_reduced_noNaN_train' is not defined

In [3]:
import pandas as pd
# Reload the dataset and preprocess
url = "https://raw.githubusercontent.com/pointOfive/STA130_F23/main/Data/amazonbooks.csv"
ab = pd.read_csv(url, encoding="ISO-8859-1")

# Subset to exclude specified columns
columns_to_keep = [col for col in ab.columns if col not in ["Weight_oz", "Width", "Height"]]
ab_reduced = ab[columns_to_keep]

# Drop rows with NaN entries
ab_reduced_noNaN = ab_reduced.dropna()

# Convert data types as specified
ab_reduced_noNaN["Pub year"] = ab_reduced_noNaN["Pub year"].astype(int)
ab_reduced_noNaN["NumPages"] = ab_reduced_noNaN["NumPages"].astype(int)
ab_reduced_noNaN["Hard_or_Paper"] = ab_reduced_noNaN["Hard_or_Paper"].astype("category")

# Perform the 80/20 train-test split
ab_reduced_noNaN_train, ab_reduced_noNaN_test = train_test_split(
    ab_reduced_noNaN, test_size=0.2, random_state=42
)

# Define the target and features for the second model
y2 = pd.get_dummies(ab_reduced_noNaN_train["Hard_or_Paper"])["H"]
X2 = ab_reduced_noNaN_train[["NumPages", "Thick", "List Price"]]

# Train a DecisionTreeClassifier with max_depth=4
clf2 = DecisionTreeClassifier(max_depth=4, random_state=42)
clf2.fit(X2, y2)

# Plot the fitted decision tree for clf2
tree_plot2 = tree.plot_tree(
    clf2,
    feature_names=["NumPages", "Thick", "List Price"],
    class_names=["Paperback", "Hardcover"],
    filled=True
)

ab_reduced_noNaN_train.shape[0], ab_reduced_noNaN_test.shape[0], clf2


NameError: name 'pd' is not defined

6)
1.Key Concepts
Positive (P): In our case, a "Hardcover" book is treated as the positive class.
Negative (N): A "Paperback" book is the negative class.
True Positive (TP): A hardcover book correctly predicted as hardcover.
True Negative (TN): A paperback book correctly predicted as paperback.
False Positive (FP): A paperback book incorrectly predicted as hardcover.
False Negative (FN): A hardcover book incorrectly predicted as paperback.
Confusion Matrix in sklearn:
Rows represent the true labels (y_true).
Columns represent the predicted labels (y_pred).
Order of Arguments in confusion_matrix:
confusion_matrix(y_true, y_pred) → True labels first, predicted labels second.


7)
The differences between the two confusion matrices are primarily caused by the number and type of features used to make predictions. The first confusion matrix uses only the "List Price" feature to classify books, which may not provide enough information for accurate classification. In contrast, the second matrix incorporates additional features—"NumPages" and "Thick"—allowing the model to better capture the relationships between predictors and the target variable.

The confusion matrices for the test set (clf and clf2) are better because they evaluate the models on unseen data, providing a more reliable assessment of how well the models generalize to new observations. Evaluating on training data often leads to overly optimistic results, as the model has already been optimized for this data. This distinction highlights the importance of using a separate test set for validation.

8)
To visualize and interpret feature importances in a scikit-learn Decision Tree, you can use the.feature_importances_ attribute of the trained model. This attribute provides the relative importance of each predictor variable in determining the predictions, calculated based on the reduction in the chosen criterion (e.g., Gini impurity or Shannon entropy) contributed by splits involving that feature.

Here’s how you can visualize feature importances and identify the most important predictor for clf2:

Steps:
Access clf2.feature_importances_: This provides an array of importance scores, one for each feature.
Access clf2.feature_names_in_: This lists the names of the features used in the model, matching the order of the importance scores.
Visualize Importances: Use a bar plot to display the importance values for each feature, making it easier to interpret.
Determine the Most Important Feature: Find the feature with the highest importance score.


In [4]:
import matplotlib.pyplot as plt
import numpy as np

# Extract feature importances and feature names
feature_importances = clf2.feature_importances_
feature_names = clf2.feature_names_in_

# Identify the most important feature
most_important_feature = feature_names[np.argmax(feature_importances)]

# Visualize feature importances as a bar chart
plt.figure(figsize=(8, 6))
plt.barh(feature_names, feature_importances, color='skyblue')
plt.xlabel("Feature Importance")
plt.title("Feature Importances in clf2")
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

most_important_feature

NameError: name 'clf2' is not defined

In linear regression, the coefficients represent the direct impact of each predictor variable on the outcome, assuming all other variables are held constant. Each coefficient indicates how much the predicted value of the target variable changes with a one-unit change in the predictor, making interpretation straightforward and additive.

In contrast, feature importances in decision trees indicate how much each predictor variable contributes to reducing uncertainty (e.g., Gini impurity or entropy) across all splits in the tree. This interpretation is less direct, as it aggregates the contributions of a feature across potentially complex interactions and non-linear relationships throughout the tree.