## Lecture 4, Part 2

1. Start by normalizing the data and separating a validation set with 20\% of the data randomly selected. The remaining 80\% will be called the sub-dataset.

In [96]:
import numpy as np
import importlib

import ForwardSelection
import ROCAnalysis
import MachineLearningModel

importlib.reload(ForwardSelection)
importlib.reload(ROCAnalysis)
importlib.reload(MachineLearningModel)

from ForwardSelection import ForwardSelection
from ROCAnalysis import ROCAnalysis
from MachineLearningModel import LogisticRegression

# Load the data
data = np.genfromtxt('./resources/datasets/heart_disease_cleveland.csv', delimiter=',', skip_header=1) 
X = data[:, :-1]
y = data[:, -1]

model = LogisticRegression()
X_normalized = model.normalize(X)

seed = 14159
np.random.seed(seed)
indices = np.random.permutation(len(X_normalized))
split_point = int(0.8 * len(X_normalized))

X_subdataset = X_normalized[indices[:split_point]]
y_subdataset = y[indices[:split_point]]
X_validation = X_normalized[indices[split_point:]]
y_validation = y[indices[split_point:]]

2. Use your implementation of forward selection to estimate a reasonable classification model. You must use your implementation of Logistic Regression in this assignment. The decision to make a reasonable number of iterations and learning rate is up to you but must be justified. Optimize the model selection to produce the best f-score. You must use the sub-dataset in your forward selection process. Report the features selected by this process and discuss your results. 

--- Your answer here --- 

In [97]:
model = LogisticRegression(learning_rate=0.01, num_iterations=1000)

forward_selection = ForwardSelection(X_subdataset, y_subdataset, model, seed)
forward_selection.forward_selection()
selected_features = forward_selection.selected_features
best_fscore = forward_selection.best_cost

# Report the selected features and the best F-score
print("\nSelected features:", selected_features)
print(f"Best F-score: {best_fscore:.3f}")

Current features: []. Best feature to add: 2 with F-score: 0.795
Current features: [2]. Best feature to add: 6 with F-score: 0.797
Current features: [2, 6]. Best feature to add: 3 with F-score: 0.811
Current features: [2, 6, 3]. Best feature to add: 0 with F-score: 0.811
Current features: [2, 6, 3, 0]. Best feature to add: 1 with F-score: 0.820
Current features: [2, 6, 3, 0, 1]. Best feature to add: 10 with F-score: 0.827
Current features: [2, 6, 3, 0, 1, 10]. Best feature to add: 11 with F-score: 0.860
Current features: [2, 6, 3, 0, 1, 10, 11]. Best feature to add: 4 with F-score: 0.860
Current features: [2, 6, 3, 0, 1, 10, 11, 4]. Best feature to add: 5 with F-score: 0.860
Current features: [2, 6, 3, 0, 1, 10, 11, 4, 5]. Best feature to add: 12 with F-score: 0.860
Current features: [2, 6, 3, 0, 1, 10, 11, 4, 5, 12]. Best feature to add: 7 with F-score: 0.853
Current features: [2, 6, 3, 0, 1, 10, 11, 4, 5, 12, 7]. Best feature to add: 8 with F-score: 0.785
Current features: [2, 6, 3, 

3. Report the performance of the best model in the validation set regarding all statistics available in your ROCAnalysis class. 
Was the process successful when compared to using all features?  
Discuss your results regarding these metrics and what you can conclude from this experiment.

Discussion:  
The forward selection process selected a subset of features that resulted in a model with a precision of 0.719, recall of 0.731, and an F1 score of 0.725. These metrics indicate that the model has a good performance in identifying true positives with a balanced trade-off between precision and recall.

However, when compared to the model using all features, the selected subset of features resulted in lower performance metrics. The model using all features achieved a precision of 0.809, recall of 0.846, and an F1 score of 0.827. This suggests that the additional features not selected by the forward selection process contribute valuable information that improves the model's ability to correctly classify the target variable.

In conclusion, while the forward selection process was able to identify a subset of features that produced a well-performing model, the use of all features resulted in a better-performing model. This indicates that, in this case, the additional features provide important information that enhances the model's performance. Therefore, it may be beneficial to consider using all available features or exploring other feature selection methods to achieve the best possible model performance.

In [105]:
X_validation_selected = X_validation[:, selected_features]

y_pred = model.predict(X_validation_selected)
y_validation_normalized = (y_validation >= 0.5).astype(int)
y_pred_normalized = (y_pred >= 0.5).astype(int)
roc_analysis = ROCAnalysis(y_validation_normalized, y_pred_normalized)

print("Performance using forward selection:")
print("Precision:", f"{roc_analysis.precision():.3f}")
print("Recall (TP Rate):", f"{roc_analysis.tp_rate():.3f}")
print("False Positive Rate:", f"{roc_analysis.fp_rate():.3f}")
print("F1 Score:", f"{roc_analysis.f_score():.3f}")

model_all = LogisticRegression(learning_rate=0.01, num_iterations=1000)
model_all.fit(X_normalized, y)
y_pred = model_all.predict(X_validation)
y_pred = (y_pred >= 0.5).astype(int)
roc_analysis_all_features = ROCAnalysis(y_validation, y_pred)

print("\nPerformance using all features:")
print("Precision:", f"{roc_analysis_all_features.precision():.3f}")
print("Recall (TP Rate):", f"{roc_analysis_all_features.tp_rate():.3f}")
print("False Positive Rate:", f"{roc_analysis_all_features.fp_rate():.3f}")
print("F1 Score:", f"{roc_analysis_all_features.f_score():.3f}")

Performance using forward selection:
Precision: 0.719
Recall (TP Rate): 0.731
False Positive Rate: 0.286
F1 Score: 0.725

Performance using all features:
Precision: 0.809
Recall (TP Rate): 0.846
False Positive Rate: 0.200
F1 Score: 0.827
