In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## The Heart Dataset

File name: 'Heart_Dataset_2.csv'

This dataset has been obtained from Kaggle.

The dataset contains 303 observations with 13 features and 1 class label with 0 and 1 values.
These features are discussed below:
1. age: in years
2. gender: (1 = male; 0 = female)
3. cp: chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
4. trestbps: resting blood pressure, in mm Hg on admission to the hospital
5. chol: serum cholestrol in mg/dl
6. fbs: fasting blood sugar, 120 mg.dl (1 = true; 0 = false)
7. restecg: restinng electrocardiographic results (values: 0,1,2)
8. thalach: maximum heart ache achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
12. ca: number of major vessels (0-3) coloured by flouroscopy
13. thal: (3 = normal; 6 = fixed defect; 7 = reversable defect)
14. target: the predicted attribute, diagnosis of heart disease (0 = fit; 1 = diseased)

This is a binary classification problem.
Does not contain any categorical data, the dataset is clean. sed)

In [None]:
# Loading and exploring dataset
import pandas as pd
#Reading the file into a dataframe
PATH = '/content/drive/MyDrive/AILab2025/Lab03ANN'
data=pd.read_csv(f'{PATH}/D6_Heart_Dataset_2.csv')

#Displaying the read contents
data

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [None]:
# separating predictors and target
X = data.drop("target",axis=1) #predictors
Y = data["target"]  #target

In [None]:
# Splitting into train and test sets
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X, Y,test_size=0.40,random_state=0)

## Logistic Regression

### Hyperparameters

1. tol
- Default: 0.0001
- Type: float
- Tolerance for stopping criteria. The optimization stops when the change in the cost function is below this threshold.

2. max_iter
- Default: 100
- Type: int
- Maximum number of iterations taken for the solver to converge.

3.  random_state
- Default: None
- Type: int, RandomState instance, or None
- Controls randomness in data shuffling or weight initialization (depends on solver).
- Setting a fixed integer ensures reproducible results.

4. multi_class
- Default: 'deprecated' (in newer versions, 'auto')
- Options: 'auto', 'ovr', 'multinomial'
- How to handle multi-class classification:

5. penalty
- Default: 'l2'
- Options: 'l1', 'l2', 'elasticnet', 'none'
- This defines the type of regularization applied to the model’s coefficients to prevent overfitting.

6. l1_ratio
- Default: None
- Type: float between 0 and 1
- Used only when penalty='elasticnet'.
- Controls the mix of L1 and L2 regularization

7. class_weight
- Default: None
- Options: {dict, ‘balanced’, None}
- Adjusts the penalty associated with misclassifying each class. Helps handle imbalanced datasets.
- None: All classes treated equally.
- 'balanced': Automatically adjusts weights inversely proportional to class frequencies (useful for imbalanced data). e.g. for span filter if 95% emails are non-spam and 5% are spam, then it will learn that spam class is (95/5 =)19 times rarer. so making a mistake on spam email is 19 times more costly. It balances the "cost" of a mistake, forcing the model to pay serious attention to the rare class you actually care about finding

8. solver
- This is the specific algorithm or method used to achieve the goal(minimize error). It's the "how-to" for finding that minimum error.
- Default: 'lbfgs'(Supoorted penalty is l2) A good general-purpose solver. It's fast and effective for most datasets.
- Options: 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'
-liblinearl1 (supports , l1, l2 ) Good for small datasets. It's very efficient for smaller amounts of data and is one of the few that supports
- saga   (supports l1, l2, elasticnet) Good for Large datasets. This is often the best choice for very large datasets and is the only solver that supports all penalty types.sagl2 onlyLarge datasets.
- sag:  (supports l2) A predecessor to saga. It's also very fast for large datasets.
- newton-cg (supports l2) A solid choice that can be effective, especially for multiclass problems.

9. dual
- It’s useful when the number of features >> number of samples (i.e., very high-dimensional data).
 - controls whether the optimization problem is solved in its dual(True) form or primal(False) form.
 - usually trains faster.
 - Primal form — directly optimize the weight vector
- Dual form — optimize over Lagrange multipliers (useful for certain solvers and data shapes)
- Default: False
- Type: bool
- dual=True is only supported for 'liblinear' solver and only when penalty='l2'.

10. warm_start
- Default: False
- Type: bool
- If True, reuse the solution from the previous fit() call as initialization.

11. n_jobs
- Default: None
- Type: int
- Number of CPU cores used in cross-validation or parallel computation.
- -1 means use all processors.

12. fit_intercept
- Default: True
- Type: bool
- Whether to calculate the intercept (bias) term.
- If False, the model will assume data is already centered.

13. intercept_scaling
- Default: 1
- Type: float
- Used only when solver='liblinear' and fit_intercept=True.

14. C
- Default: 1.0
- Type: float, must be > 0
- Inverse of regularization strength (C = 1 / λ).

15. verbose
- Default: 0
- Type: int
- Controls the amount of output printed during training.
- 0: Silent
- greater than 0: More detailed logs (useful for debugging).


In [None]:
%%time
# Create logistic regression object
from sklearn.linear_model import LogisticRegression
logistic_regression1 = LogisticRegression(solver="liblinear", random_state=10)
# logistic_regression1 = LogisticRegression(solver="liblinear", random_state=10, penalty="l1")
# logistic_regression1 = LogisticRegression(solver="liblinear", random_state=10, penalty="l1",class_weight={0: 1, 1:50} )
# treat any mistake on Class 1 as 50 times worse than a mistake on Class 0.

# Train model
model1 = logistic_regression1.fit(X_train, Y_train)

CPU times: user 152 ms, sys: 13.4 ms, total: 165 ms
Wall time: 219 ms


In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix

## Performance Evaluaton on Train Set

**Note:**
- The .predict() method in scikit-learn's LogisticRegression model is designed to output the final classification decision based on the default threshold of **0.5** .

In [None]:
Y_pred_train = model1.predict(X_train)

# Printing results
print(confusion_matrix(Y_train, Y_pred_train))
print("Accuracy: ",metrics.accuracy_score(Y_train,Y_pred_train))
print('Precision: ',metrics.precision_score(Y_train,Y_pred_train))
print('Recall score: ',metrics.recall_score(Y_train,Y_pred_train))
print('F1 score: ',metrics.f1_score(Y_train,Y_pred_train))

[[66 13]
 [ 8 94]]
Accuracy:  0.8839779005524862
Precision:  0.8785046728971962
Recall score:  0.9215686274509803
F1 score:  0.8995215311004785


In [None]:
# This gives probabilities for both class 0 and class 1
probabilities = model1.predict_proba(X_train)
# first col shows prob for class 0 and second col for class 1

# Use this if We only need the probabilities for the positive class (class 1)
positive_probabilities = probabilities[:, 1]
print (probabilities)
print (positive_probabilities)

[[1.46921818e-01 8.53078182e-01]
 [5.95708227e-01 4.04291773e-01]
 [1.54419749e-01 8.45580251e-01]
 [9.01768182e-02 9.09823182e-01]
 [3.84756771e-01 6.15243229e-01]
 [3.56905143e-01 6.43094857e-01]
 [9.13847920e-03 9.90861521e-01]
 [6.47409854e-02 9.35259015e-01]
 [3.61707218e-02 9.63829278e-01]
 [8.88083799e-01 1.11916201e-01]
 [2.91776912e-01 7.08223088e-01]
 [1.73937023e-01 8.26062977e-01]
 [2.40795292e-01 7.59204708e-01]
 [8.09697858e-01 1.90302142e-01]
 [2.84081406e-01 7.15918594e-01]
 [2.99341411e-02 9.70065859e-01]
 [9.90730932e-01 9.26906793e-03]
 [9.23733609e-01 7.62663907e-02]
 [5.42690913e-01 4.57309087e-01]
 [3.80935298e-01 6.19064702e-01]
 [1.34157553e-01 8.65842447e-01]
 [2.92410302e-01 7.07589698e-01]
 [1.95199066e-01 8.04800934e-01]
 [2.01495149e-01 7.98504851e-01]
 [2.31062343e-01 7.68937657e-01]
 [8.25422940e-02 9.17457706e-01]
 [5.51548005e-02 9.44845199e-01]
 [1.29916017e-02 9.87008398e-01]
 [1.64958187e-01 8.35041813e-01]
 [3.56112011e-02 9.64388799e-01]
 [1.523253

## Model Evaluation using different threshold

In [None]:
# Applying a custom threshold (e.g., 0.3)
custom_threshold = 0.3

# Classify based on the custom threshold
# If the probability >= custom_threshold, the prediction is 1, otherwise 0
Y_pred_train_2 = (positive_probabilities >= custom_threshold).astype(int)


# Printing results
print(confusion_matrix(Y_train, Y_pred_train_2))
print("Accuracy: ",metrics.accuracy_score(Y_train,Y_pred_train_2))
print('Precision: ',metrics.precision_score(Y_train,Y_pred_train_2))
print('Recall score: ',metrics.recall_score(Y_train,Y_pred_train_2))
print('F1 score: ',metrics.f1_score(Y_train,Y_pred_train_2))

[[53 26]
 [ 3 99]]
Accuracy:  0.8397790055248618
Precision:  0.792
Recall score:  0.9705882352941176
F1 score:  0.8722466960352423


# Visulazing all results side by side

In [None]:
# create a dictionary first
pred_results_dict = {'Actual Class': Y_train,
                     'Predicted Class (th:0.5)': Y_pred_train,
                     'Predicted Class (th:0.3)': Y_pred_train_2 }
# convert to a dataframe

train_comparison_df = pd.DataFrame(pred_results_dict)
print(train_comparison_df.head(20))

     Actual Class  Predicted Class (th:0.5)  Predicted Class (th:0.3)
159             1                         1                         1
282             0                         0                         1
110             1                         1                         1
21              1                         1                         1
29              1                         1                         1
150             1                         1                         1
16              1                         1                         1
75              1                         1                         1
109             1                         1                         1
179             0                         0                         0
283             0                         1                         1
4               1                         1                         1
96              1                         1                         1
229             0   

## Performance Evaluaton on Test Set

In [None]:
#Predictions
Y_pred_test = model1.predict(X_test)

# Printing results
print(confusion_matrix(Y_test, Y_pred_test))
print("Accuracy: ",metrics.accuracy_score(Y_test,Y_pred_test))
print('Precision: ',metrics.precision_score(Y_test,Y_pred_test))
print('Recall score: ',metrics.recall_score(Y_test,Y_pred_test))
print('F1 score: ',metrics.f1_score(Y_test,Y_pred_test))

[[45 14]
 [ 8 55]]
Accuracy:  0.819672131147541
Precision:  0.7971014492753623
Recall score:  0.873015873015873
F1 score:  0.8333333333333334


In [None]:
Y_pred_test

array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], dtype=int64)

In [None]:
model1.predict_proba(X_test)

array([[0.96706758, 0.03293242],
       [0.11449161, 0.88550839],
       [0.12822927, 0.87177073],
       [0.98195089, 0.01804911],
       [0.94763601, 0.05236399],
       [0.57778252, 0.42221748],
       [0.95714543, 0.04285457],
       [0.88538822, 0.11461178],
       [0.99807187, 0.00192813],
       [0.99817469, 0.00182531],
       [0.21677127, 0.78322873],
       [0.04540871, 0.95459129],
       [0.97067196, 0.02932804],
       [0.12837825, 0.87162175],
       [0.03779956, 0.96220044],
       [0.37278851, 0.62721149],
       [0.97468625, 0.02531375],
       [0.3987807 , 0.6012193 ],
       [0.99703381, 0.00296619],
       [0.3776972 , 0.6223028 ],
       [0.14681785, 0.85318215],
       [0.8204537 , 0.1795463 ],
       [0.93981535, 0.06018465],
       [0.89336343, 0.10663657],
       [0.13633489, 0.86366511],
       [0.58001986, 0.41998014],
       [0.84958516, 0.15041484],
       [0.6103472 , 0.3896528 ],
       [0.02438567, 0.97561433],
       [0.30896665, 0.69103335],
       [0.