In [3]:
import numpy as np
import pandas as pd
import os


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


df = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv


**The following block of code performs One-Hot Encoding on a set of categorical features in the dataset using scikit-learn's OneHotEncoder.**

## OneHotEncoder:
- It is a tool from the ```sklearn.preprocessing``` library.
- It converts categorical arguements/values into 0s or 1s
- At a time, only a particular value is 1 or 'hot'
- This encodes a set of values (eg. strings) into numerical data which helps models learn, train and test

### How it all Encoding works:
- First, we define the columns we want to encode into 0s and 1s
- We then pass them to OneHotEncoder's ```fit_transform``` function. Each value is converted into an array of 0s and a single 1, which represents what that is.
- For example, if there are 3 unique values in a column such as ```[Male, Female, Others]```. Then the encoding for ```Male``` would be ```[1, 0, 0]```. So only the ```Male``` value is hot, and the others aren't.
- Similarly for ```Female``` the encoding would be ```[0, 1, 0]``` and so on
- The ```fit_transform``` function results in a sparse matrix, which is then converted to a dense one using ```toarray()``` function.
- The ```get_feature_names_out``` adds new columns to the dataset, with values corresponding to the encoded one
- Finally, all these are added back to the dataset

In [4]:
from sklearn.preprocessing import OneHotEncoder

kat = ["gender", "ever_married", "work_type", "Residence_type", "smoking_status"]
ohe = OneHotEncoder()
feature = ohe.fit_transform(df[kat]).toarray()
sut = ohe.get_feature_names_out(kat)
df[sut] = feature


### Now we drop all null values across the ```bmi```, the old columns which are now OneHotEncoded, and the ```id``` column since it's irrelevant to our analysis. 

In [5]:
df.dropna(subset=['bmi'], inplace=True)
df = df.drop(columns=kat)
df = df.drop(columns=["id"])

# Decision Tree Classifier:

**How It Works:**

The algorithm splits the dataset into subsets based on the value of input features.

It continues splitting recursively until it reaches a stopping condition (e.g., max depth, pure leaf, or minimum samples).

At each node, it selects the best feature and threshold that results in the highest information gain or lowest impurity.

1. **Impurity Measures**: *Used to determine the quality of a split*

    - Gini Impurity (default): *A measure of how pure a node is, i.e, how many instances of different classes does it possess*

    $$ G = 1 - \sum_{i=1}^C p_i^2$$
     
    - Entropy (used in information gain): *The higher the entropy, the more mixed the node*
    
    $$
    H = - \sum_{i=1}^C p_i \log_2 p_i
    $$

2. **Information Gain**: *The reduction in impurity before and after a split*

    $$ Gain = H_{\text {parent}} - \sum{}\frac{n_i}{n} H_i $$ 

3. **Sample Splitting**: *This is done to increase the purity of a node. Nodes are split based on a certain critera (eg: ```age <= 50```) so as to produce the purest child node. This process grows the tree recursively. This is done based on certain strategies. At each step, the best possbile (optimal) or a random strategy is chosen. Choosing the best possible at each step can cause overfitting.*
 
**Some Common Hyperparameters**
| *Parameter*        | *Description*                                                   |
|----------------------|-------------------------------------------------------------------|
| `criterion`          | Function to measure split quality: `'gini'` or `'entropy'`       |
| `max_depth`          | Maximum depth of the tree                                         |
| `min_samples_split`  | Minimum number of samples required to split an internal node     |
| `min_samples_leaf`   | Minimum number of samples required to be at a leaf node          |
| `max_features`       | Number of features to consider when looking for the best split   |
| `splitter`           | Strategy used to choose the split: `'best'` or `'random'`        |


### Now, we split the data into inputs and output. The inputs are all other columns apart from the stroke, while the output is the stroke column itself.

### After this, the inputs and outputs are split into training and testing sets.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score

y = df["stroke"]
x = df.drop(columns=["stroke"])
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, train_size=0.8)



### The model is now trained

In [7]:
tree = DecisionTreeClassifier()
model = tree.fit(x_train, y_train)

### F1 Score:

*It is the harmonic mean of precision and recall, combining both into a single number that balances the two*

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

*where,*

$$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$

$$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$


$$
\text{F1}_{\text{weighted}} = \sum_{i=1}^{C} \frac{n_i}{n} \cdot \text{F1}_i
$$

$$
\begin{aligned}
C &= \text{Number of classes} \\
n_i &= \text{Number of true instances for class } i \\
n &= \text{Total number of samples} \\
\text{F1}_i &= \text{F1 score for class } i
\end{aligned}
$$

### Accuracy Score

$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{TP + TN}{TP + TN + FP + FN}
$$

$$
\begin{aligned}
TP &= \text{True Positives} \\
TN &= \text{True Negatives} \\
FP &= \text{False Positives} \\
FN &= \text{False Negatives}
\end{aligned}
$$




### `classification_report` (scikit-learn)

The `classification_report` function prints key metrics for evaluating a classification model:

* **Precision**: Accuracy of positive predictions
* **Recall**: Ability to find all positive samples
* **F1 Score**: Harmonic mean of precision & recall
* **Support**: Number of true samples in each class


In [8]:
y_pred_tree = model.predict(x_test)

from sklearn.metrics import f1_score, accuracy_score, classification_report

print("Weighted F1 score: ", f1_score(y_test, y_pred_tree, average = 'weighted'))
print("Accuracy score: ", accuracy_score(y_test, y_pred_tree))
print(classification_report(y_test, y_pred_tree))

Weighted F1 score:  0.9132676560974813
Accuracy score:  0.9164969450101833
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       929
           1       0.18      0.15      0.16        53

    accuracy                           0.92       982
   macro avg       0.56      0.56      0.56       982
weighted avg       0.91      0.92      0.91       982



## GridSearchCV

`GridSearchCV` is a tool in `scikit-learn` used for **exhaustive hyperparameter tuning**. It  searches through a **grid of specified hyperparameter values** and evaluates model performance using **cross-validation**.


### What It Does

`GridSearchCV` performs:

1. **Model training** on multiple combinations of hyperparameters.
2. **Cross-validation** for each combination.
3. **Selection** of the best parameters based on a scoring metric (e.g., accuracy, F1 score).


### Accuracy of Cross-Validation

If using `k`-fold cross-validation:


$$
\text{CV Accuracy} = \frac{1}{k} \sum_{i=1}^{k} \text{Accuracy}_i
$$


Where $\text{Accuracy}_i$ is the accuracy on the $i$-th validation fold.



### Common Parameters

| **Parameter**        | **Description**                                                             |
| -------------------- | --------------------------------------------------------------------------- |
| `estimator`          | The model (e.g., `SVC()`, `RandomForestClassifier()`)                       |
| `param_grid`         | Dictionary or list of dictionaries with hyperparameters to try              |
| `scoring`            | Metric to optimize (`'accuracy'`, `'f1'`, `'neg_mean_squared_error'`, etc.) |
| `cv`                 | Number of cross-validation folds (e.g., `cv=5`)                             |
| `verbose`            | Controls the amount of output during training                               |
| `n_jobs`             | Number of parallel jobs (`-1` uses all cores)                               |
| `refit`              | If `True`, refits the best estimator on the whole dataset                   |
| `return_train_score` | If `True`, also returns training scores                                     |



## K-Fold Cross-Validation

**K-Fold Cross-Validation** is a technique to evaluate the performance of a machine learning model by splitting the dataset into **K equal-sized folds**.

### How It Works:

1. The dataset is divided into **K folds** (subsets).
2. For each fold:

   * One fold is used as the **validation set**.
   * The remaining $K - 1$ folds are used for **training**.
3. This process repeats **K times**, each time with a different fold used for validation.
4. The performance metric is **averaged** over the K iterations.


### Average Performance


$$
\text{CV Score} = \frac{1}{K} \sum_{i=1}^{K} \text{Score}_i
$$


###  Common Choice:

* $K = 5$ or $K = 10$ are most commonly used in practice.

In [9]:
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'splitter': ['best', 'random']
}

from sklearn.model_selection import GridSearchCV

grid_tree = GridSearchCV(estimator = model, param_grid = param_grid, cv = 10, scoring = 'f1_weighted', n_jobs = -1)

grid_tree.fit(x_train, y_train)

In [10]:
print("Best parameters:", grid_tree.best_params_)
print("Best F1 Score (CV):", grid_tree.best_score_)

Best parameters: {'criterion': 'gini', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 5, 'splitter': 'random'}
Best F1 Score (CV): 0.9440877452974199


In [11]:
y_pred_grid = grid_tree.predict(x_test)

print("Weighted F1 score: ", f1_score(y_test, y_pred_grid, average = 'weighted'))
print("Accuracy score: ", accuracy_score(y_test, y_pred_grid))
print(classification_report(y_test, y_pred_grid))

Weighted F1 score:  0.9211816744528192
Accuracy score:  0.945010183299389
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       929
           1       0.33      0.02      0.04        53

    accuracy                           0.95       982
   macro avg       0.64      0.51      0.50       982
weighted avg       0.91      0.95      0.92       982





## `RandomForestClassifier`

`RandomForestClassifier` is an **ensemble learning method** in scikit-learn that builds a **"forest" of decision trees** and aggregates their predictions to improve classification performance and reduce overfitting.

It’s based on the **bagging (Bootstrap Aggregation)** technique and introduces randomness in both:

* **Data selection** (via bootstrapping)
* **Feature selection** (random subset at each split)



### How It Works

1. **Bootstrapping**:

   * Random samples (with replacement) are drawn from the dataset to train each tree.

2. **Random Feature Selection**:

   * At each node, only a random subset of features is considered for splitting.

3. **Ensemble Prediction**:

   * Final prediction is made by **majority voting** across all trees (for classification).



### Mathematical Representation

Let each tree be $T_i$, trained on a bootstrap sample:

$$
\hat{y} = \text{majority vote}(T_1(x), T_2(x), ..., T_n(x))
$$

### Key Parameters

| Parameter           | Description                                                       |
| ------------------- | ----------------------------------------------------------------- |
| `n_estimators`      | Number of trees in the forest (default: 100)                      |
| `criterion`         | Splitting metric: `'gini'` (default) or `'entropy'`               |
| `max_depth`         | Maximum depth of each tree                                        |
| `min_samples_split` | Minimum number of samples to split an internal node               |
| `min_samples_leaf`  | Minimum number of samples required at a leaf node                 |
| `max_features`      | Number of features to consider when looking for the best split    |
| `bootstrap`         | Whether to use bootstrap sampling (default: `True`)               |
| `oob_score`         | Use out-of-bag samples to estimate performance (default: `False`) |
| `random_state`      | Seed for reproducibility                                          |



### Advantages

* High accuracy
* Robust to overfitting
* Works well with missing and categorical data
* Handles large feature spaces and datasets efficiently


### Disadvantages

* Less interpretable than a single decision tree
* Can be slow for very large datasets with many trees
* Larger memory footprint

In [12]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators = 500, random_state = 42)
forest.fit(x_train, y_train)

In [13]:
y_pred_forest = model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred_forest))
print("Weighted F1 score: ", f1_score(y_test, y_pred_forest, average = 'weighted'))
print(classification_report(y_test, y_pred_forest))

Accuracy: 0.9164969450101833
Weighted F1 score:  0.9132676560974813
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       929
           1       0.18      0.15      0.16        53

    accuracy                           0.92       982
   macro avg       0.56      0.56      0.56       982
weighted avg       0.91      0.92      0.91       982



In [14]:
param_grid = {
    'n_estimators': [100,200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2,5],
    'min_samples_leaf': [1,2],
    'max_features': ['sqrt', 'log2']
}

grid_random_forest = GridSearchCV(RandomForestClassifier(random_state=42),
                    param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_random_forest.fit(x_train, y_train)

In [15]:
print("Best params:", grid_random_forest.best_params_)
print("Best score:", grid_random_forest.best_score_)

Best params: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best score: 0.960529975202995


In [16]:
y_pred_forest_grid = grid_random_forest.predict(x_test)

print("Weighted F1 score: ", f1_score(y_test, y_pred_forest_grid, average = 'weighted'))
print("Accuracy score: ", accuracy_score(y_test, y_pred_forest_grid))
print(classification_report(y_test, y_pred_forest_grid))

Weighted F1 score:  0.9197911970678917
Accuracy score:  0.9460285132382892
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       929
           1       0.00      0.00      0.00        53

    accuracy                           0.95       982
   macro avg       0.47      0.50      0.49       982
weighted avg       0.89      0.95      0.92       982



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## `SVC` – Support Vector Classifier (scikit-learn)

`SVC` stands for **Support Vector Classifier**, an implementation of the **Support Vector Machine (SVM)** algorithm used for **binary and multi-class classification**.

It works by finding the **optimal hyperplane** that maximally separates classes in the feature space.


### Core Idea

SVM aims to:

* Find a decision boundary (hyperplane) with the **maximum margin** between different classes.
* Use **support vectors**, the closest points to the hyperplane, to define that boundary.

#### Margin Maximization Formula:

The goal is to minimize:

$$
\frac{1}{2} \lVert \mathbf{w} \rVert^2
$$

Subject to:

$$
y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1
$$

Where:

* $\mathbf{w}$ = weights (normal vector to the hyperplane)
* $b$ = bias
* $y_i$ = class label (+1 or -1)
* $\mathbf{x}_i$ = feature vector


### Kernel Trick

When data is **not linearly separable**, SVM uses **kernels** to project it into a higher-dimensional space where it is separable.

Common kernels:

* `'linear'` — for linear problems
* `'rbf'` (default) — Radial Basis Function (nonlinear)
* `'poly'` — Polynomial
* `'sigmoid'` — Neural network-like

### Important Parameters

| Parameter      | Description                                                           |
| -------------- | --------------------------------------------------------------------- |
| `C`            | Regularization parameter (controls margin width vs misclassification) |
| `kernel`       | Type of kernel function: `'linear'`, `'poly'`, `'rbf'`, `'sigmoid'`   |
| `gamma`        | Kernel coefficient for `'rbf'`, `'poly'`, `'sigmoid'` kernels         |
| `degree`       | Degree of the polynomial kernel                                       |
| `probability`  | Whether to enable probability estimates (`True` adds overhead)        |
| `class_weight` | Adjust weights to handle imbalanced data (e.g., `'balanced'`)         |


### Advantages

* Works well for high-dimensional spaces
* Effective when number of features > number of samples
* Flexible with kernel functions
* Robust to overfitting in many cases

### Disadvantages

* Not efficient on large datasets
* Requires proper scaling of features
* Sensitive to choice of kernel and hyperparameters

In [17]:
from sklearn.svm import SVC

model_vec = SVC(kernel = 'rbf', C=1.0, gamma = 'scale', random_state = 42)
model_vec.fit(x_train, y_train)

In [18]:
y_pred_vec = model_vec.predict(x_test)

print("Accuracy score: ", accuracy_score(y_test, y_pred_vec))
print("Weighted F1 score: ", f1_score(y_test, y_pred_vec, average = 'weighted'))
print(classification_report(y_test, y_pred_vec))

Accuracy score:  0.9460285132382892
Weighted F1 score:  0.9197911970678917
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       929
           1       0.00      0.00      0.00        53

    accuracy                           0.95       982
   macro avg       0.47      0.50      0.49       982
weighted avg       0.89      0.95      0.92       982



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [28]:
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf'],
    'gamma': ['auto']
}

grid_vec = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid_vec.fit(x_train, y_train)

print("Best Parameters:", grid_vec.best_params_)

Best Parameters: {'C': 0.1, 'gamma': 'auto', 'kernel': 'rbf'}


In [29]:
grid_vec_pred = grid_vec.predict(x_test)

print("Accuracy score: ", accuracy_score(y_test, grid_vec_pred))
print("Weighted F1 score: ", f1_score(y_test, grid_vec_pred, average = 'weighted'))
print(classification_report(y_test, grid_vec_pred))

Accuracy score:  0.9460285132382892
Weighted F1 score:  0.9197911970678917
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       929
           1       0.00      0.00      0.00        53

    accuracy                           0.95       982
   macro avg       0.47      0.50      0.49       982
weighted avg       0.89      0.95      0.92       982



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## `KNeighborsClassifier` – K-Nearest Neighbors (KNN)

`KNeighborsClassifier` is a **simple, intuitive**, and widely used classification algorithm based on **similarity**. It classifies a new data point by looking at the **'k' closest training examples** and assigning the most common class among them.


### How It Works

1. Choose a value of **$k$** (number of neighbors).
2. Calculate the **distance** (e.g., Euclidean) from the new point to all training points.
3. Select the **k nearest neighbors**.
4. Assign the class that is **most common among those neighbors**.


### Distance Metric

By default, KNN uses **Euclidean distance**:

$$
\text{distance} = \sqrt{ \sum_{i=1}^{n} (x_i - y_i)^2 }
$$

You can also use:

* **Manhattan** distance
* **Minkowski** distance
* Custom distance metrics

### Important Parameters

| Parameter     | Description                                                                           |
| ------------- | ------------------------------------------------------------------------------------- |
| `n_neighbors` | Number of neighbors to use (the "k" in KNN)                                           |
| `weights`     | `'uniform'` (all neighbors equal) or `'distance'` (closer neighbors have more weight) |
| `metric`      | Distance function to use (`'euclidean'`, `'manhattan'`, `'minkowski'`, etc.)          |
| `p`           | Power parameter for Minkowski metric (p=2 = Euclidean, p=1 = Manhattan)               |
| `algorithm`   | Search algorithm: `'auto'`, `'ball_tree'`, `'kd_tree'`, `'brute'`                     |


### Advantages

* Simple to implement and understand
* No training phase (instance-based)
* Naturally handles multi-class classification

### Disadvantages

* Slow with large datasets (prediction requires distance computation for all points)
* Sensitive to irrelevant or scaled features
* Doesn’t work well in high-dimensional spaces (**curse of dimensionality**)

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(x_train)
X_test_scaled = scaler.transform(x_test)

In [24]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_train_scaled, y_train)

In [25]:
y_pred = knn.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.945010183299389
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       929
           1       0.40      0.04      0.07        53

    accuracy                           0.95       982
   macro avg       0.67      0.52      0.52       982
weighted avg       0.92      0.95      0.92       982



In [27]:
# Define the parameter grid
param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'p': [1, 2] 
}

# Initialize the model
knn = KNeighborsClassifier()

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=5,
    scoring='f1_weighted',
    verbose=1,
    n_jobs=-1
)

# Fit to the training data
grid_search.fit(X_train_scaled, y_train)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Use the best model
best_knn = grid_search.best_estimator_
y_pred = best_knn.predict(X_test_scaled)

# Evaluate
print(classification_report(y_test, y_pred))
print("Accuracy score: ", accuracy_score(y_test, y_pred))
print("Weighted F1 score: ", f1_score(y_test, y_pred, average='weighted'))

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best Parameters: {'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
Best Score: 0.942258112845894
              precision    recall  f1-score   support

           0       0.95      1.00      0.97       929
           1       0.40      0.04      0.07        53

    accuracy                           0.95       982
   macro avg       0.67      0.52      0.52       982
weighted avg       0.92      0.95      0.92       982

Accuracy score:  0.945010183299389
Weighted F1 score:  0.9229481980051685
