In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

#### Binary Classification problem

Based on this information, the problem statement or outcome of interest could be to predict the presence or absence of heart disease (target variable) based on the other features (age, gender, chest pain type, blood pressure, cholesterol level, etc.). This is a binary classification problem where the goal is to build a model that can accurately classify individuals into two categories: those with heart disease and those without.d those without.

Here are the full forms of each column in the provided dataset:

1. **age**: Age of the individual.
2. **sex**: Gender of the individual (1 for male, 0 for female).
3. **cp**: Chest pain type.
4. **trestbps**: Resting blood pressure (in mm Hg).
5. **chol**: Serum cholesterol level (in mg/dl).
6. **fbs**: Fasting blood sugar > 120 mg/dl (1 for true, 0 for false).
7. **restecg**: Resting electrocardiographic results.
8. **thalach**: Maximum heart rate achieved during exercise.
9. **exang**: Exercise-induced angina (1 for yes, 0 for no).
10. **oldpeak**: ST depression induced by exercise relative to rest.
11. **slope**: Slope of the peak exercise ST segment.
12. **ca**: Number of major vessels colored by fluoroscopy.
13. **thal**: Thallium stresdiac conditions.

In [2]:
df = pd.read_csv('heart.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
df.shape

(303, 14)

In [5]:
df['target'].unique()

array([1, 0], dtype=int64)

In [6]:
X = df.iloc[:, : -1]
y = df.iloc[:, -1]

In [7]:
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

In [8]:
X_train.shape

(242, 13)

In [9]:
X_test.shape

(61, 13)

In [10]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

accuracy_score(y_test, y_pred)

0.8524590163934426

In [11]:
from sklearn.model_selection import cross_val_score

cross_val_score(RandomForestClassifier(), X,y,scoring='accuracy', cv=10)

array([0.90322581, 0.83870968, 0.87096774, 0.93333333, 0.9       ,
       0.8       , 0.7       , 0.83333333, 0.73333333, 0.76666667])

In [12]:
np.mean(np.array([0.87096774, 0.83870968, 0.83870968, 0.9, 0.86666667,0.8, 0.76666667, 0.83333333, 0.73333333, 0.83333333]))

0.828172043

In [13]:
rf = RandomForestClassifier(max_samples=0.75,random_state=42)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.9016393442622951

In [14]:
np.mean(cross_val_score(RandomForestClassifier(max_samples=0.75, random_state=42), X,y,scoring='accuracy', cv=10))

0.8349462365591398

### `GridSearchCV`

In [15]:
from sklearn.model_selection import GridSearchCV

In [16]:
param_grid = {
    'n_estimators' : [20,60,100,120],
    'max_features' : [0.2,0.6,1.0],
    'max_depth' : [2,8,None],
    'max_samples' : [0.5,0.75,1.0]
}

print(param_grid)

{'n_estimators': [20, 60, 100, 120], 'max_features': [0.2, 0.6, 1.0], 'max_depth': [2, 8, None], 'max_samples': [0.5, 0.75, 1.0]}


In [17]:
rf = RandomForestClassifier()

In [18]:
rf_grid = GridSearchCV(
    estimator= rf,
    param_grid=param_grid,
    cv = 5,
    verbose= 2,
    n_jobs= -1
)

In [19]:
rf_grid.fit(X_train, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [20]:
rf_grid.best_params_

{'max_depth': 2, 'max_features': 0.2, 'max_samples': 1.0, 'n_estimators': 60}

In [21]:
rf_grid.best_score_

0.8511054421768707

### `RandomizedSearchCV`

`RandomizedSearchCV` is a method for hyperparameter tuning in machine learning models. It is a part of the `scikit-learn` library in Python and is used to systematically search through a predefined set of hyperparameters for a machine learning algorithm.

Here's an explanation of `RandomizedSearchCV`:

1. **Hyperparameter Tuning**:
   - In machine learning models, hyperparameters are parameters that are not learned from the data but are set prior to model training. Examples of hyperparameters include the learning rate in gradient descent or the depth of a decision tree. Tuning these hyperparameters is essential for optimizing the performance of a model.

2. **Grid Search vs. Randomized Search**:
   - Traditionally, hyperparameter tuning is performed using grid search, where a grid of hyperparameter values is defined, and the model is trained and evaluated for each combination of hyperparameters. While grid search exhaustively searches through all possible combinations, it can be computationally expensive, especially for models with a large number of hyperparameters or a large search space.
   - Randomized search, on the other hand, randomly samples a fixed number of hyperparameter settings from a specified distribution. It does not try all possible combinations but focuses on exploring a wider range of hyperparameter values more efficiently.

3. **RandomizedSearchCV**:
   - `RandomizedSearchCV` is a class in scikit-learn that combines random search with cross-validation. It performs randomized hyperparameter search while evaluating the model's performance using cross-validation. This helps in finding the best set of hyperparameters that yield the highest performance while also reducing the risk of overfitting.
   - The user specifies a parameter grid or a distribution for each hyperparameter to be tuned, along with the number of iterations (`n_iter`) representing the number of random combinations to try.
   - During the search process, `RandomizedSearchCV` randomly selects hyperparameter combinations from the specified distributions, trains and evaluates the model using cross-validation, and keeps track of the best-performing set of hyperparameters found so far.

4. **Benefits**:
   - **Efficiency**: Randomized search is more efficient than grid search, especially for models with a large number of hyperparameters or a large search space.
   - **Exploration of Hyperparameter Space**: Randomized search allows for a more diverse exploration of the hyperparameter space, potentially leading to better-performing models.
   - **Parallelization**: `RandomizedSearchCV` supports parallelization, allowing it to utilize multiple CPU cores for faster hyperparameter search.

In summary, `RandomizedSearchCV` is a powerful tool for hyperparameter tuning in machine learning models. It efficiently explores the hyperparameter space, helping to find optimal hyperparameter settings and improve the performance of the models.

In [22]:
from sklearn.model_selection import RandomizedSearchCV

In [23]:
param_grid = {
    'n_estimators' : [20,60,100,120],
    'max_features' : [0.2,0.6,1.0],
    'max_depth' : [2,8,None],
    'max_samples' : [0.5,0.75,1.0],
    'bootstrap' : [True, False],
    'min_samples_split' : [2, 5],
    'min_samples_leaf' : [1,2]
}
print(param_grid)

{'n_estimators': [20, 60, 100, 120], 'max_features': [0.2, 0.6, 1.0], 'max_depth': [2, 8, None], 'max_samples': [0.5, 0.75, 1.0], 'bootstrap': [True, False], 'min_samples_split': [2, 5], 'min_samples_leaf': [1, 2]}


In [24]:
rf = RandomForestClassifier()

In [25]:
rf_rand = RandomizedSearchCV(
    estimator= rf,
    param_distributions= param_grid,
    cv=5,
    verbose=2,
    n_jobs= -1
)

In [26]:
rf_rand.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py", line 383, in fit
    raise ValueError(
ValueError: `max_sample` cannot be set if `bootstrap=False`. Either switch to `bootstrap=True` or set `max_sample=None`.

 0.80561224 0.77687075        nan 0.81394558]


In [27]:
rf_rand.best_params_

{'n_estimators': 100,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_samples': 0.75,
 'max_features': 0.2,
 'max_depth': 2,
 'bootstrap': True}

In [28]:
rf_rand.best_score_

0.8263605442176871

`RandomizedSearchCV` is particularly useful in scenarios where exhaustive search with `GridSearchCV` becomes computationally expensive or impractical. Here's when and why to use `RandomizedSearchCV`:

1. **Large Hyperparameter Space**:
   - When dealing with a large number of hyperparameters or a wide range of possible values for hyperparameters, grid search becomes inefficient due to the combinatorial explosion of possible parameter combinations. In such cases, `RandomizedSearchCV` is a better choice as it randomly samples a subset of the parameter space, allowing for a more efficient search.

2. **Limited Computational Resources**:
   - Grid search can require a significant amount of computational resources, especially when dealing with large datasets or complex models. `RandomizedSearchCV` is computationally more efficient as it explores only a subset of hyperparameter combinations. This makes it suitable for situations where computational resources are limited.

3. **Exploratory Data Analysis**:
   - During the initial stages of model development, when you are exploring various hyperparameter configurations and their impact on model performance, `RandomizedSearchCV` can provide a quick way to sample different hyperparameter combinations and assess their effects on model performance.

4. **Trade-off between Exploration and Exploitation**:
   - `RandomizedSearchCV` strikes a balance between exploration and exploitation. While it randomly samples hyperparameter configurations, it still focuses on regions of the hyperparameter space that are likely to yield good performance based on the specified distribution. This allows for efficient exploration of the hyperparameter space while exploiting promising regions.

5. **High-dimensional Data**:
   - When working with high-dimensional data, the number of hyperparameters to tune can be large. `RandomizedSearchCV` helps in efficiently searching through the hyperparameter space without explicitly evaluating all possible combinations, making it suitable for high-dimensional data settings.

6. **Model Selection and Comparison**:
   - `RandomizedSearchCV` can be used to compare the performance of different models with different hyperparameter settings efficiently. By running randomized search with multiple models, you can quickly identify the most promising model configurations for further evaluation.

In summary, `RandomizedSearchCV` is beneficial when dealing with large hyperparameter spaces, limited computational resources, exploratory data analysis, striking a balance between exploration and exploitation, high-dimensional data, and model selection and comparison. It offers an efficient alternative to grid search, allowing for effective hyperparameter tuning in various machine learning scenarios.