#Q1
Purpose: Grid Search Cross-Validation (Grid Search CV) is a technique used to find the best hyperparameters for a machine learning model. It systematically tests a predefined set of hyperparameter values, searching through all possible combinations.

How it works:

Define a hyperparameter grid, specifying the values to be tested for each hyperparameter.

The grid search algorithm trains the model for each combination of hyperparameter values using cross-validation.

It evaluates the model's performance based on a specified metric.

The combination of hyperparameter values that gives the best performance is selected.

In [2]:
#1
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
}

# Create a random forest classifier
rf_classifier = RandomForestClassifier()

# Perform grid search with cross-validation
grid_search = GridSearchCV(rf_classifier, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best hyperparameter values
print("Best Hyperparameters:", grid_search.best_params_)

Best Hyperparameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}


#Q2
Difference:

Grid Search CV: It exhaustively searches through all possible combinations of hyperparameter values defined in a grid. It's computationally expensive but ensures a thorough exploration of the hyperparameter space.

Randomized Search CV: It randomly samples a fixed number of hyperparameter combinations from the specified hyperparameter space. It's computationally less expensive but may not guarantee a thorough exploration.

When to choose:

Use Grid Search CV when:

The hyperparameter search space is not too large.
You have sufficient computational resources.

Use Randomized Search CV when:

The hyperparameter search space is large.
You want to perform a quick exploration of the hyperparameter space.

In [3]:
#2
from sklearn.model_selection import RandomizedSearchCV

# Randomized Search
random_search = RandomizedSearchCV(rf_classifier, param_distributions=param_grid, n_iter=10, cv=3, scoring='accuracy')
random_search.fit(X_train, y_train)

# Print the best hyperparameter values
print("Best Hyperparameters (Randomized Search):", random_search.best_params_)

Best Hyperparameters (Randomized Search): {'n_estimators': 200, 'min_samples_split': 10, 'max_depth': 10}


#Q3
Data Leakage: Data leakage occurs when information from the test set is used to train a model, leading to overly optimistic performance estimates. It can result in models that perform well on the test set but fail to generalize to new, unseen data.

Example: Let's say you're building a credit risk model, and you accidentally include future information about a customer's credit history in the training set (e.g., including information about whether a customer defaulted on a loan after the current loan was approved). The model might learn patterns related to the future outcome, making it seem more accurate during training but less effective on new applications.

#Q4
Preventing Data Leakage:

Split Data Properly: Ensure a clear separation between training and test sets before any preprocessing.

Feature Engineering: Create features only using information available at the time of prediction.

In [10]:
#4
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()

# Create a DataFrame
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = breast_cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

# Train a RandomForestClassifier on the training set
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model's performance on the test set
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on the test set:", accuracy)

Accuracy on the test set: 0.9649122807017544


#Q5
Confusion Matrix: A confusion matrix is a table that summarizes the performance of a classification model. It shows the counts of true positive, true negative, false positive, and false negative predictions.

In [11]:
#5
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Binary classification (class 2 vs. not class 2)
y_binary = (y == 2).astype(int)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Train a logistic regression model
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[19  0]
 [ 0 11]]


#Q6
Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positives. It focuses on the accuracy of the positive predictions.

Precision=TP/TP+FP
 

Recall (Sensitivity or True Positive Rate): Recall is the ratio of correctly predicted positive observations to the total actual positives. It focuses on how well the model captures all the positive instances.

Recall=TP/TP+FN

#Q7
Interpretation:

True Positive (TP): Instances correctly predicted as positive.

True Negative (TN): Instances correctly predicted as negative.

False Positive (FP): Instances incorrectly predicted as positive (Type I error).

False Negative (FN): Instances incorrectly predicted as negative (Type II error).

In [12]:
#7
print("Confusion Matrix:")
print(cm)

# Interpretation
print("True Positive:", cm[1, 1])
print("True Negative:", cm[0, 0])
print("False Positive:", cm[0, 1])
print("False Negative:", cm[1, 0])


Confusion Matrix:
[[19  0]
 [ 0 11]]
True Positive: 11
True Negative: 19
False Positive: 0
False Negative: 0


#Q8
common metrics:
Accuracy,
Precision,
Recall (Sensitivity),
Specificity,
F1 Score,

In [14]:
#8
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


#Q9
Accuracy: Overall correctness of the model 

Confusion Matrix: Provides a detailed breakdown of correct and incorrect predictions.

In [15]:
#9
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print confusion matrix
print("Confusion Matrix:")
print(cm)

Accuracy: 1.0
Confusion Matrix:
[[19  0]
 [ 0 11]]


#Q10
Identification of Biases or Limitations:

Class Imbalance: Check if there is a significant imbalance between classes.

Bias Toward Majority Class: A model might perform well on the majority class but poorly on the minority class.

False Positive/Negative Rates: Analyze errors and their impact on different classes.

In [17]:
#10
import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score

# Assuming y_test and y_pred are NumPy arrays
y_test = np.array([1, 0, 1, 1, 0, 0, 1, 1, 0, 0])
y_pred = np.array([1, 0, 1, 1, 0, 1, 1, 0, 0, 1])

# Create a confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Interpretation
tp = np.sum((y_test == 1) & (y_pred == 1))
tn = np.sum((y_test == 0) & (y_pred == 0))
fp = np.sum((y_test == 0) & (y_pred == 1))
fn = np.sum((y_test == 1) & (y_pred == 0))

print("Confusion Matrix:")
print(cm)

# Interpretation
print("True Positive:", tp)
print("True Negative:", tn)
print("False Positive:", fp)
print("False Negative:", fn)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Confusion Matrix:
[[3 2]
 [1 4]]
True Positive: 4
True Negative: 3
False Positive: 2
False Negative: 1
Accuracy: 0.7
