# 4.2 Decision Tree
A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. The tree is constructed by recursively splitting the data into subsets based on feature
values, forming a tree-like structure. Each internal node represents a decision based on a feature,
and each leaf node represents a class label (for classification tasks).
In a Decision Tree Classifier, the goal is to predict the value of a target variable by learning
simple decision rules inferred from data features.
## 4.2.1 How It Works
• Root Node Creation: The tree starts at the root node and recursively splits the data using a
feature that maximizes the "information gain" (for classification tasks). The splitting criterion
could be Gini impurity or Entropy.

• Recursive Partitioning: The data is split at each node until one of the following conditions
is met:

– All records belong to the same class.

– The maximum depth of the tree is reached.

– Other specified criteria (such as min_samples_split or min_impurity_decrease) are
satisfied.

• Tree Pruning: To prevent overfitting, the tree can be pruned by limiting its depth (max_depth)
or setting other parameters like min_samples_split.

• Classification: At the leaf nodes, the predicted class is determined based on the majority
class of the training examples at that node.
## 4.2.2 Advantages and Disadvantages
Advantages:

• Interpretability: Easy to visualize and interpret, especially for small trees.

• Non-linearity: Can handle non-linear relationships between features.

• No Data Normalization: Does not require scaling of the data.

• Handling Missing Data: Can handle missing values effectively.

Disadvantages:

• Overfitting: Decision Trees can overfit on the training data, especially when the tree is deep.

• Unstable: Small changes in the data can result in a completely different tree.

• Bias: Trees can be biased towards the dominant class in an imbalanced dataset

## 4.2.3 Splitting Criteria

• Gini Impurity:
Gini impurity is a measure of how often a randomly chosen element would be incorrectly
classified if randomly labeled according to the class distribution of a subset. Lower Gini
impurity means better splits.
Gini(S) = 1 −
Xn
i=1
p
2
i
where pi
is the probability of an element being classified as class i.

• Entropy (Information Gain)
Entropy measures the amount of uncertainty in a dataset. The goal is to maximize information
gain, which reduces the uncertainty (or entropy) after the split.
Entropy(S) = −
Xn
i=1
pi
log2 pi
where pi
is the probability of an element being classified as class i (which can be estimated as
the proportion of elements in class i).
Information gain is calculated as:
IG(S, A) = Entropy(S) −
X
v∈Values(A)
|Sv|
|S|
× Entropy(Sv)
where S is the dataset, and A is the attribute used to split S, and Sv represents the subset of
S for which attribute A has value v.
## 4.2.4 Parameters of Decision Tree in Scikit-learn
Scikit-learn’s DecisionTreeClassifier offers several parameters to control tree behavior and prevent overfitting. Below is a list of key parameters:

| Parameter | Description | Default Value |
| :--- | :--- | :--- |
| criterion | Function to measure the quality of a split. Options are "gini" (Gini impurity) and "entropy" (Information gain). | "gini" |
| splitter | Strategy to split at each node. Options are "best" (chooses the best split) and "random" (chooses a random split). | "best" |
| max_depth | Maximum depth of the tree. Controls tree depth to prevent overfitting. If None, nodes expand until all leaves are pure. | None |
| min_samples_split | Minimum number of samples required to split an internal node. Increasing this helps prevent small splits and overfitting. | 2 |
| min_samples_leaf | Minimum number of samples required to be at a leaf node. Increasing this prevents trees from having overly small leaves. | 1 |
| min_weight_fraction_leaf | Minimum weighted fraction of the total weights required to be at a leaf node. | 0.0 |
| max_features | Number of features to consider when looking for the best split. Can be an integer, float, "auto", "sqrt", "log2", or None. | None |
| random_state | Controls the randomness of the estimator. Setting a value ensures reproducibility of results. | None |
| max_leaf_nodes | Grow the tree with a maximum number of leaf nodes. If None, an unlimited number of leaf nodes will be allowed. | None |
| min_impurity_decrease | Node will be split if the split induces a decrease in impurity greater than or equal to this value. | 0.0 |
| class_weight | Weights associated with classes. If None, all classes are assumed to have weight one. | None |
| ccp_alpha | Complexity parameter used for Minimal Cost-Complexity Pruning. The larger the value, the more the tree will be pruned. | 0.0 |

## 4.2.5 Code Example of Decision Tree
Below is a code example of a Decision Tree classifier in Scikit-learn

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Train Decision Tree model
dt_model=DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train,y_train)
# Predict and evaluate
y_pred=dt_model.predict(X_test)
print(f"Decision Tree Accuracy: {accuracy_score(y_test,y_pred):.2f}")

Decision Tree Accuracy: 1.00


## 4.2.6 Conclusion
The Decision Tree algorithm is a powerful classification tool, but it can easily overfit the training
data. To mitigate overfitting, tuning hyperparameters like max_depth, min_samples_split, and
min_samples_leaf is crucial. Using methods like Grid Search or Random Search ensures that the
best combination of parameters is selected. These methods will be described and studied later in
the lab (e.g., see Sec. 6).
# 4.3 Random Forest
Random Forest is an ensemble learning algorithm, used for both classification and regression
tasks. It is based on the idea of building multiple decision trees and combining their predictions to
improve model accuracy and prevent overfitting. It works by creating multiple decision trees (hence
a "forest") during training time and outputting the mode of the classes (classification) or the mean
prediction (regression) of the individual trees.
Random Forest is built on two main ideas:

• Bagging (Bootstrap Aggregation): The algorithm uses bootstrapping, i.e., creating different subsets of the training dataset (with replacement), to train individual decision trees.

• Random Feature Selection: At each split in the decision tree, a random subset of the
features is considered, instead of using all features. This helps to ensure that individual trees
are less correlated.

## 4.3.1 How It Works:
1. Training Phase:

• The algorithm creates multiple decision trees. Each tree is trained on a random subset
of the data (sampled with replacement).

• Each tree is built using a random subset of the features (selected randomly at each node
split).

• Every tree grows to the maximum possible depth (unless constrained by hyperparameters
like max_depth or min_samples_split).
2. Prediction Phase:

• For classification tasks, each tree in the forest outputs a class label, and the final prediction
is made based on a majority vote.

• For regression tasks, the predictions of all trees are averaged to obtain the final prediction.
4.3.2 Advantages of Random Forest:
1. Reduction in Overfitting: Random Forest overcomes the problem of overfitting that is often
encountered in Decision Trees. This is achieved by averaging multiple trees (in classification)
or taking the majority vote, which lowers the variance.
2. Robustness to Noise: Random Forest is robust to noisy data and works well even if a large
portion of the data is missing.
3. Feature Importance: Random Forest provides a measure of feature importance, which can
help in feature selection.
4. Versatility: It can handle both classification and regression problems, and can work with
high-dimensional data.
5. Scalability: It is highly scalable because it builds each tree independently, and therefore can
be easily parallelized.
4.3.3 Disadvantages of Random Forest:
1. Complexity and Interpretability: Random Forest models are more complex and harder
to interpret compared to individual decision trees.
2. Computationally Intensive: Since multiple trees are grown, the algorithm can be computationally expensive and slow, especially when the number of trees is large or the dataset is
huge.
3. Bias-Variance Tradeoff: While Random Forest reduces variance, it can slightly increase
bias compared to individual decision trees.

## 4.3.4 Parameters of Random Forest in Scikit-learn
The RandomForestClassifier in Scikit-learn offers several hyperparameters to control the behavior
of the model and tune its performance:

| Parameter | Description | Default Value |
| :--- | :--- | :--- |
| n_estimators | The number of trees in the forest. Increasing this improves performance but also increases computation time. | 100 |
| criterion | The function to measure the quality of a split. Supported criteria are 'gini' for Gini impurity and 'entropy' for Information Gain. | 'gini' |
| max_depth | The maximum depth of the tree. Limiting depth controls overfitting. If None, nodes are expanded until all leaves are pure. | None |
| min_samples_split | The minimum number of samples required to split an internal node. Larger values prevent overfitting. | 2 |
| min_samples_leaf | The minimum number of samples required to be at a leaf node. Larger values prevent deep trees with fewer samples per leaf. | 1 |
| min_weight_fraction_leaf | The minimum weighted fraction of the input samples required to be at a leaf node. | 0.0 |
| max_features | The number of features to consider when looking for the best split. Can be 'auto', 'sqrt', 'log2', or an integer. | 'auto' |
| max_leaf_nodes | Grow the tree with a maximum number of leaf nodes. If None, unlimited leaf nodes are allowed. | None |
| min_impurity_decrease | A node will be split if the split causes a decrease in impurity greater than or equal to this value. | 0.0 |
| bootstrap | Whether bootstrap samples are used when building trees. If False, the entire dataset is used to build each tree. | True |
| oob_score | Whether to use out-of-bag samples to estimate the generalization accuracy. | False |
| n_jobs | The number of jobs to run in parallel. -1 means using all processors. | None |
| random_state | Controls the randomness of the estimator. If set, ensures reproducibility of results. | None |
| verbose | Controls the verbosity when fitting and predicting. | 0 |
| warm_start | If True, reuse the solution of the previous call to add more estimators. | False |
| class_weight | Weights associated with classes. If None, all classes are supposed to have weight one. | None |
| ccp_alpha | Complexity parameter used for Minimal Cost-Complexity Pruning. The higher the value, the more the tree is pruned. | 0.0 |
| max_samples | If bootstrap=True, the number or fraction of samples to draw from the original data to train each base estimator. | None |

## 4.3.5 Code Example of Random Forest:
Here is an example of a Random Forest Classifier using Scikit-learn.


In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Train Random Forest model
rf_model=RandomForestClassifier(n_estimators=100,random_state=42)
rf_model.fit(X_train,y_train)
# Predict and evaluate
y_pred=rf_model.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test,y_pred):.2f}")

Random Forest Accuracy: 1.00


## 4.3.6 Conclusion
The Random Forest algorithm is a powerful and versatile machine learning model that can handle
both classification and regression tasks. It reduces overfitting by averaging the results of multiple
trees and introduces randomness into the feature selection process, which makes it more robust.
Effective tuning of hyperparameters like n_estimators, max_depth, min_samples_split, and
bootstrap can significantly improve performance.
# 4.4 Support Vector Machine (SVM)
Support Vector Machines (SVM) are a set of supervised learning algorithms primarily used for
classification, but they can also be applied to regression and outlier detection tasks. SVMs are
effective for both linear and non-linear data classification problems. The key idea behind SVM is to
find a hyperplane that best separates the data points of different classes in the feature space.
SVM aims to maximize the margin between the data points of different classes. The points that
are closest to the hyperplane are known as support vectors, and they are the most critical elements
of the dataset for defining the classification boundary.
## 4.4.1 How SVM Works
1. Linear SVM:
• For a linearly separable dataset, SVM finds the optimal hyperplane that separates the
data into two classes with the maximum margin. The margin is the distance between the
hyperplane and the nearest data points from both classes (called support vectors).

• Mathematically, the objective of SVM is to find a hyperplane wT x + b = 0 where:
– w is the normal vector to the hyperplane.
– x is the input feature vector.
– b b is the bias term.

• SVM maximizes the margin (i.e., the distance between the hyperplane and support vectors) by solving the following (constrained) optimization problem:
min
w,b 
1
2
∥w∥
2

(1)
subject to:
yi(w
T xi + b) ≥ 1 ∀i (2)
where yi are the class labels (yi ∈ {−1, 1}).
2. Non-Linear SVM:

• When data is not linearly separable, SVM uses the kernel trick (described in the next
subsection) to map the data into a higher-dimensional space where it becomes linearly
separable. This transformation is done implicitly without needing to compute the coordinates in the new space.
3. Soft Margin SVM:

• In practice, real-world data is often noisy or not perfectly separable. Soft margin SVM
allows some misclassification by introducing a penalty for misclassified data points. This
is controlled by the regularization parameter C, which balances the trade-off between
maximizing the margin and minimizing the classification error.
– If C is large, the model will prioritize minimizing classification errors (small margin).
– If C is small, the model will allow more misclassifications and aim to maximize the
margin.
4.4.2 Kernel Functions in SVM:
The kernel trick allows SVM to work well with non-linearly separable data by implicitly transforming
the feature space. Some common kernels include:
1. Linear Kernel:

• No transformation, the data is assumed to be linearly separable.

• Kernel function: K(x, x′
) = x
T x
′
.
2. Polynomial Kernel:

• Adds polynomial terms of features to allow non-linear classification.

• Kernel function: K(x, x′
) = (x
T x
′ + 1)d
, where d is the degree of the polynomial.
3. Radial Basis Function (RBF) Kernel:

• Maps data into an infinite-dimensional space using a Gaussian function.

• Kernel function: K(x, x′
) = exp(−γ∥x − x
′∥
2
), where γ defines the influence of a single
training example.
4. Sigmoid Kernel:

• Mimics the behavior of neural networks.

• Kernel function: K(x, x′
) = tanh(αxT x
′ + c), where α and c are kernel parameters.

## 4.4.3 Advantages of SVM:
1. Effective in High-Dimensional Spaces: SVM works well even when the number of features
is greater than the number of samples.
2. Robustness to Overfitting: The regularization parameter C allows control over the trade-off
between misclassification and margin maximization, preventing overfitting.
3. Flexibility through Kernels: Different kernel functions enable SVM to handle linear and
non-linear classification problems.
4. Support Vectors: Only the support vectors are used to define the decision boundary, making
the algorithm memory efficient.
## 4.4.4 Disadvantages of SVM:
1. High Computational Cost: SVM can be computationally intensive, especially for large datasets.
2. Choice of Kernel and Parameters: SVM’s performance depends heavily on the choice of kernel
function and hyperparameters (like C andγ). Selecting the right kernel can be challenging.
3. Sensitivity to Noise: SVM is sensitive to noisy data and outliers, especially when the data
points are close to the decision boundary.
## 4.4.5 Parameters of SVM in Scikit-learn:
Scikit-learn’s SVC (Support Vector Classifier) provides several parameters that can be tuned to
control the behavior of the model and optimize its performance.

| Parameter | Description | Default Value |
| :--- | :--- | :--- |
| C | Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. Smaller values specify stronger regularization. | 1.0 |
| kernel | Specifies the kernel type to be used in the algorithm. Options are: 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'. | 'rbf' |
| degree | Degree of the polynomial kernel function (kernel='poly'). Ignored by other kernels. | 3 |
| gamma | Kernel coefficient for 'rbf', 'poly', and 'sigmoid'. Options are: 'scale' (uses $1 / (n_{\text{features}} \times X.\text{var}())$) or 'auto' (uses $1 / n_{\text{features}}$). | 'scale' |
| coef0 | Independent term in kernel function. It is used in 'poly' and 'sigmoid' kernels. | 0.0 |
| shrinking | Whether to use the shrinking heuristic. | True |
| probability | Whether to enable probability estimates. This must be set to True before calling fit. | False |
| tol | Tolerance for stopping criterion. | 1e-3 |
| cache_size | Size of the kernel cache (in MB). | 200 |
| class_weight | Weights associated with classes. If not given, all classes are supposed to have weight one. 'balanced' mode uses the values inversely proportional to class frequencies. | None |
| verbose | Controls the verbosity when fitting and predicting. | False |
| max_iter | Hard limit on iterations within solver, or -1 for no limit. | -1 |
| decision_function_shape | Whether to return a one-vs-rest ('ovr') decision function of shape (n_samples, n_classes) as output, or the original one-vs-one ('ovo') shape. | 'ovr' |
| break_ties | If True, decision function will be used to break ties in multi-class classification when decision_function_shape='ovr'. | False |
| random_state | Controls the pseudo-random number generation for shuffling the data for probability estimates. Only used when probability=True. | None |

## 4.4.6 Code Example of Support Vector Machine
Here’s an example of using the Support Vector Classifier (SVC) with Scikit-learn.

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Train SVM model
svm_model=SVC(kernel='linear',random_state=42)
svm_model.fit(X_train,y_train)
# Predict and evaluate
y_pred=svm_model.predict(X_test)
print(f"SVM Accuracy: {accuracy_score(y_test,y_pred):.2f}")

SVM Accuracy: 1.00


# 4.5 K-Nearest Neighbors (KNN)
The K-Nearest Neighbors (KNN) algorithm is a simple, yet powerful, non-parametric machine
learning method used for both classification and regression tasks. It is based on the principle that
data points that are close to each other in feature space are likely to belong to the same class (in
classification) or have similar values (in regression).

KNN is an example of a lazy learner algorithm because it does not make any assumptions
about the data distribution, nor does it explicitly build a model during the training phase. Instead,
it stores the training dataset and makes predictions based on the similarity between the test instance
and the stored training instances. Accordingly, it is also an example of instance-based learning.

## 4.5.1 How KNN Works
1. Choosing k: The parameter k determines how many neighbors will be considered for the
prediction. Typically, an odd k is chosen to avoid ties in classification problems.
2. Distance Calculation: For each test instance, KNN calculates the distance to all the training
points. The most common distance metric used is Euclidean distance, defined as:
d(x, y) =
vuutXn
i=1
(xi − yi)
2
where x and y are two points in an n-dimensional feature space.
3. Identifying Neighbors: After computing the distances, the algorithm selects the k-nearest
neighbors, which are the training points that have the smallest distance from the test instance.
4. Making Predictions:

• Classification: The class of the test instance is predicted based on the majority class
among the k-nearest neighbors.

• Regression: The value of the test instance is predicted as the average of the values of
the k-nearest neighbors.
## 4.5.2 Advantages of KNN

• Simplicity: KNN is easy to implement and understand.

• No Training Time: Since KNN is a lazy learner, it does not require any training time.

• Non-parametric: KNN does not make any assumptions about the underlying data distribution, making it versatile and applicable to many types of data.

• Flexible: KNN can be used for both classification and regression tasks.

## 4.5.3 Disadvantages of KNN

• Computationally Expensive: KNN needs to compute the distance between the test instance
and all training points, which can be slow, especially for large datasets.

• Storage Intensive: The algorithm must store all the training data, requiring considerable
memory.

• Sensitive to Noise and Irrelevant Features: KNN can be affected by noisy data or
irrelevant features, as all features contribute equally to the distance calculation.

• Choice of k: Choosing an optimal value for k is crucial. A small k may lead to overfitting,
while a large k may lead to underfitting.

## 4.5.4 Parameters of KNN in Scikit-learn
The KNeighborsClassifier in Scikit-learn offers several parameters to control the behavior of the
KNN algorithm:


| Parameter | Description | Default Value |
| :--- | :--- | :--- |
| n_neighbors | The number of neighbors to use for k-nearest neighbors voting. | 5 |
| weights | Weight function used in prediction. Can be: 'uniform' (all points weighted equally), 'distance' (closer points have higher influence), or a custom function. | 'uniform' |
| algorithm | Algorithm used to compute the nearest neighbors. Options are: 'auto' (chooses best algorithm), 'ball_tree' (BallTree), 'kd_tree' (KDTree), 'brute' (Brute-force). | 'auto' |
| leaf_size | Leaf size passed to BallTree or KDTree algorithms. It affects the speed of construction/query and memory usage. | 30 |
| p | The power parameter for the Minkowski distance: p = 1 (Manhattan), p = 2 (Euclidean). | 2 |
| metric | The distance metric to use. Default is Minkowski: $d(x, y) = \left(\sum|x_i - y_i|^p\right)^{1/p}$ | 'minkowski' |
| metric_params | Additional keyword arguments for the metric function. | None |
| n_jobs | The number of parallel jobs to run for neighbors search. If -1, all processors are used. | None |

## 4.5.5 Code Example
Here’s an example of using the K-Nearest Neighbors algorithm in Scikit-learn:

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Train KNN model
knn_model=KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train,y_train)
# Predict and evaluate
y_pred=knn_model.predict(X_test)
print(f"KNN Accuracy: {accuracy_score(y_test,y_pred)}")

KNN Accuracy: 1.0


## 4.5.6 Conclusion
The K-Nearest Neighbors (KNN) algorithm is a simple and effective tool for both classification and
regression tasks. It relies on the intuition that similar points are likely to belong to the same class
or have similar values. While it is easy to understand and implement, KNN is computationally
expensive and sensitive to the choice of hyperparameters like k and the distance metric. Tuning
these parameters can significantly impact the performance of the model.

# 4.6 Logistic Regression (LR)
Logistic Regression is a supervised learning algorithm used for binary classification problems,
where the output variable is categorical and consists of two classes (0 or 1, true or false, etc.).
Despite its name, logistic regression is not a regression algorithm but a classification one. The
algorithm estimates the probability that a given instance belongs to a certain class by fitting data
to a logistic (sigmoid) function.
## 4.6.1 Sigmoid Function
Logistic regression models the probability of a binary outcome using a sigmoid function, which is
defined as:
σ(z) = 1
1 + e−z
where z = w
T x + b is the linear combination of the input features x, weights w, and bias b. The
output of the sigmoid function is a value between 0 and 1, which can be interpreted as the probability
of the instance belonging to the positive class.
P(y = 1|x) = 1
1 + e−(wT x+b)
## 4.6.2 How Logistic Regression Works
1. Linear Model: Logistic regression begins by fitting a linear model w
T x+b, where w represents
the coefficients (weights) for the features and b represents the bias
2. Sigmoid Transformation: The linear model’s output is passed through the sigmoid function
to ensure the output lies between 0 and 1, representing the probability of the sample belonging
to the positive class.
3. Thresholding: Once the probability is estimated, a threshold (usually 0.5) is applied to
classify the instance into one of two classes:
yˆ =
(
1 if P(y = 1|x) ≥ 0.5
0 if P(y = 1|x) < 0.5
4. Cost Function: The algorithm uses the logistic loss (also known as log-loss or binary crossentropy) as the cost function:
J(w, b) = −
1
m
Xm
i=1
h
y
(i)
log(P(y = 1|x
(i)
)) + (1 − y
(i)
) log(1 − P(y = 1|x
(i)
))i
5. Optimization: The weights w and bias b are optimized by minimizing the cost function using
techniques like gradient descent.
## 4.6.3 Advantages of Logistic Regression

• Interpretability: The model’s coefficients can be interpreted as the impact of each feature
on the predicted probability.

• Probability Output: Logistic regression provides probability estimates, which can be useful
for ranking or prioritizing instances.

• Computationally Efficient: It is efficient for small to medium-sized datasets and converges
quickly.

• No Feature Scaling Required (for some variants): In basic logistic regression, feature
scaling is not mandatory.
## 4.6.4 Disadvantages of Logistic Regression

• Linear Decision Boundary: Logistic regression assumes a linear relationship between features and the log-odds of the outcome, which may not be appropriate for non-linear data.

• Imbalanced Data: Logistic regression may struggle with imbalanced datasets, as it tends to
predict the majority class.

• Overfitting with High-Dimensional Data: When there are too many features, logistic
regression can easily overfit the training data. Regularization techniques like L1 or L2 are
often needed to address this.
## 4.6.5 Parameters of Logistic Regression in Scikit-learn
The LogisticRegression class in Scikit-learn provides several parameters that allow the user to
customize the behavior of the logistic regression model. Below is a list of these parameters:

| Parameter | Description | Default Value |
| :--- | :--- | :--- |
| penalty | Specifies the regularization type. Options: 'l1', 'l2' (Ridge), 'elasticnet', 'none'. | 'l2' |
| dual | Whether to solve the dual optimization problem (only for penalty='l2' and solver='liblinear'). | False |
| tol | Tolerance for stopping criteria during optimization. | 1e-4 |
| C | Inverse of the regularization strength. Smaller values specify stronger regularization. | 1.0 |
| fit_intercept | Whether to include an intercept (bias term) in the model. | True |
| intercept_scaling | Scaling factor for the intercept when using solver='liblinear'. | 1 |
| class_weight | Weights associated with each class. Can be None (equal weights) or 'balanced' (inversely proportional to class frequencies). | None |
| solver | Algorithm for optimization. Options: 'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'. | 'lbfgs' |
| max_iter | Maximum number of iterations for the solver to converge. | 100 |
| multi_class | Strategy for handling multiple classes: 'auto' (chooses 'ovr' or 'multinomial'), 'ovr' (One-vs-rest), 'multinomial'. | 'auto' |
| verbose | Enables verbose output for the solver. | 0 |
| warm_start | Whether to reuse the previous solution for the optimization. | False |
| n_jobs | The number of parallel jobs to run for solver='sag' and 'saga'. | None |

## 4.6.6 Code Example of Logistic Regression
Here’s an example of using the Logistic Regression algorithm in Scikit-learn:


In [5]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris=load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.2,random_state=42)
# Initialize Logistic Regression
log_reg=LogisticRegression()
# Train the model
log_reg.fit(X_train,y_train)
# Predict on the test set
y_pred=log_reg.predict(X_test)
# Evaluate the model
accuracy=accuracy_score(y_test,y_pred)
print(f"Test Accuracy: {accuracy:.2f}")

Test Accuracy: 1.00


## 4.6.7 Conclusion
Logistic Regression is a simple and widely used classification algorithm that models the probability
of a binary outcome. It works by applying a logistic transformation to a linear model, producing
outputs between 0 and 1. Logistic regression is effective for many problems, but its performance
depends on the underlying data and the choice of regularization techniques.
# 5 Performance Evaluation Metrics
In machine learning, evaluating model performance is essential to understand how well the model
generalizes to unseen data. Below are some common performance evaluation metrics used for classification and regression tasks.
# 5.1 Accuracy
Accuracy is the ratio of correctly predicted observations to the total observations. It is the most
intuitive performance measure and is used when the data is balanced.
Accuracy = 
TP + TN
TP + TN + FP + FN
where:

• TP: True Positives

• TN: True Negatives

• FP: False Positives

• FN: False Negatives
# 5.2 Precision
Precision (also called positive predictive value) is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision is used when the cost of false positives
is high.
Precision =
TP
TP + FP
5.3 Recall (Sensitivity or True Positive Rate)
Recall (also called sensitivity or true positive rate) is the ratio of correctly predicted positive
observations to all observations in the actual class. Recall is useful when the cost of false negatives
is high.
Recall =
TP
TP + FN
# 5.4 F1-Score
The F1-Score is the harmonic mean of precision and recall. It is a useful metric when you need a
balance between precision and recall.
F1-Score = 2 ×
Precision × Recall
Precision + Recall
# 5.5 Specificity (True Negative Rate)
Specificity is the ratio of correctly predicted negative observations to all actual negatives. It is
useful for evaluating the performance of a model on negative observations.
Specificity =
TN
TN + FP
# 5.6 ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
The ROC curve is a graphical representation of the performance of a classification model at different
threshold values. The AUC (Area Under Curve) summarizes the ROC curve into a single value,
with 1 representing a perfect classifier and 0.5 representing random guessing.
The ROC curve is plotted with:
• True Positive Rate (TPR) on the y-axis, given by: TP
TP+FN
• False Positive Rate (FPR) on the x-axis, given by: FP
FP+TN
The AUC is calculated as the area under this curve.
# 5.7 Logarithmic Loss (Log Loss)
Log Loss measures the performance of a classification model where the output is a probability value
between 0 and 1. Lower log loss values indicate better model performance.
For binary classification, log loss is defined as:
Log Loss = −
1
N
X
N
i=1
[yi
log(pi) + (1 − yi) log(1 − pi)]
where:

• N is the number of samples.

• yi
is the actual label (0 or 1).

• pi
is the predicted probability for class 1.
# 5.8 Mean Absolute Error (MAE)
For regression tasks, Mean Absolute Error (MAE) is the average of the absolute differences
between predicted and actual values.
MAE =
1
n
Xn
i=1
|yi − yˆi
|
where:

• yi
is the actual value.

• yˆi
is the predicted value.

• n is the number of samples.
# 5.9 Mean Squared Error (MSE)
The Mean Squared Error (MSE) is the average of the squared differences between predicted and
actual values.
MSE =
1
n
Xn
i=1
(yi − yˆi)
2
# 5.10 Root Mean Squared Error (RMSE)
The Root Mean Squared Error (RMSE) is the square root of the mean squared error. It is
commonly used because it has the same units as the output variable.
RMSE =
vuut
1
n
Xn
i=1
(yi − yˆi)
2
# 5.11 R-squared (Coefficient of Determination)
The R-squared metric measures how well the regression predictions fit the actual data. It is
the proportion of the variance in the dependent variable that is predictable from the independent
variables.
R
2 = 1 −
Pn
i=1(yi − yˆi)
2
Pn
i=1(yi − y¯)
2
where y¯ is the mean of the actual values.

# 6 Hyperparameter Tuning and Optimization Techniques
Before implementing the optimization techniques, it is important to understand their practical operating properties and performance characteristics.
# 6.1 Overview and General Guidelines
Hyperparameter Tuning is the process of selecting the optimal hyperparameters that govern the
training process of a machine learning model. Unlike model parameters learned during training,
hyperparameters are set prior to the training process.
Various optimization techniques can be considered for hyperparameter tuning, including the
following popular approaches.
1. Grid Search:
• Description: An exhaustive search method where all possible combinations of hyperparameters are tried.
• Pros: Simple and guarantees finding the best combination within the specified grid.
• Cons: Computationally intensive, especially with a large number of hyperparameters or
extensive ranges.
2. Random Search:
• Description: Randomly selects combinations of hyperparameters to try.
• Pros: More efficient than grid search for high-dimensional spaces; can find good hyperparameters with fewer iterations.
• Cons: May miss the optimal combination due to randomness.
3. Bayesian Optimization:
• Description: Builds a probabilistic model of the objective function and uses it to select
the most promising hyperparameters to evaluate next.
• Pros: Efficient and can find better hyperparameters with fewer evaluations.
• Cons: More complex and computational overhead due to model fitting.
4. Genetic Algorithms:
• Description: Mimics the process of natural selection by creating a population of solutions
and evolving them over generations.
• Pros: Good at exploring large, complex search spaces and avoiding local minima.
• Cons: Computationally expensive and requires careful parameter settings.
For this lab, we will investigate the first three approaches described above, with code templates
provided in the subsequent subsections.
# 6.2 Grid SearchCV
The following is a code template for implementing Grid Search to find the best hyperparameters for
the Random Forest model.



In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Hyperparameter grid
param_grid={'n_estimators':[50,100,150],'max_depth':[None,10,20],'min_samples_split':[2,5,10]}
# Grid Search
grid_search=GridSearchCV(RandomForestClassifier(random_state=42),param_grid,cv=3,scoring='accuracy',n_jobs=-1)
grid_search.fit(X_train,y_train)
# Best parameters and accuracy
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.2f}")

Best parameters: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 50}
Best cross-validation accuracy: 0.96


# 6.3 RandomizedSearchCV
The following is a code template to use Random Search for hyperparameter tuning on the Random
Forest model.

In [7]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Hyperparameter distribution
param_dist={'n_estimators':np.linspace(50,200,num=10,dtype=int),'max_depth':[None]+list(np.arange(5,25,5)),'min_samples_split':np.arange(2,11)}
# Random Search
random_search=RandomizedSearchCV(RandomForestClassifier(random_state=42),param_distributions=param_dist,n_iter=20,cv=3,scoring='accuracy',random_state=42,n_jobs=-1)
random_search.fit(X_train,y_train)
# Best parameters and accuracy
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation accuracy: {random_search.best_score_:.2f}")

Best parameters: {'n_estimators': np.int64(150), 'min_samples_split': np.int64(6), 'max_depth': np.int64(15)}
Best cross-validation accuracy: 0.96


# 6.4 Bayesian Optimization with Optuna
The following is a code template to implement Bayesian Optimization using the Optuna library

In [9]:
import optuna
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
data=load_iris()
X_train,X_test,y_train,y_test=train_test_split(data.data,data.target,test_size=0.2,random_state=42)
# Objective function
def objective(trial):
    n_estimators=trial.suggest_int('n_estimators',50,200)
    max_depth=trial.suggest_int('max_depth',5,30)
    min_samples_split=trial.suggest_int('min_samples_split',2,10)
    rf=RandomForestClassifier(n_estimators=n_estimators,max_depth=max_depth,min_samples_split=min_samples_split,random_state=42)
    score=cross_val_score(rf,X_train,y_train,cv=3,scoring='accuracy',n_jobs=-1)
    return score.mean()
# Study
study=optuna.create_study(direction='maximize')
study.optimize(objective,n_trials=20)
# Best parameters and accuracy
print(f"Best parameters: {study.best_params}")
print(f"Best cross-validation accuracy: {study.best_value:.2f}")

[I 2025-10-30 23:31:30,737] A new study created in memory with name: no-name-e8a63096-31c2-4f61-ac13-1dc3aac946c3
[I 2025-10-30 23:31:30,927] Trial 0 finished with value: 0.9500000000000001 and parameters: {'n_estimators': 136, 'max_depth': 27, 'min_samples_split': 6}. Best is trial 0 with value: 0.9500000000000001.
[I 2025-10-30 23:31:31,151] Trial 1 finished with value: 0.9500000000000001 and parameters: {'n_estimators': 172, 'max_depth': 6, 'min_samples_split': 9}. Best is trial 0 with value: 0.9500000000000001.
[I 2025-10-30 23:31:31,449] Trial 2 finished with value: 0.9500000000000001 and parameters: {'n_estimators': 193, 'max_depth': 16, 'min_samples_split': 3}. Best is trial 0 with value: 0.9500000000000001.
[I 2025-10-30 23:31:31,738] Trial 3 finished with value: 0.9500000000000001 and parameters: {'n_estimators': 196, 'max_depth': 29, 'min_samples_split': 6}. Best is trial 0 with value: 0.9500000000000001.
[I 2025-10-30 23:31:31,876] Trial 4 finished with value: 0.958333333333

Best parameters: {'n_estimators': 86, 'max_depth': 10, 'min_samples_split': 2}
Best cross-validation accuracy: 0.96
