## Purpose:

The aim is to learn learn Linear,Lasso and Ridge Regression thoroughly.
We will look at the nitty-gritty of each of the algorithms and how to implement them using sklearn.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
from sklearn.datasets import fetch_california_housing

In [4]:
dataset=fetch_california_housing()


In [5]:
dataset_df=pd.DataFrame(data=dataset.data,columns=dataset.feature_names)

In [6]:
dataset_df.head(5)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


Lets divide dataset into train and test sets using 'train_test_split'.This enables us to train our model on a ceratin portion of data and test the model using remaining unseen data.

Note: It best works on large datasets as you have enough data to get a good estimate of the model’s generalization performance.

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
#to skip to the last feature
X=dataset_df
y=pd.DataFrame(data=dataset.target,columns=dataset.target_names)

In [1]:
y.head()


NameError: name 'y' is not defined

In [10]:
X_train,X_test,y_train,y_test=train_test_split(X,y,shuffle=True,random_state=42,test_size=0.3)

In [11]:
X_train.shape,y_train.shape,X_test.shape,y_test.shape

((14448, 8), (14448, 1), (6192, 8), (6192, 1))

## Linear Regression :

Basis of Linear Regression is to find the best fit line or hyperplane(for higher dimensions) that can be used for prediction.
We can represent the relationship as

$$
y = w_1x_1 + w_2x_2 + \dots + w_nx_n + b + \epsilon
$$

Here ,**y** represents our **target variable**.

**x1,x2,x3..xn** represents various **independent features**.

**w1,w2,w3..wn** represents the **coeficients or the weights** of the features.

**b** represents the **y intercept (bias)**

**$\epsilon$** is the **residual error**.



The standard form of MSE(Mean Squared Error) is


$\text{MSE} = \frac{1}{m} \sum_{i=1}^m \left(y_i - \hat{y}_i\right)^2$

Where $\hat{y}_i = X_iw + b$ (the prediction).<br><br>


Gradient descent minimizes the cost function:

In case of **Gradient optimisation**,we use slightly modified form of equation i.e<br><br>

$
J(w, b) = \frac{1}{2m} \sum_{i=1}^m \left(y_i - (X_iw + b)\right)^2
$

by iteratively updating the coefficients:

$w_j = w_j - \alpha \frac{\partial J}{\partial w_j}$


$b = b - \alpha \frac{\partial J}{\partial b}$

where:
$	\alpha$: is the Learning rate

$\frac{\partial J}{\partial w_j}$: Gradient of the cost function with respect to $w_j$.
<br><br>
The goal is to find the values of $w_1, w_2.... w_n$ and b such that the error is minimized.

Now that we discussed about MSE, I would like to discuss RMSE and need for RMSE?.
<br><br>
**Root Mean Squared Error (RMSE):**

$\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{m} \sum_{i=1}^m \left(y_i - \hat{y}_i\right)^2}$

	•	It is the square root of the MSE.
Why Use RMSE?



	•	MSE gives a squared error, which means its unit is the square of the target variable. For instance, if your target variable is height (in meters), MSE will be in meters². This can make interpretation less intuitive.
	•	RMSE, by taking the square root, converts the metric back to the same unit as the target variable, making it easier to interpret and compare directly with the values of y.
	•	While MSE is easier to compute and differentiable,it is less interpretable due to its squared unit.



**No hyper parameters to tune.**

Advantages:

	•	Simplicity: Easy to interpret and implement.
	•	Computationally Efficient: Works well on small to medium-sized datasets.
	•	Baseline Model: Serves as a reference point for more complex models.
Disadvantages:

	•	Overfitting: Prone to overfitting when there is high multicollinearity or a
	large number of features relative to the number of observations.
	•	No Regularization: Cannot handle datasets with irrelevant or highly correlated features effectively.

I need to touch base on how Linear Regression causes Overfitting:

1.	Excessive Model Complexity:

	•	In the case of linear regression, overfitting usually occurs when you have too many features (input variables) relative to the number of training samples. The model may end up fitting to the noise in the data by giving too much importance to each feature.

	•	For example, if you add too many features to your model (e.g., polynomial features or highly correlated features), it might try to “fit” every small variation in the data, rather than finding a generalizable relationship.

2.	Trying to Make All Predictions Perfect:

	•	In a typical ordinary least squares (OLS) linear regression model, the goal is to minimize the residual sum of squares (RSS), which is the difference between the actual and predicted values. The model tries to minimize this error, which can lead to it fitting the training data too closely (including the noise).

	•	If you have a small or noisy dataset, this results in overfitting because the model has learned too many details of the training set, including the random fluctuations, which do not generalize to new data.

3.	Large Coefficients:

	•	Overfitting often leads to large coefficients in the model. When the model is trying to perfectly fit the training data, it will make the coefficients large in order to reduce the residuals (the difference between predicted and actual values) as much as possible. Large coefficients make the model more sensitive to small changes in the input features, which is undesirable and can cause poor generalization.
	•	When the coefficients become large, the model becomes more complex and overfits, since it can adjust rapidly to changes in any of the input features.

✅ Keep in mind the concept of large coeficients as we have a lot of content coming up surrounding this!


In [12]:

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()

In [13]:
from sklearn.model_selection import cross_val_score

#### Why Use cross_val_score for Linear Regression?

✅ Evaluate Generalization:

* Cross-validation helps to estimate how well the linear regression model will perform on unseen data by splitting the data into multiple training and testing subsets.
* It ensures that the model is not overly tuned to a single train-test split, thus providing a robust measure of its performance.
   
✅ Detect Overfitting/Underfitting:

* By examining the scores (e.g., R², MSE) across folds, you can determine if the model is consistently performing well or if there is significant variance between folds.
* Large variations in fold scores indicate potential overfitting or data inconsistencies.
* Consistently low scores suggest underfitting.

✅ Baseline Comparison:

* Linear Regression can serve as a baseline model to compare with more
     complex models (e.g., decision trees, SVM, neural networks).
* Cross-validation ensures this baseline evaluation is reliable and fair.

In [14]:
mse=cross_val_score(lin_reg,X_train,y_train,scoring='neg_mean_squared_error',cv=5)
mean_mse=np.mean(mse)
lin_reg.fit(X_train,y_train)
print(mean_mse)

-0.5268253746355749


#### Why negative mse ?

I would like to add here, that this negative error is also helpful in finding best algorithm
when you are comparing multiple algorithms through GridSearchCV().

GridSearchCV ranks models by maximizing the score. Since MSE is a “minimization” metric (lower values are better), the negative sign ensures that a smaller MSE corresponds to a larger (less negative) score, making higher scores better.

In [15]:
print(f"mse:{mse}")
print(f"mean_mse:{mean_mse}")


mse:[-0.54787556 -0.500835   -0.52045639 -0.51612252 -0.54883741]
mean_mse:-0.5268253746355749


## Ridge Regression :

Ridge Regression is a linear regression model with L2 regularization, which penalizes large coefficients to reduce overfitting.

Lets really dive into how large coeficients influence the model :

**The Role of Coefficients in Linear Models**

In a linear regression model:

$\hat{y} = w_1x_1 + w_2x_2 + \dots + w_nx_n + b$

Each feature $x_i$ is multiplied by its corresponding weight $w_i$(coeficient).

The magnitude of $w_i$ determines how much $x_i$ contributes to the predicted value $\hat{y}$:

If $w_i$ is small, changes in $x_i$ have a minimal impact on $\hat{y}$.
If $w_i$ is large, even small changes in $x_i$ cause large changes in $\hat{y}$.


**Amplification in Polynomial or Nonlinear Models**

In polynomial regression or when using basis functions, the model becomes:

$\hat{y} = w_0 + w_1x + w_2x^2 + w_3x^3 + \dots$

Higher-order terms $(x^2, x^3, \dots)$:
These terms grow rapidly for small changes in x.
Large coefficients $(w_2, w_3, \dots)$ make the model highly sensitive to small variations in x, resulting in dramatic oscillations or “wiggliness” in the prediction curve.


#### How does Ridge fix the overfitting?

Ridge Regression

	•	Regularization: L2 regularization, penalizes the sum of squared coefficients.
	•	Effect: Shrinks coefficients towards zero but does not set any to exactly zero.
Best for:

	•	When all features are relevant, but you want to reduce their influence (shrink coefficients).
	•	When multicollinearity (high correlation between features) is present, as Ridge can stabilize the solution 
        by reducing variance.

	Advantages:
	•	Keeps all features in the model, which can be helpful if each feature contributes to the prediction.
	•	Computationally efficient, especially for high-dimensional data.
	•	Disadvantages:
	•	Does not perform feature selection, so all features remain in the model even if some are irrelevant.

Lets look at the expression in detail :

Expression:


$J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( y_i - \left( w^\top X_i + b \right) \right)^2 + \lambda \sum_{j=1}^n w_j^2$<br><br>


Components:

$\frac{1}{m} \sum_{i=1}^m \left( y_i - \left( w^\top X_i + b \right) \right)^2$: The same loss function as that of Linear regression

$\lambda \sum_{j=1}^n w_j^2$ : This additional term is the  L2 regularization term (squared sum of coefficients).This will penalise the higher coeficients and force the model to opt for smaller coeficients almost bringing them down close to 0.

$\lambda$: Regularization strength (hyperparameter).

$w_j$: Coefficients of the features.


*Note:If the coeficients were brought down to 0,it would nullify the effect of that particular feature in the prediction, bringing in a feature selection aspect.We will talk more about this in Lasso.*


Lets implement it!




In [16]:

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

In [17]:
ridge=Ridge()

In [18]:
##Note that the lambda in Mathematical equation correspnds to the alpha mentioned here.
params=[{'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20]}]


In [19]:
ridge_regressor=GridSearchCV(ridge,params,scoring='neg_mean_squared_error',cv=10)
ridge_regressor.fit(X_train,y_train)

In [20]:
print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

#If we compare Linear and ridge,both are performing similiarly

{'alpha': 10}
-0.5256985094322817


## Lasso Regression :

Lasso introduces the L1 regularisation term to loss function to overcome overfitting.Let us look at the expression.

Expression:


$J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( y_i - \left( w^\top X_i + b \right) \right)^2 + \lambda \sum_{j=1}^n |w_j|$


Components:

$\frac{1}{m} \sum_{i=1}^m \left( y_i - \left( w^\top X_i + b \right) \right)^2$: The regular loss function

$\lambda \sum_{j=1}^n |w_j|$: L1 regularization term (sum of absolute coefficients).

Where:

$\lambda$: Regularization strength (hyperparameter).

$|w_j|$: Absolute value of coefficients.<br><br>

Effect:

	•	Encourages sparsity by driving some coefficients (w_j) exactly to zero.
	•	Performs automatic feature selection by excluding irrelevant features.




I think there is still clarity needed for how the feature selection takes place in Lasso and not in Ridge.Why is L2 not setting coeficients to 0 whereas L1 does?

Lets investigate 🔍 !!

The cost function for **Ridge regression with L2 regularization** is:


$J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( y_i - \left( w^\top X_i + b \right) \right)^2 + \lambda \sum_{j=1}^n w_j^2$

The second term is the L2 regularization term $(\lambda \sum_{j=1}^n w_j^2)$.

This term is differentiable everywhere, including at zero.

When performing optimization (e.g., gradient descent), the derivative of the L2 regularization term with respect to a coefficient $w_j$ is:


$\frac{\partial}{\partial w_j} \left( \lambda \sum_{j=1}^n w_j^2 \right) = 2\lambda w_j$

	•	This derivative is continuous and smooth (it has a well-defined gradient), and as a result, the coefficients are gradually shrunk towards zero without ever being set to exactly zero.
	•	Even if a coefficient becomes very small, it will never be exactly zero, unless the regularization parameter lambda is infinitely large (which is typically not the case).

So, for L2 regularization (Ridge), the coefficients are shrunk but not set to zero. The gradient of the L2 penalty term ensures smooth changes in the coefficients.<br><br>

The cost function for **Lasso regression with L1 regularization** is:


$J(w, b) = \frac{1}{m} \sum_{i=1}^m \left( y_i - \left( w^\top X_i + b \right) \right)^2 + \lambda \sum_{j=1}^n |w_j|$

	•	The second term is the L1 regularization term (\lambda \sum_{j=1}^n |w_j|).
	•	This term is not differentiable at zero because of the absolute value function (which creates a corner at w_j = 0).

Effect of L1 penalty:

	•	The L1 penalty has a non-differentiable point at zero. This creates a situation where coefficients are “encouraged” to become exactly zero, instead of being gradually shrunk towards zero.
	•	The optimization process works to minimize both the loss function and the penalty term. The L1 penalty is quite harsh when the coefficients are near zero, so the optimization algorithm drives some coefficients to zero instead of just shrinking them.

Gradient of L1 Regularization:

Unlike L2 regularization, the gradient of the L1 penalty is not continuous at zero. The derivative of the L1 term is:


$\frac{\partial}{\partial w_j} \left( \lambda |w_j| \right) =
\begin{cases}
\lambda & \text{if } w_j > 0 \\
-\lambda & \text{if } w_j < 0 \\
\text{undefined} & \text{if } w_j = 0
\end{cases}$

	•	The gradient is undefined at zero because of the sharp corner at w_j = 0.
	•	As a result, during optimization, when the coefficient is close to zero, it is more likely to be set exactly to zero because the non-differentiability at zero makes the penalty harsh enough to eliminate the coefficient entirely.
	•	This is what leads to sparse solutions, where some coefficients are exactly zero, resulting in automatic feature selection.


  Here is an image for those who grasp things better visually like me!
 ![](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*bM2txQ6caL4AKiN19oH5bQ.gif)

 Image from:[Source](https://towardsdatascience.com/visualizing-regularization-and-the-l1-and-l2-norms-d962aa769932)

 In the plot above, we can observe the differences between the L1 (Lasso) and L2 (Ridge) regularization penalties:

	•	L2 Regularization (Ridge): The penalty is smooth and continuous. As the coefficient w increases or decreases, the penalty increases quadratically (w^2). The graph shows a smooth curve with no abrupt changes. The penalty increases slowly for small values of w, and more sharply for larger values of w.
	•	L1 Regularization (Lasso): The penalty is linear, increasing linearly with the absolute value of w (|w|). Notice the sharp corner at w = 0, which is the non-differentiable point. This sharp point is what leads to sparsity—coefficients near zero are more likely to be pushed to exactly zero during optimization.
  



In [21]:

from sklearn.linear_model import Lasso
lasso=Lasso()
params=[{'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10]}]

In [22]:
lasso_regressor=GridSearchCV(lasso,params,scoring='neg_mean_squared_error',cv=10)

In [23]:
lasso_regressor.fit(X_train,y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [24]:
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

{'alpha': 1e-08}
-0.5257104322228356


### Comparing the Linear,Ridge and Lasso models using GridSearchCV :


In [25]:
from sklearn.metrics import mean_squared_error

In [26]:
models={'LinearRegression':{'model':LinearRegression(),
                             'params':{}

                            },
        'LassoRegression':{'model':Lasso(),
                           'params':{'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10]}},

        'RidgeRegression':{'model':Ridge(),
                           'params':{'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20]}}
        }
results=[]

In [27]:
for key,value in models.items():
  grid=GridSearchCV(estimator=value['model'],param_grid=value['params'],scoring='neg_mean_squared_error', cv=5)
  grid.fit(X_train, y_train)
  best_model = grid.best_estimator_
  y_pred = best_model.predict(X_test)
  test_mse = mean_squared_error(y_test, y_pred)
  results.append({
        'Model': key,
        'Best Params': grid.best_params_,
        'CV MSE': -grid.best_score_,  # Convert back to positive MSE
        'Test MSE': test_mse
    })



  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [28]:
results=pd.DataFrame(results)
results

Unnamed: 0,Model,Best Params,CV MSE,Test MSE
0,LinearRegression,{},0.526825,0.530568
1,LassoRegression,{'alpha': 1e-08},0.526825,0.530568
2,RidgeRegression,{'alpha': 1e-15},0.526825,0.530568


## Logistic Regression :

Key Concepts in Logistic Regression:

1.	**Sigmoid Function:**

The core idea behind logistic regression is that it uses the sigmoid function (also called the logistic function) to map the linear regression output to a probability value between 0 and 1.

The sigmoid function is defined as:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

where:

•	 $z = w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n$  (the linear combination of the input features).

•	 $\sigma(z)$ gives the probability.

The output of the sigmoid function can be interpreted as the probability of the positive class (class 1), and for binary classification, the decision threshold is typically 0.5:<br>
If  $\sigma(z) \geq$ 0.5 , predict class 1 (positive).<br>
If  $\sigma(z) $< 0.5 , predict class 0 (negative).<br>

2. **Log-Loss (Binary Cross-Entropy):**

Logistic regression uses log-loss (or binary cross-entropy) as the loss function to evaluate the model’s performance. The loss function measures how well the model’s predicted probabilities match the actual class labels.
The log-loss function for binary classification is:

$L(y, \hat{y}) = -\frac{1}{m} \sum_{i=1}^m \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$

where:

$y_i$  is the true label (0 or 1).

$\hat{y}_i$ is the predicted probability for class 1.

m  is the number of samples.

In [29]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
log=LogisticRegression()


In [30]:
df=load_breast_cancer()

In [31]:
X=pd.DataFrame(df.data,columns=df.feature_names)

In [32]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [33]:
y=pd.DataFrame(df.target,columns=['Target'])

In [34]:
#To check if balanced dataset
y['Target'].value_counts()

Unnamed: 0_level_0,count
Target,Unnamed: 1_level_1
1,357
0,212


In [35]:
#train_test
X_train,X_test,y_train,y_test=train_test_split(X,y,shuffle=True)

In [36]:
params=[{'C':[1,5,10]},{'max_iter':[100,150]}]

In [37]:
log=LogisticRegression(C=100,max_iter=100)
model=GridSearchCV(log,param_grid=params,scoring='f1',cv=5)

In [38]:
model.fit(X_train,y_train)
y_train.shape

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/

(426, 1)

In [39]:
model.best_params_

{'max_iter': 150}

In [40]:
model.best_score_

0.9695682185401813

In [41]:
y_pred=model.predict(X_test)

In [42]:
y_pred

array([1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1])

In [43]:
from sklearn.metrics import confusion_matrix,classification_report

In [44]:
confusion_matrix(y_test,y_pred)

array([[44,  6],
       [ 5, 88]])

In [45]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.9230769230769231