## Before continue we need to know about an important concept:
***
# Regularization

Regularization is a method to prevent overfitting in models overfitting is when a model is too complex and memorizes the noise in training data.
regularization is a module that we will see in the [Maths](/tree/Data_Science_Full_Basic/Maths) section
in this moment we will focus in two regularization types.
### - **L1 (Lasso):**
Adding a penalty proportional to the absolute sum of parameter coefficient values often forces parameters to be exactly zero, resulting in simpler models with fewer features.
### - **L2 (Ridge):**
The addition of a penalty proportional to the sum of the squares of the coefficients prevents the parameter weights from growing too large, which helps reduce overfitting but does not eliminate features completely as in L1.


# Regression part 2
### There are others types of regression models:
- Ridge Regression **(uses L2 regularization)**
- Lasso Regression **(uses L1 regularization)**

## Ridge Regression

Ridge regression is a type of regularized linear regression that adds a penalty term to the loss function to prevent overfitting.

### Lets make and example:

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
df = pd.read_csv('WineQT.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   int64  
 12  Id                    1143 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 116.2 KB


#### Cleaning data

In [4]:
df.isna().any()

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
Id                      False
dtype: bool

In [5]:
df.drop(columns=['Id'], inplace= True)

In [6]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### Preparing model

In [7]:
X = df.drop(columns=['quality'])
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [8]:
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)

In [9]:
y_pred = ridge.predict(X_test)

results = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})
results

Unnamed: 0,y_test,y_pred
158,5,5.319608
1081,6,4.774222
291,5,5.256569
538,6,5.266676
367,6,6.181609
...,...,...
248,5,6.433799
307,6,5.350355
334,6,6.052577
423,5,5.146512


In [10]:
mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.2f}")

MSE: 0.38


In [11]:
r2_score(y_test, y_pred)

0.34158347358544094

**Identify the most important features in the dataset**

In [12]:
coefficients = ridge.coef_
important_features = [feature for feature, coeff in zip(X.columns, coefficients) if abs(coeff) > 0.05]
print(f"Important features: {important_features}")

Important features: ['volatile acidity', 'citric acid', 'chlorides', 'pH', 'sulphates', 'alcohol']


## Lasso Regression

Lasso regression is similar to ridge regression but in this case Lasso Regression uses L1 regularization (absolute value penalty), which sets some coefficients to zero, effectively eliminating irrelevant features (feature selection).

### Lets see an example

In [13]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)

In [14]:
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


data.describe()

In [16]:
data.isna().any()

CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM         False
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT      False
MEDV       False
dtype: bool

### Prepare the data

In [17]:
X = data.drop(columns=['MEDV'])
y = data['MEDV']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state=42)

In [18]:
lasso = Lasso(alpha=0.1, random_state=42)

In [19]:
lasso.fit(X_train, y_train)

In [20]:
y_pred = lasso.predict(X_test)

results = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred})
results

Unnamed: 0,y_test,y_pred
173,23.6,28.279697
274,32.4,32.598919
491,13.6,14.704904
72,22.8,26.200056
452,16.1,18.942931
...,...,...
110,21.7,20.734639
321,23.1,26.468968
265,22.8,29.908786
29,21.0,21.635710


In [21]:
r2_score(y_test, y_pred)

0.6647359801936796