# Linear regression model using Iris dataset

In [33]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso

In [14]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [15]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [16]:
y=iris.target

In [17]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [18]:
x=iris.data

In [19]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4,random_state=101)

In [22]:
model=LinearRegression()

In [24]:
model.fit(x_train,y_train)

In [32]:
# Predict on the training set
pred_train_lr = model.predict(x_train)
print("Training Set RMSE:", np.sqrt(mean_squared_error(y_train, pred_train_lr)))
print("Training Set R²:", r2_score(y_train, pred_train_lr))

# Predict on the test set
pred_test_lr = model.predict(x_test)  # Use x_test here
print("Test Set RMSE:", np.sqrt(mean_squared_error(y_test, pred_test_lr)))
print("Test Set R²:", r2_score(y_test, pred_test_lr))

Training Set RMSE: 0.21990902197573362
Training Set R²: 0.9338764649957424
Test Set RMSE: 0.21234647583507144
Test Set R²: 0.9197985707122192


## Training Set Metrics
1)  Training Set RMSE: 0.2199

* Root Mean Squared Error (RMSE) is a measure of the average prediction error in the same units as the target variable. An RMSE of 0.2199 indicates that the model's predictions for the training set deviate from the actual values by approximately 0.22 units on average.
2) Training Set R²: 0.9339

* R² (Coefficient of Determination) measures how well the model explains the variance in the target variable. A value of 0.9339 means that about 93.39% of the variation in the training set is explained by the model.
## Test Set Metrics
1) Test Set RMSE: 0.2123

* An RMSE of 0.2123 for the test set indicates that the model's predictions deviate from the true values by about 0.21 units on average.
2) Test Set R²: 0.9198

* An R² of 0.9198 for the test set means the model explains 91.98% of the variation in the target variable for unseen (test) data.
## Key Observations
1) Good Model Fit:

Both the training and test R² values are high (above 0.9), suggesting the model has strong predictive performance and effectively captures the relationship between features and the target variable.
2) No Significant Overfitting:

The RMSE and R² values for the training and test sets are very close, indicating the model generalizes well to unseen data. This is a good sign that the model is neither underfitting nor overfitting.
3) Error Magnitude:

The RMSE values are relatively small, implying the model's predictions are close to the true values in both the training and test sets.
 ## Summary
The model performs well on both the training and test datasets, with high R² values and low RMSE values, indicating it has learned the underlying patterns effectively and generalizes well to new data.

# Applying Lasso Regression

Lasso regression, or the Least Absolute Shrinkage and Selection Operator, is also a modification of linear regression. In Lasso, the loss function is modified to minimize the complexity of the model by limiting the sum of the absolute values of the model coefficients (also called the l1-norm).

The loss function for Lasso Regression can be expressed as below:

Loss function = OLS + alpha * summation (absolute values of the magnitude of the coefficients)

In the above loss function, alpha is the penalty parameter we need to select. Using an l1 norm constraint forces some weight values to zero to allow other coefficients to take non-zero values.

In scikit-learn, a lasso regression model is constructed by using the Lasso class. The first line of code below instantiates the Lasso Regression model with an alpha value of 0.01. The second line fits the model to the training data.

The third line of code predicts, while the fourth and fifth lines print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.

In [37]:
model_lasso = Lasso(alpha=0.01)
model_lasso.fit(x_train, y_train) 
pred_train_lasso= model_lasso.predict(x_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lasso)))
print(r2_score(y_train, pred_train_lasso))

pred_test_lasso= model_lasso.predict(x_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lasso))) 
print(r2_score(y_test, pred_test_lasso))

0.2241303779331927
0.9313134954201462
0.20867665308133937
0.9225467371796311


When applying Lasso regularization to your dataset, the results seem to be:

## Training Metrics:
* RMSE: 0.2241
    * This indicates the average prediction error for the training set is slightly higher compared to the linear regression model's RMSE (0.2199). Lasso introduces regularization, which may slightly increase training error but can improve generalization.
* R²: 0.9313
    * The R² for the training set is slightly lower than the linear regression model's (0.9339). This is expected because Lasso prioritizes simplicity and avoids overfitting by reducing or eliminating less important features.
## Test Metrics:
* RMSE: 0.2087

    * The test RMSE is slightly lower than that of the linear regression model (0.2123). This suggests that Lasso may have improved the model's ability to generalize to unseen data by regularizing the coefficients.
* R²: 0.9225

    * The test R² is slightly higher than the linear regression model's (0.9198). This indicates the model explains a marginally greater proportion of the variance in the test set with Lasso regularization.


# Key Observation :

1) Improved Generalization:

    * Lasso has slightly improved the test set performance (lower RMSE and higher R²), which is often the goal of regularization.

2) Minimal Trade-off:

    s* While training metrics slightly deteriorated (higher RMSE and lower R²), this trade-off is small and acceptable given the improvement in generalization.

