In [35]:
#Importing libraries
import pandas as pd
import numpy as np
# Import the necessary libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

### You need to select a dataset that relevant for all questions below (at least 2500 observations)

## 1. Load and Explore Data

### Read the dataset.

In [36]:
life_expectancy_df = pd.read_csv("C:/Users/CHARITHA/Downloads/archive (7)/Life Expectancy Data.csv")

In [37]:
life_expectancy_df.shape

(2938, 22)

### Display the first few rows of the dataset

In [38]:
life_expectancy_df.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


The dataset is suitable for Linear, Lasso, Ridge, and Elastic Net regression because it has various features that affect the target variable "Life expectancy(continous variable)" which allows for effective analysis. It includes both numerical and categorical data, making it a good choice for showing how to encode and select features. With over 2500 observations there's enough data for training and testing the models which helps in evaluating their performance through cross-validation. The variety and size of the dataset also allow for grid search to fine-tune hyperparameters improving the models accuracy and ability to generalize.

In [39]:
life_expectancy_df.rename(columns=lambda x: x.replace(' ', ''), inplace=True)

### Extract input (X) and output (Y) data from the dataset

Our focus is strictly on regression analysis the Country column which has names of countries does not significantly contribute to predicting the output so I am dropping it  

In [40]:
life_expectancy_df = life_expectancy_df.drop(columns = ['Country'])

In [41]:
X = life_expectancy_df.drop(columns=['Lifeexpectancy'])
Y = life_expectancy_df['Lifeexpectancy']

## 2. Handle Missing Values

In [42]:
print(X.isnull().sum())

Year                              0
Status                            0
AdultMortality                   10
infantdeaths                      0
Alcohol                         194
percentageexpenditure             0
HepatitisB                      553
Measles                           0
BMI                              34
under-fivedeaths                  0
Polio                            19
Totalexpenditure                226
Diphtheria                       19
HIV/AIDS                          0
GDP                             448
Population                      652
thinness1-19years                34
thinness5-9years                 34
Incomecompositionofresources    167
Schooling                       163
dtype: int64


In [43]:
print(Y.isnull().sum())

10


I am using KNN to impute the missing values becuase
- Preserves Data Distribution: Maintains the original patterns and relationships in the data.
- Handles Both Numerical and Categorical Data: Suitable for  mixed dataset.
- Avoids Bias: Uses actual data points, reducing imputation bias.
- Utilizes Full Dataset: Leverages the entire dataset to find the best neighbors, leading to more reliable estimates.
- Maintains Multivariate Relationships: Considers all variables, preserving interrelated factors like life expectancy, adult mortality, and GDP.

In [44]:
# Initialize the KNN Imputer
imputer = KNNImputer(n_neighbors=5)

# Select columns to impute (all numerical columns)
columns_for_imputation = X.select_dtypes(include=['float64', 'int64']).columns.tolist()

# Perform KNN imputation on all columns with missing values in X
X[columns_for_imputation] = imputer.fit_transform(X[columns_for_imputation])

# Reshape Y for KNN imputation
Y_reshaped = Y.values.reshape(-1, 1)

# Perform KNN imputation on Y
Y_imputed = imputer.fit_transform(Y_reshaped)

# Convert the imputed Y back to a Series
Y = pd.Series(Y_imputed.flatten(), index=Y.index)

# Check for missing values after imputation
print("Missing values after imputation in X:")
print(X.isnull().sum())
print("Missing values after imputation in Y:")
print(Y.isnull().sum())

Missing values after imputation in X:
Year                            0
Status                          0
AdultMortality                  0
infantdeaths                    0
Alcohol                         0
percentageexpenditure           0
HepatitisB                      0
Measles                         0
BMI                             0
under-fivedeaths                0
Polio                           0
Totalexpenditure                0
Diphtheria                      0
HIV/AIDS                        0
GDP                             0
Population                      0
thinness1-19years               0
thinness5-9years                0
Incomecompositionofresources    0
Schooling                       0
dtype: int64
Missing values after imputation in Y:
0


## 3.Encode Categorical Data

There is only one categorical column which is Status

In [45]:
X = pd.get_dummies(X, columns=['Status'], drop_first=True)
X['Status_Developing'] = X['Status_Developing'].astype(int)
# Check the result
print(X.head())

     Year  AdultMortality  infantdeaths  Alcohol  percentageexpenditure  \
0  2015.0           263.0          62.0     0.01              71.279624   
1  2014.0           271.0          64.0     0.01              73.523582   
2  2013.0           268.0          66.0     0.01              73.219243   
3  2012.0           272.0          69.0     0.01              78.184215   
4  2011.0           275.0          71.0     0.01               7.097109   

   HepatitisB  Measles   BMI  under-fivedeaths  Polio  Totalexpenditure  \
0        65.0   1154.0  19.1              83.0    6.0              8.16   
1        62.0    492.0  18.6              86.0   58.0              8.18   
2        64.0    430.0  18.1              89.0   62.0              8.13   
3        67.0   2787.0  17.6              93.0   67.0              8.52   
4        68.0   3013.0  17.2              97.0   68.0              7.87   

   Diphtheria  HIV/AIDS         GDP  Population  thinness1-19years  \
0        65.0       0.1  584

## 4. Split the Dataset:

In [46]:
from sklearn.model_selection import train_test_split
# Assuming X and Y are defined and have the same length
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

## 5. Standardize Data

In [47]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 6. Linear Regression

### Perform Linear Regression on the training data

In [48]:
linear_model = LinearRegression()
linear_model.fit(X_train, Y_train)

### Predict the output for the test dataset using the fitted model

In [49]:
Y_pred_linear = linear_model.predict(X_test)

### Print the root mean squared error (RMSE) from Linear Regression.

In [50]:
rmse_linear = np.sqrt(mean_squared_error(Y_test, Y_pred_linear))
print(f'Linear Regression RMSE: {rmse_linear}')

Linear Regression RMSE: 3.7000649646249326


## 7. Lasso Regression

### Implement Lasso Regression on the training data.

In [51]:
# Initialize Lasso model with a higher max_iter
lasso_model = Lasso(max_iter=10000)  # Example: increase to 10,000 iterations

lasso_model.fit(X_train, Y_train)

### Predict the output for the test dataset using the fitted Lasso model.

In [52]:
Y_pred_lasso = lasso_model.predict(X_test_scaled)

### Evaluate and print the RMSE for Lasso Regression.

In [53]:
rmse_lasso = np.sqrt(mean_squared_error(Y_test, Y_pred_lasso))
print(f'Lasso Regression RMSE: {rmse_lasso}')

Lasso Regression RMSE: 3.7000723647255462


## 8. Ridge Regression

### Implement Ridge Regression on the training data.

In [89]:
ridge_model = Ridge(max_iter=10000)
ridge_model.fit(X_train, Y_train)

### Predict the output for the test dataset using the fitted Ridge model.

In [55]:
Y_pred_ridge = ridge_model.predict(X_test)

### Evaluate and print the RMSE for Ridge Regression

In [56]:
rmse_ridge = np.sqrt(mean_squared_error(Y_test, Y_pred_ridge))
print(f'Ridge Regression RMSE: {rmse_ridge}')

Ridge Regression RMSE: 3.7020341485247807


## 9. Elastic Net Regression

### Implement Elastic Net Regression on the training data.

In [81]:
elastic_net_model = ElasticNet(max_iter=10000)  # Example
elastic_net_model.fit(X_train, Y_train)

### Predict the output for the test dataset using the fitted Elastic Net model

In [82]:
Y_pred_elastic_net = elastic_net_model.predict(X_test)

### Evaluate and print the RMSE for Elastic Net Regression

In [83]:
rmse_elastic_net = np.sqrt(mean_squared_error(Y_test, Y_pred_elastic_net))
print(f'Elastic Net Regression RMSE: {rmse_elastic_net}')

Elastic Net Regression RMSE: 4.188373771624211


### Performance Comparison

-  **Linear and Lasso Regression**: Both models performed similarly, achieving RMSE values around 3.7001, indicating effective predictions without significant overfitting or underfitting, making them the best performing models.

-  **Ridge Regression**: With a slightly higher RMSE of 3.7020, Ridge Regression was less effective than Linear and Lasso models, likely due to its coefficient-shrinking approach, which may not have been necessary for this dataset.

-  **Elastic Net Regression**: This model had the highest RMSE of 4.1884, suggesting it performed the least effectively, possibly due to the complexity of combining L1 and L2 penalties, which did not capture the underlying patterns as well as the other models.

## 10. Cross-Validation and Grid Search

### Apply cross-validation on the dataset to assess the models' generalization performance

In [84]:
# Cross-validation for Linear Regression
cv_scores_linear = cross_val_score(linear_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
cv_rmse_linear = np.sqrt(-cv_scores_linear)
print(f'Cross-validated RMSE for Linear Regression: {cv_rmse_linear.mean()}')

Cross-validated RMSE for Linear Regression: 4.033996886644355


In [85]:
# Lasso Regression Cross-Validation
lasso_cv_scores = cross_val_score(lasso_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
lasso_rmse_cv = np.sqrt(-lasso_cv_scores)
print(f'Cross-validated RMSE for Lasso: {lasso_rmse_cv.mean()}')

Cross-validated RMSE for Lasso: 4.034074727495609


In [86]:
# Ridge Regression Cross-Validation
ridge_cv_scores = cross_val_score(ridge_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
ridge_rmse_cv = np.sqrt(-ridge_cv_scores)
print(f'Cross-validated RMSE for Ridge: {ridge_rmse_cv.mean()}')

Cross-validated RMSE for Ridge: 4.0367025178358205


In [87]:
# Ridge Regression Cross-Validation
elastic_net_cv_scores = cross_val_score(elastic_net_model, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')
elastic_net_rmse_cv = np.sqrt(-elastic_net_cv_scores)
print(f'Cross-validated RMSE for elastic_net_rmse: {elastic_net_rmse_cv.mean()}')

Cross-validated RMSE for Ridge: 4.496411731265011


### Perform grid search to fine-tune hyperparameters for Ridge and Lasso Regression models

In [88]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

# Function to calculate RMSE from cross-validated scores
def calculate_rmse(cv_scores):
    return np.sqrt(-cv_scores)

# Grid search for Lasso Regression
lasso_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
lasso_grid = GridSearchCV(Lasso(), lasso_params, cv=5, scoring='r2')
lasso_grid.fit(X_train, Y_train)

# Best parameters for Lasso Regression
best_lasso_params = lasso_grid.best_params_
best_lasso_r2 = lasso_grid.best_score_  # Get the best R² score

# Calculate RMSE for Lasso
lasso_rmse = calculate_rmse(cross_val_score(Lasso(alpha=best_lasso_params['alpha']), X_train, Y_train, cv=5, scoring='neg_mean_squared_error')).mean()

print(f'Best parameters for Lasso Regression: {best_lasso_params}')
print(f'Cross-validated R² for Lasso Regression: {best_lasso_r2}')
print(f'Cross-validated RMSE for Lasso Regression: {lasso_rmse}')

# Grid search for Ridge Regression
ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='r2')
ridge_grid.fit(X_train, Y_train)

# Best parameters for Ridge Regression
best_ridge_params = ridge_grid.best_params_
best_ridge_r2 = ridge_grid.best_score_  # Get the best R² score

# Calculate RMSE for Ridge
ridge_rmse = calculate_rmse(cross_val_score(Ridge(alpha=best_ridge_params['alpha']), X_train, Y_train, cv=5, scoring='neg_mean_squared_error')).mean()
print(f'Best parameters for Ridge Regression: {best_ridge_params}')
print(f'Cross-validated R² for Ridge Regression: {best_ridge_r2}')
print(f'Cross-validated RMSE for Ridge Regression: {ridge_rmse}')

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for Lasso Regression: {'alpha': 0.01}
Cross-validated R² for Lasso Regression: 0.8191640695931482
Cross-validated RMSE for Lasso Regression: 4.0436291105594035
Best parameters for Ridge Regression: {'alpha': 0.01}
Cross-validated R² for Ridge Regression: 0.8200244650222952
Cross-validated RMSE for Ridge Regression: 4.034002431166777


### Discuss the results of cross-validation and grid search, providing insights into the optimal hyperparameters for the models.

| Model                | Cross-Validated RMSE | Initial RMSE   |
|----------------------|-----------------------|-----------------|
| Linear Regression     | 4.034                 | 3.700           |
| Lasso Regression      | 4.043                 | 3.700           |
| Ridge Regression      | 4.034                 | 3.702           |
| Elastic Net Regression| Not Provided           | 4.188       

- **Linear Regression:** The cross-validated Root Mean Squared Error (RMSE) for Linear Regression was approximately **4.034**. This indicates a moderate level of prediction accuracy, as RMSE represents the average error between predicted and actual values.
   - **Lasso Regression:** The Lasso model produced a similar cross-validated RMSE of about **4.043**. This suggests that Lasso effectively handles feature selection through regularization, performing comparably to linear regression.
   - **Ridge Regression:** The Ridge Regression model had a cross-validated RMSE of **4.034**, identical to Linear Regression, implying that it also maintains similar predictive capabilities while addressing multicollinearity through regularization.
   - **Elastic Net Regression:** Although the specific RMSE was not provided for Elastic Net, it generally exhibits balanced properties of both Lasso and Ridge, effectively managing feature selection and multicollinearity.
   - **R² Scores:** The R² scores for both Lasso and Ridge regressions were approximately **0.819** and **0.820**, respectively. These high values indicate that the models explain around 82% of the variance in the target variable, reflecting strong predictive.

#### Optimal Hyperparameters

1. **Lasso Regression:**
   - The optimal regularization parameter (alpha) for Lasso Regression was found to be **0.01**. This suggests that a low level of regularization is adequate to prevent overfitting while still maintaining model accuracy. The Lasso’s ability to shrink some coefficients to zero further highlights its role in feature selection, potentially simplifying the model without significant loss in performance.

2. **Ridge Regression:**
   - Similarly, the optimal alpha for Ridge Regression was also **0.01**. This indicates that the model benefits from minimal regularization to address multicollinearity while retaining essential features. The ability to maintain higher coefficients helps in capturing relationships between predictors effectively.

3. **Elastic Net Regression:**
   - While specific hyperparameter results for Elastic Net were not detailed, it typically combines the benefits of both Lasso and Ridge regression. Exploring a range of alpha values through grid search would provide insights into the optimal regularization needed for this model.

#### Insights and Recommendations

- **Model Selection:** The close RMSE values for Lasso and Ridge suggest that either model can be chosen based on the specific context of the problem. If feature selection is a priority, Lasso might be preferred; if dealing with multicollinearity is more critical, Ridge could be more suitable.
- **Hyperparameter Tuning:** The findings indicate that low levels of regularization (alpha = 0.01) are optimal for both Lasso and Ridge regression models. Future work could include experimenting with more granular alpha values or expanding the hyperparameter search space to further refine the models.izable.    |
