<a href="https://colab.research.google.com/github/budd-lab/ML-Model-Selection-and-Cross-Validation/blob/main/AIMLBerkeleyAssignments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px

from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
set_config(display="diagram")

import warnings
warnings.filterwarnings('ignore')



In [None]:
#!mkdir -p ~/.kaggle
#!cp kaggle.json ~/.kaggle/
#!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list

ref                                                            title                                            size  lastUpdated          downloadCount  voteCount  usabilityRating  
-------------------------------------------------------------  ----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
muhammadehsan000/healthcare-dataset-2019-2024                  Healthcare Dataset (2019-2024)                    3MB  2024-08-09 17:52:25           3935         87  1.0              
berkayalan/paris-2024-olympics-medals                          Paris 2024 Olympics Medals                        1KB  2024-08-14 11:02:45           1827         39  1.0              
muhammadehsan000/diabetes-healthcare-dataset                   Diabetes Healthcare Dataset                      27KB  2024-08-17 19:30:34            471         26  1.0              
muhammadehsan000/olympic-games-medal-dataset-1994-2024         Olympic Games Medal Da

In [3]:
# Download the dataset using the Kaggle API (assuming you have kaggle.json set up)
!kaggle datasets download -d kumarajarshi/life-expectancy-who

# Unzip the downloaded file
!unzip life-expectancy-who.zip

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('Life Expectancy Data.csv')

# Display the first few rows to verify the import
print(df.head())

Dataset URL: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
License(s): other
Downloading life-expectancy-who.zip to /content
  0% 0.00/119k [00:00<?, ?B/s]
100% 119k/119k [00:00<00:00, 36.9MB/s]
Archive:  life-expectancy-who.zip
  inflating: Life Expectancy Data.csv  
       Country  Year      Status  Life expectancy   Adult Mortality  \
0  Afghanistan  2015  Developing              65.0            263.0   
1  Afghanistan  2014  Developing              59.9            271.0   
2  Afghanistan  2013  Developing              59.9            268.0   
3  Afghanistan  2012  Developing              59.5            272.0   
4  Afghanistan  2011  Developing              59.2            275.0   

   infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   ...  \
0             62     0.01               71.279624         65.0      1154  ...   
1             64     0.01               73.523582         62.0       492  ...   
2             66     0.01               73

In [4]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [10]:
# Drop rows with NA values
df=df.dropna()
df
df.describe()

Unnamed: 0,Year,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
count,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0,1649.0
mean,2007.840509,69.302304,168.215282,32.553062,4.533196,698.973558,79.217708,2224.494239,38.128623,44.220133,83.564585,5.955925,84.155246,1.983869,5566.031887,14653630.0,4.850637,4.907762,0.631551,12.119891
std,4.087711,8.796834,125.310417,120.84719,4.029189,1759.229336,25.604664,10085.802019,19.754249,162.897999,22.450557,2.299385,21.579193,6.03236,11475.900117,70460390.0,4.599228,4.653757,0.183089,2.795388
min,2000.0,44.0,1.0,0.0,0.01,0.0,2.0,0.0,2.0,0.0,3.0,0.74,2.0,0.1,1.68135,34.0,0.1,0.1,0.0,4.2
25%,2005.0,64.4,77.0,1.0,0.81,37.438577,74.0,0.0,19.5,1.0,81.0,4.41,82.0,0.1,462.14965,191897.0,1.6,1.7,0.509,10.3
50%,2008.0,71.7,148.0,3.0,3.79,145.102253,89.0,15.0,43.7,4.0,93.0,5.84,92.0,0.1,1592.572182,1419631.0,3.0,3.2,0.673,12.3
75%,2011.0,75.0,227.0,22.0,7.34,509.389994,96.0,373.0,55.8,29.0,97.0,7.47,97.0,0.7,4718.51291,7658972.0,7.1,7.1,0.751,14.0
max,2015.0,89.0,723.0,1600.0,17.87,18961.3486,99.0,131441.0,77.1,2100.0,99.0,14.39,99.0,50.6,119172.7418,1293859000.0,27.2,28.2,0.936,20.7


In [11]:
# Assign X and y variables for model fitting

X=df.drop(columns=['Life expectancy ']).select_dtypes(include='number')
y=df['Life expectancy ']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)



In [12]:
# Assign Pipeline for model fitting

pipe_seq = Pipeline([
    ('polyfeatures',PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler',StandardScaler()),
    ('seq',SequentialFeatureSelector(estimator=LinearRegression(),n_features_to_select=5)),
    ('linreg',LinearRegression())
])

pipe_lasso = Pipeline([
    ('polyfeatures',PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler',StandardScaler()),
    ('lasso',Lasso(random_state=42))
])

pipe_ridge = Pipeline([
    ('polyfeatures',PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler',StandardScaler()),
    ('ridge',Ridge(random_state=42))
])


pipe_seq.fit(X_train,y_train)
pipe_lasso.fit(X_train,y_train)
pipe_ridge.fit(X_train,y_train)


seq_coef = pipe_seq.named_steps['linreg'].coef_
lasso_coef = pipe_lasso.named_steps['lasso'].coef_
ridge_coef = pipe_ridge.named_steps['ridge'].coef_

print("Seq Coef:", seq_coef)
print("Lasso Coef:", lasso_coef)
print("Ridge Coef:", ridge_coef)

print("Seq Score:", pipe_seq.score(X_test,y_test))
print("Lasso Score:", pipe_lasso.score(X_test,y_test))
print("Ridge Score:", pipe_ridge.score(X_test,y_test))

print("Seq MSE:", mean_squared_error(y_test,pipe_seq.predict(X_test)))
print("Lasso MSE:", mean_squared_error(y_test,pipe_lasso.predict(X_test)))
print("Ridge MSE:", mean_squared_error(y_test,pipe_ridge.predict(X_test)))

Seq Coef: [-4.23681279 -4.64585137 -5.08622869  4.82388459  9.15712711]
Lasso Coef: [-0.         -0.         -0.          0.          0.          0.
 -0.          0.         -0.          0.          0.          0.
 -0.          0.         -0.         -0.         -0.          0.
  1.30533624 -0.         -1.10497713 -0.          0.          0.
  0.         -0.          0.         -0.          0.          0.
  0.         -1.54621818  0.         -0.         -0.         -0.
  0.          0.         -0.95716838 -0.         -0.          0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.          0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.         -0.         -0.         -0.         -0.         -0.
 -0.27566985 -0.         -0.         -0.         -0.         -0.
 -0.          0.          0.          0.         -0.          0.
 -0.          0.          0.          0.         -0.          0.
 -0.  

In [13]:
model=pipe_ridge.named_steps['ridge']
model

In [19]:
from sklearn.model_selection import GridSearchCV, train_test_split, KFold, LeaveOneOut, PredefinedSplit

param_grid = {'alpha': [0.1, 1.0, 10.0]}

# 1. Simulating Holdout Cross-Validation with GridSearchCV
# We create a predefined split where the test set is used as validation in GridSearchCV

test_indices = [-1] * len(X_test)
train_indices = [0] * len(X_train)
test_fold = test_indices + train_indices # Combine indices for PredefinedSplit
ps = PredefinedSplit(test_fold=test_fold)

grid_search_holdout = GridSearchCV(model, param_grid, cv=ps, scoring='neg_mean_squared_error')
grid_search_holdout.fit(X, y)
holdout_score = grid_search_holdout.best_score_
holdout_best_params = grid_search_holdout.best_params_

# 2. K-Fold Cross-Validation with GridSearchCV
#kf = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search_kfold = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search_kfold.fit(X, y)
kfold_score = grid_search_kfold.best_score_
kfold_best_params = grid_search_kfold.best_params_

# 3. Leave-One-Out Cross-Validation with GridSearchCV
grid_search_loo = GridSearchCV(model, param_grid, cv=LeaveOneOut(), scoring='neg_mean_squared_error')
grid_search_loo.fit(X, y)
loo_score = grid_search_loo.best_score_
loo_best_params = grid_search_loo.best_params_

# Display the results
print(f"Holdout CV - Best Score: {holdout_score:.4f}, Best Params: {holdout_best_params}")
print(f"K-Fold CV - Best Score: {kfold_score:.4f}, Best Params: {kfold_best_params}")
print(f"LOO CV - Best Score: {loo_score:.4f}, Best Params: {loo_best_params}")

Holdout CV - Best Score: -54.1318, Best Params: {'alpha': 10.0}
K-Fold CV - Best Score: -15.1707, Best Params: {'alpha': 0.1}
LOO CV - Best Score: -12.8850, Best Params: {'alpha': 0.1}


In [26]:
holdout_mse=-holdout_score
kfold_mse=-kfold_score
loo_mse=-loo_score

print("Holdout MSE:", holdout_mse)
print("K-Fold MSE:", kfold_mse)
print("LOO MSE:", loo_mse)



Holdout MSE: 54.131811724697904
K-Fold MSE: 15.170684507741672
LOO MSE: 12.885032072095992


In [28]:
# Create a DataFrame for the plot
mse_data = {
    'Method': ['Holdout', 'K-Fold', 'LOO'],
    'MSE': [holdout_mse, kfold_mse, loo_mse]
}
mse_df = pd.DataFrame(mse_data)

# Create the bar plot
fig = px.bar(mse_df, x='Method', y='MSE',
             title='Comparison of Cross-Validation Methods using Mean Squared Error',
             labels={'Score': 'Mean Squared Error'},
             text_auto=True)

# Customize the plot appearance
fig.update_traces(marker_color='blue', textposition='outside')

# Show the plot
fig.show()

In [29]:
!apt-get install -y git

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.11).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [30]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/


/bin/bash: line 1: CV.ipynb: command not found
cp: target 'https://github.com/budd-lab/ML-Model-Selection-and-Cross-Validation/tree/main/Model' is not a directory


In [32]:
!git clone https://github.com/budd-lab/ML-Model-Selection-and-Cross-Validation.git


Cloning into 'ML-Model-Selection-and-Cross-Validation'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), done.


In [34]:
%cd ML-Model-Selection-and-Cross-Validation

/content/ML-Model-Selection-and-Cross-Validation


In [37]:
!git add .
!git commit -m "Adding notebook"
!git push -u origin master

On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
error: src refspec master does not match any
[31merror: failed to push some refs to 'https://github.com/budd-lab/ML-Model-Selection-and-Cross-Validation.git'
[m

In [36]:
!git config --global user.email "buddhendra@google.com"
!git config --global user.name "Buddhendra"

In [38]:
!git init

Reinitialized existing Git repository in /content/ML-Model-Selection-and-Cross-Validation/.git/


In [39]:
!git add .
!git commit -m "Initial commit"

On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean


In [40]:
!git remote add origin https://github.com/budd-lab/ML-Model-Selection-and-Cross-Validation.git

error: remote origin already exists.


In [41]:
!git push -u origin master

error: src refspec master does not match any
[31merror: failed to push some refs to 'https://github.com/budd-lab/ML-Model-Selection-and-Cross-Validation.git'
[m

In [45]:
!cp AIMLBerkeleyAssignments.ipynb /content/ML-Model-Selection-and-Cross-Validation

cp: cannot stat 'AIMLBerkeleyAssignments.ipynb': No such file or directory


In [46]:
pwd

'/content/ML-Model-Selection-and-Cross-Validation'