## Ridge regression with cross validation

1. cross validation is a statistical method which is used to evaluate the performance of a model on unseen data.
2. It involves splitting the data into multiple folds or subsets, using one of the subset as the validation set and training the model on the remaining subsets. Repeating the process untill the model is trained on all the subsets.
3. Results from each validatio set are averaged to create a more robust model.
4. Cross validation is done to overcome overfitting of the model



There are several types of cross-validation techniques, including:

1. k-fold cross-validation
2. leave-one-out cross validation
3. stratified cross-validation
4. time-based folds for time-series datasets
5. nested cross-validation for hyperparameter tuning 

Advantages of the Cross-validation:

1. It provides a more robust evaluation of the model performance.
2. It can be used to compare different models and select the best on average
3. It can be used to optimize the hyperparmeter
4. It allows the use of all the available data for both training and validation, making it a more data-efficient method compared to traditional validation techniques

Disadvantage of the cross-validation:

1. Computationaly expensive


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

In [4]:
boston_df=pd.read_csv('Boston.csv')

In [5]:
boston_df

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


Separating predicted variable from the experience (E)

In [8]:
X=boston_df.drop('medv',axis=1)
y=boston_df['medv']

In [15]:
from sklearn.model_selection import train_test_split

In [17]:
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,random_state=23)

In [12]:
kfold=KFold(n_splits=5)

In [42]:
from sklearn.linear_model import LogisticRegression,LogisticRegression

In [41]:
model=LinearRegression()
model2=LogisticRegression()

In [54]:
from sklearn.model_selection import GridSearchCV

In [58]:
param_grid={'fit_intercept':[True]}

In [59]:
gcv=GridSearchCV(model,param_grid,cv=kfold,scoring='neg_mean_squared_error')

In [60]:
gcv.fit(X,y)

In [62]:
model_y_pred=gcv.predict(X_test)

In [71]:
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

In [64]:
print("Mean squared error",np.round(mse(y_test,model_y_pred)))

Mean squared error 23.0


In [65]:
gcv.best_params_

{'fit_intercept': True}

In [66]:
gcv.best_score_

-37.131807467699126

In [72]:
pd.DataFrame(gcv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_fit_intercept,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005413,0.00077,0.00413,0.000737,True,{'fit_intercept': True},-12.460301,-26.048621,-33.074138,-80.762371,-33.313607,-37.131807,23.091945,1


In [74]:
r2_score(y_test,model_y_pred)

0.7213933361702176

In [75]:
gcv=GridSearchCV(model,param_grid,cv=kfold)

In [76]:
gcv.fit(X_train,y_train)

In [78]:
model_y_pred_new=gcv.predict(X_test)

In [79]:
gcv.best_score_

0.6882130799800862