Introduction to AI Programming II (2023-09-22)


# **Lab 3: Ridge & Lasso Regression with California housing prices data**

# Task1: Preprocessing


### Import data

In [1]:
import numpy as np
import pandas as pd

housing = pd.read_csv('https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv')
housing.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,37.84,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
7,-122.25,37.84,52.0,3104.0,687.0,1157.0,647.0,3.12,241400.0,NEAR BAY
8,-122.26,37.84,42.0,2555.0,665.0,1206.0,595.0,2.0804,226700.0,NEAR BAY
9,-122.25,37.84,52.0,3549.0,707.0,1551.0,714.0,3.6912,261100.0,NEAR BAY


### NaN (Not a number: Empty data) will be filled with mean value.

In [2]:
median =housing["total_bedrooms"].median()
housing["total_bedrooms"] = housing["total_bedrooms"].fillna(median)

### Feature substitution: total_bedrooms -> 'bedrooms_per_room'

In [3]:
housing['bedrooms_per_room'] =  housing['total_bedrooms'] / housing['total_rooms']
del housing["total_bedrooms"], housing['total_rooms']

### Standard Scaling

In [4]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

col_list = list(housing)
col_list.remove("ocean_proximity") # text type
col_list.remove("median_house_value") # target variable needs not to be scaled

# generate a new dataframe that consist of numeric type only
housing_numeric = housing[col_list]
housing_scaled = scaler.fit_transform(housing_numeric)
# Data type conversion from 'Series' to 'DataFrame'
housing_scaled_df = pd.DataFrame(housing_scaled, index=housing_numeric.index, columns=housing_numeric.columns)

# Concatenate
housing = pd.concat([housing_scaled_df, housing["median_house_value"], housing["ocean_proximity"]], axis=1)
housing.head()


Unnamed: 0,longitude,latitude,housing_median_age,population,households,median_income,bedrooms_per_room,median_house_value,ocean_proximity
0,-1.327835,1.052548,0.982143,-0.974429,-0.977033,2.344766,-1.029988,452600.0,NEAR BAY
1,-1.322844,1.043185,-0.607019,0.861439,1.669961,2.332238,-0.888897,358500.0,NEAR BAY
2,-1.332827,1.038503,1.856182,-0.820777,-0.843637,1.782699,-1.291686,352100.0,NEAR BAY
3,-1.337818,1.038503,1.856182,-0.766028,-0.733781,0.932968,-0.449613,341300.0,NEAR BAY
4,-1.337818,1.038503,1.856182,-0.759847,-0.629157,-0.012881,-0.639087,342200.0,NEAR BAY


### One-hot encoding

In [5]:
housing = pd.get_dummies(housing)
housing.head(10)

Unnamed: 0,longitude,latitude,housing_median_age,population,households,median_income,bedrooms_per_room,median_house_value,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-1.327835,1.052548,0.982143,-0.974429,-0.977033,2.344766,-1.029988,452600.0,0,0,0,1,0
1,-1.322844,1.043185,-0.607019,0.861439,1.669961,2.332238,-0.888897,358500.0,0,0,0,1,0
2,-1.332827,1.038503,1.856182,-0.820777,-0.843637,1.782699,-1.291686,352100.0,0,0,0,1,0
3,-1.337818,1.038503,1.856182,-0.766028,-0.733781,0.932968,-0.449613,341300.0,0,0,0,1,0
4,-1.337818,1.038503,1.856182,-0.759847,-0.629157,-0.012881,-0.639087,342200.0,0,0,0,1,0
5,-1.337818,1.038503,1.856182,-0.894071,-0.801787,0.087447,0.275563,269700.0,0,0,0,1,0
6,-1.337818,1.033821,1.856182,-0.292712,0.037823,-0.111366,-0.320242,299200.0,0,0,0,1,0
7,-1.337818,1.033821,1.856182,-0.237079,0.385698,-0.395137,0.115458,241400.0,0,0,0,1,0
8,-1.342809,1.033821,1.061601,-0.19381,0.249687,-0.942359,0.712372,226700.0,0,0,0,1,0
9,-1.337818,1.033821,1.856182,0.110844,0.560944,-0.09447,-0.223507,261100.0,0,0,0,1,0


### Train / Test split

**sklearn** Divide data into learning/testing using functions within the learning library.

In [6]:
# training - test seperation
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size= 0.2, random_state=45)

print('# of train_set : %.0f, # of test_set : %.0f' %(train_set.shape[0], test_set.shape[0]))

# of train_set : 16512, # of test_set : 4128


Using a simple drop function and a copy function, divide it into feature and target.

In [7]:
# feature and label seperation of training set
train_set_features = train_set.drop('median_house_value',axis=1) # X
train_set_target = train_set["median_house_value"].copy() # y

In [8]:
# Feature and target value Seperation of the test set
test_set_features = test_set.drop("median_house_value",axis=1)                  # drop 'median_house_value" from test_set
test_set_target = test_set["median_house_value"].copy()                    # only containing 'median_house_value" from test_set

# Task2: Linear Regression model
## Use scikit learn packages

## Train

Analysis result of best lasso model (with best alpha).
1. train score: **0.6506911525314936**
2. test score: **0.6279901608184868**
3. rmse: **70258.86302412274**
4. coefficients: **[-50759.39036387 -51574.01642357  12545.81188557 -53652.59414589
  56812.99611044  75937.26906977   9146.7447002  -26966.35524515
 -65564.17761279 146019.09338452 -31175.40269955 -22313.15782703]**
5. bias: **245849.07939194865**

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

lin_reg = LinearRegression()
lin_reg.fit(train_set_features, train_set_target)

print("Learned Parameters : ")
print("Coefficients: ", lin_reg.coef_)
print("bias: ", lin_reg.intercept_)

print("Train_r2_score : ", lin_reg.score(train_set_features, train_set_target))

Learned Parameters : 
Coefficients:  [-50759.39036387 -51574.01642357  12545.81188557 -53652.59414589
  56812.99611044  75937.26906977   9146.7447002  -26966.35524515
 -65564.17761279 146019.09338452 -31175.40269955 -22313.15782703]
bias:  245849.07939194865
Train_r2_score :  0.6506911525314936


## Test

In [12]:
final_model = lin_reg
final_predictions = final_model.predict(test_set_features)

print("Test_score : ", r2_score(test_set_target, final_predictions))

delta = test_set_target - final_predictions
print("Errors (%) in housing value prediction : ", np.mean(np.abs(delta)/test_set_target))

# RMSE
final_mse = mean_squared_error(test_set_target, final_predictions)
final_rmse = np.sqrt(final_mse)
print("RMSE is : ", final_rmse)

Test_score :  0.6279901608184868
Errors (%) in housing value prediction :  0.2860170873659307
RMSE is :  70258.86302412274


# Task3: Ridge Regression model

Find best alpha for ridge model.(hint: use for)

 Best score: **0.6343267973400977**

 Best alpha: **513**

In [14]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings(action = 'ignore')

def ridge(alpha):
  model = Ridge(alpha = alpha)
  model.fit(train_set_features, train_set_target)

  ridge_predicted = model.predict(test_set_features)

  test_score = r2_score(test_set_target, ridge_predicted)
  return test_score

In [15]:
max_alpha = 100000
best = 0
best_alpha = None
stop_or_not = 0
for alpha in range(0, max_alpha):
  test_score = ridge(alpha)
  if best < test_score:
    best = test_score
    best_alpha = alpha
  else:
    stop_or_not += 1
  if stop_or_not > 100:
    break

print("Best score: ", best)
print("Best alpha: ", best_alpha)

Best score:  0.6343267973400977
Best alpha:  513


Analysis result of best ridge model (with best alpha) and compare with results of general linear regression model.

1. train score: **0.6450076210300433 (Lower than linear regression)**
2. test score: **0.6343267973400977**
3. rmse: **69657.91514292923**
4. coefficients: **[-31171.69526231 -30963.00835497  12805.51499391 -37932.0444856
  41691.79078894  74203.44217969   8305.31793683  10349.29139127
 -36999.98793457   1119.81026561   9150.08888287  16380.79739482]**
5. bias: **210831.35699161858**

In [16]:
model_ridge = Ridge(alpha = best_alpha)
model_ridge.fit(train_set_features, train_set_target)

print("Train_score : ", model_ridge.score(train_set_features,train_set_target))

ridge_predicted = model_ridge.predict(test_set_features)
print("Test_score : ", r2_score(test_set_target,ridge_predicted))

delta = test_set_target - ridge_predicted
print("Errors in housing value prediction : ", np.mean(np.abs(delta)/test_set_target))

final_ridge_mse = mean_squared_error(test_set_target, ridge_predicted)
final_ridge_rmse = np.sqrt(final_ridge_mse)
print("final_ridge_RMSE : ", final_ridge_rmse)

print("Learned parameters : ")
print("Coefficients: ", model_ridge.coef_)
print("bias: ", model_ridge.intercept_)

Train_score :  0.6450076210300433
Test_score :  0.6343267973400977
Errors in housing value prediction :  0.2847023699431231
final_ridge_RMSE :  69657.91514292923
Learned parameters : 
Coefficients:  [-31171.69526231 -30963.00835497  12805.51499391 -37932.0444856
  41691.79078894  74203.44217969   8305.31793683  10349.29139127
 -36999.98793457   1119.81026561   9150.08888287  16380.79739482]
bias:  210831.35699161858


# Task4: Lasso Regression model

Find best alpha for lasso model.(hint: use for)

best score? **0.6332078888251764**

best alpha? **969**

In [18]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings(action = 'ignore')

def lasso(alpha):
  model = Lasso(alpha = alpha)
  model.fit(train_set_features, train_set_target)

  lasso_predicted = model.predict(test_set_features)

  test_score = r2_score(test_set_target, lasso_predicted)
  return test_score

In [19]:
max_alpha = 100000
best = 0
best_alpha = None
stop_or_not = 0
for alpha in range(0, max_alpha):
  test_score = lasso(alpha)
  if best < test_score:
    best = test_score
    best_alpha = alpha
  else:
    stop_or_not += 1
  if stop_or_not > 100:
    break

print("Best score: ", best)
print("Best alpha: ", best_alpha)

Best score:  0.6332078888251764
Best alpha:  969


Analysis result of best lasso model (with best alpha) and compare with results of general linear regression model.

1. train score: **0.6458098971982584**
2. test score: **0.6332078888251764**
3. rmse: **69764.4054502133**
4. coefficients: **[-29623.81469158 -29665.16625357  12223.21260863 -41493.43384512
  45021.29021038  75300.89806367   8165.56078667     -0.
 -49288.97422645      0.              0.           1440.34887846]**
5. bias: **222246.03290625935**

In [21]:
model_lasso = Lasso(alpha = best_alpha)
model_lasso.fit(train_set_features,train_set_target)

print("Train_score : ", model_lasso.score(train_set_features,train_set_target))

lasso_predicted = model_lasso.predict(test_set_features)
print("Test_score : ", r2_score(test_set_target,lasso_predicted))

delta = test_set_target - lasso_predicted
print("Errors in housing value prediction : ", np.mean(np.abs(delta)/test_set_target))

final_lasso_mse = mean_squared_error(test_set_target, lasso_predicted)
final_lasso_rmse = np.sqrt(final_lasso_mse)
print("RMSE is : ", final_lasso_rmse)

print("Learned parameters : ")
print("Coefficients: ", model_lasso.coef_)
print("bias: ", model_lasso.intercept_)

Train_score :  0.6458098971982584
Test_score :  0.6332078888251764
Errors in housing value prediction :  0.28276775219839223
RMSE is :  69764.4054502133
Learned parameters : 
Coefficients:  [-29623.81469158 -29665.16625357  12223.21260863 -41493.43384512
  45021.29021038  75300.89806367   8165.56078667     -0.
 -49288.97422645      0.              0.           1440.34887846]
bias:  222246.03290625935


Compare the test score of ridge & linear at alpha = 0. What is the best? or same?

**same**

ridge test score: **0.6279901608184868**

lasso test score: **0.6279901608184868**

In [22]:
print(ridge(0))
print(lasso(0))

 0.6279901608184868
 0.6279901608184868


**for alpha = 10000**

**underfit**

In [23]:
print(ridge(10000))
print(lasso(10000))

0.5119258910819141
0.5582812639292549
