# Linear Regression with Scikit Learn – Predicting Housing Prices

### Project Description:
This project applies linear regression techniques to prices of California homes uusing features such as year built and square footage. Both ordinary least squares (Linear Regression) and L2-regularized linear regression (Ridge) were implemented using scikit-learn to compare model performance.

### Objectives:
* Predict housing prices from multiple features using linear regression.
* Compare performance of unregularized and L2-regularized models (Ridge).
* Evaluate model accuracy using mean squared error (MSE) and R² score.

### Public dataset source:
[Kaggle California Housing Prices](https://www.kaggle.com/datasets/camnugent/california-housing-prices)
The data contains information from the 1990 California census.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score


In [2]:
# Establish file path and import data
path = 'CA_housing.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
cols=['LotArea','OverallQual','OverallCond','YearBuilt','TotalBsmtSF','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd','PoolArea']
df_float = df[cols].astype(float)
X = df[cols].to_numpy()  
y = df['SalePrice'].to_numpy()

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [5]:
# Create and fit the model
lr = LinearRegression()
lr.fit(X_train,y_train)

# Make predictions
y_pred = lr.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("MSE is:", mse)
r2 = r2_score(y_test, y_pred)
print("R2 is:", r2)

MSE is: 1439025268.6763942
R2 is: 0.8123906037623247


In [13]:
# Create and fit the model
model = Ridge(alpha=10)
model.fit(X_train,y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
print("MSE is:", mse)
r2 = r2_score(y_test, y_pred)
print("R2 is:", r2)

MSE is: 1439625183.8600435
R2 is: 0.812312391289029


The base linear regression already performs well, which means the model is not overfitting and the data are relatively clean and well-behaved. This is why the L2 regularization has minimal impact on the model performance.
