# Linear Regression Tutorial
We train and evaluate a linear regression model using the California Housing Prices dataset downloaded from Kaggle via `kagglehub`. The notebook covers the end-to-end workflow from data acquisition to evaluation.

## 1. Load Libraries
We import numerical utilities, data handling helpers, the Kaggle downloader, the linear regression model, and regression metrics.

In [1]:
# Import essential libraries
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## 2. Load Dataset
We download the California Housing Prices dataset from Kaggle with `kagglehub`, load it into a pandas DataFrame, and split the data into training and testing sets.

In [2]:
# Download Kaggle dataset and prepare train/test splits
import kagglehub
from sklearn.model_selection import train_test_split

path = kagglehub.dataset_download("camnugent/california-housing-prices")
csv_path = next(Path(path).glob("**/*.csv"))
housing = pd.read_csv(csv_path)
housing = housing.dropna(subset=["total_bedrooms"])
feature_cols = housing.columns.drop("median_house_value")
X = housing[feature_cols].values
y = housing["median_house_value"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape

Downloading from https://www.kaggle.com/api/v1/datasets/download/camnugent/california-housing-prices?dataset_version_number=1...


100%|██████████| 400k/400k [00:00<00:00, 1.04MB/s]

Extracting files...





((16346, 9), (4087, 9))

## 3. Train Model
We fit the linear regression model on the training data.

In [4]:
# Fit the linear regression model
from sklearn.preprocessing import OneHotEncoder
import numpy as np

lr = LinearRegression()

# Identify the categorical column index (assuming 'ocean_proximity' is the last column)
categorical_features_idx = [X_train.shape[1] - 1]

# Manually split X_train and X_test into numerical and categorical parts
X_train_num = X_train[:, :categorical_features_idx[0]].astype(float)
X_train_cat = X_train[:, categorical_features_idx[0]].reshape(-1, 1)

X_test_num = X_test[:, :categorical_features_idx[0]].astype(float)
X_test_cat = X_test[:, categorical_features_idx[0]].reshape(-1, 1)

# Apply OneHotEncoder to the categorical feature
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_train_cat_encoded = onehot_encoder.fit_transform(X_train_cat)
X_test_cat_encoded = onehot_encoder.transform(X_test_cat)

# Concatenate numerical and one-hot encoded categorical features
X_train_processed = np.concatenate((X_train_num, X_train_cat_encoded), axis=1)
X_test_processed = np.concatenate((X_test_num, X_test_cat_encoded), axis=1)

# Fit the model with the processed data
lr.fit(X_train_processed, y_train)
lr

## 4. Evaluate Performance
We predict on the test split and compute regression scores.

In [6]:
# Evaluate predictions using mean squared error and R-squared
y_pred = lr.predict(X_test_processed)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse, r2

(4802173538.604141, 0.6488402154432007)

## 5. Inspect Coefficients
We review feature weights to understand the learned relationships.

In [7]:
# Display learned coefficients and intercept
coefficients = dict(enumerate(lr.coef_, start=1))
intercept = lr.intercept_
coefficients, intercept

({1: np.float64(-27108.74632117819),
  2: np.float64(-25657.807542675506),
  3: np.float64(1081.364206273332),
  4: np.float64(-6.322145519679907),
  5: np.float64(103.00404177118799),
  6: np.float64(-36.409751368799334),
  7: np.float64(43.14272491812608),
  8: np.float64(39277.08301971278),
  9: np.float64(-34269.44338726975),
  10: np.float64(-73509.66116574455),
  11: np.float64(179383.9310755514),
  12: np.float64(-40501.86026350638),
  13: np.float64(-31102.966259030472)},
 np.float64(-2265004.318716105))