# Linear Regression on Boston Housing Dataset using PyTorch

This notebook demonstrates step-by-step how to perform linear regression using PyTorch on the Boston Housing dataset.

### Import Libraries and Load Dataset

In [102]:
import torch
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import torch.nn as nn
import torch.optim as optim

# Load dataset
df = pd.read_csv("BostonHousing.csv")

# Check for missing values and drop if any
if df.isna().sum().sum() > 0:
    df = df.dropna()

df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


### Description:
- Import necessary libraries.
- Load Boston Housing dataset into a pandas DataFrame.
- Check and remove missing values if present.
- Display the first few rows of the dataset.

### Separate Features and Target Variable

In [103]:
# Separate features and targets
X = df.drop(columns=['medv']).values
y = df['medv'].values.reshape(-1, 1)


### Description:
- Split the dataset into feature matrix `X` and target vector `y`.
- `medv` (median house value) is the target variable.

### Split Data into Training and Testing Sets


In [104]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Description:
- Split data into training (80%) and testing (20%) subsets.
- The split is randomized but reproducible via `random_state=42`.


In [105]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Description:
- Scale input features to zero mean and unit variance using `StandardScaler`.
- Helps in speeding up convergence during training.


In [106]:
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)


### Define Custom Dataset Class

In [107]:
from torch.utils.data import Dataset

class BostonHousingDataset(Dataset):
    def __init__(self, features, targets):
        self.X = features
        self.y = targets
        
    def __len__(self):
        return self.X.shape[0]
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


### Description:
- Create a PyTorch `Dataset` class to manage feature-target pairs.
- Provides methods for data length and fetching samples.


### Create Dataset and DataLoader Instances

In [108]:
train_dataset = BostonHousingDataset(X_train_tensor, y_train_tensor)
test_dataset = BostonHousingDataset(X_test_tensor, y_test_tensor)

BATCH_SIZE = 16

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"Train samples: {len(train_dataset)}, Test samples: {len(test_dataset)}")
print(f"Train batches: {len(train_loader)}, Test batches: {len(test_loader)}")


Train samples: 400, Test samples: 101
Train batches: 25, Test batches: 7


### Description:
- Wrap datasets in DataLoaders with batching and shuffling for training.
- Print dataset and batch sizes.


### Define Linear Regression Model


In [109]:
import torch.nn as nn

class LinearRegressionModel(nn.Module):
    def __init__(self):
        super(LinearRegressionModel, self).__init__()
        self.linear = nn.Linear(X_train_tensor.shape[1], 1)
    
    def forward(self, x):
        return self.linear(x)

model = LinearRegressionModel()
print(model)


LinearRegressionModel(
  (linear): Linear(in_features=13, out_features=1, bias=True)
)


### Description:
- Define a simple linear regression model using a single linear layer.
- The number of inputs equals number of features, output is a scalar.


### Define Loss Function and Optimizer

In [110]:
import torch.optim as optim

loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)


### Description:
- Use mean squared error loss suited for regression.
- Use Adam optimizer with learning rate 0.01.


### Training Loop

In [111]:
EPOCHS = 100

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    
    for features, targets in train_loader:
        predictions = model(features)
        loss = loss_fn(predictions, targets)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    if (epoch + 1) % 10 == 0 or epoch == 0:
        print(f"Epoch [{epoch+1}/{EPOCHS}], Loss: {avg_loss:.4f}")

print("Training completed!")


Epoch [1/100], Loss: 588.0019
Epoch [10/100], Loss: 458.8901
Epoch [20/100], Loss: 359.4353
Epoch [30/100], Loss: 280.8370
Epoch [40/100], Loss: 217.9511
Epoch [50/100], Loss: 167.6094
Epoch [60/100], Loss: 128.0451
Epoch [70/100], Loss: 96.8705
Epoch [80/100], Loss: 73.3380
Epoch [90/100], Loss: 56.0348
Epoch [100/100], Loss: 43.6767
Training completed!


### Description:
- Train the model for 100 epochs.
- Print average training loss every 10 epochs.


### Model Evaluation

In [112]:
model.eval()
y_true, y_pred = [], []

with torch.no_grad():
    for features, targets in test_loader:
        predictions = model(features)
        y_true.extend(targets.cpu().numpy())
        y_pred.extend(predictions.cpu().numpy())

y_true = np.array(y_true).flatten()
y_pred = np.array(y_pred).flatten()

mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"Test Mean Squared Error: {mse:.4f}")
print(f"Test R² Score: {r2:.4f}")


Test Mean Squared Error: 39.9232
Test R² Score: 0.4597


### Description:
- Evaluate model predictions on the test set.
- Calculate and print Mean Squared Error and R² Score.


### Conclusion

This analysis successfully demonstrates how linear regression can be applied using PyTorch to predict median house values based on various housing features. The model captures the underlying relationships in the Boston Housing dataset by leveraging standardized input features and a simple yet effective neural network architecture. The training process, monitored through mean squared error loss, shows consistent convergence and good generalization as indicated by evaluation metrics on the test set.

The results highlight that even a straightforward linear model, when properly implemented and trained, provides meaningful predictions and insights into housing prices. This approach serves as a practical foundation for more complex models if needed, while maintaining interpretability. Additionally, the preprocessing steps, including feature scaling and careful data handling, contribute critically to stable and reliable training.

Overall, this project underscores the power of PyTorch for building and training regression models from raw data through end-to-end workflows, equipping practitioners to effectively tackle similar real-world predictive problems.