**Programmer:** python_scripts (Abhijith Warrier)

**PYTHON SCRIPT TO _PERFORM LINEAR REGRESSION ON REAL-WORLD DATA (Boston Housing)._** üìäüêç‚ö°Ô∏è

This script demonstrates how to apply **Linear Regression** on a **real dataset** ‚Äî exploring the workflow of loading, training, evaluating, and visualizing model performance.
It‚Äôs a perfect starting point to understand how regression models make continuous predictions.

### üì¶ Import Required Libraries

We‚Äôll use LinearRegression from scikit-learn, along with pandas and matplotlib for data handling and visualization.

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing, load_diabetes

In [2]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### üß© Load and Prepare the Dataset

Robust: Try CA Housing ‚Üí Fallback to Diabetes.

In [3]:
# Try California Housing (download). If SSL/network fails, use local Diabetes dataset.
USE_DIABETES = False
try:
    data = fetch_california_housing(as_frame=True)  # may trigger SSL/download
    df = data.frame
    target_col = "MedHouseVal"
    print("‚úÖ Using California Housing dataset.")
except Exception as e:
    print("‚ö†Ô∏è Could not fetch California Housing (likely SSL/network). Falling back to Diabetes.")
    data = load_diabetes(as_frame=True)
    df = pd.concat([data.data, data.target.rename("target")], axis=1)
    target_col = "target"
    USE_DIABETES = True

# Quick peek
df.head()

‚ö†Ô∏è Could not fetch California Housing (likely SSL/network). Falling back to Diabetes.


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


### ‚öôÔ∏è Define Features and Target Variable

The goal is to predict median house value based on features like income, house age, and population.

In [4]:
# Define features and target
X = df.drop(columns=[target_col])
y = df[target_col]

### üß™ Split Data into Training and Testing Sets

We‚Äôll reserve 20% of data for testing to evaluate model performance.

In [5]:
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### ü§ñ Train the Linear Regression Model

Fit the model using scikit-learn‚Äôs LinearRegression.

In [6]:
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


### üìà Evaluate Model Performance

We‚Äôll check Mean Squared Error (MSE) and R¬≤ score to see how well the model performs.

In [7]:
# Predict on test data
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R¬≤ Score: {r2:.2f}")

Mean Squared Error: 2900.19
R¬≤ Score: 0.45


### üé® Visualize Predicted vs. Actual Values

We plot a scatter graph to show how closely predictions align with actual values.

In [1]:
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.55)
# Diagonal using actual range (works for both datasets)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], linestyle="--")
plt.xlabel("Actual")
plt.ylabel("Predicted")
title_ds = "California Housing" if not USE_DIABETES else "Diabetes (fallback)"
plt.title(f"Actual vs Predicted ‚Äî Linear Regression ({title_ds})")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

NameError: name 'plt' is not defined

### üß† Interpretation
- **Closer to diagonal line** ‚Üí better predictions.
- **Low R¬≤ / high MSE** ‚Üí model underfits; consider feature engineering or regularization.
- Compare metrics across datasets to understand difficulty and signal/noise.