# Step-by-Step Tutorial: Linear Regression & Logistic Regression Model Evaluation

In [None]:
!pip install numpy pandas matplotlib seaborn scikit-learn

# 🏡 Predicting Housing Prices in California with Linear Regression

**Problem Statement**  
We aim to predict the median house value in California using Linear Regression based on various features such as:

- Income  
- Age of houses  
- Number of rooms  
- Geographic location  

This is a regression problem because the target variable (`median_house_value`) is continuous, not categorical.

We'll use a sample dataset (based on the one from GeeksforGeeks) that resembles the California Housing Prices Dataset originally from StatLib (also hosted on Kaggle and Scikit-learn).

[Download the California Housing Dataset (CSV)](https://media.geeksforgeeks.org/wp-content/uploads/20240522145850/housing[1].csv)


## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score


## 2. Load Dataset

We will load the California Housing dataset using `fetch_california_housing` from `sklearn.datasets` and convert it into a pandas DataFrame for easier manipulation and analysis.

In [None]:
url = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv"
data = pd.read_csv(url)
data.head()


## 📊 Step 3: Exploratory Data Analysis (EDA)


###  Basic Information

In [None]:
data.info()
data.describe()


### Check for Missing Values


In [None]:
data.isnull().sum()


### Visualizations

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix of Features")
plt.show()


The heatmap above shows how different features in the housing dataset are related to each other. Each colored square represents the strength of the relationship (correlation) between two features. 

- **Dark red or blue colors** mean a strong relationship, while lighter colors mean a weaker relationship.
- **Positive values** (closer to 1) mean that as one feature increases, the other tends to increase too.
- **Negative values** (closer to -1) mean that as one feature increases, the other tends to decrease.

This chart helps us quickly see which features might be important for predicting house prices. For example, if "median_income" has a strong positive correlation with "median_house_value," it means higher income areas tend to have higher house prices.

In [None]:
#🔹 Distribution of Median House Value
plt.figure(figsize=(8, 6))
sns.histplot(data["median_house_value"], kde=True, color='skyblue')
plt.title("Distribution of Median House Value")
plt.xlabel("House Value")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

The chart above shows the distribution of the median house values in California. It uses a histogram to display how frequently different house values occur in the dataset. The curve overlaid on the histogram represents the estimated probability density (KDE), helping to visualize the overall shape of the distribution. This plot helps us understand the range, central tendency, and skewness of housing prices in the region.

In [None]:
#🔹 Income vs. House Value
plt.figure(figsize=(8, 6))
sns.scatterplot(x='median_income', y='median_house_value', data=data, hue='ocean_proximity')
plt.title("Income vs House Value")
plt.legend(title="Proximity")
plt.grid(True)
plt.show()


## 3. Preprocess Data
Before training our model, we need to preprocess the data to ensure it is clean and suitable for analysis. This involves:

- **Handling Missing Values:**  
    The dataset contains some missing values in the `total_bedrooms` column. We remove any rows with missing data to avoid errors during model training.

- **Dropping Categorical Columns:**  
    The `ocean_proximity` column is categorical and not directly usable by linear regression without encoding. For simplicity, we drop this column for now.

The following code performs these steps:

- `data = data.dropna()` removes rows with missing values.
- `data = data.drop("ocean_proximity", axis=1)` drops the categorical column.

This ensures our dataset contains only numerical features and no missing values, making it ready for splitting into features and target variables.

In [None]:
data = data.dropna()  # drop missing values
data = data.drop("ocean_proximity", axis=1)  # drop categorical for now


## Step 4: Split into Features and Target


In [None]:
X = data.drop("median_house_value", axis=1)
y = data["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


### $Train the Model 


In [None]:
model = LinearRegression()
model.fit(X_train, y_train)


## Step 6: What is Linear Regression?

Linear Regression fits a line (or hyperplane) to minimize the error between predicted and actual values.

The general form of a linear regression model is:

$$
\text{House Price} = w_0 + w_1 \cdot \text{Income} + w_2 \cdot \text{Rooms} + \ldots + \epsilon
$$

Where:
- $w_0$ is the intercept,
- $w_1, w_2, \ldots$ are the coefficients for each feature,
- $\epsilon$ is the error term.

## Step 7: Predictions and Evaluation

In this step, we use our trained Linear Regression model to predict housing prices on the test set. We then evaluate the model's performance using metrics such as Mean Squared Error (MSE) and R² Score.

- **Predictions:**  
    The model predicts the median house values for the test data (`X_test`).

- **Evaluation Metrics:**  
    - **Mean Squared Error (MSE):** Measures the average squared difference between actual and predicted values. Lower values indicate better performance.
    - **R² Score:** Represents the proportion of variance in the target variable explained by the model. Values closer to 1 indicate a better fit.



## A scatter plot of actual vs. predicted values helps visualize the model's performance:

In [None]:
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5, label='Predicted vs Actual')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', label='Perfect Prediction')
plt.xlabel("Actual Median House Value")
plt.ylabel("Predicted Median House Value")
plt.title("Actual vs Predicted Median House Value")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

## Step 9: Evaluate Model Performance (Regression)

| Metric | Meaning |
|--------|---------|
| **MSE** | Measures average squared error between actual and predicted values. Lower is better. |
| **R² (coefficient of determination)** | Measures how well the regression line fits the data (1 is perfect fit, 0 means no explanatory power). |

In [None]:
# Sample test: Predict median house value for a few custom samples
sample_data = pd.DataFrame({
    'longitude': [-122.0, -118.5],
    'latitude': [37.5, 34.0],
    'housing_median_age': [30, 15],
    'total_rooms': [2000, 3500],
    'total_bedrooms': [400, 700],
    'population': [800, 1500],
    'households': [300, 600],
    'median_income': [5.0, 7.5]
})

sample_predictions = model.predict(sample_data)
for i, pred in enumerate(sample_predictions):
    print(f"Sample {i+1} predicted median house value: ${pred:,.2f}")