# California Housing Price Prediction and Classification

In this project, the goal is to compare the performance of **Linear Regression**, **Logistic Regression**, and **K-Nearest Neighbors (KNN)** on the California Housing Prices dataset (with extra distance features). The dataset contains 14 attributes describing housing blocks in California from the 1990 census, including median house value, income, house age, rooms, population, households, latitude, longitude, and distances to the coast and major cities.[page:1]

We will:
- Predict **Median House Value** as a regression problem using Linear Regression and KNN Regression.
- Define a binary label (**expensive vs cheap**) from Median House Value and solve a classification problem using Logistic Regression and KNN Classification.


In [4]:
# Topic 1: Setup & Data Loading

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    mean_squared_error, r2_score,
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix, classification_report
)

# Make plots look nicer
plt.style.use("seaborn-v0_8")
sns.set_palette("viridis")

RANDOM_STATE = 42

csv_path = "/content/California_Houses.csv"

df = pd.read_csv(csv_path)

print("Shape:", df.shape)
df.head()

df.info()
print("\nBasic statistics:")
display(df.describe().T)

print("\nMissing values per column:")
print(df.isna().sum())

Shape: (20640, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Median_House_Value        20640 non-null  float64
 1   Median_Income             20640 non-null  float64
 2   Median_Age                20640 non-null  int64  
 3   Tot_Rooms                 20640 non-null  int64  
 4   Tot_Bedrooms              20640 non-null  int64  
 5   Population                20640 non-null  int64  
 6   Households                20640 non-null  int64  
 7   Latitude                  20640 non-null  float64
 8   Longitude                 20640 non-null  float64
 9   Distance_to_coast         20640 non-null  float64
 10  Distance_to_LA            20640 non-null  float64
 11  Distance_to_SanDiego      20640 non-null  float64
 12  Distance_to_SanJose       20640 non-null  float64
 13  Distance_to_SanFrancisco  20640 non-null  

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Median_House_Value,20640.0,206855.816909,115395.615874,14999.0,119600.0,179700.0,264725.0,500001.0
Median_Income,20640.0,3.870671,1.899822,0.4999,2.5634,3.5348,4.74325,15.0001
Median_Age,20640.0,28.639486,12.585558,1.0,18.0,29.0,37.0,52.0
Tot_Rooms,20640.0,2635.763081,2181.615252,2.0,1447.75,2127.0,3148.0,39320.0
Tot_Bedrooms,20640.0,537.898014,421.247906,1.0,295.0,435.0,647.0,6445.0
Population,20640.0,1425.476744,1132.462122,3.0,787.0,1166.0,1725.0,35682.0
Households,20640.0,499.53968,382.329753,1.0,280.0,409.0,605.0,6082.0
Latitude,20640.0,35.631861,2.135952,32.54,33.93,34.26,37.71,41.95
Longitude,20640.0,-119.569704,2.003532,-124.35,-121.8,-118.49,-118.01,-114.31
Distance_to_coast,20640.0,40509.264883,49140.03916,120.676447,9079.756762,20522.019101,49830.414479,333804.7



Missing values per column:
Median_House_Value          0
Median_Income               0
Median_Age                  0
Tot_Rooms                   0
Tot_Bedrooms                0
Population                  0
Households                  0
Latitude                    0
Longitude                   0
Distance_to_coast           0
Distance_to_LA              0
Distance_to_SanDiego        0
Distance_to_SanJose         0
Distance_to_SanFrancisco    0
dtype: int64
