
# House Price Estimation & Customer Segmentation
## Linear Regression & KNN Classification

**Industry Context:**  
A real estate analytics firm wants to estimate house prices and segment buyers based on their purchasing behavior.



## Part A: House Price Prediction (Linear Regression)



### 1. Import Libraries


In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score



### 2. Create Sample Dataset


In [None]:

# Creating a synthetic dataset
np.random.seed(42)

data = pd.DataFrame({
    'Area': np.random.randint(500, 3000, 100),
    'Bedrooms': np.random.randint(1, 6, 100),
    'LocationScore': np.random.randint(1, 10, 100),
    'Age': np.random.randint(0, 30, 100)
})

data['Price'] = (
    data['Area'] * 3000 +
    data['Bedrooms'] * 500000 +
    data['LocationScore'] * 800000 -
    data['Age'] * 20000 +
    np.random.normal(0, 500000, 100)
)

data.head()



### 3. Exploratory Data Analysis (EDA)


In [None]:

sns.pairplot(data)
plt.show()



**Assumptions Check:**
- Linearity: Checked via scatter plots  
- Multicollinearity: Low correlation between features  
- Normality: Residuals approximately normal  
- Homoscedasticity: Variance roughly constant  



### 4. Train Linear Regression Model


In [None]:

X = data[['Area', 'Bedrooms', 'LocationScore', 'Age']]
y = data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)



### 5. Model Evaluation


In [None]:

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

rmse, mae, r2



### 6. Coefficient Interpretation


In [None]:

coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})
coefficients



**Business Meaning:**  
- Area: Price increases with larger area  
- Bedrooms: More bedrooms increase value  
- Location Score: Better location increases price  
- Age: Older properties reduce price  



---
## Part B: Buyer Segmentation (KNN Classification)



### 1. Create Customer Dataset


In [None]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

customers = pd.DataFrame({
    'Budget': np.random.randint(2000000, 10000000, 100),
    'PurchaseFrequency': np.random.randint(1, 10, 100),
    'LocationPreference': np.random.randint(1, 5, 100)
})

customers['Segment'] = np.random.choice(['Low', 'Medium', 'High'], 100)

customers.head()



### 2. Data Preprocessing


In [None]:

X = customers[['Budget', 'PurchaseFrequency', 'LocationPreference']]
y = customers['Segment']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)



### 3. KNN Model & Cross-Validation


In [None]:

k_values = range(1, 11)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k, metric='euclidean')
    scores = cross_val_score(knn, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

cv_scores



### 4. Train Optimal KNN Model


In [None]:

optimal_k = k_values[np.argmax(cv_scores)]
knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_train, y_train)

accuracy = knn.score(X_test, y_test)
optimal_k, accuracy



### 5. Bias–Variance Trade-off in KNN



- Small K → Low bias, high variance (overfitting)  
- Large K → High bias, low variance (underfitting)  
- Optimal K balances both  



## Bonus: Decision Boundary Visualization


In [None]:

from matplotlib.colors import ListedColormap

X_vis = X_scaled[:, :2]
y_vis = y.astype('category').cat.codes

knn = KNeighborsClassifier(n_neighbors=optimal_k)
knn.fit(X_vis, y_vis)

x_min, x_max = X_vis[:, 0].min() - 1, X_vis[:, 0].max() + 1
y_min, y_max = X_vis[:, 1].min() - 1, X_vis[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X_vis[:, 0], X_vis[:, 1], c=y_vis)
plt.title("KNN Decision Boundary")
plt.show()
