Part 1: Data Loading and Exploration

In [42]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
import streamlit as st

# Load the housing dataset
housing = fetch_california_housing()

X = pd.DataFrame(housing.data, columns=housing.feature_names) 
y = pd.Series(housing.target, name='med_house_value')

#Display the first five rows of the dataset. (5 pts)
print("\nDataset Preview:")
print(X.head())

#Print the feature names and check for missing values. (5 pts)
print("\nFeature names:", X.columns.tolist())

print("\nMissing values per column:")
print(X.isnull().sum())

#Generate summary statistics (mean, min, max, etc.). (10 pts)

print("\nSummary Statistics:")
print(X.describe())



Dataset Preview:
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  

Feature names: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Missing values per column:
MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

Summary Statistics:
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20

Part 2: Linear Regression on Unscaled Data (30 points)

In [43]:
#Split the dataset into training and test sets (80% training, 20% testing). (5 pts)

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, y, 
                                                            test_size=0.2, 
                                                            random_state=42)

#Train a linear regression model on the unscaled data using sklearn.linear_model.LinearRegression. (5 pts)
lin_reg_raw = LinearRegression()
lin_reg_raw.fit(X_train_raw, y_train)

#Make predictions on the test set. (5 pts)
y_pred_raw = lin_reg_raw.predict(X_test_raw)
print("Predictions:")
print(y_pred_raw)

# View our model's coefficients
print("\nModel Coefficients (Unscaled):")
print(pd.Series(lin_reg_raw.coef_,
                index=X.columns))
print("\nModel Intercept (Unscaled):")
print(pd.Series(lin_reg_raw.intercept_))

Predictions:
[0.71912284 1.76401657 2.70965883 ... 4.46877017 1.18751119 2.00940251]

Model Coefficients (Unscaled):
MedInc        0.448675
HouseAge      0.009724
AveRooms     -0.123323
AveBedrms     0.783145
Population   -0.000002
AveOccup     -0.003526
Latitude     -0.419792
Longitude    -0.433708
dtype: float64

Model Intercept (Unscaled):
0   -37.023278
dtype: float64


In [44]:
#Evaluate model performance using the following metrics:

from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

#Mean Squared Error (MSE) (5 pts)

mse_raw = mean_squared_error(y_test, y_pred_raw)
print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse_raw:.2f}")

#Root Mean Squared Error (RMSE) (5 pts)

rmse_raw = root_mean_squared_error(y_test, y_pred_raw)
print(f"Root Squared Error: {rmse_raw:.2f}")

#R² Score (5 pts)

r2_raw = r2_score(y_test, y_pred_raw)
print(f"R² Score: {r2_raw:.2f}")


Unscaled Data Model:
Mean Squared Error: 0.56
Root Squared Error: 0.75
R² Score: 0.58


Interpretation Questions:

What does the R² score tell us about model performance?

The R² score is 0.58, which means the model explains about 58% of the variance in housing prices. This is a moderate level of explanatory power which means the model captures some important patterns in the data, but there’s still a significant amount of variation left unexplained, which could indicate missing features  in the data.

Which features seem to have the strongest impact on predictions based on the model’s coefficients?

Based on the model’s coefficients, the features with the strongest impact on predictions are Average Bedrooms per Household (AveBedrms), Median Income (MedInc), and Longitude and Latitude. Specifically, AveBedrms has the largest positive influence, meaning that neighborhoods with more bedrooms per household tend to have higher predicted housing prices. MedInc also has a strong positive effect, indicating that areas with higher median incomes tend to see higher prices. In contrast, both Longitude and Latitude have significant negative coefficients, meaning that moving further north or east tends to lower predicted housing prices, likely reflecting regional price patterns within California. 

How well do the predicted values match the actual values?

Based on the MSE of 0.56 and the RMSE of 0.75, the model’s predictions are fairly close to the actual values since lower MSE and RMSE values indicate better predictive performance. The RMSE of 0.75 tells us that the average prediction is about 0.75 units off from the true value. This is a moderate level of accuracy: the model captures the general trend of the data, but the errors are large enough to suggest room for improvement.

Part 4: Feature Selection and Simplified Model (25 points)

Select three features from the dataset to build a simplified model. Explain your choice. 

For the simplified model, I would select MedInc (Median Income), AveBedrms (Average Bedrooms per Household), and Longitutde. These three features had the coefficients with the largest magnitude and capture key drivers of housing prices while keeping the model manageable and interpretable. 

AveBedrms had the highest coefficient of .783 which indicates that the number of bedrooms in a home is strongly positively correlated with how valuable it is and is the largest predictor of housing prices. MedInc had the second largest coefficient of .449, as higher income neighborhoods tend to have higher housing prices, reflecting purchasing power. Finally, Longitude captures geographic variation in pricing, which is crucial in a dataset covering California, where location strongly influences home values and southern most houses tended to be more expensive. Together, these three features balance economic factors, physical housing characteristics, and location effects, providing a well-rounded foundation for a simplified predictive model.

In [45]:
#Train a new linear regression model using only these three features. (5 pts)

x= pd.DataFrame(housing.data, columns=housing.feature_names)[['MedInc', 'AveBedrms', 'Latitude']]  # Features
Y = pd.Series(housing.target, name='med_house_value')  # Target variable

x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.2, random_state=42)

newmodel = LinearRegression()
newmodel.fit(x_train, Y_train)

Y_pred = newmodel.predict(x_test)

#Evaluate the performance of this simplified model and compare it to the full model. (5 pts)

mse_raw2 = mean_squared_error(Y_test, Y_pred)
print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse_raw2:.2f}")

rmse_raw2 = root_mean_squared_error(Y_test, Y_pred)
print(f"Root Squared Error: {rmse_raw2:.2f}")

r2_raw2 = r2_score(Y_test, Y_pred)
print(f"R² Score: {r2_raw2:.2f}")


Unscaled Data Model:
Mean Squared Error: 0.70
Root Squared Error: 0.84
R² Score: 0.47


Interpretation Questions:

How does the simplified model compare to the full model?

The simplified model, which uses only three features, shows higher error metrics compared to the full model. Specifically, the Mean Squared Error (MSE) is 0.70, and the Root Mean Squared Error (RMSE) is 0.84. These higher values suggest that the simplified model makes less accurate predictions than the full model. Additionally, the R² score for the simplified model is 0.47, meaning it only explains 47% of the variance in housing prices. In contrast, the full model, with more features, has a higher R² score of .58 and better explains the variability in the target variable. This indicates that the full model, with its greater number of features, is better at capturing the underlying patterns in the data and making more accurate predictions.

Would you use this simplified model in practice? Why or why not?

In practice, I would consider using the simplified model, despite its slightly lower accuracy compared to the full model. The R² score of 0.47 and the error metrics, while not perfect, are not drastically worse than those of the more complex model. Given that the simplified model uses only three features, it will be much easier to implement and interpret. Additionally, using fewer features means there is less data required for training, which can be advantageous in situations where data collection is expensive or time-consuming. For practical applications where a quick, cost-effective solution is needed, the simplified model offers a reasonable trade-off between performance and ease of use. 