In [3]:
import pandas as pd

# Part 1: Data Loading and Exploration

#### 1. Load the Califoria Housing dataset from `sklearn.datasets`.

In [None]:
from sklearn.datasets import fetch_california_housing

# Load the housing dataset
housing = fetch_california_housing()

#### 2. Create a Pandas DataFrame for the features and a Series for the target variable (med_house_value).

In [5]:
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='med_house_value')

#### 3. Perform an initial exploration of the dataset

In [6]:
# Display the first five rows of the dataset
display(X.head())

# Print the feature names
print("\nFeature Names:")
print(housing.feature_names)

# Check for missing values
print("\nNumber of Missing Values:")
missing_values = X.isnull().sum()
display(missing_values)

# Generate Summary Statistics
print("\nSummary Statistics for Housing Dataset:")
display(X.describe())

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25



Feature Names:
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Number of Missing Values:


MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64


Summary Statistics for Housing Dataset:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31


# Part 2: Linear Regression on Unscaled Data

#### 4. Split the dataset into training and test sets (80% training, 20% testing)

In [7]:
from sklearn.model_selection import train_test_split

# Split the dataset 80/20 with random state of 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

'''
Note: I am using random_state to ensure 
that results are the same for me 
as they are for when you run this code
'''

'\nNote: I am using random_state to ensure \nthat results are the same for me \nas they are for when you run this code\n'

#### 5. Train a linear regression model on the unscaled data using `sklearn.linear_model.LinearRegression`

In [8]:
from sklearn.linear_model import LinearRegression

# Initialize and train the linear regression model using unscaled data
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)


#### 6. Make predictions on the test set

In [9]:
# Make predictions on test set
y_pred = lin_reg.predict(X_test)

#### 7. Evaluate model performance using the following metrics:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R 2 Score

In [10]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

# Evaluate model performance using MSE, RMSE, and R 2
mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display performance metrics
print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Squared Error: {rmse:.2f}")
print(f"R2 Score: {r2:.2f}")

# Display model coefficients
print("\nModel Coefficients:")
print(pd.Series(lin_reg.coef_, index=X.columns))

Unscaled Data Model:
Mean Squared Error: 0.56
Root Squared Error: 0.75
R2 Score: 0.58

Model Coefficients:
MedInc        0.448675
HouseAge      0.009724
AveRooms     -0.123323
AveBedrms     0.783145
Population   -0.000002
AveOccup     -0.003526
Latitude     -0.419792
Longitude    -0.433708
dtype: float64


### 8. Interpretation Questions

**What does the R 2 score tell us about model performance?**
<br>R 2 metrics indicate the proportion of variance in our target variable (med_house_value) explained by predictor variables, with an R 2 closer to 1 (100% fit) suggesting how well the model predictions fit the actual values in our dataset. Thus, our R 2 score of 0.58 suggests that the model does not *strongly* predict actual values; however, it may still be acceptable.

**Which features seem to have the strongest impact on predictions based on the model's coefficients?**
<br>Since coefficients represents the change in the median house cost for a one-unit change in each feature (holding all other features constant), and the magnitude of the absolute value of a coefficient represents how strongly each feature correlates to the target (larger number, higher correlation), it seems that AveBedrms has the strongest impact on predictions, with MedInc, Latitude, and Longitude following behind. AveRooms has a much weaker impact, with HouseAge and AveOccup being even weaker, and Population appearing neglible.

**How well do the predicted values match the actual values?**
<br>Using MSE, RMSE, and R 2, we can evaluate how well the predicted values match the actual values. As I said previously, the R 2 score suggests that many of the predicted values do *not* match the actual values. However, both the MSE and RSME are close to zero, which suggests a better match between the predicted and actual values. Ultimately, though, the data for this model is unscaled and thus it is difficult to know much from the evaluation metrics we're using.

# Linear Regression with Feature Scaling

#### 9. Apply feature scaling (use `sklearn.preprocessing.StandardScaler`)

In [14]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler and apply it to our FEATURES

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

#### 10. Train a new linear regression model on the scaled features

In [16]:
# Split the scaled data
X_train_scaled, X_test_Scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y, test_size = 0.2, random_state = 42)

# Fit the scaled data
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_scaled, y_train_scaled)

# Make Predictions
y_pred_scaled = lin_reg_scaled.predict(X_test_Scaled)
y_pred_scaled

array([0.71912284, 1.76401657, 2.70965883, ..., 4.46877017, 1.18751119,
       2.00940251])

#### 11. Evaluate model performance again using the same metrics (MSE, RMSE, R2)

In [18]:
# Evaluate model performance using MSE, RMSE, and R 2
mse_scaled = mean_squared_error(y_test_scaled, y_pred_scaled)
rmse_scaled = root_mean_squared_error(y_test_scaled, y_pred_scaled)
r2_scaled = r2_score(y_test_scaled, y_pred_scaled)

# Display performance metrics - should be the same as the original model
print("Scaled Data Model:")
print(f"Mean Squared Error: {mse_scaled:.2f}")
print(f"Root Squared Error: {rmse_scaled:.2f}")
print(f"R2 Score: {r2_scaled:.2f}")

# Display model coefficients - should be different
print("\nModel Coefficients:")
print(pd.Series(lin_reg_scaled.coef_, index=X.columns))

Scaled Data Model:
Mean Squared Error: 0.56
Root Squared Error: 0.75
R2 Score: 0.58

Model Coefficients:
MedInc        0.852382
HouseAge      0.122382
AveRooms     -0.305116
AveBedrms     0.371132
Population   -0.002298
AveOccup     -0.036624
Latitude     -0.896635
Longitude    -0.868927
dtype: float64


#### 12. Interpretation Questions:
- Compare the metrics before and after scaling. What changed, and why?
- Did the R² score improve? Why or why not?
- What role does feature scaling play in linear regression?

# Part 4: Feature Selection and Simplified Model

#### 13. Select three features from the dataset to build a simplified model. Explain your choice.

In [11]:
# Create new X and y with select columns

X2 = X[["MedInc", "AveBedrms", "Longitude"]]
y2 = pd.Series(housing.target, name='med_house_value')

'''
Since MedInc, AveBedrms, and Longitude have the strongest coefficients (see #7), 
I chose them for their potential to more accurately match the actual 
values in their predictions. 
'''

'\nSince MedInc, AveBedrms, and Longitude have the strongest coefficients (see #7), \nI chose them for their potential to more accurately match the actual \nvalues in their predictions. \n'

#### 14. Train a new linear regression modeal using only these three features.

In [12]:
# Split the dataset
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, 
                                                    test_size = 0.2, random_state=42)

# Initialize and train linear regression model on unscaled data
lin_reg2 = LinearRegression()
lin_reg2.fit(X_train2, y_train2)

# Made predictions
y_pred2 = lin_reg2.predict(X_test2)

#### 15. Evaluate the performance of this simplified model and compare it to the full model.

In [13]:
# Calculate MSE, RMSE, and R 2 for simplified model
mse2 = mean_squared_error(y_test2, y_pred2)
rmse2 = root_mean_squared_error(y_test2, y_pred2)
r22 = r2_score(y_test2, y_pred2)

# Evaluate both simplified and original model
# NEW:
print("NEW Data Model:")
print(f"Mean Squared Error: {mse2:.2f}")
print(f"Root Squared Error: {rmse2:.2f}")
print(f"R2 Score: {r22:.2f}")
# ORIGINAL:
print("\nORIGINAL Data Model:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Squared Error: {rmse:.2f}")
print(f"R2 Score: {r2:.2f}")

# View both new and original model coefficients
# NEW:
print("\nNEW Model Coefficients:")
print(pd.Series(lin_reg2.coef_, index=X2.columns))
# ORIGINAL:
print("\nORIGINAL Model Coefficients:")
print(pd.Series(lin_reg.coef_, index=X.columns))

NEW Data Model:
Mean Squared Error: 0.71
Root Squared Error: 0.84
R2 Score: 0.46

ORIGINAL Data Model:
Mean Squared Error: 0.56
Root Squared Error: 0.75
R2 Score: 0.58

NEW Model Coefficients:
MedInc       0.418949
AveBedrms   -0.001910
Longitude   -0.019934
dtype: float64

ORIGINAL Model Coefficients:
MedInc        0.448675
HouseAge      0.009724
AveRooms     -0.123323
AveBedrms     0.783145
Population   -0.000002
AveOccup     -0.003526
Latitude     -0.419792
Longitude    -0.433708
dtype: float64


#### 16. Interpretation Questions:

**How does the simplified modeal compare to the full model?**
<br>The simplified model does worse than the full model. Its MSE and RMSE values are *further* from zero and the R^2 score is *further* from one. The coefficients are also weaker without other columns included in the model (although I admit I do not know why this is, other than guessing there is some dependency between the values; for instance there may be a relationship between Latitude/Longitude and AveBedrms/AveRooms).

**Would you use this simplified model in practice? Why or why not?**
<br>I would not use this simplified model, because our evaluation of it shows that the predicted values of the original model are closer to the actual model than the predicted values of the simplified model.