# **S06: EXERCISES**

## Exercise 01

Build a Linear Regression model for the `california_housing` dataset. You can load this dataset calling `datasets.fetch_california_housing()`. 

 - **PART 1**
    - **Question 1** - What are the $R^2$ metrics for train and test sets?
    - **Question 2** - Imagine that me and my wife, we want to sell our house in *528-426 W Scott Ave
Clovis, CA 93612* but we have no idea about the price. Our house is 30 years old, with 6 rooms and 3 bedrooms. In our geographic block group we are 300 people. Our income is 60K. *Hint: Build a single record with this information and get the prediction using your trained model*

    - *NOTE: Don't use Latitude and Longitude for this part*

 - **(Optional) PART 2** - Repeat the process, but now include new three variables called `distance2SF`, `distance2SJ` and `distance2SD` containing the distance from each area to San Francisco, San Jose and San Diego, respectively, in Km.

    - **Question 3** - What is the recomended for sale price of my house now? 
    - *NOTE: You can use the `geopy` library to calculate distances between locations. https://geopy.readthedocs.io/en/stable/#module-geopy.distance*

**Don't forget...**
 - Split data into train and test in order to evaluate the model with unseen data
 - If you want, you can standardize (`StandardScaler`) data before fitting the model
 - Train your model and apply it to the test data.
 - Evaluate the model with the `score` function, for both train and test datasets.

In [14]:
import pandas as pd
import plotly.express as px
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
import numpy as np

In [15]:
from sklearn.datasets import fetch_california_housing

# Load the California housing dataset
california_housing = fetch_california_housing()

# Create a Pandas DataFrame from the data and target
X, y = data=california_housing.data, california_housing.target

# Display the DataFrame
print(X)


[[   8.3252       41.            6.98412698 ...    2.55555556
    37.88       -122.23      ]
 [   8.3014       21.            6.23813708 ...    2.10984183
    37.86       -122.22      ]
 [   7.2574       52.            8.28813559 ...    2.80225989
    37.85       -122.24      ]
 ...
 [   1.7          17.            5.20554273 ...    2.3256351
    39.43       -121.22      ]
 [   1.8672       18.            5.32951289 ...    2.12320917
    39.43       -121.32      ]
 [   2.3886       16.            5.25471698 ...    2.61698113
    39.37       -121.24      ]]


In [16]:
X = X[:, :-2] 
print(X)

[[8.32520000e+00 4.10000000e+01 6.98412698e+00 1.02380952e+00
  3.22000000e+02 2.55555556e+00]
 [8.30140000e+00 2.10000000e+01 6.23813708e+00 9.71880492e-01
  2.40100000e+03 2.10984183e+00]
 [7.25740000e+00 5.20000000e+01 8.28813559e+00 1.07344633e+00
  4.96000000e+02 2.80225989e+00]
 ...
 [1.70000000e+00 1.70000000e+01 5.20554273e+00 1.12009238e+00
  1.00700000e+03 2.32563510e+00]
 [1.86720000e+00 1.80000000e+01 5.32951289e+00 1.17191977e+00
  7.41000000e+02 2.12320917e+00]
 [2.38860000e+00 1.60000000e+01 5.25471698e+00 1.16226415e+00
  1.38700000e+03 2.61698113e+00]]


In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [19]:
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

In [20]:
y_train_pred = lr.predict(X_train_scaled)
y_test_pred = lr.predict(X_test_scaled)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
print(f"R^2 for the training set: {r2_train}")
print(f"R^2 for the test set: {r2_test}")

R^2 for the training set: 0.5459161602818383
R^2 for the test set: 0.5099337366296424


In [22]:
house_features = np.array([[6.0, 30, 6, 3, 300, 300/6]])
house_features_scaled = scaler.transform(house_features)
predicted_price = lr.predict(house_features_scaled)
print(f"Predicted selling price: {predicted_price[0]*100000} USD")

Predicted selling price: 500672.5944413878 USD
