# 1 - Load the dataset

In [4]:
#sklearn comes with few small datasets. We will use one of them called "California Housing". Which is identical to
#to the example we saw in theory part. This dataset has 20640 samples with 8 features (columns). Here target variable
#is the price of the house.

#import the libs
from sklearn.datasets import fetch_california_housing
#load the dataset
data = fetch_california_housing()  #returns dictionary-like object, attributes are - data, target, DESCR
#first of all, let's see the shape of the training data
print(data.data.shape)

(20640, 8)


In [5]:
#shape of a target/labels
print(data.target.shape)

(20640,)


In [6]:
#important info about the dataset
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [7]:
#how target values look like
data.target[:40]

array([4.526, 3.585, 3.521, 3.413, 3.422, 2.697, 2.992, 2.414, 2.267,
       2.611, 2.815, 2.418, 2.135, 1.913, 1.592, 1.4  , 1.525, 1.555,
       1.587, 1.629, 1.475, 1.598, 1.139, 0.997, 1.326, 1.075, 0.938,
       1.055, 1.089, 1.32 , 1.223, 1.152, 1.104, 1.049, 1.097, 0.972,
       1.045, 1.039, 1.914, 1.76 ])

# 2 - Preprocess the dataset

Since this dataset is already preprocessed, we dont have to do anything in this phase.

# 3 - Train a model

In [8]:
from sklearn.linear_model import LinearRegression
#create a linear regression object
lin_reg = LinearRegression()
#train a model
lin_reg.fit(data.data, data.target)

LinearRegression()

In [9]:
#learned weights
lin_reg.coef_

array([ 4.36693293e-01,  9.43577803e-03, -1.07322041e-01,  6.45065694e-01,
       -3.97638942e-06, -3.78654265e-03, -4.21314378e-01, -4.34513755e-01])

In [10]:
#learned intercept
lin_reg.intercept_

-36.94192020718439

# 4 - Test a model

In [19]:
# we can use a model to predict as follows
lin_reg.predict(data.data[5].reshape(1,-1))  #first sample

array([2.67527702])

In [20]:
#let's see what was the true value
data.target[5]  # Pretty close :)

2.697

In [13]:
#find mean squared error
from sklearn.metrics import mean_squared_error
mean_squared_error(data.target, lin_reg.predict(data.data))

0.5243209861846072

In [14]:
#let us calculate mse from scratch to make sure its correct
import numpy as np
np.mean((lin_reg.predict(data.data) - data.target) ** 2)

0.5243209861846072

# 5 - Deploy a model

We can use **predict** method to predict the price of a house.

As you can see, the main benifit of these libraries are we do not have to worry about internal algorithms. It does this work for us.