# Machine Learning - Intro to Linear Regression:
Setup the enviroment

In [4]:
import pandas as pd
import numpy as np

### Getting the data:
For this project, I'll use the California Housing Prices dataset. 

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv")

### Exploratory Analysis

In [14]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


How many columns are in the dataset?

In [5]:
df.shape[1]

10

Which columns in the dataset have missing values?

In [6]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

How many unique values does the `ocean_proximity` column have?

In [7]:
df["ocean_proximity"].nunique()

5

What's the average value of the `median_house_value` for the houses located near the bay?

In [9]:
df[df.ocean_proximity == 'NEAR BAY']['median_house_value'].mean()

259212.31179039303

# now, let's:
1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

In [35]:
df['total_bedrooms'].mean().round(3)

537.871

In [38]:
df['total_bedrooms'].fillna(df['total_bedrooms'].mean().round(3)).mean().round(3)

537.871

In [39]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

### Next:
1. I selected all the options located on islands.

In [10]:
df['ocean_proximity']=='ISLAND'

0        False
1        False
2        False
3        False
4        False
         ...  
20635    False
20636    False
20637    False
20638    False
20639    False
Name: ocean_proximity, Length: 20640, dtype: bool

In [11]:
df[(df['ocean_proximity']=='ISLAND')]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
8314,-118.32,33.35,27.0,1675.0,521.0,744.0,331.0,2.1579,450000.0,ISLAND
8315,-118.33,33.34,52.0,2359.0,591.0,1100.0,431.0,2.8333,414700.0,ISLAND
8316,-118.32,33.33,52.0,2127.0,512.0,733.0,288.0,3.3906,300000.0,ISLAND
8317,-118.32,33.34,52.0,996.0,264.0,341.0,160.0,2.7361,450000.0,ISLAND
8318,-118.48,33.43,29.0,716.0,214.0,422.0,173.0,2.6042,287500.0,ISLAND


In [12]:
X = df[(df['ocean_proximity']=='ISLAND')]

In [13]:
X

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
8314,-118.32,33.35,27.0,1675.0,521.0,744.0,331.0,2.1579,450000.0,ISLAND
8315,-118.33,33.34,52.0,2359.0,591.0,1100.0,431.0,2.8333,414700.0,ISLAND
8316,-118.32,33.33,52.0,2127.0,512.0,733.0,288.0,3.3906,300000.0,ISLAND
8317,-118.32,33.34,52.0,996.0,264.0,341.0,160.0,2.7361,450000.0,ISLAND
8318,-118.48,33.43,29.0,716.0,214.0,422.0,173.0,2.6042,287500.0,ISLAND


2. I selected only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. I got the underlying NumPy array. Let's call it `X`.

In [14]:
X = X[['housing_median_age','total_rooms','total_bedrooms']]

In [86]:
X

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
8314,27.0,1675.0,521.0
8315,52.0,2359.0,591.0
8316,52.0,2127.0,512.0
8317,52.0,996.0,264.0
8318,29.0,716.0,214.0


4. I computed matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.


In [87]:
Xt = X.T

In [88]:
Xt

Unnamed: 0,8314,8315,8316,8317,8318
housing_median_age,27.0,52.0,52.0,52.0,29.0
total_rooms,1675.0,2359.0,2127.0,996.0,716.0
total_bedrooms,521.0,591.0,512.0,264.0,214.0


In [89]:
XTX = Xt.dot(X)

In [90]:
XTX

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
housing_median_age,9682.0,351053.0,91357.0
total_rooms,351053.0,14399307.0,3772036.0
total_bedrooms,91357.0,3772036.0,998358.0


5. I computed the inverse of `XTX`.

In [91]:
Xinv = np.linalg.inv(XTX)

In [92]:
Xinv

array([[ 9.19403586e-04, -3.66412216e-05,  5.43072261e-05],
       [-3.66412216e-05,  8.23303633e-06, -2.77534485e-05],
       [ 5.43072261e-05, -2.77534485e-05,  1.00891325e-04]])

6. I created an array `y` with values `[950, 1300, 800, 1000, 1300]`.

In [108]:
y = np.array([950, 1300, 800, 1000, 1300])

In [109]:
y

array([ 950, 1300,  800, 1000, 1300])

7. I multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

In [110]:
w = (Xinv @ X.T) @ y

In [111]:
w

0    23.123310
1    -1.481242
2     5.699229
dtype: float64