# Real Estate estimator

We will try to figure out whether there exists a ***linear realtionship*** between :
- the **price** of a flat (our **target** for each flat)
- and some usual **features** such as like surface area, bedrooms, etc...

**Using only numpy**

In [None]:
import numpy as np

🌆 Suppose that we were able to collect data for 4 flats down below: 
- their **features**:
    - `surface` (square feet)
    - `bedrooms`
    - `floors` 
- their **target**:
    - `price` (in thousands of USD)

|flats |surface (square feet)|bedrooms|floors|price (k USD)|
|------|-------------|--------|------|------------|
|flat1 |620|1|1|244|
|flat2 |3280|4|2|671|
|flat3 |1900|2|2|504|
|flat4 |1320|3|3|510|

A first approach to **predict the price of an apartment** is to try to **find a linear relationship between the  target and the features** (*i.e. between the price and the (surface, bedrooms, floor)*), by solving the following **system of $n = 4$ linear equations with $p = 4$ unknown variables**: 



$$\begin{cases}
    244 = \theta_0 + 620\theta_1 + 1\theta_2 + 1\theta_3 \\
    671 = \theta_0 + 3280\theta_1 + 4\theta_2 + 2\theta_3 \\
    504 = \theta_0 + 1900\theta_1 + 2\theta_2 + 2\theta_3 \\
    510 = \theta_0 + 1320\theta_1 + 3\theta_2 + 3\theta_3 \\
\end{cases}$$

which can be translated into a matricial equation:

$$\boldsymbol y = \boldsymbol {X \cdot \theta}$$

$$\begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}_{4 \times 1} = \begin{bmatrix}
    1 & 620 & 1 & 1 \\
    1 & 3280 & 4 & 2 \\
    1 & 1900 & 2 & 2 \\
    1 & 1320 & 3 & 3
\end{bmatrix}_{4 \times 4} \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}_{4 \times 1}$$

where :
* $\boldsymbol y$ is the **`target`**, the vector of `Price`
* $\boldsymbol X$ represents the **`matrix of features`**
* $\boldsymbol {\theta} = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ (*theta*) is the vector of **coefficients/variables/unknowns** to be found

----

🤓 Here, we are using the Greek letter `theta` $\boldsymbol \theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3 \\
\end{bmatrix}$, to represent the coefficients of our **features**:

- A flat with no surface, no bedroom and no floor would cost $\theta_0$
- An increase of one sqm - *holding the number of bedrooms and the floor  number constant* -  would increase the price by $\theta_1$ thousand USD
- An additional bedroom - *holding the surface and the floor number constant* -   would increase the price by $\theta_2$ thousand USD
- An increase of one floor number - *holding the surface and the number of bedrooms constant* - would increase the price by $\theta_3$ thousand USD

----

😉 If we manage to solve this system of linear equations (i.e. if we find $\theta_0$, $\theta_1$, $\theta_2$, $\theta_3$), the price of any new flat could be estimated using the following formula: $$y_{newflat} = \boldsymbol x_{newflat} \cdot \boldsymbol \theta$$

## (1) Define the matrix $\boldsymbol X$ of `features`:

Create a $(4,3)$ `numpy.ndarray` storing the values of the 3 features (surface, bedrooms, floors) for the 4 observations. 

In [None]:
X = np.array([
    [620,1,1],
    [3280,4,2],
    [1900,2,2],
    [1320,3,3]])
X

array([[ 620,    1,    1],
       [3280,    4,    2],
       [1900,    2,    2],
       [1320,    3,    3]])

Double-check the ***shape***, the ***size*** and the ***dim*** of this ***np.array***

In [None]:
print(X.shape)
print(X.ndim)
print(X.dtype)

(4, 3)
2
int64


Add a "constant" vector of ones $ = \begin{bmatrix}
    1 \\
    1 \\
    1 \\
    1 \\
\end{bmatrix}$ to create the $(4,4)$ matrix $\boldsymbol X$ representing the linear system of equations

<details>
    <summary><i>Explanations</i></summary>

🤔 As you've probably noticed, the linear system of equations includes a $\theta_0$ coefficient which appears in the 4 equations. 

❗️ We need an additional feature to represent the y-intercept of the linear regression line 

_Note_ : we talk about an [affine relation](https://math.stackexchange.com/questions/275310/what-is-the-difference-between-linear-and-affine-function) rather than a strict linear relation between the `price` and the features (_Cf. Decision Science Module_)
    
    
</details>

In [None]:
# Define x0 as a (4,1) vector filled with 1 with the fastest NumPy method
x0 = np.ones((4,1)) 
x0

array([[1.],
       [1.],
       [1.],
       [1.]])

In [None]:
# Use `numpy.hstack` to create the (4,4) matrix X by concatenating x0 to your previous (4,3) matrix
X = np.hstack((x0,X))
X

array([[1.00e+00, 6.20e+02, 1.00e+00, 1.00e+00],
       [1.00e+00, 3.28e+03, 4.00e+00, 2.00e+00],
       [1.00e+00, 1.90e+03, 2.00e+00, 2.00e+00],
       [1.00e+00, 1.32e+03, 3.00e+00, 3.00e+00]])

## (2) Define the vector $\boldsymbol y$ of `Prices`

$\boldsymbol y  = \begin{bmatrix}
    244 \\
    671 \\
    504 \\
    510
\end{bmatrix}$

In order to match our matricial representation $\boldsymbol y  = \boldsymbol {X\cdot \theta}$, what should the shape of $\boldsymbol y$ be? Define $\boldsymbol y$ down below. ❓

<details>
    <summary><i>Hint</i></summary>

$\boldsymbol y$ should be a $(4,1)$ array, equivalent to a flat "vector", represented vertically
</details>

In [None]:
y = np.array([
    [244],
    [671],
    [504],
    [510]])
y.shape

(4, 1)

## (3) Find the solution of the system

⏰Now, it's time to find the vector of coefficients $\boldsymbol \theta = \begin{bmatrix}
    \theta_0 \\
    \theta_1 \\
    \theta_2 \\
    \theta_3
\end{bmatrix}$ !

👍 The solution of the equation is:
 
$$ \large \boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y 
\large \iff \boldsymbol X^{-1} \cdot \boldsymbol X \boldsymbol \cdot \theta = \boldsymbol X^{-1} \cdot \boldsymbol y 
\large \iff \boldsymbol \theta = \boldsymbol X^{-1} \cdot \boldsymbol y
$$

where $\large \boldsymbol X^{-1}$ is the inverse of $\large \boldsymbol X$.

In [None]:
# Compute the inverse of the matrix X with the right NumPy method
X_inverse = np.linalg.inv(X)
X_inverse

array([[ 1.64516129e+00, -7.51278738e-18, -2.90322581e-01,
        -3.54838710e-01],
       [-5.37634409e-04, -1.66950831e-19,  1.07526882e-03,
        -5.37634409e-04],
       [ 3.70967742e-01,  5.00000000e-01, -1.24193548e+00,
         3.70967742e-01],
       [-6.82795699e-01, -5.00000000e-01,  8.65591398e-01,
         3.17204301e-01]])

You can check that the inversion worked by testing the following equality:

$$\boldsymbol X^{-1} \cdot\boldsymbol X = \boldsymbol I_4$$
where $\boldsymbol I_4$ is the $ 4 \times 4 $ identity matrix $ \begin{bmatrix}
    1 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 \\
    0 & 0 & 1 & 0 \\
    0 & 0 & 0 & 1
\end{bmatrix}$

In [None]:
# Define I4 using the right NumPy method
I4 = np.eye(4)
I4

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Now compute $\boldsymbol X^{-1} \boldsymbol X$:

In [None]:
X_invX = np.dot(X_inverse,X) # != Xinv * X
X_invX

array([[ 1.00000000e+00, -2.62012634e-13, -3.33066907e-16,
        -3.33066907e-16],
       [-1.08420217e-19,  1.00000000e+00, -1.08420217e-19,
         1.08420217e-19],
       [ 5.55111512e-17,  2.30482300e-13,  1.00000000e+00,
        -1.66533454e-16],
       [ 1.66533454e-16, -1.79856130e-13,  2.77555756e-16,
         1.00000000e+00]])

Does it look like $\boldsymbol I_4 $ ?


In [None]:
np.allclose(X_invX, I4)

True

🎉 You are finally able to compute `theta` using the following formula: $ \large \boldsymbol \theta = \boldsymbol X^{-1}\cdot \boldsymbol y $:

In [None]:
theta = np.dot(X_inverse, y)
theta

array([[ 74.12903226],
       [  0.13655914],
       [-10.72580645],
       [ 95.93010753]])

## (4) Estimation of a new flat price

You finally solved the system finding $\boldsymbol \theta$ and are now able to estimate the `Price` (in thousands of $) of a 5th flat given these characteristics:

- `Surface`: 3000 $ft^2$
- `Bedrooms`: 5 
- `Floors`: 1

with the following formula:

$$y_{flat5} = \boldsymbol x_{flat5} \cdot \boldsymbol \theta$$

In [None]:
# Define x5
x5 = np.array([[1,3000,5,1]])

# Compute y5
# You should find a Price of 526,000 $
y5 = np.dot(x5,theta) 
y5 

array([[526.10752688]])

## (5) Reality-check

In reality, a flat price is never entirely determined by its surface, number of bedrooms and  the floor number.

Let's imagine that we measure the real price $y_{flat5}$ at $700,000$ instead of $526,000$ as predicted. 

Could we take this new information into account to improve our model?

Update the linear system of equations $ \large \boldsymbol X \cdot \boldsymbol \theta = \boldsymbol y$ to incorporate the information carried out by this new flat.

In [None]:
# Create the new matrix of features X of shape (5,4)
# Print the shape to double-check the shape is indeed (5,4)
new_flat = np.array([[1, 3000, 5, 1]])
X2 = np.vstack((X, new_flat))
X2.shape

(5, 4)

In [None]:
# Create new y of shape (5,1)
y2 = np.vstack((y, np.array([[700]])))
y2.shape

(5, 1)