### TO DO:
- Precalculus vector revision
- Euclidean norm
- Manhatan norm 



# California Housing Prices

<b>Task:</b>
Use California census data to build a model of housing prices in the state. <br>
Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

<b>Additional informations:</b>
This data includes metrics such as the population, median income, and median housing price for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will call them “districts” for short.

## Machine Learning project checklist:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a and train it.
6. Fine-tune model.
7. Present solution.
8. Launch system.

# 1. Look at the big picture.

### Pipeline:

![A Machine Learning pipeline for real estate investments](../resources/images/housing/machine-learning-pipeline-for-real-estate-inv.png)

### Current solution:
The district housing prices are currently estimated manually by experts: a team gathers up-to-date information about a district, and when they cannot get the median housing price, they estimate it using complex rules.

This is costly and time-consuming, and their estimates are not great; in cases where they manage to find out the actual median housing price, they often realize that their estimates were off by more than 20%. 

### Frame:
- Typical <b>supervised learning task</b>, since you are given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price).
- Typical <b>regression task</b>, since you are asked to predict a value. More specifically, this is a <b>multiple regression problem</b>, since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.).
- It is also a <b>univariate regression problem</b>, since we are only trying to predict a single value for each district. If we were trying to predict multiple values per district, it would be a <b>multivariate regression problem</b>.
- Plain batch learning

### Select a performance Measure

**Root Mean Square Error** (How much eror the system typically makse in its predictions, with a higher wieght for large errors): 
<br>
<center>
$ RMSE(X,h) = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)}) - y^{(i)})^2 }$ 
</center> 

# 2. Implementation

In [27]:
# Imports
import os
import tarfile
import urllib
import pandas as pd

In [23]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"

DIR_STRUCTURE = "..,resources,data,housing".split(",")
HOUSING_DATA_PATH = os.path.join(*DIR_STRUCTURE)
HOUSING_URL = DOWNLOAD_ROOT + "/datasets/housing/housing.tgz"

In [24]:
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_DATA_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [28]:
def load_housing_data(housing_path=HOUSING_DATA_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [25]:
fetch_housing_data()

In [29]:
housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
