# California housing prices


### The purpose of this project is to build a model of housing prices in California using the California centrus dataset (data from 1990). This dataset has metrics for each block group (district), which is the smallest geoghaphical unit for which data was published.
### The model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

### Main steps that will be covered:
1. __ Frame the problem and look at the big picture __ 
2. __ Get the data __ 
3. __ Discover and visualize the data to gain insights__ 
4. __ Prepare the data for ML algorithms__ 
5. __ Select a model and traint it__ 
6. __ Fine-tune the model__ 
7. __ Present the solution__ 
8. __ Launch, monitor, and mantain the system__ 


## 1. Frame the problem and look at the big picture

###    1. 1 Define the objective in business terms:
__The output of the model will be used as input for another ML system, along with other signals. The downstream system will determine wether it is worth investing in a given area or not. __

###     1. 2 What are the current solutions (if any)?
__The district housing prices are estimated manually by experts. The process is expensive and time consuming, and the estimates are not great (off by more than 10%).__


### 1. 3 How should you frame the problem (supervised/unsupervised, batch/online etc)?
__This problem can be solved with a supervised learning algorithm as the data is labeled. Moreover, the algorithm can be a multivariate regression since we are asked to make a prediction. As it is not required to rapidly adapt the algorithm to new changes, batch learning can be used.__

### 1. 4 How the performance should be measured?
__We can use RMSE(Root Mean Squared Error aka l2 norm), with a higher weight for large errors. __






In [1]:
from IPython.display import Math
Math(r'RMSE(X, h) = \sqrt{\frac1m \sum_{i=1}^m (h(x^{(i)}) - y^{(i)})^{2}}')

<IPython.core.display.Math object>

### 1. 4 Check the assumptions

## 2. Get the data


In [2]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_FILE_NAME = "housing.tgz"
HOUSING_URL = os.path.join(DOWNLOAD_ROOT, "datasets/housing/", HOUSING_FILE_NAME)

HOUSING_SAVE_PATH = os.path.join("datasets", "housing")

def fetch_data(data_url=HOUSING_URL, save_path=HOUSING_SAVE_PATH, file_name=HOUSING_FILE_NAME):
    if not os.path.isdir(save_path):
        os.makedirs(save_path)
    
    file_path = os.path.join(save_path, file_name)
    urllib.request.urlretrieve(data_url, file_path)    
    archived_data = tarfile.open(file_path)
    archived_data.extractall(path=save_path)
    archived_data.close()
    
    

In [3]:
fetch_data()

In [4]:
import pandas as pd

def load_data(path=HOUSING_SAVE_PATH,file_name="housing.csv"):
    complete_path=os.path.join(path, file_name)
    return pd.read_csv(complete_path)
    

In [5]:
data = load_data()
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
