<a href="https://colab.research.google.com/github/alexgrand/ml/blob/main/ch2/ch2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# House prediction project

We are going to use California Housing Prices dataset from the Statlib repository to predict prices on 
houses in California. This data includes metrics such as the population, median income, and median housing price
for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau
publishes sample data (a block group typically has a population of 600 to 3,000 people). This blocks will be called
districts for the simplicity.

## Frame the problem
1. **Business objective:** this prediction of a district's median housing price will be fed to another machine learning
system, along with many other signals. The downstream system will define if whether it is worth investigating in a
given area.
2. **Current solution:** the district housing prices are currently estimated manually by experts: a team gathers 
up-to-date information about a district, and when they cannot get the median housing price, they estimate it using 
complex rules. This is costly and time-consuming, and their estimates are not great; in cases where they manage to find
out the actual median housing price, they often realize that their estimates were off by more than 30%.
This is why the company thinks that it would be useful to train a model to predict a district’s median housing price, 
given other data about that district.
3. **Data:** labeled data. The census data looks like a great dataset to exploit for this purpose, since it includes
the median housing prices of thousands of districts, as well as other data
4. **Training Model:**
    - This is clearly a typical __supervised learning task__, since the model can be trained
    with labeled examples (each instance comes with the expected output, i.e., the district’s median housing price).
    -  It is a typical __regression task__, since the model will be asked to predict a value. More specifically,
    this is a __multiple regression problem__, since the system will use multiple features to make a prediction (the 
    district’s population, the median income, etc.). It is also a __univariate__ regression problem, since we are only 
    trying to predict a single value for each district.
    - There is no continuous flow of data coming into the system, there is no particular need to adjust data rapidly,
    and the data is small enough to fit in memory, so __plain batch learning__ should be enough.
5. **Select a Performance Measure:** we could use typical performance measure for regression problems:\
_The root mean square error (RMSE)_:\
$\text{RMSE}(\textbf{X},h) = \sqrt{\frac{1}{m}\sum_{i=1}^\textit{m}(\textit{h}(\textbf{x}^{(i)} - y^{(i)})^2}$.\
But in case if there would be too much outliers we could use _Mean Absolute Error (MAE)_:\
$\text{MAE}(\textbf{X},h)=\frac{1}{m}\sum^{m}_{i=1}|{h(\textbf{x}^{(i)}) - y^{(i)}}|$.
_RMSE_ is so-called Euclidean norm or ℓ2 norm that is denoted as $∥ · ∥$. MAE corresponds to the ℓ1 norm, noted $∥ · ∥_1$
(This is sometimes called the Manhattan norm because it measures the distanc between two points in a city if you 
can only travel along orthogonal city blocks)
<br>**NOTE:**</br>
The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is 
more sensitive to outliers than the MAE. But when outliers are exponentially rare (like in a bell-shaped curve), 
the RMSE performs very well and is generally preferred.
6. **Check the assumptions:** we have talked to the team that works on downstream system. It looks like they do really
need just prices from our prediction model. But in case if they would for example convert our outcomes into the categories
like `cheap`, `medium`, `expensive` houses we would need to frame our problem as a classification task.
7. **The minimum performance needed** to reach the business objective: N/A
8. **Any comparable problems, reusable models:** N/A
9. **Is human expertise available:** N/A