# <strong>End-to-End ML Project</strong>

There are 8 main steps for a ML project, also called the `ML checklist` in this book:

<ol>
  <li><strong>Frame the problem and Look at the big picture</strong></li>
  Define the problem, what solution will be used, are previous solutions that can be reused, which type of model would work better, which performance metric should you use.
  <br>
  <li><strong>Get the data</strong></li>
  List all the data you need, where can you find it, legal obligations or authorization, creating the workspace, check if the format is correct, delete sensitive information. Sample the training and test set. Once you have a test set, don't use it again. It is only for testing purposes.
  <strong><italic>Try to automate as much as possible</strong></italic>
  <br>
  <li><strong>Explore and visualize the data to gain insights</strong></li>
  Try to gain insights from a domain expert, create copy of the data for EDA, study each attribute and how it relates with others, correlations, possible transformations that could work. Study how would you manually solve the problem.
  <br>
  <li><strong>Prepare the data for the ML algorithms</strong></li>
  Work on copies of the data, don't touch the original dataset. Clean the data, perform feature selection/engineering/scaling.
  <br>
  <li><strong>Select a model and train it</strong></li>
  If the data is huge, sample smaller training sets so you can compare different models. Try to automate as much as possible. Use cross-validation, means, standard deviations. Select the top 3-5 most promising models. Preferrably, choose models that make different types of error. It doesn't work to have 5 different models that fail in the same spot.
  <br>
  <li><strong>Fine-tune your model</strong></li>
  Here you will need as much data as possible, automate as much as you can, fine-tune each hyperparameter using cross-validation. Try ensemble methods. Then, at the end, measure its performance on the test set to get the generalization error
  <br>
  <li><strong>Present your solution</strong></li>
  Document every step of what you have done. Create a nice presentation (don't forget to highlight the big picture, stakeholders don't care much about details). Explain the solutions with arguments and compelling visualizations
  <br>
  <li><strong>Launch, monitor and maintain the system</strong></li>
  Get it ready por production, monitor its performance and the inputs' quality. Retrain if neccesary.<br>
</ol>

Here we will analyze house prices in California in 1990. The data reflect block groups or districts (also called district, they are zones with 600 to 3000 people).

The goal is that the model predicts the median housing price in any district, given all the other metrics or features in the data set (`"housing.csv"`)



When evaluating, keep in mind `univariate` problems (predict a single value). `Multivariate` are models that try to predict multiple values.

## Select a Performance Measure

For regression problem, the `root mean square error (RMSE)` is the typical performance metric:

$RMSE(X,h) = \sqrt{(1/m)\sum^{m}_{i=1}(h(x^{(i)})-y^{(i)})^{2}}$

Where:

$m$ is the number of instances in the dataset, for example if there are 2000 districts, then $m = 2000$.

$x^{(i)}$ is a vector of all the features values (but exluding the label or solution) of the $i^{th}$ instance in the dataset and $y^{(i)}$ is its label. `There is one row per instance`.

$X$ is a matrix with all the feature values (without the labels or solutions) of all the instances.

$h$ is you prediction function or `hypothesis`. When the system is given an instance's feature vector $x^{(i)}$, it outputs a predicted value $\hat{y}=h(x^{(i)})$.

$RMSE(X,h)$ is the cost function measured on the set of examples using the hypothesis $h$.

> I will use lowercase letters for vector and UPPERCASE for matrices.