# Chapter 2 - End to End Project

The main steps you will go through in a machine learning project are as follows:<br>
<br>
1 Think about the big picture<br>
2 Get your data<br> 
3 Discover and visualize data to gain insights<br>
4 Prepare data for ML algorithms<br>
5 Select a model and train it<br>
6 Fine tune your model<br>
7 Present your solution <br>
8 Launch, monitor, and maintain your system


## 1. Look at the big picture

It is crucial before beginning a project, especially if the task was given to you at a job, to understand what the system will be used for. It will be very difficult to create a system that correctly accomplishes what needs to be done if you do no know what the end goal is.<br>
Once you understand what you are trying to accomplish, you need to figure out what type of problem it is. Is it supervised regession problem? Reinforcement learning? Unsupervised clustering? <br>
After deciding the task you will implement, you need to find a way to measure your model's preformance. A typical supervised regression problem will use Root Mean Squared Error(RMSE)<br>
\begin{equation*}
RMSE(X,h)= \sqrt{\frac{1}{m}\sum_{i=1}^m (h(x^{(i)}))-y^{(i)})^2}
\end{equation*}

We can also use the Mean Absolute Error
\begin{equation*}
MAE(X,h) = \frac{1}{m} \sum_{i=1}^m |h(x^{(i)}) - y^{(i)}|
\end{equation*}

These are just 2 of many ways we can measure preformance.

## 2. Get the data

Depending on where your data is coming from there are many different ways to retrieve and load your data. Before trying to load the data make sure you are familar with the layout scheme of the data. You then want to load your data into a dataframe to be able to use in python. Some nice ways to see bits of your dataframe data is by the following commands. <br>
<br> 
- df.head()
- df.tail()
- df.info()
<br>

This is crucial to know what the data looks like because you might have to make further cleaning measures after, for example N/A values. It will also tell you if you are working with categorical or quantitative values. 
<br> 
<br>
The **df.describe()** method will show you the summary statisitics on each column(variable) in your dataframe. <br><br>
We can also use the **df.hist()** method to plot histograms of the numerical variables to see their distributions. <br><br>
Once you have a brief idea abobut how our dataset looks we want to set aside a test set. We do this early on because we do not want to build asusmptions about the behavior of our dataset and pick specific models because of it. This will lead to an overly optimisitc generalization error(snooping bias).  <br><br>
We can do this using **sklearn.model_selection** package **train_test_split** function.<br><br>

When splittig your dataset up you want to make sure that you are not introducing sampling bias. We can do this by several measure, one is called *Stratified sampling* which breaks down your set into multiple subsets that is represenative of the total data. It then pulls data from each subset proportional to its whole dataset. <br><br>
- Example: If test average height of country our sample would be skewed if it was filled with NBA players.
 <br><br>
 To do stratified sampling we can split our dataset using the **sklearn.model_selection** **StratifiedShuffleSplit** funciton. 
 

## 3. Discover and Visualize the Data to Gain Insights



There are many ways we can now take our training set an analyze our data even further. One way is to run a scatter plot. A scatter plot can also tell use a lot about the correlation of a set of two variables. 
<br><br>
We can even make a scatter plot matrix by the following code.

In [1]:
from pandas.tools.plotting import scatter_matrix

#scatter_matrix(df['column names'])

Where the variables are plotted against themselves the scatter matrix instead plots a histograms of that variable. 
<br><br>
This is a more intuitive process. You can look at the data and try to clean/ alter it in ways you seem fit.

## 4. Prepare the data for ML Algorithms

- We you begin to transform your data you should make functions that way your work is reproducible and makes it easier for future work.<br><br>
- It is important to clean your data so you do not have any missing values in it. ML Algos have a difficult time with missing values.<br><br>
- You will want to consider scaling when your estimators vary significantly. You can do this by min-max scaling(normalization) or standardization.<br><br>
- Normalization puts everything on a scale from 0 to 1.<br><br>
- Standarization puts everything on a mean 0 scale.<br><br>


## 5. Select a model and train it

Now that everything is set up things will be much easier.All we neeed to do is pick a model<br><br>
From the **sklearn.linear_model** package we can use the **LinearRegression** function to create the model and the **lin_reg.predict** method to create predicitons.<br><br>
We can also get our RMSE from **sklearn.metrics** package by taking the square root of the mean_squared_error function output.<br><br>
One could also decided to you the **DecisionTreeRegressor** function from the **sklearn.tree** package.<br><br>
We can also split our data up into different batches using cross validation. This can be done by  **sklearn.model_selection** package and the **cross_val_score** function while the paramter *cv* specifies the number of folds.
Be sure to save your models for later use using the following code

In [2]:
from sklearn.externals import joblib

#to save
#joblib.dump(my_model, 'mymodel.pkl')

#to load
#my_model_loaded = joblib.load('my_model.pkl')

## 6. Fine-Tune your Model

 One way you can fiddle with your hyperparameters is by using the **GridSearchCV** in the **sklearn.model_selection** package to find the best combination of hyperparameters that you specified for it.<br><br>
 For models with much larger hyperparameter search space we can use **RandomSearchCV**

## 7. Analyze the Best Models

By looking at the best model you can gain insight on what is the best and worst features oft he model. You could then try tweaking the model by possibly removing unneeded features.<br><br>
Now we can finally evaluate our model on our test set, beware there will be no more tuning of your model after this.

## 8. Launch, Monitor, Maintain your System

Now that your model has launched you need to monitor it from breaking or degrading over time. <br><br>
You will want to train your models on a regular basis as to not have degrading happen because of old data.<br><br>
