# Chap 2 : End-to-End Machine Learning Project

1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

#### Pipelines:
A sequence of data processing components is called a data pipeline.

Questions:
1. The first question to ask your boss is what exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit from this model?
<br>
2. what the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.
<br>

<b>frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? batch learning or online learning techniques?</b><br>
Select a `performance measure`---> `RMSE`, Root mean square error measures standard deviation of the errors the system makes in its predictions.`RMSE
equal to 50,000 means that about 68% of the system’s predictions fall within 50,000 of the actual value, and about 95% of the predictions fall within 100,000` <br>
other measures are Mean Absolute Error, Root Mean Sum of Squares,Sum of Absolutes

3. Check Assumptions
4. Get data
5. plot and explore:

Now Caping of data(fixed according to the client needs), scales of data(fixed by the feature scaling) and being tail heavy(fixed by transforming them to bell-shaped distributions).<br>

The sampling of the data while creating test and train sets is necessary.<br>
`stratified sampling`: the population is divided into homogeneous subgroups called strata,
and the right number of instances is sampled from each stratum to guarantee that the
test set is representative of the overall population.

<p>
    
6. Look for correlations
7. Experiment with atrribute combinations
8. Data Cleaning:
dealing with N/A values, `get rid of corresponding values,rid of the attribute or replace each one with some values like zero,mean,median...`<br><p>
    
sklearn has `Imputer in sklearn.preprocessing` <br>
<b>Fit and Transform the imputer object.</b><br>
`Estimators`:Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an imputer is an estimator). The estimation itself is performed by the fit() method.<br>
`Transformers`: Some estimators (such as an imputer ) can also transform a dataset; these are called transformers. Once again, the API is quite simple: the transformation is performed by the transform() method.<br>
`Predictors`: some estimators are capable of making predictions given a dataset; they are called predictors.<br>
`Inspection`: estimator’s hyperparameters are accessible directly via public
instance variables (e.g., imputer.strategy ) say `imputer.statistics_` gives the values.<br>
<p>
    
    
9. Handling Text and Categorical Attributes:
`LabelEncoder` for text and  `OneHotEncoder` for categorical ones and `LabelBinarizer` for doing the onehotencoding and converting into numpy array directly.<br>

<p>
    
10. Feature Scaling:
Common ways min-max(Normalization) and Standarization.<br>

min-max---> values are shifted and rescaled so that they end up ranging from 0 to 1.<br>
We do this by subtract‐
ing the min value and dividing by the max minus the min. Scikit-Learn provides a
transformer called `MinMaxScaler` for this. It has a `feature_range` hyperparameter
that lets you change the range if you don’t want 0–1 for some reason.
<p>
    
    
`Standardization` is quite different: first it subtracts the mean value (so standardized
values always have a zero mean), and then it divides by the variance so that the result‐
ing distribution has unit variance.<br>
Scikit-Learn provides a transformer called `StandardScaler` for standardization. It is less affected by outliers than min-max scaling.<br>

11. These make the `Transformation Pipelines`:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
    ('imputer', Imputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])
housing_num_tr = num_pipeline.fit_transform(housing_num)


pipeline’s `fit()` method, it calls `fit_transform()` sequentially on
all transformers, passing the output of each call as the parameter to the next call, until
it reaches the final estimator, for which it just calls the fit() method.<br>
<b>We can make different Subpipelines to make for different tasks where we can even use our functions as parameters to achieve the tasks.</b><br><p>
        
12. Select and Train a Model:
We select a model and then train the model with our training data.<br>
Evaluate the model on the test set.<br>
See if there is under fitting with `sklearn.metrics.mean_squared_error() y_train and predictions.` <br>
if the `sqrt` of above is more, the model is underfitting, we increase the complexity of the model and thus select the decisiontree regressor(`sklearn.tree.DecisionTreeRegressor`) model and train on it. calculate the mean_squared_error again and see if it overfits(it does overfit in this case). <br><p>
    
    
13. Better Evaluation Using Cross-Validation:
`sklearn.model_selection.cross_val_score(model, X_train,y_train,scoring='neg_mean_squared_error',cv=10)` crossvalidates the model with `K-fold cross validation technique.` <br>
<p>
    
`mean` of scores is worse than the linear regression. So yes it `overfits`.<br>
<p>
    
    
`sklearn.ensemble.RandomForestRegressor` is used now.<br>
Building a model on top of many
other models is called `Ensemble Learning`, and it is often a great way to push ML algo‐
rithms even further.<br>

<p>
    
    
    
Before tuning the hyperparameters experiment with other models and see cross-validation scores with predictions. `We can save sklearn models by using Python's pickle module`.<br>
OR with:<br>
`sklearn.externals.joblib , which is more efficient at serializing large NumPy arrays:`


In [None]:
from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("my_model.pkl")

14. Fine-Tune Your Model:
`Scikit-Learn’s GridSearchCV` to fine tune the hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
    ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

Other ways to finetune your model is `Randomized search` or `Ensemble Methods` <br>
see the `feature importances` with `grid_search.best_estimator_.feature_importances_`<br>
<p>
    
14. Now evaluate the system on Test Set.<br>
To present your solution (high‐
lighting what you have learned, what worked and what did not, what assumptions
were made, and what your system’s limitations are), document everything, and create
nice presentations with clear visualizations and easy-to-remember statements (e.g.,
“the median income is the number one predictor of housing prices”).

15. Launch, Monitor, and Maintain Your System
You need to get your solution ready for produc‐
tion, in particular by plugging the production input data sources into your system and writing tests.<br>
<p>
    
`Monitoring` is important for the linve performance and to catch the sudden breakage, also the models `rot` over time, unless th models are regularized on fresh data.<br>

Human pipelines may be required to monitor. Poor quality `signals` have to be prevented(maintaining the sensors,alerts for the bad signal and performance drops.)<br><p>
    
    
Mostly automate these alerts and catching bad singnals thingy.<br>



