# 1. Overview Machine Learning Landscape

• **Popular open data repositories:**

* [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php)
* [Kaggle datasets](https://www.kaggle.com/datasets)
* [Amazon’s AWS datasets](https://registry.opendata.aws/)


• **Meta portals (they list open data repositories):**

* http://dataportals.org/
* http://opendatamonitor.eu/
* http://quandl.com/

• **Other pages listing many**

* [Wikipedia List of datasets for machine learning research](https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research)
* [Quora list](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)
* [Reddit datsets](https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public)


## Checklist for ML projects

1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to Machine Learning
algorithms.
5. Explore many different models and short-list the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.



**1. Frame the Problem**

The first question to ask your boss is what exactly is the business objective; building a
model is probably not the end goal. How does the company expect to use and benefit
from this model? This is important because it will determine how you frame the
problem, what algorithms you will select, what performance measure you will use to
evaluate your model, and how much effort you should spend tweaking it.

The next question to ask is what the current solution looks like (if any). It will often
give you a reference performance, as well as insights on how to solve the problem.

Okay, with all this information you are now ready to start designing your system.
First, you need to frame the problem: is it supervised, unsupervised, or Reinforcement
Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?

**Select a Performance Measure**

Your next step is to select a performance measure. A typical performance measure for
regression problems is the Root Mean Square Error (RMSE). It measures the standard
deviation4 of the errors the system makes in its predictions.

**Check the Assumptions** 



**2.Get the Data**



Common method for split our data is...

    from sklearn.model_selection import train_test_split
    train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
    
So far we have considered purely random sampling methods. This is generally fine if
your dataset is large enough (especially relative to the number of attributes), but if it
is not, you run the risk of introducing a significant sampling bias

This is called stratified
sampling: the population is divided into homogeneous subgroups called strata,
and the right number of instances is sampled from each stratum to guarantee that the
test set is representative of the overall population.

For example, suppose we know that a specific attribute is very important for the purpose of our model, so we must ensure that it is well sampled, an example in scikit learn is:

    from sklearn.model_selection import StratifiedShuffleSplit
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    for train_index, test_index in split.split(housing, housing["income_cat"]):
        strat_train_set = housing.loc[train_index]
        strat_test_set = housing.loc[test_index]

**3.Discover and Visualize the Data to Gain Insights**

Looking for Correlations
Since the dataset is not too large, you can easily compute the standard correlation
coefficient (also called Pearson’s r) between every pair of attributes using the corr()
method:

    corr_matrix = housing.corr()

Now let’s look at how much each attribute correlates with the median house value:

    corr_matrix["median_house_value"].sort_values(ascending=False)

    median_house_value 1.000000
    median_income 0.687170
    total_rooms 0.135231
    housing_median_age 0.114220
    households 0.064702
    total_bedrooms 0.047865
    population -0.026699
    longitude -0.047279
    latitude -0.142826

    
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that
there is a strong positive correlation; for example, the median house value tends to go
up when the median income goes up. When the coefficient is close to –1, it means
that there is a strong negative correlation; you can see a small negative correlation
between the latitude and the median house value (i.e., prices have a slight tendency to
go down when you go north

![alt text](https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg")


Another way to check for correlation between attributes is to use Pandas’ scatter_matrix function, which plots every numerical attribute against every other numerical attribute.

    from pandas.tools.plotting import scatter_matrix
    attributes = ["median_house_value", "median_income", "total_rooms","housing_median_age"]
    scatter_matrix(housing[attributes], figsize=(12, 8))

**Data Cleaning**

Most Machine Learning algorithms cannot work with missing features, so let’s create
a few functions to take care of them. You noticed earlier that the total_bedrooms
attribute has some missing values, so let’s fix this. You have three options:

* Get rid of the corresponding districts.
* Get rid of the whole attribute.
* Set the values to some value (zero, the mean, the median, etc.).

You can accomplish these easily using DataFrame’s dropna(), drop(), and fillna()
methods:

**handling Text and Categorical Attributes**

Most Machine Learning algorithms prefer
to work with numbers anyway, so let’s convert these text labels to numbers.
Scikit-Learn provides a transformer for this task called LabelEncoder:

    from sklearn.preprocessing import LabelEncoder
    encoder = LabelEncoder()
    housing_cat = housing["ocean_proximity"]
    housing_cat_encoded = encoder.fit_transform(housing_cat)
    housing_cat_encoded
    array([1, 1, 4, ..., 1, 0, 3])
    
One issue with this representation is that ML algorithms will assume that two nearby
values are more similar than two distant values. Obviously this is not the case (for
example, categories 0 and 4 are more similar than categories 0 and 1). To fix this
issue, a common solution is to create one binary attribute per category: one attribute
equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute
equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is
called one-hot encoding, because only one attribute will be equal to 1 (hot), while the
others will be 0 (cold).

    from sklearn.preprocessing import OneHotEncoder
    encoder = OneHotEncoder()
    housing_cat_1hot = encoder.fit_transform(housing_cat_encoded)
    housing_cat_1hot.toarray()
    
    array([[ 0., 1., 0., 0., 0.],
    [ 0., 1., 0., 0., 0.],
    [ 0., 0., 0., 0., 1.],
    ...,
    [ 0., 1., 0., 0., 0.],

We can apply both transformations (from text categories to integer categories, then
from integer categories to one-hot vectors) in one shot using the LabelBinarizer
class:

    >>> from sklearn.preprocessing import LabelBinarizer
    >>> encoder = LabelBinarizer()
    >>> housing_cat_1hot = encoder.fit_transform(housing_cat)
    >>> housing_cat_1hot
    array([[0, 1, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1],
    ...,
    [0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0],
    [0, 0, 0, 1, 0]])

**Feature Scaling**

One of the most important transformations you need to apply to your data is feature
scaling. With few exceptions, Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales.

There are two common ways to get all attributes to have the same scale: min-max
scaling and standardization.
Min-max scaling (many people call this normalization) is quite simple: values are
shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting
the min value and dividing by the max minus the min. Scikit-Learn provides a
transformer called MinMaxScaler for this. It has a feature_range hyperparameter
that lets you change the range if you don’t want 0–1 for some reason.

Standardization is quite different: first it subtracts the mean value (so standardized
values always have a zero mean), and then it divides by the variance so that the resulting
distribution has unit variance. Unlike min-max scaling, standardization does not
bound values to a specific range, which may be a problem for some algorithms (e.g.,
neural networks often expect an input value ranging from 0 to 1). However, standardization
is much less affected by outliers. For example, suppose a district had a median
income equal to 100 (by mistake). Min-max scaling would then crush all the other
values from 0–15 down to 0–0.15, whereas standardization would not be much affected.
Scikit-Learn provides a transformer called StandardScaler for standardization.

*As with all the transformations, it is important to fit the scalers to
the training data only, not to the full dataset (including the test set).
Only then can you use them to transform the training set and the
test set (and new data).*

**Transformation Pipelines**

As you can see, there are many data transformation steps that need to be executed in
the right order

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    num_pipeline = Pipeline([
                            ('imputer', Imputer(strategy="median")),
                            ('attribs_adder', CombinedAttributesAdder()),
                            ('std_scaler', StandardScaler()),
                            ])
    housing_num_tr = num_pipeline.fit_transform(housing_num)
    
    
**Training and Evaluating on the Training Set**

**Better Evaluation Using Cross-Validation**

One way to evaluate the Decision Tree model would be to use the train_test_split
function to split the training set into a smaller training set and a validation set, then
train your models against the smaller training set and evaluate them against the validation
set. It’s a bit of work, but nothing too difficult and it would work fairly well.

A great alternative is to use Scikit-Learn’s cross-validation feature. The following code
performs K-fold cross-validation: it randomly splits the training set into 10 distinct
subsets called folds, then it trains and evaluates the Decision Tree model 10 times,
picking a different fold for evaluation every time and training on the other 9 folds.
The result is an array containing the 10 evaluation scores:

    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
    scoring="neg_mean_squared_error", cv=10)
    rmse_scores = np.sqrt(-scores)
    
    

Scikit-Learn cross-validation features expect a utility function
(greater is better) rather than a cost function (lower is better), so
the scoring function is actually the opposite of the MSE (i.e., a negative
value), which is why the preceding code computes -scores
before calculating the square root.

For save a model simply: 

    from sklearn.externals import joblib
    joblib.dump(my_model, "my_model.pkl")
    # and later...
    my_model_loaded = joblib.load("my_model.pkl")

**Fine-Tune Your Model**

**GridSearch**

you should get Scikit-Learn’s GridSearchCV to search for you. All you need to
do is tell it which hyperparameters you want it to experiment with, and what values to
try out, and it will evaluate all the possible combinations of hyperparameter values,
using cross-validation.

    from sklearn.model_selection import GridSearchCV
    param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},]
                                        
    forest_reg = RandomForestRegressor()
    grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error')
    grid_search.fit(housing_prepared, housing_labels)
    
**Randomized Search**

