# ML Example

### Identify the problem
What is the purpose of the project, in the case of house price forecasting, then the purpose is to forecast the median property price

### Solutions
- What is the business purpose, after all, building models is not the ultimate purpose
- To know how effective the current solution is, for example, will give an error rate of the current solution is alpha
- One can further study the problem to clarify whether the problem is supervised/unsupervised, or a reinforcement model? Is it classification/regression, or other such as clustering. To use batch learning or online learning?
- Example: We have house price values, so it is a supervised problem; we eventually want to predict the median house price, so it is a regression problem, and it is a multivariate predictive regression because there are many influencing parameters; in addition, there is no continuous inflow of data, and there is no special need to make quick adaptation to data changes. The amount of data is not large enough to be put into memory, so batch learning is fine. If the data volume is large, you can either split the batch learning across multiple servers (using MapReduce techniques, as you will see later), or use online learning].

### Select performance indicators
Here we need to choose an evaluation metric, typical for regression problems is the root mean squared error RMSE, which characterizes the standard deviation of the system prediction error.
Alternatively, the difference squared absolute error can be used.

### Check assumptions
- Check the data, for example, check the data type, check the missing data, check the outliers, check the distribution of the data, check the correlation between the data, check the data quality, etc.

### Create workspace

such as python jupyter and corresponding library files (such as numpy, pandas, scipy, and sklearn, etc.) and frameworks (tf, etc.)

### Get data
- Download the data
- Load the data
- Explore the data
- Visualize the data

In [None]:
import pandas as pd
data=pd.read_csv('housing.csv')
data.head()
data.info()
data.describe()


import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))
plt.show()

import altair as alt
alt.chart(data).mark_point().encode(
    x='longitude',
    y='latitude',
    color='median_house_value:Q'
)

### Creating a test set

Before viewing the data, it is best to create a test set so that the selection of the test set is not affected by the mindset after viewing the data.
One way is that you can choose the test set randomly, for example, choose 20% of the data randomly as the test set, but then when the data set is updated, the test set will change and we can use random numbers to handle it.

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)


In addition, there is a risk of losing the distribution of key features by random sampling. For example, there is a feature A that contributes a lot to the final label (the correlation between them is strong).
Then we should also ensure in the test set that the distribution of A follows the distribution trend of the original dataset. This can be done using stratified sampling

In [None]:
import numpy as np
data["A_new"] = np.ceil(data["A"] / 1.5)
data["A_new"].where(data["A_new"] < 5, 5.0, inplace=True)
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(data, data["A_new"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

#The results of stratified sampling can be checked with the following code.
data["A_new"].value_counts() / len(data)
strat_test_set["A_new"].value_counts() / len(strat_test_set)

# Note that the generated A_new tag needs to be removed at the end, with the drop command, code.
for set in (strat_train_set, strat_test_set):
    set.drop(["A_new"], axis=1, inplace=True)

This way we generated two training-test sets, random and hierarchical.

### Data Visualization

Frequently, we need to first take a look at the data and can determine how the features relate to the labels and which features are more useful or influential. The matplotlib library can be used here

In [None]:
data.plot(kind="scatter", x="longitude", y="latitude")#Scatterplot to see the distribution

### Finding correlations

You can use the corr() method to calculate the standard correlation coefficient between each pair of attributes

In [None]:
corr_m=data.corr()
print(corr_m['median_house_value'].sort_values(ascending=True))


Note that corr() can only characterize linear relationships and ignore non-linear relationships, so there are limitations in its referential properties. There is another way to detect correlation coefficients between attributes when pandas' scatter_matrix function

In [None]:

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
pd.plotting.scatter_matrix(data[attributes], figsize=(12, 8))


### Attribute combination test

Sometimes just using the original feature data does not work well, it is possible to consider combining some features to generate new features, such as the number of people / households, to get a feature such as the number of people per household

In [None]:
data['population_per_household']=data['population']/data['households']


### Prepare the data for machine learning algorithms
Don't do it by hand, you need to write some functions for the following reasons.
Functions allow you to easily perform repetitive data transformations on any dataset (for example, the next time you fetch a new dataset).
You can slowly build a library of transformation functions that can be reused in future projects.
You can use these functions in your real-time system before passing the data to the algorithm.
This allows you to easily try multiple data conversions and see which conversion methods work best in combination.
One thing to keep in mind is to always remember to replicate the data to ensure that later data processing does not affect the initial data as much as possible. Do the markup.

### Data Cleaning

Problems such as missing values can exist in the original data, so the data needs to be cleaned. This is a very critical step.
There are three ways to deal with missing values
1. directly delete the row where the missing value is located; 
2. if a feature has too many missing values, then directly delete the feature; 
3. assign a value to the missing position (with 0, median or average, etc.)

In [None]:
housing.dropna(housing['total_bedrooms'])
housing.drop('total_bedrooms',axis=1)
housing['total_bedrooms'].fillna(median)


Try to use the third way so that you can make full use of the original data. The Imputer class of sklearn can be used to handle missing values.

In [None]:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)


### Processing text and category attributes

There will be some text types in the data, which we can recode using one-hot when processing them, which requires two transformations (text category to integer category and then to one-hot vector)
These two transformations can be implemented with sklearn's LabelBinarizer

In [None]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
print(housing_cat_1hot)


However, the above classes should also be used for label column conversions, and the correct approach is to use the CategoricalEncoder class that sklearn will soon provide, e.g.

In [None]:
cat_encoder = CategoricalEncoder()
housing_cat_reshaped = housing_cat.values.reshape(-1, 1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
print(housing_cat_1hot)


### Custom converters

The role of converters is to perform some data processing operations together, such as cleaning, attribute combination, etc. as described earlier, in addition to the homemade converters that can work seamlessly with sklearn's pipeline. The sample code for this section can be found in the documentation you wrote (note:). This part can have the property combinations written in it.
Note that here you can set some hyperparameters for the attribute to check if this attribute is helpful for the ground ML algorithm.

Feature scaling

This step is important for the different problems of inputting numerical attribute measures. For example, if the age attribute is in the range of 20 to 50, and the income distribution is in the range of 5000 to 100000, the performance of such data applied to the algorithm will not be too good. Usually do not scale the target values.
Two ways.
Linear function normalization (min-max-scaling) - subtract the minimum value and divide by the difference between the maximum and minimum values, sklearn's MinMaxScaler
Standardization - subtract the mean and divide by the variance, resulting in a distribution with unit variance. sklearn's StandardScaler
Note:All operations such as data conversion should be applied to the training and test sets separately, not to the completed data set.

Transformation pipeline

The purpose of a pipeline is to create a pattern that allows data to be processed and transformed in a certain order. For example, the following is a complete pipeline for processing numeric and category attributes.

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
    ])

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])


In [None]:
housing_prepared=full_pipeline.fit_transform(housing)


This calls the steps: num_pipeline->DataFrameSelector->Imputer->CombinedAttributesAdder->StandardScaler->cat_pipeline->DataFrameSelector-> CategoricalEncoder to get the processed training set.
The representations are: sub-pipeline data manipulation -> selection converter -> missing value processing -> attribute combination -> normalization -> sub-pipeline categorization processing -> selection converter -> categorical marker as one-hot vector
Explanation for the selection converter: Converts the data by selecting the corresponding attributes (values or categories), discarding the rest, and turning the output DataFrame into a NumPy array. scikit-Learn does not have tools to handle PandasDataFrame, so we need to write a simple custom converter to do the job:.

In [None]:

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

### Select and train models

Training and evaluation on the training set

Here we can select the algorithmic model to train the data for learning (in fact, we can find that most of the work is focused on the pre-processing of the data, including cleaning visualization text class attribute transformation, etc.).

In [None]:
from from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)


Of course, not only linear regression, but also other models such as decision tree model and random forest model can be used, and the steps are the same as above.

Using cross-validation for better evaluation

Alternatively, we can use cross-validation to validate the model, using decision trees as an example.

Above, the training set is randomly divided into ten different subsets, which become "folds", and then the evaluation decision tree model is trained 10 times, each time one unused fold is selected for evaluation and the other 9 are used for training. The result is an array of 10 scores.
The Scikit-Learn cross-validation function expects a utility function (bigger is better) rather than a loss function (lower is better), so the score function is actually the opposite of the MSE (i.e., negative), which is why the previous code calculates -scores before calculating the square root.
Results view.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)


### Model Fine-tuning

Grid Search

Use Scikit-Learn's GridSearchCV method. to target random forests as an example.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

### RandomizedSearch

When the search space for hyperparameters is large, it is best to use RandomizedSearchCV. this class is used much like the class GridSearchCV, but instead of trying all possible combinations, it does so by selecting a specific number of random combinations of a random value for each hyperparameter.

Integration methods

Another way to fine-tune the system is to combine the best-performing models.

Analyzing the best models and their errors

A deeper understanding of the problem can often be gained by analyzing the best models. For example, the RandomForestRegressor can indicate the relative importance of each attribute for making accurate predictions.

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
print(feature_importances)

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances,attributes), reverse=True)
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse) 
print(final_rmse)


Based on the above importance scores, we can discard some unimportant attributes, etc.

Evaluate the system with test sets

After finally debugging the model, we need to test it with the test set. Note that the test set has been useless after we split the data before, so we need to process the test set first, such as dropping labels, pipelining, etc. Then we apply our model to it. Then we can apply our model to it.


At this point, we have basically finished creating and testing the model, and then we need to present the results or conclusions.

Start, monitor and maintain the system

The above models are embedded into the company's system for automated operation.

Practice

Don't waste on advanced algorithms, just be able to use them. The focus is on understanding the business and data, and the processing of the data.

