<a href="https://colab.research.google.com/github/gerry11/ML/blob/main/ML_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Key Steps

1. Define a ML problem and propose a solution
2. Construct the dataset
  * collect raw data
  * indentify features and label sources
  * select a sample strategy
  * split the data (Or later)
3. Transform data
  * explore and clean your data
  * feature engineering
  * Split the data
4. Train a model (starting from a simple one)
5. Use the model to make predictions

[Data Preparation and Feature Engineering](https://developers.google.com/machine-learning/data-prep/?utm_source=mlcc&utm_campaign=mlcc-next-steps&utm_medium=referral&utm_content=data-prep-ss) takes up most of your time when building models, and plays an important role in the success of your models. 

---






# Before manipulating the data



## 1. Know Your Data 
Before building models, full unstanding of the data is important, serveral checks are essentail for the success of models:

To avoid bias, the fairness of data needs to be evaluated through profiling data:
  *   Missing value
  *   check the quantile of features
  *   Plotting histograms, to see if any Skew exists that mis-represent realitic. (Making sure the training and test sets are similar)

The better you know your data, the more insight you'll have as to how to prepare your data for further modelling.


###  Facets Tool to view data
One useful tool to view and understand data is Facets, consisted from [Facets Overview](https://pair-code.github.io/facets/) and [Facets Dive](https://pair-code.github.io/facets/).

Sample codes are listed below:

In [1]:
#@title Initial Environment and Data
# Run on TensorFlow 2.x
%tensorflow_version 2.x
from __future__ import absolute_import, division, print_function, unicode_literals

#@title Import revelant modules and install Facets
import numpy as np
import pandas as pd
# import tensorflow as tf
# from tensorflow.keras import layers
# from matplotlib import pyplot as plt
# from matplotlib import rcParams
# import seaborn as sns

# The following lines adjust the granularity of reporting. 
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format

from google.colab import widgets
# For facets
from IPython.core.display import display, HTML
import base64
!pip install facets-overview==1.0.0
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator

In [None]:
# Adult Census Income dataset(https://archive.ics.uci.edu/ml/datasets/Census+Income) is used as example
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]

train_csv = tf.keras.utils.get_file('adult.data', 
  'https://download.mlcc.google.com/mledu-datasets/adult_census_train.csv')
test_csv = tf.keras.utils.get_file('adult.data', 
  'https://download.mlcc.google.com/mledu-datasets/adult_census_test.csv')

train_df = pd.read_csv(train_csv, names=COLUMNS, sep=r'\s*,\s*', 
                       engine='python', na_values="?")
test_df = pd.read_csv(test_csv, names=COLUMNS, sep=r'\s*,\s*', skiprows=[0],
                      engine='python', na_values="?")

In [None]:
#@title Visualize with Facets Overiew
fsg = FeatureStatisticsGenerator()
dataframes = [
    {'table': train_df, 'name': 'trainData'}]
censusProto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusProto.SerializeToString()).decode("utf-8")


HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

In [None]:
#@title Visualize with Facets Dive
# Set the Number of Data Points to Visualize in Facets Dive

SAMPLE_SIZE = 5000 #@param
  
train_dive = train_df.sample(SAMPLE_SIZE).to_json(orient='records')

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=train_dive)
display(HTML(html))

---
## Construct Dataset
To construct your dataset (and before doing data transformation), you should:
1. Collect the raw data.
2. Identify feature and label sources.
3. Select a sampling strategy.
4. Split the data (May do it after transforming the data)




### Collecting the Data
#### [The Size of a Dataset](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality#the-size-of-a-data-set)
"As a rough rule of thumb, your model should train on **at least an order of magnitude more examples than trainable parameters**. Simple models on large data sets generally beat fancy models on small data sets."

#### [The Quality of a Dataset](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality#the-quality-of-a-data-set)
"A quality data set is one that lets you succeed with the business problem you care about. In other words, the data is good if it accomplishes its intended task."

Certain aspects of quality are more likely to build better-performing models:
1. [Reliability](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality#reliability) refers to the degree to which you can trust your data. 
2. [Feature Representation](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality#feature-representation): "Always consider what data is available to your model at prediction time. During training, use only the features that you'll have available in serving."
3. [Minimize Skew](https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality#training-versus-prediction):
"Make sure your training set is representative of your serving traffic."

### [Joining Logs](https://developers.google.com/machine-learning/data-prep/construct/collect/joining-logs)
### [label Sources](https://developers.google.com/machine-learning/data-prep/construct/collect/label-sources)


### [Sampling and Splitting Data](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/sampling)

1. Sampling - Imbalanced Data and how to downsample and upweight
2. Splitting - Randon Splitting isn't always the Best.






---
## [Transform Data](https://developers.google.com/machine-learning/data-prep/transform/introduction)

### Numeric Data 


*   [Normalization](https://developers.google.com/machine-learning/data-prep/transform/normalization)
  * Scaling to a range 
      * Upper and Lower bounds are known, few/no outliers
      * Data is approximately uniformly distributed
  * feature clipping
      * To deal with extreme outliers, cap feature values with a min/max values.
      One simple strategy is to clip by z-score to +/- 3 
  * log scaling
      * suitable for data wiwth *power law distribution*. 
  * z-score
*   [Bucketing](https://developers.google.com/machine-learning/data-prep/transform/bucketing)
Sometimes when there is no clear correlation between the label and one numeric feature, it is worth considering transforming it into categorical feature, using a set of thresholds. Techniques for this purpose is called bucketing. 

To bucketize the numeric features, there are two types of approaches:
1. Buckets with **equally spaced** boundaries:
  the boundaries are fixed and encompass the same range
2. Buckets with **quantile** boundaries: 
  each bucket has the same number of records.  



### [Categorical Data](https://developers.google.com/machine-learning/data-prep/transform/transform-categorical)
1. Vocabulary
2. Hashing
3. Hybrid of Hashing and Vocabulary


NOTE that embeddings are not a typical data transformation â€” they are part of the model, and functionally are equivalent to a layer of weights





---


# Some Effective Quidelines for ML
* Keep your first model simple.
* Focus on ensuring data pipeline correctness.
* Use a simple, observable metric for training & evaluation.
* Own and monitor your input features.
* Treat your model configuration as code: review it, check it in.
* Write down the results of all experiments, especially "failures."

Source from Google's [Machine learning crash course](https://developers.google.com/machine-learning/crash-course)