In [1]:
# Basics of Machine Learning

## 如何從問題 --> 轉變至機器學習模型？

Learning Approach:

    資料 -> 定義問題 -> 提取特徵向量 -> 演算法 -> 模型
    
Inference State:
    
    測試資料 -> 特徵向量 -> 模型 -> 預測

## Data Representation and Features

**Vectors** - 包括兩種屬性：維度(dimension), 類型(type)

**Matrix** - a vector of vectors.

**Graphs** - a collection of objects(nodes) that can be linked together w/ **edges** to represent a network

## 如何挑選好的特徵?

Q: 偵測圖片中的車輛
    
Step 1. 轉為灰階，降低運算負載
Step 2. 灰階影像讓演算法可以專注學習圖片的形狀(shapes)、紋路(textures)
Step 3. 「巨量資料」對於機器學習演算法並不總是有益的，效能可能受影響。
Step 4. 留意特徵向量的維度不可過大！(Curse of dimensionality)

Q: 設計折衣服的機器手臂，如何衡量衣服的「有效特徵」？

Ans: 

**Good:** width, height, x-symmetry score, y-ssymmetry score, flatness

**Bad:** color, cloth texture, material


Q: 圖片裡的檯燈、長褲、小狗，什麼是有效特徵來區分這三者？

Ans: 

**Good:** 

1. {brightness, reflection} 區分檯燈
2. {shape of pants} 區分長褲
3. {texture} 區分狗


## 如何衡量特徵間的相似度？

Ans: Distance metrics
    
###### L0, L1, L2 norm

**L2 norm:** = Euclidian distance = ||x-y||

**L0 norm:** counts the total number of nonzero elements of a vector

**L1 norm:** = Manhattan distance = ∑|x_n| = distance between two vectors is along the orthogonal directions

###### Q: 何時選用 L2 norm 之外的度量單位?

情境1: 每個月不讓使用者看到超過五筆不正確的搜尋紀錄
==> L-infinity norm of this vector (12-D) must be less than 5.

情境2: 每年少於五筆不正確的資料可以讓使用者看到
==> L1 norm <=5. The sum of all errors in the entire space should be less than 5.

## Types of learning

### === Supervised LRN ===
All about learning from examples laid out by a supervisor.

**labeled data**

**model** - a useful understanding

**training dataset** - a collection of labeled examples

**the ground truth of x** = corresponding label associated with x is f(x) = y

**a regressor** - If y=f(x) can result in many values which have a natural ordering, this model is a refressor.

**g(x|𝛉)** - the success of a model's prediction. Depend on how well it agrees with the fround truth y.

**the cost** - 兩個向量：the ground truth 和 g(x|𝛉) 之間的距離

此演算法主要是要計算一個 𝛉* (Theta star) 讓所有的資料點 x ∈ X 間 cost 最小. 這個搜尋最小參數過程使用的方法，我們稱做 

**gradient descent**, 

**simulated annealing**, 

**genetic algorithms**

### === Unsupervised LRN ===

Clustering - like classification of data without knowing any labels.

k-means - the E-M algorithm, one of the most popular clustering algorithms.

Dimensionality reduction
    - principle component analysis (PCA)
    - autoencoders

### === Reinforcement LRN ===

「環境」當老師，提供「提示」。系統收到「動作」的回饋。

RL trains on information gathered by observing how the environment reacts to actions.

**States** - each state yields a corresponding **reward**. The total reward of each called **the value of a state.**

**Actions**

The agent's goal is to find a sequence of actions that MAXIMIZES rewards.

# Real-world problems

|Problem|Solution|
|---|---|
|Predicting trends, fitting a curve to data points, describing relationships between variables. | Linear Regression |
|Classifying data into two categories, finding the best way to split a dataset | Logistic Regression |
| Classifying data into multiple categories | Softmax Regression |
| Revealing hidden causes of observations, finding the most likely hidden reason for a series of outcomes | Hidden Markov model(Viterbi) |
| Clustering data into a fixed number of categories, automatically partitioning data points into separate classes | K-means |
|Clustering data into arbitrary categories, visualizing high-dimensional data into a lower-dimensional embedding | Self-organizing map |

|Problem|Solution|
|---|---|
|Reducing dimensionality of data, learning latent variables responsible for high-dimensional data| Autoencoder|
|Planning actions in an environment using neural networks (reinforcement learning)|Q-policy NN|
|Classifying data using supervised neural networks|Perceptron|
|Classifying real-world images using supervised neural networks|CNN|
|Producing patterns that match observations using neural networks|RNN|
|Predicting natural language responses to natural language queries| Seq2seq model|
|Learning to rank items by learning their utility| Ranking|