在进行特定编程的情况下,给予计算机学习能力的领域。 —— Arthur Samuel
主要两种类型的机器学习算法
- supervised learning
- unsupervised learning
X --> Y mappings input --> output label
learns from being given "right answers"
我们数据集中的每个样本都有相应的“正确答案”,再根据这些样本作出预测
- predict a number
- infinitely many possible numbers
- predict categories
- small number of possible outputs
术语class/category都表示类别
Given data is not associated with any output labels
That means, data only comes with inputs
无监督学习中没有任何的标签
我们已知数据集,却不知如何处理,也未告知每个数据点是什么
Our job is to find some structure or some pattern or just find some interesting in the unlabeled data.
grouping similar data points together
针对数据集,无监督学习就能判断出数据有多个不同的聚集簇。这是一个,那是另一个,二者不同。无监督学习算法可能会把这些数据分成多个不同的簇,可以对这些簇进行后续的工作。所以叫做聚类算法。
find unusual data points
take a big dataset and almost magically compress it to a much smaller dataset while losing as little as possible
- linear regression with one variable
Machine Learning Terminology 机器学习术语
中文术语 | 英文表达 | 英文释义 |
---|---|---|
训练集 | Data used to train the model | |
输入变量 | input variable / input feature / feature | |
输出变量 | output variable / target variable | |
估计值 | the estimate or the prediction for |
|
训练样本总数or训练集实例数量 | number of training examples | |
训练集中的单个实例 | single training example | |
训练集的第$i$个实例 |
|
In supervised learning, recall the training set includes the input features and the output targets (the output targets are the right answers to the model we will learn from).
graph TB
A[training set]
B[learning algorithm]
C([function])
D{x}
E{y-hat}
subgraph train
A --> B
B --> C
end
subgraph test
D -.-> C
C -.-> E
end
这里function也叫做假设(hypothesis)
function的工作是获取一个新的输入以及输出并进行估计或预测
function
y-hat or
x is called the input or the input feature
注意:$y$是指target,这是训练集中的实际真实值;而y-hat或者说$\hat{y}$是指估计值,它可能是也可能不是实际的真实值
The key question:
How to represent the function
example:
For Linear Regression with one variable (or you are allowed to call it the Univariate Linear Regression), the function $f$ can be written as
The cost function will tell us how well the model is doing so that we can try to get it to do better.
模型参数 parameters: the variables you can adjust during training in order to improve the model
参数(parameters)有时也会称为系数(coefficients)或者权重(weights)
对于之前的线性回归,$w$为斜率,$b$为截距,要做的事就是要找到合适的$w$和$b$,使得 $$\hat{y}^{(i)} = f_{w,b}(x^{(i)}) = w * x^{(i)} + b$$ is close to $y^{(i)}$ for all $(x^{(i)}, y^{(i)})$
How to measure how well a line fits the training data? Cost function!!!
建模误差 error
Cost function 的一个选择为
remember $m$ is the number of training examples.
上面选择的代价函数也被称为平方误差函数或者平方误差代价函数,对于大多数问题,特别是回归问题,都是一个合理的选择。还有其他的代价函数也能很好地发挥作用,但是平方误差代价函数可能是解决回归问题最常用的手段了
最后求解Cost function的最小值,对应的参数值就是我们要找的合适的参数值
总结(以Linear Regression为例):
model | |
parameters |
|
cost function | |
goal |
Cost function的可视化
高维图(如$J(w,b)-w-b$组成的3D图)、等高线图
Gradient Descent is an algorithm that you can use to try to minimize any function.
Outline:
- Start with some
$w$ ,$b$ . - Keep on changing the parameters
$w$ and$b$ to reduce$J(w, b)$ - Until we settle at or near a minimum
Gradient Descent只能找到局部最小值,而无法找到全局最小值
Repeat until convergence (which means you reach the minimum) $$ \begin{align*} w &= w-\alpha*\frac{\partial}{\partial w}J(w,b) \ b &= b-\alpha*\frac{\partial}{\partial b}J(w,b) \end{align*} $$
Here
For linear regression, we always choose
$$
J(w,b) = \frac{1}{2m}\sum_{i=1}^{m}(f_{w,b}(x^{(i)}) - y^{(i)})^2
$$
and then
You should simultaneously update
- If
$\alpha$ is too small, gradient descent may be slow. - If
$\alpha$ is too large, gradient descent may- overshoot, never reach minimum
- fail to converge, diverge
Near a local minimum:
- Derivative becomes smaller
- Update steps become smaller
Can reach minimum without decreasing learning rate
"Batch" Gradient Descent
"Batch": Each step of gradient descent uses all the training examples instead of just a subset of the training data.