
## Data Wrangling and Pre-processing

- linear regression and logistic regression models with gradient descent learning algorithms

-  evaluate the performance of the models using evaluation metrics


## Gradient descent - find the local minimum of a function 

- second derivative -  plateau point + global minimum  but in more complex functions and models, that is not possible

- 梯度是向量，梯度向量的每个分量表示在其他量不变时,f在xi方向上的变化率，对于单变量函数，就是在X轴的变换

- 方向(梯度的反方向) 距离（学习率） 终止条件（最大迭代次数 或 |step| = |next_x - current_x| < precision（根据定义） ）


In [None]:
#~source: https: //en.wikipedia.org/wiki/Gradient_descent
# code source: https://en.wikipedia.org/w/index.php?title=Gradient_descent&oldid=966271567

next_x = 6# We start the search at x = 6
gamma = 0.01# Step size multiplier
precision = 0.00001# Desired precision of result
max_iters = 10000# Maximum number of iterations

# 手动输入Derivative function
def df(x):
  return 4 * x ** 3 - 9 * x ** 2

# 迭代计算 小于精度或超出最大迭代数时停止
for i in range(max_iters):
    current_x = next_x
    next_x = current_x - gamma * df(current_x)
    print(i, next_x, df(current_x))

    step = next_x - current_x
    if abs(step) <= precision:
        break

print("Minimum at ", next_x)

# The output for the above will be something like 
# "Minimum at 2.2499646074278457"

# Linear regression

SSE:
$$
\mathbb{E}[ w ] \equiv \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2
$$


MSE:
$$
\text{Error}_{(m,b)} = \frac{1}{N} \sum_{i=1}^{N} (y_i - (mx_i + b))^2
$$


Note that both SSE and MSE can be used for linear regression. SSE loss is mostly used in the neural network literature, while MSE is used more in the statistics literature. The SSE has 1/2 normalising constant in front so that the derivative is simpler. 

adjust the parameters：
- 1 find gradient
$$
\frac{\partial}{\partial m} = \frac{2}{N} \sum_{i=1}^{N} -x_i (y_i - (mx_i + b))
$$

$$
\frac{\partial}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} -(y_i - (mx_i + b))
$$



In [None]:
from numpy import *
# 计算Error : MSE
# y = mx + b
# m is slope, b is y-intercept
def compute_error_for_line_given_points(b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        totalError += (y - (m * x + b)) ** 2
    return totalError / float(len(points))

# 计算gradient 和 新的b，m - gradient descent
def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

# 更新参数 b, m
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        b, m = step_gradient(b, m, array(points), learning_rate)
        error = compute_error_for_line_given_points(b,m, points)
        print(i, b, m, error,  'i, b, m, error')
    return [b, m]

def run():
    points = genfromtxt(r"data\data_linearreg.csv", delimiter=",")
    learning_rate = 0.0001
    initial_b = 0 # initial y-intercept guess
    initial_m = 0 # initial slope guess
    num_iterations = 1000
    print ("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
    print ("Running...")
    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
    print ("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)) )

if __name__ == '__main__':
    run()

# X is a vector
$$
o = w_0 + w_1 x_1 + \dots + w_n x_n
$$

$$
\mathbb{E}[w] \equiv \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2
$$

Our goal is to adjust the parameters of the linear model (w_0, w_1, w_2, ... , w_N) for the given input data (x_1, x_2, ... , x_N).

- 求Gradient
$$
\nabla E[w] \equiv \left[ \frac{\partial E}{\partial w_0}, \frac{\partial E}{\partial w_1}, \dots, \frac{\partial E}{\partial w_n} \right]
$$

- Training rule:
$$
\Delta w = -\eta \nabla E[w]
$$

i.e.,

$$
\Delta w_i = -\eta \frac{\partial E}{\partial w_i}
$$


see: https://folk.idi.ntnu.no/keithd/classes/advai/lectures/backprop.pdf

再看 推导


# Pearson correlation coefficient(PCC) Linear relationship
- strength of the relationship between two variables -1 ： 1
- 0 : no linear relationship 
$$
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \cdot \sqrt{\sum (y_i - \bar{y})^2}}
$$

# R-Squared Score
- R-squared is the percentage of the response variable variation that is explained by a linear model.
- 0% - 100%
$$
R^2 = 1 - \frac{\sum (y_i - \hat{y_i})^2}{\sum (y_i - \bar{y})^2}=1 - \frac{SSR}{SST}
$$

- SSR 是残差平方和，表示模型未解释的变异量。
- SST 是总变异，表示响应变量的所有变异。
