# List for research in deep learning
1. Activation function
2. Loss function
3. Follow Standford CS 229
     - This is crucial since we are trying to understand the fundementals of ML

In [12]:
import numpy as np

## Part 1 Linear Regression

In [258]:
# Part 1 Linear Regression
housing_data = np.loadtxt("/home/everitt257/Downloads/stanford_dl_ex-master/ex1/housing.data")
features = housing_data[:400, :-1]
ground_truths = housing_data[:400, -1].reshape(400,-1)
# Number of parameters to be trained
n = features.shape[1] # Number of columns
m = features.shape[0] # Number of rows
# Initialize the weights and bias
weights = np.random.random_sample(n).reshape(n,-1)
bias = np.random.random_sample()
# Visualize what the matrix look like
print("Feature shape:{} \
       \nWeight shape: {} \
       \nGround_Truth shape: {} \
       \nBias {}".format(features.shape, weights.shape, ground_truths.shape,bias))

Feature shape:(400, 13)        
Weight shape: (13, 1)        
Ground_Truth shape: (400, 1)        
Bias 0.3385973288082881


- Fitting the line
$$ \hat{y_i} = wx^i = w_0x_0^i + w_1x_1^i + ... + w_nx_n^i + b $$
The $\hat{y}$ should be as close to the original label $y$ as possible. Here I specify the loss function as:
$$loss = \frac{1}{2}\sum_{i=1}^{n}(\hat{y_i} - y_i)^2$$
What we are trying to do here is to minimize the loss function. To minmize the loss function. We'll use the 
gradient decent rule to help us decide what the partial derivatives are for each weight and bias.
**Please note that we can have other loss function defined as well. The reason I choose $loss = \frac{1}{2}\sum_{i=1}^{n}(\hat{y_i} - y_i)^2$ is because it penalizes error quadratically. Secondly it make the math easier when we take the partial derivative.**

- Let's compute some derivatives
    
    According gradient decent 
    
    $$\hat{w_i} = w_i - \alpha\frac{\partial loss}{\partial w}$$
    
    Where $w$ is the weight or bias. *To simplify things, we omits bias for the moment.*
    In our case of study, let's **work out the partial derivative for a single feature**:
    
    $$loss_{single} = \frac{1}{2}(\hat{y_i} - y_i)^2$$
    
    The derived derivative w.r.t to weight is thus just:
    
    \begin{equation}
        \begin{split}
        \frac{\partial loss}{\partial w_i} & =  \frac{\partial loss}{\partial \hat{y_i}}
        \frac{\partial \hat{y_i}}{w_i} \\
        & = (\hat{y_i} - y_i)x_i
        \end{split}
    \end{equation}
    
    The derived derivative w.r.t to bias is
    
    \begin{equation}
        \begin{split}
        \frac{\partial loss}{\partial b} & =  \frac{\partial loss}{\partial \hat{y_i}}
        \frac{\partial \hat{y_i}}{b} \\
            & = (\hat{y_i} - y_i)
        \end{split}
    \end{equation}
    
    So the update rule for the weight and bias of  single feature is just 
    
    \begin{align}
        \hat{w_i} & = w_i + \alpha(y_i-\hat{y_i})x_i \\
        \hat{b} & = b + \alpha(y_i-\hat{y_i})
    \end{align}
    
    For multiple features, the rule can be generalized as:
    
    \begin{align}
        \hat{w_i} & = w_i + \alpha\sum_{j=1}^{n}(y_i - \hat{y_i})x_i^j \\
        \hat{b} & = b + \alpha\sum_{j=1}^{n}(y_i-\hat{y_i})
    \end{align}
    
    Where $\hat{y_i} = wx^i$ and $\alpha$ equals the learning rate.
    
    **Note this is the gradient decent rule that follows the steepest decent. In practice we rarely consider this since it takes every derivative into account. Another solution is to use the so-called stochastic decent to accelerate the optimization**

In [240]:
def normalize(weights):
    """Don't think I will be needing thi"""
    return weights/np.sum(weights)

In [241]:
def partial_d(features, weights, labels):
    y_hat = np.dot(features, weights)
    y_diff = y_hat - labels
    weight_d = np.dot(y_diff.transpose(), features)
    bias_d = np.sum(y_diff)
    return weight_d, bias_d

In [242]:
weight_d, bias_d = partial_d(features, weights, ground_truths)

In [251]:
def update(old_w, old_b, weight_d, bias_d, a):
    new_w = old_w + weight_d*a
    new_b = old_b + bias_d*a
    return new_w, new_b

In [252]:
new_w, new_b = update(weights, bias, weight_d, bias_d, 0.01)

In [257]:
print("Features.shape {} and Weights.shape {}".format(features.shape, weights.shape))
y_hat = np.dot(features, weights)
print(y_hat.shape)

Features.shape (400, 13) and Weights.shape (13, 13)
(400, 13)


In [254]:
learn_rate = 0.001
error_bound = 0.01
for i in range(10):
    weight_d, bias_d = partial_d(features, weights, ground_truths)
    weights, bias = update(weights, bias, weight_d, bias_d, learn_rate)
    
    y_hat = np.dot(features, weights.transpose())
    square_error = 1./400*np.sum((y_hat-ground_truths)**2)
    print(square_error)
    print("square_error is: ".format(square_error))

inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 
inf
square_error is: 


  


In [212]:
test = np.random.random_sample(10)

In [215]:
test.reshape(10,-1).shape

(10, 1)

In [221]:
np.sum(weights)

6.1119970072086911

In [223]:
normalized_w = normalize(weights)

In [224]:
normalized_w

array([[ 0.06432218],
       [ 0.14666415],
       [ 0.04983606],
       [ 0.01719232],
       [ 0.09797791],
       [ 0.04185312],
       [ 0.11315203],
       [ 0.11876532],
       [ 0.08418966],
       [ 0.02731797],
       [ 0.02032364],
       [ 0.08274421],
       [ 0.13566143]])