### Building Intuition about NN

In [1]:
import numpy as np

#### Dot product

The best approach to get an intuition for building NN is to start from scratch, and I will start with matrix mutiplication, as it's the foundational corner stone.


First, I will start with the concept of dot product. There are multiple useful ways to think about dot product as following:
-  1 The weighted average. For example the cummlative GPA of a student who study for 5 years could be calucalted by dot producting the weight vector with the gpa vectors

In [2]:
gpa = np.array([3.3,3.1,2.9,3.4,3.5])
gpa

array([3.3, 3.1, 2.9, 3.4, 3.5])

In [3]:
weights = np.array([1/15,2/15,3/15,4/15,5/15])
weights

array([0.06666667, 0.13333333, 0.2       , 0.26666667, 0.33333333])

> We want: 1/15 * 0.066 + 2/15 * 0.1333 + 3.15 * 0.2 + ....

essentially to multiply each gpa with the asssocaited weght. To do this, we need the first rule in matrix multiplication, which dictates the shape of each vector. if we are performing A@B, and A shape is M * N , B shape has to be N * K, which is to say that the rows of B has to match the columns of A.

In [4]:
# let's check the shape of gpa
gpa.shape

(5,)

In [5]:
# we need to reshape gpa to be 1*5

gpa = gpa.reshape(1,5)
gpa

array([[3.3, 3.1, 2.9, 3.4, 3.5]])

In [6]:
# let's check the shape of weights
weights.shape

(5,)

In [7]:
# we need to reshape it to be 5*1

weights = weights.reshape(5,1)
weights

array([[0.06666667],
       [0.13333333],
       [0.2       ],
       [0.26666667],
       [0.33333333]])

In [8]:
# now to get the dot product of gpa and weights we use the @ symbol

gpa@weights

array([[3.28666667]])

 > and this will be the final results for the student.

Now let's say we have more than one student. We can put the results of that student in the second column of gpa matrix. Note that the number of rows in gpa matrix don't require any change in the shape of the weights vectos. Additionally note that the change in the results will be reflected in the number of rows and not in the number of columns. Let's test this

In [9]:
# first, let's create the record of student b and then add it to the gpa matrix
student_b = np.array([3.1,2.5,2.6,2.8,2.9]).reshape(1,5)
gpa = np.vstack([gpa,student_b])
gpa

array([[3.3, 3.1, 2.9, 3.4, 3.5],
       [3.1, 2.5, 2.6, 2.8, 2.9]])

In [10]:
# now let's get the results for both students
gpa@weights

array([[3.28666667],
       [2.77333333]])

# multi-output
We saw how changing the rows or the columns of X change the output. But how about chaning the type of outputs that we want. In out previous examle we had one target which was the gpa, what if we want to change the number of outpouts to be more than one, say for example we want to have the average of grades in addition to the gpa. In this case we need to include another column in the weight matrix to connect with the average. So essentially the number of columns in `w` should be the same as the number of outputs.


In [11]:
weights = np.hstack([weights,np.array([1/5,1/5,1/5,1/5,1/5]).reshape(5,1)])
weights


array([[0.06666667, 0.2       ],
       [0.13333333, 0.2       ],
       [0.2       , 0.2       ],
       [0.26666667, 0.2       ],
       [0.33333333, 0.2       ]])

Now for each of the two students we have, we will calculate the avg in addition to the gpa


In [12]:
gpa@weights

array([[3.28666667, 3.24      ],
       [2.77333333, 2.78      ]])

>The first student, has a cummulative gpa of 3.28 and avg of 3.24

In [13]:
# This knowledge will come handy when we are dealing with NN that has many hidden layer with many units.
# for example let's say we have the seoncd layer with 10 units, and the third layer 5 units, then the w matrix will be 10 by 5, i,e 10 rows and 5 columns.


- 2: The concept of dot product as a similarity measure. While this could not be of obvius value here, thinking about dot product as a similarity measure is very useful. Here we will be thiking of two vector that origiated form the zero and live in the 2D plane. Say vecor A is


[0 3] and nother vector B which is [3 0]. It's obvious that A lives in the x-axis while B lives in the y-axis. The dot product between those vectors is zero reflecting that A and B are basically totally different. Now, if we consider another vector C [0 2.9], we can see clearly that C is very similar to vector A. Indeed the dot product will be much bigger between A and C than between A and B.

#### Solving linear equation system 

Building on what we now know about dot product let's try to solve the following equation:


X.w = y


what does this eqaution tells us is that the dot product of feature matrix `X` and weight vector `w` will get us the response vector `y`. 


let's creat our feature matrix, we will have a data with only one example and 5 features.

In [14]:
X = np.array([10,12,5,3,-5]).reshape(1,5)
X


array([[10, 12,  5,  3, -5]])

We have come to know that there is a dependent variable `y`, that is linearly dependent on those features. That means if we take the right mix of theres features in `X` we will get y.
 of course the million dollar question is that what is the right mix. Let's say we the right mix of feature X1 is w1 and from featureX2 is w2 etc. Now, we will get the following equation

X1.w1 + X2.w2 + ...... Xn.wn = y

we can write the above equation in a compact form in the following form:

X.w = y

Now, to write this equation in the matrix form, we need to know the rules of matrix multiplication



- the rows of w has to be the same as the columns of X
- so if X shape is 1,5 then w needs to be 5x, so we can have any numbe of columns but the rows have to be 5. The resuled matrix will take the shape of (rows of X , columns of w)
- why is this important, well, since the number of rows in X is the numbe of example, observations in the data does not affect. We can have the same shape of w for any number of observations. 
- On the other hands, w is affected by the number of features, of course becuase each w is connected with aparticualr feature.
- Now when it comes the resulted y, of coursee we need one y for each observation, and therefore the numbe of rows in y is the same as number of observation. Now let's say we don't have one output
- I mean we have y1 and y2, we will face this in multi-classificaiton and also in NN when w have a layer with more than one unit. Here th columns of y will increase, and guess who will increase the numbe of columns to accomodate this, yes that's right. That's the NN

In [15]:
# ok, we need to learn about the dot product, so we can peformr X.w

In [16]:
# X is 5 * 1 and the output is 2 *1 so we need the w to be 2*5

w = np.array([
    [ 0.1, 0.1, 0.1,0.1,0.1],
    [0.1,0.1,0.1,0.1,0.1]

])

In [17]:
w = w.reshape(5,2)

In [18]:
w

array([[0.1, 0.1],
       [0.1, 0.1],
       [0.1, 0.1],
       [0.1, 0.1],
       [0.1, 0.1]])

In [19]:
X@w

array([[2.5, 2.5]])

In [20]:
# input layer, 5 nodes        layer 1, 2 nodes

#
#                    #
#
#                    #
#

In [21]:
# let's make up a simple example to see if the NN can actually learn from examples


# y X1 + X2 + X3 + X4 + X5

# examples
X = np.array([[1,1,1,1,1],
             [2,1,2,2,2],
            [3,1,3,3,3],
            [3,2,3,3,3],
            [3,3,3,3,3],
            [3,4,3,3,3],
            [3,5,3,3,3],
            [3,6,3,3,3],
            [3,7,3,3,3],
            [3,8,3,3,3]
             ])


In [22]:
X.shape

(10, 5)

In [23]:
y = np.array([5,9,13,14,15,16,17,18,19,20])
y.shape

(10,)

### Train function

In [24]:
class linear_model:
    def __init__(self):
        pass

    def train(self,X,y):
        # create the weight vectorr
        w = np.ones([X.shape[1],1])
        losses = []
        y_predicted = np.zeros(len(X))
        for i in range(len(X)):
            y_predicted[i] = X[i]@w
            loss = y[i] - y_predicted[i]
            
            print('the true value = ',y[i])
            print('the predicted value = ',y_predicted[i])
            losses.append(loss)
        total_loss = np.sum(losses)
        print('the total loss =', total_loss )
        

In [25]:
lm = linear_model()

In [26]:
lm.train(X,y)

the true value =  5
the predicted value =  5.0
the true value =  9
the predicted value =  9.0
the true value =  13
the predicted value =  13.0
the true value =  14
the predicted value =  14.0
the true value =  15
the predicted value =  15.0
the true value =  16
the predicted value =  16.0
the true value =  17
the predicted value =  17.0
the true value =  18
the predicted value =  18.0
the true value =  19
the predicted value =  19.0
the true value =  20
the predicted value =  20.0
the total loss = 0.0


In [27]:
X[0]

array([1, 1, 1, 1, 1])

In [28]:
# note that we used the correct w for this case and hence we have loss of zero, let's change ws

In [29]:
class linear_model:
    def __init__(self):
        pass

    def train(self,X,y):
        w = np.ones([X.shape[1],1])
        w[0] = 0.2
        losses = []
        y_predicted = np.zeros(len(X))
        for i in range(len(X)):
            y_predicted[i] = X[i]@w
            loss_at_w = y[i] - y_predicted[i]
            loss_at_w_plus_h = y[i] - y_predicted[i]
            
            print('the true value = ',y[i])
            print('the predicted value = ',y_predicted[i])
            losses.append(loss_at_w )
        total_loss = np.sum(losses)
        print('the total loss =', total_loss)
              
        
        

In [30]:
lm = linear_model()

In [31]:
lm.train(X,y)

the true value =  5
the predicted value =  4.2
the true value =  9
the predicted value =  7.4
the true value =  13
the predicted value =  10.6
the true value =  14
the predicted value =  11.6
the true value =  15
the predicted value =  12.6
the true value =  16
the predicted value =  13.6
the true value =  17
the predicted value =  14.6
the true value =  18
the predicted value =  15.6
the true value =  19
the predicted value =  16.6
the true value =  20
the predicted value =  17.6
the total loss = 21.599999999999998


In [32]:
# as you can see here, the loss is no longer 0.

In [33]:
# it is obvious that we need to keep tweeking ws until we get the minium loss

### now comes the calculs part, first we know that we can write the loss as a function of ws

- L = f(w)
- dl/dw = f'(w) = (f(w+h) - f(w))/h

new_w = old+w - step_size* dl/dw

In [34]:
 #ok, I need to update w which is a vector of shape (x[1],1)
        # any change in w, for example if we change w[0], the vector y will change completely, not only y[0] as we dicussed before.
        # now as we said we want to change w by tweeking it a little bit, by adding dw, i.e adding a vector that will be similar in length to w
              
        # for example :
              # w = [1               df/dw = [0.10                                                        [1.2
              #      2                     0.04            > new w will be old_w _ step_size*dw.     >. 2.08         note that here we assumed step*size to be 2
               #     3]                    0.03]                                                         2.06
        
        # let's calculate dl/dw
        # but this will involve calculating the loss twice, one at and then at w + h
        # let's make a function of calculating the loss
        

## Using a simple model to solve the equation 3X = 6

In [35]:
class linear_model:
    def __init__(self):
        pass
    
    def calc_loss(self,X,y,w):
       
        losses = []
        y_predicted = np.zeros(len(X))
       
        
        y_predicted = X@w
        loss= y - y_predicted
            
        print('the true value = ',y)
        print('the predicted value = ',y_predicted)
        losses.append(loss)
        total_loss = np.sum(losses)
        print('the total loss =', total_loss )
        return total_loss
      


    def train(self,X,y):
        w = np.ones([X.shape[1],1])
        w[0] = 5
        
        for i in range(len(w)):
            loss_w = self.calc_loss(X,y,w)
            
            # the case when the loss is positive, in this case we need to decrease w
            
            # we will choose the margin that we will accept as 0.5
            
            if abs(loss_w)> 0:
                # since the equation is 3x =6 , then dw/dl = 3
                dldw = 3
            elif abs(loss_w)< 0:
                dldw = -3
                # we will make the the learning rate is 0.5
            lr =1
            w[0]= w[0] -lr*dldw
            loss_w_dash= self.calc_loss(X,y,w)
            print(loss_w_dash - loss_w)
            loss_w = loss_w_dash

            
            self.w = w
            print(self.w)
        
    def predict(self,X_new):
        predicted_y = X_new@self.w
        
        return predicted_y

In [36]:
X = np.array([3]).reshape(1,1)
y = np.array([6]).reshape(1,1)

In [37]:
X.shape

(1, 1)

In [38]:
w = np.ones([1,1])

In [39]:
X@w

array([[3.]])

In [40]:
lm = linear_model()

In [41]:
lm.train(X,y)

the true value =  [[6]]
the predicted value =  [[15.]]
the total loss = -9.0
the true value =  [[6]]
the predicted value =  [[6.]]
the total loss = 0.0
9.0
[[2.]]


In [42]:
# so essentially the equation we have is

# 2*X, this is our model, because this is basically what got us 6
# let's see what if we try a new X, say for example 4, we expect the new y to be 8

In [43]:
X_new = np.array([4])

In [44]:
y = lm.predict(X_new)

In [45]:
y

array([8.])

### let's now try equations for 2 variables


- X1 = 2
- X2 = 3

- 2X1 + X2 = 7
- 3X1 - X2 = 3

> this is the equivelant of having 2 examples


In [46]:
X = np.array([[2,1],
             [3,-1]])
X

array([[ 2,  1],
       [ 3, -1]])

In [47]:
y = np.array([7,3]).reshape(2,1)
y

array([[7],
       [3]])

In [48]:
class linear_model:
    def __init__(self):
        pass
    
    def calc_loss(self,X,y,w):
       
        losses = []
        y_predicted = np.zeros(len(X))
       
        
        y_predicted = X@w
        loss= y - y_predicted

        print('the true value = ',y)
        print('the predicted value = ',y_predicted)
        losses.append(loss)
        total_loss = np.sum(losses)
        print('the total loss =', total_loss )
        return total_loss
      


    def train(self,X,y):
        w = np.ones([X.shape[1],1])
        w[0] = 5
        w[1] = 3
        
        loss_w = self.calc_loss(X,y,w)
        while abs(loss_w) > 1:
            for i in range(len(w)):
                loss_w = self.calc_loss(X,y,w)

                # the case when the loss is positive, in this case we need to decrease w

                # we will choose the margin that we will accept as 0.5

                if abs(loss_w)> 0:
                    # since the equation is 3x =6 , then dw/dl = 3
                    dldw = 1
                elif abs(loss_w)< 0:
                    dldw = -3
                    # we will make the the learning rate is 0.5
                lr =1
                w[i]= w[i] -lr*dldw
                loss_w_dash= self.calc_loss(X,y,w)
                print(loss_w_dash - loss_w)
                loss_w = loss_w_dash
            
            if abs(loss_w) < 5:
                
                self.w = w
                
                return self.w
                print('converged to solution')
                
            else:
                print('not converged')
            
            

    def predict(self,X_new):
        predicted_y = X_new@self.w
        
        return predicted_y

In [49]:
lm = linear_model()

In [50]:
lm.train(X,y)

the true value =  [[7]
 [3]]
the predicted value =  [[13.]
 [12.]]
the total loss = -15.0
the true value =  [[7]
 [3]]
the predicted value =  [[13.]
 [12.]]
the total loss = -15.0
the true value =  [[7]
 [3]]
the predicted value =  [[11.]
 [ 9.]]
the total loss = -10.0
5.0
the true value =  [[7]
 [3]]
the predicted value =  [[11.]
 [ 9.]]
the total loss = -10.0
the true value =  [[7]
 [3]]
the predicted value =  [[10.]
 [10.]]
the total loss = -10.0
0.0
not converged
the true value =  [[7]
 [3]]
the predicted value =  [[10.]
 [10.]]
the total loss = -10.0
the true value =  [[7]
 [3]]
the predicted value =  [[8.]
 [7.]]
the total loss = -5.0
5.0
the true value =  [[7]
 [3]]
the predicted value =  [[8.]
 [7.]]
the total loss = -5.0
the true value =  [[7]
 [3]]
the predicted value =  [[7.]
 [8.]]
the total loss = -5.0
0.0
not converged
the true value =  [[7]
 [3]]
the predicted value =  [[7.]
 [8.]]
the total loss = -5.0
the true value =  [[7]
 [3]]
the predicted value =  [[5.]
 [5.]]
the

array([[2.],
       [0.]])

In [51]:
# even though this is not the correct value, but the tota loss is actually zero. I need to to solve this proble,

In [52]:
#n the issue of losses cancelling themselves are a direct result of not taking the sum of the absolute value , which is L1

In [53]:
# L1 loss

In [54]:
lm.train(X,y)


the true value =  [[7]
 [3]]
the predicted value =  [[13.]
 [12.]]
the total loss = -15.0
the true value =  [[7]
 [3]]
the predicted value =  [[13.]
 [12.]]
the total loss = -15.0
the true value =  [[7]
 [3]]
the predicted value =  [[11.]
 [ 9.]]
the total loss = -10.0
5.0
the true value =  [[7]
 [3]]
the predicted value =  [[11.]
 [ 9.]]
the total loss = -10.0
the true value =  [[7]
 [3]]
the predicted value =  [[10.]
 [10.]]
the total loss = -10.0
0.0
not converged
the true value =  [[7]
 [3]]
the predicted value =  [[10.]
 [10.]]
the total loss = -10.0
the true value =  [[7]
 [3]]
the predicted value =  [[8.]
 [7.]]
the total loss = -5.0
5.0
the true value =  [[7]
 [3]]
the predicted value =  [[8.]
 [7.]]
the total loss = -5.0
the true value =  [[7]
 [3]]
the predicted value =  [[7.]
 [8.]]
the total loss = -5.0
0.0
not converged
the true value =  [[7]
 [3]]
the predicted value =  [[7.]
 [8.]]
the total loss = -5.0
the true value =  [[7]
 [3]]
the predicted value =  [[5.]
 [5.]]
the

array([[2.],
       [0.]])

# wa hoooooooooo :)

- The breakthrough was the fact hat I have to calculate dwdl once
- Also making the condition of stopping not zero but at 0.05
- And playing with the learning rate
- Of course also th earlier realization that I need to calucalte L1 and not just sum th loss since they were cancelling each other.

In [55]:
# Batch vs stochastic

while I was sleeping yesterday I was thinking about the calculation of gradients. More specifically I was thinking about how it is possible to calculate the gradient using one example, using some examples or all examples,or using all examples. The only difference will be in the way the loss is calculated. If only one example is included then we cacluate one loss, if n the the loss vectors, will have n rows each is associated with a loss, of course since we need only one number we will just take the mean of all losses. Note that we are calculatin the mean squared error using this methodology MAE. If we use L2. then we will be calculting the MSE.

In [56]:
# Using MSE

In [57]:
class linear_model:
    def __init__(self):
        pass
    
    def calc_loss(self,X,y,w, loss_measure = 'MSE'):
        
        
        """
        Calculating the loss or the deviation of the results we got from the model compared with the real values. The first step is to calculate the X@w which represents the results we got from
        the model. The second step is to evaulate the loss which can be done through a loss function that we define. for example here we define two ways to caculate the loss either using MAE, in
        which we first calculate the absolute difference between each predicted and true value and then we calucalte the mean. or using the MSW, in which we calculate the abolute difference between
        true and preidcted value, square it and then find the average. The loss we then be returned.
        
        """
        
       
        losses = []
        y_predicted = np.zeros(len(X))
        y_predicted = X@w
        
        if loss_measure == 'MSE':
            loss = (y - y_predicted)**2
        elif loss_measure == 'MAE':
            loss= abs(y - y_predicted)

        #print('the true value = ',y)
        #print('the predicted value = ',y_predicted)
        losses.append(loss)
        total_loss = np.mean(losses)
        #print('the total loss =', total_loss )
        return total_loss
      


    def train(self,X,y,loss_measure = 'MAE'):
        #w = np.ones([X.shape[1],1])
        np.random.seed(500)
        w = np.random.rand(X.shape[1],1)
        dldw = np.ones([X.shape[1],1])
        dldw[0] = 0.02
        dldw[1] = 0.02
        w[0] = 3
        w[1] = 3
        print(w)
        
    # calculate the derivitives, this happen once, that was my mistake, I was including it in the for loop.

        for i in range(len(w)):

            # f(w)

            loss_w = self.calc_loss(X,y,w,loss_measure = loss_measure)
            # we will choose the margin that we will accept as 0.5

            # f(w+h)

            h = 0.005
            w[i] = w[i] + h

            # dldw = f(w+h) - f(w)/h
            dldw[i] = (self.calc_loss(X,y,w,loss_measure = loss_measure) - loss_w)/h


        print(dldw)   
        
        # updating the weight
        
        lr =0.005
        
        for i in range(20):
            print('step',i)
            loss_w = self.calc_loss(X,y,w)
            print('loss is now',loss_w)
            if loss_w <= 0.08 :
                self.w = w
                print('final w = ',self.w)
                return self.w
            elif loss_w > 0.08 :
                w = w - lr*dldw
#         for i in range(10):
#             counter = counter + 1
            
#             print('loss = ',loss_w_dash)
#             #print(loss_w_dash - loss_w)
#             loss_w = loss_w_dash

#             

#                 return self.w
                
                

    def predict(self,X_new):
        predicted_y = X_new@self.w
        
        return predicted_y

In [58]:
lm = linear_model()
lm.train(X,y,loss_measure = 'MSE')

[[3.]
 [3.]]
[[13.0325]
 [-1.    ]]
step 0
loss is now 6.560162500000003
step 1
loss is now 5.7321160466406065
step 2
loss is now 4.9599711865624725
step 3
loss is now 4.243727919765584
step 4
loss is now 3.5833862462499555
step 5
loss is now 2.978946166015569
step 6
loss is now 2.4304076790624403
step 7
loss is now 1.9377707853905637
step 8
loss is now 1.501035484999939
step 9
loss is now 1.1202017778905646
step 10
loss is now 0.7952696640624431
step 11
loss is now 0.5262391435155735
step 12
loss is now 0.31311021624995605
step 13
loss is now 0.15588288226559058
step 14
loss is now 0.05455714156247709
final w =  [[2.092725]
 [3.075   ]]


array([[2.092725],
       [3.075   ]])

In [59]:
X_2d_new = np.array([[2,3],
                   [0,5]])

In [60]:
lm.predict(X_2d_new)

array([[13.41045],
       [15.375  ]])

In [61]:
# let's use the model we have trained for predicting the results for new data

In [62]:
# The effect of initialzing weighhts with very smalll values. 

In [63]:
# ok, we know that dldw is basically how much the loss is affected when the weight is changed a little bit. Since the trajectory is not linear
# we expect that dldw is not constant. But I basically makes it so in the above code.

I think now I started to grasp the idea of foward pass and backward pass.
Where do we need the dldw, we need it when we want to update w. Now dldw is not constatnt because again , imagin that you are in this half-circle 
down. as you apprach the bottom the slopes acually goes dowm. I am going to do this for the above code.

In [64]:
class linear_model:
    def __init__(self):
        pass
    

    
    def calc_loss(self,X,y,w, loss_measure = 'MSE'):
        
        
        """
        Calculating the loss or the deviation of the results we got from the model compared with the real values. The first step is to calculate the X@w which represents the results we got from
        the model. The second step is to evaulate the loss which can be done through a loss function that we define. for example here we define two ways to caculate the loss either using MAE, in
        which we first calculate the absolute difference between each predicted and true value and then we calucalte the mean. or using the MSW, in which we calculate the abolute difference between
        true and preidcted value, square it and then find the average. The loss we then be returned.
        
        """    
        losses = []
        y_predicted = np.zeros(len(X))
        y_predicted = X@w
        
        if loss_measure == 'MSE':
            loss = (y - y_predicted)**2
        elif loss_measure == 'MAE':
            loss= abs(y - y_predicted)

        #print('the true value = ',y)
        #print('the predicted value = ',y_predicted)
        losses.append(loss)
        total_loss = np.mean(losses)
        #print('the total loss =', total_loss )
        return total_loss
      

    def diff(self,w,X,y, loss_measure,diff_type = 'Analytical'):
        """
        given a function f(x) and a point w, calculate th f'(x)
        
        """
        dldw = np.ones([X.shape[1],1])
        h = 0.005  
        
        if diff_type =='Numerical':
            # calculate the derivitives dl/dw using limits
            for i in range(len(w)):
                print('i',i)
                # f(w)
                loss_w = self.calc_loss(X,y,w,loss_measure = loss_measure)
                # we will choose the margin that we will accept as 0.5
                # f(w+h)
                w[i] = w[i] + h      
                loss_w_plus_h = self.calc_loss(X,y,w,loss_measure = loss_measure)
                # dldw = f(w+h) - f(w)/h
                dldw[i] = (loss_w_plus_h- loss_w)/h
                print('dldw using numerical solution')
                print(dldw)
        elif diff_type == 'Analytical':
            # calculate the derivitves dl/dw using calculus
            # l = (X.w - y)**2
            # dl/dw = 2*(X.w - y)@X = 2*X@()

            dldw = 2*X@(X@w - y)
            print('dldw using analytical solution')
            print(dldw)
        return dldw 
    
    def train(self,X,y,loss_measure = 'MAE', lr = 0.07, diff_type = 'Analytical'):
        """
        training of the model is done by evaluting loss and updating the weights
        """
        # initilization
        np.random.seed(50)
        w = np.random.rand(X.shape[1],1)
        # intialization. I am changing this manually
        #w[0] = 
        #w[1] = 3.5
        print('initial weights')
        print(w)    
        # updating the weight     
        #lr =0.07
        
        for i in range(30):
            print('step =',i)
            
            # forward pass is essentially the process of calculting the loss given the data and the weight, for the loss we need to calculate a predicted value
            # calculate the loss at the current weights
            loss_w = self.calc_loss(X,y,w)
            print('loss is now',loss_w)
            # stoping condition, note that we don't require the loss is to be exactly zero but close enough
            if loss_w <= 0.08 :
                self.w = w
                print('final w = ',self.w)
                return self.w
            
            # updating the weight. 
            elif loss_w > 0.08 :
                # the backpropagation is essentially the calculation of the partially derivives using chain rules. since we have only one layer at this time we will just calculate the dldw
                #First at that particular point we calculate dldw 
                dldw = self.diff(w,X,y,loss_measure, diff_type)
                # Now we update the weight
                w = w -lr*dldw
                print('w')
                print(w)          
    def predict(self,X_new):
        """
        Given a new data, the model will be using the weights stored in the self.w to produce a prediction
        """
        predicted_y = X_new@self.w
        
        return predicted_y

In [65]:
lm = linear_model()
lm.train(X,y,loss_measure = 'MSE', lr = 0.08,diff_type = 'Analytical')

initial weights
[[0.49460165]
 [0.2280831 ]]
step = 0
loss is now 18.241141479081318
dldw using analytical solution
[[-26.61941075]
 [-31.20772529]]
w
[[2.62415451]
 [2.72470113]]
step = 1
loss is now 2.7798160070416325
dldw using analytical solution
[[8.18756534]
 [1.54253606]]
w
[[1.96914928]
 [2.60129824]]
step = 2
loss is now 0.1528493396307682
dldw using analytical solution
[[-1.22931361]
 [-3.37471838]]
w
[[2.06749437]
 [2.87127571]]
step = 3
loss is now 0.054868788984643146
final w =  [[2.06749437]
 [2.87127571]]


array([[2.06749437],
       [2.87127571]])

In [66]:
lm.predict(X_2d_new)

array([[12.74881588],
       [14.35637857]])

In [67]:
# my thought process during building the model from scratch

- note that weight in the first dimention X[0] is being updated correctly, when I initialize with weight that is bigger than the correct once, the weight goes down, and when I initialize with 
a weight that is smaller than the correct weight then the weight will go up. But, for some reason there is almost no movement in the second dimention, X[1], the quetion is why ?I alreay tried initializing with a different set of values, but still there is no movements.


- I guess the solution deponds on how X1 and X2 affect f(x) and hence the loss function. It seems that if we change w a bit in X1 direction, the loss will be affected significantly, but if we 
change w in the X2 direction, then the loss will be affected that much hence the slow conversion to the correct values of w.

# More than one Layer

In [68]:
class linear_model:
    def __init__(self):
        pass
    
    
    def sigmoid(self,z):
        return 1/(1+np.exp(-z))

    def sigmoid_grad(self,z):
        return self.sigmoid(z) * 1-self.sigmoid(z)
    
    
    def calc_loss(self,X,y,w, loss_measure = 'MSE'):
        
        
        """
        Calculating the loss or the deviation of the results we got from the model compared with the real values. The first step is to calculate the X@w which represents the results we got from
        the model. The second step is to evaulate the loss which can be done through a loss function that we define. for example here we define two ways to caculate the loss either using MAE, in
        which we first calculate the absolute difference between each predicted and true value and then we calucalte the mean. or using the MSW, in which we calculate the abolute difference between
        true and preidcted value, square it and then find the average. The loss we then be returned.
        
        """    
        losses = []
        z = X@w
        #y_predicted = z
        y_predicted = self.sigmoid(z)
        
        
        
        
        if loss_measure == 'MSE':
            loss = (y - y_predicted)**2
        elif loss_measure == 'MAE':
            loss= abs(y - y_predicted)

        #print('the true value = ',y)
        #print('the predicted value = ',y_predicted)
        losses.append(loss)
        total_loss = np.mean(losses)
        #print('the total loss =', total_loss )
        return total_loss
      

    def diff(self,w,X,y, loss_measure,diff_type = 'Analytical'):
        """
        given a function f(x) and a point w, calculate th f'(x)
        
        """
        dldw = np.ones([X.shape[1],1])
        h = 0.005  
        
        if diff_type =='Numerical':
            # calculate the derivitives dl/dw using limits
            for i in range(len(w)):
                print('i',i)
                # f(w)
                loss_w = self.calc_loss(X,y,w,loss_measure = loss_measure)
                # we will choose the margin that we will accept as 0.5
                # f(w+h)
                w[i] = w[i] + h      
                loss_w_plus_h = self.calc_loss(X,y,w,loss_measure = loss_measure)
                # dldw = f(w+h) - f(w)/h
                dldw[i] = (loss_w_plus_h- loss_w)/h
                print('dldw using numerical solution')
                print(dldw)
        elif diff_type == 'Analytical':
            # calculate the derivitves dl/dw using calculus
            # l = (X.w - y)**2
            # dl/dw = 2*(X.w - y)@X = 2*X@()
            
            z = X@w
            
            print('sigmpod_grad = ',self.sigmoid_grad(z))
            
            dldw = 2*X@(z - y)
            print('dldw using analytical solution')
            print(dldw)
        return dldw 
    
    def train(self,X,y,loss_measure = 'MAE', lr = 0.07, diff_type = 'Analytical'):
        """
        training of the model is done by evaluting loss and updating the weights
        """
        # initilization
        np.random.seed(50)
        w = np.random.rand(X.shape[1],1)
        # intialization. I am changing this manually
        #w[0] = 
        #w[1] = 3.5
        print('initial weights')
        print(w)    
        # updating the weight     
        #lr =0.07
        
        for i in range(30):
            print('step =',i)
            
            # forward pass is essentially the process of calculting the loss given the data and the weight, for the loss we need to calculate a predicted value
            # calculate the loss at the current weights
            loss_w = self.calc_loss(X,y,w)
            print('loss is now',loss_w)
            # stoping condition, note that we don't require the loss is to be exactly zero but close enough
            if loss_w <= 0.08 :
                self.w = w
                print('final w = ',self.w)
                return self.w
            
            # updating the weight. 
            elif loss_w > 0.08 :
                # the backpropagation is essentially the calculation of the partially derivives using chain rules. since we have only one layer at this time we will just calculate the dldw
                #First at that particular point we calculate dldw 
                dldw = self.diff(w,X,y,loss_measure, diff_type)
                # Now we update the weight
                w = w -lr*dldw
                print('w')
                print(w)          
    def predict(self,X_new):
        """
        Given a new data, the model will be using the weights stored in the self.w to produce a prediction
        """
        predicted_y = X_new@self.w
        
        return predicted_y

In [69]:
lm = linear_model()
lm.train(X,y,loss_measure = 'MSE', lr = 0.08,diff_type = 'Analytical')

initial weights
[[0.49460165]
 [0.2280831 ]]
step = 0
loss is now 21.86457304319812
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[-26.61941075]
 [-31.20772529]]
w
[[2.62415451]
 [2.72470113]]
step = 1
loss is now 20.013641498984054
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[8.18756534]
 [1.54253606]]
w
[[1.96914928]
 [2.60129824]]
step = 2
loss is now 20.08000579699743
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[-1.22931361]
 [-3.37471838]]
w
[[2.06749437]
 [2.87127571]]
step = 3
loss is now 20.075060231466576
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[ 0.68747257]
 [-0.62482808]]
w
[[2.01249656]
 [2.92126196]]
step = 4
loss is now 20.091554636108313
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[ 0.01747578]
 [-0.55492495]]
w
[[2.0110985 ]
 [2.96565596]]
step = 5
loss is now 20.09544591859577
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[ 0.0866909 ]
 [-0.20816136]]
w
[[2.00416323]
 

In [70]:
# it's amazing how we got exactly the right answer just by applying the sigmoid function and even wituout including the derivitive of sigmoid in the backpropagation. 

In [71]:
X

array([[ 2,  1],
       [ 3, -1]])

In [72]:
y

array([[7],
       [3]])

2X1 + X2 = 7

3X1 - X2 = 3

X1 = 2, X2 = 3

Ok, what if X1 =4, X2= 6

then y = [14,6]

In [73]:
y[0][0] = 14


In [74]:
y[1][0] = 6

In [75]:
X

array([[ 2,  1],
       [ 3, -1]])

In [76]:
y

array([[14],
       [ 6]])

In [77]:
lm = linear_model()
lm.train(X,y,loss_measure = 'MSE', lr = 0.08,diff_type = 'Analytical')

initial weights
[[0.49460165]
 [0.2280831 ]]
step = 0
loss is now 101.1286071748941
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[-60.61941075]
 [-67.20772529]]
w
[[5.34415451]
 [5.60470113]]
step = 1
loss is now 97.00014908356842
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[18.02756534]
 [ 4.90253606]]
w
[[3.90194928]
 [5.21249824]]
step = 2
loss is now 97.00758596749306
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[-2.94771361]
 [-6.88831838]]
w
[[4.13776637]
 [5.76356371]]
step = 3
loss is now 97.0064746876639
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[ 1.45585657]
 [-1.06489208]]
w
[[4.02129784]
 [5.84875508]]
step = 4
loss is now 97.00998879944528
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[-0.00432006]
 [-1.08217231]]
w
[[4.02164345]
 [5.93532886]]
step = 5
loss is now 97.01087699175652
sigmpod_grad =  [[0.]
 [0.]]
dldw using analytical solution
[[ 0.17366598]
 [-0.3875084 ]]
w
[[4.00775017]
 [5

In [78]:
# well I again got the correct answer, I should get some time to think about this mistery?


In [112]:
class linear_model:
    def __init__(self):
        pass
    
    
    def sigmoid(self,z):
        return 1/(1+np.exp(-z))

    def sigmoid_grad(self,z):
        return self.sigmoid(z) * (1-self.sigmoid(z))
    
    
    def calc_loss(self,X,y,w1,w2 ,loss_measure = 'MSE'):
        
        
        """
        Calculating the loss or the deviation of the results we got from the model compared with the real values. The first step is to calculate the X@w which represents the results we got from
        the model. The second step is to evaulate the loss which can be done through a loss function that we define. for example here we define two ways to caculate the loss either using MAE, in
        which we first calculate the absolute difference between each predicted and true value and then we calucalte the mean. or using the MSW, in which we calculate the abolute difference between
        true and preidcted value, square it and then find the average. The loss we then be returned.
        
        """    
        losses = []
        
        # the first layer in the model, will take the inputs X, which has two units x1 and x2 and produce z1, z1 will have two units, z1_1, and z2_2
        z1 = X@w1
        print('z1')
        print(z1)
        # the output layer
        y_predicted= z1@w2        
        #y_predicted = self.sigmoid(z2)
        print('y_predicted',y_predicted)
        #print('grad', self.sigmoid(z) * (1-self.sigmoid(z)))
        
        
        
        
        if loss_measure == 'MSE':
            loss = (y - y_predicted)**2
        elif loss_measure == 'MAE':
            loss= abs(y - y_predicted)

        #print('the true value = ',y)
        #print('the predicted value = ',y_predicted)
        losses.append(loss)
        total_loss = np.mean(losses)
        #print('the total loss =', total_loss )
        return total_loss
      

    def diff(self,w,X,y,loss_measure,diff_type = 'Analytical'):
        """
        given a function f(x) and a point w, calculate th f'(x)
        
        """
        dldw = np.ones([X.shape[1],1])
        h = 0.005  
        
        if diff_type =='Numerical':
            # calculate the derivitives dl/dw using limits
            for i in range(len(w)):
                print('i',i)
                # f(w)
                loss_w = self.calc_loss(X,y,w,loss_measure = loss_measure)
                # we will choose the margin that we will accept as 0.5
                # f(w+h)
                w[i] = w[i] + h      
                loss_w_plus_h = self.calc_loss(X,y,w,loss_measure = loss_measure)
                # dldw = f(w+h) - f(w)/h
                dldw[i] = (loss_w_plus_h- loss_w)/h
                print('dldw using numerical solution')
                print(dldw)
        elif diff_type == 'Analytical':
            # calculate the derivitves dl/dw using calculus
            # l = (X.w - y)**2
            # dl/dw = 2*(X.w - y)@X = 2*X@()
            
            z = X@w
            
            print('sigmpod_grad = ',self.sigmoid_grad(z))
            
            dldw = 2*X@(z - y)#*self.sigmoid_grad(z)
            print('dldw using analytical solution')
            print(dldw)
        return dldw 
    
    def train(self,X,y,loss_measure = 'MAE', lr = 0.07, diff_type = 'Analytical'):
        """
        training of the model is done by evaluting loss and updating the weights
        """
        # initilization
        np.random.seed(50)
        w1 = np.random.rand(X.shape[1],2)
        w2 = np.random.rand(X.shape[1],1)
        # intialization. I am changing this manually
        #w[0] = 
        #w[1] = 3.5
        #print('initial weights')
        #print(w1)    
        # updating the weight     
        #lr =0.07
        
        for i in range(30):
            print('step =',i)
            
            # forward pass is essentially the process of calculting the loss given the data and the weight, for the loss we need to calculate a predicted value
            # calculate the loss at the current weights
            loss_w = self.calc_loss(X,y,w1,w2)
            print('loss is now',loss_w)
            # stoping condition, note that we don't require the loss is to be exactly zero but close enough
            if loss_w <= 0.08 :
                self.w = w
                print('final w = ',self.w)
                return self.w
            
            # updating the weight. 
            elif loss_w > 0.08 :
                # the backpropagation is essentially the calculation of the partially derivives using chain rules. since we have only one layer at this time we will just calculate the dldw
                #First at that particular point we calculate dldw 
                
                
                # dldw1 = dl/dz*dz/dw1
                #dl/dz =2(x@w-y), l = (z-y)**2
                #dz/dw = X>>>>>>>>>>>>>why: becauaw z =X@w
                dldw1 = 2*X@(X@w1-y)
                
                
                z1 = X@w1 # this is the new x
                
                # the new w is w2
                
                
                dldw2 =  2*z1@(z1@w2 - y)#self.diff(w1,X,y,loss_measure, diff_type)
                               
                               
                # dl/dw1
                print('dldw1')
                print(dldw1)
                print(w2)
                dldw2 = self.diff(w2,X,y,loss_measure, diff_type)
                # Now we update the weight
                w1 = w1 -lr*dldw1
                w2 = w2 -lr*dldw2
                
                print('w1')
                print(w1)  
                                
                print('w2')
                print(w2) 
                
    def predict(self,X_new):
        """
        Given a new data, the model will be using the weights stored in the self.w to produce a prediction
        """
        predicted_y = X_new@self.w
        
        return predicted_y

In [113]:
lm = linear_model()
lm.train(X,y,loss_measure = 'MSE', lr = 0.08,diff_type = 'Analytical')

step = 0
z1
[[1.24467721 0.85249612]
 [1.22833101 0.2879194 ]]
y_predicted [[1.31921117]
 [0.75040089]]
loss is now 94.18034808664889
dldw1
[[-60.56462911 -64.01417672]
 [-66.98859874 -67.4608621 ]]
[[0.3773151 ]
 [0.99657423]]
sigmpod_grad =  [[0.12602231]
 [0.24885816]]
dldw using analytical solution
[[-60.72444017]
 [-61.76351557]]
w1
[[5.33977197 5.34921724]
 [5.61456182 5.79319888]]
w2
[[5.23527031]
 [5.93765548]]
step = 1
z1
[[16.29410577 16.49163336]
 [10.4047541  10.25445285]]
y_predicted [[183.22568534]
 [115.35910835]]
loss is now 20298.373578560095
dldw1
[[17.98593129 18.47543914]
 [ 4.95512643  6.44089447]]
[[5.23527031]
 [5.93765548]]
sigmpod_grad =  [[7.48188298e-08]
 [5.72392935e-05]]
dldw using analytical solution
[[17.16909531]
 [ 6.91286568]]
w1
[[3.90089747 3.87118211]
 [5.21815171 5.27792732]]
w2
[[3.86174269]
 [5.38462622]]
step = 2
z1
[[13.01994665 13.02029154]
 [ 6.48454071  6.33561901]]
y_predicted [[120.38908701]
 [ 59.15656791]]
loss is now 7072.12927291932
dl

In [111]:
# it is amazing to see that it is working