>>> Work in Progress

### Outline
- Linear Predictors
- Loss minimization
- Stochastic gradient descent

### Linear Predictors
#### Application: spam classification
Input x = email message
Output y $\in$ {spam, not spam}

**Types of prediction tasks**
- Binary classification (email $\Rightarrow$ spam/not spam)
> $x \rightarrow \fbox{f} \rightarrow y \in ${+1, -1}
- Regression (location, year $\Rightarrow$ housing price)
> $x \rightarrow \fbox{f} \rightarrow y \in \mathbb R$
- Multiclass classification: y is a category
> 100 Images $\rightarrow \fbox{f} \rightarrow cat$
- Ranking: y is a permutation
> 1 2 3 4 $\rightarrow \fbox{f} \rightarrow$ 2 3 4 1
- Structured prediction: build from parts, construct
> la casa blue $\rightarrow \fbox{f} \rightarrow$ the blue house
- many more..

----

### Feature extraction

- What properties of x might be relevant for predicting y?
> Input $\xrightarrow[\text{}]{\text{feature extractor}} \fbox{Feature Name: Feature Value}$ 
- <img src="images/02_featureExt.png" width=400 height=400>
----

#### Feature vector
  > $\phi(x) \in \mathbb R^{d} $

#### Weight vector
- $\in \mathbb R^{d} $

#### Score
  - weighted combinations of features $\in \mathbb R $
  - the score on an example (x,y) is $w.\phi(x)$, how confident we are in predicting +1
  - the magnitude of w does not matter, as the orthogonal vector(decision boundary) will still be the same
    - when used for prediction, the magnitude of boundary does not matter
    - when used for learning, the magnitude matters
  > $w.\phi(x) = \sum_{j=1}^{d}w_{j}\phi(x)_{j}$

-----

#### Linear classifier (predictor)
- binary in this case: $f_{w}$
  > $f_{w} = \text{sign}(w.\phi(x)) = 
    \begin{cases}
      +1 & \text{if $w.\phi(x) > 0$}\\
      -1 & \text{if $w.\phi(x) < 0$}\\
       ? & \text{if $w.\phi(x) = 0$}
    \end{cases}       
    $
- Example:
  - w = [2,-1]
  - $\phi(x) \in \{[2,0],[0,2],[2,4]\}$
<img src="images/02_imageClassifier.png" width=400 height=400>

----

#### Margin
  - Margin on an example (x,y) is $(w.\phi(x))y$, __how correct we are__
  - Margin less than 0 means that y and scores are different signs are there is a mistake

#### Zero-one (0-1) loss
  - did you make mist
  > $\begin{split}
 \text{Loss}_{0-1}(x,y,w) & = \mathbb 1[f_{w} \neq y] \\
 & = \mathbb 1[(w.\phi(x))y \le 0]  \\
 & = \mathbb 1[\text{Margin} \le 0]  
 \end{split}$
  - is an indicator function that takes condition and returns 1 or 0
    - if condition is true, returns 1, else 0
    - if margin is less than 0, we have made a mistake
<img src="images/02_zeroOneLoss2.png" width=400 height=400>  

----

- __Loss function__: how good is a predictor?
  - its a number, which helps us understand if we are satisfied with the prediction, if we use __w__ to make prediction on __x__ when the correct output is __y__
  - Loss is on a particular example(residual square), TrainLoss is on a complete set(sum of residual square)
  - High loss is bad, low loss is good
  > Squared Loss(x,y,w)$= (f_{w}(x)-y)^{2} = (w.\phi(x) - y)^{2}$  
  > Train Loss(w)$=\frac{1}{|\mathbb D_{train}|}\sum_{(x,y)\in\mathbb D_{\text{train}}}\text{Loss(x,y,w)}$   
- <img src="images/02_lossFun.png" width=400 height=400>

-----

### Optimization algorithm
- how to compute best?
  - Goal: min$_{w}$TrainLoss(w)
  - gradient $\nabla_{w}$TrainLoss(w) is the direction that increases the training loss the most  
    - use chain rule and do derivative
  - step size $\eta$  
  > Initialize w = [0,..0]  
  > For t = 1,..,T: (epochs)  
  > $\ \ \ w \leftarrow w - \eta\nabla_{w}$TrainLoss(w)  
  
  - Level curves
- If prediction = target, gradient is zero   

----

- <img src="images/02_leastSqObjFun.png" width=400 height=400>
- <img src="images/02_minTrainLoss.png" width=400 height=400>

#### Learner
- Optimization problem
  - what properties do we want the classifier to have in terms of data
- Optimization algorithm
  - how to optimize this

- Loss function Loss(x,y,w) quantifies 

#### Gradient Descent
- Gradient descent is slow
  - gradient is calculation of the training loss
  - and training loss is the sum of all the points
  - which makes it expensive
  - how to avoid this?
    - **Stochastic gradient descent**
      - Rather than looping through all the training examples to compute a single gradient and making one step(which is expensive)
      - loop through the example and update the weight w based on each example
      - update wont be good, but can make many more updates
<img src="images/02_sgdTrainLoss.png" width=400 height=400>
    - Step size
      - 0 - too conservative
      - 1 - too aggressive
      - Strategy
        > Constant: $\eta = 0.1$  
        > Decreasing: $\eta = 1/\sqrt{\text{# updates made so far}}$        

#### Classification - Will SGD work on 0-1Loss?
- No? Why?
  - 1) Its not differentiable
  - 2) The gradient is zero everywhere other than at 0, which does not matter
    - The weights will not move
<img src="images/02_zeroOneLoss2.png" width=400 height=400>
- How to solve this? 
  - Make the gradient non-zero
  - **Hinge loss**
    - Loss$_{hinge}(x,y,w) = \text{max}\{1-(w.\phi(x))y, 0\}$  
    - <img src="images/02_hingeLoss.png" width=400 height=400>  
    - Calculate gradient of this hinge loss  
      > $\nabla\text{Loss}_{hinge}(x,y,w) =   
      \begin{cases}  
      -\phi{(x)}y & \text{if} \{1-(w.\phi(x))y\} > \{0\} \\   
      0 & \text{otherwise}   
      \end{cases}  
      $   
      > or    
      > $\nabla\text{Loss}_{hinge}(x,y,w) =   
      \begin{cases}   
                0 & \text{if} \{w.\phi(x))y\} > \{1\} \\   
      -\phi{(x)}y & \text{otherwise}   
      \end{cases}  
      $   
- Other type of loss function for classification    
  - Logistic  
  - <img src="images/02_loss4Classification.png" width=400 height=400>  

#### Example - Gradient Descent - Vectorized

In [None]:
# %load lec02-c02-gradientDescentVectorized.py
import pandas as pd
import numpy as np

# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_colwidth', None)

##############################################
# Model

# points = [(np.array([2]),4), (np.array([4]),2)]
# d = 1

# Generate data
iterationCount = 2000
true_w = np.array([1,2,3,4,5,]) # Reverse-engineer to get to this vector
d = len(true_w)
dfColNames = [f"w{s+1}" for s in range(d)]
dfColNames.append('F(w)')
points = []
for i in range(iterationCount):
    x = np.random.randn(d)
    y = true_w.dot(x) + np.random.randn()
    points.append((x,y))


def F(w):
    return sum((w.dot(x) - y)**2 for x, y in points) / len(points)

def dF(w):
    return sum(2*(w.dot(x) - y) * x for x, y in points) / len(points)

##############################################
# Algorithm

def gradientDescent(F, dF, d):
    w = np.zeros(d)
    eta = 0.01

    lst = []
    for t in range(iterationCount):
        l1 = []
        value = F(w)
        gradient = dF(w)
        w = w - eta * gradient
        l1.extend(w)
        l1.append(value)
        lst.append(l1)
    df = pd.DataFrame(lst, columns = dfColNames)
    df['Iteration'] = df.index
    return df

result = gradientDescent(F, dF, d)

# print(result)



In [89]:
! python lec02-c02-gradientDescentVectorized.py 

import plotly.express as px
import numpy as np

result1 = result.melt(id_vars=['Iteration', 'F(w)'], \
            value_vars=['w1', 'w2', 'w3', 'w4', 'w5'], \
            var_name='w')
# print(result1)
xMin = np.floor(min(result1['value']))
xMax = np.ceil(max(result1['value']))
yMin = np.floor(min(result1['F(w)']))
yMax = np.ceil(max(result1['F(w)']))


# #Creating animation using plotly express
fig = px.line(result1, x="value", y="F(w)", 
#          animation_frame="Iteration", 
#               animation_group="w",
#            size="F(w)", # color="continent", hover_name="country",
#            log_y=True, # size_max=55, 
           color = "w",
           range_x=[xMin,xMax], range_y=[yMin,yMax],
           markers=True,
                )
# fig.update_traces()
fig.show()



#### Example - Stochastic Gradient Descent


In [None]:
%load lec02-c03-stochasticGradientDescent.py


In [5]:
!python lec02-c03-stochasticGradientDescent.py

iteration 0: w = [1.01430472 2.02831837 3.0797153  4.03925903 4.95799282], F(w) = 0.05737914263048566
iteration 1: w = [1.00862689 2.0253523  3.07926712 3.99970066 4.94798952], F(w) = 0.06631244401867277
iteration 2: w = [1.00639475 2.02417345 3.07918764 3.9868582  4.94337729], F(w) = 0.06983585712957409
iteration 3: w = [1.00518906 2.02354366 3.07914705 3.98050392 4.94080453], F(w) = 0.07171100079324301
iteration 4: w = [1.00443137 2.02315139 3.079119   3.97671346 4.93916944], F(w) = 0.07287409975435354
iteration 5: w = [1.00391016 2.02288327 3.07909757 3.97419594 4.93803925], F(w) = 0.07366579671890006
iteration 6: w = [1.00352923 2.02268825 3.07908043 3.97240244 4.93721149], F(w) = 0.07423944326399548
iteration 7: w = [1.00323848 2.02253992 3.07906632 3.97105995 4.93657906], F(w) = 0.07467420733608385
iteration 8: w = [1.00300915 2.02242325 3.07905449 3.97001739 4.93608008], F(w) = 0.07501507753759547
iteration 9: w = [1.00282358 2.02232904 3.07904439 3.96918436 4.93567631], F(w) = 

iteration 96: w = [1.00124224 2.02153201 3.07893173 3.96249542 4.93225025], F(w) = 0.07756401668532137
iteration 97: w = [1.00124028 2.02153102 3.07893154 3.96248761 4.93224602], F(w) = 0.07756675158425681
iteration 98: w = [1.00123835 2.02153005 3.07893136 3.96247997 4.93224188], F(w) = 0.07756943141278436
iteration 99: w = [1.00123646 2.0215291  3.07893119 3.96247247 4.93223783], F(w) = 0.07757205781795257
iteration 100: w = [1.00123461 2.02152817 3.07893101 3.96246513 4.93223385], F(w) = 0.0775746323817828
iteration 101: w = [1.00123279 2.02152725 3.07893084 3.96245793 4.93222995], F(w) = 0.07757715662443725
iteration 102: w = [1.00123101 2.02152636 3.07893067 3.96245087 4.93222613], F(w) = 0.07757963200720945
iteration 103: w = [1.00122926 2.02152548 3.07893051 3.96244394 4.93222238], F(w) = 0.07758205993535838
iteration 104: w = [1.00122755 2.02152461 3.07893035 3.96243715 4.9322187 ], F(w) = 0.07758444176074668
iteration 105: w = [1.00122586 2.02152377 3.07893019 3.96243048 4.932

iteration 197: w = [1.0011434  2.02148223 3.07892223 3.96210542 4.93203831], F(w) = 0.07770096914488561
iteration 198: w = [1.00114292 2.02148199 3.07892218 3.96210354 4.93203728], F(w) = 0.07770163130616632
iteration 199: w = [1.00114244 2.02148175 3.07892213 3.96210167 4.93203626], F(w) = 0.07770228685794381
iteration 200: w = [1.00114197 2.02148151 3.07892209 3.96209983 4.93203526], F(w) = 0.07770293589870023
iteration 201: w = [1.00114151 2.02148128 3.07892204 3.962098   4.93203426], F(w) = 0.07770357852497319
iteration 202: w = [1.00114105 2.02148105 3.07892199 3.9620962  4.93203327], F(w) = 0.07770421483139883
iteration 203: w = [1.00114059 2.02148081 3.07892195 3.96209441 4.9320323 ], F(w) = 0.07770484491075572
iteration 204: w = [1.00114014 2.02148059 3.0789219  3.96209263 4.93203133], F(w) = 0.07770546885402278
iteration 205: w = [1.00113969 2.02148036 3.07892186 3.96209088 4.93203037], F(w) = 0.07770608675040147
iteration 206: w = [1.00113925 2.02148014 3.07892182 3.96208914 

iteration 276: w = [1.00111613 2.02146847 3.07891951 3.96199865 4.93197995], F(w) = 0.07773856963342865
iteration 277: w = [1.00111588 2.02146835 3.07891948 3.96199769 4.93197942], F(w) = 0.07773890881133037
iteration 278: w = [1.00111564 2.02146823 3.07891946 3.96199673 4.9319789 ], F(w) = 0.07773924556125149
iteration 279: w = [1.0011154  2.0214681  3.07891943 3.96199578 4.93197838], F(w) = 0.07773957990918387
iteration 280: w = [1.00111515 2.02146798 3.07891941 3.96199484 4.93197786], F(w) = 0.07773991188073145
iteration 281: w = [1.00111492 2.02146786 3.07891939 3.9619939  4.93197735], F(w) = 0.0777402415011444
iteration 282: w = [1.00111468 2.02146774 3.07891936 3.96199298 4.93197684], F(w) = 0.07774056879531803
iteration 283: w = [1.00111444 2.02146762 3.07891934 3.96199205 4.93197634], F(w) = 0.07774089378778544
iteration 284: w = [1.00111421 2.0214675  3.07891931 3.96199114 4.93197584], F(w) = 0.07774121650274969
iteration 285: w = [1.00111397 2.02146739 3.07891929 3.96199023 4

iteration 355: w = [1.00110092 2.0214608  3.07891797 3.96193928 4.93194743], F(w) = 0.0777595004172058
iteration 356: w = [1.00110077 2.02146072 3.07891795 3.96193869 4.93194711], F(w) = 0.07775970605932701
iteration 357: w = [1.00110062 2.02146064 3.07891794 3.96193811 4.93194679], F(w) = 0.07775991055393117
iteration 358: w = [1.00110047 2.02146057 3.07891792 3.96193754 4.93194647], F(w) = 0.07776011391060016
iteration 359: w = [1.00110033 2.0214605  3.07891791 3.96193696 4.93194616], F(w) = 0.07776031613880204
iteration 360: w = [1.00110018 2.02146042 3.07891789 3.96193639 4.93194585], F(w) = 0.07776051724791774
iteration 361: w = [1.00110004 2.02146035 3.07891788 3.96193583 4.93194553], F(w) = 0.07776071724719555
iteration 362: w = [1.00109989 2.02146028 3.07891786 3.96193526 4.93194523], F(w) = 0.07776091614580155
iteration 363: w = [1.00109975 2.0214602  3.07891785 3.9619347  4.93194492], F(w) = 0.07776111395280237
iteration 364: w = [1.0010996  2.02146013 3.07891784 3.96193414 4

iteration 435: w = [1.00109112 2.02145584 3.07891697 3.96190108 4.93192648], F(w) = 0.0777729737782422
iteration 436: w = [1.00109102 2.02145579 3.07891696 3.96190069 4.93192626], F(w) = 0.07777311100868413
iteration 437: w = [1.00109092 2.02145574 3.07891695 3.96190031 4.93192605], F(w) = 0.07777324761312347
iteration 438: w = [1.00109082 2.02145569 3.07891694 3.96189992 4.93192584], F(w) = 0.07777338359583182
iteration 439: w = [1.00109072 2.02145564 3.07891693 3.96189954 4.93192563], F(w) = 0.07777351896104259
iteration 440: w = [1.00109062 2.02145559 3.07891692 3.96189915 4.93192542], F(w) = 0.0777736537129561
iteration 441: w = [1.00109053 2.02145554 3.07891691 3.96189877 4.93192521], F(w) = 0.07777378785572604
iteration 442: w = [1.00109043 2.02145549 3.0789169  3.9618984  4.931925  ], F(w) = 0.07777392139347153
iteration 443: w = [1.00109033 2.02145545 3.07891689 3.96189802 4.9319248 ], F(w) = 0.07777405433029035
iteration 444: w = [1.00109024 2.0214554  3.07891688 3.96189764 4.

### Summary
<img src="images/02_scoreSummary.png" width=400 height=400> 


  
-----