# Answer to Q1:
"Consider how $\alpha$ and $\beta$ are used in our regression models."

In general, we can consider $\alpha$ as <i>ratio</i> of our next step. By "next step" we mean the vector $dE + \beta \cdot step_{previous}$, and by ratio, a coefficient which is applied to this vector. So, for example, an $\alpha$ of 0.01 would mean we're going to go from $P$ to $P + (0.01dE + 0.01\beta step_{previous})$. Further, $\beta$ stands for ratio of our previous behaviour with respect to our derivative. That can be important in cases where we don't want derivative to completely dominate/dictate how our location will change, but will be impacted by our previous decisions.<br>
Particular cases are as follows:<br>
<ul><li>$\alpha$ too small, $\beta$ too small: Since $\beta$ is too small, it'll be as if it's nonexistent, ie. $dE$ will be the dominant factor. However, since $\alpha$ too is too small, there will be little-to-no change at all, and this kind of an algorithm will take many steps to converge to target point.
<li>$\alpha$ too large, $\beta$ too small: Once again, a $\beta$ too small means our steps are dominated by $dE$. However, in this case $\alpha$ being <i>too large</i> implies that it is large beyond point of usefulness. In other words, this carries the risk of our sequence being divergent, or that it "blows up".
<li>$\alpha$ too small, $\beta$ large: $\beta$ being <i>large</i> shows that its existence contributes a significant amount. Further, some intuitive dimension analysis suggests $\alpha \beta$ is only <i>small</i> (since $small \cdot large = normal$, and $verysmall \cdot large = small$) and though our step sizes are decreasing, this will be slower than the case where both were small. This situation can be expected to converge to our target point relatively quickly.
<li>$\alpha$ too large, $\beta$ small: Similar to $2^{nd}$ case, a very large $\alpha$ is one that is beyond usefulness. Therefore, though $\beta$ can be significant, in the end we could have a divergent sequence.
</ul>

# Answer to Q2:
"Use the logistic regression model to determine if input is in class 1; compare with Naive Bayesian method."

In [1]:
# MOST IDEAS/CODE COPIED FROM PREVIOUS IRIS EXERCISE, SEE:
# https://github.com/SWE582-Fall-2017-Bogazici/naive-bayes-alidursen/blob/master/Iris%20-%20A%20Continuous%20Exercise.ipynb

import numpy
import pandas
import math

IrisData = pandas.read_csv('bezdekIris.csv', header=-1).set_index([4])
X = IrisData.as_matrix()
StatsByClass = numpy.dstack((numpy.array([IrisData.loc['Iris-setosa'].mean().as_matrix(),
                                          IrisData.loc['Iris-versicolor'].mean().as_matrix(),
                                          IrisData.loc['Iris-virginica'].mean().as_matrix()]),
                             numpy.array([IrisData.loc['Iris-setosa'].std().as_matrix(),
                                          IrisData.loc['Iris-versicolor'].std().as_matrix(),
                                          IrisData.loc['Iris-virginica'].std().as_matrix()])))

def NormalProb(point, mean=0., std=1., interval=0.01):
    xlower = ((point-mean)-interval*0.5)/std
    xupper = ((point-mean)+interval*0.5)/std
    return (math.erf(xupper/math.sqrt(2.))-math.erf(xlower/math.sqrt(2.)))/2.

def Sigmoid(x):
    return 1/(1+numpy.exp(-x))

def LogSumExp(a,b):
    return a + numpy.log(1+numpy.exp(b-a))

def SetosaBayes(sl,sw,pl,pw):
    # As usual, will accept (up to) 4 floats and will return a matrix of 2 entries:
    # 1st possibility of being 'Setosa', 2nd of not being.
    # But honestly, since we want to easily compare two methods, it is easier to
    # return a boolean, True if P('Setosa')>=0.5, False otherwise
    param = [sl, sw, pl, pw]
    pS = numpy.zeros((4,3))
    for i in range(4):
        for k in range(3):
            pS[i,k] = NormalProb(point=param[i], mean=StatsByClass[k,i,0], std=StatsByClass[k,i,1])
    p = numpy.ones(3)
    for i in range(4):
        p = p*pS[i]
    return ( p[0]*2. >= numpy.sum(p) )

# To get i'th row/data point from data: (returns a numpy array, because why not?)
# def Row(i):
#     return IrisData.iloc[i:i+1,:].as_matrix()
# No need to define a function that never gets used.

y = numpy.array([int(u=='Iris-setosa') for u in IrisData.index.values])
w = numpy.zeros(4)
p = numpy.zeros(4)
yTX = numpy.dot(y.T, X)
# Obtaining a suitable w
alpha = 0.01
beta = 0.8
trials = 1000
for epoch in range(trials):
    L, dL = (numpy.dot(yTX, w)-numpy.sum(LogSumExp(0,numpy.dot(X,w))),
            numpy.dot(X.T,y-Sigmoid(numpy.dot(X,w))))
    p = dL + beta*p
    w = w + alpha*p
    #if (epoch+1)%100==0:
    #    print('Step', epoch,':', L)

def SetosaLogistic(sl,sw,pl,pw):
    # As discussed in SetosaBayes method, will return a boolean.
    # Since I don't fully know if we can use logistic method on attribute subsets,
    # I'll mandate entry of all 4 attributes. This, in turn, will necessitate the same for SetosaBayes.
    
    # Originally we had Sigmoid(V) >= 0.5 here, but by the nature of Sigmoid it's equivalent to V => 0
    # Of course, this is due to my lax condition: "Better than 50% is good enough." A stricter condition
    # would rightfully use Sigmoid function itself.
    return (numpy.inner(numpy.array([sl,sw,pl,pw]),w) >= 0.)

def MethodComparison(*e):
    """4 floats needed."""
    return (SetosaBayes(*e) == SetosaLogistic(*e))

In [168]:
from random import *
means = numpy.array([5.84,3.05,3.76,1.20])*2
trial = 200
overalltrial = 200
overallc = []
ocM, ocm = 0,1
for i in range(overalltrial):
    c,t = 0,0
    for j in range(trial):
        l = numpy.array([random(),random(),random(),random()])*means
        r = MethodComparison(*l)
        if r:
            c += 1
        t = c/float(trial)
    if t>ocM:
        ocM = t
    elif t<ocm:
        ocm = t
    overallc.append(t)
ocM,ocm
# This way of finding max/min is faster than using max(),min() functions respectively.
# Note that, this is still not very fast, there are currently 40,000 trials going on, after all.

(0.755, 0.55)

Numbers above can be changed according to one's heart's desire, but I believe it's fair to suggest Naive Bayesian and Logistic model align between 50 to 75% of the time. Given that Bayesian model is inherently flawed, this does not necessarily imply a failure of logistic model or our implementation of it.<br>
Yet, upon further study, it's seen that we did not include column of uniform 1's that is generally added for flexibility in calculations. That might have given us a different $w$ than would be otherwise available.

### Q2 - PyTorch Addendum
#### Fatal errors:
Somehow my torch is incompatible with my system. Whenever I try to run some method (even rather trivial ones!) I'd have the kernel shutdown and restart.<br>However, I also had some theoretical problems as well. As commented below, I checked example pyTorchExample.ipynb. There we introduce an error function, <br>
<b>EuclidianLoss = torch.nn.MSELoss(size_average=True)</b><br>
which we later call with<br>
<b>EuclidianLoss(f(Variable(x)), Variable(y))</b><br>
What is going on here? Does torch.nn.MSELoss() returns a <i>function</i> that accepts 2 parameters? If that is the case, how can we update that to our situation? For example, I discovered there is a torch.nn.Softplus() function that can stand in for our LogSumExp function. However, what does it return? A number? A function like torch.nn.MSELoss() does?<br><br>
Some further problems: Do we need column of pure 1's to account for bias terms or not? How to transform a tensor into our data set? All in all, not only my technical problems, but also my unfamiliarity with torch led me to present an incomplete answer to 2nd question.

In [5]:
# Before we begin, it's helpful to see these two:
# https://github.com/torch/demos/blob/master/linear-regression/example-linear-regression.lua
# https://github.com/atcemgil/notes/blob/master/pyTorchExample.ipynb

import torch
X_ = torch.Tensor(numpy.insert(X,4,1,axis=1))
# Note that this step introduces some floating error.
# For example, X_[1][2] returns 1.399999976158142 where it should've been just 1.4
# However, this is _very_ insignificant. Just not the _same_ dataset, is all.

# torch provides its own LogSumExp(0,w): Softplus()
LSE = torch.nn.Softplus()

# Learning rate:
eta = 0.01

for epoch in range(10):
    E = 

1.399999976158142