## PA4 - Multiclass Learning

In this exercise you are required to design, implement, train, and test multiclass learning algorithms for the [MNIST](http://yann.lecun.com/exdb/mnist) dataset of images and the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups) dataset of text news data. You will explore two different approaches to multiclass learning: *Cross Entropy*, a general loss appropriate for multiple classes, and *One-vs-All*, a reduction scheme from binary classification.

In [1]:
# Make sure to have the packages numpy, scikit-learn, and Tensorflow
# installed before starting the assignment

# import numpy first
import numpy as np

## The Losses

In the case of Cross Entropy, denoted in short as **CE**, the supervision for an example $\mathbf{x}\in\mathbb{R}^d$  is, instead of a label
$y\in[k]\stackrel{\tiny \mathsf{def}}{=}\{1,\ldots,k\}$, a vector $\mathbf{y}\in\Delta^k$ where
$$\Delta^k = \big\{\mathbf{p}=(p_1,\ldots,p_k)\, \big| \, \sum_i p_i = 1
   \mbox{ and } \forall i: p_i \geq 0 \big\} ~.$$
$\Delta^k$ is called the k$^{th}$ dimensional simplex and, informally speaking, it consists of all probability vectors in $\mathbb{R}^k$. The CE loss for a target probability vector $\mathbf{y}$ and predicted probability vector $\mathbf{\hat{y}}$ is defined as follows,
   $$\ell_{CE}(\mathbf{y, \hat{y}}) \stackrel{\tiny \mathsf{def}}{=}
     \sum_{i=1}^k y[i] \log\left(\frac{y[i]}{\hat{y}[i]}\right) ~ .$$
We use the convention that $0\log(0)\stackrel{\tiny \mathsf{def}}{=} 0$. To obtain the vector $\mathbf{\hat{y}}$ we exponentiate the vector of predictions $\mathbf{z} = W\mathbf{x} \in \mathbb{R}^k$ where $\mathbf{\hat{y}}[i] = \frac{\exp({\mathbf{z}[i]})}{Z}$ for all $i \in [k]$ and $Z$ ensures that $\mathbf{\hat{y}}\in \Delta^k $.

The One-vs-All, or **OvA**, approach instead uses $k$ logistic binary classifiers: the idea is to pick the label that has been *most confidently* chosen for the given example versus the rest. Formally, the loss can be expressed as follows. Given $(\mathbf{x},y)\in\mathbb{R}^d\times[k]$ let us define $\mathbf{\bar{y}}\in\{-1,1\}^k$ as the vector of labels resulting from the multiclass reduction. That is, $\mathbf{\bar{y}}[y]=1$ and for all $j\neq y$ we set $\mathbf{\bar{y}}[j]=-1$. Then the loss for $(\mathbf{x},y)$, with $\mathbf{z}$ defined as before, is
$$
\ell_{OvA}(\mathbf{\bar{y}}, \mathbf{z}) = \sum_{j=1}^k \log\big(1+ e^{-\mathbf{\bar{y}}[j] \, \mathbf{z}[j]}\big) ~ .
$$

Z∈R is a constant number, and it is there to make sure that after exponentiation of the vector z the vector y^ belongs to Δk. More specifically, the constant Z ensures that ∑kj=1y^j=1.
ACT 1 is asking you to provide an explicit formula for this constant Z (it will be in terms of the vector z).

### ACT 1
Compute the value of $Z$ explicitly.
$$
Z
=
\frac{1}{n}
\sum_{i=1}^k
\exp({\mathbf{z}[i]})
$$

To propose a method of computing $\mathbf{\hat{y}}$ from $\mathbf{z}$ that is numerically stable, I would first normalize $\mathbf{z}$, such that:

$$
\mathbf{z} = \frac{\mathbf{z}}{||\mathbf{z}||}
$$

Then, I would proceed as stated above to calculate $\mathbf{\hat{y}}$.

In [105]:
def calculate_Z(z):
    """
    """
    # return np.sum(np.exp(z))
    return np.average(np.exp(z))


def calculate_y_hat(z):
    """
    """
    z = normalize_vector(z)
    Z = calculate_Z(z)
    y = np.exp(z) / Z
    return y


def normalize_vector(x):
    """
    """
    return x / np.linalg.norm(x)

# testing
z = np.random.rand(1, 5)
z = z*100000000000
print(z.shape)
print('z', z)
print('l z', np.linalg.norm(z))

z_normalized = normalize_vector(z)
print('norm z', z_normalized)
print('l norm z', np.linalg.norm(z_normalized))

Z = calculate_Z(z_normalized)
print('Z', Z)

y_hat = calculate_y_hat(z_normalized)
print('y_hat', y_hat)

print('sum y_hat', np.sum(y_hat))

(1, 5)
z [[1.01227388e+10 6.23982677e+10 3.55164100e+10 9.82613777e+10
  9.83637402e+10]]
l z 156806099229.7428
norm z [[0.06455577 0.39793266 0.22649891 0.62664257 0.62729537]]
l norm z 1.0
Z 1.5106972975948252
y_hat [[0.70608789 0.98546795 0.83021347 1.2387109  1.23951979]]
sum y_hat 5.0


Let $X \in \mathbb{R}^{n \times d}$ be the given dataset and $\mathbf{y} \in [k]^n$ the corresponding labels. Construct one-hot vectors for each label to have a probability vector from $\Delta^k$: the one-hot vector for the class $c \in [k]$ if defined as $\mathbf{1}_c \in \Delta^k$ which is a vector of zeros except for coordinate $c$ where it is $1$. For example, suppose $k=4$ and $c=2$, then $\mathbf{1}_c$ is $(0, 1, 0, 0)$. When needed we will
be smoothing the one-hot vectors for numerical purposes. The $\epsilon$-smoothed one-hot vector for class $c$ is given by $(1-\epsilon) \mathbf{1}_c + \epsilon \mathbf{\frac1k}$ where $\mathbf{\frac1k} \in \Delta^k$ is the uniform probability vector, namely $\mathbf{\frac1k} = (\frac1k, \frac1k, \ldots, \frac1k)$.

ACT 2: Implement the `one_hot` function that given $\mathbf{y} \in [k]^n$ data of labels returns $Y \in \mathbb{R}^{n \times k}$ matrix of one-hot vectors. **In the code, the labels $\mathbf{y}$ are $0$-indexed, so you need to adjust accordingly.**

In [76]:
# Given y vector of labels, return matrix Y whose i-th row is the one-hot vector of label y_i 
# If eps is not 0, then need to compute the smoothed one-hot vectors instead
# If k is 0, then need to infer its value instead
def one_hot(y, k=0, eps=0):
    """
    ACT 2
    """
    print('-----in one hot ----')
    # print('y shape', y.shape)
    # y = y.reshape((-1, 1))
    # print('y reshape shape', y.shape)
    n, _ = y.shape  # assuming y has shape (1, n)
    k_max = np.amax(y) 
    
    if not k:
        k = k_max + 1
    # else:
    #    assert k == k_max, "k values do not match!"
    
    print('n, k for Y', (n, k))
    Y = np.zeros((n, k))
    # col_indices = y[0, :]
    col_indices = y[:, 0]
    Y[np.arange(n), col_indices] = 1.0
    
    
    if eps:
        print('smoothing one hot...')
        Y = (1 - eps) * Y + (eps * 1 / k)
    print('----')
    return Y


# testing
y_test = np.random.randint(0, 5, size=(10, 1))
print(y_test)
print(y_test.shape)

y_hot = one_hot(y_test)
y_hot.shape

print(y_hot)

y_hot = one_hot(y_test, eps=0.1)
print(y_hot)

[[1]
 [1]
 [2]
 [2]
 [1]
 [2]
 [1]
 [2]
 [3]
 [4]]
(10, 1)
-----in one hot ----
n, k for Y (10, 5)
----
[[0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
-----in one hot ----
n, k for Y (10, 5)
smoothing one hot...
----
[[0.02 0.92 0.02 0.02 0.02]
 [0.02 0.92 0.02 0.02 0.02]
 [0.02 0.02 0.92 0.02 0.02]
 [0.02 0.02 0.92 0.02 0.02]
 [0.02 0.92 0.02 0.02 0.02]
 [0.02 0.02 0.92 0.02 0.02]
 [0.02 0.92 0.02 0.02 0.02]
 [0.02 0.02 0.92 0.02 0.02]
 [0.02 0.02 0.02 0.92 0.02]
 [0.02 0.02 0.02 0.02 0.92]]


You will first implement an abstract class for losses in general that has all the methods implemented except computing the loss value and the gradient. Then each of CE and OvA classes will implement their specific loss functions, and gradients separately. *Feel free to add your own methods to the class skeletons provided.*

It is worth to note that label prediction in both cases is done by predicting the coordinate with the largest value in the vector $\mathbf{z}$ as the label. In order to prevent the parameters from blowing up during training, we will implement methods `RowNorms` and `Project` to compute the norms of the rows of $W \in \mathbb{R}^{k \times d}$ and their projections onto the sphere of radius $r$.

ACT 3: Given a vector $\mathbf{v} \in \mathbb{R}^d$, compute its projection onto the sphere $S_r = \{ \mathbf{u} \in \mathbb{R}^d \, : \, \| \mathbf{u} \| \leq r \}$.

### ACT 3

The projection of $\mathbf{v}$ onto the sphere $S_r$ of radius $r$ would be given by:

$$
\mathbf{v}_{proj}
=
r
\frac
{\mathbf{v}}
{||\mathbf{v}||}
$$

In [77]:
class GenericLoss:
    def __init__(self, name="", dims = [], W0 = []):
        self.name = name
        assert dims != [] or W0 != [], 'Must set dims or W0'
        if W0 == []:
            self.W = np.zeros(dims)
        else:
            self.W = W0
        self.k, self.d = self.W.shape
    
    # set the parameter matrix to the given W
    def Set(self, W):
        """
        ACT 4
        """
        self.W = W
    
    # get the parameter matrix of the instance
    def Get(self):
        """
        ACT 5
        """
        return self.W
    
    # update the parameter matrix by adding given dW
    def Update(self, dW):
        """
        ACT 6
        """
        self.W += dW  # assuming matrix shapes are compatible
    
    # compute the norms of the rows of the parameter matrix
    def RowNorms(self):
        """
        ACT 7
        """
        return self.norm_rows(self.W)
    
    # compute the projection of each of the rows to the sphere of radius rad
    def Project(self, rad):
        """
        ACT 8
        """
        scale = r / self.RowNorms()
        scale[scale > 1.0] = 1.0  # TO-DO: scale > 1 or scale > rad?
        self.W *= scale
    
    # compute the numerical predictions (z values) given a data matrix X
    def Predict(self, X):
        """
        ACT 9

        Returns
        -------
        Z : np.array(n x k)
            A matrix of predictions Z.
        """
        scores = np.einsum('ij,kj->ki', self.W, X)
        return scores
    
    # compute the label predictions given a data matrix X
    def PredictLabels(self, X):
        """
        ACT 10
        
        Returns
        -------
        labels : np.array(1 x n)
            A vector with the argmax of matrix Z, over all rows.
        """
        print('---predicting labels----')
        labels = np.argmax(self.Predict(X), axis=1)  # or axis 1?
        labels = labels.reshape((-1, 1))
        print(labels.shape)
        print('----')
        return labels
    
    # compute the classification error given a data matrix X, and label vector y
    def Error(self, X, y):
        """
        ACT 11
        
        Error is the sum of all missed predictions. For each one of them, +1.
        
        Returns
        -------
        e : scalar
            The cumulative error.
        """
        y = y.reshape((1, -1))
        labels = self.PredictLabels(X)
        e = (y != labels).astype(int)
        return np.sum(e)
        
    @staticmethod
    def norm_rows(A):
        """
        Returns
        -------
        norm : np.array(1 x dim_b)
        """
        return np.linalg.norm(A, axis=1, keepdims=True)  # TO-DO: what shape should this be?

    def normalize_rows(self, A):
        """
        Normalizes the rows of a matrix.
        
        Returns
        -------
        B : np.array
            Normalized copy of A.
        """
        norm = self.norm_rows(A)
        return A / norm

    def Loss(self, X, y):
        pass
    
    def Gradient(self, X, y):
        pass

To implement the specific losses, you simply need to compute the loss value itself given $X \in \mathbb{R}^{n \times d}, \mathbf{y} \in [k]^n$ and the parameter matrix $W \in \mathbb{R}^{k \times d}$ as well as the gradient of that loss with respect to $W$. Let us first express $\ell^{\mathsf{CE}}$ and $\ell^{\mathsf{OvA}}$ in terms of $X, \mathbf{y}, W$.

$$\ell^{\mathsf{CE}}(X, \mathbf{y}, W) = \frac{1}{n} \sum_{i=1}^n \ell_{CE}(\mathbf{y}_i, \hat{\mathbf{y}}_i), \quad \ell^{\mathsf{OvA}}(X, \mathbf{y}, W) = \frac{1}{n} \sum_{i=1}^n \ell_{OvA}(\bar{\mathbf{y}}_i, \mathbf{z}_i) ~ .$$

where for each $i \in [n]$ the vectors $\mathbf{y}_i$ are the rows of $Y =$ `one_hot`$(\mathbf{y}) \in \mathbb{R}^{n \times k}$, $\bar{\mathbf{y}}_i$ are the rows of $\bar{Y} = (2Y - 1) \in \mathbb{R}^{n \times k}$, $\mathbf{z}_i$ are the rows of $\mathbf{Z} = X W^{\top} \in \mathbb{R}^{n \times d}$, and $\hat{\mathbf{y}}_i = \exp(\mathbf{z}_i)/Z_i$ with $Z_i$ making sure $\hat{\mathbf{y}}_i \in \Delta^k$. The gradient computation can be done step-by-step using the chain rule.

ACT 12: Compute the gradient of $\ell^{\mathsf{CE}}$ w.r.t. $W$, given by $\nabla_W \ell^{\mathsf{CE}}$.

Hint: First, show for a single $\mathbf{y}, \mathbf{\hat{y}}, \mathbf{z}$ that $\nabla_{\mathbf{z}} \ell_{CE}(\mathbf{y}, \mathbf{\hat{y}}) = \mathbf{\hat{y}} - \mathbf{y}$ (simplify the loss function explicitly in terms of $\mathbf{z}$ to make things easier). Afterwards, denoting $\hat{Y} \in \mathbb{R}^{n \times k}$ the matrix with rows $\mathbf{\hat{y}}_i$ for all $i \in [n]$, conclude that $\nabla_W \ell^{\mathsf{CE}}(X, \mathbf{y}, W) = \frac1n (\hat{Y} - Y)^{\top} X$.

ACT 13: Compute the gradient of $\ell^{\mathsf{OvA}}$ w.r.t. $W$, given by $\nabla_W \ell^{\mathsf{OvA}}$.

Hint: First, show for a single $\mathbf{\bar{y}}, \mathbf{z}$ that $\nabla_{\mathbf{z}} \ell_{OvA}(\mathbf{\hat{y}}, \mathbf{z}) = - \frac{\mathbf{\hat{y}}}{1 + \exp(\mathbf{\hat{y}} \mathbf{z})}$ with all the operations in this expression being elementwise. Afterwards, conclude that $\nabla_W \ell^{\mathsf{OvA}}(X, \mathbf{y}, W) = - \frac1n \left( \frac{\bar{Y}}{1 + \exp(\bar{Y} \mathbf{Z})} \right)^{\top} X$ again operations like multiplication, exponential division being elementwise.

In [111]:
class CrossEntropy(GenericLoss):

    # In CE, the one-hot vectors should be smoothed with the given eps parameter.
    def __init__(self, name="CrossEntropy", dims=[], W0=[], eps=1e-6):
        super().__init__(name, dims, W0)
        self.eps = eps
    
    def Loss(self, X, y):
        """
        ACT 14
        Compute the cross-entropy loss.
        
        Parameters
        ----------
        X : np.array(n x d)
            The data matrix.
        y : np.array(1 x n)
            The labels vector.
            
        Returns
        -------
        loss : scalar
            The value of the cross-entropy loss.
        """
        print('----calculating CE loss----')
        Y = one_hot(y, k=self.k, eps=self.eps)  # n x k
        Y_hat = self.calculate_Y_hat(X)
        
        print('Y', Y.shape)
        print('Y_hat', Y_hat.shape)

        loss = Y * np.log(Y / Y_hat)
        loss_ce = np.sum(loss, axis=1)     
        loss_CE = np.average(loss_ce)
        print('----')
        return loss_CE
    
    def Gradient(self, X, y):
        """
        ACT 15
        
        Compute the gradient of the cross-entropy loss w.r.t to W.
        
        Parameters
        ----------
        X : np.array(n x d)
            The data matrix.
        y : np.array(1 x n)
            The labels vector.
        
        Returns
        -------
        grad : np.array(k x d)
            The matrix gradient of the cross-entropy loss function.
        """
        print('----calculating CE gradient----')
        n, d = X.shape
        Y = one_hot(y, k=self.k, eps=self.eps)  # n x k
        
        Y_hat = self.calculate_Y_hat(X)
        print('Y_hat shape', Y_hat.shape)
        # Y_hat -= Y
        Y_temp = Y_hat - Y
        print('Y_temp shape', Y_temp.shape)
        
        
        # (1 / n) * (Y_matrix - Yhat_matrix)^T * X
        print('X.shape', X.shape)
    
        
        # grad = np.einsum('ij,ik->jk', Y_temp, X)  # k x d
        grad = np.matmul(np.transpose(Y_temp), X)
        
        print('X norm', np.linalg.norm(X, ord='fro'))
        print('Y norm', np.linalg.norm(Y, ord='fro'))
        print('Y_hat norm', np.linalg.norm(Y_hat, ord='fro'))
        print('Y_temp norm', np.linalg.norm(Y_temp, ord='fro'))
        print('Grad norm', np.linalg.norm(grad/n, ord='fro'))
        
        print('----')
        return  grad / n
        # return  grad

    def calculate_Y_hat(self, X):
        """
        Computes the Y_hat matrix.
        
        Returns
        ------
        Y_hat : np.array(n x k)
            The Y_hat matrix.
        """
        print('----calculating Y hat----')
        Z = self.Predict(X)  # n x k
        Z = self.normalize_rows(Z)
        
        print('Z', Z.shape)
        
        Y_hat = np.exp(Z)  # n x k
        # print('Y_hat', Y_hat.shape)
        # Y_hat = Y_hat / np.sum(Y_hat, axis=0)  # constrain to simplex  ### check this! before commenting it, tests did not pass
        Y_hat = Y_hat / np.average(Y_hat, axis=0)  # constrain to simplex  ### check this! before commenting it, tests did not pass
        print('max Y hat', np.amax(Y_hat))
        print('Y_hat', Y_hat.shape)
        print('----')
        return Y_hat

    
class LogisticOvA(GenericLoss):
    def __init__(self, name="LogisticOvA", dims=[], W0=[]):
        super().__init__(name, dims, W0)

    def Loss(self, X, y):
        """
        ACT 16
        
        Computes the one-versus-all loss. Operations are element-wise.
        
        Parameters
        ----------
        X : np.array(n x d)
            The data matrix.
        y : np.array(1 x n)
            The labels vector.
            
        Returns
        -------
        loss : scalar
            The value of the cross-entropy loss.
        """
        Y = one_hot(y, k=self.k)  # n x k
        Y_dash = self.calculate_Y_dash(Y)
        
        Z = self.Predict(X)  # n x k
        #  Z = self.normalize_rows(Z)  # TO-DO: Test with and without normalization

        power = -1 * Y_dash * Z
        loss = np.log(1 + np.exp(power))
        loss_ova = np.sum(loss, axis=1)
        loss_OVA = np.average(loss_ova)
        return loss_OVA
        
    def Gradient(self, X, y):
        """
        ACT 17
        
        Compute the gradient of the  one-versus-all loss w.r.t to W.
        
        Parameters
        ----------
        X : np.array(n x d)
            The data matrix.
        y : np.array(1 x n)
            The labels vector.
        
        Returns
        -------
        grad : np.array(k x d)
            The matrix gradient of the cross-entropy loss function.
        """
        n, d = X.shape
        Y = one_hot(y, k=self.k)  # n x k
        Y_dash = self.calculate_Y_dash(Y)
        Z = self.Predict(X)  # n x k
        #  Z = self.normalize_rows(Z)  # TO-DO: Test with and without normalization
        
        A = 1 + np.exp(Y_dash * Z)
        B = Y_dash / A
        grad =  -1 * np.einsum('ij,ik->jk', B, X) / n
        return grad
    
    def calculate_Y_dash(self, Y):
        """
        Computes the Y_dash matrix.
        
        Returns
        ------
        Y_dash : np.array(n x k)
            The Y_dash matrix.
        """
        return 2 * Y - 1.0

After implementing the two loss classes, we will test them the following way:

(i) the loss of random labels should be larger than the loss of the assigned labels (the result of `PredictLabels`);

(ii) the gradient with random labels should have a bigger norm than that with the assigned labels;

(iii) the gradient norm should decrease after a single small gradient step.

In [114]:
# assert the 3 points above for the loss class given by loss_tested
# after implementing make sure both CrossEntropy, LogisticOvA pass all tests
def TestLoss(loss_tested):
    n, d, k, tests = 100, 10, 7, 1000
    test_loss = loss_tested(W0=[], dims=(k, d))
    
    # the number of tests is given by tests.
    # for each test, generate random Gaussian matrices X, W of appropriate sizes.
    for _ in range(tests):
        X, W = np.random.randn(n, d), np.random.randn(k, d)
        test_loss.Set(W)
    
        # assert that the loss value with the assigned labels y is smaller than that
        # with labels uniformly random from the interval [0, k-1].
        y, y_rand = test_loss.PredictLabels(X), np.random.randint(0, k, n).reshape((n, -1))
        ### ACT 18: loss value on X with assigned labels y
        loss1 = test_loss.Loss(X, y)
        ### ACT 19: loss value on X with random labels y_rand
        loss2 = test_loss.Loss(X, y_rand)
        assert loss1 < loss2, "Loss test failed (%f >= %f)" % (loss1, loss2)
    
        # assert that the gradient norm with the assigned labels is smaller than that
        # with labels uniformly random from the interval [0, k-1].
        grad1 = test_loss.Gradient(X, y)
        ### ACT 20: norm of the gradient with X and assigned labels y
        norm_grad1 = np.linalg.norm(grad1, ord='fro')
        # norm_grad1 = np.sqrt(np.sum(np.square(grad1)))
        ### ACT 21: norm of the gradient with X and random labels y_rand
        grad2 = test_loss.Gradient(X, y_rand)
        norm_grad2 = np.linalg.norm(grad2, ord='fro')
        # norm_grad2 = np.sqrt(np.sum(np.square(grad2)))
        assert norm_grad1 < norm_grad2, "Gradient norm test failed (%f >= %f)" % (norm_grad1, norm_grad2)
        
        # assert that after making a single gradient step (in the opposite direction)
        # the gradient norm decreases (choose a small step size).
        test_loss.Update(-0.01 * grad1)
        grad3 = test_loss.Gradient(X, y)
        ### ACT 22: norm of the gradient with X and y after making a single gradient step
        norm_grad3 = np.linalg.norm(grad3, ord='fro')
        assert norm_grad3 < norm_grad1, "Gradient step test failed (%f >= %f)" % (norm_grad3, norm_grad1)
    
    return True

LossTested = CrossEntropy
if TestLoss(LossTested): print('CE Test Passed')

LossTested = LogisticOvA
if TestLoss(LossTested): print('Logistic OvA Test Passed')

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3834605286094876
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3834605286094876
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3834605286094876
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.156355062858605
Y norm 9.999991428572038
Y_hat norm 28.281532138721953
Y_temp norm 24.220147537490657
Grad norm 0.743281833286918
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 

Z (100, 7)
max Y hat 2.239846146256543
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 33.19433185080846
Y norm 9.999991428572038
Y_hat norm 28.29640056347063
Y_temp norm 24.087002075536354
Grad norm 0.6338957574000265
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.239846146256543
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 33.19433185080846
Y norm 9.999991428572036
Y_hat norm 28.29640056347063
Y_temp norm 26.452672434167567
Grad norm 1.133250832007952
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.239905822668046
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 33.19433185080846
Y norm 9.999991428572038
Y_hat norm 28.296464262695803
Y_temp norm 24.08671510

n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.231087376465349
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.950390779964106
Y norm 9.999991428572038
Y_hat norm 28.282231964988622
Y_temp norm 24.12521504135398
Grad norm 0.9227839414549901
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.158634385811238
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.158634385811238
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.158634385811238
Y_hat (100, 7)
----
Y_hat shape (100, 

----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.190404008986908
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.955045454748454
Y norm 9.999991428572038
Y_hat norm 28.224500739238664
Y_temp norm 24.04089051009881
Grad norm 0.7808577243554876
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.190404008986908
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.955045454748454
Y norm 9.999991428572036
Y_hat norm 28.224500739238664
Y_temp norm 26.340333065540637
Grad norm 1.1346656378072673
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.190474550467026
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp sh

max Y hat 2.1718473484364362
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1718473484364362
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1718473484364362
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.58767379369297
Y norm 9.999991428572038
Y_hat norm 28.203148693098004
Y_temp norm 24.107997675488104
Grad norm 1.031857464273513
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1718473484364362
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.58767379369297
Y norm 9.999991428572038
Y_hat norm 28.20314869

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.121882219230229
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.121882219230229
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.121882219230229
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 33.823864830553795
Y norm 9.999991428572038
Y_hat norm 28.23844055292719
Y_temp norm 23.987872603961364
Grad norm 0.8718040882941122
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3173196432574765
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3173196432574765
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3173196432574765
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.601334212696358
Y norm 9.999991428572038
Y_hat norm 28.30948328388947
Y_temp norm 24.053761880882863
Grad norm 0.8568150975126758
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 

max Y hat 2.0977386620077576
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.20228825965194
Y norm 9.999991428572038
Y_hat norm 28.156402408985286
Y_temp norm 24.146889801735384
Grad norm 1.2031861463334983
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3906487864086383
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3906487864086383
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3906487864086383
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.122555135124045
Y norm 9.9

Z (100, 7)
max Y hat 2.2282148617811726
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.189687247294348
Y norm 9.999991428572038
Y_hat norm 28.30420654596221
Y_temp norm 24.28425586461087
Grad norm 0.5070785717988494
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2282148617811726
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.189687247294348
Y norm 9.999991428572038
Y_hat norm 28.30420654596221
Y_temp norm 26.35757352556498
Grad norm 0.9133828090159177
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2280835244165837
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.189687247294348
Y norm 9.999991428572038
Y_hat norm 28.30428038300257
Y_temp norm 24.2839

max Y hat 2.146489031897855
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.146489031897855
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.146489031897855
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.632189632238866
Y norm 9.999991428572038
Y_hat norm 28.30990196472739
Y_temp norm 24.317396807742607
Grad norm 0.6393265904834111
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.146489031897855
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.632189632238866
Y norm 9.999991428572038
Y_hat norm 28.3099019647

Z (100, 7)
max Y hat 2.140543205811014
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.252283586797603
Y norm 9.999991428572038
Y_hat norm 28.23954957904701
Y_temp norm 24.021277573316492
Grad norm 0.897296368074529
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.4157934896949915
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.4157934896949915
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.4157934896949915
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.610557719867302
Y

smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3406354316273696
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3406354316273696
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3406354316273696
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.20009152058265
Y norm 9.999991428572038
Y_hat norm 28.269539438653435
Y_temp norm 24.130474891053908
Grad norm 1.2100866195950915
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3406354316273696
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 3

Z (100, 7)
max Y hat 2.3679584375184928
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.036924124861606
Y norm 9.999991428572038
Y_hat norm 28.31694898419283
Y_temp norm 26.177810536850828
Grad norm 1.1613233841216388
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3679993552990726
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.036924124861606
Y norm 9.999991428572038
Y_hat norm 28.317177360725182
Y_temp norm 24.043705241084073
Grad norm 0.8819613783982463
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2741615015942056
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
-

----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2141159228003606
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2141159228003606
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.114302436687176
Y norm 9.999991428572038
Y_hat norm 28.210150011360472
Y_temp norm 24.034352825964227
Grad norm 0.9822227094240388
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2141159228003606
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.114302436687176
Y norm 9.999991428572038
Y_hat norm 28.210150011360472
Y_temp norm 26.420286731773793
Grad norm 1.3445561554674808
----

Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.919818309426145
Y norm 9.999991428572036
Y_hat norm 28.22016839420565
Y_temp norm 24.04559939039287
Grad norm 1.0916578557806842
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1916443566336636
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.919818309426145
Y norm 9.999991428572038
Y_hat norm 28.22016839420565
Y_temp norm 26.51743967343381
Grad norm 1.4271489093368852
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.192737630382406
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.919818309426145
Y norm 9.999991428572036
Y_hat norm 28.22053858830994
Y_temp norm 24.045398196878846
Grad norm 1.0907714131385395
----
---predicting l

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2619434078076743
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2619434078076743
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2619434078076743
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.711092273555238
Y norm 9.999991428572038
Y_hat norm 28.282401449819652
Y_temp norm 24.14425876486479
Grad norm 0.6338299713932736
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 

max Y hat 2.3658170847402613
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3658170847402613
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.2259065812809
Y norm 9.999991428572038
Y_hat norm 28.226895525680238
Y_temp norm 23.99360964992095
Grad norm 1.1328978291434353
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3658170847402613
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.2259065812809
Y norm 9.999991428572038
Y_hat norm 28.226895525680238
Y_temp norm 26.176068834505195
Grad norm 1.3634745431845927
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.31182707658144
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.31182707658144
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.31182707658144
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.83284848003604
Y norm 9.999991428572038
Y_hat norm 28.26119180285599
Y_temp norm 23.935016559181697
Grad norm 0.772078155985095
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.311827

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3490271311997044
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3490271311997044
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3490271311997044
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.127605152720395
Y norm 9.999991428572038
Y_hat norm 28.302595491801338
Y_temp norm 23.933409030636962
Grad norm 0.8458480869687165
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat

Z (100, 7)
max Y hat 2.380247344846541
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.918646861173865
Y norm 9.999991428572038
Y_hat norm 28.312679020814564
Y_temp norm 23.929712610796177
Grad norm 0.7880188585443336
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.354282676333484
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.354282676333484
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.354282676333484
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.970172568970106
Y 

Z (100, 7)
max Y hat 2.37281583254212
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.59885532672581
Y norm 9.999991428572038
Y_hat norm 28.32680240497613
Y_temp norm 24.169126283217803
Grad norm 0.7087381943557024
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.213261980595021
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.213261980595021
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.213261980595021
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.40870498212154
Y norm

Z (100, 7)
max Y hat 2.3438413191476015
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.17396813833485
Y norm 9.999991428572038
Y_hat norm 28.217124534315992
Y_temp norm 24.213529624021486
Grad norm 0.8823453155794152
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.378237307250267
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.378237307250267
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.378237307250267
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.50589940454597
Y n

Z (100, 7)
max Y hat 2.240263882999084
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.759787197354807
Y norm 9.999991428572038
Y_hat norm 28.222689698991466
Y_temp norm 24.094873499369363
Grad norm 0.6825498568091363
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.4484654554705894
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.4484654554705894
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.4484654554705894
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.381718880809093

Z (100, 7)
max Y hat 2.0081236093678263
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.783870466032116
Y norm 9.999991428572038
Y_hat norm 28.171533134592135
Y_temp norm 26.452789822290544
Grad norm 1.278734944619343
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.0082131654237236
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.783870466032116
Y norm 9.999991428572038
Y_hat norm 28.171731601593752
Y_temp norm 24.173870297329884
Grad norm 0.9222818394359936
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.183946869276398
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
--

Z (100, 7)
max Y hat 2.233597987930893
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.455856274825383
Y norm 9.999991428572038
Y_hat norm 28.309799890206055
Y_temp norm 26.255692789049025
Grad norm 1.0143001334978592
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.233838122022897
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.455856274825383
Y norm 9.999991428572036
Y_hat norm 28.309870304060468
Y_temp norm 24.096151156467844
Grad norm 0.6920413405470359
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.5216981666266403
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
--

Z (100, 7)
max Y hat 2.33468727255511
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.7394374007055
Y norm 9.999991428572038
Y_hat norm 28.337024627220778
Y_temp norm 24.13659016133996
Grad norm 0.85979715942208
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3212789440790442
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3212789440790442
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3212789440790442
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.41197401490979
Y norm

max Y hat 2.2880312050299545
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2880312050299545
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2880312050299545
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.13426350224963
Y norm 9.999991428572038
Y_hat norm 28.202275694871897
Y_temp norm 24.132362120874685
Grad norm 1.1634750639690599
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2880312050299545
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.13426350224963
Y norm 9.999991428572038
Y_hat norm 28.2022756

Z (100, 7)
max Y hat 2.264893975154738
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.264893975154738
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.018060683004535
Y norm 9.999991428572038
Y_hat norm 28.35722211891866
Y_temp norm 24.216062148660072
Grad norm 0.7729109012352584
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.264893975154738
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.018060683004535
Y norm 9.999991428572038
Y_hat norm 28.35722211891866
Y_temp norm 26.550485419229652
Grad norm 1.151103898484019
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100

max Y hat 2.1399382434296026
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1399382434296026
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1399382434296026
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.476483315383067
Y norm 9.999991428572038
Y_hat norm 28.25193152747567
Y_temp norm 24.05668680424815
Grad norm 0.7642197918220052
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1399382434296026
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.476483315383067
Y norm 9.999991428572038
Y_hat norm 28.2519315

Z (100, 7)
max Y hat 2.164765380353709
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.672332456688295
Y norm 9.999991428572038
Y_hat norm 28.252778315139818
Y_temp norm 26.484580987395525
Grad norm 1.2977180839283688
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1649544506968827
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.672332456688295
Y norm 9.999991428572038
Y_hat norm 28.252940182007855
Y_temp norm 24.22400599005793
Grad norm 0.9616100113839282
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.238215838987832
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
---

smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.235491074781554
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.14956809616001
Y norm 9.999991428572036
Y_hat norm 28.22729765498659
Y_temp norm 23.9805630835329
Grad norm 1.078484743004583
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.235491074781554
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.14956809616001
Y norm 9.999991428572038
Y_hat norm 28.22729765498659
Y_temp norm 26.545571470972
Grad norm 1.4118392698579243
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2359509654272878
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.14956809616001
Y norm 9.999991428572036
Y_hat

n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2155790490546248
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.567364215461406
Y norm 9.999991428572038
Y_hat norm 28.2590654704864
Y_temp norm 23.957257724484833
Grad norm 0.8305602149181466
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1419963613076964
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1419963613076964
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1419963613076964
Y_hat (100, 7)
----
Y_hat shape (10

Z (100, 7)
max Y hat 2.2315348577901615
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.23185193594341
Y norm 9.999991428572038
Y_hat norm 28.2077243612186
Y_temp norm 23.993756077520143
Grad norm 0.7508864987561016
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2534514609011085
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2534514609011085
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2534514609011085
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.25883810296961
Y 

----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.5911842249476105
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.26806657304083
Y norm 9.999991428572038
Y_hat norm 28.370815439974994
Y_temp norm 24.08284177801069
Grad norm 0.9275349889824832
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.5911842249476105
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.26806657304083
Y norm 9.999991428572038
Y_hat norm 28.370815439974994
Y_temp norm 26.665118063538625
Grad norm 1.2397010273100717
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.591690745310077
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp sh

smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1368125495343544
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 33.1083838794362
Y norm 9.999991428572038
Y_hat norm 28.200992402089852
Y_temp norm 26.338570186305738
Grad norm 1.494154736891847
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.137462802350126
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 33.1083838794362
Y norm 9.999991428572036
Y_hat norm 28.201457060494313
Y_temp norm 24.13743020523364
Grad norm 1.2454779335250168
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.271038545520202
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot --

Z (100, 7)
max Y hat 2.247967724784851
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.799260265553066
Y norm 9.999991428572038
Y_hat norm 28.240531766967408
Y_temp norm 26.43023821759153
Grad norm 1.1018171083105273
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.248074002546669
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.799260265553066
Y norm 9.999991428572038
Y_hat norm 28.24061002010732
Y_temp norm 24.14979189365604
Grad norm 0.6959701107903948
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3061242205630066
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----


---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.157002993075363
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.157002993075363
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.157002993075363
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.346362874371245
Y norm 9.999991428572038
Y_hat norm 28.210275526867818
Y_temp norm 24.083358908459733
Grad norm 0.8307307194724607
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.

max Y hat 2.352189665240886
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.202363718161305
Y norm 9.999991428572038
Y_hat norm 28.297868262903254
Y_temp norm 24.08181688756733
Grad norm 0.6476037601360086
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3358341815597203
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3358341815597203
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3358341815597203
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.23663709401973
Y norm 9.999

smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2293269311781057
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.344866143260674
Y norm 9.999991428572038
Y_hat norm 28.216963797422597
Y_temp norm 26.183788043305384
Grad norm 1.217268111167702
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.229629528077421
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.344866143260674
Y norm 9.999991428572038
Y_hat norm 28.217229251024552
Y_temp norm 24.330762480789904
Grad norm 1.0139527952549878
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3758815280890913
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one 

Z (100, 7)
max Y hat 2.2539992806875917
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2539992806875917
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.26697869969174
Y norm 9.999991428572038
Y_hat norm 28.215797765035536
Y_temp norm 23.997150139041473
Grad norm 1.175562166585825
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2539992806875917
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.26697869969174
Y norm 9.999991428572036
Y_hat norm 28.215797765035536
Y_temp norm 26.561115585208647
Grad norm 1.4867562335751565
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (

Z (100, 7)
max Y hat 2.235964014549963
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.235964014549963
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.235964014549963
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.572654514775177
Y norm 9.999991428572038
Y_hat norm 28.254102202182025
Y_temp norm 24.11611786483245
Grad norm 0.735190347685328
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.235964014549963
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.572654514775177
Y norm 9.999991428572038
Y_hat norm 28.

Z (100, 7)
max Y hat 2.1510936770803037
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.33884228999438
Y norm 9.999991428572038
Y_hat norm 28.3089061856436
Y_temp norm 23.969037787136642
Grad norm 0.8097414912294311
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.197628029208525
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.197628029208525
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.197628029208525
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.80829636148592
Y nor

max Y hat 2.364027173152071
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.838255583250582
Y norm 9.999991428572038
Y_hat norm 28.341625149243185
Y_temp norm 24.037358801973316
Grad norm 0.6429350987106832
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3262187516678923
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3262187516678923
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3262187516678923
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.26381038374252
Y norm 9.99

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.327449991977222
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.327449991977222
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.327449991977222
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.81517669662879
Y norm 9.999991428572038
Y_hat norm 28.276122312880304
Y_temp norm 24.037968045839037
Grad norm 0.8391779117565009
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3

max Y hat 2.2441680543726688
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.849337596117365
Y norm 9.999991428572038
Y_hat norm 28.198846425400184
Y_temp norm 26.46878907272147
Grad norm 1.3058164733947721
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2443220007159743
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.849337596117365
Y norm 9.999991428572038
Y_hat norm 28.198912342569667
Y_temp norm 24.047901671201807
Grad norm 0.9558977414816466
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3310305827697606
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----cal

----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.193639726611602
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.885702550713464
Y norm 9.999991428572038
Y_hat norm 28.288160600756342
Y_temp norm 23.982575787510005
Grad norm 0.7923059646392663
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.193639726611602
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.885702550713464
Y norm 9.999991428572038
Y_hat norm 28.288160600756342
Y_temp norm 26.642072615904365
Grad norm 1.2143092895293384
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1935322875762093
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp 

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3439501518047243
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3439501518047243
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3439501518047243
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 30.555195906503524
Y norm 9.999991428572038
Y_hat norm 28.26867717468325
Y_temp norm 24.014724831244035
Grad norm 0.7166053047915003
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 

Z (100, 7)
max Y hat 2.244499493332996
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.14795159620453
Y norm 9.999991428572038
Y_hat norm 28.28468289540479
Y_temp norm 24.06428582642137
Grad norm 0.993900991339165
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2924392197242054
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2924392197242054
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2924392197242054
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.12571322042267
Y no

Z (100, 7)
max Y hat 2.3053739595934877
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.05745361808264
Y norm 9.999991428572038
Y_hat norm 28.261306809335565
Y_temp norm 26.48640458470106
Grad norm 1.3085301094795514
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3053813747729257
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.05745361808264
Y norm 9.999991428572038
Y_hat norm 28.26154748231223
Y_temp norm 24.084181079998324
Grad norm 0.9369769262816104
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.5112308903468676
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----

----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1273057396463293
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1273057396463293
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.366398628530995
Y norm 9.999991428572038
Y_hat norm 28.238718642559157
Y_temp norm 24.071237975011943
Grad norm 0.9192524902241448
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1273057396463293
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 32.366398628530995
Y norm 9.999991428572038
Y_hat norm 28.238718642559157
Y_temp norm 26.575243011339662
Grad norm 1.2884811490891013
----

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1585881339290247
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1585881339290247
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.1585881339290247
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.72692783866125
Y norm 9.999991428572038
Y_hat norm 28.254680744436293
Y_temp norm 24.08741985763464
Grad norm 0.9676223217025044
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3131057984086474
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3131057984086474
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.3131057984086474
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.451964754901905
Y norm 9.999991428572038
Y_hat norm 28.233284740810358
Y_temp norm 24.000338202103404
Grad norm 0.956273716835914
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 

Z (100, 7)
max Y hat 2.234648197829845
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.059152881239953
Y norm 9.999991428572038
Y_hat norm 28.278343815268297
Y_temp norm 26.07520999671535
Grad norm 1.0604739473094469
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2347321941312015
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.059152881239953
Y norm 9.999991428572038
Y_hat norm 28.27856020431447
Y_temp norm 24.144220141889523
Grad norm 0.7295923748150482
----
---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2080291320919114
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
---

---predicting labels----
(100, 1)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2212102644791556
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE loss----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2212102644791556
Y_hat (100, 7)
----
Y (100, 7)
Y_hat (100, 7)
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat 2.2212102644791556
Y_hat (100, 7)
----
Y_hat shape (100, 7)
Y_temp shape (100, 7)
X.shape (100, 10)
X norm 31.820400817160134
Y norm 9.999991428572038
Y_hat norm 28.247355139034397
Y_temp norm 24.070047330196807
Grad norm 0.8506436782362173
----
----calculating CE gradient----
-----in one hot ----
n, k for Y (100, 7)
smoothing one hot...
----
----calculating Y hat----
Z (100, 7)
max Y hat

n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for 

---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot 

-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot 

---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot 

-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predic

-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot 

-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot 

----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one

-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot 

-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot 

-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot 

-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predicting labels----
(100, 1)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
-----in one hot ----
n, k for Y (100, 7)
----
---predic

Below are methods from previous assignments for data processing and training. No need to reimplement these.

In [115]:
# Sample a mini-batch w/ or w/o replacement
from numpy.random import randint
from numpy.random import permutation
class IndexSampler:
    def __init__(self, d):
        self.d = d
        self.prm = None
    
    def sample_new_index(self, replace=0):
        if replace:
            return randint(self.d)
        if self.prm is None:
            self.prm = permutation(self.d)
            self.head = 0
        ind = self.prm[self.head]
        self.head += 1
        if self.head == self.d:
            self.head = 0
            self.prm = None
        return ind

In [116]:
# Create a vector of learning-rate values. Mode can be: 'fixed_t', 'linear_t', 'sqrt_t'
# Internal shift_t parameter can/should be changed during experiments.
def learning_rate_schedule(eta0, epochs, mode):
    base_t = 10.0
    if mode == 'fixed_t':
        return eta0 * np.ones(epochs)
    if mode == 'sqrt_t':
        return eta0 * np.ones(epochs) / (base_t + np.sqrt(np.arange(epochs)))
    if mode == 'linear_t':
        return eta0 * np.ones(epochs) / (base_t + np.arange(epochs))
    print('invalid mode for learning rate schedule: %s' % mode)
    return []

In [117]:
# SGD with general loss class. h is the handle defined above.
def SGD(X, y, Loss, params):
    h = params
    pstr, rad, replace = h['pstr'], h['rad'], ['replace']
    eta0, epochs, bs, lrmode = h['eta0'], h['epochs'], h['batch_size'], h['lr_mode']
    n, d = X.shape
    nbs = int(n / bs)
    k = max(y) + 1
    ls = Loss(W0=[], dims=(k, d))
    eta_t = learning_rate_schedule(eta0, epochs, lrmode)
    losses = [ls.Loss(X, y)]
    errors = [ls.Error(X, y)]
    sampler = IndexSampler(nbs)
    for e in range(1, epochs * nbs):
        head = sampler.sample_new_index(replace) * bs
        Xt, yt = X[head:head + bs], y[head:head + bs]
        gw = ls.Gradient(Xt, yt)
        ls.Update(-eta_t[e // nbs] * gw)
        if rad > 0: ls.Project(rad)
        if e % nbs == 0:
            losses.append(ls.Loss(X, y))
            errors.append(ls.Error(X, y))
        if (e % (nbs * 10)) == 0:
            print(pstr.format(e // nbs, losses[-1], errors[-1]))
    return ls, losses, errors

In [118]:
# import matplotlib and get the mnist dataset from tensorflow.keras
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import mnist as keras_mnist
(X_train, y_train), (X_test, y_test) = keras_mnist.load_data()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [119]:
# normalize the data by subtracting the mean and dividing by std
def normalize(X, bias=0):
    n, d = X.shape
    m = np.mean(X, axis=1).reshape(n, 1) * np.ones((1, d))
    s = np.std(X, axis=1).reshape(n, 1) * np.ones((1, d))
    Xn = (X - m) / s
    if bias != 0:
        Xn = np.hstack((Xn, bias * np.ones((n, 1))))
    return Xn

# flatten the images into d-dimensional vectors for training
def flatten_images(X):
    s = X.shape
    n = s[0]
    d = np.prod(s[1:])
    return X.reshape(n, d)

Xtr = normalize(flatten_images(X_train), bias = 1)
Xte = normalize(flatten_images(X_test), bias = 1)

In [None]:
### ACT 23
# run SGD with CrossEntropy loss on the MNIST training data
# batch size should be 1000, sampling with no replacement
# number of epochs should be 500, sphere radius for W is 10.0
# learning rate mode is sqrt_t with eta=1.0

In [None]:
### ACT 24
# compute test error and loss

In [None]:
### ACT 25
# plot loss vs epochs, separately plot error vs epochs

In [None]:
### ACT 26
# using test data construct a confusion matrix C that is 10x10
# C[i][j] indicates pct digit i was classified as j
# assert that each digit is most likely to be correctly classified
assert np.array_equal(np.argmax(C, axis=1), np.arange(k))

In [None]:
### ACT 27
# perform ACT 23-25 with LogisticOvA instead of CrossEntropy
# compare your results, conclude in 1 sentence

In [None]:
# data processing for 20 newsgroups datasets already done for you.
from sklearn.datasets import fetch_20newsgroups
from collections import Counter

def get_20newsgroups_data():
    newsgroups_train = fetch_20newsgroups(subset='train')
    newsgroups_test  = fetch_20newsgroups(subset='test')
    return newsgroups_train, newsgroups_test

def construct_vocabulary(data, vs):
    vocab = Counter()
    for text in data:
        for word in text.split(' '):
            vocab[word.lower()] += 1
    word2index = dict(vocab.most_common(vs))
    i = 0
    for k in word2index.keys():
        word2index[k] = i
        i += 1
    return word2index

def text_to_vec(data, vocab):
    def norm_rows(M):
        return np.sqrt(np.sum(M * M, axis=1, keepdims=True))
    def project_rows(M, r):
        return M * np.minimum(r / norm_rows(M), 1.0)
    n = len(data)
    d = len(vocab)
    X = np.zeros((n, d))
    i = 0
    for text in data:
        for word in text.split(' '):
            if word.lower() in vocab:
                X[i, vocab[word.lower()]] += 1.0
        i += 1
    # Convert to log-frequencies and normalize to have ||X[i,*]||=1
    X = project_rows(np.log(X + 1.0), 1.0)
    return X

In [None]:
vsize = 1000
train, test = get_20newsgroups_data()
vocab = construct_vocabulary(train.data, vsize)
Xtr = text_to_vec(train.data, vocab)
Xte = text_to_vec(test.data, vocab)
y_train = train.target
y_test = test.target

In [None]:
# you will perform the same experiments with this dataset
### ACT 28
# perform ACT 23-27 for the 20newsgroups dataset
# all parameters for training should be the same except:
# we will start with a much larger learning rate eta=1000.0
# and a sphere radius 40.0

In [None]:
### ACT 29
# after training, testing, and plotting make sure to compute the
# row norms of the final parameters (mean and std), and conclude 
# in one sentence whether the results were as you expected