**<h2>Chapter9 - Intro to Activation Functions - Modeling Probabilities</h2>**

**<em>Q:should the new code is better than old one because using new three functions?</em>**

A:The effectiveness of the new code versus the old one depends on various factors including the performance metrics you are interested in (accuracy, speed, generalization, etc.). Here's a breakdown:

### Advantages of Using the New Functions:

1. **Softmax**: It's often more appropriate for a classification task, especially when the classes are mutually exclusive. It provides probabilities for each class, making the model's outputs more interpretable.
  
2. **Tanh**: The tanh activation function is zero-centered, meaning that negative inputs will be mapped strongly negative and zero inputs will be near zero in the output. This can make the optimization process easier and potentially quicker.

3. **tanh2deriv**: Given that tanh is used as an activation function, having a function to calculate its derivative can make backpropagation easier and possibly more effective.

### Potential Disadvantages:

1. **Softmax**: Computationally more expensive than sigmoid or no activation at all.

2. **Tanh and tanh2deriv**: The tanh function and its derivative could potentially be computationally more expensive than simpler functions like the ReLU (Rectified Linear Unit) and its derivative, depending on the implementation and the hardware.

3. **Learning Rate (`alpha = 2`)**: The high learning rate might cause the model to overshoot the optimal weights and fail to converge, or to generalize poorly.

4. **Dropout Implementation**: The new code uses dropout but doesn't compensate for the dropped-out neurons during training (by multiplying by 2), which could potentially affect training dynamics.

### Conclusion:

Neither version is inherently "better"; they are just suited for different requirements. If you're interested in model interpretability and potentially better training dynamics, the new functions might offer advantages. However, you'll have to test both versions rigorously to conclude which is better empirically for your specific use-case.

### Three New Functions:

#### 1. tanh(x)
This function returns the hyperbolic tangent of the input `x`. It's an activation function that squashes the outputs to lie in the range of -1 and 1.

```python
def tanh(x):
    return np.tanh(x)
```

#### 2. tanh2deriv(output)
This function returns the derivative of the tanh function given its output. It's used for backpropagation to adjust the weights.

```python
def tanh2deriv(output):
    return 1 - (output ** 2)
```

#### 3. softmax(x)
This function applies the softmax function to the input `x`. It turns raw scores (logits) into probabilities. The softmax function is often used for the output layer of a classifier.

```python
def softmax(x):
  temp = np.exp(x)
  return temp / np.sum(temp, axis=1, keepdims=True)
```

The softmax function is a commonly used activation function in neural networks, especially in the output layer for multi-class classification problems. The softmax function takes an \(N\)-dimensional vector of real numbers as input and normalizes it into a probability distribution, that is, a vector of positive numbers between 0 and 1 that add up to 1.

The formula for the softmax function \( \text{softmax}(\vec{x}) \) applied to a vector \( \vec{x} = [x_1, x_2, ..., x_N] \) is defined as:

\begin{equation*}
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}
\end{equation*}

Here's how the code implements this:

1. **Exponential Calculation**: `np.exp(x)` computes the exponential指数 of each element \( x_i \) in the input vector \( \vec{x} \). This transforms each value to be non-negative. Exponentiating also accentuates the differences between the elements.(指数运算也突出了元素之间的差异。) The larger an element \( x_i \) is, the larger \( e^{x_i} \) will be, especially compared to other, smaller elements.

    ```python
    temp = np.exp(x)
    ```

2. **Normalization**: `np.sum(temp, axis=1, keepdims=True)` computes the sum of these exponentials along the specified axis (in this case, axis 1 corresponds to summing along each row in a 2D array). This serves as the denominator for the softmax function, ensuring that the resulting probability distribution adds up to 1.
规范化：np.sum(temp，axis=1，keepdims=True)计算指定轴上这些指数的总和（在这种情况下，轴1对应于在2D数组中沿每行求和）。这用作softmax函数的分母，确保生成的概率分布总和为1。

    ```python
    return temp / np.sum(temp, axis=1, keepdims=True)
    ```

The `axis=1, keepdims=True` ensures that the division is performed correctly. By setting `axis=1`, we sum along the columns (i.e., summing all the softmax scores for each sample in the batch). The `keepdims=True` ensures that the result has the same shape as the original array, making the division operation compatible.

Certainly! Let's break down these three parameters (`temp`, `axis=1`, `keepdims=True`) and how they affect the function `softmax()`.

### 1. `temp`

The variable `temp` stores the exponentials of the input \(x\). Suppose \(x\) is a 2D array representing two examples with scores for three classes, i.e., \(x = [[1, 2, 3], [1, 1, 1]]\).

Computing `temp = np.exp(x)` would give:

\begin{equation*}
\text{temp} = \left[ \left[e^1, e^2, e^3\right], \left[e^1, e^1, e^1\right] \right] = \left[ [2.72, 7.39, 20.08], [2.72, 2.72, 2.72] \right]
\end{equation*}

Each element in the array is exponentiated, making them non-negative and emphasizing differences.

### 2. `axis=1`

The `axis=1` parameter specifies which axis to sum along. For a 2D array, `axis=0` would sum along columns, and `axis=1` would sum along rows.

Continuing with our `temp` example:

- Summing along `axis=1`: `np.sum(temp, axis=1)` would produce `[30.19, 8.16]`.

This is essential for the softmax function because we want to normalize the scores for each example (each row) so that they form a probability distribution.

### 3. `keepdims=True`

Setting `keepdims=True` maintains the original dimensionality of the array after the sum operation. 

If you perform `np.sum(temp, axis=1, keepdims=True)`, it returns a 2D array `[[30.19], [8.16]]` instead of a 1D array `[30.19, 8.16]`.

This is important for broadcasting rules in NumPy when we divide `temp` by this sum. If we didn't keep dimensions, NumPy would broadcast incorrectly, and we'd get an error or incorrect result.

### Putting It All Together

Finally, in `softmax()`:

\begin{equation*}
\text{result} = \frac{\text{temp}}{\text{np.sum(temp, axis=1, keepdims=True)}} = \frac{\left[ [2.72, 7.39, 20.08], [2.72, 2.72, 2.72] \right]}{\left[ [30.19], [8.16] \right]} = \left[ [0.09, 0.24, 0.67], [0.33, 0.33, 0.33] \right]
\end{equation*}

Each row now sums to 1 and represents a probability distribution across the classes. The function has worked as intended!

By implementing softmax this way, each row in the input 2D array \( \text{temp} \) is transformed into a probability distribution over the classes (columns).

**<em>Q:I mean the 't' in 'True' why need capitalization? in C++ we write "true" as "true" the 't' don't need capitalizaion. </em>**

Ah, I see what you're asking. In Python, `True` and `False` are capitalized because they are built-in constants, and that's simply the convention the language designers chose. Python is case-sensitive, so `True` and `true` would be considered different (the latter would be considered a variable name and would throw an error if not defined).

C++ and Python are different languages with different design philosophies and conventions. In C++, `true` and `false` are keywords and are all lowercase. However, Python chose to make its Boolean constants (`True` and `False`) start with a capital letter.

There's no deep technical reason for this; it's simply a convention that you'll need to remember when working with Python. Different languages have different rules for syntax and naming, and this is one of those cases.

在Python中，True和False是大写的，因为它们是内置常量，这只是语言设计者选择的约定。Python是区分大小写的，因此True和true被认为是不同的（后者被认为是变量名，如果未定义，则会引发错误）。

C++和Python是不同的语言，具有不同的设计哲学和约定。在C++中，true和false是关键字，都是小写的。但是，Python选择使其布尔常量（True和False）以大写字母开头。

这没有深层技术原因；这只是一种约定，你需要记住在使用Python时。不同的语言有不同的语法和命名规则，这是其中之一。

In [5]:
import numpy as np, sys
np.random.seed(1)

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

def tanh(x):
    return np.tanh(x)

def tan2deriv(output):
    return 1 - (output ** 2)

def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

alpha, iterations, hidden_size = (2, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100

weights_0_1 = 0.02*np.random.random((pixels_per_image,hidden_size))-0.01
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):
    correct_cnt = 0
    for i in range(int(len(images) / batch_size)):
        batch_start, batch_end=((i * batch_size),((i+1)*batch_size))
        layer_0 = images[batch_start:batch_end]
        layer_1 = tanh(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2,size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = softmax(layer_1.dot(weights_1_2))
        
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

        layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size * layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * tan2deriv(layer_1)
        layer_1_delta *= dropout_mask

        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    test_correct_cnt = 0
    test_total_error = 0

    for i in range(len(test_images)):

        layer_0 = test_images[i:i+1]
        layer_1 = tanh(np.dot(layer_0,weights_0_1))
        layer_2 = np.dot(layer_1,weights_1_2)

        test_total_error += np.sum(np.abs(np.argmax(layer_2) - np.argmax(test_labels[i:i+1])))
        test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
    total_error = np.sum(np.abs(layer_2_delta))
    avg_error = total_error / float(len(images))

    test_avg_error = test_total_error / float(len(test_images))
    
    if(j % 10 == 0):
        sys.stdout.write("\n"+ \
         "I:" + str(j) + \
         " Test-Acc:"+str(test_correct_cnt/float(len(test_images)))+\
         " Train-Acc:" + str(correct_cnt/float(len(images)))+\
         " Test-Err:" + str(test_avg_error) +\
         " Train-Err:" + str(avg_error))


I:0 Test-Acc:0.394 Train-Acc:0.156 Test-Err:2.0623 Train-Err:1.7959499953429915e-05
I:10 Test-Acc:0.6867 Train-Acc:0.723 Test-Err:1.1351 Train-Err:1.6988449187755914e-05
I:20 Test-Acc:0.7025 Train-Acc:0.732 Test-Err:1.0619 Train-Err:1.521588821935578e-05
I:30 Test-Acc:0.734 Train-Acc:0.763 Test-Err:0.9422 Train-Err:1.3249866834745405e-05
I:40 Test-Acc:0.7663 Train-Acc:0.794 Test-Err:0.8285 Train-Err:1.1221782527969216e-05
I:50 Test-Acc:0.7913 Train-Acc:0.819 Test-Err:0.7367 Train-Err:1.0413538511490937e-05
I:60 Test-Acc:0.8102 Train-Acc:0.849 Test-Err:0.666 Train-Err:9.225930193461022e-06
I:70 Test-Acc:0.8228 Train-Acc:0.864 Test-Err:0.6202 Train-Err:8.408800041933423e-06
I:80 Test-Acc:0.831 Train-Acc:0.867 Test-Err:0.5929 Train-Err:7.715483087549682e-06
I:90 Test-Acc:0.8364 Train-Acc:0.885 Test-Err:0.5722 Train-Err:6.9307721515200216e-06
I:100 Test-Acc:0.8407 Train-Acc:0.883 Test-Err:0.5538 Train-Err:6.504311898401096e-06
I:110 Test-Acc:0.845 Train-Acc:0.891 Test-Err:0.5374 Train-Err

In [6]:
#edit alpha=0.0001
import numpy as np, sys
np.random.seed(1)

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

def tanh(x):
    return np.tanh(x)

def tan2deriv(output):
    return 1 - (output ** 2)

def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

alpha, iterations, hidden_size = (0.0001, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100

weights_0_1 = 0.02*np.random.random((pixels_per_image,hidden_size))-0.01
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):
    correct_cnt = 0
    for i in range(int(len(images) / batch_size)):
        batch_start, batch_end=((i * batch_size),((i+1)*batch_size))
        layer_0 = images[batch_start:batch_end]
        layer_1 = tanh(layer_0.dot(weights_0_1))
        dropout_mask = np.random.randint(2,size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = softmax(layer_1.dot(weights_1_2))
        
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))

        layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size * layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * tan2deriv(layer_1)
        layer_1_delta *= dropout_mask

        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    test_correct_cnt = 0
    test_total_error = 0

    for i in range(len(test_images)):

        layer_0 = test_images[i:i+1]
        layer_1 = tanh(np.dot(layer_0,weights_0_1))
        layer_2 = np.dot(layer_1,weights_1_2)

        test_total_error += np.sum(np.abs(np.argmax(layer_2) - np.argmax(test_labels[i:i+1])))
        test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
    total_error = np.sum(np.abs(layer_2_delta))
    avg_error = total_error / float(len(images))

    test_avg_error = test_total_error / float(len(test_images))
    
    if(j % 10 == 0):
        sys.stdout.write("\n"+ \
         "I:" + str(j) + \
         " Test-Acc:"+str(test_correct_cnt/float(len(test_images)))+\
         " Train-Acc:" + str(correct_cnt/float(len(images)))+\
         " Test-Err:" + str(test_avg_error) +\
         " Train-Err:" + str(avg_error))


I:0 Test-Acc:0.072 Train-Acc:0.088 Test-Err:2.9287 Train-Err:1.8015329284054318e-05
I:10 Test-Acc:0.0721 Train-Acc:0.069 Test-Err:2.9281 Train-Err:1.7998444927999157e-05
I:20 Test-Acc:0.0722 Train-Acc:0.087 Test-Err:2.9279 Train-Err:1.8002056144353965e-05
I:30 Test-Acc:0.0723 Train-Acc:0.068 Test-Err:2.9279 Train-Err:1.79944096441887e-05
I:40 Test-Acc:0.0724 Train-Acc:0.066 Test-Err:2.9277 Train-Err:1.801123729156478e-05
I:50 Test-Acc:0.0725 Train-Acc:0.093 Test-Err:2.9279 Train-Err:1.8003343919530262e-05
I:60 Test-Acc:0.0726 Train-Acc:0.087 Test-Err:2.9269 Train-Err:1.8000515106158395e-05
I:70 Test-Acc:0.0729 Train-Acc:0.082 Test-Err:2.9271 Train-Err:1.8001228860941025e-05
I:80 Test-Acc:0.0729 Train-Acc:0.072 Test-Err:2.9275 Train-Err:1.8002585215358698e-05
I:90 Test-Acc:0.0729 Train-Acc:0.083 Test-Err:2.9275 Train-Err:1.8001949170079207e-05
I:100 Test-Acc:0.0729 Train-Acc:0.081 Test-Err:2.9275 Train-Err:1.8001920579155582e-05
I:110 Test-Acc:0.073 Train-Acc:0.076 Test-Err:2.9276 Trai