# Binary Classification
# MultiClass Classification
# MultiLabel Classification

| Hyperparameter                 | Binary Classification        | MultiClass Classification    | MultiLabel Classification      |
| ------------------------------ | -----------------------------| -----------------------------| ------------------------------ |
| **Number of Classes**          | 2 (binary)                   | More than 2                  | More than 1 (multiple labels)  |
| **Input Layer Shape**          | Features                     | Features                     | Features                       |
| **Hidden Layers**              | 1 or more                    | 1 or more                    | 1 or more                      |
| **Neurons per Hidden Layer**   | 64, 128, 256, etc.           | 64, 128, 256, etc.           | 64, 128, 256, etc.             |
| **Output Layer Shape**         | 1                            | Number of classes            | Number of labels               |
| **Hidden Activation**          | ReLU or other activation     | ReLU or other activation     | ReLU or other activation       |
| **Output Activation**          | Sigmoid                      | Softmax                      | Sigmoid (per label)            |
| **Loss Function**              | Binary Cross-Entropy         | Categorical Cross-Entropy    | Binary Cross-Entropy per label |
| **Optimizer**                  | Adam, SGD, etc.              | Adam, SGD, etc.              | Adam, SGD, etc.                |
| **Decision Threshold**         | Typically 0.5                | -                            | -                              |
| **One-Hot Encoding**           | Not needed                   | Needed for target variable   | Needed for each label          |
| **Example Libraries/Models**   | Logistic Regression, SVM     | Logistic Regression, CNN     | Binary Relevance, MLkNN, etc.  |
| **Common Evaluation Metrics**  | Accuracy, Precision, Recall  | Accuracy, Precision, Recall  | Precision, Recall, F1-score    |



The Stochastic Gradient Descent (SGD) update rule is given by the equation $( \theta_{i+1} = \theta_i - \alpha \nabla J(\theta_i, x^{(i)}, y^{(i)}))$, where:
- $( \theta_i )$ is the parameter vector at iteration $( i)$,
- $( \alpha )$ is the learning rate (a positive scalar),
- $( J(\theta_i, x^{(i)}, y^{(i)}) )$ is the loss function,
- $( \nabla J(\theta_i, x^{(i)}, y^{(i)}) )$ is the gradient of the loss function with respect to the parameters $( \theta_i )$.

The Adam optimization algorithm updates the parameters $ \theta $ using the following equation:

$ \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t $

where:
- $ \theta_t $ is the parameter vector at time step $ t $,
- $ \alpha $ is the learning rate,
- $ \hat{m}_t $ is the biased first moment estimate,
- $ \hat{v}_t $ is the biased second raw moment estimate,
- $ \epsilon $ is a small constant for numerical stability.


In [None]:
import tensorflow as tf
import numpy as np
import matplotlib as plt
import keras

In [None]:
# Making a data set
from sklearn.datasets import *

samples = 1000

X, Y = make_circles(samples, noise=0.04)