# **Neural Networks**

At its core, a Neural Network is a mathematical function that maps an input to an output. However, unlike a simple linear regression equation (y = mx + c), a neural network is composed of layers of interconnected "neurons" that can approximate incredibly complex, non-linear functions.

### The Biological Inspiration

The concept is loosely inspired by the human brain.

   * **Biological Neuron:** Receives electrical signals through dendrites. If the signal is strong enough, it "fires" (activates) and sends the signal down the axon to other neurons.
   * **Artificial Neuron:** Receives numerical inputs. It multiplies them by specific "weights" (importance), adds a "bias" (threshold), and passes the result through an "activation function" (decision rule).

### The Architecture

A standard Neural Network consists of three types of layers:

   1. **Input Layer:** The raw data enters here (e.g., pixels of an image, words in a sentence). No computation happens here; it's just a doorway.
   2. **Hidden Layers:** The "magic" happens here. These layers transform the input data into abstract features. Deep Learning simply refers to networks with many hidden layers.
   3. **Output Layer:** The final decision (e.g., "Cat" vs. "Dog", or a stock price prediction).

### How It "Learns"

A neural network is initially dumb. It starts with random weights.

   1. **Forward Pass (Guessing):** Data flows through the network. It makes a guess.
   2. **Loss Calculation (Grading):** We compare the guess to the actual answer. The difference is the "Loss" or error.
   3. **Backward Pass (Learning):** We use calculus (Chain Rule) to calculate which weights contributed most to the error.
   4. **Optimization (Correction):** We adjust the weights slightly in the opposite direction of the error to reduce it next time.

**Analogy:** Imagine trying to tune a radio to a specific clear station (the optimal function). You turn the knobs (weights) randomly at first. You hear static (high loss). You turn one knob slightly; the static gets worse. You turn it the other way; the static gets better. You keep doing this for all knobs until the signal is clear.

---

## **Topic Tree**

## **Concept 1: The Linear Layer (Forward Propagation)**

The fundamental building block of a neural network is the Linear Layer (also called the Dense or Fully Connected layer). It performs a linear transformation on the input.

### Mechanics & Math

For a single layer, the mathematical operation is:

$$Z = W \cdot A + b$$

Where:

   * $A$: The input to this layer (or the output of the previous layer).
   * $W$: The Weight matrix (the "knobs" we tune).
   * $b$: The Bias vector (shifts the activation up or down).
   * $Z$: The linear output (pre-activation).

**Dimensions (The Most Important Part)**
To make the matrix multiplication work, the dimensions must align. We will use the convention where each **column** is a training example.

   * $A$: Shape $(n_{in}, m)$ - where $n_{in}$ is input features,  is number of examples.
   * $W$: Shape $(n_{out}, n_{in})$ - where $n_{out}$ is the number of neurons in this layer.
   * $b$: Shape $(n_{out}, 1)$ -  Python broadcasting will apply this to all  examples.
   * $Z$: Shape $(n_{out}, m)$.

### Intuition

Think of $W$ as a filter. If the input $A$ matches the pattern in row $i$ of $W$, the resulting dot product will be high. The bias $b$ acts as a threshold; if the weighted sum isn't high enough to overcome a negative bias, the neuron stays "quiet" (after activation).

### Trade-offs & Pitfalls

* **Initialization:** You cannot initialize $W$ to zeros. If you do, every neuron learns the exact same thing (symmetry problem). We usually initialize them with small random numbers. $b$ can be initialized to zeros.
* **Broadcasting:** In Python/NumPy, if you aren't careful with the shape of $b$(e.g., `(n,)` vs `(n, 1)`), you might introduce subtle bugs where dimensions mismatch silently.

---

### Your Task

Implement two functions using **NumPy**:

1. `initialize_parameters(n_in, n_out)`:
* Initialize `W` with random values from a standard normal distribution, scaled by `0.01` (to keep values small).
* Initialize `b` with zeros.
* **Return:** `W`, `b`


2. `linear_forward(A, W, b)`:
   * Perform the dot product $W \cdot A$.
   * Add the bias $b$.
   * **Return:** `Z` (the result), and a `cache` tuple containing `(A, W, b)` (we will need these later for backpropagation).



**Input Constraints:**

   * Use `numpy.random.randn` for weights.
   * Ensure  has shape $(n_{out}, 1)$.



In [5]:
import numpy as np

class NN1():
    def __init__(self):
        """
        Constructor
        """
        self.sd = 0.01


    def initialize_parameters(self, n_in, n_out):
        """
        Initialize W with random values from a standard normal distribution, scaled by 0.01 (to keep values small).
        Initialize b with zeros.
        Return: W, b
        """
        W = self.sd * np.random.randn(n_out, n_in)
        b = np.zeros((n_out, 1))
        return W, b


    def linear_forward(self, A, W, b):
        """
        Perform the dot product: W.A 
        Add the bias b.
        Return: Z (the result), and a cache tuple containing (A, W, b) (we will need these later for backpropagation).
        """

        Z = np.dot(W, A) + b
        return Z, (A, W, b)



## **Concept 2: Activation Functions (ReLU & Sigmoid)**

If we only used linear layers (like the one you just built), it wouldn't matter how many layers we stacked—the final output would still be just a linear combination of the inputs. The network could only learn to draw straight lines (linear boundaries).

**Activation Functions** introduce non-linearity, allowing the network to learn complex curves and shapes.

### 1. ReLU (Rectified Linear Unit)
    
$$A = \max(0, Z)$$

   * **Intuition:** It's a gate. If the input is positive, let it through unchanged. If it's negative, shut it off (make it zero).
   * **Context:** This is the default choice for **hidden layers**.
   * **Why?** It is computationally cheap (just a `max` check) and helps prevent the "vanishing gradient" problem that older functions (like Tanh) suffered from.

### 2. Sigmoid

$$A = \frac{1}{1 + e^{-Z}}$$

   * **Intuition:** It squashes any number (from  to ) into a range between **0 and 1**.
   * **Context:** This is the standard choice for the **output layer** in binary classification.
   * **Why?** It lets us interpret the output as a probability (e.g., "70% chance this is a Moon").

---

### Your Task

Add these static methods or standalone functions to your class (or outside it, your choice).

1. `sigmoid(Z)`:
   * Implement the formula: $1 / (1 + np.exp(-Z))$
   * **Return:** `A`, `cache` (where `cache` is just `Z`).


2. `relu(Z)`:
   * Implement: Return `Z` if `Z > 0`, else `0`. (Hint: `np.maximum` is faster than `max`).
   * **Return:** `A`, `cache` (where `cache` is just `Z`).



**Why return `Z` as cache?**
Just like in the linear layer, we need the input `Z` to calculate the slope (derivative) of these functions later during backpropagation.


In [8]:
import numpy as np

class N12():
    """
    Class with ReLU and Sigmoid
    """
    def __init__(self, Z):
        self.Z = Z

    def get_sigmoid(self):
        A = 1/(1 + np.exp(-self.Z))
        return A, self.Z

    def get_ReLU(self):
        A = np.maximum(0, self.Z)
        return A, self.Z


## **Concept 3: Loss Function (Binary Cross-Entropy)**

Now that our network can make a prediction ($A$), we need to measure how "wrong" that prediction is. This measurement is called the **Loss** or **Cost**.

Since we are doing **Binary Classification** (0 vs 1), we use **Binary Cross-Entropy Loss** (also called Log Loss).

### Mechanics & Math

The cost $J$ over $m$ examples is:

$$J = -\frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(a^{(i)}) + (1 - y^{(i)}) \log(1 - a^{(i)}) \Big]$$

Where:

   * $y^{(i)}$: The true label (0 or 1).
   * $a^{(i)}$: The network's prediction (probability between 0 and 1).

### Intuition

   * **If correct answer is 1 ($y=1$):** The second term $(1-y)$ becomes 0. We only care about $\log(a)$.
   * If prediction $a$ is close to 1 (e.g., 0.99), $\log(0.99) \approx 0$. **Cost is low.**
   * If prediction $a$ is close to 0 (e.g., 0.01), $\log(0.01)$ is a large negative number. **Cost is high.**
   
   
   * **If correct answer is 0 ($y=0$):** The first term becomes 0. We care about $\log(1-a)$.
   * It works symmetrically.


### Trade-offs

* **MSE (Mean Squared Error):** Common for regression (predicting house prices), but bad for classification because it learns slowly when the error is large (due to the shape of the gradient).
* **Cross-Entropy:** Heavily penalizes confident wrong predictions, speeding up learning.


Certainly. Let's strip away the math symbols and look at the intuition.

### Simpler Explanation: The "Strict Teacher" Analogy

Imagine you are taking a test where you have to answer "True" (1) or "False" (0). However, instead of just answering Yes/No, you have to write down your **confidence percentage**.

**Scenario 1: The correct answer is TRUE (1).**

**Student A** says: "99% sure it's True."
Teacher: "Great job. You are right and confident." -> **Loss: 0.01 (Tiny penalty)**

**Student B** says: "50% sure it's True."
Teacher: "Okay, you're wishy-washy." -> **Loss: 0.69 (Medium penalty)**

**Student C** says: "0.1% sure it's True" (They are 99.9% sure it's False).
Teacher: "You are confidently wrong! That is the worst mistake." -> **Loss: Huge Number (Severe penalty)**



**Why do we use Logs?**
The logarithm function is just a mathematical tool to make the penalty "explode" the closer you get to being 100% wrong.

   * If you are slightly wrong, the Log Loss is small.
   * If you are completely, arrogantly wrong (predicting 100% "No" when it is "Yes"), the Log Loss goes to **infinity**.

**Why not just use simple difference (Mean Squared Error)?**
If we used simple subtraction (like in linear regression), the penalty for being "confidently wrong" isn't harsh enough. The Cross-Entropy loss forces the neural network to fix its worst mistakes (the ones where it is confidently wrong) *very* quickly.

---

### Your Task

Implement the function `compute_cost(AL, Y)`.

**Specifications:**

   1. Implement the formula: $J = -\frac{1}{m} \sum \Big[ y \log(a) + (1 - y) \log(1 - a) \Big]$ using `np.multiply`, `np.log`, and `np.sum`.
   2. `AL` is the activation output of the final layer (shape: $1, m$).
   3. `Y` is the true label vector (shape: $1, m$).
   4. **Important:** The result `cost` must be a scalar (a single number). Use `np.squeeze()` at the end to turn `[[0.5]]` into `0.5`.



In [6]:
import numpy as np

def compute_cost(AL, Y):
    n, m = Y.shape
    J = -np.multiply((1/m), np.sum(np.multiply(Y, np.log(AL)) + np.multiply((1- Y), np.log(1 - AL))))
    return np.squeeze(J)

**Pro Tip (Optimization):** In production code, we often add a tiny number (epsilon, e.g., `1e-8`) inside the `np.log()` to prevent `log(0)` errors if the model predicts exactly 0 or 1. For now, your strict implementation is mathematically perfect.

#### 1. The Problem: The "Black Hole" of Log(0)

In math, the function $\log(x)$ asks: *"To what power must I raise  (2.718) to get $x$?"*

If you try to calculate $\log(0)$, you are asking: *"To what power do I raise 2.718 to get 0?"*
There is no answer. You can get very close to zero by using a huge negative power (like $e^{-1000}$), but you never actually touch zero. Therefore, mathematically:
$$\log(0) = -\infty \text{ (Negative Infinity)}$$

**In Python:**
If your neural network predicts a perfect `0` (which can happen due to rounding errors) for a label that is actually `1`:

```python
np.log(0)  # Output: -inf

```

If you try to do math with `-inf` (like adding it to other numbers), your entire calculation crashes or turns into `NaN` (Not a Number). The training stops instantly.

#### 2. The Solution: The "Safety Cushion"

To prevent the code from crashing, we add a tiny, insignificant number called **epsilon** ($\epsilon$), usually $10^{-8}$ (0.00000001).

Instead of calculating $\log(A)$, we calculate $\log(A + \epsilon)$.

   * If $A = 0.5$: $\log(0.50000001) \approx -0.69$. (No noticeable change).
   * If $A = 0$: $\log(0.00000001) \approx -18.4$.
      * Now, instead of "Negative Infinity," we just get a "Very Large Penalty." The computer is happy, and training continues.


This concept was **Cross-Entropy Math**. Adding "Numerical Stability Hacks" (like epsilon) creates noise when you are trying to just memorize the formula.

---


## **Concept 4: The Chain Rule (Backpropagation Theory)**

We have the Loss ($J$). Now we need to minimize it. To do that, we need to know: **"If I nudge this specific weight  slightly up, does the Loss go up or down?"**

This is the **Gradient** ($\frac{\partial J}{\partial W}$). Since our network is just a chain of functions nested inside each other ($Loss(Sigmoid(Linear(X)))$), we use the **Chain Rule** from calculus to find this gradient.

### Intuition: The "Blame Game"

Imagine a relay race where the team loses. Who is to blame?

1. **The Coach (Loss Function):** Yells at the last runner (Activation). "You were too slow!"
2. **The Last Runner (Activation):** Blames their shoes (Linear Output Z). "The fit was tight!"
3. **The Shoes (Linear Output):** Blame the manufacturer (Weights W).

Mathematically, we calculate the blame (gradient) in reverse order.

### Mechanics & Math

We want to find $\frac{\partial J}{\partial W}$. We break it down into steps:

$$\frac{\partial J}{\partial W} = \underbrace{\frac{\partial J}{\partial A}}_{\text{Change in Loss vs Pred}} \cdot \underbrace{\frac{\partial A}{\partial Z}}_{\text{Change in Pred vs Linear}} \cdot \underbrace{\frac{\partial Z}{\partial W}}_{\text{Change in Linear vs Weights}}$$

This chain allows us to pass the error signal backward through any number of layers.

### The "Magic" Shortcut

For Logistic Regression (and the output layer of our binary classifier), the calculus simplifies beautifully.

If you combine the derivative of the **Cross-Entropy Loss** with the derivative of the **Sigmoid Activation**, the first step of backpropagation ($dZ$) simplifies to just:
$$dZ = A - Y$$

 * If prediction $A$ is 0.8 and Truth $Y$ is 1.0, the "error signal" is .
 * This term ($A-Y$) is the "upstream gradient" that kicks off the entire backpropagation process.

---

### Your Task (Mental Check)

I will not ask for Python code yet. I want to verify you understand the data flow.

**Scenario:**
You have a network: $X \xrightarrow{W_1} Z_1 \xrightarrow{ReLU} A_1 \xrightarrow{W_2} Z_2 \xrightarrow{Sigmoid} A_2 \xrightarrow{Loss} J$.

**Question:** To calculate the gradient for the first layer's weights ($W_1$), we need the error signal from the future.
Complete this dependency chain:
To get $\frac{\partial J}{\partial W_1}$, we need $\frac{\partial J}{\partial Z_1}$.
To get $\frac{\partial J}{\partial Z_1}$, we need the gradient from the layer ahead of it. Which variable represents the "error signal" coming back from layer 2? (Choose one: $W_2$, $Z_2$, or $dZ_2$?)


**Answer:** $dZ_2$ is the error signal at the linear part of Layer 2, which allows us to calculate everything before it.

---

## **Concept 5: Backward Propagation (The Linear Step)**

Now we translate the math into code. We will tackle this in two parts. First, the **Linear Backward** step.

Assume we already know the error $dZ$ at the end of a linear layer. We need to calculate three things:

1. **$dW$ (Gradient of Weights):** How much to change $W$.
2. **$db$ (Gradient of Bias):** How much to change $b$.
3. **$dA_{prev}$ (Gradient of Input):** The error to send back to the previous layer.

### Mechanics & Math

Recall that $Z = W \cdot A_{prev} + b$.
Using calculus (matrix calculus), the gradients are:

   $$dW = \frac{1}{m} dZ \cdot A_{prev}^T$$
   $$db = \frac{1}{m} \sum_{axis=1} dZ$$
   $$dA_{prev} = W^T \cdot dZ$$

**Key Details:**

   * **Transposing ($T$):** We transpose matrices to make the dimensions align for multiplication.
      * $dZ$ : $(n_{out}, m)$
      * $A_{prev}$ : $(n_{in}, m)$ -> $A_{prev}^T$ : $(m, n_{in})$
      * Result $dW$ : $(n_{out}, n_{in})$ (Matches shape of $W$).

   * **Summing:** For $db$, we sum the errors across all $m$ examples to get one average adjustment per neuron.


### Simpler Explanation: The "Project Post-Mortem"

Imagine a project failed (High Loss), and now we are doing a "post-mortem" meeting to figure out what went wrong. We are working backwards from the failure.

We are currently analyzing the **Linear Layer** ($Z = W \cdot A + b$).
The layer ahead of us (the Activation) has just handed us a report called **$dZ$**. This report says: *"The output  you gave me was too high/low by this amount."*

Now, the Linear Layer has to decide who is responsible for this error. It splits the blame into three parts:

**1. Blame the Weights ($dW$)**
   * **Question:** "Did we set the importance (weight) of the input incorrectly?"
   * **Intuition:** If the input ($A_{prev}$) was very loud (a large number), even a tiny mistake in the Weight ($W$) would cause a huge error in the result.
   * **Action:** We calculate $dW$ so we can adjust the knobs ($W$) later to fix this.

**2. Blame the Bias (db)**
   * **Question:** "Was our baseline threshold (bias) too high or low?"
   * **Intuition:** This is the "average" error that remains after checking the weights. If we were consistently too high across all examples, the bias takes the blame.
   * **Action:** We calculate $db$ so we can shift the baseline ($b$) later.

**3. Pass the Blame Backward ($dA_{prev}$) -- CRITICAL STEP**
   * **Question:** "Maybe I (the Linear Layer) did my job correctly, but the *data given to me* ($dA_{prev}$) by the previous guy was garbage?"
   * **Intuition:** We need to calculate how much the *input* contributed to the error.
   * **Action:** We calculate $dA_{prev}$ and hand it backward to the previous layer. They will use this as their "incoming error" to fix their own weights.


### Your Task

Implement the function `linear_backward(dZ, cache)`.

**Specifications:**

* **Inputs:**
   * `dZ`: The gradient of the cost with respect to the linear output ($Z$).
   * `cache`: The tuple `(A_prev, W, b)` you stored during the forward pass.


* **Operations:**
   * Retrieve `A_prev`, `W`, `b` from `cache`.
   * Get `m` from the shape of `A_prev` (number of columns).
   * Calculate `dW`, `db`, `dA_prev` using the formulas above.
   * **Important:** `db` requires `np.sum` with `axis=1` and `keepdims=True` (to maintain shape `(n, 1)`).


* **Returns:** `dA_prev`, `dW`, `db`.


In [10]:
import numpy as np

def linear_backward(dZ, cache):
    A_prev, W, b = cache
    m = A_prev.shape[1]
    At = np.transpose(A_prev)
    dW = np.multiply((1/m), np.dot(dZ, At))
    db = np.multiply((1/m),np.sum(dZ, axis = 1, keepdims = True))
    dA_prev = np.dot(np.transpose(W), dZ)
    return dA_prev, dW, db


You have successfully implemented the "Linear" part of the backward pass. Now we need to implement the "Activation" part to complete the chain.

---

## **Concept 5 (Part 2): Activation Backward**

We just calculated how to move from $dZ$ to $dW$. But how do we get $dZ$ in the first place?
We get it from $dA$ (the gradient coming from the layer ahead).

### Mechanics & Math

The chain rule tells us:
$$dZ = \frac{\partial L}{\partial A} \cdot \frac{\partial A}{\partial Z} = dA \cdot g'(Z)$$

This means calculating $dZ$ is just an **element-wise multiplication** between the incoming error $dA$ and the **slope (derivative)** of the activation function at that point $Z$.

#### 1. Derivative of ReLU

If $Z > 0$, the slope is 1 (linear). If $Z \le 0$, the slope is 0 (flat).
$$g'(Z) = \begin{cases} 1 & \text{if } Z > 0 \\ 0 & \text{if } Z \le 0 \end{cases}$$

   * **Logic:** Copy $dA$ exactly where $Z$ was positive. Zero out $dA$ where $Z$ was negative.

#### 2. Derivative of Sigmoid

The derivative of the sigmoid function $s(Z)$ has a beautiful property:
$$g'(Z) = s(Z) \cdot (1 - s(Z))$$

   * **Logic:** We first calculate the sigmoid $s$ of the cached $Z$, then use it to scale $dA$.

---

### Your Task

Implement two standalone functions:

1. `relu_backward(dA, cache)`
   * **Input:** `dA` (gradient of activation), `cache` (this is just `Z` stored from forward pass).
   * **Logic:** Create `dZ` which is a copy of `dA`. Set values in `dZ` to `0` wherever the corresponding `Z` (from cache) is $\le 0$.
   * **Return:** `dZ`.


2. `sigmoid_backward(dA, cache)`
   * **Input:** `dA`, `cache` (stored `Z`).
   * **Logic:**
      * Calculate $s = 1 / (1 + e^{-Z})$.
      * Calculate $dZ = dA * s * (1 - s)$. (Element-wise multiplication).

   * **Return:** `dZ`.



In [12]:
import numpy as np

def relu_backward(dA, cache):
    Z = cache
    dZ = np.array(dA, copy = True)
    dZ[Z <= 0] = 0 
    return dZ

def sigmoid_backward(dA, cache):
    Z = cache
    s = 1/(1 + np.exp(-Z))
    dZ = np.multiply(dA, (s*(1-s)))
    return dZ


We have successfully built the engine (Linear), the transmission (Activation), and the diagnostic tools (Loss & Gradients). Now we need the driver.

---

## **Concept 6: Optimization (Gradient Descent)**

Calculating the gradients ($dW$, $db$) tells us the direction of the steepest slope uphill. Since we want to minimize the loss (go downhill), we simply take a step in the **opposite direction**.

This process is called **Gradient Descent**.

### Intuition: The Mountain in the Fog

Imagine you are lost on a mountain at night (foggy). You want to get to the village at the bottom (Minimum Loss).

   1. You feel the slope of the ground under your feet (Calculate Gradients).
   2. You take a small step downhill (Update Parameters).
   3. You repeat this until the ground flattens out (Convergence).

**The Learning Rate ($\alpha$):**
This controls the size of your step.

   * **Too small:** You will take forever to reach the bottom.
   * **Too big:** You might overshoot the village and end up on the other side of the valley (divergence).

### Mechanics & Math

For every parameter $\theta$ (which represents all our $W$ and $b$ matrices):
$$\theta = \theta - \alpha \cdot d\theta$$

In our specific case with a dictionary of parameters:

   * $W = W - \alpha \cdot dW$
   * $b = b - \alpha \cdot db$

### Your Task

Implement `update_parameters(parameters, grads, learning_rate)`.

**Specifications:**

   * **Inputs:**
      * `parameters`: A dictionary containing `{"W1": ..., "b1": ..., "W2": ..., "b2": ...}`.
      * `grads`: A dictionary containing `{"dW1": ..., "db1": ..., "dW2": ..., "db2": ...}`. (Output from backprop).
      * `learning_rate`: A scalar (e.g., 0.01).
   
   
   * **Logic:**
      * Update every parameter in the dictionary using the formula above.
      * **Note:** You must update them "in place" or return a new dictionary with the updated values. Returning a new dictionary is safer and cleaner.
   * **Return:** `parameters` (the updated dictionary).



In [28]:
import numpy as np

def update_parameters(parameters, grads, learning_rate):
    new_params = {}
    new_params.update(parameters)

    for key in new_params.keys():
        grad_key = "d" + key
        val = new_params[key]
        new_params[key] = val - np.multiply(learning_rate, grads[grad_key])
    return new_params


## **Concept 7: Data Preparation (The "Moons" Dataset)**

We have a functioning neural network engine. Now we need fuel.
We will use the **"Make Moons"** dataset from Scikit-Learn. It generates two interleaving half-circles.

**Why this dataset?**
A linear classifier (like Logistic Regression) cannot solve this. You cannot draw a single straight line to separate the two moons. This proves our Neural Network's ability to learn **non-linear boundaries**.


We are building a Neural Network, which is complex and computationally expensive. If we just wanted to classify data that can be separated by a straight line (like separating "Tall people" from "Short people"), we could just use simple Logistic Regression.

We need a problem that **fails** on simple models to prove our Neural Network is actually working.

   * **The Shape:** The "Moons" dataset consists of two crescent shapes that interlock like two bananas.
   * **The Challenge:** You cannot draw a single straight line to separate these two shapes. You need a curvy, wiggle line.
   * **The Proof:** If our network successfully classifies this, it proves the **Hidden Layer (ReLU)** is doing its job of bending the decision boundary. If the network fails (gets 50% accuracy), we know our non-linearity isn't working.

### Mechanics & Dimensions (Critical)

Scikit-Learn returns data in the standard format: **Rows = Examples**.

* `X`: Shape `(m, 2)` (where m is 1000 examples, 2 features).
* `y`: Shape `(m,)` (Rank-1 array).

**Our Network's Requirement:**
Our math assumes **Columns = Examples**.

* We need `X` to be `(2, m)`.
* We need `y` to be `(1, m)` (a proper row vector, not a Rank-1 array).

### Why Transpose and Reshape? (The "Dimension Mismatch" Trap)

This is the most common source of bugs in Deep Learning.

**The World Standard (Scikit-Learn/Pandas):**
Most data libraries store data like a spreadsheet:
   * **Rows:** Examples (Users, Images, etc.)
   * **Columns:** Features (Age, Height, Pixel values)
   * Shape: `(1000 examples, 2 features)`

**Our Math Standard (Linear Algebra/Vectorization):** 
Our formula is $Z = W \cdot X + b$..

   * $W$ (Weights) has shape `(Neurons, Features)`.
   * To multiply $W$ by $X$, the rows of $X$ must match the columns of $W$.
   * Therefore, we need $X$ to be `(Features, Examples)`.

If we feed the default Scikit-Learn data into our code:

$$W_{(4, 2)} \cdot X_{(1000, 2)}$$

**CRASH.** Inner dimensions (2 vs 1000) don't match.
We must Transpose $X$ to be `(2, 1000)` so that:
$$W_{(4, 2)} \cdot X_{(2, 1000)} = Z_{(4, 1000)}$$

**The Rank-1 Array Issue (`y`):**
Scikit-Learn returns labels `y` as a flat array: `[0, 1, 1, 0...]` with shape `(1000,)`.

   * NumPy treats this as "shapeless" in certain operations.
   * If you subtract this from a matrix `(1, 1000)`, NumPy might create a giant `(1000, 1000)` matrix instead of what you want.
   * **The Fix:** We force it to be a strict row vector: `(1, 1000)`.

### Your Task

Implement a function `load_data()`:

1. Import `sklearn.datasets`.
2. Use `sklearn.datasets.make_moons` to generate `N=1000` samples with `noise=0.2`.
3. **Transpose X:** Convert shape from `(1000, 2)` to `(2, 1000)`.
4. **Reshape Y:** Convert shape from `(1000,)` to `(1, 1000)`.
5. **Return:** `X`, `Y`.

*(Note: No visualization code required yet, just data processing)*


In [None]:
from sklearn.datasets import make_moons
import numpy as np

