<a href="https://colab.research.google.com/github/abdel2ty/IntenseAI_Notebooks_v1/blob/main/numpy_assignments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<a href="https://www.zero-grad.com/">
         <img alt="Zero Grad" src="https://i.postimg.cc/y8LZ0CM6/linear-Algebra.png" >
      </a>

# 🧪 Assignment 1: Build Your Own `train_test_split_np()` Using NumPy


##  Introduction:




In Machine Learning, we build models that **learn from data** and then **make predictions** on new, unseen data.

But if we train and test our model on the same data, we won't know if it's actually good — it might just be memorizing the data (overfitting). To avoid this, we **split the dataset** into two parts:

- **Training Set** 🧠  
  Used by the model to learn patterns from data.

- **Testing Set** 🧪  
  Used to evaluate how well the model performs on new, unseen data.

This approach helps us **estimate how the model will perform in the real world**.

---

### ⚖️ Common Split Ratios

- **80% Train / 20% Test** – most common for general tasks.
- **70% Train / 30% Test** – when you want more test data.
- **90% Train / 10% Test** – if your dataset is large.

There's no one "right" ratio — it depends on your dataset size and use case.

---

### 🔀 Should We Shuffle the Data?

Yes!  
If your data is ordered (e.g. time-based), you should **shuffle** it before splitting to avoid bias.

Shuffling ensures that both the training and test sets represent the overall data distribution fairly.

---

### 🧠 Real-Life Analogy

Imagine you're studying for an exam.

- You **practice with sample questions** (training set).
- Then you **test yourself with new questions** you've never seen (test set).

If you only "test" yourself using the same practice questions, you're not really testing your understanding — you're just repeating.

---

### 🧪 Summary

| Term           | Purpose                          |
|----------------|----------------------------------|
| Training Set   | Learn from it                    |
| Testing Set    | Evaluate model on unseen data    |
| Shuffle        | Avoid bias from data order       |
| Test Ratio     | Controls how much data is held out for testing |

Understanding this concept is the first step toward building reliable, real-world ML models.


## Implementation




Create a custom `train_test_split_np()` function using **NumPy only**, similar to scikit-learn's `train_test_split`, with full control over:

- Test ratio (e.g., 20% test)
- Shuffle behavior
- Random seed for reproducibility

In [None]:
import numpy as np

def train_test_split_np(X, y, test_ratio=0.2, seed=None, shuffle=True):
    """
    Splits X and y into training and testing sets using NumPy only.

    Parameters:
        X (ndarray): Feature array of shape (n_samples, n_features)
        y (ndarray): Target array of shape (n_samples, 1) or (n_samples,)
        test_ratio (float): Ratio of data to use for testing (0 < test_ratio < 1)
        seed (int): Optional random seed for reproducibility
        shuffle (bool): Whether to shuffle data before splitting

    Returns:
        X_train, X_test, y_train, y_test
    """
    if seed is not None:
        np.random.seed(seed)

    n_samples = X.shape[0]
    indices = np.arange(n_samples)

    if shuffle:
        indices = np.random.permutation(n_samples)

    test_size = int(n_samples * test_ratio)
    test_indices = indices[:test_size]
    train_indices = indices[test_size:]

    X_train = X[train_indices]
    X_test = X[test_indices]
    y_train = y[train_indices]
    y_test = y[test_indices]

    return X_train, X_test, y_train, y_test

In [None]:
# Example usage
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([1, 2, 3, 4, 5])

print("Original X:\n", X)
print("Original y:\n", y)


Original X:
 [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
Original y:
 [1 2 3 4 5]


In [None]:
X_train, X_test, y_train, y_test = train_test_split_np(X, y, test_ratio=0.2, seed=42, shuffle=True)

print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:\n", y_train)
print("y_test:\n", y_test)

X_train:
 [[ 9 10]
 [ 5  6]
 [ 1  2]
 [ 7  8]]
X_test:
 [[3 4]]
y_train:
 [5 3 1 4]
y_test:
 [2]


# 📈 Assignment 2: Build Linear Regression from Scratch (Using Your Train-Test Split)


## 🧠 What You'll Learn



In this assignment, you’ll build a **Linear Regression model** step-by-step using only **NumPy** — and apply it to the training and test sets you created in **Assignment 1**.

You will:

- Use your own `train_test_split_np()` function  
- Fit a line to training data using the **Normal Equation**  
- Make predictions on test data



 🧪 **The Idea**

Linear Regression tries to find the best-fitting line:

```
y = θ₀ + θ₁x
```

We use a mathematical formula (Normal Equation) to find the best values for `θ₀` and `θ₁`:

```
θ = (XᵀX)⁻¹ Xᵀy
```

Then we use this line to make predictions for new data.


## 📝 Tasks

### ✅ Step 1: Generate the Dataset


In [None]:
import numpy as np

np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (100, 1)
y shape: (100, 1)


### ✅ Step 2: Add Intercept Column (x₀ = 1)


> Example:

![Adding 1](https://i.ibb.co/d0zpGpcB/adding-1.png)


In [None]:
ones = np.ones((X.shape[0], 1))
X_b = np.c_[ones, X]  # Add bias term

print("X_b shape:", X_b.shape)

X_b shape: (100, 2)


### ✅ Step 3: Split the Data


Use your function to split `X` and `y`:

```python
X_train, X_test, y_train, y_test = train_test_split_np(X, y, test_ratio=0.2, seed=42)
```

In [None]:
X_train, X_test, y_train, y_test = train_test_split_np(X_b, y, test_ratio=0.2, seed=42, shuffle=True)

print(" X_train Shape:", X_train.shape)
print(" X_test Shape:", X_test.shape)
print(" y_train Shape:", y_train.shape)
print(" y_test Shape:", y_test.shape)

 X_train Shape: (80, 2)
 X_test Shape: (20, 2)
 y_train Shape: (80, 1)
 y_test Shape: (20, 1)


### ✅ Step 4: Compute θ Using the Normal Equation

![image.png](https://miro.medium.com/v2/resize:fit:1120/1*7ZiWm6xAF4oWiYfWklUMEw.jpeg)

In [None]:
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y

print("theta_best shape:", theta_best.shape)
print("theta_best:\n", theta_best)

theta_best shape: (2, 1)
theta_best:
 [[4.21509616]
 [2.77011339]]


### ✅ Step 5: Predict on the Test Set

In [None]:
y_pred = X_test @ theta_best

print("y_pred shape:", y_pred.shape)
print("y_pred:\n", y_pred)

y_pred shape: (20, 1)
y_pred:
 [[4.56722383]
 [9.1726426 ]
 [8.4935073 ]
 [7.88561985]
 [5.64879594]
 [6.65364079]
 [5.83364376]
 [8.99688487]
 [4.32913892]
 [6.29013335]
 [6.60816951]
 [7.58103241]
 [8.7329374 ]
 [9.47213722]
 [4.8776754 ]
 [5.07947481]
 [8.48810878]
 [4.62532032]
 [8.82701716]
 [5.15983847]]


### ✅ Step 6: Evaluate the Model

Use **Mean Squared Error (MSE)** to evaluate your model's performance:

<img src="https://www.i2tutorials.com/wp-content/media/2019/11/Differences-between-MSE-and-RMSE-1-i2tutorials.jpg" width="400">


In [None]:
mse = np.mean((y_test - y_pred) ** 2)
print("Mean Squared Error (MSE):", mse)

Mean Squared Error (MSE): 0.6330083230690094
