## 🧠 What's the Big Idea?

**Linear Regression** fits a *straight line* to your data. But what if your data isn't straight? Like a curve or squiggle? 🤔

### Solution: **Feature Engineering + Polynomial Regression**

We can *transform* the input data (called **features**) to help linear regression model curves, waves, or any complex pattern — **still using a straight-line method!** 😄

---

## 📘 Let’s Understand Step-by-Step

---

### 📌 First, Import Libraries

```python
import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2)
```

* `numpy` helps with number crunching and arrays.
* `matplotlib.pyplot` helps us **plot** graphs.
* `zscore_normalize_features`: scales features to help gradient descent work better.
* `run_gradient_descent_feng`: runs the **gradient descent** to learn the model.
* `np.set_printoptions(precision=2)`: just makes the numbers display neatly with 2 decimal places.

---

## 🌀 Attempt 1: Fit a Curve Using Just `x`

```python
x = np.arange(0, 20, 1)
y = 1 + x**2
X = x.reshape(-1, 1)

model_w,model_b = run_gradient_descent_feng(X, y, iterations=1000, alpha = 1e-2)

plt.scatter(x, y, marker='x', c='r', label="Actual Value")
plt.title("no feature engineering")
plt.plot(x, X @ model_w + model_b, label="Predicted Value")
plt.xlabel("X"); plt.ylabel("y"); plt.legend(); plt.show()
```

🧾 **Explanation:**

* We're trying to model a curve with only the `x` feature.
* Since `y = 1 + x²` is not a straight line, linear regression **can’t fit it well**.

---

## 🧠 Solution: Add a Polynomial Feature — `x²`

```python
x = np.arange(0, 20, 1)
y = 1 + x**2

X = x**2
X = X.reshape(-1, 1)

model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha = 1e-5)

plt.scatter(x, y, marker='x', c='r', label="Actual Value")
plt.title("Added x**2 feature")
plt.plot(x, np.dot(X,model_w) + model_b, label="Predicted Value")
plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()
```

📌 **What changed?**

* We replaced `x` with `x²`. Now the input matches the true shape of `y = 1 + x²`.
* Gradient descent easily learns the right weights.

---

## 🧪 Try More Features: `x`, `x²`, and `x³`

```python
x = np.arange(0, 20, 1)
y = x**2

X = np.c_[x, x**2, x**3]
model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-7)

plt.scatter(x, y, marker='x', c='r', label="Actual Value")
plt.title("x, x**2, x**3 features")
plt.plot(x, X @ model_w + model_b, label="Predicted Value")
plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()
```

📌 **What's happening?**

* Model uses all features but **gives most weight to x²**.
* Output:

  ```python
  w: [0.08 0.54 0.03], b: 0.01
  ```

---

## 🔍 Visualizing Features

```python
X = np.c_[x, x**2, x**3]
X_features = ['x','x^2','x^3']

fig,ax=plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X[:,i], y)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("y")
plt.show()
```

🧾 **Explanation:**

* Shows how each feature relates to `y`.
* Only the **x² plot looks like a straight line** — best for linear regression.

---

## ⚖️ Scaling Features

```python
x = np.arange(0,20,1)
X = np.c_[x, x**2, x**3]
print(f"Raw X Range: {np.ptp(X,axis=0)}")

X = zscore_normalize_features(X)     
print(f"Normalized X Range: {np.ptp(X,axis=0)}")
```

📌 Why normalize?

* Features like `x³` become huge → slows learning.
* Z-score normalization fixes this.

---

## ✅ Final Fit with Scaling

```python
x = np.arange(0,20,1)
y = x**2

X = np.c_[x, x**2, x**3]
X = zscore_normalize_features(X)

model_w, model_b = run_gradient_descent_feng(X, y, iterations=100000, alpha=1e-1)

plt.scatter(x, y, marker='x', c='r', label="Actual Value")
plt.title("Normalized x x**2, x**3 feature")
plt.plot(x,X@model_w + model_b, label="Predicted Value")
plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()
```

✨ **Result:**

* Super fast learning.
* x² again dominates the model → it’s the right feature.

---

## 🎢 Modeling Complex Functions: `cos(x/2)`

```python
x = np.arange(0,20,1)
y = np.cos(x/2)

X = np.c_[x, x**2, x**3, ..., x**13]      # Add higher powers
X = zscore_normalize_features(X)

model_w,model_b = run_gradient_descent_feng(X, y, iterations=1000000, alpha = 1e-1)

plt.scatter(x, y, marker='x', c='r', label="Actual Value")
plt.title("Normalized Polynomial Features up to x^13")
plt.plot(x,X@model_w + model_b, label="Predicted Value")
plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()
```

📌 This is amazing!

* We're fitting a **wave-like cosine** using **only linear regression** — by just adding many powers of `x`!
* Feature engineering makes linear regression flexible and powerful.

---

## 🏁 Summary

| Concept                   | Explanation                                                              |
| ------------------------- | ------------------------------------------------------------------------ |
| **Feature Engineering**   | Creating new features like `x²`, `x³`, etc. to help the model fit curves |
| **Polynomial Regression** | Linear regression using polynomial (powered) features                    |
| **Gradient Descent**      | Finds the best-fit parameters for the model                              |
| **Feature Scaling**       | Normalizing features helps gradient descent converge faster              |
| **Model Interpretation**  | Large weights = important features, small weights = not useful           |


# Optional Lab: Feature Engineering and Polynomial Regression

![](./images/C1_W2_Lab07_FeatureEngLecture.PNG)


## Goals
In this lab you will:
- explore feature engineering and polynomial regression which allows you to use the machinery of linear regression to fit very complicated, even very non-linear functions.


## Tools
You will utilize the function developed in previous labs as well as matplotlib and NumPy. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from lab_utils_multi import zscore_normalize_features, run_gradient_descent_feng
np.set_printoptions(precision=2)  # reduced display precision on numpy arrays

<a name='FeatureEng'></a>
# Feature Engineering and Polynomial Regression Overview

Out of the box, linear regression provides a means of building models of the form:
$$f_{\mathbf{w},b} = w_0x_0 + w_1x_1+ ... + w_{n-1}x_{n-1} + b \tag{1}$$ 
What if your features/data are non-linear or are combinations of features? For example,  Housing prices do not tend to be linear with living area but penalize very small or very large houses resulting in the curves shown in the graphic above. How can we use the machinery of linear regression to fit this curve? Recall, the 'machinery' we have is the ability to modify the parameters $\mathbf{w}$, $\mathbf{b}$ in (1) to 'fit' the equation to the training data. However, no amount of adjusting of $\mathbf{w}$,$\mathbf{b}$ in (1) will achieve a fit to a non-linear curve.


<a name='PolynomialFeatures'></a>
## Polynomial Features

Above we were considering a scenario where the data was non-linear. Let's try using what we know so far to fit a non-linear curve. We'll start with a simple quadratic: $y = 1+x^2$

You're familiar with all the routines we're using. They are available in the lab_utils.py file for review. We'll use [`np.c_[..]`](https://numpy.org/doc/stable/reference/generated/numpy.c_.html) which is a NumPy routine to concatenate along the column boundary.

In [None]:
# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2
X = x.reshape(-1, 1)

model_w,model_b = run_gradient_descent_feng(X,y,iterations=1000, alpha = 1e-2)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("no feature engineering")
plt.plot(x,X@model_w + model_b, label="Predicted Value");  plt.xlabel("X"); plt.ylabel("y"); plt.legend(); plt.show()

Well, as expected, not a great fit. What is needed is something like $y= w_0x_0^2 + b$, or a **polynomial feature**.
To accomplish this, you can modify the *input data* to *engineer* the needed features. If you swap the original data with a version that squares the $x$ value, then you can achieve $y= w_0x_0^2 + b$. Let's try it. Swap `X` for `X**2` below:

In [None]:
X.shape

In [None]:
# create target data
x = np.arange(0, 20, 1)
y = 1 + x**2

# Engineer features 
X = x**2      #<-- added engineered feature

In [None]:
X = X.reshape(-1, 1)  #X should be a 2-D Matrix
model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha = 1e-5)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Added x**2 feature")
plt.plot(x, np.dot(X,model_w) + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Great! near perfect fit. Notice the values of $\mathbf{w}$ and b printed right above the graph: `w,b found by gradient descent: w: [1.], b: 0.0490`. Gradient descent modified our initial values of $\mathbf{w},b $ to be (1.0,0.049) or a model of $y=1*x_0^2+0.049$, very close to our target of $y=1*x_0^2+1$. If you ran it longer, it could be a better match. 

### Selecting Features
<a name='GDF'></a>
Above, we knew that an $x^2$ term was required. It may not always be obvious which features are required. One could add a variety of potential features to try and find the most useful. For example, what if we had instead tried : $y=w_0x_0 + w_1x_1^2 + w_2x_2^3+b$ ? 

Run the next cells. 

In [None]:
# create target data
x = np.arange(0, 20, 1)
y = x**2

# engineer features .
X = np.c_[x, x**2, x**3]   #<-- added engineered feature

In [None]:
model_w,model_b = run_gradient_descent_feng(X, y, iterations=10000, alpha=1e-7)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("x, x**2, x**3 features")
plt.plot(x, X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Note the value of $\mathbf{w}$, `[0.08 0.54 0.03]` and b is `0.0106`.This implies the model after fitting/training is:
$$ 0.08x + 0.54x^2 + 0.03x^3 + 0.0106 $$
Gradient descent has emphasized the data that is the best fit to the $x^2$ data by increasing the $w_1$ term relative to the others.  If you were to run for a very long time, it would continue to reduce the impact of the other terms. 
>Gradient descent is picking the 'correct' features for us by emphasizing its associated parameter

Let's review this idea:
- less weight value implies less important/correct feature, and in extreme, when the weight becomes zero or very close to zero, the associated feature is not useful in fitting the model to the data.
- above, after fitting, the weight associated with the $x^2$ feature is much larger than the weights for $x$ or $x^3$ as it is the most useful in fitting the data. 

### An Alternate View
Above, polynomial features were chosen based on how well they matched the target data. Another way to think about this is to note that we are still using linear regression once we have created new features. Given that, the best features will be linear relative to the target. This is best understood with an example. 

In [None]:
# create target data
x = np.arange(0, 20, 1)
y = x**2

# engineer features .
X = np.c_[x, x**2, x**3]   #<-- added engineered feature
X_features = ['x','x^2','x^3']

In [None]:
fig,ax=plt.subplots(1, 3, figsize=(12, 3), sharey=True)
for i in range(len(ax)):
    ax[i].scatter(X[:,i],y)
    ax[i].set_xlabel(X_features[i])
ax[0].set_ylabel("y")
plt.show()

Above, it is clear that the $x^2$ feature mapped against the target value $y$ is linear. Linear regression can then easily generate a model using that feature.

### Scaling features
As described in the last lab, if the data set has features with significantly different scales, one should apply feature scaling to speed gradient descent. In the example above, there is $x$, $x^2$ and $x^3$ which will naturally have very different scales. Let's apply Z-score normalization to our example.

In [None]:
# create target data
x = np.arange(0,20,1)
X = np.c_[x, x**2, x**3]
print(f"Peak to Peak range by column in Raw        X:{np.ptp(X,axis=0)}")

# add mean_normalization 
X = zscore_normalize_features(X)     
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X,axis=0)}")

Now we can try again with a more aggressive value of alpha:

In [None]:
x = np.arange(0,20,1)
y = x**2

X = np.c_[x, x**2, x**3]
X = zscore_normalize_features(X) 

model_w, model_b = run_gradient_descent_feng(X, y, iterations=100000, alpha=1e-1)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Normalized x x**2, x**3 feature")
plt.plot(x,X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

Feature scaling allows this to converge much faster.   
Note again the values of $\mathbf{w}$. The $w_1$ term, which is the $x^2$ term is the most emphasized. Gradient descent has all but eliminated the $x^3$ term.

### Complex Functions
With feature engineering, even quite complex functions can be modeled:

In [None]:
x = np.arange(0,20,1)
y = np.cos(x/2)

X = np.c_[x, x**2, x**3,x**4, x**5, x**6, x**7, x**8, x**9, x**10, x**11, x**12, x**13]
X = zscore_normalize_features(X) 

model_w,model_b = run_gradient_descent_feng(X, y, iterations=1000000, alpha = 1e-1)

plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Normalized x x**2, x**3 feature")
plt.plot(x,X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()


In [None]:
from sklearn.metrics import r2_score

r2_score(y, X@model_w + model_b)


## Congratulations!
In this lab you:
- learned how linear regression can model complex, even highly non-linear functions using feature engineering
- recognized that it is important to apply feature scaling when doing feature engineering

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate input range
x = np.linspace(0, 20, 200)  # dense sampling

# Complex target function:
# - a cubic trend
# - a sinusoidal component
# - a piecewise bump between x ∈ [10,12]
# - plus Gaussian noise
y = (
    0.01 * x**3                              # cubic trend
    - 0.5 * x**2                            # quadratic down-curving
    + 3 * x                                 # linear uplift
    + 5 * np.sin(1.5 * x)                   # sinusoidal oscillation
    + np.where((x >= 10) & (x <= 12), 8, 0) # localized “bump”
    + np.random.normal(0, 3.0, size=x.shape) # noise
)
X = x.reshape(-1, 1)

# Show the data
plt.figure(figsize=(10, 4))
plt.scatter(X, y, s=15, c='navy', alpha=0.6)
plt.title("Complex Non‑Linear Dataset")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


In [None]:
print(X.shape , y.shape)

X = np.c_[x, x**2, x**3,x**4, x**5, x**6, x**7, x**8, x**9, x**10, x**11, x**12, x**13,x**14, x**15, x**16, x**17, x**18, x**19, x**20, x**21, x**22, x**23]

print(X.shape , y.shape)

X = zscore_normalize_features(X)  # normalize features

print(X.shape , y.shape)

model_w, model_b = run_gradient_descent_feng(X, y, iterations=1000000, alpha=0.001)


In [None]:
plt.scatter(x, y, marker='x', c='r', label="Actual Value"); plt.title("Normalized x x**2, x**3 feature")
plt.plot(x,X@model_w + model_b, label="Predicted Value"); plt.xlabel("x"); plt.ylabel("y"); plt.legend(); plt.show()

In [None]:
from sklearn.metrics import r2_score
r2_score(y, X@model_w + model_b)