**Chapter 1 ‚Äì The Machine Learning landscape**

_This notebook contains the code examples in chapter 1. You'll also find the exercise solutions at the end of the notebook. The rest of this notebook is used to generate `lifesat.csv` from the original data sources, and some of this chapter's figures._

You're welcome to go through the code in this notebook if you want, but the real action starts in the next chapter.

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-ml3/blob/main/01_the_machine_learning_landscape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml3/blob/main/01_the_machine_learning_landscape.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Setup

This project requires Python 3.7 or above:

In [None]:
import sys

assert sys.version_info >= (3, 7)

Scikit-Learn ‚â•1.0.1 is required:

In [None]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

Let's define the default font sizes, to plot pretty figures:

In [None]:
import matplotlib.pyplot as plt

plt.rc('font', size=12)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=12)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

Make this notebook's output stable across runs:

In [None]:
import numpy as np

np.random.seed(42)

# Code example 1-1

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Download and prepare the data
data_root = "https://github.com/ageron/data/raw/main/"
lifesat = pd.read_csv(data_root + "lifesat/lifesat.csv")
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat[["Life satisfaction"]].values

# Visualize the data
lifesat.plot(kind='scatter', grid=True,
             x="GDP per capita (USD)", y="Life satisfaction")
plt.axis([23_500, 62_500, 4, 9])
plt.show()

# Select a linear model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[37_655.2]]  # Cyprus' GDP per capita in 2020
print(model.predict(X_new)) # outputs [[6.30165767]]

Replacing the Linear Regression model with k-Nearest Neighbors (in this example, k = 3) regression in the previous code is as simple as replacing these two
lines:

```python
from sklearn.linear_model import LinearRegression

model = LinearRegression()
```

with these two:

```python
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=3)
```

In [None]:
# Select a 3-Nearest Neighbors regression model
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=3)

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
print(model.predict(X_new)) # outputs [[6.33333333]]


# Generating the data and figures ‚Äî please skip

This is the code I used to generate the `lifesat.csv` dataset. You can safely skip this.

Create a function to save the figures:

In [None]:
from pathlib import Path

# Where to save the figures
IMAGES_PATH = Path() / "images" / "fundamentals"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Load and prepare Life satisfaction data

To create `lifesat.csv`, I downloaded the Better Life Index (BLI) data from [OECD's website](http://stats.oecd.org/index.aspx?DataSetCode=BLI) (to get the Life Satisfaction for each country), and World Bank GDP per capita data from [OurWorldInData.org](https://ourworldindata.org/grapher/gdp-per-capita-worldbank). The BLI data is in `datasets/lifesat/oecd_bli.csv` (data from 2020), and the GDP per capita data is in `datasets/lifesat/gdp_per_capita.csv` (data up to 2020).

If you want to grab the latest versions, please feel free to do so. However, there may be some changes (e.g., in the column names, or different countries missing data), so be prepared to have to tweak the code.

In [None]:
import urllib.request

datapath = Path() / "datasets" / "lifesat"
datapath.mkdir(parents=True, exist_ok=True)

data_root = "https://github.com/ageron/data/raw/main/"
for filename in ("oecd_bli.csv", "gdp_per_capita.csv"):
    if not (datapath / filename).is_file():
        print("Downloading", filename)
        url = data_root + "lifesat/" + filename
        urllib.request.urlretrieve(url, datapath / filename)

In [None]:
oecd_bli = pd.read_csv(datapath / "oecd_bli.csv")
gdp_per_capita = pd.read_csv(datapath / "gdp_per_capita.csv")

Preprocess the GDP per capita data to keep only the year 2020:

In [None]:
gdp_year = 2020
gdppc_col = "GDP per capita (USD)"
lifesat_col = "Life satisfaction"

gdp_per_capita = gdp_per_capita[gdp_per_capita["Year"] == gdp_year]
gdp_per_capita = gdp_per_capita.drop(["Code", "Year"], axis=1)
gdp_per_capita.columns = ["Country", gdppc_col]
gdp_per_capita.set_index("Country", inplace=True)

gdp_per_capita.head()

Preprocess the OECD BLI data to keep only the `Life satisfaction` column:

In [None]:
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")

oecd_bli.head()

Now let's merge the life satisfaction data and the GDP per capita data, keeping only the GDP per capita and Life satisfaction columns:

In [None]:
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                              left_index=True, right_index=True)
full_country_stats.sort_values(by=gdppc_col, inplace=True)
full_country_stats = full_country_stats[[gdppc_col, lifesat_col]]

full_country_stats.head()

To illustrate the risk of overfitting, I use only part of the data in most figures (all countries with a GDP per capita between `min_gdp` and `max_gdp`). Later in the chapter I reveal the missing countries, and show that they don't follow the same linear trend at all.

In [None]:
min_gdp = 23_500
max_gdp = 62_500

country_stats = full_country_stats[(full_country_stats[gdppc_col] >= min_gdp) &
                                   (full_country_stats[gdppc_col] <= max_gdp)]
country_stats.head()

In [None]:
country_stats.to_csv(datapath / "lifesat.csv")
full_country_stats.to_csv(datapath / "lifesat_full.csv")

In [None]:
country_stats.plot(kind='scatter', figsize=(5, 3), grid=True,
                   x=gdppc_col, y=lifesat_col)

min_life_sat = 4
max_life_sat = 9

position_text = {
    "Turkey": (29_500, 4.2),
    "Hungary": (28_000, 6.9),
    "France": (40_000, 5),
    "New Zealand": (28_000, 8.2),
    "Australia": (50_000, 5.5),
    "United States": (59_000, 5.3),
    "Denmark": (46_000, 8.5)
}

for country, pos_text in position_text.items():
    pos_data_x = country_stats[gdppc_col].loc[country]
    pos_data_y = country_stats[lifesat_col].loc[country]
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y),
                 xytext=pos_text, fontsize=12,
                 arrowprops=dict(facecolor='black', width=0.5,
                                 shrink=0.08, headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "ro")

plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])

save_fig('money_happy_scatterplot')
plt.show()

<details>
<summary><b> ‚úÖAI Understanding Template-ML Landscape - 01 </b></summary>

Below is a **short, crisp, AI-style explanation** using your requested template **+ a clear explanation of your full code**.

---

# ‚úÖ **AI Understanding Template ‚Äî Linear Regression + kNN Regression Code**

## **1. What is it?**

A small ML pipeline that:

* Loads GDP & life-satisfaction data
* Visualizes it
* Trains **Linear Regression** and **k-Nearest Neighbors Regression**
* Predicts Life Satisfaction for Cyprus
* Saves plots & downloads OECD datasets
* Cleans and merges datasets into a final CSV

---

## **2. How does it reason?**

### **Linear Regression**

* Fits a **straight line**:
  [ y = aX + b ]
* Predicts life satisfaction by projecting GDP onto that line.

### **k-Nearest Neighbors (kNN)**

* Finds the **3 closest GDP values** to Cyprus.
* Averages their Life Satisfaction values.
* More local, non-linear reasoning.

---

## **3. Where does it fail?**

Linear Regression fails when:

* Relationship is **non-linear**
* Outliers distort the best-fit line

kNN fails when:

* Data is **sparse**
* GDP neighbors are **not similar** socially/culturally
* Wrong value of **k** chosen

Both fail if:

* Input GDP is **outside training range** (extrapolation).

---

## **4. When should I use it?**

Use Linear Regression when:

* Relationship is roughly **linear**
* You want interpretability

Use kNN when:

* Relationship is **curved** or **local**
* You want predictions based on **similar neighbours**

---

## **5. What is the mental model?**

### Linear Regression

> **‚ÄúDraw the best straight line through the points and read values from the line.‚Äù**

### kNN

> **‚ÄúFind 3 similar countries and average their happiness.‚Äù**

---

## **6. How do I prompt it?**

(Not prompting AI ‚Äî but how to *use* the models.)

* Give GDP values as a **2D array**:
  `[[37655.2]]`
* Fit the model with:
  `model.fit(X, y)`
* Predict with:
  `model.predict(X_new)`
* Keep GDP scaled or well-distributed if using kNN.

---

## **7. What are alternatives?**

| Method                          | Why use it?                       |
| ------------------------------- | --------------------------------- |
| **Polynomial Regression**       | When curve fits better than line  |
| **Random Forest Regression**    | Handles complex relations, robust |
| **Gradient Boosting (XGBoost)** | High accuracy on tabular data     |
| **Neural Networks**             | When dataset grows large          |
| **LOESS/Smoothing**             | Non-parametric local regression   |

---

# ‚úÖ **Code Explanation (Short & Clear)**

---

## **IMPORTS**

```python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
```

* Loads libraries for charts, math, dataframes, and ML models.

---

## **LOAD DATA**

```python
data_root = "https://github.com/ageron/data/raw/main/"
lifesat = pd.read_csv(data_root + "lifesat/lifesat.csv")
```

* Downloads pre-cleaned GDP + life satisfaction dataset.

### Split columns:

```python
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat[["Life satisfaction"]].values
```

---

## **VISUALIZE**

```python
lifesat.plot(kind='scatter', grid=True, x="GDP per capita (USD)", y="Life satisfaction")
plt.axis([23500, 62500, 4, 9])
plt.show()
```

* Scatter plot to see the relationship.

---

## **LINEAR MODEL**

```python
model = LinearRegression()
model.fit(X, y)
```

* Fits a straight-line model.

### Predict Cyprus:

```python
X_new = [[37655.2]]
print(model.predict(X_new))
```

---

## **KNN MODEL (3 neighbors)**

```python
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3)
model.fit(X, y)
print(model.predict(X_new))
```

* Predicts using 3 nearest GDP values.

---

## **SAVE FIGURES**

```python
IMAGES_PATH = Path() / "images" / "fundamentals"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)
```

* Prepares folder for saving images.

### Save function:

```python
def save_fig(...):
    plt.tight_layout()
    plt.savefig(path, dpi=300)
```

---

## **DOWNLOAD RAW DATA**

```python
for filename in ("oecd_bli.csv", "gdp_per_capita.csv"):
    urllib.request.urlretrieve(url, datapath / filename)
```

* Downloads two datasets from OECD.

---

## **CLEAN AND PREPARE DATA**

```python
gdp_per_capita = gdp_per_capita[gdp_per_capita["Year"] == gdp_year]
gdp_per_capita.drop(["Code","Year"], axis=1)
...
```

* Filters for year 2020
* Renames columns
* Sets country as index

---

## **PIVOT LIFE SATISFACTION (OECD data)**

```python
oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
oecd_bli = oecd_bli.pivot(...)
```

* Removes inequality variations
* Turns indicators into columns

---

## **MERGE BOTH DATASETS**

```python
full_country_stats = pd.merge(...)
full_country_stats.sort_values(...)
```

* Combine happiness + GDP into one table
* Sort by GDP

---

## **FILTER GDP RANGE**

```python
country_stats = full_country_stats[(...)]
```

* Keeps values between 23.5k and 62.5k USD.

---

## **SAVE AS CSV**

```python
country_stats.to_csv(...)
```

---

## **ANNOTATE COUNTRIES ON PLOT**

Uses:

```python
plt.annotate(...)
```

* Adds labels + arrows for selected countries
* Helps visualize outliers and unique positions.

---

## **FINAL PLOT + SAVE**

```python
save_fig('money_happy_scatterplot')
plt.show()
```

---

# If you want

I can also give you:

‚úî 20-word summary
‚úî Conceptual diagram
‚úî Same explanation in ‚Äúinterview answer‚Äù style
‚úî Equivalent code using **Polynomial Regression** or **Random Forest**


<details>

<details>
<summary><b> ‚úÖAI Understanding Template-ML landscape -01 - CheatSheet and QnA </b></summary>
#1 - Cheatsheet

Here is a **short, crisp, high-value Cheat Sheet** for your entire Life Satisfaction (GDP ‚Üí Happiness) code pipeline.

---

# üß≠ **Life Satisfaction ML Pipeline ‚Äî Cheat Sheet (Super-Short)**

---

## **1. Imports**

```python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from pathlib import Path
import urllib.request
```

---

# üì• **2. Load Dataset**

```python
data_root = "https://github.com/ageron/data/raw/main/"
lifesat = pd.read_csv(data_root + "lifesat/lifesat.csv")
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat[["Life satisfaction"]].values
```

---

# üìä **3. Visualize Relationship**

```python
lifesat.plot(kind='scatter', grid=True,
             x="GDP per capita (USD)", y="Life satisfaction")
plt.axis([23500, 62500, 4, 9])
plt.show()
```

---

# ü§ñ **4. Train Models**

### **Linear Regression**

```python
model = LinearRegression()
model.fit(X, y)
```

### **kNN Regression (k=3)**

```python
model = KNeighborsRegressor(n_neighbors=3)
model.fit(X, y)
```

---

# üìå **5. Predict for Cyprus**

```python
X_new = [[37655.2]]
model.predict(X_new)
```

---

# üíæ **6. Save Figures**

```python
IMAGES_PATH = Path("images/fundamentals")
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(id):
    plt.tight_layout()
    plt.savefig(IMAGES_PATH / f"{id}.png", dpi=300)
```

---

# üîΩ **7. Download Raw OECD Datasets**

```python
datapath = Path("datasets/lifesat")
datapath.mkdir(parents=True, exist_ok=True)

for file in ("oecd_bli.csv", "gdp_per_capita.csv"):
    url = data_root + "lifesat/" + file
    urllib.request.urlretrieve(url, datapath / file)
```

---

# üßπ **8. Prepare GDP Dataset**

```python
gdp = pd.read_csv(datapath / "gdp_per_capita.csv")
gdp = gdp[gdp["Year"] == 2020]
gdp = gdp.drop(["Code", "Year"], axis=1)
gdp.columns = ["Country", "GDP per capita (USD)"]
gdp.set_index("Country", inplace=True)
```

---

# üßπ **9. Prepare Life Satisfaction Dataset**

```python
bli = pd.read_csv(datapath / "oecd_bli.csv")
bli = bli[bli["INEQUALITY"] == "TOT"]
bli = bli.pivot(index="Country", columns="Indicator", values="Value")
```

---

# üîó **10. Merge Both Datasets**

```python
full_stats = pd.merge(bli, gdp, left_index=True, right_index=True)
full_stats.sort_values(by="GDP per capita (USD)", inplace=True)
```

---

# ü™ú **11. Filter GDP Range**

```python
min_gdp, max_gdp = 23500, 62500
stats = full_stats[(full_stats["GDP per capita (USD)"] >= min_gdp) &
                   (full_stats["GDP per capita (USD)"] <= max_gdp)]
```

---

# üíæ **12. Save Cleaned CSV**

```python
stats.to_csv(datapath / "lifesat.csv")
full_stats.to_csv(datapath / "lifesat_full.csv")
```

---

# üñäÔ∏è **13. Annotate Countries on Plot**

```python
for country, text_pos in position_text.items():
    x = stats.loc[country]["GDP per capita (USD)"]
    y = stats.loc[country]["Life satisfaction"]
    plt.annotate(country, xy=(x, y), xytext=text_pos,
                 arrowprops=dict(facecolor="black"))
```

---

# üìà **14. Final Visualization**

```python
plt.axis([23500, 62500, 4, 9])
save_fig("money_happy_scatterplot")
plt.show()
```

---

# ‚≠ê **Ultra-Fast Summary (For Revision)**

### ‚úî Load ‚Üí Visualize ‚Üí Train (Linear + kNN) ‚Üí Predict

### ‚úî Download OECD data ‚Üí Clean ‚Üí Merge ‚Üí Filter ‚Üí Save

### ‚úî Plot + Annotate ‚Üí Export figure

---

If you want, I can also give you:

üìå A **one-page PDF cheat sheet**
üìå A **side-by-side comparison of Linear Regression vs kNN**
üìå A **flow diagram of the entire pipeline**



#2 -

Here is a **short, crisp, interview-style Q&A summary** of your entire Life Satisfaction ML code.

---

# üîç **Interview-Style Q&A Summary ‚Äî Life Satisfaction ML Pipeline**

---

### **1. What problem is this code solving?**

It predicts **Life Satisfaction** of a country using **GDP per capita**.

---

### **2. What data is being used?**

OECD datasets containing:

* **GDP per capita**
* **Life satisfaction scores**

---

### **3. How is the data loaded?**

```python
pd.read_csv("...lifesat.csv")
```

Or downloaded directly from GitHub using `urllib.request`.

---

### **4. How are features and labels defined?**

* **X** = GDP per capita
* **y** = Life satisfaction

```python
X = lifesat[["GDP per capita (USD)"]].values
y = lifesat[["Life satisfaction"]].values
```

---

### **5. How is the relationship visualized?**

A scatterplot to inspect linearity:

```python
lifesat.plot(kind='scatter', x="GDP per capita (USD)", y="Life satisfaction")
```

---

### **6. What ML models are used?**

Two models:

1. **Linear Regression** (assumes straight-line relationship)
2. **k-Nearest Neighbors Regressor (k=3)** (non-linear, local averaging)

---

### **7. How are models trained?**

```python
model.fit(X, y)
```

---

### **8. What is the prediction example?**

Predict Life Satisfaction for Cyprus using GDP `37655.2`.

* Linear Regression ‚Üí ~6.30
* kNN (k=3) ‚Üí ~6.33

---

### **9. Why try two models?**

To compare:

* **Global linear trend** vs
* **Local neighbor-based prediction**

---

### **10. Why restrict GDP range?**

To remove outliers that distort regression.

```python
min_gdp, max_gdp = 23500, 62500
```

---

### **11. How are the datasets merged?**

Using country index:

```python
pd.merge(oecd_bli, gdp_per_capita, left_index=True, right_index=True)
```

---

### **12. Why pivot OECD_BLI?**

To turn indicators into columns ‚Üí clean country-level table.

---

### **13. What is saved in the end?**

* `lifesat.csv` ‚Üí filtered dataset
* `lifesat_full.csv` ‚Üí full merged dataset
* `money_happy_scatterplot.png` ‚Üí annotated plot

---

### **14. How are the annotated visuals created?**

Using:

```python
plt.annotate(...)
plt.plot(...)
```

for each selected country.

---

### **15. What is the final output?**

A clean ML pipeline that:
‚úî Downloads ‚Üí cleans ‚Üí merges ‚Üí visualizes data
‚úî Trains Linear + kNN models
‚úî Predicts life satisfaction for new GDP values
‚úî Saves cleaned data + images

---

If you want, I can also create:
üìå **10 interview questions based on this code**
üìå **A one-page revision sheet**
üìå **A diagram of the entire pipeline**

------------------------------
#3 - 20 interview questions
Here are **20 crisp, interview-style questions** based on your Life Satisfaction (GDP ‚Üí Happiness) ML code:

---

# ‚úÖ **20 Interview Questions Based on the Code**

### **1. What is the primary prediction goal of this ML pipeline?**

To predict *Life Satisfaction* from *GDP per capita*.

---

### **2. How do you load a CSV file from a URL in Python?**

Using `pd.read_csv(url)` or `urllib.request.urlretrieve`.

---

### **3. Why do we convert pandas columns to `.values`?**

Because scikit-learn models expect **NumPy arrays**, not DataFrames.

---

### **4. Why is a scatterplot used before training the model?**

To visually inspect if a **linear relationship** exists between variables.

---

### **5. What assumptions does Linear Regression make in this context?**

It assumes a **straight-line** relationship between GDP and life satisfaction.

---

### **6. When would k-Nearest Neighbors outperform Linear Regression?**

When the relationship is **non-linear** and **locally varying**.

---

### **7. How do you choose the value of k in KNN?**

Through **cross-validation** or empirical testing.

---

### **8. Why do we filter GDP between 23,500 and 62,500?**

To remove **outliers** that distort linear model fitting.

---

### **9. What does `pivot()` achieve in the OECD_BLI dataset?**

It transforms **indicators into separate columns**, creating a tidy table.

---

### **10. How do you merge two datasets on country?**

Using:

```python
pd.merge(left, right, left_index=True, right_index=True)
```

---

### **11. Why do we use `.set_index("Country")` for GDP data?**

To match the index of OECD data for merging.

---

### **12. What is the significance of annotating points on the scatterplot?**

To highlight specific countries with unique GDP‚ÄìLife Satisfaction positions.

---

### **13. What does `plt.axis([..])` do?**

Sets **manual limits** for x and y axes for consistent visualization.

---

### **14. Why do we save the cleaned datasets (`lifesat.csv` and full CSV`)?**

For **reuse**, **reproducibility**, and to avoid re-downloading.

---

### **15. What is the purpose of `save_fig()`?**

To export plots with consistent settings (layout, dpi, path).

---

### **16. Why is `model.predict()` called with a 2D array?**

Because scikit-learn expects inputs shaped as **(n_samples, n_features)**.

---

### **17. What happens if Cyprus‚Äô GDP value is passed as a 1D list?**

It would raise a shape error; must be wrapped as `[[value]]`.

---

### **18. How does Linear Regression compute predictions?**

Using:

```
y = b0 + b1 * x
```

after estimating optimal coefficients via **least squares**.

---

### **19. How does kNN compute predictions?**

It finds the **k nearest GDP values** and returns the **average Life Satisfaction**.

---

### **20. Why export the final annotated figure?**

For **reports**, **publications**, and **visual storytelling** in analysis.

---

If you want, I can also create:

üìå **20 MCQs based on this code**
üìå **20 coding interview tasks from this pipeline**
üìå **A ready-to-use interview notes PDF**


</details>

In [None]:
highlighted_countries = country_stats.loc[list(position_text.keys())]
highlighted_countries[[gdppc_col, lifesat_col]].sort_values(by=gdppc_col)

In [None]:
country_stats.plot(kind='scatter', figsize=(5, 3), grid=True,
                   x=gdppc_col, y=lifesat_col)

X = np.linspace(min_gdp, max_gdp, 1000)

w1, w2 = 4.2, 0
plt.plot(X, w1 + w2 * 1e-5 * X, "r")
plt.text(40_000, 4.9, fr"$\theta_0 = {w1}$", color="r")
plt.text(40_000, 4.4, fr"$\theta_1 = {w2}$", color="r")

w1, w2 = 10, -9
plt.plot(X, w1 + w2 * 1e-5 * X, "g")
plt.text(26_000, 8.5, fr"$\theta_0 = {w1}$", color="g")
plt.text(26_000, 8.0, fr"$\theta_1 = {w2} \times 10^{{-5}}$", color="g")

w1, w2 = 3, 8
plt.plot(X, w1 + w2 * 1e-5 * X, "b")
plt.text(48_000, 8.5, fr"$\theta_0 = {w1}$", color="b")
plt.text(48_000, 8.0, fr"$\theta_1 = {w2} \times 10^{{-5}}$", color="b")

plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])

save_fig('tweaking_model_params_plot')
plt.show()

In [None]:
from sklearn import linear_model

X_sample = country_stats[[gdppc_col]].values
y_sample = country_stats[[lifesat_col]].values

lin1 = linear_model.LinearRegression()
lin1.fit(X_sample, y_sample)

t0, t1 = lin1.intercept_[0], lin1.coef_.ravel()[0]
print(f"Œ∏0={t0:.2f}, Œ∏1={t1:.2e}")

In [None]:
country_stats.plot(kind='scatter', figsize=(5, 3), grid=True,
                   x=gdppc_col, y=lifesat_col)

X = np.linspace(min_gdp, max_gdp, 1000)
plt.plot(X, t0 + t1 * X, "b")

plt.text(max_gdp - 20_000, min_life_sat + 1.9,
         fr"$\theta_0 = {t0:.2f}$", color="b")
plt.text(max_gdp - 20_000, min_life_sat + 1.3,
         fr"$\theta_1 = {t1 * 1e5:.2f} \times 10^{{-5}}$", color="b")

plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])

save_fig('best_fit_model_plot')
plt.show()

In [None]:
cyprus_gdp_per_capita = gdp_per_capita[gdppc_col].loc["Cyprus"]
cyprus_gdp_per_capita

In [None]:
cyprus_predicted_life_satisfaction = lin1.predict([[cyprus_gdp_per_capita]])[0, 0]
cyprus_predicted_life_satisfaction

In [None]:
country_stats.plot(kind='scatter', figsize=(5, 3), grid=True,
                   x=gdppc_col, y=lifesat_col)

X = np.linspace(min_gdp, max_gdp, 1000)
plt.plot(X, t0 + t1 * X, "b")

plt.text(min_gdp + 22_000, max_life_sat - 1.1,
         fr"$\theta_0 = {t0:.2f}$", color="b")
plt.text(min_gdp + 22_000, max_life_sat - 0.6,
         fr"$\theta_1 = {t1 * 1e5:.2f} \times 10^{{-5}}$", color="b")

plt.plot([cyprus_gdp_per_capita, cyprus_gdp_per_capita],
         [min_life_sat, cyprus_predicted_life_satisfaction], "r--")
plt.text(cyprus_gdp_per_capita + 1000, 5.0,
         fr"Prediction = {cyprus_predicted_life_satisfaction:.2f}", color="r")
plt.plot(cyprus_gdp_per_capita, cyprus_predicted_life_satisfaction, "ro")

plt.axis([min_gdp, max_gdp, min_life_sat, max_life_sat])

plt.show()

In [None]:
missing_data = full_country_stats[(full_country_stats[gdppc_col] < min_gdp) |
                                  (full_country_stats[gdppc_col] > max_gdp)]
missing_data

In [None]:
position_text_missing_countries = {
    "South Africa": (20_000, 4.2),
    "Colombia": (6_000, 8.2),
    "Brazil": (18_000, 7.8),
    "Mexico": (24_000, 7.4),
    "Chile": (30_000, 7.0),
    "Norway": (51_000, 6.2),
    "Switzerland": (62_000, 5.7),
    "Ireland": (81_000, 5.2),
    "Luxembourg": (92_000, 4.7),
}

In [None]:
full_country_stats.plot(kind='scatter', figsize=(8, 3),
                        x=gdppc_col, y=lifesat_col, grid=True)

for country, pos_text in position_text_missing_countries.items():
    pos_data_x, pos_data_y = missing_data.loc[country]
    plt.annotate(country, xy=(pos_data_x, pos_data_y),
                 xytext=pos_text, fontsize=12,
                 arrowprops=dict(facecolor='black', width=0.5,
                                 shrink=0.08, headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "rs")

X = np.linspace(0, 115_000, 1000)
plt.plot(X, t0 + t1 * X, "b:")

lin_reg_full = linear_model.LinearRegression()
Xfull = np.c_[full_country_stats[gdppc_col]]
yfull = np.c_[full_country_stats[lifesat_col]]
lin_reg_full.fit(Xfull, yfull)

t0full, t1full = lin_reg_full.intercept_[0], lin_reg_full.coef_.ravel()[0]
X = np.linspace(0, 115_000, 1000)
plt.plot(X, t0full + t1full * X, "k")

plt.axis([0, 115_000, min_life_sat, max_life_sat])

save_fig('representative_training_data_scatterplot')
plt.show()

In [None]:
from sklearn import preprocessing
from sklearn import pipeline

full_country_stats.plot(kind='scatter', figsize=(8, 3),
                        x=gdppc_col, y=lifesat_col, grid=True)

poly = preprocessing.PolynomialFeatures(degree=10, include_bias=False)
scaler = preprocessing.StandardScaler()
lin_reg2 = linear_model.LinearRegression()

pipeline_reg = pipeline.Pipeline([
    ('poly', poly),
    ('scal', scaler),
    ('lin', lin_reg2)])
pipeline_reg.fit(Xfull, yfull)
curve = pipeline_reg.predict(X[:, np.newaxis])
plt.plot(X, curve)

plt.axis([0, 115_000, min_life_sat, max_life_sat])

save_fig('overfitting_model_plot')
plt.show()

In [None]:
w_countries = [c for c in full_country_stats.index if "W" in c.upper()]
full_country_stats.loc[w_countries][lifesat_col]

In [None]:
all_w_countries = [c for c in gdp_per_capita.index if "W" in c.upper()]
gdp_per_capita.loc[all_w_countries].sort_values(by=gdppc_col)

In [None]:
country_stats.plot(kind='scatter', x=gdppc_col, y=lifesat_col, figsize=(8, 3))
missing_data.plot(kind='scatter', x=gdppc_col, y=lifesat_col,
                  marker="s", color="r", grid=True, ax=plt.gca())

X = np.linspace(0, 115_000, 1000)
plt.plot(X, t0 + t1*X, "b:", label="Linear model on partial data")
plt.plot(X, t0full + t1full * X, "k-", label="Linear model on all data")

ridge = linear_model.Ridge(alpha=10**9.5)
X_sample = country_stats[[gdppc_col]]
y_sample = country_stats[[lifesat_col]]
ridge.fit(X_sample, y_sample)
t0ridge, t1ridge = ridge.intercept_[0], ridge.coef_.ravel()[0]
plt.plot(X, t0ridge + t1ridge * X, "b--",
         label="Regularized linear model on partial data")
plt.legend(loc="lower right")

plt.axis([0, 115_000, min_life_sat, max_life_sat])

save_fig('ridge_model_plot')
plt.show()

<details>
<summary><b> ‚úÖAI Understanding Template-01-ML Landscape -02 </b></summary>
Below is a **short, crisp, high-signal AI-style explanation** for your entire code.
This includes:

‚úî The 7-point AI template
‚úî Clear explanation of **what the code is doing step-by-step**
‚úî Why each model/version is used
‚úî Failures, mental model, prompting, alternatives

---

# ‚úÖ **AI Understanding Template (Applied to Your Linear Regression Code)**

## **1. What is it?**

A complete workflow demonstrating **linear regression on tabular data**, including:

* Selecting/trimming data
* Plotting data
* Visualizing manually-chosen model lines
* Fitting the best linear model
* Predicting for a new country
* Showing missing/out-of-range data
* Showing overfitting via polynomial regression
* Showing regularization via Ridge regression

---

## **2. How does it reason?**

The model learns:

> **Life satisfaction = Œ∏0 + Œ∏1 √ó GDP_per_capita**

It finds Œ∏0 and Œ∏1 by minimizing the **mean squared error** between predicted and real life satisfaction.

The later polynomial model learns:

> **Life satisfaction = a0 + a1x + a2x¬≤ + ‚Ä¶ + a10x¬π‚Å∞**

The Ridge model reasons by:

> ‚ÄúFit the data but keep coefficients small to avoid overfitting.‚Äù

---

## **3. Where does it fail?**

* GDP ‚Üí Life Satisfaction is **not truly linear**.
* Missing countries (e.g., South Africa, Switzerland) distort predictions.
* Polynomial model **overfits** (wild curves).
* Partial-data linear model is biased.
* Real-world happiness depends on **many features**, not just GDP.

---

## **4. When should I use it?**

Use this approach when:

* Relationship looks roughly linear.
* You need a **fast, interpretable** model.
* You want a baseline before deeper models.
* You need a full teaching pipeline: plots ‚Üí linear ‚Üí polynomial ‚Üí Ridge.

---

## **5. Mental model**

Think of it as:

> **Drawing the best straight line through scattered points, then experimenting with curved lines and regularization.**

---

## **6. How do I prompt it?**

(In code terms)

* Input X must be **2D** (GDP column).
* Input y must be **2D** (LifeSat column).
* Fit using `.fit(X, y)`
* Predict using `.predict([[new_value]])`
* Visualize with `plt.plot()` and `DataFrame.plot()`
* Add text with `plt.text()`
* Annotate missing data with `plt.annotate()`

---

## **7. Alternatives**

| Model                             | When to use                             |
| --------------------------------- | --------------------------------------- |
| **LinearRegression**              | Relationship is simple & linear         |
| **Polynomial + LinearRegression** | Need non-linear curve fit               |
| **Ridge/Lasso/ElasticNet**        | Need stable model + prevent overfitting |
| **RandomForestRegressor**         | Tabular + non-linear without math       |
| **XGBoost/CatBoost**              | State-of-the-art tabular performance    |
| **Neural Networks**               | Large dataset, complex patterns         |

---

# ‚úÖ **Code Explanation (Short, Direct, Step-by-Step)**

### **1) Highlight selected countries**

```python
highlighted_countries = country_stats.loc[list(position_text.keys())]
highlighted_countries[[gdppc_col, lifesat_col]].sort_values(by=gdppc_col)
```

You pick a set of countries, extract their rows, and sort by GDP per capita.

---

### **2) Plot the original data**

```python
country_stats.plot(kind='scatter', x=gdppc_col, y=lifesat_col)
```

Shows the raw GDP-vs-LifeSatisfaction scatterplot.

---

### **3) Plot hand-selected linear models**

Each block:

```python
w1, w2 = 4.2, 0
plt.plot(X, w1 + w2 * 1e-5 * X, "r")
```

You manually draw 3 different trial lines to see:

* Red line
* Green line
* Blue line

These are *guesses for Œ∏0, Œ∏1*.

---

### **4) Compute the best-fit line using scikit-learn**

```python
lin1 = linear_model.LinearRegression()
lin1.fit(X_sample, y_sample)
```

Learns:

```python
t0 = intercept (Œ∏0)
t1 = slope (Œ∏1)
```

Plotting:

```python
plt.plot(X, t0 + t1 * X)
```

This is the **true best-fit line**.

---

### **5) Predict for Cyprus**

```python
cyprus_predicted_life_satisfaction = lin1.predict([[cyprus_gdp]])[0,0]
```

Then:

* Draw a vertical red dashed line
* Mark the predicted point with a red circle

---

### **6) Find missing / out-of-range countries**

```python
missing_data = full_country_stats[(GDP < min) | (GDP > max)]
```

These are countries not included in the training slice.

You annotate them on the plot with arrows.

---

### **7) Fit a linear model on full data**

```python
lin_reg_full.fit(Xfull, yfull)
plt.plot(X, t0full + t1full * X, "k")
```

This produces the **correct full-dataset line**.

---

### **8) Polynomial regression (degree 10)**

```python
poly = PolynomialFeatures(degree=10)
scaler = StandardScaler()
pipeline_reg = Pipeline([...])
```

Fits a high-degree curve ‚Üí **overfits**.
Plot shows unrealistic wiggles.

---

### **9) Ridge regression**

```python
ridge = linear_model.Ridge(alpha=10**9.5)
ridge.fit(X_sample, y_sample)
```

This produces a **smoothed, very stable line** by shrinking Œ∏1.

You plot:

* Partial-data linear model (blue dotted)
* Full-data model (black line)
* Ridge model (blue dashed)

---

# ‚úÖ Want a **super-short cheat sheet version** (10 lines)?

I can compress all of this into a one-page interview-ready summary.

</details>

<details>
<summary><b> ‚úÖAI Understanding Template-ML Landscape - CheatSheet </b></summary>

Here is your **interview-style Q&A summary**, short and crisp, covering **AI: What / How / Fail / When / Mental Model / Prompting / Alternatives** + **code explanation**.

---

# ‚úÖ **INTERVIEW Q&A SUMMARY ‚Äî Linear Regression GDP vs Life Satisfaction**

---

## **1) What is this code doing?**

This code demonstrates an end-to-end **linear regression workflow** on GDP-per-capita vs life-satisfaction.
It visualizes data, draws manual lines, learns best-fit parameters, predicts values, handles missing data, shows overfitting, and uses Ridge regularization.

---

## **2) What is the main AI/ML concept used here?**

**Simple Linear Regression**:
[
\hat{y} = \theta_0 + \theta_1 x
]
It finds the best straight line to predict life satisfaction from GDP.

---

## **3) How does the model reason?**

It calculates:

* A **slope (Œ∏‚ÇÅ)** telling how life satisfaction changes with GDP
* An **intercept (Œ∏‚ÇÄ)** defining baseline satisfaction
* Predicts by inserting GDP (x) into the linear formula.

Polynomial regression adds higher powers (x^2, x^3) to create curves.
Ridge regression penalizes large coefficients to prevent instability.

---

## **4) Where does this approach fail?**

It fails when:

* Relationship is **non-linear**
* Data contains **noise/outliers**
* Important features are missing
* Polynomial regression **overfits** small datasets
* Missing countries distort slope/intercept

---

## **5) When should linear regression be used?**

Use when:

* The trend appears **straight-line**
* You want **quick, interpretable** results
* Dataset is **small/clean**
* You're building a **baseline model**

Use polynomial or Ridge when data bends or overfits.

---

## **6) What is the mental model behind this code?**

Think of it as:

> ‚ÄúDraw a straight line that best fits all the country points.‚Äù

Polynomial model:

> ‚ÄúA bendy line trying too hard to fit every point.‚Äù

Ridge model:

> ‚ÄúA stable line that avoids wild slopes.‚Äù

---

## **7) How do you prompt or use this model?**

You ‚Äúprompt‚Äù it by giving numeric input:

* Train: `lin1.fit(X, y)`
* Predict: `lin1.predict([[GDP]])`
* Plot: give X grid ‚Üí `plt.plot(X, t0 + t1*X)`

For polynomial prompting:
Use `Pipeline([poly ‚Üí scaler ‚Üí linear])`.

---

## **8) What are alternatives to this model?**

| Model                 | Use case                |
| --------------------- | ----------------------- |
| Polynomial Regression | Non-linear curves       |
| Ridge/Lasso           | Regularization needed   |
| Random Forest         | Robust tabular learning |
| Gradient Boosting     | High predictive power   |
| Neural Network        | Complex relationships   |

---

# ‚úÖ **CODE SECTIONS ‚Äî Interview-Style Explanation**

---

### **Q: What does the highlighting code do?**

```python
highlighted_countries = country_stats.loc[list(position_text.keys())]
```

**A:** Selects specific countries and sorts them by GDP.

---

### **Q: What is the scatter plot used for?**

```python
country_stats.plot(kind='scatter', x=gdppc_col, y=lifesat_col)
```

Shows correlation between GDP and life satisfaction.

---

### **Q: Why plot manual lines (Œ∏‚ÇÄ, Œ∏‚ÇÅ)?**

To visually compare different slopes/intercepts and understand how linear models behave.

---

### **Q: How is the real regression model trained?**

```python
lin1 = LinearRegression()
lin1.fit(X_sample, y_sample)
```

Learns best Œ∏‚ÇÄ and Œ∏‚ÇÅ from data.

---

### **Q: How is prediction done for Cyprus?**

```python
lin1.predict([[cyprus_gdp]])
```

Feeds GDP into the model ‚Üí outputs life satisfaction.

---

### **Q: How does the code handle missing or out-of-range countries?**

It filters countries outside the training GDP range and annotates them separately.

---

### **Q: What does polynomial regression demonstrate?**

```python
PolynomialFeatures(degree=10)
```

It creates an overly flexible curve ‚Üí shows **overfitting**.

---

### **Q: Why use Ridge Regression?**

```python
ridge = Ridge(alpha=10**9.5)
```

Adds a penalty to Œ∏‚ÇÅ to reduce overfitting and stabilize the model.

---

# ‚≠ê **Want a 1-page cheat sheet, diagram, or 20 rapid-fire interview questions based on this code?**


<details>

<details>
<summary><b> ‚úÖAI Understanding Template- interview-style Q&A style summary </b></summary>
Here is your **interview-style Q&A summary**, short and crisp, covering **AI: What / How / Fail / When / Mental Model / Prompting / Alternatives** + **code explanation**.

---

# ‚úÖ **INTERVIEW Q&A SUMMARY ‚Äî Linear Regression GDP vs Life Satisfaction**

---

## **1) What is this code doing?**

This code demonstrates an end-to-end **linear regression workflow** on GDP-per-capita vs life-satisfaction.
It visualizes data, draws manual lines, learns best-fit parameters, predicts values, handles missing data, shows overfitting, and uses Ridge regularization.

---

## **2) What is the main AI/ML concept used here?**

**Simple Linear Regression**:
[
\hat{y} = \theta_0 + \theta_1 x
]
It finds the best straight line to predict life satisfaction from GDP.

---

## **3) How does the model reason?**

It calculates:

* A **slope (Œ∏‚ÇÅ)** telling how life satisfaction changes with GDP
* An **intercept (Œ∏‚ÇÄ)** defining baseline satisfaction
* Predicts by inserting GDP (x) into the linear formula.

Polynomial regression adds higher powers (x^2, x^3) to create curves.
Ridge regression penalizes large coefficients to prevent instability.

---

## **4) Where does this approach fail?**

It fails when:

* Relationship is **non-linear**
* Data contains **noise/outliers**
* Important features are missing
* Polynomial regression **overfits** small datasets
* Missing countries distort slope/intercept

---

## **5) When should linear regression be used?**

Use when:

* The trend appears **straight-line**
* You want **quick, interpretable** results
* Dataset is **small/clean**
* You're building a **baseline model**

Use polynomial or Ridge when data bends or overfits.

---

## **6) What is the mental model behind this code?**

Think of it as:

> ‚ÄúDraw a straight line that best fits all the country points.‚Äù

Polynomial model:

> ‚ÄúA bendy line trying too hard to fit every point.‚Äù

Ridge model:

> ‚ÄúA stable line that avoids wild slopes.‚Äù

---

## **7) How do you prompt or use this model?**

You ‚Äúprompt‚Äù it by giving numeric input:

* Train: `lin1.fit(X, y)`
* Predict: `lin1.predict([[GDP]])`
* Plot: give X grid ‚Üí `plt.plot(X, t0 + t1*X)`

For polynomial prompting:
Use `Pipeline([poly ‚Üí scaler ‚Üí linear])`.

---

## **8) What are alternatives to this model?**

| Model                 | Use case                |
| --------------------- | ----------------------- |
| Polynomial Regression | Non-linear curves       |
| Ridge/Lasso           | Regularization needed   |
| Random Forest         | Robust tabular learning |
| Gradient Boosting     | High predictive power   |
| Neural Network        | Complex relationships   |

---

# ‚úÖ **CODE SECTIONS ‚Äî Interview-Style Explanation**

---

### **Q: What does the highlighting code do?**

```python
highlighted_countries = country_stats.loc[list(position_text.keys())]
```

**A:** Selects specific countries and sorts them by GDP.

---

### **Q: What is the scatter plot used for?**

```python
country_stats.plot(kind='scatter', x=gdppc_col, y=lifesat_col)
```

Shows correlation between GDP and life satisfaction.

---

### **Q: Why plot manual lines (Œ∏‚ÇÄ, Œ∏‚ÇÅ)?**

To visually compare different slopes/intercepts and understand how linear models behave.

---

### **Q: How is the real regression model trained?**

```python
lin1 = LinearRegression()
lin1.fit(X_sample, y_sample)
```

Learns best Œ∏‚ÇÄ and Œ∏‚ÇÅ from data.

---

### **Q: How is prediction done for Cyprus?**

```python
lin1.predict([[cyprus_gdp]])
```

Feeds GDP into the model ‚Üí outputs life satisfaction.

---

### **Q: How does the code handle missing or out-of-range countries?**

It filters countries outside the training GDP range and annotates them separately.

---

### **Q: What does polynomial regression demonstrate?**

```python
PolynomialFeatures(degree=10)
```

It creates an overly flexible curve ‚Üí shows **overfitting**.

---

### **Q: Why use Ridge Regression?**

```python
ridge = Ridge(alpha=10**9.5)
```

Adds a penalty to Œ∏‚ÇÅ to reduce overfitting and stabilize the model.

---

# ‚≠ê **Want a 1-page cheat sheet, diagram, or 20 rapid-fire interview questions based on this code?**


<details>

<details>
<summary><b> ‚úÖAI Understanding Template-ML Landscape - 20 Quick fire interview questions </b></summary>

Here is your **1-page cheat sheet + diagram + 20 rapid-fire interview Qs** ‚Äî all short, crisp, and interview-ready.

---

# ‚úÖ **1-PAGE CHEAT SHEET ‚Äî GDP vs Life Satisfaction (Linear Regression)**

## **1) Core Idea**

Fit a line:
[
\hat{y} = \theta_0 + \theta_1 x
]
Predict life satisfaction (y) using GDP per capita (x).

---

## **2) Workflow Summary**

1. **Load/clean country stats**
2. **Scatter plot** (see correlation)
3. **Try manual Œ∏‚ÇÄ, Œ∏‚ÇÅ lines** (intuition)
4. **Fit Linear Regression**
5. **Predict for a country (e.g., Cyprus)**
6. **Visualize prediction**
7. **Identify missing/out-of-range countries**
8. **Polynomial Regression ‚Üí Overfitting**
9. **Ridge Regression ‚Üí Regularization**
10. **Compare partial vs full dataset models**

---

## **3) Important Concepts**

* **Linear Regression:** Finds best straight line
* **Intercept (Œ∏‚ÇÄ):** Life satisfaction at zero GDP
* **Slope (Œ∏‚ÇÅ):** Change in satisfaction for change in GDP
* **Polynomial Features:** Adds x¬≤, x¬≥‚Ä¶ ‚Üí flexible curve
* **Overfitting:** Curve models noise instead of trend
* **Regularization (Ridge):** Penalizes large weights ‚Üí stable model
* **Pipeline:** Sequential transformation + model

---

## **4) Key Code Blocks**

### **Fit Model**

```python
lin1 = LinearRegression()
lin1.fit(X, y)
```

### **Predict**

```python
lin1.predict([[gdp]])
```

### **Polynomial + Scaling + Regression**

```python
Pipeline([
 ('poly', PolynomialFeatures(10)),
 ('scal', StandardScaler()),
 ('lin', LinearRegression())
])
```

### **Ridge**

```python
Ridge(alpha=1e9)
```

---

## **5) When to Use What?**

| Situation                     | Use                   |
| ----------------------------- | --------------------- |
| Trend looks linear            | Linear Regression     |
| Non-linear but smooth         | Polynomial Regression |
| Overfitting risk              | Ridge/Lasso           |
| Complex + noisy               | Random Forest         |
| High accuracy on tabular data | Gradient Boosting     |

---

## **6) Mental Model**

* Linear: **Straight ruler** fitting trend
* Polynomial: **Flexible wire** fitting every bump
* Ridge: **Ruler with weight** preventing wobbling

---

# üéØ **Diagram ‚Äì Simple Visual Model**

```
Scatter plot: GDP (x) ‚Üí Life Satisfaction (y)

                 ‚óè
    ‚óè     ‚óè   ‚óè        ‚óè
 ‚óè         ‚óè  ‚óè  ‚óè   
---------------------------- x (GDP)
       Best-Fit Line ‚Üí
```

Polynomial Regression:

```
A very curvy line trying too hard to fit every point
```

Ridge Regression:

```
Straighter, more stable line
```

---

# üî• **20 Rapid-Fire Interview Questions (and short answers)**

### **1. What algorithm is used here?**

Linear Regression.

### **2. What is Œ∏‚ÇÄ?**

Intercept.

### **3. What is Œ∏‚ÇÅ?**

Slope/coefficient.

### **4. What does slope represent?**

How much life satisfaction changes per GDP unit.

### **5. What library performs regression?**

`sklearn.linear_model`.

### **6. Why plot manual lines?**

To visualize how Œ∏‚ÇÄ, Œ∏‚ÇÅ affect predictions.

### **7. How do you predict for Cyprus?**

Use `model.predict([[GDP]])`.

### **8. What is overfitting?**

Model learns noise instead of pattern.

### **9. How do polynomial features cause overfitting?**

Too many curve terms.

### **10. What does Ridge do?**

Penalizes large weights ‚Üí reduces overfitting.

### **11. What is a pipeline?**

Chain of transformations + model.

### **12. Why scale polynomial features?**

Large values distort regression.

### **13. What happens if GDP range is too small?**

Model becomes unstable; wrong slope.

### **14. Why annotate missing countries?**

To show they were excluded and distort results.

### **15. What is the input feature?**

GDP per capita.

### **16. What is the target?**

Life satisfaction index.

### **17. Why use np.linspace?**

Create smooth X-grid for plotting predictions.

### **18. What is `coef_`?**

Learned slope values.

### **19. What is `intercept_`?**

Learned bias/constant.

### **20. Why use a scatter plot first?**

Check if linear model makes sense.

---

If you want, I can also prepare:

‚úî **HR-style summary**
‚úî **Technical deep-dive version**
‚úî **End-to-end interview notes (prep sheet)**

Just tell me which one.


<details>

# Exercise Solutions

1. Machine Learning is about building systems that can learn from data. Learning means getting better at some task, given some performance measure.
2. Machine Learning is great for complex problems for which we have no algorithmic solution, to replace long lists of hand-tuned rules, to build systems that adapt to fluctuating environments, and finally to help humans learn (e.g., data mining).
3. A labeled training set is a training set that contains the desired solution (a.k.a. a label) for each instance.
4. The two most common supervised tasks are regression and classification.
5. Common unsupervised tasks include clustering, visualization, dimensionality reduction, and association rule learning.
6. Reinforcement Learning is likely to perform best if we want a robot to learn to walk in various unknown terrains, since this is typically the type of problem that Reinforcement Learning tackles. It might be possible to express the problem as a supervised or semi-supervised learning problem, but it would be less natural.
7. If you don't know how to define the groups, then you can use a clustering algorithm (unsupervised learning) to segment your customers into clusters of similar customers. However, if you know what groups you would like to have, then you can feed many examples of each group to a classification algorithm (supervised learning), and it will classify all your customers into these groups.
8. Spam detection is a typical supervised learning problem: the algorithm is fed many emails along with their labels (spam or not spam).
9. An online learning system can learn incrementally, as opposed to a batch learning system. This makes it capable of adapting rapidly to both changing data and autonomous systems, and of training on very large quantities of data.
10. Out-of-core algorithms can handle vast quantities of data that cannot fit in a computer's main memory. An out-of-core learning algorithm chops the data into mini-batches and uses online learning techniques to learn from these mini-batches.
11. An instance-based learning system learns the training data by heart; then, when given a new instance, it uses a similarity measure to find the most similar learned instances and uses them to make predictions.
12. A model has one or more model parameters that determine what it will predict given a new instance (e.g., the slope of a linear model). A learning algorithm tries to find optimal values for these parameters such that the model generalizes well to new instances. A hyperparameter is a parameter of the learning algorithm itself, not of the model (e.g., the amount of regularization to apply).
13. Model-based learning algorithms search for an optimal value for the model parameters such that the model will generalize well to new instances. We usually train such systems by minimizing a cost function that measures how bad the system is at making predictions on the training data, plus a penalty for model complexity if the model is regularized. To make predictions, we feed the new instance's features into the model's prediction function, using the parameter values found by the learning algorithm.
14. Some of the main challenges in Machine Learning are the lack of data, poor data quality, nonrepresentative data, uninformative features, excessively simple models that underfit the training data, and excessively complex models that overfit the data.
15. If a model performs great on the training data but generalizes poorly to new instances, the model is likely overfitting the training data (or we got extremely lucky on the training data). Possible solutions to overfitting are getting more data, simplifying the model (selecting a simpler algorithm, reducing the number of parameters or features used, or regularizing the model), or reducing the noise in the training data.
16. A test set is used to estimate the generalization error that a model will make on new instances, before the model is launched in production.
17. A validation set is used to compare models. It makes it possible to select the best model and tune the hyperparameters.
18. The train-dev set is used when there is a risk of mismatch between the training data and the data used in the validation and test datasets (which should always be as close as possible to the data used once the model is in production). The train-dev set is a part of the training set that's held out (the model is not trained on it). The model is trained on the rest of the training set, and evaluated on both the train-dev set and the validation set. If the model performs well on the training set but not on the train-dev set, then the model is likely overfitting the training set. If it performs well on both the training set and the train-dev set, but not on the validation set, then there is probably a significant data mismatch between the training data and the validation + test data, and you should try to improve the training data to make it look more like the validation + test data.
19. If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect).