# Supervised Learning with scikit-learn: Jupyter Notebook Notes

---

## 1. Introduction to Machine Learning with scikit-learn

- **Focus**: Supervised learning using the scikit-learn library in Python.

---

## 2. What is Machine Learning?

- **Definition**:  
  Machine learning is the process by which computers learn to make decisions from data **without being explicitly programmed**.

---

## 3. Examples of Machine Learning

- **Spam Detection**: Predicting whether an email is spam or not, based on its content and sender.
- **Book Clustering**: Grouping books into categories based on their words, then assigning new books to a category.

---

## 4. Unsupervised Learning

- **Definition**:  
  Unsupervised learning uncovers hidden patterns and structures from **unlabeled data**.
- **Example**:  
  Grouping customers into distinct categories based on purchasing behavior, **without pre-defined labels**.  
  This is known as **clustering**.

![image.png](attachment:0ce25f83-22b6-4132-884c-79b2042ca740.png)

---

## 5. Supervised Learning

- **Definition**:  
  In supervised learning, the **target values are known**.  
  The goal is to build a model that can accurately predict the target value for **unseen data**, given a set of features.
- **Example**:  
  Predicting a basketball player's position (target) based on their points per game (feature).

![image.png](attachment:77cb81d1-2f84-4cb1-b6e0-8534fc5da495.png)

---

## 6. Types of Supervised Learning

- **Classification**:
    - Predicts the **category or label** of an observation.
    - **Example**: Predicting if a bank transaction is **fraudulent (1)** or **not fraudulent (0)** (Binary Classification).
- **Regression**:
    - Predicts a **continuous value**.
    - **Example**: Predicting the **price** of a property based on features like number of bedrooms and size.

---

## 7. Naming Conventions

- **Feature**:  
  Also known as predictor variable or independent variable.
- **Target variable**:  
  Also known as dependent variable or response variable.

![image.png](attachment:603cdad9-0079-4247-936f-515c2af65619.png)

---

## 8. Data Requirements Before Using Supervised Learning

- **No missing values** in the data.
- **Data must be numeric**.
- **Data should be stored as**:
    - pandas DataFrame or Series, or
    - NumPy array.
- **Perform Exploratory Data Analysis (EDA)**:
    - Use pandas methods for descriptive statistics.
    - Create appropriate data visualizations to ensure data is in the correct format.

---

## 9. scikit-learn Syntax for Supervised Learning

### General Workflow

1. **Import the Model** from an sklearn module.
2. **Instantiate the Model**.
3. **Fit the Model** to your data (`X`, `y`).
4. **Predict** on new data (`X_new`).
5. **Examine the output**.

### Example Code

```python
from sklearn.module import Model

model = Model()
model.fit(X, y)
predictions = model.predict(X_new)
print(predictions)
```

#### Example Output

```python
array([0, 0, 0, 0, 1, 0])
```
> For example, in a spam classification model, `1` = spam, `0` = not spam.

---

### Line-by-Line Code Explanation

#### 1. `from sklearn.module import Model`

- **What**: Imports a specific machine learning model (algorithm) from the relevant scikit-learn module.
- **Why**: To use a particular algorithm suited to the supervised learning task (e.g., classification or regression).
- **Result**: Makes the model class available for use.

#### 2. `model = Model()`

- **What**: Creates (instantiates) the model object.
- **Why**: To prepare the model for training (fitting) on your data.
- **Result**: `model` is now an instance of the selected algorithm.

#### 3. `model.fit(X, y)`

- **What**: Fits (trains) the model on the training data, where `X` is your features and `y` is your target variable.
- **Why**: To learn the relationship between features and target.
- **Result**: The model’s internal parameters are adjusted to best fit the data.

#### 4. `predictions = model.predict(X_new)`

- **What**: Uses the trained model to predict the target values for new, unseen data (`X_new`).
- **Why**: To make predictions on data the model has not seen before.
- **Result**: Returns an array of predicted values (e.g., classes or numbers).

#### 5. `print(predictions)`

- **What**: Displays the predictions.
- **Why**: To observe the output of the model on new data.
- **Result**: You see the predicted labels or values for each observation in `X_new`.

##### Output Significance

- In the context of spam detection:
    - `array([0, 0, 0, 0, 1, 0])`
    - Each number corresponds to a prediction for one email.
    - `1` means predicted as "spam", `0` means "not spam".
- **Interpreting the output**: If you gave the model six emails, this output means the model predicts only the fifth email is spam; the rest are not spam.

---

## 10. Summary and Practice

- You now understand:
    - The difference between supervised and unsupervised learning.
    - Types of supervised learning (classification & regression).
    - Naming conventions for features and targets.
    - Data requirements for supervised learning.
    - The standard scikit-learn workflow for building and using supervised models.
---

## References

- [scikit-learn documentation](https://scikit-learn.org/stable/)

---

## Binary Classification Example

There are two types of supervised learning — **classification** and **regression**.  
Binary classification is used to predict a target variable that has **only two labels**, typically represented numerically with a `0` or a `1`.

### Sample Dataset

| account_length | total_day_charge | total_eve_charge | total_night_charge | total_intl_charge | customer_service_calls | churn |
|----------------|-----------------|-----------------|------------------|-----------------|----------------------|-------|
| 101            | 45.85           | 17.65           | 9.64             | 1.22            | 3                    | 1     |
| 73             | 22.30           | 9.05            | 9.98             | 2.75            | 2                    | 0     |
| 86             | 24.62           | 17.53           | 11.49            | 3.13            | 4                    | 0     |
| 59             | 34.73           | 21.02           | 9.66             | 3.24            | 1                    | 0     |
| 129            | 27.42           | 18.75           | 10.11            | 2.59            | 1                    | 0     |

---

### Question

Looking at this data, **which column could be the target variable for binary classification?**

**Options:**

1. `customer_service_calls`  
2. `total_night_charge`  
3. `churn`  
4. `account_length`  

**Answer:**  
`churn` — because it contains only two labels (0 or 1), indicating whether the customer churned.


### Exercise
The supervised learning workflow
Recall that scikit-learn offers a repeatable workflow for using supervised learning models to predict the target variable values when presented with new data.

Reorder the pseudo-code provided so it accurately represents the workflow of building a supervised learning model and making predictions.

Instructions

Drag the code blocks into the correct order to represent how a supervised learning workflow would be executed.

![image.png](attachment:2da35d1e-65b0-4dfc-98ca-bf1c590aa783.png)

# k-Nearest Neighbors (KNN) Classification with scikit-learn

---

## 1. The Classification Challenge

- **Supervised learning** uses labeled data to build models called **classifiers**.
- **Goal**: Predict the labels of **unseen (unlabeled) data**.

---

## 2. Steps for Classifying Unseen Data

1. **Build a classifier**: The model learns from labeled training data.
2. **Pass the classifier new, unlabeled data**.
3. **Predict labels** for the new data.

- The **training data** consists of feature values and corresponding labels.

---

## 3. k-Nearest Neighbors (KNN) Algorithm

- **KNN** is a popular algorithm for classification tasks.
- **Main Idea**:
    - To predict the label of a new data point:
        - Look at the `k` closest labeled data points (neighbors).
        - Assign the label that is the **majority among those neighbors** (majority voting).

---

![image.png](attachment:19b8bfdb-e324-4d94-844a-51fdbeb981d7.png)

### KNN Example: Majority Voting

- **If k = 3**: The label is determined by the closest 3 neighbors.
    - Example: If 2 are red and 1 is blue, the new point is classified as **red**.

![image.png](attachment:fe2ec879-5258-4c9b-8d5d-603cc417e7f0.png)

- **If k = 5**: Now, if 3 are blue and 2 are red, the label is **blue**.

![image.png](attachment:af89e8aa-3f65-4789-ac58-3341c6e3a9f0.png)

---

## 4. KNN Intuition

![image.png](attachment:af671a33-b9cb-4093-8c60-9061974bcd25.png)

- **Visualization Example**:  
  - Features: `total_day_charge` vs. `total_eve_charge` for telecom customers.
  - **Color coding**:
    - **Blue**: Customer churned.
    - **Red**: Customer did not churn.
- **KNN Decision Boundary**:
    - The area of the plot is split by KNN into regions (based on labels of neighbors).
    - Example: Using `k = 15`, the algorithm draws boundaries predicting **churn** (gray background) or **no churn** (red background).
    - **New data points** are classified based on which region they fall into.
    -  This boundary would be used to make predictions on unseen data.

![image.png](attachment:aa1e43c7-0485-4caf-8412-d4e4777d6031.png)

---

## 5. Using scikit-learn to Fit a KNN Classifier

### Code Example: Fitting a KNN Classifier

```python
from sklearn.neighbors import KNeighborsClassifier

# Select features and target from the DataFrame
X = churn_df[["total_day_charge", "total_eve_charge"]].values
y = churn_df["churn"].values

print(X.shape, y.shape)

knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(X, y)
```

#### Output

```python
(3333, 2) (3333,)
```

---

### Line-by-Line Explanation

#### 1. `from sklearn.neighbors import KNeighborsClassifier`

- **What**: Imports the K-Nearest Neighbors classifier from scikit-learn.
- **Why**: To use the KNN algorithm for classification tasks.
- **Expected Result**: You now have access to the KNN classifier class.

#### 2. `X = churn_df[["total_day_charge", "total_eve_charge"]].values`

- **What**: Selects two columns (`total_day_charge` and `total_eve_charge`) from the `churn_df` DataFrame and converts them to a NumPy array.
- **Why**: scikit-learn requires the features as a 2D NumPy array for fitting models.
- **Expected Result**: `X` is a `(3333, 2)` array, where each row is a customer, and each column is a feature.

#### 3. `y = churn_df["churn"].values`

- **What**: Selects the `churn` column (the target variable) and converts it to a NumPy array.
- **Why**: The target labels must be a 1D array for scikit-learn models.
- **Expected Result**: `y` is a `(3333,)` array, where each entry is `1` (churn) or `0` (no churn).

#### 4. `print(X.shape, y.shape)`

- **What**: Prints the shapes of `X` and `y`.
- **Why**: To confirm that the sizes match and are in the correct format for scikit-learn.
- **Expected Output**: `(3333, 2) (3333,)` which means 3333 customers, 2 features, and 3333 target values.

#### 5. `knn = KNeighborsClassifier(n_neighbors=15)`

- **What**: Instantiates a KNN classifier, setting `k` (number of neighbors) to 15.
- **Why**: To specify the KNN algorithm and how many neighbors to use for voting.
- **Expected Result**: `knn` is a KNN classifier object with `k=15`.

#### 6. `knn.fit(X, y)`

- **What**: Fits (trains) the KNN model on the training data (`X`, `y`).
- **Why**: To store the training data in the model so it can use it to predict new labels.
- **Expected Result**: The model is ready to make predictions on new, unseen data.

---

## 6. Predicting on Unlabeled Data

### Code Example: Making Predictions

```python
import numpy as np

X_new = np.array([
    [56.8, 17.5],
    [24.4, 24.1],
    [50.1, 10.9]
])

print(X_new.shape)

predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions))
```

#### Output

```python
(3, 2)
Predictions: [1 0 0]
```

---

### Line-by-Line Explanation

#### 1. `import numpy as np`

- **What**: Imports the NumPy library.
- **Why**: NumPy is needed to create arrays for the new data points.
- **Expected Result**: You can now use `np.array()`.

#### 2. `X_new = np.array([[56.8, 17.5], [24.4, 24.1], [50.1, 10.9]])`

- **What**: Creates a 2D NumPy array with 3 new observations, each with 2 features.
- **Why**: These represent new, unlabeled data points whose labels we want to predict.
- **Expected Result**: `X_new` is a `(3, 2)` array (3 observations, 2 features each).

#### 3. `print(X_new.shape)`

- **What**: Prints the shape of the new data array.
- **Why**: To confirm the format is correct (rows = observations, columns = features).
- **Expected Output**: `(3, 2)`

#### 4. `predictions = knn.predict(X_new)`

- **What**: Uses the trained KNN model to predict the labels for the new data.
- **Why**: To classify each new observation as `churn` (1) or `no churn` (0).
- **Expected Result**: An array of predictions: `[1, 0, 0]`.

#### 5. `print('Predictions: {}'.format(predictions))`

- **What**: Prints out the predicted labels.
- **Why**: To see the model's predictions for the new data points.
- **Expected Output**: `Predictions: [1 0 0]`

---

### Output Significance

- **Interpretation**:
    - First observation (`[56.8, 17.5]`) is predicted to **churn** (`1`).
    - Second and third observations are predicted to **not churn** (`0`).
- **Each value** in the output corresponds to the predicted label for each new data point.

---

## Summary Table: KNN Workflow in scikit-learn

| Step                          | Description                                                 |
|-------------------------------|------------------------------------------------------------|
| Import KNN                    | `from sklearn.neighbors import KNeighborsClassifier`        |
| Prepare data                  | Features as 2D array (`X`), targets as 1D array (`y`)      |
| Instantiate model             | `knn = KNeighborsClassifier(n_neighbors=15)`               |
| Fit model                     | `knn.fit(X, y)`                                            |
| Make predictions              | `knn.predict(X_new)`                                       |

---


### Exercise
k-Nearest Neighbors: Fit
In this exercise, you will build your first classification model using the churn_df dataset, which has been preloaded for the remainder of the chapter.

The target, "churn", needs to be a single column with the same number of observations as the feature data. The feature data has already been converted into numpy arrays.

"account_length" and "customer_service_calls" are treated as features because account length indicates customer loyalty, and frequent customer service calls may signal dissatisfaction, both of which can be good predictors of churn.

Instructions

Import KNeighborsClassifier from sklearn.neighbors.
Instantiate a KNeighborsClassifier called knn with 6 neighbors.
Fit the classifier to the data using the .fit() method.

```python

# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

```

## Exercise: k-Nearest Neighbors — Predict

k-Nearest Neighbors: Predict
Now you have fit a KNN classifier, you can use it to predict the label of new data points. All available data was used for training, however, fortunately, there are new observations available. These have been preloaded for you as X_new.

The model knn, which you created and fit the data in the last exercise, has been preloaded for you. You will use your classifier to predict the labels of a set of new data points:

```python
X_new = np.array([
    [30.0, 17.5],
    [107.0, 24.1],
    [213.0, 10.9]
])
````

The model `knn` (created and fit in the previous exercise) has been preloaded for you.
You will now use your classifier to **predict the labels** of the new data points.

---

### Instructions

1. Create `y_pred` by predicting the target values of the unseen features `X_new` using the `knn` model.
2. Print the predicted labels for the set of predictions.
```python
# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions
print("Predictions: {}".format(y_pred)) 

<script.py> output:
    Predictions: [0 1 0]

```

# Measuring Model Performance in Classification

---

## 1. Introduction: Why Measure Model Performance?

- **Purpose**:  
  After building a classifier, we need to evaluate how well it predicts the correct labels.
- **Key Question**:  
  *Is the model making good predictions?*

---

## 2. Accuracy: A Common Metric for Classification

- **Definition**:  
  **Accuracy** = (Number of correct predictions) / (Total number of predictions)
- **Usage**:  
  Widely used to measure classifier performance.

---

## 3. Pitfall: Evaluating on Training Data

- **Issue**:  
  Measuring accuracy on the **training data** can give an **overly optimistic** view of performance.
- **Reason**:  
  The model has already seen (and fit to) the training data, so it does not reflect ability to generalize to new, unseen data.
---

## 4. Solution: Train/Test Split

- **Approach**:  
  - **Split** the dataset into a **training set** and a **test set**.
  - **Train** the model on the training set.
  - **Evaluate** its accuracy on the test set, which simulates new/unseen data.
  - 
![image.png](attachment:e63f67e3-f980-4122-b08e-89274891379e.png)

---

### Code Example: Train/Test Split and Accuracy

```python
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.3,         # 30% for testing
    random_state=21,       # ensures reproducibility
    stratify=y             # preserves class proportion
)

# Instantiate the KNN model with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model on the training data
knn.fit(X_train, y_train)

# Evaluate accuracy on the test data
print(knn.score(X_test, y_test))
```

#### Output

```python
0.8800599700149925
```

---

### Line-by-Line Explanation

#### 1. `from sklearn.model_selection import train_test_split`

- **What**: Imports the function to split your data into training and test sets.
- **Why**: To enable proper evaluation of model performance on unseen data.

#### 2. `from sklearn.neighbors import KNeighborsClassifier`

- **What**: Imports KNN classifier.
- **Why**: To build a K-Nearest Neighbors model.

#### 3. `X_train, X_test, y_train, y_test = train_test_split(...)`

- **What**: Splits features (`X`) and targets (`y`) into training and test sets.
- **Parameters**:
    - `test_size=0.3`: 30% for testing, 70% for training.
    - `random_state=21`: Ensures the split is reproducible.
    - `stratify=y`: Maintains the proportion of each class in both splits.
- **Result**: Four arrays: training features, test features, training labels, test labels.

#### 4. `knn = KNeighborsClassifier(n_neighbors=6)`

- **What**: Instantiates a KNN classifier with `k=6`.
- **Why**: To set up the model for training.

#### 5. `knn.fit(X_train, y_train)`

- **What**: Trains the KNN model on the training data.
- **Why**: The model learns from known data.

#### 6. `print(knn.score(X_test, y_test))`

- **What**: Calculates and prints the accuracy on the test set.
- **Why**: To see how well the model performs on unseen data.
- **Expected Output**: `0.88` (or 88% accuracy), meaning the model predicts correctly 88% of the time on the test data.

---

### Output Significance

- **88% accuracy**:  
  Out of all test set observations, 88% were classified correctly.
- **Note**:  
  If the target variable is imbalanced (e.g., 9:1 ratio), accuracy can be misleading; other metrics may also be appropriate.

---

## 5. Model Complexity: The Role of k in KNN

- **Interpretation of k**:
    - **Larger k**: Simpler model, smoother boundaries, **can cause underfitting** (fails to capture patterns).
    - **Smaller k**: More complex model, more sensitive to noise, **can lead to overfitting** (captures random fluctuations).

![image.png](attachment:22578c0d-ae1c-458d-b656-ca6136576545.png)

---

## 6. Investigating Overfitting/Underfitting: Model Complexity Curve

- **Goal**:  
  Examine how model performance changes as `k` varies.
- **Method**:
    - Measure **accuracy** on both training and test sets for different values of `k`.
    - Plot results to visualize overfitting/underfitting.

### Code Example: Model Complexity Curve

```python
import numpy as np

train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1, 26)

for neighbor in neighbors:
    knn = KNeighborsClassifier(n_neighbors=neighbor)
    knn.fit(X_train, y_train)
    train_accuracies[neighbor] = knn.score(X_train, y_train)
    test_accuracies[neighbor] = knn.score(X_test, y_test)
```

#### Output

- `train_accuracies` and `test_accuracies` are dictionaries mapping each k value (from 1 to 25) to accuracy on the training and test set, respectively.

---

### Line-by-Line Explanation

#### 1. `import numpy as np`

- **What**: Imports NumPy for numerical operations and to create a range of neighbor values.
- **Why**: Needed for array operations and iteration.

#### 2. `train_accuracies = {}` and `test_accuracies = {}`

- **What**: Create empty dictionaries to store accuracy scores.
- **Why**: To collect results for plotting and analysis.

#### 3. `neighbors = np.arange(1, 26)`

- **What**: Creates an array of integers from 1 to 25 (inclusive).
- **Why**: These are the k values to try for KNN.

#### 4. `for neighbor in neighbors: ...`

- **What**: Loops through each k value.
- **Why**: To train and evaluate a separate KNN model for each number of neighbors.

#### 5. `knn = KNeighborsClassifier(n_neighbors=neighbor)`

- **What**: Instantiates a KNN classifier with current k value.
- **Why**: To test performance for this k.

#### 6. `knn.fit(X_train, y_train)`

- **What**: Trains the classifier on the training data for current k.
- **Why**: Each model is fit anew for fair comparison.

#### 7. `train_accuracies[neighbor] = knn.score(X_train, y_train)`

- **What**: Computes accuracy on training data for current k and stores it.
- **Why**: To see how well the model fits the training data.

#### 8. `test_accuracies[neighbor] = knn.score(X_test, y_test)`

- **What**: Computes accuracy on test data for current k and stores it.
- **Why**: To see how well the model generalizes.

---

## 7. Plotting the Model Complexity Curve

### Code Example: Plotting Accuracies vs. Number of Neighbors

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()
```
![image.png](attachment:ecb98dc1-4d46-4441-8acc-56d6e2d1a4ca.png)

#### Output

- A line plot with:
    - **X-axis**: Number of neighbors (`k`)
    - **Y-axis**: Accuracy
    - **Two lines**: Training accuracy and testing accuracy for each value of k

---

### Line-by-Line Explanation

#### 1. `import matplotlib.pyplot as plt`

- **What**: Imports plotting library.
- **Why**: To create visualizations.

#### 2. `plt.figure(figsize=(8, 6))`

- **What**: Sets the plot size.
- **Why**: For clearer visualization.

#### 3. `plt.title("KNN: Varying Number of Neighbors")`

- **What**: Adds a title to the plot.
- **Why**: To describe what the plot shows.

#### 4. `plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")`

- **What**: Plots training accuracy vs. number of neighbors.
- **Why**: To visualize how model fits training data as k changes.

#### 5. `plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")`

- **What**: Plots test accuracy vs. number of neighbors.
- **Why**: To visualize generalization to new data.

#### 6. `plt.legend()`

- **What**: Adds a legend to distinguish training and testing lines.
- **Why**: For clarity.

#### 7. `plt.xlabel("Number of Neighbors")`, `plt.ylabel("Accuracy")`

- **What**: Label the axes.
- **Why**: Explain what each axis represents.

#### 8. `plt.show()`

- **What**: Displays the plot.
- **Why**: To see the result.

---

### Significance of the Model Complexity Curve

- **Interpretation**:
    - **Low k (complex model)**: High training accuracy, lower test accuracy (overfitting).
    - **High k (simple model)**: Both accuracies drop and plateau (underfitting).
    - **Best test accuracy**: Often at an intermediate k (e.g., peak at k ≈ 13).
- **Goal**:  
  *Choose k where test accuracy is maximized, balancing bias and variance.*

---

## 8. Summary Table: Evaluating and Tuning KNN Classifiers

| Step                         | Description                                                |
|------------------------------|-----------------------------------------------------------|
| Split data                   | `train_test_split` for train/test evaluation              |
| Fit model                    | Train on `X_train`, `y_train`                             |
| Score model                  | `.score(X_test, y_test)` for accuracy                     |
| Tune model complexity (k)    | Loop over k, store accuracies, plot complexity curve      |
| Interpret results            | Choose k with best test accuracy, avoid over/underfitting |

---

## 9. Key Takeaways

- **Always evaluate models on unseen (test) data**.
- **Model complexity** is controlled by `k` in KNN:
    - Too low: overfitting
    - Too high: underfitting
- **Use plots** to visualize and select the best k for your data.

---


### Exercise
Train/test split + computing accuracy
It's time to practice splitting your data into training and test sets with the churn_df dataset!

NumPy arrays have been created for you containing the features as X and the target variable as y.

Instructions

Import train_test_split from sklearn.model_selection.
Split X and y into training and test sets, setting test_size equal to 20%, random_state to 42, and ensuring the target label proportions reflect that of the original dataset.
Fit the knn model to the training data.
Compute and print the model's accuracy for the test data
```python
# Import the module
from sklearn.model_selection import train_test_split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

<script.py> output:
    0.8740629685157422

```

## Exercise: Overfitting and Underfitting

Interpreting **model complexity** is a great way to evaluate supervised learning performance.  
Your aim is to produce a model that can both **interpret the relationship** between features and the target variable, and also **generalize well** when exposed to new observations.

---

### Setup
- Training and test sets are already created:  
  `X_train, X_test, y_train, y_test`
- Libraries imported:  
  `from sklearn.neighbors import KNeighborsClassifier`  
  `import numpy as np`

---

### Instructions
1. Create `neighbors` as a numpy array of values from **1 up to and including 12**.  
2. Instantiate a `KNeighborsClassifier`, with the number of neighbors equal to the neighbor iterator.  
3. Fit the model to the training data.  
4. Calculate accuracy scores for the **training set** and **test set** separately using `.score()`.  
5. Save results in the dictionaries `train_accuracies` and `test_accuracies` using the neighbor value as the key.  

---

### Code
```python
# Create neighbors
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
    # Set up a KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=neighbor)
    
    # Fit the model
    knn.fit(X_train, y_train)
    
    # Compute accuracy
    train_accuracies[neighbor] = knn.score(X_train, y_train)
    test_accuracies[neighbor] = knn.score(X_test, y_test)

print(neighbors, '\n', train_accuracies, '\n', test_accuracies)
````

---

### Output

#### Predicted Classes

```
[ 1  2  3  4  5  6  7  8  9 10 11 12]
```

---

#### Training Accuracy (Model A)

| Neighbors | Accuracy |
| --------- | -------- |
| 1         | 1.0000   |
| 2         | 0.8879   |
| 3         | 0.9070   |
| 4         | 0.8734   |
| 5         | 0.8829   |
| 6         | 0.8689   |
| 7         | 0.8754   |
| 8         | 0.8659   |
| 9         | 0.8679   |
| 10        | 0.8629   |
| 11        | 0.8644   |
| 12        | 0.8604   |

---

#### Test Accuracy (Model B)

| Neighbors | Accuracy |
| --------- | -------- |
| 1         | 0.7871   |
| 2         | 0.8501   |
| 3         | 0.8426   |
| 4         | 0.8561   |
| 5         | 0.8553   |
| 6         | 0.8613   |
| 7         | 0.8636   |
| 8         | 0.8606   |
| 9         | 0.8621   |
| 10        | 0.8598   |
| 11        | 0.8598   |
| 12        | 0.8591   |

---

### Interpretation

* **Training accuracy (Model A)** is perfect at `n_neighbors=1` (overfitting), then decreases as neighbors increase.
* **Test accuracy (Model B)** starts lower at `n_neighbors=1`, improves with more neighbors, and stabilizes around `0.86`.
* The gap between training and test accuracy shows the **bias–variance tradeoff**.

```

### Exercise
Visualizing model complexity
Now you have calculated the accuracy of the KNN model on the training and test sets using various values of n_neighbors, you can create a model complexity curve to visualize how performance changes as the model becomes less complex!

The variables neighbors, train_accuracies, and test_accuracies, which you generated in the previous exercise, have all been preloaded for you. You will plot the results to aid in finding the optimal number of neighbors for your model.

Instructions

Add a title "KNN: Varying Number of Neighbors".
Plot the .values() method of train_accuracies on the y-axis against neighbors on the x-axis, with a label of "Training Accuracy".
Plot the .values() method of test_accuracies on the y-axis against neighbors on the x-axis, with a label of "Testing Accuracy".
Display the plot.
```python
# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors,train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()
```
![image.png](attachment:a602f01e-b45e-468b-8ae4-58cecac39dd9.png)

See how training accuracy decreases and test accuracy increases as the number of neighbors gets larger. For the test set, accuracy peaks with 7 neighbors, suggesting it is the optimal value for our model

In [None]:
# Chap 1 of ML(Fundamentals) ENDS...