---
title: "Classification Methods"
subtitle: "IN2004B: Generation of Value with Data Analytics"
author: 
  - name: Alan R. Vazquez
    affiliations:
      - name: Department of Industrial Engineering
format: 
  revealjs:
    chalkboard: false
    multiplex: false
    footer: "Tecnologico de Monterrey"
    logo: IN1002b_logo.png
    css: style.css
    slide-number: True
    html-math-method: mathjax
editor: visual
jupyter: python3
---


## Agenda

</br>

1. Introduction
2. Classification and Regression Trees (CART)
3. Classification Algorithm Metrics
4. *K* Nearest Neighbors

# Introduction

## Load the libraries

Before we start, let's import the data science libraries into Python.


In [None]:
#| echo: true
#| output: false

# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay 
from sklearn.metrics import accuracy_score

Here, we use specific functions from the **pandas**, **matplotlib**, and **sklearn** libraries in Python.

## Main Data Science Problems

</br>

[**Regression Problems**]{style="color:green;"}. The response is numerical. For example, a person's income, the value of a house, or a patient's blood pressure.

[**Classification Problems**]{style="color:blue;"}. The response is categorical and involves K different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).

The predictors ($\boldsymbol{X}$) can be *numerical* or *categorical*.

## Main Data Science Problems

</br>

[**Regression Problems**. The response is numerical. For example, a person's income, the value of a house, or a patient's blood pressure.]{style="color:gray;"}

[**Classification Problems**]{style="color:blue;"}. The response is categorical and involves K different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).

The predictors ($\boldsymbol{X}$) *numerical* or *categorical*.

## Terminology

</br></br>

Explanatory variables or predictors:

-   $X$ represents an explanatory variable or predictor.
-   $\boldsymbol{X} = (X_1, X_2, \ldots, X_p)$ represents a collection of $p$ predictors.

## 

</br>

[Response]{style="text-decoration: underline;"}:

::: incremental
- $Y$ is a [**categorical variable**]{style="color:darkgreen;"} that takes [**2 categories**]{style="color:darkgreen;"} or [**classes**]{style="color:darkgreen;"}.

- For example, $Y$ can take [0]{style="color:darkgreen;"} or [1]{style="color:darkgreen;"}, [A]{style="color:darkgreen;"} or [B]{style="color:darkgreen;"}, [no]{style="color:darkgreen;"} or [yes]{style="color:darkgreen;"}, [spam]{style="color:darkgreen;"} or [no spam]{style="color:darkgreen;"}.

- When classes are strings, they are usually encoded as 0 and 1.
  - The **target class** is the one for which $Y = 1$. 
  - The **reference class** is the one for which $Y = 0$.
:::

## Classification Algorithms

</br>

Classification algorithms use predictor values [to predict the class]{style="color:blue;"} of the response (target or reference).

</br>

That is, for an unseen record, they use predictor values to predict whether the record belongs to the target class or not.

</br>

Technically, [**they predict the probability**]{style="color:purple;"} that the record belongs to the target class.

## Classification Algorithms

</br></br>

[**Goal**]{style="color:darkgreen;"}: Develop a function $C(\boldsymbol{X})$ for predicting $Y = \{0, 1\}$ from $\boldsymbol{X}$.

</br>

. . .

To achieve this goal, most algorithms consider functions $C(\boldsymbol{X})$ that [**predict the probability**]{style="color:brown;"} that $Y$ takes the value of 1.

</br>

. . .

A probability for each class can be very useful for gauging the model’s confidence about the predicted classification.

## Example 1

Consider a spam filter where $Y$ is the email type.

- The target class is spam. In this case, $Y=1$.
- The reference class is not spam. In this case, $Y=0$.

. . .

![](images/spam.png){fig-align="center" width="556" height="178"}

. . .

Both emails would be classified as spam. However, we would have greater confidence in our classification for the second email.

## 

</br>

Technically, $C(\boldsymbol{X})$ works with the *conditional probability*:

$$P(Y = 1 | X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p) = P(Y = 1 | \boldsymbol{X} = \boldsymbol{x})$$

In words, this is the probability that $Y$ takes a value of 1 [**given that**]{style="color:brown;"} the predictors $\boldsymbol{X}$ have taken the values $\boldsymbol{x} = (x_1, x_2, \ldots, x_p)$.

</br>

. . .

The conditional probability that $Y$ takes the value of 0 is

$$P(Y = 0 | \boldsymbol{X} = \boldsymbol{x}) = 1 - P(Y = 1 | \boldsymbol{X} = \boldsymbol{x}).$$

## Bayes Classifier

</br>

It turns out that, if we know the true structure of $P(Y = 1 | \boldsymbol{X} = \boldsymbol{x})$, we can build a good classification function called the [**Bayes classifier**]{style="color:darkblue;"}:

$$C(\boldsymbol{X}) =
    \begin{cases}
      1, & \text{if}\ P(Y = 1 | \boldsymbol{X} = \boldsymbol{x}) > 0.5 \\
      0, & \text{if}\ P(Y = 1 | \boldsymbol{X} = \boldsymbol{x}) \leq 0.5
    \end{cases}.$$

This function classifies to the most probable class using the conditional distribution $P(Y | \boldsymbol{X} = \boldsymbol{x})$.

## 

</br>

[HOWEVER, we don’t (and will never) know the true form of $P(Y = 1 | \boldsymbol{X} = \boldsymbol{x})$!]{style="color:red;"}

</br>

. . .

To overcome this issue, we several standard solutions:

::: incremental
-   [**Logistic Regression**]{style="color:brown;"}: Impose an structure on $P(Y = 1 | \boldsymbol{X} = \boldsymbol{x})$. This was covered in IN1002B.
-   [**Classification Trees**]{style="color:darkblue;"}: Estimate $P(Y = 1 | \boldsymbol{X} = \boldsymbol{x})$ directly. What we will cover today.
-   [**_K_-Nearest Neighbours**]{style="color:darkgreen;"}: Estimate $P(Y = 1 | \boldsymbol{X} = \boldsymbol{x})$ directly. (Optional).
:::

## Two datasets

</br></br>

The application of data science models needs two data sets:

::: incremental
-   [**Training data**]{style="color:blue;"} is data that we use to train or construct the estimated function $\hat{f}(\boldsymbol{X})$.

-   [**Test data**]{style="color:green;"} is data that we use to evaluate the predictive performance of $\hat{f}(\boldsymbol{X})$ only.
:::

## 

::::: columns
::: {.column width="30%"}
![](images/training.png){width="256"}
:::

::: {.column width="70%"}
</br>

A random sample of $n$ observations.

Use it to **construct** $\hat{f}(\boldsymbol{X})$.
:::
:::::

::::: columns
::: {.column width="30%"}
![](images/test.png){width="262"}
:::

::: {.column width="70%"}
Another random sample of $n_t$ observations, which is independent of the training data.

Use it to **evaluate** $\hat{f}(\boldsymbol{X})$.
:::
:::::

## Validation Dataset

In many practical situations, a test dataset is not available. To overcome this issue, we use a [**validation dataset**]{style="color:orange;"}.

![](images/validation.png){fig-align="center" width="645"}

. . .

**Idea**: Apply model to your [**validation dataset**]{style="color:orange;"} to mimic what will happen when you apply it to test dataset.

## Example 1

</br>

The "BostonHousing.xlsx" contains data collected by the US Bureau of the Census concerning housing in the area of Boston, Massachusetts. The dataset includes data on 506 census housing tracts in the Boston area in 1970s.

The **goal** is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms.

The [response]{style="color:darkred;"} is the median value of owner-occupied homes in \$1000s, contained in the column `MEDV`.

## The predictors

::: {style="font-size: 70%;"}
-   `CRIM`: per capita crime rate by town.
-   `ZN`: proportion of residential land zoned for lots over 25,000 sq.ft.
-   `INDUS`: proportion of non-retail business acres per town.
-   `CHAS`: Charles River ('Yes' if tract bounds river; 'No' otherwise).
-   `NOX`: nitrogen oxides concentration (parts per 10 million).
-   `RM`: average number of rooms per dwelling.
-   `AGE`: proportion of owner-occupied units built prior to 1940.
-   `DIS`: weighted mean of distances to five Boston employment centers
-   `RAD`: index of accessibility to radial highways ('Low', 'Medium', 'High').
-   `TAX`: full-value property-tax rate per \$10,000.
-   `PTRATIO`: pupil-teacher ratio by town.
-   `LSTAT`: lower status of the population (percent).
:::

## Read the dataset

</br>

We read the dataset and set the variable `CHAS` and `RAD` as categorical.


In [None]:
#| echo: true
#| output: true

Boston_data = pd.read_excel('BostonHousing.xlsx')

Boston_data['CHAS'] = pd.Categorical(Boston_data['CHAS'])
Boston_data['RAD'] = pd.Categorical(Boston_data['RAD'], 
                                      categories=["Low", "Medium", "High"], 
                                      ordered=True)

## 

</br>


In [None]:
#| echo: true
#| output: true

Boston_data.head()

## How do we generate validation data?

We split the current dataset into a training and a validation dataset. To this end, we use the function `train_test_split()` from **scikit-learn**.

</br>

The function has three main inputs:

-   A pandas dataframe with the predictor columns only.
-   A pandas dataframe with the response column only.
-   The parameter `test_size` which sets the portion of the dataset that will go to the validation set.

## Create the predictor matrix

We use the function `.drop()` from **pandas**. This function drops one or more columns from a data frame. Let's drop the response column `MEDV` and store the result in `X_full`.


In [None]:
#| echo: true
#| output: true

# Set full matrix of predictors.
X_full = Boston_data.drop(columns = ['MEDV']) 
X_full.head(4)

## Create the response column

We use the function `.filter()` from **pandas** to extract the column `MEDV` from the data frame. We store the result in `Y_full`.


In [None]:
#| echo: true
#| output: true

# Set full matrix of responses.
Y_full = Boston_data.filter(['MEDV'])
Y_full.head(4)

## Let's partition the dataset

</br>


In [None]:
#| echo: true
#| output: true

# Split the dataset into training and validation.
X_train, X_valid, Y_train, Y_valid = train_test_split(X_full, Y_full, 
                                                      test_size = 0.3)

-   The function makes a clever partition of the data using the *empirical* distribution of the response.

-   Technically, it splits the data so that the distribution of the response under the training and validation sets is similar.

-   Usually, the proportion of the dataset that goes to the validation set is 20% or 30%.

## 

The predictors and response in the training dataset are in the objects `X_train` and `Y_train`, respectively. We compile these objects into a single dataset using the function `.concat()` from **pandas**. The argument `axis = 1` tells `.concat()` to concatenate the datasets by their rows.


In [None]:
#| echo: true
#| output: true

training_dataset = pd.concat([X_train, Y_train], axis = 1)
training_dataset.head(4)

## 

Equivalently, the predictors and response in the validation dataset are in the objects `X_valid` and `Y_valid`, respectively.


In [None]:
#| echo: true
#| output: true

validation_dataset = pd.concat([X_valid, Y_valid], axis = 1)
validation_dataset.head()

## Work on your training dataset

After we have partitioned the data, we **work on the** [**training data**]{style="color:blue;"} to develop our predictive pipeline.

The pipeline has two main steps:

1.  Data preprocessing.
2.  Model development.

We will now discuss preprocessing techniques applied to the predictor columns in the training dataset.

Note that all preprocessing techniques will also be applied to the [**validation dataset**]{style="color:orange;"} and [**test dataset**]{style="color:green;"} to prepare it for your model!


# Classification and Regression Trees (CART)

## Decision Tree

It is a supervised learning algorithm that predicts or classifies observations using a hierarchical tree structure.

- Simple and useful for interpretation.

- Can handle numerical and categorical predictors and responses.

- Computationally efficient.

- Nonparametric technique.

## Example 2: Identifying Counterfeit Banknotes

</br>

![](images/clipboard-270396609.png)

Dataset

The data is located in the file "banknotes.xlsx".


In [None]:
#| echo: true
#| output: true

bank_data = pd.read_excel("banknotes.xlsx")
# Set response variable as categorical.
bank_data['Status'] = pd.Categorical(bank_data['Status'])
bank_data.head()

## Generating Training Data

We split the current dataset into two datasets: training and validation. To do this, we use the scikit-learn `train_test_split()` function.


In [None]:
#| echo: true
#| output: true

# Set full matrix of predictors.
X_full = bank_data.drop(columns = ['Status'])

# Set full matrix of responses.
Y_full = bank_data['Status']

# Split the dataset.
X_train, X_val, Y_train, Y_val = train_test_split(X_full, Y_full, 
                                                    test_size=0.3)

The `test_size` parameter sets the portion of the dataset that will go into the validation set.

## 

</br>

- The function intelligently partitions the data using the *empirical* distribution of the response.

- Technically, it splits the data so that the response distribution in the training and validation sets is similar.

- Typically, the proportion of the data set allocated to the test set is 20% or 30%.

- Later, we will use the [**validation data set**]{style="color:orange;"} to evaluate the classification performance of the estimated logistic regression model for classifying unobserved data.

## Basic idea of a decision tree

Stratify or segment the prediction space into several simpler regions.

![](images/Screenshot%202025-07-28%20at%2011.47.35%20a.m..png){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.012.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.013.jpeg){fig-align="center"}

## How do you build a decision tree?

Building decision trees involves two main procedures.

1. [Grow a large tree.]{style="color:darkblue;"}

2. [Pruning the tree to prevent overfitting.]{style="color:darkblue;"}

After building a “good” tree, we can predict new observations that are not in the data set we used to build it.

## How do we grow a tree?

**Using the CART algorithm!**

The algorithm uses a recursive binary splitting strategy that builds the tree using a greedy top-down approach.

Basically, at a given node, it considers all variables and all possible splits of that variable. Then, for classification, it chooses the best variable and splits it that **minimizes** the so-called [***impurity***]{style="color:purple;"}.

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.016.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.017.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.018.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.019.jpeg)

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.020.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.021.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.022.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.023.jpeg){fig-align="center"}

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.024.jpeg){fig-align="center"}

## 

:::::: center
::::: columns
::: {.column width="40%"}
We repeat the partitioning process until the terminal nodes have no less than 5 observations.
:::

::: {.column width="60%"}
![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.025.jpeg){fig-align="center"}
:::
:::::
::::::

## What is impurity?

Node impurity refers to the homogeneity of the response classes at that node.

:::::: center
::::: columns
::: {.column width="50%"}
![](images/Impurity1.png){fig-align="center"}
:::

::: {.column width="50%"}
![](images/Impurity2.png){fig-align="center"}
:::
:::::
::::::

[*The CART algorithm minimizes impurity between tree nodes.*]{style="color:darkgray;"}

# ¿Cómo medimos la impureza?

:::::: center
::::: columns
::: {.column width="40%"}
There are three different metrics for impurity:

- Risk of misclassification.

- Cross entropy.

- Gini impurity index.
:::

::: {.column width="60%"}
![](images/Metrics1.png){fig-align="center"} ![](images/Metrics2.png){fig-align="center"} [Proportion of elements in a class]{.smallcaps}
:::
:::::
::::::

## Pruning the Tree

To avoid overfitting, we pruned some of the tree's branches. More specifically, we collapsed two internal (non-terminal) nodes.

![](images/clipboard-1949573140.png)

## 

![](images/Modulo%202%20-%20Algoritmos%20de%20Clasificacion.029.jpeg){fig-align="center"}

## 

To prune a tree, we use an advanced algorithm to measure the contribution of the tree's branches.

The algorithm has a tuning parameter called $\alpha$, which **places greater weight on the number of tree nodes** (or size):

- Large values of $\alpha$ result in small trees with few nodes.

- Small values of $\alpha$ result in large trees with many nodes.

## Implementation Details

- Categorical predictors with unordered levels $\{A, B, C\}$. We order the levels in a specific way (works for binary and regression problems).

- Predictors with missing values. For quantitative predictors, we use multiple imputation. For categorical predictors, we create a new "NA" level.

- Tertiary or quartary splits. There is not much improvement.

- Diagonal splits (using a linear combination for partitioning). These can lead to improvement, but they impair interpretability.

# Python Example

The “AdultReduced.jmp” data comes from the UCI Machine Learning Repository and is derived from US Census records.

In this data, the goal is to predict whether a person's income was high (defined in 1994 as more than $50,000) or low.

Predictors include education level, job type (e.g., never worked and local government), capital gains/losses, hours worked per week, country of origin, etc.

The data contains 7,508 records.

## Disadvantage of Decision Trees

- Decision trees have high variance. A small change in the training data can result in a very different tree.

- It has trouble identifying simple data structures.

![](images/clipboard-3265772983.png)

# Classification Algorithm Metrics

## Evaluation

</br>

We evaluate a logistic regression classifier by classifying observations that were not used for training or estimation.

That is, we use the classifier to predict categories in the test data set using only the predictor values from this set.

In Python, we use the commands:


In [None]:
#| echo: true

# Remove problematic predictor from the test set.
#X_val = X_val.drop(columns = ['Right'])

# Add constant to the predictor matrix from the test set.
#X_val = sm.add_constant(X_val)

# Predict probabilities.
#predicted_probability = logit_model.predict(X_val)

## 

The `predict()` function generates [**probabilities**]{style="color:brown;"} instead of the actual classes.


In [None]:
#| echo: true
#| output: true

#predicted_probability.head()

These are the probabilities that a bill is "counterfeit" based on its characteristics (predictor values).

To convert the probabilities into real-world classes, we round them:


In [None]:
#| echo: true
#predicted_classes = round(predicted_probability).astype('int')

## 


In [None]:
#| echo: true
#predicted_classes.head()

- Observations with probabilities greater than 0.5 are classified as "false."
- Observations with probabilities less than 0.5 are classified as "genuine."

Now, we compare the predictions with the actual categories in the [**validation dataset**]{style="color:orange;"}. [A good logistic regression model shows good agreement between its predictions and the actual categories.]{style="color:darkblue;"}

## Confusion Matrix

- Table used to evaluate the performance of a classifier.

- Compares actual values with the predicted values of a classifier.

- Useful for binary and multiclass classification problems.

![](images/confusion_matrix.png){fig-align="center"}

## In Python

</br>

We calculate the confusion matrix using the homonymous function **scikit-learn**.


In [None]:
#| echo: true
#| output: true

# Create dummy variables for test set.
#Y_dummies = pd.get_dummies(Y_val, dtype = 'int')

# Select target variable from test set.
#Y_target_test = Y_dummies['counterfeit']

# Compute confusion matrix.
#cm = confusion_matrix(Y_target_test, predicted_classes)

# Show confusion matrix.
#print(cm)

## 

We can display the confusion matrix using the `ConfusionMatrixDisplay()` function.


In [None]:
#| echo: true
#| output: true
#| fig-align: center

#ConfusionMatrixDisplay(cm).plot()

## Accuracy

A simple metric for summarizing the information in the confusion matrix is **accuracy**. It is the proportion of correct classifications for both classes, out of the total classifications performed.

In Python, we calculate accuracy using the **scikit-learn** `accuracy_score()` function.


In [None]:
#| echo: true
#| output: true

#accuracy = accuracy_score(Y_target_test, predicted_classes)
#print( round(accuracy, 2) )

</br>

The higher the accuracy, the better the performance of the classifier.

## Observaciones

</br>

- Accuracy is easy to calculate and interpret.

- It works well when the data set has a balanced class distribution (i.e., cases 1 and 0 are approximately equal).

- However, there are situations in which identifying the target class is more important than the reference class.

- For example, it is not ideal for unbalanced data sets. When one class is much more frequent than the other, accuracy can be misleading.

## An example

- Let's say we want to create a classifier that tells us whether a mobile phone company's customer will churn next month.

- Customers who churn significantly decrease the company's revenue. That's why it's important to retain these customers.

- To retain that customer, the company will send them a text message with an offer for a low-cost mobile plan.

- Ideally, our classifier correctly identifies customers who will churn, so they get the offer and, hopefully, stay.

##

- In other words, we want to avoid making wrong decisions about customers who will churn.

- Wrong decisions about loyal customers aren't as relevant.

- Because if we classify a loyal customer as one who will churn, the customer will get a good deal. They'll probably pay less but stay anyway.

## Another example

- Another example is developing an algorithm (classifier) that can quickly identify patients who may have a rare disease and need a more extensive and expensive medical evaluation.

- The classifier must make correct decisions about patients with the rare disease, so they can be evaluated and eventually treated.

- A healthy patient who is misclassified with the disease will only incur a few extra dollars to pay for the next test, only to discover that the patient does not have the disease.

## Classification-Specific Metrics

To overcome this limitation of accuracy and error rate, there are several class-specific metrics. The most popular are:

- [**Sensitivity**]{style="color:darkblue;"} or *recall*

- [**Precision**]{style="color:darkgreen;"}

- **Type I error**

These metrics are calculated from the confusion matrix.

## 

![](images/classspecific_metrics.png){fig-align="center"}

[**Sensitivity**]{style="color:darkblue;"} or *recall* = OO/(OO + OR) “How many records of the target class did we predict correctly?”

## 

![](images/classspecific_metrics.png){fig-align="center"}

[**Precision**]{style="color:darkgreen;"} = OO/(OO + RO) How many of the records we predicted as target class were classified correctly?

## 

![](images/classspecific_metrics.png){fig-align="center"}

**Type I error** = RO/(RO + RR) “How many of the reference records did we incorrectly predict as targets?”

## Discussion

- There is generally a trade-off between sensitivity and Type I error.

- Intuitively, increasing the sensitivity of a classifier is likely to increase Type I error, because more observations are predicted as positive.

- Possible trade-offs between sensitivity and Type I error may be appropriate when there are different penalties or costs associated with each type of error.

## Example

Assuming the target class is "large"

- Sensitivity = 566/(566 + 214) = 0.726

- Accuracy = 566/(566 + 156) = 0.783

- Type 1 Error = 156/(156 + 655) = 0.192

## Activity 2.1: Classification and Metrics (cooperative mode)

Pair with a partner.

Using the data in the "weight-height.csv" table, apply the CART procedure to build a decision tree useful for predicting a person's sex based on their weight and height.

In this example, the predictor variables are continuous, and the predictor variable is binary.

##

Interpret the Precision, Accuracy, Sensitivity, and Type 1 Error values for the validation set. If the software doesn't report them, perform the calculations using the confusion matrix. Use "Female" as the target class.

Discuss the effectiveness of the resulting model.

# *K* nearest neighbors

This is a supervised learning algorithm that uses proximity to make classifications or predictions about the clustering of a single data point.

**Basic idea**: Predict a new observation using the *K* closest observations in the training dataset.

To predict the response for a new observation, *K*-NN uses the *K* nearest neighbors (observations) in [***terms of the predictors!***]{style="color:aqua;"}

The predicted response for the new observation is the most common response among the *K* nearest neighbors.

## The algorithm has 3 steps:

1. Choose the number of nearest neighbors (*K*).

2. For a new observation, find the *K* closest observations in the training data (ignoring the response).

3. For the new observation, the algorithm predicts the value of the most common response among the *K* nearest observations.

##

Suppose we have two groups: the red group and the green group. The number line shows the value of a variable for our training data.

A new observation arrives, and we don't know which group it belongs to.

![](images/ball_example.png){fig-align="center"}

If we had chosen $K=3$, then the three nearest neighbors would vote on which group the new observation belongs to.

##

Using $K = 3$, that's 2 votes for "genuine" and 2 for "fake." So we classify it as "genius."

![](images/ball_example2.png){fig-align="center"}

Closeness is based on Euclidean distance.

## Implementation Details

**Ties**

- If there are more than *K* nearest neighbors, include them all.

- If there is a tie in the vote, set a rule to break the tie. For example, randomly select the class.

##

***KNN uses the Euclidean distance between points***. So it ignores units.

- Example: two predictors: height in cm and arm span in feet. Compare two people: (152.4, 1.52) and (182.88, 1.85).

- These people are separated by 30.48 units of distance in the first variable, but only by 0.33 units in the second.

- Therefore, the first predictor plays a much more important role in classification and can bias the results to the point where the second variable becomes useless.

##

As a first step, we must transform the predictors so that they have the same units!

This requires a predictor standardization process, which is done in Python.

## Standardization

</br>

Standardization refers to *centering* and *scaling* each numerical predictor individually. This places all predictors on the same scale.

To **center** a predictor variable, the mean value of the predictor is subtracted from all values.

Therefore, the centered predictor has a mean of zero (i.e., its average value is zero).

##

</br>

To **scale** a predictor, each of its values is divided by its standard deviation.

When scaling the data, the values have a common standard deviation of one.

In mathematical terms, we standardize a predictor as:

$${\color{blue} \tilde{X}_{i}} = \frac{{ X_{i} - \bar{X}}}{ \sqrt{\frac{1}{n -1} \sum_{i=1}^{n} (X_{i} - \bar{X})^2}},$$

with $\bar{X} = \sum_{i=1}^n \frac{x_i}{n}$.

## Example 1 (cont.)

We use the five numeric predictors from the `complete_sbAuto` dataset.


In [None]:
#| echo: true

#complete_sbAuto.head()

## Two predictors in original units

Consider the previously created `complete_sbAuto` dataset. Consider two points on the graph: $(175, 5140)$ and $(69, 1613)$.

::::: columns
::: {.column width="50%"}

In [None]:
#| echo: false
#| output: true
#| fig-align: center

#plt.figure(figsize=(5,5))
#sns.scatterplot(data = complete_sbAuto, x = 'horsepower', y = 'weight')
#plt.scatter(x = 175, y = 5140, color = 'red')
#plt.scatter(x = 69, y = 1613, color = 'red')
#plt.xlabel('Horsepower', fontsize=14)
#plt.ylabel('Weight', fontsize=14)
#plt.show()

:::

::: {.column width="50%"}
</br>

The distance between these points is $\sqrt{(69 - 175)^2 + (1613-5140)^2}$ $= \sqrt{11236 + 12439729}$ $= 3528.592$.
:::
:::::

## Standardization in Python

</br>

To standardize **numeric** predictors, we use the `StandardScaler()` function. We also apply the function to variables using the `fit_transform()` function.

</br>


In [None]:
#| echo: true

#scaler = StandardScaler()
#Xs = scaler.fit_transform(complete_sbAuto)

## 

Unfortunately, the resulting object isn't a Pandas data frame. So, we converted it to this format.


In [None]:
#| echo: true

#scaled_df = pd.DataFrame(Xs, columns = complete_sbAuto.columns)
#scaled_df.head()

## Two predictors in standardized units

On the new scale, the two points are now: $(1.82, 2.53)$ and $(-0.91, -1.60)$.

::::: columns
::: {.column width="50%"}

In [None]:
#| echo: false
#| output: true
#| fig-align: center

#plt.figure(figsize=(5,5))
#sns.scatterplot(data = scaled_df, x = 'horsepower', y = 'weight')
#plt.scatter(x = 1.83, y = 2.54, color = 'red')
#plt.scatter(x = -0.90, y = -1.60, color = 'red')
#plt.xlabel('Standardized horsepower', fontsize=14)
#plt.ylabel('Standardized weight', fontsize=14)
#plt.show()

:::

::: {.column width="50%"}
</br>

The distance between these points is $\sqrt{(-0.91 - 1.82)^2 + (-1.60-2.53)^2}$ $= \sqrt{7.45 + 17.05} = 4.95$.
:::
:::::

## Discussion

*K*-NN is intuitive and simple and can produce decent predictions. However, *K*-NN has some disadvantages:

- When the training dataset is very large, *K*-NN is computationally expensive. This is because, to predict an observation, we need to calculate the distance between that observation and all the others in the dataset. ("Lazy learner").

- In this case, a decision tree is more advantageous because it is easy to build, store, and make predictions with.

##

- The predictive performance of *K*-NN deteriorates as the number of predictors increases.

- This is because the expected distance to the nearest neighbor increases dramatically with the number of predictors, unless the size of the dataset increases exponentially with this number.

- This is known as the ***curse of dimensionality***.

![](images/clipboard-72810347.png)

<https://aiaspirant.com/curse-of-dimensionality/>

# [Return to main page](https://alanrvazquez.github.io/TEC-IN1002B-Website/)