# DSCI 100 - Introduction to Data Science


## Lecture 6 - Classification using K-Nearest Neighbours

<center>
    <img src = "https://www.tidymodels.org/images/cover.png" width="800"/>
    </center>
Source: https://www.tidymodels.org/

### Housekeeping

**Quiz 1**
- Quiz grading will take roughly 2 weeks

**Projects**
- TAs have been assigned to each of your groups (see `group_TAs` on Canvas)
    - Summer: they will be assigned and you will know who they are on Thursday
- Your `group_contract` will be due next Saturday!
    - it's a group submission; only one team member needed to submit
    - team evaluations are an important component of the project 
    - if you still have not been able to connect with a team member
    - For Thursday/Next week:
        1. send a message through Canvas
        1. we have sent messaging about missing team members so hopefully you hear back but if not remember that you will have the opportunity to evaluate
- Visit the `slides_group_project` item on Canvas to see some activities to get your group warmed up!
- See the `group_project_proposal` to explore data sets for the project


<!-- <img align="left" src="https://media.giphy.com/media/3o7TKU8RvQuomFfUUU/giphy.gif" width="500" /> -->

### Reminder  

Where are we? Where are we going?

<center>
    <img src = "https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" width="800"/>
    </center>

*source: [R for Data Science](https://r4ds.had.co.nz/) by Grolemund & Wickham*

## Classification

Suppose we have past data of cancer tumour cell diagnosis labelled "benign" and "malignant".

Do you think a new cell with Concavity = 4 and Perimeter = 2 would be malignant? How did you decide?

<center><img src="https://datasciencebook.ca/_main_files/figure-html/05-knn-1-1.png" width="600"/></center> 


<!-- <center><img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/05-knn-4-1.png" width="600"/></center> -->

Do you think a new cell with Concavity = 3.3 and Perimeter = 0.2 would be malignant? How did you decide?

<center><img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/05-knn-4-1.png" width="600"/></center>

## What kind of data analysis question is this?

Types of questions: 
- descriptive
- exploratory
- **predictive**
- inferential 
- causal 
- mechanistic

## K-nearest neighbours classification

*Predict the label / class for a new observation using the K closest points from our dataset.*




1. Compute the distance between the new observation and each observation in our *training set*<br><br>

<center>
$\text{Distance} = \sqrt{(x_{\text{new}} - x_{\text{train}})^2 + (y_{\text{new}} - y_{\text{train}})^2}$
</center>

<center>
<img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/05-knn-4-1.png" width="600"/></center>

- point is at perimeter= 0.2, concavity = 3.3 
- Recall from your reading, the training dataset is the sample of data used to fit the model.
- Suppose we have a set of data and we want to predict the class of a new observation 
- we want to calculate the distance between the new observation and the other points 

## K-nearest neighbours classification

*Predict the label / class for a new observation using the K closest points from our dataset.*

2. Sort the data in ascending order according to the distances
3. Choose the top K rows as "neighbours"
```
## # A tibble: 5 x 5
##        ID Perimeter Concavity Class dist_from_new
##     <dbl>     <dbl>     <dbl> <fct>         <dbl>
## 1   86409     0.241      2.65 B             0.881
## 2  887181     0.750      2.87 M             0.980
## 3  899667     0.623      2.54 M             1.14 
## 4  907914     0.417      2.31 M             1.26 
## 5 8710441    -1.16       4.04 B             1.28
```

## K-nearest neighbours classification

*Predict the label / class for a new observation using the K closest points from our dataset.*

4. Classify the new observation based on majority vote.

<center>
<img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/05-knn-5-1.png" width="600"/>
</center>

### What would the predicted class be?

## We can go beyond 2 predictors

For two observations $u, v$, each with $m$ variables (columns) labelled $1, \dots, m$,
<br>
<br>

<center>
   $\text{Distance} = \sqrt{(u_1-v_1)^2 + (u_2-v_2)^2 + \dots + (u_m - v_m)^2}$ 
</center>

Aside from that, it's the same algorithm!

## Standardizing Data
<center><img src="img/scaling_example1.png" width="400"/></center>

<center><img src="img/scaling_example2.png" width="400"/></center>

- When using K-nearest neighbour classification, the scale of each variable (i.e., its size and range of values) matters. e.g. Salary (10,000+) and Age (0-100)
- Since the classifier predicts classes by identifying observations that are nearest to it, any variables that have a large scale will have a much larger effect than variables with a small scale.
- But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. 
- For example, suppose you have a data set with two attributes, height (in feet) and weight (in pounds). 

distance1 = sqrt((202 - 200)^2 + (6 - 6)^2) = 2

distance2 = sqrt((200 - 200)^2 + (8 - 6)^2) = 2

Here if we calculate the distance we get 2 in both cases! A difference of 2 pounds is not that big, but a different in 2 feet is a lot. So how can we adjust for this? 

## Nonstandardized Data

What if one variable is much larger than the other? 

<br>
<center>
<img src="img/nonstd.png" width="800"/>
</center>

## Nonstandardized Data vs Standardized Data

What if one variable is much larger than the other? 

*Standardize:* shift and scale so that the average is 0 and the standard deviation is 1.

<br>
<center>
<img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/05-scaling-plt-1.png" width="1600"/>
</center>



- Standardization: when all variables in a data set have a mean (center) of 0 and a standard deviation (scale) of 1, we say that the data have been standardized.

- In the plot with the original data above, its very clear that K-nearest neighbours would classify the red dot (new observation) as malignant. However, once we standardize the data, the diagnosis class labelling becomes less clear, and appears it would depend upon the choice of  
K. 
- Thus, standardizing the data can change things in an important way when we are using predictive algorithms. As a rule of thumb, standardizing your data should be a part of the preprocessing you do before any predictive modelling / analysis.

- In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celcius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. 
- Although this doesn’t affect the K-nearest neighbour classification algorithm, this large shift can change the outcome of using many other predictive models.


## Introduction to the `tidymodels` package in R

Tidymodels handles computing distances, standardization, balancing, and prediction for us!

0. Load the libraries and data we need (new: `tidymodels`)

In [None]:
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

cancer <- read_csv("data/clean-wdbc.data.csv") %>%
              mutate(Class = as_factor(Class))

- tidymodels? consistency.
- R open source -- made by different people and using different principles, everything has a slightly different interface, and trying to keep everything in line can be frustrating
- aim: uniform interface for variety of models that exist in R

## Introduction to the `tidymodels` package in R

0b. Inspect the data

In [None]:
cancer

## Introduction to the `tidymodels` package in R

`tidymodels` is a collection of packages:
<center>
    <img src = "https://i0.wp.com/rviews.rstudio.com/post/2019-06-14-a-gentle-intro-to-tidymodels_files/figure-html/tidymodels.png?zoom=2&w=578&ssl=1" width="800"/>
    </center>
Source: https://www.r-bloggers.com/2019/06/a-gentle-introduction-to-tidymodels/

## Introduction to the `tidymodels` package in R

In `tidymodels`, the `recipes` package is named after cooking terms.
 
### 1. Make a `recipe` to specify the predictors/response and preprocess the data
1. `recipe()`:  Main argument in the formula. 

Arguments: 

- formula
- data
    
2. `prep()` & `bake()`: you can also `prep` and `bake` a recipe to see what the preprocessing does!

- visit https://recipes.tidymodels.org/reference/index.html to see all the preprocessing steps

In [None]:
wdbc_recipe <- recipe(Class ~ Perimeter + Concavity, data = cancer) %>%
                  step_center(all_predictors()) %>%
                  step_scale(all_predictors()) 
wdbc_recipe

```
wdbc_recipe <- recipe(Class ~ Perimeter + Concavity, data=cancer) %>%
                  step_center(all_predictors()) %>%
                  step_scale(all_predictors()) 
wdbc_recipe           
```

- first argument "Model formula" 
- Left hand side of ~: "response" / thing we are trying to predict (can selectively remove, but another step) 
- Right hand side: whatever columns you want to use as predictors (could use a . to use everything as a predictor) 
- second argument: data frame 
- pre-processing steps to standardize data

`prep()` and `bake`?

- preprocessing recipe `wdbc_recipe` has been defined but no values have been estimated

- The `prep()` function computes everything so that the preprocessing steps can be executed

- The `bake()` function takes a recipe and applies it to data and returns data

- If you want to extract the pre-processed dataset: you can `prep()` and `bake()` but extracting the pre-processed data isn’t necessary for the pipeline, since this will be done under the hood when the model is fit

- will be covered more next week 

## Introduction to the `tidymodels` package in R

### 2. Build a model specification (`model_spec`) to specify the model and training algorithm

1. **model type**: kind of model you want to fit

2. **arguments**: model parameter values

3. **engine**: underlying package the model should come from 

4. **mode**: type of prediction (some packages can do both classification and regression)

In [None]:
wdbc_model <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
       set_engine("kknn") %>%
       set_mode("classification")

wdbc_model

```
wdbc_model <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
       set_engine("kknn") %>%
       set_mode("classification")

wdbc_model
```

- container for information about a model that will be fit
- intended to be functionally independent of the data. model specification does not interact with the data 
- most R functions immediately evaluate their arguments. `parsnip` model functions do not 
- save the argument expressions to be evaluated later when `fit()` called with the actual data.

## Introduction to the `tidymodels` package in R

### 3. Put them together in a workflow and then `fit` it

In [None]:
wdbc_workflow <- workflow() %>%
                    add_recipe(wdbc_recipe) %>%
                    add_model(wdbc_model)
wdbc_fit <- wdbc_workflow %>%
                fit(data=cancer)
wdbc_workflow

- We may want to use our recipe across several steps as we train and test our model. To simplify this process, we can use a model workflow, which pairs a model and recipe together. 
- single function that can be used to prepare the recipe and train the model from the resulting predictors
- `wdbc_workflow` -- still haven’t yet implemented the pre-processing steps in the recipe nor fit the model. just the framework.
- `wdbc_fit`: now the recipe and model frameworks are actually implemented

## Introduction to the `tidymodels` package in R

### 4. Predict a new class label using `predict`

In [None]:
new_obs <- tibble(Perimeter = 0.2, Concavity = 3.3)
predict(wdbc_fit, new_obs)

<center>
<img src="https://ubc-dsci.github.io/introduction-to-datascience/_main_files/figure-html/05-knn-5-1.png" width = "600"/>
</center>

- `predict()` applies the recipe to the new data, then passes them to the fitted model.

## Go forth and ... model?

<br>
<img align="left" src="https://i.imgflip.com/2q2nlu.jpg" width="500" />

# What did we learn today? 
- 
- 
- 

## Class challenge

Suppose we have a new observation in the `iris` dataset, with 

- petal length = 5
- petal width = 0.6

In your groups, discuss the following questions:
- Create a plot to visualize the relationship between the predictors/features. Based on your plot, how would you classify this observation based on $k=3$ nearest neighbours?
- Do you think we need to scale the data? Why or why not? 

```
head(iris)
options(repr.plot.width = 6, repr.plot.height = 3)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
    geom_point() + 
    labs( x= "Petal Length (cm)", y = "Petal Width (cm)")
```