# Classification

## 6.3 Evaluation Accuracy

- Split data into taining and test set
    - Use training set to train and then see if the model successfully predicts the outcomes for the test set

## 6.4 Randomness and seeds

- To avoid bias: let R split the datasets
- ```set.seed()``` gives a unique sequence of numbers
    - Use once at the begining to ensure that the analysis is reproducible and unbiased

## 6.5 Accuracy with ```tidymodels```

### Create the train/test set

- ```initial_split()``` first shuffles then stratifies the data by class label, ensuring same proportion for each class in training and test
    - ```initial_split (cancer, prop = 0.75, strata = Class)``` (75% of data in the training set)
    - `training` and `testing` functions
    
### Preprocess data

- Standardization ppreprocessor recipe using only the training data

### Predict labels in test set

- ```bind_cols()``` to add the column of predictions to original data 
- ```metrics(truth = Class, estimate = .pred_class)```\
   `filter(.metric == 'accuracy')`
- Confusion matrix
    - `conf_mat(truth = Class, estimate = .pred_class)`

### The majority classifier (e.g.)

- Majority of the cases are benign in the training set (60%)
- Model's accuracy should be an improvement on simply betting on the majority

## 6.6 Tuning, choosing the right $K$ value

## Cross Validation
- Traning data -> training + evaluating
    - compare performance of different $K$
    - can have multiple splits
    - splits the training data into $C$ evenly sized chunks
    - using one chunk as validation and the remaining $C-1$ chunks as training 
    - accuracy is the average
- `vforld_cv(cancer_train, v = 5, strata = Class)`
    - `recipe`
    - `knn_fit` with `fit_resamples`
    - `collect_metrics()`
- `knn_spec <- nearest_neighbor(weight_func = 'rectangular', neighbors = tune 90)|>`\
               `set_engine ('kknn')|>`\
               `set_mode('classification')`
    - `k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))` list of $k$ values to try\
       `knn_results <- workflow()|>`\
        `add_recipe(...)|> add model(...)|>`\
        `tune_grid(resamples = cancer_vfold, grid= k_vals)|>`\
        `collect_metrics()`
- Choosing which k?
    - roughly optimal
    - nearby value -> significant decrease in accuracy
    - cost

## Under/overfitting
- Under: the model is not influenced enough by the training data : i.e. choosing too many neighbors
- Over: becomes unreliable, too litle neigbors were chosen