# DRILL: Tuning KNN – Normalizing Distance, Picking K

A credit card company is trying to figure out if people are going to pay their bills on time. We have everyone's purchases, split into 4 categories: groceries, dining out, utilities, and entertainment. 

What are some ways you might use KNN to create this model? What aspects of KNN would be useful?


## Feedback

This is a thought model, you have to assume data and explain your strategy to use KNN in this kind of problem. You also need to explain how and why you will use normalization, weighting, and how will you find the optimal value of K. 
Try to follow below points to explain
- Data Explanation: Explain independent features and target features
- Data Preparation: How will you transform the data to apply KNN (normalization) and why?
- Training: Explaining KNN and parameter like weighting you will use for better model accuracy
- Model Validation: How will you validate you model accuracy (k-fold cross validation)
- Model Improvement: Explain how will you improve the model accuracy by finding out the optimal value of K. How will you decide which K value is best

## Independent vs Target Features
- Independent features: date of credit card purchase, $ amount of purchase, category to which purchase belongs
- Target feature: whether credit card purchase was paid off

## Data Preparation & Normalization
- Normalize purchase data across all categories (i.e. $ amount for each purchase) 
    - This normalization will help the data reflect what is a particularly high or average or particularly low purchase – equally across all categories, regardless of the disparity in actual amount (also unit-agnostic). This way, more egregious purchases (spending too much on dining out, for example – or spending much less than usual on utilities) may lend more insight into whether a purchase will be paid off on time
    - Normalization preserves relationships between data points, but will not penalize for large units (esp. if more continuous features are added)

## Setting up X and Y Values
- Plot all purchases on x- and y-axis using the date as x value and the purchase amount as the y value
- Performance a T-test on data belonging to different categories (i.e. weekend vs. weekday purchases) to see if the category affects outcome. If so, include as feature
    - Each data point will be coded to a certain category using color
    - Another way to differentiate between different purchase categories might be to simply create 4 different graphs for each category
- Addendum: for additional features, data could be broken down into seasons, weekends vs. weekdays, period in billing cycle, season, proximity to high-shopping times (like Christmas, Valentine's, etc.) - on different graphs

## Weighting
- Weight neighbors by proximity (this will take into account a data point's neighbors' category AND dollar amount similarities)

## Validation
- Divide your data into random folds and use cross-validation to make sure your model will accurately predict whether a purchase will be paid repeatedly

## Improvement
- Start with a K value that is reflective of how large the dataset is - i.e. if a user has 20 purchases, 10 is much too high of a K value
- Adjust the K value to see at which point the KNN model performs the most accurately (use cross-validation to test) – trial and error
- Turn weighting on and off to see if this affects model accuracy (again, using cross-validation scores)
- Use an odd number for K so that there is always a winner
- Plot a residual plot (actual value - predicted value) for different values of K – see where accuracy peaks and choose that K value