## General Approach

- Considered 3 main models: 
    1. Logistic Regression
    2. Random Forest
    3. Boosted Trees
- Split data into train and test sets with an 80-20 split in chronological order to avoid leakage as per the Kaggle competition rules (i.e. didn't want to use information about future shots in the training phase).
- Tuned each model using k-fold cross validation on the training set (k=10), with log-loss as the metric. This gave an estimate of each model's ability to generalise as well as its variance.
- Selected the best model using this approach, and finally assessed its performance using the held-out test set at the end.

## Logistic Regression

- Initially considered a model with just `shot_distance` as a predictor variable. This resulted in a cross validation accuracy of 0.5984 ± 0.0143 and a log-loss of 0.6674 ± 0.0072. This also allowed us to estimate the shot distance at which shot success probability falls to 50%, which occured at a distance of 8.3ft.
- Then created a model using all of our predictor variables, but with L1 penalisation, where the penalisation coefficient and type of penalisation was found using a grid search. This should result in reduced variance and better generalisation to unseen data due to the simpler model. 
- This resulted in a cross validated accuracy of 0.6795 ± 0.0142 and log-loss of 0.6271 ± 0.0087.

## Random Forest

- Carried out a grid search for the random forest hyperparameters, which resulted in `max_depth` of 17 and `min_samples_leaf` of 6. The cross validated accuracy was found to be 0.6796 ± 0.0146 and the log-loss 0.6116 ± 0.0105, so a significant improvement over logistic regression.
- We also visualised the effect of regularization through varying `max_depth`, which showed how the cross-validated error rate decreased for increasingly deep trees until a cut-off where the error rate began to increase again, showing how regularization is required to avoid over-fitting.
- The random forest was also able to give us an idea of which features were important to the model, using feature importance ranking. It was found that the `action_type` variable, which encoded which type of shot was taken, the `shot_distance` variable, and the `defensive_ranking` variable were all important in the Random Forest approach, which makes intuitive sense.
- We also fit a simple decision tree (but didn't attempt to optimize the hyperparameters) in order to show what a typical tree in our random forest model might look like, and visualized this in tree form.

## Gradient Boosted Trees

- The gradient boosted trees achieved a cross validated accuracy of ... and log loss of ...

## Choosing a final model

- Comparing the above models on their cross-validated performance, the gradient boosted trees and random forest methods performed the best, which would be expected as they are ensembling methods. They also had reasonably low variance. The gradient boosted trees marginally outperfomed the random forest approach, and so we chose this as our final model.
- Using our final model on the held-out test data, we obtained a accuracy of 0.6770 and log-loss of 0.6064. This would have placed us xx/in the top xx% in the Kaggle competition, however it should be noted that we did not use the same test data as the competition so we can't really directly compare this.