
## <font color='blue'> Capstone Part 4 - Written Report

#### Authored by  Eric Nesi

### Executive Summary:
The goal of this project was to use spatial analytics to evaluate defensive performance by the Sydney Kings during the 2016-2017 season.  Modeling my data using logistic regression and Random Forest, despite its lack of accuracy, allowed me to see as I expected that the distance from the hoop was not a very good predictor of a made shot.  Furthermore, breaking down my data visually showed that there truly were spots on the floor that were more efficient than others.

I used Effective Field Goal Percentage as the main statistic to judge efficiency.  Effective Field Goal Percentage takes into account the fact that 3-pointers are worth 1.5 times 2-pointers, and as such serve as a better indicator of efficiency than field goal percentage alone.  

### Data Acquisition: 
I received my original dataset and drew inspiration from Andrew Price of SpatialJam.com. This helped me get started in my data exploration and was all the defensive data for the Sydney Kings from 2016-2017 season. However, I decided to supplement this data with shot chart data from the rest of the league for the entire year. 

I wrote a function to scrape shotchart data for the rest of the league on FibaLiveStats.com.  After locating the API, and writing a simple function to create a list of urls for each game.  A few games of data were missing due to the fact that I was attempting to scrape this data months after the events completed. As such, I was able to locate part of Andrew Price of SpatialJam.com's dataset on his Github.  I inserted a couple of games I needed some of this data into the csv and proceeded with cleaning the data.  In the end, I was able to acquire all data except for two games.  

Follow this **[LINK][Scrape]** to observe my Python Notebook for the scraping.

[Scrape]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Workspace/NBL_Shot_Data_Scrape.ipynb 

### Cleaning Process & EDA:  
I first cleaned and did an EDA for the Sydney Kings defensive data.  This was very helpful because it gave me raw data that was relatively clean and let me go through the process of cleaning it without having the 15000 records I would later have. This process was crucial for my capstone given the fact that cleaning the data and putting it in a format to create Tableau Visualizations was what I considered the most important and time consuming element of my project. Here are just a few of the problems and resolutions I encountered throughout this process:

***Problem:*** Calculating distance from the hoop.

***Solution:***  The first step in this process was making sure my court was acurately depicted.  To do this, I examined the **[FIBA][FIBA_Rules]** rulebook for court pictures and measurements.  I took a screen shot of the court and fit it into a python notebook and marked down the coordinates of the hoop. A regulation FIBA court is 28m long by 15m wide.  This meant I needed to account for these measurements in my distance formula.  Link is at the bottom to the second and third part of my capstone for these formulas.  

***Problem:*** Being able to grid my court to give me my variable Court Locations.  

***Solution:*** These would eventually be my dummy variables in my regression model, and are extremely powerful in my interactive Tableau Vizualisations.  Given the fact that my original distance formula did not account for the size of the court in meters, I could not figure out a way to break the court down into logical areas. Once I had my distance, my problem was solved as I could use the measurements provided in the FIBA Rulebook to do so.

***Problem:*** Hexbinning my data for my Hexbin. 

***Solution:*** I actually worked backwards on this problem.  I spent a long time trying to create a grid of maleable hexagons over the court, looking at things like GIS, MMQIS, and ARCGIS. However, I was really struggling. I almost had given up, when I luckily stumbled upon an article about Tableau's hexbinning functionality that worked extremely well for me.  It allowed me to hexbin the court size each bin according to frequency of shots and color each bin according to relation to league EFG% average, a number I had calculated just because I thought it may be useful at some point.  In Capstone Part 3, I took the overall player totals from every player in the league and just copy and saved them as a csv.  I then was able to quickly group by team and calculate the EFG% average for every team and for the entire league.   

Follow this **[LINK][Pt2]** and this **[LINK][Pt3]** to observe my Python Notebook for Cleaning and EDA of my Kings Data.

Follow this **[LINK][TS1]**, or this **[LINK][TS2]** to view my cleaning and analysis on my NBL Total Shots Data.

[FIBA_Rules]:http://www.fiba.com/en/Module/c9dad82f-01af-45e0-bb85-ee4cf50235b4/1faaa885-7478-4f87-ae5a-45b2c5939e96
[Pt2]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Cap_pt2/Capstone_EDA_Pt2.ipynb
[Pt3]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Cap_pt3/CAPSTONE%20%20-%20Pt.3.ipynb
[TS1]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Workspace/All_NBL_Shots.ipynb
[TS2]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Workspace/Model_NBL_Total_Data.ipynb

### Modeling:
In my model, I set the shot result as the target. Essentially, I was trying to train/test a model that could predict when a shot would be made based on floor location.  Given my entire project was geared towards this spatial analysis, I tried to make sure distance and my floor locations buckets were my predictors in this analysis.  

I used both logistic regression model, and a Random Forest model in an attempt to predict this binary target.  However, my accuracy for both just my Kings defensive data and my NBL Total Shots Data.  My decision to use these models was based on the binary target and then testing to see if I could accurately predict whether a shot would be made.  Despite attmepts through gridsearching to hone my model, I feel it lacked sufficient accuracy and in reality lacked significant purpose.  That said, working through the modeling process did reveal a few things to me.

First, lets take a look at the best predictor coefficients from both my Kings model and my total NBL model. 

<img src="files/Coeff_King.png" width="200" height="200;"/>  <img src="files/Coeff_Total.png" width="200" height="200;"/>

These coefficients reveal that other than Restricted Area, there are not any strong positive predictors for a successful shot result in my data.  That said, from my visualizations I know that there are other ways of presenting shot efficiency. 

Follow this **[LINK][Pt3]** to observe my Kings modeling:

Follow this **[LINK][TS2]** to observe my total NBL modeling. 

[Pt3]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Cap_pt3/CAPSTONE%20%20-%20Pt.3.ipynb
[TS2]:https://git.generalassemb.ly/DSI-SYD-2/capstone_eric/blob/master/Workspace/Model_NBL_Total_Data.ipynb

### Visualization:
The crown jewel of my project is my Tableau workbooks, which allowed me to visualize my data on a basketball court.  Using the powerful functionality of Tableau, I was able to clearly visualize the most efficient shots based on location on the court.  I went back and forth between my jupyter notebooks and Tableau in order to effectively create both hexbin and raw shot location data. Every filter is dynamic, meaning that you can select a team, court location, distance and the Effective Field Goal % is changed with it. 

For Kings Interactive Viz: **[HERE][T1]**

For NBL Total Interactive Viz:  **[HERE][T2]**

When it comes to hexbinning, I did not feel I had enough data per point to draw conclusions from hexbinning my Kings data.  That said, my hexbin plot for all shots reveal an interesting trend I noticed in my data.  

***See hexbin data below:***
<img src="files/HEXBIN_TOTAL.png">

[T1]:https://public.tableau.com/profile/eric.nesi#!/vizhome/SydneyKingsDefensiveShotChart2016-2017/Dashboard1
[T2]:https://public.tableau.com/profile/eric.nesi#!/vizhome/All_NBL_Shots/Dashboard1

### Analysis:

I built these visual tools to eventually analyze the data that I acquired. The EFG% marker is set to a 50.81%, which is the league average for the season.  This means that shots from these red areas were the most effective during the 2016/2017 season.  Furthermore, the data size of each bin is adjusted based on frequency.  Unsurprisingly, the most effective shots came within the restricted area.  However, shots from inside the paint, but out of the restricted area were fairly innefficient league-wide.  This mirrored my results for shots taken against the Sydney Kings as well.  What this means is that shots which are close, but not right at the hoop, were actually innefficient over the entire season. This stands in direct contradiction to the fact that these shots made up the largest percentage of shots taken.

Let's take a look at this graph:

<img src="files/EFGvPCT.png">

For the entire season, three pointers in every location were more effective shots than those in the paint, yet shots in the paint represented a substantially larger number of attempts.  This could be for multiple reasons.  Firstly, shots in the paint that are not in the restricted area are many times contested and thus difficult to make.  Not to mention, these shots are worth two points as compared to three.  Furthermore, there may not be as many players on the court who have the ability to shoot threes, but they do have the ability to take these twos.  All of that said, this area of the court is an interesting case study for inefficiency.  

Lets take a look at that same graph, but for the Kings:

<img src="files/KingEvT.png">

If you take a look this graph it shows that the Kings were very good at preventing the Right Corner 3pt and the Left Corner 3pt, and also experienced the trend of the league when it comes to shots in the Paint.  However, their lack of  ability to prevent three pointers from the right wing and left wing hurt them. Not to mention, teams shot 40% from three, resulting in a 60% EFG.

### Going Forward:
This project does end here for me, as this kind of analysis has piqued my interest into creating a model that is able to predict point differential in games.  Developing algorithms to create this model is going to be a big time investment and was not feasible for me to accomplish for this project; however, I want to use elements from this project to create my model.  This project has caused me to do a lot of reading around basketball analytics modeling, and while there are some examples of models include spatial analysis, I believe there is a gap in using the predictive powers of this kind of analysis.  

My goal is to use this type of spatial data to get calculate the difference in the percentage of shots taken and effective field goal percentage at certain points on the floor.  From this analysis, it seems that points in the paint could be a good place to start, plus the highly efficient areas like the Restricted Area, and a number of the three pointer locations. This combined with traditional advanced metrics, such as offensive efficiency, defensive efficiency, and pace.  Ideally, I would be able to use this model to predict outcomes of games, and as a result, betting lines.

Furthermore, this project has been a great way to develop and hone my Tableau skills, and I really enjoyed the data visualization aspects of this project.  I plan on doing some smaller scale, basketball data visualization projects in the near future.  