# Predicting Probability Wins Using NBA Data

## Project Scope

### CONTEXT

 - The sports industry is a fiercely competitive industry and NBA is no exception.
 - A consecutive poor performance may sink a franchise--- therefore, the goal is to win and bag the “Championship” title.
 - Attracting the most talented player and being able to offer them lucrative contracts is also based on their performance.
 - It is in the CONTEXT of “winning is everything” that this capstone project is all about.
 
### NEEDS

 - There are a lot of tools available that can predict chances of winning in sports like the NBA.
 - The need is to have a tool that is able to help coaches adjust their game plan and/or strategy based on what is currently happening (e.g., the opponent’s team current line up of players, who got the first ball, somebody from the team/opponent’s team got injured/fouled out, etc.)
 
### VISION

 - The team’s coach will initially get an initial prediction of their chances of winning based on historical data (i.e. data based on past games played against the specific opponent team they are playing against with).
 - Coach will be able to feed information (different game plays/strategies based on some assumptions) and pick the game play that will allow them to have more chances of winning.
 - During the actual game, the coach is also able to feed information (actual information especially on opponent’s side) and be able to adjust his game plan/strategy based on the prediction that will have the best outcome.

### OUTCOME

 - The main user (which is assumed to be the coach) will work closely with the data analyst/scientist. The application should be able to keep track of the data being fed on the tool. It should be able to keep track of the outcome (expected).
 - The actual game plan/strategy chosen by the coach will also be tracked and the actual outcome as well. Improvements to the tool should be based on the actual vs. expected outcomes.

## Data Dictionary

| Variable             | Values                  | Description                                       | Mnemonic               |
|----------------------|-------------------------|---------------------------------------------------|------------------------|
| Game ID              | String                  | Official Game ID at NBA.com                       | game_id                |
| Season Type          | String                  | Season that the dataset belongs to                | data_set               |
| Game Date            | MM/DD/YYYY              | Game date                                         | date                   |
| A1 | String | First player of away team who is one of the active players in the court | a1 |
| A2 | String | Second player of away team who is one of the active players in the court | a2 |
| A3 | String | Third player of away team who is one of the active players in the court | a3 |
| A4 | String | Fourth player of away team who is one of the active players in the court | a4 |
| A5 | String | Fifth player of away team who is one of the active players in the court | a5 |
| H1 | String | First player of home team who is one of the active players in the court | h1 |
| H2 | String | Second player of home team who is one of the active players in the court | h2 |
| H3 | String | Third player of home team who is one of the active players in the court | h3 |
| H4 | String | Fourth player of home team who is one of the active players in the court | h4 |
| H5 | String | Fifth player of home team who is one of the active players in the court | h5 |
| Period | Nominal Integer | Period of the game that the event described in actually occurred | period |
| Away Score | Integer | Accumulated score of away team at that moment in the game | away_score |
| Home Score | Integer | Accumulated score of home team at that moment in the game | home_score |
| Time Remaining | HH:MM:SS | Remaining time in the period | remaining_time |
| Time Elapsed | HH:MM:SS | Time passed since the period has started | elapsed |
| Length of Time | HH:MM:SS | Duration of the event described in that row | play_length |
| Play ID | Integer | ID number of the events in the game | play_id |
| Team | Nominal | Team that has executed the described event | team |
| Event Type | Nominal | Various types of events (fouls, shots, rebounds, freethrows, turnovers, etc.) | event_type |
| Assist | String | The player who made the assist if the event is a "shot" | assist |
| Jumpball Away | String | Player of away team who is participating in a jumpball | away |
| Jumpball Home | String | Player of home team who is participating in a jumpball | home |
| Block | String | Player who blocked a shot | block |
| Check In | String | Player who checks in the game | entered |
| Check Out | String | Player who checks out of the game | left |
| Freethrows | Ordinal Integer | Freethrows in an order (first, second, third, ...) | num |
| Opponent | String | Player who has drawn a foul | opponent |
| Number Freethrows | Integer | How many freethrows are going to be shoot | outof |
| Player Name | String | Player who executed the event | player |
| Points | Integer | Points scored within the event | points |
| Possession | String | Player who grabbed the ball after an event | possession |
| Reason | String | More details on how the event resulted | reason |
| Result | Nominal | Shot made or missed | result |
| Steal | String | Player who steals the ball | steal |
| Type | String | More details of the event | type |
| Distance | Integer | Shot distance in feet | shot_distance |
| NBA X | Integer | X axis value of that shot at NBA.com. X coordinates differs from -250 to +250 | original_x |
| NBA Y | Integer | Y axis value of that shot at NBA.com. Y coordinates differs from -51 to +870 | original_y |
| X | Integer | X in terms of FEET to reflect the court size which is 50 feet wide. differs from  0 to +50 | converted_x |
| Y | Integer | Y in terms of FEET to reflect the court size which is 94 feet long. differs from  0 to +94 | converted_y |
| Explanation | String | Text explanation of event | explanation |

## Environmental Scan

### Article 1: The problem with win probability

http://www.sloansportsconference.com/wp-content/uploads/2018/02/2011.pdf

 - Lacks sufficient context
   - Win probability models should be responsive to in-game contextual features such as injuries and fouls
 - No measure of uncertainty
   - There are many paths to any one outcome
 - No publicly available datasets or models for comparison
 
### Article 2: Statistical methods in sports with a focus on win probability and performance evaluation

https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=6969&context=etd

 - Discussed common methods of estimating in-game probability
 - Performance and usefulness of method in NHL, NBA, and NFL
 
### Insights from articles:

 - The use of play by play data in creating a model
 - The use of random forest as machine learning algorithm

## Key Challenges

1. Big Dataset 
 - The dataset will use 15 seasons of NBA. If researcher analyzes play-by-play data, this means dealing with millions of records. 
 - Researcher is not sure about how this will impact running codes in her local machine. A recommendation made was to put data in the cloud and work from there.

2. Las Vegas Spread
 - The articles found during environmental scan both used the Las Vegas Spread as one of its feature variables. Current dataset does not provide this. 
 - Researcher can use ELO ratings in place of this.
 
3. Random Forest
 - Researcher has no experience or does not have basic knowledge of Random Forest as a machine learning algorithm yet. This will be taught in her Data Science course but at a later time which may not be enough to complete the Capstone project.
 - If dataset and using Random Forest will really be challenging, researcher can re-scope the project to look into the data in a larger scale (instead of play-by-play, maybe just the final results and use this data instead). Researcher can also resort to using simpler machine learning algorithms like Linear Regression and/or Logistic Regression.
 
4. Data Transformation
 - Even though data is already clean, based on the data dictionary above, most variables are string or characters. Transforming data in categorical variables will be very challenging especially that it is a big dataset.
 - Researchers insufficient knowledge of the different NBA sports team names and team member names can be very challenging too when she does the transformation. Transforming some of the variables may not be that intuitive.

In [5]:
from tensorflow.python.lib.io import file_io
from pandas.compat import StringIO
import pandas as pd

# read the input data
def read_data(gcs_path):
    print('downloading csv file from', gcs_path)     
    file_stream = file_io.FileIO(gcs_path, mode='r')
    data = pd.read_csv(StringIO(file_stream.read()))
    return data

In [6]:
df = read_data('gs://mynba/Play_By_Play/2004_2005_season.csv')
df.head()

downloading csv file from gs://mynba/Play_By_Play/2004_2005_season.csv


  if self.run_code(code, result):


Unnamed: 0,game_id,data_set,date,a1,a2,a3,a4,a5,h1,h2,...,reason,result,steal,type,shot_distance,original_x,original_y,converted_x,converted_y,description
0,"=""0020400001""",2004-2005 Regular Season,2004-11-02,Jim Jackson,Maurice Taylor,Yao Ming,Tracy McGrady,Charlie Ward,Tayshaun Prince,Rasheed Wallace,...,,,,start of period,,,,,,
1,"=""0020400001""",2004-2005 Regular Season,2004-11-02,Jim Jackson,Maurice Taylor,Yao Ming,Tracy McGrady,Charlie Ward,Tayshaun Prince,Rasheed Wallace,...,,,,jump ball,,,,,,Jump Ball Wallace vs. Yao: Tip to Billups
2,"=""0020400001""",2004-2005 Regular Season,2004-11-02,Jim Jackson,Maurice Taylor,Yao Ming,Tracy McGrady,Charlie Ward,Tayshaun Prince,Rasheed Wallace,...,,made,,Jump Shot,15.0,-149.0,3.0,10.1,88.7,Hamilton 15' Jump Shot (2 PTS) (Wallace 1 AST)
3,"=""0020400001""",2004-2005 Regular Season,2004-11-02,Jim Jackson,Maurice Taylor,Yao Ming,Tracy McGrady,Charlie Ward,Tayshaun Prince,Rasheed Wallace,...,p.foul,,,p.foul,,,,,,Wallace P.FOUL (P1.T1)
4,"=""0020400001""",2004-2005 Regular Season,2004-11-02,Jim Jackson,Maurice Taylor,Yao Ming,Tracy McGrady,Charlie Ward,Tayshaun Prince,Rasheed Wallace,...,bad pass,,,bad pass,,,,,,Ward Bad Pass Turnover (P1.T1)


## Additional Reference

https://www.slideshare.net/ThomasSalierno/national-basketball-association-industry-analysis