# Predicting NBA Game Outcomes Based on a Single Team's Game Stats

This notebook serves as my introduction to Machine Learning and Data Science. It can also serve as a reference for future learners. Common machine learning libraries (Pandas, Numpy, Scikit-Learn) are used throughout the project. A project like this is perfectly suitable for beginners because there is massive data on NBA games readily available.


## The Game Plan  

Lets say we want to create a model that when given particular stats for a game, it can predict whether that team won or lost by only using the stats of the game for that team. Would it be more accurate to use all that stats available or just specific ones?

###### What we will intend to do is:
1. Scrape the Data: 
NBA.com has data of all games dating back to 1983. Scraping the data with Selenium or BeautifulSoup is an option, but would be a great pain and would take a large amount of time (a data scientist's biggest dilemma). Luckliy, there is an [NBA.com API Client](https://github.com/swar/nba_api) made available by some wonderful Github contributors.

2. Clean and Analyze the Data:
This will be a relatively large data set that will probably contain null values and raw output that needs parsing. There are several ways at which you may go about this (Excel, Python Libraries, etc).

3. Create and Test the Model:
The type of model you choose will vary and depends on the features of your data and of the output you're trying to predict.


### Scraping the Data
After installing the nba_api module and some browsing of the [Documentation](https://github.com/swar/nba_api/blob/master/docs/table_of_contents.md), you can see that the module has 'teams' and 'leaguegamefinder' functions that will help us gather historical game data:

In [1]:
import pandas as pd

from nba_api.stats.static import teams
from nba_api.stats.endpoints import leaguegamefinder

nba_teams = teams.get_teams() # Dictionary of NBA Teams and various attributes
team_ids = []

for i in nba_teams:
    team_ids.append(i['id']) # Appending team IDs to list
    
#for i in team_ids:
    #gamefinder = leaguegamefinder.LeagueGameFinder(team_id_nullable=i) # Retrieve Historical Game Data
    #games = gamefinder.get_data_frames()[0]
    #games.to_csv('games.csv', mode='a')

You should now have a csv file of all games played with their stats dating all the way back to 1983 (you may have to split the retrieval commands because NBA.com times you out after so many requests):

In [2]:
df = pd.read_csv('games.csv')
print(df)

       SEASON_ID     TEAM_ID TEAM_ABBREVIATION          TEAM_NAME   GAME_ID  \
0          42021  1610612737               ATL      Atlanta Hawks  42100105   
1          42021  1610612737               ATL      Atlanta Hawks  42100104   
2          42021  1610612737               ATL      Atlanta Hawks  42100103   
3          42021  1610612737               ATL      Atlanta Hawks  42100102   
4          42021  1610612737               ATL      Atlanta Hawks  42100101   
...          ...         ...               ...                ...       ...   
99850      21988  1610612766               CHH  Charlotte Hornets  28800062   
99851      21988  1610612766               CHH  Charlotte Hornets  28800052   
99852      21988  1610612766               CHH  Charlotte Hornets  28800024   
99853      21988  1610612766               CHH  Charlotte Hornets  28800015   
99854      21988  1610612766               CHH  Charlotte Hornets  28800008   

        GAME_DATE      MATCHUP WL  MIN  PTS  ...  F

### Cleaning and Analyzing the Data

To make the model feasible, I decided to only include data starting from the last 10 years (beginning in 2012). I also normalized the data on a scale of 0-1 as a personal preference. To do this, I used [Excel VBA Scripts](https://docs.microsoft.com/en-us/office/vba/library-reference/concepts/getting-started-with-vba-in-office), as well as the libraries available in Anaconda. You can manually do this in however manner you please.

You will also want to split your training and test data. There are several methods in which you can do it. I did an 80-20% split using Excel VBA, the output is two csv files: 'nba_train.csv' and 'nba_test.csv'.

#### FINALLY, Now here comes the fun parts - Machine Learning at its finest:

To initiate the Exploratory Data Analysis, I used [Speedml](https://pythonhosted.org/speedml/) to create some figures in order to visualize which features are most important. Let's initialize it:

In [4]:
from speedml import Speedml
sml = Speedml('nba_train.csv', 'nba_test.csv',target='Outcome')

sml.train.head() # Preview first 5 lines of training dating

Unnamed: 0.1,Unnamed: 0,P0S,FGM,FGA,FG_PC0,FG3M,FG3A,FG3_PC0,F0M,F0A,F0_PC0,0REB,DREB,REB,AS0,S0L,BLK,00V
0,0,0.505376,0.418919,0.544118,0.503001,0.275862,0.357143,0.16,0.393443,0.35,0.857,0.272727,0.535714,0.518519,0.288462,0.296296,0.086957,0.475
1,1,0.462366,0.405405,0.551471,0.480192,0.517241,0.6,0.1785,0.180328,0.25,0.55,0.25,0.535714,0.506173,0.346154,0.074074,0.130435,0.375
2,2,0.596774,0.554054,0.588235,0.615846,0.413793,0.457143,0.1875,0.278689,0.2625,0.81,0.113636,0.553571,0.444444,0.442308,0.185185,0.130435,0.275
3,3,0.564516,0.554054,0.639706,0.565426,0.413793,0.571429,0.15,0.180328,0.175,0.786,0.159091,0.589286,0.493827,0.403846,0.222222,0.086957,0.475
4,4,0.489247,0.391892,0.551471,0.464586,0.344828,0.514286,0.139,0.377049,0.3375,0.852,0.181818,0.535714,0.469136,0.307692,0.296296,0.086957,0.45


Now let's create a heatmap that displays feature correlation (it should by default appear in tkinter:

In [None]:
sml.plot.correlate()

![correlation%20map.png](attachment:correlation%20map.png)

In [None]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('games.csv')

X = df.drop(['WL'], axis=1).values
y = df['WL'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)