# Making Predictions on NBA Games

In this notebook, my goal is to gather data and statistics from basketball games - including percentages on blocks, steals, and free throws - in order to see whether machine learning algorithms can successfully predict who will win a game as well as hopefully get some interesting insights on what it takes to win. 

I begin by scraping the data from basketball-reference.com and organizing it into a JSON format. I'll be working on functions for efficiently updating the data as time goes on and extracting features. I hope to test out a few different algorithms from sklearn. Since I'm in the middle of a semester at school, I don't have time to work on this project too consistently and will try for small updates. 

In [1]:
import pandas as pd
import numpy as np
import json
from pprint import pprint
from sklearn.linear_model import LogisticRegression

with open('data.json') as f:
  data = json.loads(f.read())

#pprint(data[0:3])


In [3]:
keys = []
frames = []
#for game in data:
#    for key, d in game.items():
#        keys.append(key)
#        frames.append(pd.DataFrame.from_dict(d, orient='index'))
#print(keys)
#print(frames)


reform = [{(outerKey, innerKey): values for outerKey, innerDict in game.items() for innerKey, values in innerDict.items()} for game in data]

dfs=[]
for i in reform:
    dfs.append((pd.DataFrame(i)).T)

game_df = pd.DataFrame()
for i in dfs:
    game_df = game_df.append(i)

print(game_df.tail(1))




                                                                 ast ast_pct  \
Charlotte Hornets at Orlando Magic, February 14... Orlando Magic  32    66.7   

                                                                 blk blk_pct  \
Charlotte Hornets at Orlando Magic, February 14... Orlando Magic   6     9.5   

                                                                 def_rtg drb  \
Charlotte Hornets at Orlando Magic, February 14... Orlando Magic    91.9  42   

                                                                 drb_pct  \
Charlotte Hornets at Orlando Magic, February 14... Orlando Magic    76.4   

                                                                 efg_pct  fg  \
Charlotte Hornets at Orlando Magic, February 14... Orlando Magic    .614  48   

                                                                 fg3   ...    \
Charlotte Hornets at Orlando Magic, February 14... Orlando Magic  17   ...     

                                          

Now that we have a single dataframe with all games in it, we can find games by looking for a certain team.
Below we have all games for the Boston Celtics.

In [4]:
boston_games = game_df.xs('Boston Celtics', level=1).index.values
print("Number of Celtics games:", len(boston_games))
print(boston_games)
print(game_df.loc[boston_games[-1]])

Number of Celtics games: 58
['Philadelphia 76ers at Boston Celtics, October 16, 2018'
 'Boston Celtics at Toronto Raptors, October 19, 2018'
 'Boston Celtics at New York Knicks, October 20, 2018'
 'Orlando Magic at Boston Celtics, October 22, 2018'
 'Boston Celtics at Oklahoma City Thunder, October 25, 2018'
 'Boston Celtics at Detroit Pistons, October 27, 2018'
 'Detroit Pistons at Boston Celtics, October 30, 2018'
 'Milwaukee Bucks at Boston Celtics, November 1, 2018'
 'Boston Celtics at Indiana Pacers, November 3, 2018'
 'Boston Celtics at Denver Nuggets, November 5, 2018'
 'Boston Celtics at Phoenix Suns, November 8, 2018'
 'Boston Celtics at Utah Jazz, November 9, 2018'
 'Boston Celtics at Portland Trail Blazers, November 11, 2018'
 'Chicago Bulls at Boston Celtics, November 14, 2018'
 'Toronto Raptors at Boston Celtics, November 16, 2018'
 'Utah Jazz at Boston Celtics, November 17, 2018'
 'Boston Celtics at Charlotte Hornets, November 19, 2018'
 'New York Knicks at Boston Celtics

The Celtics have a total of 58 games. Let's get the differences between their stats and the teams they played for each game.

In [15]:
def isHomeTeam(team, game_title):
    if team == game_title.split("at")[0]:
        home = True
    else:
        home = False
        
    return home


def getSpread(team,game_df):
    team_games = game_df.xs(team, level=1).index.values
    game_stats = game_df.loc[team_games].T
    #print("team games:",team_games)
    #print("games_stats:",game_stats)
    original_team_stats = []
    other_teams_stats = []  
    for i in game_stats:
        if (i[1] == team):
            original_team_stats.append(game_df.loc[i[0]].loc[i[1]])
        else:
            other_teams_stats.append(game_df.loc[i[0]].loc[i[1]])
    
    for i in range(len(original_team_stats)):
        first_team = original_team_stats[i].values.astype('float',casting='unsafe')
        other_team = other_teams_stats[i].values.astype('float',casting='unsafe')
        spread = first_team - other_team
        print(first_team)
        print(other_team)
        print(spread)
        original_team_stats[i]=pd.Series(index=original_team_stats[i].index.values,data=spread)
        #append home value??
        print(original_team_stats[i])

    return
          
getSpread('Boston Celtics', game_df)

[2.100e+01 5.000e+01 5.000e+00 8.200e+00 8.340e+01 4.300e+01 8.780e+01
 4.900e-01 4.200e+01 1.100e+01 2.970e-01 3.700e+01 3.810e-01 4.330e-01
 9.700e+01 1.000e+01 7.140e-01 1.400e+01 1.440e-01 2.400e+02 1.007e+02
 1.200e+01 2.260e+01 2.000e+01 1.050e+02 7.000e+00 6.700e+00 1.400e+01
 1.190e+01 5.500e+01 5.390e+01 5.090e-01 1.000e+02]
[1.800e+01 5.290e+01 5.000e+00 8.300e+00 1.007e+02 4.100e+01 7.740e+01
 4.200e-01 3.400e+01 5.000e+00 1.920e-01 2.600e+01 2.990e-01 3.910e-01
 8.700e+01 1.400e+01 6.090e-01 2.300e+01 2.640e-01 2.400e+02 8.340e+01
 6.000e+00 1.220e+01 2.000e+01 8.700e+01 8.000e+00 7.700e+00 1.600e+01
 1.410e+01 4.700e+01 4.610e+01 4.480e-01 1.000e+02]
[  3.     -2.9     0.     -0.1   -17.3     2.     10.4     0.07    8.
   6.      0.105  11.      0.082   0.042  10.     -4.      0.105  -9.
  -0.12    0.     17.3     6.     10.4     0.     18.     -1.     -1.
  -2.     -2.2     8.      7.8     0.061   0.   ]
ast                  3.000
ast_pct             -2.900
blk           

In [61]:
def lastFiveGames(team,game_df):
    team_games = game_df.xs(team, level=1).index.values[-5:]
    
    return 

lastFiveGames('Boston Celtics', game_df)

In [None]:
#get last five games for a team
#for team in dataframe:
    #games = game_df.xs('Boston Celtics', level=1).index.values[:4]
    #for i in range(len(games)):
        #