# 1. Data Preprocessing

There are two datasets in this projects. First one contains relative score differential and total score of each game of each team scrapped by me from ESPN. The other dataset is the "Advanced Team Statistics" of each team againg scrapped by me from Fox Sport on each day of the 2018-2019 season. You can go over databases, we have shared them. The statistics datasets have almost every predictor we want to train our model except the Home/Away information which we will calculate using schedule/score dataset and add it to our training dataset. (Datasets won't be shared within project files, this notebook is only for giving the intuition of opponent based double checked prediction approach and sharing the results)

Although we have data of the whole season the model we want to build is going to use only the previous data to predict a game. For example if we want to predict 44th game of the season for Timberwolves the model will only use previous games to train and will predict the result of 44th game. Besides, model will not use the statistics of the team that we want to predict its game but its opponents'. What I mean is we will use previous opponents statistics as explanatory variables and the score of the game against that opponent as response variable and to predict 44th game of the season for Timberwolves we will give the statistics of the opponent to the model to predict the result of that game. 

In [1]:
import numpy as np
import sqlite3
import pandas as pd

In [2]:
# Connect to databases and fetch all 
con_schedule = sqlite3.connect("schedule_scores.db")
cursor_schedule = con_schedule.cursor()

con_stats = sqlite3.connect("team_stats.db")
cursor_stats = con_stats.cursor()

# List of 30 teams and city name dictionary to match the names used by ESPN and FoxSport
teams = ["CHA","PHI","TOR","BOS","CLE","IND","WSH","MIL","MIA","DET","NY","CHI","ORL","BKN","ATL","HOU","GS","POR","NO","MIN","SA","OKC","DEN","LAC","UTA","LAL","SAC","DAL","PHX","MEM"]
city_dic = {"atl":"Atlanta","bkn": "Brooklyn" ,"bos": "Boston", "cha":"Charlotte", "chi":"Chicago", "cle": "Cleveland", "dal": "Dallas", "den": "Denver", "det":"Detroit","gs":"Golden State", "hou": "Houston", "ind":"Indiana","lac":"LA", "lal": "Los Angeles","mem": "Memphis","mia":"Miami","mil":"Milwaukee","min":"Minnesota","no":"New Orleans","ny":"New York","okc":"Oklahoma City","orl":"Orlando","phi":"Philadelphia","phx":"Phoenix","por":"Portland","sa":"San Antonio","sac":"Sacramento","tor":"Toronto","uta":"Utah","wsh":"Washington" }

# Fetching the schedules of each team
schedules = {}
for team in teams:
    cursor_schedule.execute("SELECT * FROM {}".format(team.lower()))
    schedules[team.lower()] = cursor_schedule.fetchall()

# Fetching the stats of each team
stats = {}
for team in teams:
    cursor_stats.execute("SELECT * FROM {}".format(team.lower()))
    stats[team.lower()] = cursor_stats.fetchall()  
    


In [3]:
# Now we will write a function that creates x and y matrices to predict specific game
# To do this we are going add statistics of the opponents to schedule/score dataset 
 
def get_datasets(schedules,stats):
    
    datasets = {}
    for team in teams:
        team = team.lower()
        schedule = schedules[team]

        schedule_df = pd.DataFrame(schedule, columns=['Game', 'Date', 'Opponent','Home/Away(1/0)','Score','TotalScore'])
        # Date column should be formatted from 'Oct 19 2019' to pandas date
        schedule_df['Date'] = pd.to_datetime(schedule_df['Date'],format='%b %d %Y')

        # We are going to add each opponents statisctis to that dataframe and than we are going to combine schedule_df and stats_df
        stats_df = pd.DataFrame(columns=["Date_of_stat","GamesPlayed","OffRtg","DefRtg","Pace","FtRate","ThreeFgTend",
        "TrueS","Efg","TurnOver","OffReb","FtFga","EfgAllow","TurnOvAllow","DefRebAllow","FtFgaAllow"])

        # To fill stats_df we are goint to itterate over the rows of schedule_df.
        for index, row in schedule_df.iterrows():

            # team_stats.db has long city names instead of short ones like in the schedule_scores.db 
            # therefore a dictionary is used to match the team names ex. opponent_ = "Atlanta" -> opponent = "atl"
            opponent_ = row["Opponent"]
            opponent = list(filter(lambda x: x[1] == opponent_,list(city_dic.items())))[0][0]

            opponent_stats_table = stats[opponent]
            opponent_stats_df = pd.DataFrame(opponent_stats_table,columns=["Date_of_stat","GamesPlayed","OffRtg","DefRtg","Pace","FtRate","ThreeFgTend",
        "TrueS","Efg","TurnOver","OffReb","FtFga","EfgAllow","TurnOvAllow","DefRebAllow","FtFgaAllow"])
            opponent_stats_df['Date_of_stat'] = pd.to_datetime(opponent_stats_df['Date_of_stat'],format='%b %d %Y')

            # We have a date from schedule and we basicaly try to find the statistics row that have the most 
            # similar date with schedule because the not every day's statistics are gathered
            date = row["Date"]
            stats_at_date = opponent_stats_df.iloc[opponent_stats_df.Date_of_stat.searchsorted(date-pd.DateOffset(days=1))].to_frame().T
            stats_df = pd.concat([stats_df,stats_at_date])


        df = pd.concat([schedule_df.reset_index(drop=True),stats_df.reset_index(drop=True)], axis=1)
        # Changing the position of Home/Away column to make easier to chose x and y matrices
        homeAway_column = df.pop('Home/Away(1/0)')
        df.insert(7, 'Home/Away(1/0)', homeAway_column)

        df['WinLose'] = np.where(df['Score'] >= 0, 1, 0)
        
        datasets[team] = df
    return datasets

datasets = get_datasets(schedules,stats)
datasets["atl"].head()

Unnamed: 0,Game,Date,Opponent,Score,TotalScore,Date_of_stat,GamesPlayed,Home/Away(1/0),OffRtg,DefRtg,...,TrueS,Efg,TurnOver,OffReb,FtFga,EfgAllow,TurnOvAllow,DefRebAllow,FtFgaAllow,WinLose
0,1,2018-10-17,New York,-19.0,233.0,2018-10-31 00:00:00,7,0.0,106.5,110.5,...,0.522,0.489,11.7,24.2,16.2,0.544,14.5,75.8,21.3,0
1,2,2018-10-19,Memphis,-14.0,248.0,2018-11-02 00:00:00,7,0.0,105.5,101.8,...,0.557,0.512,12.4,17.4,25.7,0.524,16.4,82.6,20.2,0
2,3,2018-10-21,Cleveland,22.0,244.0,2018-11-01 00:00:00,8,0.0,109.8,117.7,...,0.538,0.489,12.7,29.6,22.7,0.585,13.6,70.4,19.9,1
3,4,2018-10-24,Dallas,7.0,215.0,2018-10-31 00:00:00,7,1.0,107.8,113.5,...,0.542,0.507,13.2,23.9,19.4,0.568,14.2,76.1,23.4,1
4,5,2018-10-27,Chicago,-12.0,182.0,2018-10-31 00:00:00,7,1.0,108.7,118.2,...,0.567,0.535,12.9,17.5,20.1,0.539,10.9,82.5,22.3,0


## Train Test Split

The get_train_test function takes the last n games we specified as train set and row with the game that we want to predict as test set. 

In [4]:
def get_train_test(df,game_number,last_n_games):
    
    if game_number <= last_n_games:
        first_game = 0
    else:
        first_game = game_number-last_n_games-1

    x_train = df.iloc[first_game:game_number-1,7:-1]
    y_train = df.iloc[first_game:game_number-1,-1]

    
    x_test = df.iloc[game_number-1,7:-1]
    y_test = df.iloc[game_number-1,-1]
    
    return x_train,x_test,y_train,y_test

# 2. Models and Algorithm

In this section many classification algorithms are going to be coded in functions to call them during further prediction scenarios. Firstly the classification models codded as function to ease the coding the algorithm.

## Logistic Regression

In [5]:
def logistic_regression(x_train,x_test,y_train,y_test):
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    sc_y = StandardScaler()
    X = sc_X.fit_transform(x_train.values)

    from sklearn.linear_model import LogisticRegression
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(X, y_train)

    x_test_sca = sc_X.transform([x_test])
    y_pred = classifier.predict(x_test_sca)
    
    prob = classifier.predict_proba(x_test_sca)
    
    return y_pred,y_test,prob

## KNN

In [6]:
def knn(x_train,x_test,y_train,y_test):
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    sc_y = StandardScaler()
    X = sc_X.fit_transform(x_train.values)

    from sklearn.neighbors import KNeighborsClassifier
    classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
    classifier.fit(X, y_train)

    x_test_sca = sc_X.transform([x_test])
    y_pred = classifier.predict(x_test_sca)

    prob = classifier.predict_proba(x_test_sca)

    return y_pred,y_test,prob

## SVM

In [7]:
def svm(x_train,x_test,y_train,y_test):
    
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    X = sc_X.fit_transform(x_train.values)
    y_train = y_train.astype('int')
    y_test = y_test.astype('int')
    
    from sklearn.svm import SVC
    classifier = SVC(kernel = 'rbf', random_state = 0,probability=True)
    classifier.fit(X, y_train)

    x_test_sca = sc_X.transform([x_test])
    y_pred = classifier.predict(x_test_sca)
    
    prob = classifier.predict_proba(x_test_sca)
    
    return y_pred,y_test,prob

## Naive Bayes

In [8]:
def naive_bayes(x_train,x_test,y_train,y_test):
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    sc_y = StandardScaler()
    X = sc_X.fit_transform(x_train.values)
    
    
    from sklearn.naive_bayes import GaussianNB
    classifier = GaussianNB()
    classifier.fit(X, y_train)

    x_test_sca = sc_X.transform([x_test])
    y_pred = classifier.predict(x_test_sca)
    
    prob = classifier.predict_proba(x_test_sca)
    return y_pred,y_test,prob

## Decision Tree

In [9]:
def decision_tree(x_train,x_test,y_train,y_test):
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    sc_y = StandardScaler()
    X = sc_X.fit_transform(x_train.values)

    from sklearn.tree import DecisionTreeClassifier
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X, y_train)

    x_test_sca = sc_X.transform([x_test])
    y_pred = classifier.predict(x_test_sca)
    
    prob = classifier.predict_proba(x_test_sca)
    
    return y_pred,y_test,prob

## Random Forest

In [10]:
def random_forest(x_train,x_test,y_train,y_test):
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    sc_y = StandardScaler()
    X = sc_X.fit_transform(x_train.values)

    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X, y_train)

    x_test_sca = sc_X.transform([x_test])
    y_pred = classifier.predict(x_test_sca)
    
    prob = classifier.predict_proba(x_test_sca)
    
    return y_pred,y_test,prob

# Prediction Algorithm

The idea behind prediction algoritm is basically using the opponent statistics as predictor varibles and the result of the game as response variable. This approach creates the necessity of training a separate model for each NBA team. By this aproach we are aiming to understand "How Timberwolves perform against a team with XXX statistics during this season?" and we are hoping to find how Timberwolves are going to perform next when we put the statistics of the next opponent in a classification model.

Since the main motivation of this projects is to simulate real time predictions during an NBA season when we want to predict a game; for example 24th game of Timberwolves; we will only going to use the data before that game. 

In [11]:
def predict_game(df,game_number,model,last_n_games=60):
    
    x_train,x_test,y_train,y_test = get_train_test(df,game_number,last_n_games)
    
    if model == "logistic_regression":
        y_pred,y_test,prob = logistic_regression(x_train,x_test,y_train,y_test)  
    elif model == "knn":
        y_pred,y_test,prob = knn(x_train,x_test,y_train,y_test)  
    elif model == "svm":
        y_pred,y_test,prob = svm(x_train,x_test,y_train,y_test)  
    elif model == "naive_bayes":
        y_pred,y_test,prob = naive_bayes(x_train,x_test,y_train,y_test)  
    elif model == "decision_tree":
        y_pred,y_test,prob = decision_tree(x_train,x_test,y_train,y_test)  
    elif model == "random_forest":
        y_pred,y_test,prob = random_forest(x_train,x_test,y_train,y_test)  
    
    return y_pred,y_test,prob


We are going to use another schedule dataset with whole calender (not like team by team like schedule_score.db). This dataset <<wholeSchedule_1230games.csv>> is gathered from "www.basketball-reference.com". A column named Game_Code added to match the data with another datasets in the future projects.

In [12]:
def predict_games_scenario(first_game,last_game,model,last_n_games):
    wholeSchedule = pd.read_csv("wholeSchedule_1230games.csv")
    wholeSchedule['Date'] = pd.to_datetime(wholeSchedule['Date'],format='%Y-%m-%d')
    wholeSchedule = wholeSchedule.sort_values(by="Date")
    
    prediction_results = []
    
    for index,row in wholeSchedule.iterrows():
        
        if index > first_game and index < last_game:
            
            # We need short team names like "atl","phi", the Game_Code column is created as such {game_date}_{visitor_team}_{home_team}. Ex. 16102018_phi_bos
            home_team = row["Game_Code"].split("_")[2]
            visitor_team = row["Game_Code"].split("_")[1]
            date = pd.to_datetime(row["Game_Code"].split("_")[0],format='%d%m%Y') 

            home_df = datasets[home_team]
            visitor_df = datasets[visitor_team]
            
            # We are going to make prediction with both team's models and then we are going to compare the results.
            # Home team prediction with SVM
            game_number_for_home_team = home_df[home_df["Date"] == date]["Game"].values[0]
            y_pred_home,y_test_home,prob_home = predict_game(home_df,game_number_for_home_team,model,last_n_games)
            
            # Visitor team prediction with SVM
            game_number_for_visitor_team = visitor_df[visitor_df["Date"] == date]["Game"].values[0]
            y_pred_visitor,y_test_visitor,prob_visitor = predict_game(visitor_df,game_number_for_visitor_team,model,last_n_games)
            
            prediction_results.append([y_pred_home,y_test_home,max(prob_home[0]),y_pred_visitor,y_test_visitor,max(prob_visitor[0])])
        
    return prediction_results


# 3. Application

We are going to start the application by simulating the prediction model in the middle of the season. We are going to start prediction after 550 games have played to gather enough data and we will not make prediction for last 80 games since some teams have different motivations at the end of the season like tanking or resting for play-offs. We are going to simulate each model and will compare their performanses. Also we are going to test if models of the both  home and visitor teams are going predict same results and how the correct guess ratio will change if we apply the condition of both model should predict the same way.

In [13]:
models = ["logistic_regression","knn","svm","naive_bayes","decision_tree","random_forest"]

for model in models:

    same_predictions = 0
    same_and_correct_predictions = 0

    total_games = 0
    
    print("Model: ",model)
    
    # We are going start predicting after 550 games have played and we will not going to predict last 80 games since most of the teams have 
    # different motives at the end of the season like tanking or resting key players.
    prediction_results = predict_games_scenario(first_game=550,last_game=1150,model=model,last_n_games=60)
    # prediction_results = [[y_pred_home,y_test_home,prob_home,y_pred_visitor,y_test_visitor,prob_visitor]]
    predictions_df = pd.DataFrame(prediction_results,columns=["Y_pred_home","Y_test_home","Prob_home","Y_pred_visitor","Y_test_visitor","Prob_visitor"])
    
    
    # We are going to check for different values that means models have same prediction about winner.
    # Ex. y_pred_home = 0 and y_pred_visitor = 1; both of these predict that visitor team will win the game
    for index,row in predictions_df.iterrows():
        if row["Y_pred_home"] != row["Y_pred_visitor"]:
            same_predictions+=1
            # Counting correct guesses
            if row["Y_pred_home"] == row["Y_test_home"]:
                same_and_correct_predictions += 1

         
        total_games += 1

    print("Total games:                                 {0}".format(total_games))
    print("*****")
    print("Total same predictions by both models:       {0}".format(same_predictions))
    print("Same predictions by both models and correct: {0} ({1:.2f}%)".format(same_and_correct_predictions,same_and_correct_predictions/same_predictions*100))
    print("-------------------------------------------------------------")


Model:  logistic_regression
Total games:                                 599
*****
Total same predictions by both models:       448
Same predictions by both models and correct: 316 (70.54%)
-------------------------------------------------------------
Model:  knn
Total games:                                 599
*****
Total same predictions by both models:       398
Same predictions by both models and correct: 263 (66.08%)
-------------------------------------------------------------
Model:  svm
Total games:                                 599
*****
Total same predictions by both models:       435
Same predictions by both models and correct: 305 (70.11%)
-------------------------------------------------------------
Model:  naive_bayes
Total games:                                 599
*****
Total same predictions by both models:       429
Same predictions by both models and correct: 297 (69.23%)
-------------------------------------------------------------
Model:  decision_tree
Total game

As the results show logistic regression, support vector machine and naive bayes have the best correct guess ratio. In this project we have achieved around 70% correct guess ratio by keeping the conditions as much as same with real life. The downside of the algorithm is this solution is not a generic solution, it cannot be used to predict the first games of the season. In the next projects we will try to use this predictions to make a bet simulation and we are going to apply new classification models such as deep learning to improve the correct guess percentage also classification models that are used in this project can be optimized. This project was a quick implementation of the idea of opponent statistics based double checked prediction approach it can be improved a lot.  