# Problem Statement
***
This project focuses on using machine-learning algorithms to build a model for
Predicting a NBA game outcomes, which could be used for team and game analysis to improve profromance or even more fun betting! 


In terms of betting there are three different common bet lines: 

-	Moneyline (who is winner)
-	Spread (Who is winner and what will be the difference in their scores)
-	Total (Total points made by both teams)

In this project our target is the total scores of an upcoming match based on historic data of teams. 
### I want to predict if I should bet on Over or Under the score line!  

# Raw Data
***
### Game Data
  - Games schedule and results for each season (Regular Season 2013 to 2017)
    - This data includes statistical data for each game 
    - The data of each game is aggregated by match for all players
    - Scraped from http://www.basketball-refrence.com
    
### Bet Data

  - Total points bet line for each game
    - Scraped from  http://www.oddsshark.com



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import time
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import requests

driver = webdriver.Chrome('../../../../behdad/Downloads/chromedriver')  

from scipy.stats import skew

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

from sklearn.cross_validation import train_test_split,cross_val_predict,cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression,Lasso, Ridge
from sklearn.svm import SVR,SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,AdaBoostRegressor
from sklearn.decomposition import PCA


## Scraping Code

### 1- Scraping Game Schedule and results

In [None]:
year_names = ['2013,2014,2015,2016,2017']
month_names = ['october', 'november', 'december','january', 'february', 'march', 'april', 'may','june']

game_table_urls = []
for y in year_names :
    for m in month_names :
        game_table_urls.append('http://www.basketball-reference.com/leagues/NBA_'+y+'_games-'+m+'.html')

row_data = []
i = 1
for g in game_table_urls :
    print i,g
    i+= 1
    HTML = requests.get(g).text

    table_rows = Selector(text=HTML).xpath('//*[@id="schedule"]/tbody/tr')
    for row in table_rows:
        row_data.append(row.xpath('.//a/text()').extract() + row.xpath('.//td/text()').extract())  
        
games_data = pd.DataFrame(row_data)

games = games_data.iloc[:,:7]
games.columns = ['GameDate','GuestName','HostName', 'Remove1', 'GameTime','GuestScore', 'HostScore']
games = games[['GameDate', 'GameTime','HostName','HostScore','GuestName','GuestScore']]
games.dropna(inplace=True)

team_shortname = {
    'Atlanta Hawks' : ['ATL','20734','Atlanta'] ,
    'Boston Celtics' : ['BOS','20722','Boston'],
    'Brooklyn Nets' : ['BRK','20749','Brooklyn'],
    'Chicago Bulls' : ['CHI','20732','Chicago'] ,
    'Cleveland Cavaliers' : ['CLE','20735','Cleveland'],
    'Charlotte Hornets' : ['CHO','20751','Charlotte'],
    'Dallas Mavericks' : ['DAL','20727','Dallas'],
    'Denver Nuggets' : ['DEN','20723','Denver'],
    'Detroit Pistons' : ['DET','20743','Detroit'],
    'Golden State Warriors' : ['GSW','20741','Golden State'],
    'Houston Rockets' : ['HOU','20740','Houston'],
    'Indiana Pacers' : ['IND','20737','Indiana'],
    'Los Angeles Lakers' : ['LAL','20739','LA Lakers'],
    'Los Angeles Clippers' : ['LAC','20736','LA Clippers'],
    'New York Knicks' : ['NYK','20747','New York'],
    'New Orleans Pelicans' : ['NOP','20733','New Orleans'],
    'Memphis Grizzlies' : ['MEM','20729','Memphis'],
    'Minnesota Timberwolves' : ['MIN','20744','Minnesota'],
    'Miami Heat' : ['MIA','20726','Miami'],
    'Milwaukee Bucks' : ['MIL','20725','Milwaukee'],
    'Oklahoma City Thunder' : ['OKC','20728','Oklahoma City'],
    'Orlando Magic' : ['ORL','20750','Orlando'],
    'Phoenix Suns' : ['PHO','20730','Phoenix'],
    'Portland Trail Blazers' : ['POR','20748','Portland'],
    'Philadelphia 76ers' : ['PHI','20731','Philadelphia'],
    'Sacramento Kings' : ['SAC','20745','Sacramento'],
    'San Antonio Spurs' : ['SAS','20724','San Antonio'],
    'Toronto Raptors' : ['TOR','20742','Toronto'],
    'Washington Wizards' : ['WAS','20746','Washington'],
    'Utah Jazz' : ['UTA','20738','Utah'], 
    'Charlotte Bobcats' : ['CHA','',''],
    'New Orleans Hornets' : ['NOH','',''],
    'New Jersey Nets' : ['NJN','',''] 
} 

games['host_shortname'] = games.HostName.apply(lambda x : team_shortname[x][0])

games['GameId'] = pd.to_datetime(games.GameDate).astype('str').str.replace('-','')  + '0' + games.host_shortname

games_new = pd.DataFrame(games.values,columns=games.columns)

games_new['HostScore'] = games_new['HostScore'].astype('int')
games_new['GuestScore'] = games_new['GuestScore'].astype('int')


### 2- Scraping Game Details and statistics

In [None]:
game_detail_urls = []

main_or_exception = True

if main_or_exception :
    for g in games_new.GameId.values :
        game_detail_urls.append([g,'http://www.basketball-reference.com/boxscores/' + g + '.html'])
else :
    for e in exceptions :
        game_detail_urls.append(e)

driver = webdriver.Chrome('../../../../behdad/Downloads/chromedriver')
score_by_quarter = []
four_factors = []
basic_advanced = []
exceptions = []

i = 1

for game in game_detail_urls[438:] : 
    print i, game[0], game[1]
    i += 1
    try : 
        driver.get(game[1])
        HTML = driver.page_source
        time.sleep(4)

        table_rows = Selector(text=HTML).xpath('//div[@id="content"]')

        for row in table_rows:
            tempscore = row.xpath('//*[@id="line_score"]/tbody/tr/td/text()').extract()
            tempscore.append(game[0])
            score_by_quarter.append(tempscore)

            tempfactors = row.xpath('//*[@id="four_factors"]/tbody/tr/td/text()').extract()
            tempfactors.append(game[0])
            four_factors.append(tempfactors)             

            tempadvanced = row.xpath('//tfoot/tr/td/text()').extract()
            tempadvanced.append(game[0])
            basic_advanced.append(tempadvanced) 
    except :
        exceptions.append(game)
        time.sleep(10)               

### 3- Cleaning game details (score board, four factors, basic and advanced statsitical data) and merging the data together

In [None]:
for s in score_by_quarter :
    if len(s) == 9 :
        s.insert(4,None)
        s.insert(5,None)
        s.insert(6,None)
        s.insert(7,None)
        s.insert(12,None)
        s.insert(13,None)
        s.insert(14,None)
        s.insert(15,None)
    if len(s) == 11 :
        s.insert(5,None)
        s.insert(6,None)
        s.insert(7,None)
        s.insert(13,None)
        s.insert(14,None)
        s.insert(15,None)
    if len(s) == 13 :
        s.insert(6,None)
        s.insert(7,None)
        s.insert(14,None)
        s.insert(15,None)
    if len(s) == 15 :
        s.insert(7,None)
        s.insert(15,None)
score_columns = ['gq1','gq2','gq3','gq4','got1','got2','got3','got4',
                 'hq1','hq2','hq3','hq4','hot1','hot2','hot3','hot4','GameId']    
games_score_by_quarter = pd.DataFrame(score_by_quarter,columns=score_columns)
games_score_by_quarter = games_score_by_quarter.fillna(0)
games_score_by_quarter[[x for x in games_score_by_quarter.columns if x != 'GameId']] = games_score_by_quarter[[x for x in games_score_by_quarter.columns if x != 'GameId']].astype('int')
games_score_by_quarter['total_score'] = games_score_by_quarter.gq1+games_score_by_quarter.gq2+games_score_by_quarter.gq3+games_score_by_quarter.gq4+games_score_by_quarter.hq1+games_score_by_quarter.hq2+games_score_by_quarter.hq3+games_score_by_quarter.hq4


games_four_factors = pd.DataFrame(data= four_factors,columns= ['gPace','geFG%','gTOV%','gORB%','gFT/FGA','gORtg',
                                                                 'hPace','heFG%','hTOV%','hORB%','hFT/FGA','hORtg','GameId'])
                                      
games_four_factors[[x for x in games_four_factors.columns if x != 'GameId']] = games_four_factors[[x for x in games_four_factors.columns if x != 'GameId']].astype('float')

games_basic_advanced = pd.DataFrame(data= basic_advanced,columns=[
                                    'gMP1', 'gFG', 'gFGA', 'gFG%', 'g3P', 'g3PA', 'g3P%', 'gFT', 'gFTA', 'gFT%',
                                    'gORB', 'gDRB', 'gTRB', 'gAST', 'gSTL', 'gBLK', 'gTOV', 'gPF', 'gPTS', 
                                    'gMP2', 'gTS%', 'geFG%2', 'g3PAR', 'gFTr', 'gORB%2', 'gDRB%', 'gTRB%', 'gAST%',
                                    'gSTL%', 'gBLK%', 'gTOV%2', 'gUSG%', 'gORtg2', 'gDRtg', 
                                    'hMP1', 'hFG', 'hFGA', 'hFG%', 'h3P', 'h3PA', 'h3P%', 'hFT', 'hFTA', 'hFT%',
                                    'hORB', 'hDRB', 'hTRB', 'hAST', 'hSTL', 'hBLK', 'hTOV', 'hPF', 'hPTS',
                                    'hMP2', 'hTS%', 'heFG%2', 'h3PAR', 'hFTr', 'hORB%2', 'hDRB%', 'hTRB%', 'hAST%',
                                    'hSTL%', 'hBLK%', 'hTOV%2', 'hUSG%', 'hORtg2', 'hDRtg','GameId'])   

games_basic_advanced = games_basic_advanced[[
                                    'gFG', 'gFGA', 'gFG%', 'g3P', 'g3PA', 'g3P%', 'gFT', 'gFTA', 'gFT%',
                                    'gORB', 'gDRB', 'gTRB', 'gAST', 'gSTL', 'gBLK', 'gTOV', 'gPF', 'gPTS', 
                                    'gTS%', 'g3PAR', 'gFTr', 'gDRB%', 'gTRB%', 'gAST%',
                                    'gSTL%', 'gBLK%', 'gDRtg', 
                                    'hFG', 'hFGA', 'hFG%', 'h3P', 'h3PA', 'h3P%', 'hFT', 'hFTA', 'hFT%',
                                    'hORB', 'hDRB', 'hTRB', 'hAST', 'hSTL', 'hBLK', 'hTOV', 'hPF', 'hPTS',
                                    'hTS%', 'h3PAR', 'hFTr', 'hDRB%', 'hTRB%', 'hAST%',
                                    'hSTL%', 'hBLK%','hDRtg','GameId']]
games_basic_advanced[[x for x in games_basic_advanced.columns if x != 'GameId']] = games_basic_advanced[[x for x in games_basic_advanced.columns if x != 'GameId']].astype('float')

final_details = games_score_by_quarter.merge(games_four_factors,on='GameId').merge(games_basic_advanced,on='GameId')

final_data = games_new.merge(final_details,on='GameId',how='inner')


### 4- Scraping Bet Line Data

In [None]:
game_bets_urls = []
for k,v in team_shortname.items() :
    if len(v[1]) > 0 :
        game_bets_urls.append([k,v[0],'http://www.oddsshark.com/stats/gamelog/basketball/nba/' + v[1] + '/2012'])

bet_line_data = []
i = 1
for game in game_bets_urls : 
    print i, game[2]
    i += 1
    driver.get(game[2])
    HTML = driver.page_source
    time.sleep(3)
            
    table_rows = Selector(text=HTML).xpath('//*[@class="base-table"]/tbody/tr')

    for row in table_rows[0:1]:
        bet_line_data.append([game,row.xpath('//td/text()').extract()] )

data_dict = {
        'game_date' : [],
        'second_team_name' : [],
        'first_team_score' : [],
        'second_team_score' : [],
        'game_line' : [],
        'total_line' : []}

for i, item in enumerate(bet_line_data) :
    first_team_name = item[0][0]
    first_team_shortname =  item[0][1]
    
    for j, row in enumerate(item[1]) :
        if len(row) == 1 :
            num_of_games = j - 1
            break

    templist = []
    for k, value in enumerate(item[1][num_of_games:]) :
        
        #print k,value
        if (len(value) > 2) and (value != 'REG')  and (value != 'PST') :
            templist.append(value)
    
    for counter,item in enumerate(templist) :
        if counter % 5 == 0 :
            data_dict['game_date'].append(item)
        if counter % 5 == 1 :
            data_dict['second_team_name'].append(item)
        if counter % 5 == 2 :
            data_dict['first_team_score'].append(item.split('-')[0])
        if counter % 5 == 2 :
            data_dict['second_team_score'].append(item.split('-')[1])
        if counter % 5 == 3 :
            data_dict['game_line'].append(item)
        if counter % 5 == 4 :
            data_dict['total_line'].append(item)
            
bet_data = pd.DataFrame(data_dict)

team_fullname_by_shortname = {}
for key, value in team_shortname.items() :
    team_fullname_by_shortname[value[2]] = key
bet_data['second_team_full_name'] = bet_data.second_team_name.apply(lambda x: team_fullname_by_shortname[x[1:]])

bet_data.game_line = bet_data.game_line.apply(lambda x : 0 if x == ' Ev' else float(x))
bet_data.total_line = bet_data.total_line.apply(float)
bet_data.first_team_score = bet_data.first_team_score.apply(int)
bet_data.second_team_score = bet_data.second_team_score.apply(int)
bet_data.game_date = pd.to_datetime(bet_data.game_date)


### 5- Game and Bet Data Concatination and Merging

In [None]:
games_2017 = pd.read_csv('./Data/CSV_files/Final_data_2017')
if 'Unnamed: 0' in games_2017.columns :
    del games_2017['Unnamed: 0']
games_2017.insert(0,'Season',2017)
games_2017.GameDate =  pd.to_datetime(games_2017.GameDate)

games_2016 = pd.read_csv('./Data/CSV_files/Final_data_2016')
if 'Unnamed: 0' in games_2016.columns :
    del games_2016['Unnamed: 0']
games_2016.insert(0,'Season',2016)
games_2016.GameDate =  pd.to_datetime(games_2016.GameDate)

games_2015 = pd.read_csv('./Data/CSV_files/Final_data_2015') 
if 'Unnamed: 0' in games_2015.columns :
    del games_2015['Unnamed: 0']
games_2015.insert(0,'Season',2015)
games_2015.GameDate =  pd.to_datetime(games_2015.GameDate)

games_2014 = pd.read_csv('./Data/CSV_files/Final_Data_2014') 
if 'Unnamed: 0' in games_2014.columns :
    del games_2014['Unnamed: 0']
games_2014.insert(0,'Season',2014)
games_2014.GameDate =  pd.to_datetime(games_2014.GameDate)

games_2013 = pd.read_csv('./Data/CSV_files/Final_data_2013') 
if 'Unnamed: 0' in games_2013.columns :
    del games_2013['Unnamed: 0']
games_2013.insert(0,'Season',2013)
games_2013.GameDate =  pd.to_datetime(games_2013.GameDate)

games = pd.concat([games_2013,games_2014,games_2015,games_2016,games_2017])
games = games.sort_values('GameDate').reset_index()
if 'index' in games.columns :
    del games['index']

In [None]:
def team_name_corrector(teamname) :
    if teamname == 'Charlotte Bobcats' :
        return 'Charlotte Hornets'
    elif teamname == 'New Orleans Hornets' :
        return 'New Orleans Pelicans'
    else :
        return teamname
games.HostName = games.HostName.apply(team_name_corrector)
games.GuestName = games.GuestName.apply(team_name_corrector)


games['host_shortname'][games.host_shortname == 'CHA'] = 'CHO'
games['host_shortname'][games.host_shortname == 'NOH'] = 'NOP'

In [None]:
bets_2017 = pd.read_csv('./Data/CSV_files/bets_2017')
if 'Unnamed: 0' in bets_2017.columns :
    del bets_2017['Unnamed: 0']
bets_2017.game_date =  pd.to_datetime(bets_2017.game_date)

bets_2016 = pd.read_csv('./Data/CSV_files/bets_2016')
if 'Unnamed: 0' in bets_2016.columns :
    del bets_2016['Unnamed: 0']
bets_2016.game_date =  pd.to_datetime(bets_2016.game_date)

bets_2015 = pd.read_csv('./Data/CSV_files/bets_2015')
if 'Unnamed: 0' in bets_2015.columns :
    del bets_2015['Unnamed: 0']
bets_2015.game_date =  pd.to_datetime(bets_2015.game_date)

bets_2014 = pd.read_csv('./Data/CSV_files/bets_2014')
if 'Unnamed: 0' in bets_2014.columns :
    del bets_2014['Unnamed: 0']
bets_2014.game_date =  pd.to_datetime(bets_2014.game_date)

bets_2013 = pd.read_csv('./Data/CSV_files/bets_2013')
if 'Unnamed: 0' in bets_2013.columns :
    del bets_2013['Unnamed: 0']
bets_2013.game_date =  pd.to_datetime(bets_2013.game_date)

bets = pd.concat([bets_2013,bets_2014,bets_2015,bets_2016,bets_2017,bets_new_games]).reset_index()
if 'index' in bets.columns :
    del bets['index']

def bet_lines(row) :
    temp1 = bets[(bets.second_team_full_name == row['HostName']) & (bets.game_date == row['GameDate'])][['total_line']].values#[0][0]
    row['total_line'] =temp1[0][0] #list(temp1)[0][0]
    temp2 = bets[(bets.second_team_full_name == row['HostName']) & (bets.game_date == row['GameDate'])][['game_line']].values#[0][0]
    row['game_line'] = temp2[0][0]#-1*list(temp2)[0][0]
    return row

games = games.apply(bet_lines,axis=1)

# Feature Engineering

### 1- Creating features for representing phisycal abilities and motivations

In [None]:
games = pd.read_csv('./Data/CSV_files/games_bets_ready4FE')
if 'Unnamed: 0' in games.columns :
    del games['Unnamed: 0']
games.GameDate =  pd.to_datetime(games.GameDate)

games['ORtg'] = (games['hORtg'] + games['gORtg'])/2.
games['STL%'] = (games['hSTL%'] + games['gSTL%'])/2.
games['TOV%'] = (games['hTOV%'] + games['gTOV%'])/2.
games['ORB%'] = (games['hORB%'] + games['gORB%'])/2.
games['DRB%'] = (games['hDRB%'] + games['gDRB%'])/2.
games['TRB%'] = (games['hTRB%'] + games['gTRB%'])/2.
games['BLK%'] = (games['hBLK%'] + games['gBLK%'])/2.
games['PF'] = (games['hPF'] + games['gPF'])/2.
games['Pace'] = (games['hPace'] + games['gPace'])/2.
games['FGA'] = (games['hFGA'] + games['gFGA'])/2.
games['DRtg'] = (games['hDRtg'] + games['gDRtg'])/2.
games['q1'] = (games['hq1'] + games['gq1'])/2.
games['q2'] = (games['hq2'] + games['gq2'])/2.
games['q3'] = (games['hq3'] + games['gq3'])/2.
games['q4'] = (games['hq4'] + games['gq4'])/2.

def gameranker (row) :
    row['Host_HostRank'] = 1 + len(games[(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & (games.HostName == row['HostName'])])
    row['Host_GameRank'] = 1 + len(games[(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & ((games.HostName == row['HostName']) | (games.GuestName == row['HostName']))])
    row['Guest_GuestRank'] = 1+len(games[(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & (games.GuestName == row['GuestName'])])
    row['Guest_GameRank'] = 1 +len(games[(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & ((games.HostName == row['GuestName']) | (games.GuestName == row['GuestName']))])
    row['Headsup_GameRank_Season'] = 1 + len(games[(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & ((games.HostName == row['HostName']) | (games.GuestName == row['HostName'])) & ((games.HostName == row['GuestName']) | (games.GuestName == row['GuestName']))])
    row['Headsup_GameRank_All'] = 1 + len(games[(games.GameDate < row['GameDate']) & ((games.HostName == row['HostName']) | (games.GuestName == row['HostName'])) & ((games.HostName == row['GuestName']) | (games.GuestName == row['GuestName']))])
    return row
games = games.apply(gameranker,axis=1)

def over_time_counter(row) :
    counter = 0
    if row['hot1'] + row['got1'] > 0 : 
        counter += 1
    if row['hot2'] + row['got2'] > 0 : 
        counter += 1        
    if row['hot3'] + row['got3'] > 0 : 
        counter += 1
    if row['hot4'] + row['got4'] > 0 : 
        counter += 1
    row['ot_counter'] = counter
    return row
games = games.apply(over_time_counter,axis=1)

def if_last_game_had_overtime(row) :
    mask1_host = games['GameDate'] < row['GameDate']
    mask2_host = (games['Season'] == row['Season'])
    mask3_host = ((games['HostName'] == row['HostName']) & (games['Host_GameRank'] == row['Host_GameRank'] -1))
    mask4_host = ((games['GuestName'] == row['HostName']) & (games['Guest_GameRank'] == row['Host_GameRank'] -1))
    temp = games['ot_counter'][mask1_host][mask2_host][mask3_host | mask4_host]
    if len(temp) == 0 :
        row['host_lastgame_overtime'] = 0
    else :
        row['host_lastgame_overtime'] = temp.values[0]
    mask1_guest = games['GameDate'] < row['GameDate']
    mask2_guest = (games['Season'] == row['Season'])
    mask3_guest = ((games['GuestName'] == row['GuestName']) & (games['Guest_GameRank'] == row['Guest_GameRank'] -1))
    mask4_guest = ((games['HostName'] == row['GuestName']) & (games['Host_GameRank'] == row['Guest_GameRank'] -1))
    temp = games['ot_counter'][mask1_guest][mask2_guest][mask3_guest | mask4_guest]
    if len(temp) == 0 :
        row['guest_lastgame_overtime'] = 0
    else :
        row['guest_lastgame_overtime'] = temp.values[0]
    return row
games = games.apply(if_last_game_had_overtime,axis=1)

def rehab_time_host(row) :
    mask = (((games.HostName == row['HostName']) & (games.Host_GameRank == row['Host_GameRank'] -1)) | ((games.GuestName == row['HostName']) & (games.Guest_GameRank == row['Host_GameRank'] -1))) & (games.Season==row['Season'])
    lastgame_date = games[['GameDate']][mask].max()
    lastgame_date = lastgame_date.values[0]
    thisgame_date = row['GameDate']
    if pd.isnull(lastgame_date) :
        row['Host_LastGameDiff'] = -1
    else :
        row['Host_LastGameDiff'] = (thisgame_date - lastgame_date) / np.timedelta64(1,'D')
    return row                                                                                
games = games.apply(rehab_time_host, axis =1 )

def rehab_time_guest(row) :
    mask = (((games.GuestName == row['GuestName']) & (games.Guest_GameRank == row['Guest_GameRank'] -1)) | ((games.HostName == row['GuestName']) & (games.Host_GameRank == row['Guest_GameRank'] -1))) & (games.Season==row['Season'])
    lastgame_date = games[['GameDate']][mask].max()
    lastgame_date = lastgame_date.values[0]
    thisgame_date = row['GameDate']
    if pd.isnull(lastgame_date) :
        row['Guest_LastGameDiff'] = -1
    else :
        row['Guest_LastGameDiff'] = (thisgame_date - lastgame_date) / np.timedelta64(1,'D')
    return row                                                                                
games = games.apply(rehab_time_guest, axis =1 )

def winner_looser(row) :
    if row['HostScore'] > row['GuestScore'] :
        row['winner'] = row['HostName']
        row['loser'] = row['GuestName']
    else : 
        row['winner'] = row['GuestName']
        row['loser'] = row['HostName']
    return row    
games = games.apply(winner_looser,axis=1)

def win_lose_counter(row) :
    row['host_win_count']  = len(games[(games.GameDate < row['GameDate']) & (games.Season == row['Season']) & (games.winner == row['HostName'])])
    row['host_lose_count'] = len(games[(games.GameDate < row['GameDate']) & (games.Season == row['Season']) & (games.loser == row['HostName'])])
    row['guest_win_count'] = len(games[(games.GameDate < row['GameDate']) & (games.Season == row['Season']) & (games.winner == row['GuestName'])])
    row['guest_lose_count'] = len(games[(games.GameDate < row['GameDate']) & (games.Season == row['Season']) & (games.loser == row['GuestName'])])
    return row
games = games.apply(win_lose_counter,axis=1)

def games_behind(row) :
    row['game_behind'] = ((row['host_win_count'] - row['guest_win_count']) + (row['guest_lose_count'] - row['host_lose_count']))/2.
    return row
games = games.apply(games_behind,axis=1)

def streak_host(row) :
    templist = []
    tempgames = games[['winner','loser']][(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & ((games.loser == row['HostName']) | (games.winner == row['HostName']))]
    for winner in tempgames[['winner']].values :
        if winner == row['HostName'] :
            templist.append(1)
        else :
            templist.append(-1)
    row['host_strike'] = 0
    while len(templist) > 0 :
        item = templist.pop()
        if item == 1 and row['host_strike'] >= 0 :
            row['host_strike'] += item
        elif item == -1 and row['host_strike'] <= 0 :
            row['host_strike'] += item
        else : 
            break
    return row
games = games.apply(streak_host,axis=1)

def streak_guest(row) :
    templist = []
    tempgames = games[['winner','loser']][(games.Season == row['Season']) & (games.GameDate < row['GameDate']) & ((games.loser == row['GuestName']) | (games.winner == row['GuestName']))]
    for winner in tempgames[['winner']].values :
        if winner == row['GuestName'] :
            templist.append(1)
        else :
            templist.append(-1)
    row['guest_strike'] = 0
    while len(templist) > 0 :
        item = templist.pop()
        if item == 1 and row['guest_strike'] >= 0 :
            row['guest_strike'] += item
        elif item == -1 and row['guest_strike'] <= 0 :
            row['guest_strike'] += item
        else : 
            break
    return row
games = games.apply(streak_guest,axis=1)

def streak_place_host(row) :
    templist = []
    tempgames = games[['HostName','GuestName']][(games.Season == row['Season']) & (games.GameDate <= row['GameDate']) & ((games.HostName == row['HostName']) | (games.GuestName == row['HostName']))]
    for team in tempgames[['HostName']].values :
        if team == row['HostName'] :
            templist.append(1)
        else :
            templist.append(-1)
    row['host_place_streak'] = 0
    while len(templist) > 0 :
        item = templist.pop()
        if item == 1 and row['host_place_streak'] >= 0 :
            row['host_place_streak'] += item
        elif item == -1 and row['host_place_streak'] <= 0 :
            row['host_place_streak'] += item
        else : 
            break
    return row
games = games.apply(streak_place_host,axis=1)

def streak_place_guest(row) :
    templist = []
    tempgames = games[['HostName','GuestName']][(games.Season == row['Season']) & (games.GameDate <= row['GameDate']) & ((games.HostName == row['GuestName']) | (games.GuestName == row['GuestName']))]
    for team in tempgames[['GuestName']].values :
        if team == row['GuestName'] :
            templist.append(1)
        else :
            templist.append(-1)
    row['guest_place_streak'] = 0
    while len(templist) > 0 :
        item = templist.pop()
        if item == 1 and row['guest_place_streak'] >= 0 :
            row['guest_place_streak'] += item
        elif item == -1 and row['guest_place_streak'] <= 0 :
            row['guest_place_streak'] += item
        else : 
            break
    return row
games = games.apply(streak_place_guest,axis=1)


### 2- Creating aggregated features from previous matches

In [None]:
games_avg = games[['Season','GameId','GameDate','GameTime','HostName','GuestName',
                   'total_score',
                   'total_line','game_line', 
                   'Host_HostRank','Host_GameRank','Guest_GuestRank','Guest_GameRank',
                   'Headsup_GameRank_Season','Headsup_GameRank_All', 'Host_LastGameDiff','Guest_LastGameDiff',
                   'host_win_count','host_lose_count','guest_win_count','guest_lose_count', 
                   'game_behind','host_strike','guest_strike','winner','loser',
                   'host_place_streak','guest_place_streak']]

merge_columns = [x for x in games_avg.columns if x != 'GameId']

def historyData_average_host(row) :
    

    host = games[(games['HostName'] == row['HostName']) & (games['Season'] == row['Season']) & (games['Host_HostRank'] < row['Host_HostRank']) & (games['Host_HostRank'] >= row['Host_HostRank'] - n_lastmatch)]
    final_host = host[['HostName','hq1','hq2','hq3','hq4','hPace','heFG%','hTOV%','hORB%','hFT/FGA','hORtg','hFG', 'hFGA', 'hFG%', 'h3P', 'h3PA', 'h3P%', 'hFT', 'hFTA', 'hFT%', 'hORB', 'hDRB', 'hTRB', 'hAST', 'hSTL', 'hBLK', 'hTOV', 'hPF', 'hPTS', 'hTS%', 'h3PAR', 'hFTr', 'hDRB%', 'hTRB%', 'hAST%', 'hSTL%', 'hBLK%', 'hDRtg']].groupby(['HostName']).mean().reset_index()    
    for col in final_host.columns : 
        if col != 'HostName' :
            row[col+'_avg'+ str(n_lastmatch)] = final_host[col].sum()
        
    return row



def historyData_average_guest(row) :
    guest = games[(games['GuestName'] == row['GuestName']) & (games['Season'] == row['Season']) & (games['Guest_GuestRank'] < row['Guest_GuestRank']) & (games['Guest_GuestRank'] >= row['Guest_GuestRank'] - n_lastmatch)]
    final_guest = guest[['GuestName','gq1','gq2','gq3','gq4','gPace','geFG%','gTOV%','gORB%','gFT/FGA','gORtg','gFG', 'gFGA', 'gFG%', 'g3P', 'g3PA', 'g3P%', 'gFT', 'gFTA', 'gFT%', 'gORB', 'gDRB', 'gTRB', 'gAST', 'gSTL', 'gBLK', 'gTOV', 'gPF', 'gPTS', 'gTS%', 'g3PAR', 'gFTr', 'gDRB%', 'gTRB%', 'gAST%', 'gSTL%', 'gBLK%', 'gDRtg']].groupby(['GuestName']).mean().reset_index()       
    for col in final_guest.columns : 
        if col != 'GuestName' :
            row[col+'_avg'+ str(n_lastmatch)] = final_guest[col].sum()
    return row

n_lastmatch = 10
games_avg_as_host  = games_avg[games_avg.Host_HostRank > n_lastmatch].apply(historyData_average_host, axis=1)
games_avg_as_guest  = games_avg[games_avg.Guest_GuestRank > n_lastmatch].apply(historyData_average_guest, axis=1)
games_avg_as_host = games_avg_as_host[[x for x in games_avg_as_host.columns if x not in merge_columns]]
games_avg_as_guest = games_avg_as_guest[[x for x in games_avg_as_guest.columns if x not in merge_columns]]
games_avg_as_host_or_guest = games_avg_as_host.merge(games_avg_as_guest,on='GameId',how='inner')

n_last_headsups = 3
def historyData_last_headsups(row) :
    mask1 = (games['GameDate'] < row['GameDate'])
    mask2 = ((games['HostName'] == row['HostName']) & (games['GuestName'] == row['GuestName'])) | ((games['HostName'] == row['GuestName']) & (games['GuestName'] == row['HostName'])) 
    mask3 = games['Headsup_GameRank_All'] >= row['Headsup_GameRank_All'] - n_last_headsups
    columns = ['q1','q2','q3','q4','Pace','PF','FGA','DRtg','ORtg','STL%','TOV%','ORB%','DRB%','TRB%','BLK%']
    tempgames = games[mask1][mask2][mask3][columns]
    temp = tempgames.mean()
    for col in columns : 
        if len(tempgames) > 0 : 
            row[col+'_headsup'+ str(n_lastmatch)] = temp[col]
        else : 
            row[col+'_headsup'+ str(n_lastmatch)] = 0
    return row
games_avg_headsup = games_avg[games_avg.Headsup_GameRank_All > 1].apply(historyData_last_headsups,axis=1)
games_avg_headsup = games_avg_headsup[[x for x in games_avg_headsup.columns if x not in merge_columns]]

# Model Selection and Implementation

### 1- Load the Data from Data Engineering part

In [22]:
games_avg = pd.read_csv('./Data/CSV_files/games_avg_final02')
del games_avg['Unnamed: 0']
print games_avg.shape

features = games_avg[[x for x in games_avg.columns if x not in ['GameTime','GameDate','GameId']]]

print features.shape
features = pd.get_dummies(features)
print features.shape


(4174, 117)
(4174, 114)
(4174, 230)


### 2- Use LinearRegression to store coefs

In [19]:
y = features.total_score
X = features[[x for x in features.columns if x != 'total_score']]
X = StandardScaler().fit_transform(X)

scores = []
predictions = []
coefs = []
for i in range(1000) :    
    
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33)
    lr = LinearRegression()
    lr.fit(X_train,y_train)

    scores.append(lr.score(X_test,y_test))
    predictions.append(np.mean(np.abs(lr.predict(X_test) - y_test)))
    coefs.append(lr.coef_)
np.mean(scores), np.std(predictions),np.min(predictions),np.max(predictions) ,np.mean(predictions)

(0.19386523872014066,
 0.32266753459835618,
 12.86213291750647,
 16.01446193751929,
 13.571801057123981)

### 3- Find The Coefficents and Select the features that don't cross the horizontal line more than 5%

In [20]:
coef_df = pd.DataFrame(coefs,columns=[x for x in features.columns if x != 'total_score'])

coef_df = coef_df.applymap(lambda x: 1 if x >= 0 else 0)
coef_sum = coef_df.sum(0)
coef_effect = pd.DataFrame(coef_sum,columns=['count_positive']).reset_index()
coef_list = coef_effect[(coef_effect.count_positive > 950) | (coef_effect.count_positive <50)]['index']
coef_list

0                  Season
1              total_line
2               game_line
10     Guest_LastGameDiff
16            host_strike
43             hSTL_avg10
54            hSTL%_avg10
58              gq2_avg10
59              gq3_avg10
60              gq4_avg10
64            gORB%_avg10
92            gBLK%_avg10
93            gDRtg_avg10
108        BLK%_headsup10
Name: index, dtype: object

### 4- Re Run The Linear Regression with Most Important features

In [21]:
X = games_avg[coef_list]
scores = []
predictions = []
coefs = []
for i in range(1000) :    
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33)
    lr = LinearRegression()
    lr.fit(X_train,y_train)
    scores.append(lr.score(X_test,y_test))
    predictions.append(np.mean(np.abs(lr.predict(X_test) - y_test)))
    coefs.append(lr.coef_)
np.mean(scores), np.std(predictions),np.min(predictions),np.max(predictions) ,np.mean(predictions)

(0.25091038989430964,
 0.22535710813267329,
 12.355845727216591,
 13.700889160639568,
 12.998203928208261)

### 5- Gread Search for Random forest

In [None]:
rf_predicts = []
rf_scores = []
params = {
    'n_estimators':[1500,1000,1200],
    'criterion':['mse'],
    'max_depth':[2,8,16,32,None],
    'min_samples_split':[2,8,16],
    'max_features':[None, 'sqrt']
    }

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33)

rf = RandomForestRegressor()

estimator = GridSearchCV(rf, params, cv= 4, verbose= 1)

estimator.fit(X_train,y_train)

rf_predicts = estimator.predict(X_test)
rf_scores = estimator.score(X_test,y_test)

### 6- Predicting with Random Forest Regressor

In [23]:
predicts = []
scores = []

coef_list_final = list(coef_list)
coef_list_final.append('total_score')

features_train = features[features.Season.isin([2013,2014,2015,2016])]#[coef_list_final]
features_test = features[features.Season.isin([2017])]#[coef_list_final]

X_train = features_train[[x for x in features_train.columns if x != 'total_score']]
y_train = features_train.total_score
X_test = features_test[[x for x in features_test.columns if x != 'total_score']]
y_test = features_test.total_score

model = RandomForestRegressor(min_samples_split=16,max_features=None,max_depth=None,n_estimators=1500)

model.fit(X_train,y_train)
predicts = model.predict(X_test)
scores = model.score(X_test,y_test)


### 7- Putting the predictions in data frame and creating features to evaluate the model

In [31]:
mydata = pd.DataFrame(X_test, columns=features.columns)
mydata['y_true'] = y_test
mydata['y_hat'] = predicts
#mydata = mydata[:-1]
team_names = games_avg[['HostName','GuestName']] #,'HostScore','GuestScore'
mydata = mydata.merge(team_names,left_index=True,right_index=True,suffixes=('',''))

mydata['myerror'] = np.abs(mydata.y_hat - mydata.y_true)
mydata['beterror'] = np.abs(mydata.total_line - mydata.y_true)
mydata['errordiff'] = np.abs(mydata.y_hat - mydata.total_line)

def confusion_metrix(row) :
    
    if row['y_hat'] >= row['total_line'] :
        if row['y_true'] > row['total_line'] :
            row['true_over']  = 1
            row['false_over'] = 0
            row['true_under'] = 0
            row['false_under']= 0
        else :
            row['true_over']  = 0
            row['false_over'] = 1
            row['true_under'] = 0
            row['false_under']= 0

    if row['y_hat'] < row['total_line'] :
        if row['y_true'] < row['total_line'] :
            row['true_over']  = 0
            row['false_over'] = 0
            row['true_under'] = 1
            row['false_under']= 0
        else :
            row['true_over']  = 0
            row['false_over'] = 0
            row['true_under'] = 0
            row['false_under']= 1    
    return row

mydata = mydata.apply(confusion_metrix,axis=1)
        
def ifIWon(row) :
    if (row['y_true'] > row['total_line']) & (row['y_hat'] > row['total_line']) :
        row['win_total_line'] = 1
    elif (row['y_true'] < row['total_line']) & (row['y_hat'] < row['total_line']) :
        row['win_total_line'] = 1
    elif row['y_true'] == row['total_line'] :
        row['win_total_line'] = 0
    else :
        row['win_total_line'] = -1
    return row

mydata = mydata.apply(ifIWon, axis = 1)

def baseline(row) :
    if (row['y_true'] > row['total_line'])  :
        row['over_under'] = 1
    elif  (row['y_true'] == row['total_line']):
        row['over_under'] = 0
    else :
        row['over_under'] = -1
    return row

mydata = mydata.apply(baseline, axis = 1)

def my_over_under(row) :
    if (row['y_hat'] >= row['total_line'])  :
        row['my_over_under'] = 1
    else :
        row['my_over_under'] = 0
    return row

mydata = mydata.apply(my_over_under, axis = 1)


errordata = mydata[['Season','GuestName','HostName','y_true','total_line',
                    'y_hat','myerror','beterror','errordiff','over_under','my_over_under',
                    'win_total_line','true_over','true_under','false_over','false_under',
                    'Host_HostRank','Host_GameRank','Guest_GuestRank','Guest_GameRank',
                    'game_line','Headsup_GameRank_Season','Headsup_GameRank_All',
                    'Host_LastGameDiff','Guest_LastGameDiff','host_win_count','host_lose_count',
                    'guest_win_count','guest_lose_count','game_behind','host_strike','guest_strike',
                    'host_place_streak','guest_place_streak'
                   ]]


wheniwin = errordata[errordata.win_total_line == 1]
wheniloose = errordata[errordata.win_total_line == -1]
whenidraw = errordata[errordata.win_total_line == 0]


error_margin = 13
differrormargin = 2

bigesterrors = errordata[errordata.myerror > error_margin]
smallestsrrors = errordata[errordata.myerror <=error_margin]
closesterrors = errordata[errordata.errordiff <=differrormargin]
faresterrors = errordata[errordata.myerror > differrormargin]

print 'Train Data Shape:',X_train.shape
print 'Test Data Shape:',X_test.shape,
print 'Whole Data Shape' ,features.shape
print '---------------------------------'
print 'Regression Model score:', scores
print 'Mean Absolute Error:', errordata.myerror.mean()
print 'Bet Line Mean Absolute Error of All Games:', errordata.beterror.mean()
print '----------------------------------------'
print 'Baseline:'
print 'Total Count/Percent Of Actual Over:(Base Line)',mydata.over_under[mydata.over_under == 1].count(),'--',100.*mydata.over_under[mydata.over_under == 1].count()/mydata.over_under.count()
print 'Total Count/Percent Of Actual Under:(Base Line)',mydata.over_under[mydata.over_under == -1].count(),'--',100.*mydata.over_under[mydata.over_under == -1].count()/mydata.over_under.count()
print 'Total Count/Percent Of Actual Tie:(Base Line)',mydata.over_under[mydata.over_under == 0].count(),'--',100.*mydata.over_under[mydata.over_under == 0].count()/mydata.over_under.count()
print '----------------------------------------'
print 'Confussion Metrix:'
print 'True Over:',mydata.true_over.sum(),'False Over:',mydata.false_over.sum()
print 'False Under:',mydata.false_under.sum(),'True Under:',mydata.true_under.sum()
print 
print 'Total Predicted Over:',mydata.true_over.sum()+mydata.false_over.sum(),
print '-- Over Accuracy:',100.*mydata.true_over.sum()/(mydata.true_over.sum() + mydata.false_over.sum())
print 'Total Predicted Under:',mydata.true_under.sum()+mydata.false_under.sum(),
print '-- Under Accuracy:',100.*mydata.true_under.sum()/(mydata.true_under.sum() + mydata.false_under.sum())
print '****************************************'
print '****************************************'
print 'Number and percent of games that i win::',errordata[errordata.win_total_line == 1]['win_total_line'].count(),' -- ',
print 100.0*errordata[errordata.win_total_line == 1]['win_total_line'].count() /errordata.win_total_line.count() 

print 'Number and percent of games that i lose:',errordata[errordata.win_total_line == -1]['win_total_line'].count(),' -- ',
print 100.0*errordata[errordata.win_total_line == -1]['win_total_line'].count() /errordata.win_total_line.count() 

print 'Number and percent of games that i draw:',errordata[errordata.win_total_line == 0]['win_total_line'].count(),' -- ',
print 100.0*errordata[errordata.win_total_line == 0]['win_total_line'].count() /errordata.win_total_line.count() 

print 'percent of pure win with bet on over/under:',100*(wheniwin.shape[0] - 1.1*wheniloose.shape[0])/X_test.shape[0],'%'
print '****************************************'
print '****************************************'
print
print '% of win, When i predict bad:',100.0*bigesterrors[bigesterrors.win_total_line == 1]['win_total_line'].sum() /bigesterrors.win_total_line.count() 
print 'Mean Residials :',bigesterrors.myerror.mean()
print 'Bet Mean Residials :',bigesterrors.beterror.mean()
print 'number of bad predicted games:',bigesterrors.shape[0]
print '---------------------------------'
print '% of win, When i predict good:',100.0*smallestsrrors[smallestsrrors.win_total_line == 1]['win_total_line'].sum() /smallestsrrors.win_total_line.count() 
print 'Mean Residials :',smallestsrrors.myerror.mean()
print 'Bet Mean Residials :',smallestsrrors.beterror.mean() 
print 'number of good predicted games:', smallestsrrors.shape[0] 
print '---------------------------------'
print '---------------------------------'
print '% of win, When i predict close to the line:',100.0*closesterrors[closesterrors.win_total_line == 1]['win_total_line'].sum() /closesterrors.win_total_line.count() 
print 'Mean Residials :',closesterrors.myerror.mean()
print 'Bet Mean Residials :',closesterrors.beterror.mean()
print 'number of bad predicted games:',closesterrors.shape[0]
print '----------------------------------'
print '% of win, When i predict far from the line:',100.0*faresterrors[faresterrors.win_total_line == 1]['win_total_line'].sum() /faresterrors.win_total_line.count() 
print 'Mean Residials :',faresterrors.myerror.mean()
print 'Bet Mean Residials :',faresterrors.beterror.mean() 
print 'number of good predicted games:', faresterrors.shape[0] 


Train Data Shape: (3768, 229)
Test Data Shape: (406, 229) Whole Data Shape (4174, 230)
---------------------------------
Regression Model score: 0.113109595304
Mean Absolute Error: 13.167945692
Bet Line Mean Absolute Error of All Games: 12.9148148148
----------------------------------------
Baseline:
Total Count/Percent Of Actual Over:(Base Line) 193 -- 47.6543209877
Total Count/Percent Of Actual Under:(Base Line) 206 -- 50.8641975309
Total Count/Percent Of Actual Tie:(Base Line) 6 -- 1.48148148148
----------------------------------------
Confussion Metrix:
True Over: 59 False Over: 45
False Under: 139 True Under: 162

Total Predicted Over: 104 -- Over Accuracy: 56.7307692308
Total Predicted Under: 301 -- Under Accuracy: 53.8205980066
****************************************
****************************************
Number and percent of games that i win:: 221  --  54.5679012346
Number and percent of games that i lose: 178  --  43.950617284
Number and percent of games that i draw: 6  --