![](https://i.imgur.com/oRgtzqR.jpg)
# 2016 mlb reference
    利用棒球比賽數據預測球賽主場球隊勝負
    將約2500場比賽前2000場用來train，剩餘場次用來預測，檢查命中率
## dataset
[2016 MLB](https://www.kaggle.com/cyaris/2016-mlb-season)
    

In [107]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO   
from sklearn.tree import export_graphviz
import pydotplus

path = './baseball_reference_2016_clean.csv'
pd_data = pd.read_csv('./baseball_reference_2016_clean.csv')

## drop unuse data
    故意將兩隊得分也拿掉，否則長出來的樹會完全只看兩隊的得分數
    左到右分別為：客隊失誤、客隊安打、主隊失誤、主隊安打、比賽結果

In [108]:
pd_data = pd_data.drop('Unnamed: 0', axis = 1)
pd_data = pd_data.drop('attendance', axis = 1)
pd_data = pd_data.drop('away_team', axis = 1)
pd_data = pd_data.drop(['date', 'field_type', 'game_type', 'home_team'], axis = 1)
pd_data = pd_data.drop(['start_time', 'venue', 'day_of_week', 'temperature'], axis = 1)
pd_data = pd_data.drop(['wind_speed', 'wind_direction', 'sky', 'total_runs'], axis = 1)
pd_data = pd_data.drop(['game_hours_dec', 'season', 'home_team_loss', 'home_team_outcome'], axis = 1)
pd_data = pd_data.drop(['away_team_runs', 'home_team_runs'], axis = 1)
pd_data.head()

Unnamed: 0,away_team_errors,away_team_hits,home_team_errors,home_team_hits,home_team_win
0,1,7,0,9,1
1,0,5,0,8,1
2,0,5,0,9,1
3,0,8,1,8,0
4,1,8,0,8,0


In [109]:
# training data
train = pd_data[:2000]
train = train.drop('home_team_win', axis = 1)
# test data
test = pd_data[2000:]
test_ans = test['home_team_win'].values
test = test.drop('home_team_win', axis = 1)
train.head()

Unnamed: 0,away_team_errors,away_team_hits,home_team_errors,home_team_hits
0,1,7,0,9
1,0,5,0,8
2,0,5,0,9
3,0,8,1,8
4,1,8,0,8


## design my tree rule
    取主客隊的安打和失誤數量，並進一步求得安打的四分位數來做比較
    主客隊安打平均數：8、75%：11、25%：6
![](https://i.imgur.com/ONhcCdh.jpg)

In [110]:
import numpy as np
#for i in range(0, 2000):
#    print(i)
ate = train['away_team_errors'].mean()
hte = train['home_team_errors'].mean()
ath = train['away_team_hits'].mean()
hth = train['home_team_hits'].mean()
print(ate, hte, ath, hth)

ath_75 = np.percentile(train['away_team_hits'], 75)
hth_75 = np.percentile(train['home_team_hits'], 75)
ath_25 = np.percentile(train['away_team_hits'], 25)
hth_25 = np.percentile(train['home_team_hits'], 25)
print(ath_75, hth_75, ath_25, hth_25)

train['predict_result'] = 0
for i in range(0, 2000):
    if(train['home_team_hits'][i] > hth):
        if(train['away_team_hits'][i] > ath):
            if(train['home_team_hits'][i] > hth_75):
                train['predict_result'][i] = 1
            else:
                if(train['away_team_hits'][i] > ath_75):
                    train['predict_result'][i] = 0
                else:
                    train['predict_result'][i] = 1
        else:
            if(train['home_team_errors'][i] >= 2):
                if(train['home_team_hits'][i] > hth_75):
                    train['predict_result'][i] = 1
                else:
                    train['predict_result'][i] = 0
            else:
                train['predict_result'][i] = 1
    else:
        if(train['away_team_hits'][i] > ath):
            if(train['away_team_hits'][i] > ath_75):
                train['predict_result'][i] = 0
            else:
                if(train['away_team_errors'][i] >= 3 ):
                    train['predict_result'][i] = 1
                else:
                    train['predict_result'][i] = 0
        else:    
            if(train['home_team_hits'][i] > hth_25):
                if(train['home_team_errors'][i] >= 2):
                    train['predict_result'][i] = 0
                else:
                    train['predict_result'][i] = 1
            else:
                if(train['away_team_hits'][i] > ath_25):
                    train['predict_result'][i] = 0
                else:
                    train['predict_result'][i] = 1

ans = train['predict_result'].values
print(train.head(6))
print(ans)

0.577 0.592 8.753 8.7005
11.0 11.0 6.0 6.0
   away_team_errors  away_team_hits  home_team_errors  home_team_hits  \
0                 1               7                 0               9   
1                 0               5                 0               8   
2                 0               5                 0               9   
3                 0               8                 1               8   
4                 1               8                 0               8   
5                 1              11                 1               7   

   predict_result  
0               1  
1               1  
2               1  
3               1  
4               1  
5               0  
[1 1 1 ... 0 0 1]


## build my tree

In [111]:
train = train.drop('predict_result', axis = 1)
dtree = DecisionTreeClassifier(max_depth = 4)
dtree.fit(train, ans)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file = dot_data,  
                filled = True, 
                feature_names = list(train),
                class_names = ['lose','win'],
                special_characters = True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree.pdf")

True

## my decision tree
![](https://i.imgur.com/QPksa8j.png)

## Discussion
    此decision tree問題的大小於和我自訂的rule相反，因此左右剛好顛倒
    gini值高的場次不多，大多數的比賽場次都可以順利分群
    第二層判斷客場隊伍安打數直接取75%和25%當分水嶺，而非我選用的50%，可以為之後省下一些多餘的判斷
    安打數是決定比賽最關鍵的因素，當兩隊安打數有落差時可以很好判斷比賽勝負
    若安打數差距不大可以利用失誤次數判斷勝負
    
    

![](https://imgur.com/QPksa8j)

In [112]:
dtree.feature_importances_

array([0.00611539, 0.51228585, 0.02232171, 0.45927705])

In [113]:
predict = dtree.predict(test)
from sklearn.metrics import accuracy_score

accuracy_score(test_ans, predict)

0.7624190064794817

## **預測準確率約為0.76**
    棒球比賽結局多變，即使安打數比對手多有時候也未必會贏球（關鍵時刻無安打、保送多......等）
    此外，犧牲打也有打點（有得分）但不會記入安打，因此若要更準確的預測分析可能就需要更多數據