Берутся данные [отсюда](http://www.tennis-data.co.uk/alldata.php)

Описание данных имеется [здесь](http://www.tennis-data.co.uk/notes.txt)

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [3]:
files = os.listdir('./data')

In [4]:
files

['2019.csv', '2017.csv', '2018.csv', '2016.csv', '2015.csv']

In [5]:
df = pd.read_csv('data/'+files[-1])
for file in files[-2::-1]:
    df = pd.concat((df, pd.read_csv('data/'+file)), axis=0, sort=False)

Попробуем предсказать, кто победит в самой последней имеющейся игре. Для этого будем делать отдельные модели для каждого игрока.

In [6]:
df.iloc[-1]

ATP                        64
Location               Vienna
Tournament    Erste Bank Open
Date               27.10.2019
Series                 ATP500
Court                  Indoor
Surface                  Hard
Round               The Final
Best of                     3
Winner               Thiem D.
Loser          Schwartzman D.
WRank                       5
LRank                      15
WPts                     5085
LPts                     1950
W1                          3
L1                          6
W2                          6
L2                          4
W3                          6
L3                          3
W4                        NaN
L4                        NaN
W5                        NaN
L5                        NaN
Wsets                       2
Lsets                       1
Comment             Completed
B365W                    1.28
B365L                    3.75
EXW                       NaN
EXL                       NaN
LBW                       NaN
LBL       

На момент начала игры нам достпны следующие данные:
    1. Имена игроков: Thiem D. и Schwartzman D.
    2. Условия игры: Court и Surface
    2. Их рейтинги на начало турнира: WRank, LRank, WPts, LPts
    3. Ставки на каждого из игрока на разных площадках: Вet365(B365W, B365L)
                                                        Pinnacles Sports (PSW, PSL)
                                                        Максимальные ставки (MaxW, MaxL)
                                                        Доля побед
                                                     
  
Средние ставки (AvgW, AvgL)
Также можно осреднить количество выигранных геймов в каждом сете (W1-W5, L1-L5) и количество выигранных сетов (Wsets, Lsets) по предыдущим 5-ти играм и на этих данных обучать модели

Оставим только законченные игры

In [7]:
df = df[df['Comment']=='Completed']

### Thiem D.

In [8]:
#Все игры Thiem D.
thiem = df[np.logical_or(df['Winner']=='Thiem D.', df['Loser']=='Thiem D.')]

thiem_train_data = pd.DataFrame(columns=['Date', 'Court', 'Surface', 'Rank', 'Pts', 
                                        'Avg1', 'Avg2', 'Avg3', 'Avg4', 'Avg5', 'AvgSets',
                                        'B365', 'PS', 'Max', 'Avg', 'WinAvg', 'Win'])

thiem_train_data[['Date', 'Court', 'Surface']] = thiem[['Date', 'Court', 'Surface']]

thiem_train_data['Rank'] = thiem.apply(lambda x: x['WRank'] if x['Winner'] == 'Thiem D.' else x['LRank'], axis=1)

thiem_train_data['Pts'] = thiem.apply(lambda x: x['WPts'] if x['Winner'] == 'Thiem D.' else x['LPts'],
                                      axis=1)

for i in range(1,6):
    thiem_train_data['Avg'+str(i)] = \
    thiem.apply(lambda x: x['W'+str(i)] if x['Winner'] == 'Thiem D.' \
                else x['L'+str(i)],
                axis=1)

thiem_train_data['AvgSets'] = \
    thiem.apply(lambda x: x['Wsets'] if x['Winner'] == 'Thiem D.' \
                else x['Lsets'],
                axis=1)

thiem_train_data['B365'] = \
    thiem.apply(lambda x: x['B365W'] if x['Winner'] == 'Thiem D.' \
                else x['B365L'],
                axis=1)

thiem_train_data['PS'] = \
    thiem.apply(lambda x: x['PSW'] if x['Winner'] == 'Thiem D.' \
                else x['PSL'],
                axis=1)

thiem_train_data['Max'] = \
    thiem.apply(lambda x: x['MaxW'] if x['Winner'] == 'Thiem D.' \
                else x['MaxL'],
                axis=1)

thiem_train_data['Avg'] = \
    thiem.apply(lambda x: float(x['AvgW']) if x['Winner'] == 'Thiem D.' \
                else float(x['AvgL']),
                axis=1)

thiem_train_data['Win'] = thiem.apply(lambda x: int(x['Winner'] == 'Thiem D.'),
                axis=1)

thiem_train_data.head()

Unnamed: 0,Date,Court,Surface,Rank,Pts,Avg1,Avg2,Avg3,Avg4,Avg5,AvgSets,B365,PS,Max,Avg,WinAvg,Win
86,12.01.2015,Outdoor,Hard,39.0,977.0,7.0,4.0,3.0,,,1.0,1.61,1.68,1.68,1.61,,0
186,20.01.2015,Outdoor,Hard,40.0,977.0,6.0,2.0,3.0,6.0,,1.0,2.75,2.86,3.0,2.75,,0
376,09.02.2015,Indoor,Hard,47.0,917.0,6.0,6.0,,,,2.0,2.37,2.3,2.75,2.5,,1
390,11.02.2015,Indoor,Hard,47.0,917.0,1.0,3.0,,,,0.0,1.4,1.52,1.56,1.46,,0
465,17.02.2015,Indoor,Hard,48.0,897.0,7.0,6.0,,,,2.0,1.66,1.68,1.8,1.67,,1


In [9]:
thiem_train_data.fillna(0, inplace=True)

# Индексы для осреднения 5-10
for i in range(len(thiem_train_data)-1, 5, -1):
    for j in range(5,11):
        thiem_train_data.iloc[i,j] = np.mean(thiem_train_data.iloc[i-5:i,j])
    thiem_train_data.iloc[i,-2] = np.mean(thiem_train_data.iloc[i-5:i,-1])

thiem_train_data.head()

Unnamed: 0,Date,Court,Surface,Rank,Pts,Avg1,Avg2,Avg3,Avg4,Avg5,AvgSets,B365,PS,Max,Avg,WinAvg,Win
86,12.01.2015,Outdoor,Hard,39.0,977.0,7.0,4.0,3.0,0.0,0.0,1.0,1.61,1.68,1.68,1.61,0.0,0
186,20.01.2015,Outdoor,Hard,40.0,977.0,6.0,2.0,3.0,6.0,0.0,1.0,2.75,2.86,3.0,2.75,0.0,0
376,09.02.2015,Indoor,Hard,47.0,917.0,6.0,6.0,0.0,0.0,0.0,2.0,2.37,2.3,2.75,2.5,0.0,1
390,11.02.2015,Indoor,Hard,47.0,917.0,1.0,3.0,0.0,0.0,0.0,0.0,1.4,1.52,1.56,1.46,0.0,0
465,17.02.2015,Indoor,Hard,48.0,897.0,7.0,6.0,0.0,0.0,0.0,2.0,1.66,1.68,1.8,1.67,0.0,1


In [10]:
thiem_train_data = pd.concat((thiem_train_data,
pd.get_dummies(thiem_train_data[['Court', 'Surface']], prefix=['Court', 'Surface'])), axis=1)

In [11]:
thiem_train = thiem_train_data.iloc[:300]
thiem_eval = thiem_train_data.iloc[300:]

In [13]:
model_thiem = LogisticRegression(C=0.1, random_state=42, solver='liblinear', penalty='l1')

X_train = thiem_train[['Rank', 'Pts', 'Avg1', 'Avg2', 'Avg3',
       'Avg4', 'Avg5', 'AvgSets', 'B365', 'PS', 'Max', 'Avg', 'WinAvg',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard']]
Y_train = thiem_train['Win']

model_thiem.fit(X_train, Y_train)

print(cross_val_score(model_thiem, X_train, Y_train, scoring='roc_auc', cv=5).mean())

0.6736883825417203


In [14]:
X_test = thiem_eval[['Rank', 'Pts', 'Avg1', 'Avg2', 'Avg3',
       'Avg4', 'Avg5', 'AvgSets', 'B365', 'PS', 'Max', 'Avg', 'WinAvg',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard']]
Y_test = thiem_eval['Win']

In [15]:
Y_hat_test = model_thiem.predict(X_test)

In [16]:
f1_score(Y_test, Y_hat_test)

0.8750000000000001

Качество модели не ужасное, посмотрим, какова вероятность Thiem D. выиграть в последней игре

In [17]:
X_test.iloc[-1].values.reshape(1,-1)

array([[5.000e+00, 5.085e+03, 4.800e+00, 6.000e+00, 2.400e+00, 0.000e+00,
        0.000e+00, 1.600e+00, 1.280e+00, 1.320e+00, 1.330e+00, 1.290e+00,
        8.000e-01, 1.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.000e+00]])

In [19]:
model_thiem.predict_proba(X_test.iloc[-1].values.reshape(1,-1))

array([[0.22555276, 0.77444724]])

In [21]:
model_thiem.predict(X_test.iloc[-1].values.reshape(1,-1))

array([1])

Как можно видеть, модель уверенно говорит о победе Thiem D.

### Schwartzman D.

In [22]:
#Все игры Schwartzman D.
schwartzman = df[np.logical_or(df['Winner']=='Schwartzman D.', df['Loser']=='Schwartzman D.')]

In [23]:
schwartzman_train_data = pd.DataFrame(columns=['Date', 'Court', 'Surface', 'Rank', 'Pts', 
                                        'Avg1', 'Avg2', 'Avg3', 'Avg4', 'Avg5', 'AvgSets',
                                        'B365', 'PS', 'Max', 'Avg', 'WinAvg', 'Win'])

In [24]:
schwartzman_train_data[['Date', 'Court', 'Surface']] = schwartzman[['Date', 'Court', 'Surface']]

schwartzman_train_data['Rank'] = schwartzman.apply(lambda x: x['WRank'] if x['Winner'] == 'Schwartzman D.' else x['LRank'], axis=1)

schwartzman_train_data['Pts'] = schwartzman.apply(lambda x: x['WPts'] if x['Winner'] == 'Schwartzman D.' else x['LPts'],
                                      axis=1)

In [25]:
for i in range(1,6):
    schwartzman_train_data['Avg'+str(i)] = \
    schwartzman.apply(lambda x: x['W'+str(i)] if x['Winner'] == 'Schwartzman D.' \
                else x['L'+str(i)],
                axis=1)

schwartzman_train_data['AvgSets'] = \
    schwartzman.apply(lambda x: x['Wsets'] if x['Winner'] == 'Schwartzman D.' \
                else x['Lsets'],
                axis=1)

schwartzman_train_data['B365'] = \
    schwartzman.apply(lambda x: x['B365W'] if x['Winner'] == 'Schwartzman D.' \
                else x['B365L'],
                axis=1)

schwartzman_train_data['PS'] = \
    schwartzman.apply(lambda x: x['PSW'] if x['Winner'] == 'Schwartzman D.' \
                else x['PSL'],
                axis=1)

schwartzman_train_data['Max'] = \
    schwartzman.apply(lambda x: x['MaxW'] if x['Winner'] == 'Schwartzman D.' \
                else x['MaxL'],
                axis=1)

schwartzman_train_data['Avg'] = \
    schwartzman.apply(lambda x: float(x['AvgW']) if x['Winner'] == 'Schwartzman D.' \
                else float(x['AvgL']),
                axis=1)

schwartzman_train_data['Win'] = schwartzman.apply(lambda x: int(x['Winner'] == 'Schwartzman D.'),
                axis=1)

schwartzman_train_data.head()

Unnamed: 0,Date,Court,Surface,Rank,Pts,Avg1,Avg2,Avg3,Avg4,Avg5,AvgSets,B365,PS,Max,Avg,WinAvg,Win
93,13.01.2015,Outdoor,Hard,60.0,775.0,6.0,4.0,6.0,,,2.0,4.0,3.78,4.05,3.62,,1
101,14.01.2015,Outdoor,Hard,60.0,775.0,6.0,4.0,1.0,,,1.0,3.25,3.41,3.41,3.18,,0
141,19.01.2015,Outdoor,Hard,59.0,789.0,6.0,5.0,7.0,4.0,,1.0,7.0,7.64,7.64,6.33,,0
409,10.02.2015,Indoor,Clay,66.0,770.0,7.0,6.0,,,,2.0,1.66,1.76,1.77,1.69,,1
421,12.02.2015,Indoor,Clay,66.0,770.0,3.0,6.0,2.0,,,1.0,2.1,2.18,2.3,2.13,,0


In [26]:
schwartzman_train_data.fillna(0, inplace=True)

# Индексы для осреднения 5-10
for i in range(len(schwartzman_train_data)-1, 5, -1):
    for j in range(5,11):
        schwartzman_train_data.iloc[i,j] = np.mean(schwartzman_train_data.iloc[i-5:i,j])
    schwartzman_train_data.iloc[i,-2] = np.mean(schwartzman_train_data.iloc[i-5:i,-1])

schwartzman_train_data.head()

Unnamed: 0,Date,Court,Surface,Rank,Pts,Avg1,Avg2,Avg3,Avg4,Avg5,AvgSets,B365,PS,Max,Avg,WinAvg,Win
93,13.01.2015,Outdoor,Hard,60.0,775.0,6.0,4.0,6.0,0.0,0.0,2.0,4.0,3.78,4.05,3.62,0.0,1
101,14.01.2015,Outdoor,Hard,60.0,775.0,6.0,4.0,1.0,0.0,0.0,1.0,3.25,3.41,3.41,3.18,0.0,0
141,19.01.2015,Outdoor,Hard,59.0,789.0,6.0,5.0,7.0,4.0,0.0,1.0,7.0,7.64,7.64,6.33,0.0,0
409,10.02.2015,Indoor,Clay,66.0,770.0,7.0,6.0,0.0,0.0,0.0,2.0,1.66,1.76,1.77,1.69,0.0,1
421,12.02.2015,Indoor,Clay,66.0,770.0,3.0,6.0,2.0,0.0,0.0,1.0,2.1,2.18,2.3,2.13,0.0,0


In [27]:
schwartzman_train_data = pd.concat((schwartzman_train_data,
pd.get_dummies(schwartzman_train_data[['Court', 'Surface']], prefix=['Court', 'Surface'])), axis=1)

schwartzman_train = schwartzman_train_data.iloc[:230]
schwartzman_eval = schwartzman_train_data.iloc[230:]

schwartzman_train.columns

Index(['Date', 'Court', 'Surface', 'Rank', 'Pts', 'Avg1', 'Avg2', 'Avg3',
       'Avg4', 'Avg5', 'AvgSets', 'B365', 'PS', 'Max', 'Avg', 'WinAvg', 'Win',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard'],
      dtype='object')

In [28]:
model_schwartzman = LogisticRegression(C=0.1, random_state=42, solver='liblinear', penalty='l1')

In [29]:
X_train = schwartzman_train[['Rank', 'Pts', 'Avg1', 'Avg2', 'Avg3',
       'Avg4', 'Avg5', 'AvgSets', 'B365', 'PS', 'Max', 'Avg', 'WinAvg',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard']]
Y_train = schwartzman_train['Win']

In [30]:
model_schwartzman.fit(X_train, Y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=42, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [31]:
cross_val_score(model_schwartzman, X_train, Y_train, scoring='roc_auc', cv=5).mean()

0.5533846153846154

In [32]:
X_test = schwartzman_eval[['Rank', 'Pts', 'Avg1', 'Avg2', 'Avg3',
       'Avg4', 'Avg5', 'AvgSets', 'B365', 'PS', 'Max', 'Avg', 'WinAvg',
       'Court_Indoor', 'Court_Outdoor', 'Surface_Clay', 'Surface_Grass',
       'Surface_Hard']]
Y_test = schwartzman_eval['Win']

In [34]:
Y_hat_test = model_schwartzman.predict(X_test)

In [35]:
f1_score(Y_test, Y_hat_test)

0.8

Качество модели не ужасное, посмотрим, какова вероятность Schwartzman D. выиграть в последней игре

In [36]:
X_test.iloc[-1].values.reshape(1,-1)

array([[1.50e+01, 1.95e+03, 5.80e+00, 6.20e+00, 2.60e+00, 0.00e+00,
        0.00e+00, 1.80e+00, 3.75e+00, 3.80e+00, 3.88e+00, 3.63e+00,
        8.00e-01, 1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00]])

In [38]:
model_schwartzman.predict_proba(X_test.iloc[-1].values.reshape(1,-1))

array([[0.56571887, 0.43428113]])

In [40]:
model_schwartzman.predict(X_test.iloc[-1].values.reshape(1,-1))

array([0])

Как можно видеть, модель не уверенна начсет победы или поражения Schwartzman D.

### Источники вдохновения

1. https://habr.com/ru/post/306944/
2. https://habr.com/ru/post/307422/
3. https://habr.com/ru/post/456226/