# Research Question

## Can the outcome(or the odds) of a football match be accurately found using prior match history, and match context.

- Will league affect certain teams?
- Will home team have the advantage?
- Do coaches affect win rate?

# The data

The data in this report contains match history for over 100k games, between teams in many different leagues. The columns we have are the following:
- Winning team(home or away)(binary)
- Home team(categorical)
- Away team(categorical)
- League name(categorical)
- Cup game(Binary)
- Home coach(categorical)
- Away coach(categorical)

Additionally, there are several columns with historical data, up to 10 matches prior, for both the home and away team for each match. Initially, we will not be using this data, but that may change



# Package importing and stuffs

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import linear_model
import seaborn as sns
import torch

import pyro
import pyro.distributions as dist
from pyro.contrib.autoguide import AutoDiagonalNormal, AutoMultivariateNormal
from pyro.infer import MCMC, NUTS, HMC, SVI, Trace_ELBO
from pyro.optim import Adam, ClippedAdam

# fix random generator seed (for reproducibility of results)
np.random.seed(42)

# matplotlib style options
plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 8)

# Data Handling


In [2]:

train=pd.read_csv(r"football-match-probability-prediction/train.csv")

test=pd.read_csv(r"football-match-probability-prediction/test.csv")

train_target_and_scores=pd.read_csv(r"football-match-probability-prediction/train_target_and_scores.csv")




  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
train=train.iloc[:,:10]
test=test.iloc[:,:10]
#train.head()
X_train=train.iloc[:,[0,2,3,4,5,6,7,8,9]]
Y_train=train.iloc[:,1]

X_test=test.iloc[:,[0,2,3,4,5,6,7,8,9]]
Y_test=test.iloc[:,1]
#Y_train.head()
L=len(X_train.iloc[:,4].unique()) #number_of_leagues
print(L)
G=len(X_train) #number_of_games
X_train.head()



728


Unnamed: 0,id,home_team_name,away_team_name,match_date,league_name,league_id,is_cup,home_team_coach_id,away_team_coach_id
0,11906497,Newell's Old Boys,River Plate,2019-12-01 00:45:00,Superliga,636,False,468196.0,468200.0
1,11984383,Real Estelí,Deportivo Las Sabanas,2019-12-01 01:00:00,Primera Division,752,False,516788.0,22169161.0
2,11983301,UPNFM,Marathón,2019-12-01 01:00:00,Liga Nacional,734,False,2510608.0,456313.0
3,11983471,León,Morelia,2019-12-01 01:00:00,Liga MX,743,False,1552508.0,465797.0
4,11883005,Cobán Imperial,Iztapa,2019-12-01 01:00:00,Liga Nacional,705,False,429958.0,426870.0


There is not much data handling to do, as most of our data is categorical, and already labelled properly with no missing values. Thus, we can begin smoothly with model building

# Model


In [4]:
def hierarchical_model(X, L , G, obs=None):

    input_dim = X.shape[1]

    print(X.shape)
    
    #with pyro.plate("League",L):
    #    mu_l=pyro.sample("mu_l", dist.Normal(torch.zeros(input_dim,L),10*torch.ones(input_dim,L)).to_event())
    #    sigma_l=pyro.sample("sigma_l", dist.HalfNormal(10*torch.ones(input_dim,L)).to_event())
    #    Beta_l=pyro.sample("Beta_l",dist.Normal(mu_l,sigma_l).to_event())        
    with pyro.plate("League",L):
        mu_l=pyro.sample("mu_l", dist.Normal(torch.zeros(L),10*torch.ones(L)).to_event())
        sigma_l=pyro.sample("sigma_l", dist.HalfNormal(10*torch.ones(L)).to_event())
        Beta_l=pyro.sample("Beta_l",dist.Normal(mu_l,sigma_l).to_event())


    with pyro.plate("Games",G):
        league=pyro.sample("league", dist.Categorical(torch.ones(L)/L).to_event())
        confidence = pyro.sample("confidence",dist.Beta(X.matmul(Beta_l[:,league]))) # dimension is all fucked to hell
        outcome = pyro.sample("outcome",dist.Dirichlet([confidence,confidence,confidence]),obs=obs)



# Alpha will be a function of beta_l*inputs, and beta will be 1-that. 
# so, beta*l*inputs needs to be between 0 and 1.

        


    return outcome

In [5]:


%%time

X_ = X_train.iloc[:,[1,2,6]]
X_fac=np.zeros([G,3])
#print(X_fac)

X_fac[:,0],X_fac[:,1],X_fac[:,2] = pd.factorize(X_.iloc[:,0])[0], pd.factorize(X_.iloc[:,1])[0], pd.factorize(X_.iloc[:,2])[0]
Y_fac=pd.factorize(Y_train)[0]
#print(Y_fac)
#print(X_fac)
# Define guide function
guide = AutoDiagonalNormal(hierarchical_model)

# Reset parameter values
pyro.clear_param_store()

# Define the number of optimization steps
n_steps = 12000

# Setup the optimizer
adam_params = {"lr": 0.005}
optimizer = ClippedAdam(adam_params)

# Setup the inference algorithm
elbo = Trace_ELBO(num_particles=3)
svi = SVI(hierarchical_model, guide, optimizer, loss=elbo)

# Do gradient steps
X_fac=torch.tensor(X_fac)
Y_fac=torch.tensor(Y_fac)
for step in range(n_steps):
    elbo = svi.step(X_fac, L,G, Y_fac)
    if step % 1 == 0:
        print("[%d] ELBO: %.1f" % (step, elbo))


torch.Size([110938, 3])


RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 235181459968 bytes. Error code 12 (Cannot allocate memory)
Trace Shapes:                 
 Param Sites:                 
Sample Sites:                 
    mu_l dist    728 | 728    
        value    728 | 728    
 sigma_l dist    728 | 728    
        value    728 | 728    
  Beta_l dist    728 | 728 728
        value    728 | 728 728
  league dist 110938 |        
        value 110938 |        
Trace Shapes:
 Param Sites:
Sample Sites: