# Predicting Soccer Matches

- To predict soccer matches seems like a very interesting problem on which to apply ML.  Soccer clubs, bettors, and others with a stake in the results are likewise looking to do so to gain an edge.

### Procedure
- The benchmarks will be 
    - ~50%, the accuracy rate at which most news outlets and pundits average
    - ~60%, an average of betting sites which use proprietary algorithms and methods
    - ~70%, the accuracy rate one would achieve if betting with the market
- As a baseline, a null hypothesis is created that a team will beat another if they are higher in the FIFA/Elo rankings
- Predictions will be of 'Team 1 Win', 'Team 2 Win', or 'Draw' between two sides, which makes this a classification problem
    - I hope to successfully apply and obtain serviceable results with Decision Trees, Random Forest, XGBoost, Support Vector Machines, and if time allows, a Dense Neural Network with a softmax output layer.
    - As 'Draws' are an important and meaningful result, I exclude Logistic Regression as I believe binary classification would not accurately capture important results
    - I exclude Naive Bayes because the assumption of independence does not reflect trends / 'streaks' of winning that may occur due to a variety of factors such as emotion, player fitness, etc.

## Features

Apart from the basic statistics available in the datasets, existing formulas developed for sports prediction are incorporated as follows

### Not dependent on Opponent
**Form**

$$ \frac{1}{10} \sum_{k=1}^{5} result_k $$
where result_k = result of the –kth match (value in {0, 1, 2});

<br/>

**Concentration**

$$ 1 - 2x $$
where x is the nearest match lost to a weak team (bottom 30 FIFA Rankings)

<br/>

**Motivation**
$$ min(max( 1 - \frac{dist}{3 * left} , rivalry , \frac{level + dist}{2} ), 1) $$
where
- rivalry: 1 if match is a rivalry, 0 otherwise
- dist - distance from Top 20 or Bottom 20
- left - games left to play at the moment (ie 3 if starting Group Stage, 0 if in elimination round)
- level - 1 if game is in single elimination or for qualification <br/>
_okay this one is a little hard to implement, probably will leave this one out_

<br/> <br/>

### Dependent
**Goal Difference**
$$ \frac{1}{2} + \frac{diff}{2 * maxdiff} $$
where difference is the difference in goals, maxdiff is the maximal difference in goals 

<br/>

**Score Difference**<br/>
same formulas as _Goal Difference_ but applied to scores 

<br/>

**History** 
$$ \frac{p_\text{team1} + p_\text{team2}}{2} $$

where p_i is result of –i^th match

<br/>

### To be included in a future update:
Paper by _Goven et. al_ posits that points scored by/against can be used proportionally to create Attack/Defense scores
http://meyer.math.ncsu.edu/Meyer/Talks/OD_RankingCharleston.pdf

**ODM Attack**
    $$ a_\text{team1} = \sum_{n=1}^{n} P_\text{t1t2} \frac{1}{d_\text{team1}} $$
    
**ODM Defense**
    $$ d_\text{team1} = \sum_{n=1}^{n} P_\text{t2t1} \frac{1}{a_\text{team1}} $$
    
P_t1t2_ being the points scored by team 1 on team 2. 
<br/>
P_t2t1_ being the reverse
<br/>
Since A & D are mutually dependent, compute as follows...


In [1]:
import pandas as pd