# Big Data Project

# Project overview

## 1. Background

When it comes to football bets, for every single game, we have different bets houses (with different mathematical models) that are generating three different quotas for the three different results of that game (win home, draw, win away).

<img src="../Img/introduction.png">

As it can be seen in the picture above, for the same game (Inglaterra vs Panamá), the three bets houses are offering different quotas for the three possible results of the match.

Most of the times, this quotas are pretty much the same (for example, the three of them are offering 1.22 for Inglaterra winning, and more or less 6 euros for a draw). But, there are a few times when the mathematical models have big discrepancies. For example, for Panamá winning, in this cases, bet365 is offering 4 euros more than the other two bets houses. 

So, in this case, for the model of bet365 a win for Panamá is much less probable than for the other two bets houses.

The main idea behind this project is to study this big discrepancies between bets houses and try to use them (if it’s possible) to predict the final result of the match.


## 2. Objective

The general objective of this project is increase the benefits in the management of bets of our customer. This will be achieved through the following specific objectives:

- Create a prediction model of results based on discrepancies between mathematical models of different betting houses for the same match, within the framework of a country and competition.


- Design of a method to identify matches with greater divergence between forecasts and therefore with more possibility of benefit if the result is correct.


- Determine the reliability of betting houses, evaluating the success rate by comparing their odds with the result of the matches.


- Identify if there is specialization of betting houses in a country, competition or team. Evaluate their highest success rate with respect to the different variables.



## 3. Approach

So, first of all, we need to get the data. Searching on internet we have found this page:
http://www.football-data.co.uk/data.php. 

The data in this page it is being updated every single week with the results of the games from that week.

The data structure is the following:

<img src="../Img/leagues_seasons.gif">

We have a set of leagues divided between main leagues and extra leagues. The difference between main leagues and extra leagues is, basically, that main leagues have more bets houses than extra leagues. Within a league, we have all the different seasons (starting in all of leagues by 2003-2004 more or less). And, finally, within a season, we have all the results from that season distributed in different csv files (one for each competition).

To get all this data, we have generated a python script that basically builds all the different urls of all the files and download all of them. Also, this script builds our own filesystem that has the following structure: Country > Competition > Season.

<img src="../Img/download.gif">

We download all the information but, by now, we have decided to use only the one coming from the main leagues as we have more information about bets houses.

The data from this page is very consistent, but, we have found some little problems. One of them is that not all the files in the main leagues has the same format. Some of them have more bet houses than others (depending on the country generally). Another of the problems is that not all the countries data are starting on the same season. 
So, basically, what we have done is to take a look in all the competitions and select a starting season from which we have information from all the competitions. Finally, to have the same amount of bet houses, we have created empty columns for the missing ones in all the competitions files.

Finally, we have join all this information coming from main leagues into a single file (keeping only le columns with valuable information) and exploit this data a little bit to get some interesting information as which is the bet house with more hit ratio, which are the bet houses that usually offer bets above or below the average...


In [1]:
import pandas as pd

parse_dates=["Date"]
ds = pd.read_csv("../Data/Interim/main_competitions.csv", parse_dates=parse_dates, index_col=False)
display(ds.dropna().sample(10))

n_countries = ds['Country'].nunique()
print ("Countries   : ", n_countries)
n_competitions = ds[['Country','Competition']].drop_duplicates().count()[0]
print ("Competitions: ", n_competitions)
n_seasons = ds[['Country','Competition', 'Season']].drop_duplicates().count()[0]

print ("Seasons     : ", n_seasons)
n_teams = ds['HomeTeam'].nunique()
print ("Teams       : ", n_teams)
n_matches = ds.count()[0]
print ("Matches     : ", n_matches)

Unnamed: 0,Country,Competition,Season,Div,Date,HomeTeam,AwayTeam,FTR,WHH,WHD,WHA,SBH,SBD,SBA,IWH,IWD,IWA,GBH,GBD,GBA
13327,England,Conference,2008-2009,EC,2008-09-20,Rushden & D,Burton,H,2.1,3.2,3.0,2.1,3.2,3.1,2.1,3.1,3.0,2.0,3.2,3.2
5718,England,Championship,2007-2008,E1,2008-12-01,Cardiff,Sheffield Weds,H,1.73,3.4,4.0,1.85,3.25,4.0,1.8,3.2,3.7,1.85,3.25,4.0
19366,England,League1,2006-2007,E2,2007-06-04,Scunthorpe,Yeovil,H,1.7,3.3,4.33,1.73,3.4,4.33,1.7,3.3,4.0,1.75,3.4,4.25
21120,England,League1,2010-2011,E2,2010-08-21,Peterboro,Huddersfield,H,2.5,3.2,2.8,2.35,3.25,2.7,2.3,3.1,2.6,2.4,3.25,2.65
38896,France,Division2,2006-2007,F2,2006-08-09,Bastia,Dijon,D,1.9,2.75,4.25,1.95,2.88,4.0,1.85,2.7,4.4,1.91,2.85,4.0
34777,England,Premier,2009-2010,E0,2010-01-26,Tottenham,Fulham,H,1.5,3.4,6.0,1.5,3.75,6.0,1.57,3.6,6.0,1.57,3.75,5.75
72341,Netherlands,Eredivisie,2006-2007,N1,2006-10-09,Nijmegen,AZ Alkmaar,A,3.8,3.5,1.72,4.2,3.5,1.73,4.2,3.2,1.7,4.6,3.5,1.66
21408,England,League1,2010-2011,E2,2011-01-02,Tranmere,Rochdale,D,2.4,3.3,2.9,2.3,3.25,2.8,2.4,3.1,2.5,2.35,3.25,2.75
13509,England,Conference,2008-2009,EC,2009-01-01,Grays,Ebbsfleet,H,2.3,3.2,2.62,2.3,3.2,2.75,2.1,3.3,2.8,2.3,3.25,2.75
39717,France,Division2,2009-2010,F2,2009-10-30,Istres,Sedan,H,2.4,2.9,3.0,2.3,2.9,3.0,2.3,2.9,3.0,2.35,2.9,3.0


Countries   :  11
Competitions:  22
Seasons     :  316
Teams       :  643
Matches     :  105503


## 4. Expected outcome

The expected outcomes of this project are:

- Variable or indicator that allows to assess the **success of a betting house** in his predictions. This indicator will use the values of the quotas compared with the result of the matches to determine the success of the betting house prediction.


- Machine learning model that predicts the **result of a match**, within the framework of the country, competition and specific moment. It will indicate  the probability of success of the local or away team. 


- Method that identifies matchies with **greater divergence** between forecasts and therefore with more possibility of benefit if the result is correct.

In the **production environment** data will be updated weekly, that is the update periodicity of the source data page. The update will be a batch process scheduled automatically. After updating the information  the process will be retrained and result files will be created.

User will access result files and he will make his analysis using it.


### Success of a betting house

In fact this indicator is the **'Accuracy'** because it is 'Success (TP and TN) related to total matches.

We expect to define a function like this: **accuracy (match, bethouse, scope)**

With params:

    match: Full information (register) of a match
    bethouse: Abrevation of the Bet House. Values: WH, SB, IW, GB
    scope: Columns to filter information. Values: Country, Competition, Season, Team or combination of them.

This function will calculate accuracy = success/matches of a concret Bet House in the selected scope and in reference to the desired match.

Example: match =

 Country     Competition     Season     Div     Date     HomeTeam     AwayTeam     FTR     WHH     WHD     WHA     SBH     SBD     SBA     IWH     IWD     IWA     GBH     GBD     GBA

0 Belgium JupilerLeague 2003-2004 B1 2003-08-08 Club Brugge Genk H NaN NaN NaN 1.44 3.75 6.5 1.45 3.8 5.4 1.4 3.8 6.85

result = accuracy(match, 'WH', ['Spain'])

result = 0.6


### Divergence

In a individual match divergence it's a measure of the difference between Bet House Quotes.

We expect to define a function like this:


In [None]:
1. Calculate Divergence(row)
2. Filter matches of this week. Optional we can filter also by country or competition
3. Map rows calculating divergence
4. Reduce result and obtain top N divergence values

We will use this function following these steps:

1. Filter matches of this week. Optional we can filter also by country or competition

2. Map rows calculating divergence

3. Reduce result amb obtain top N divergence values 

Result will be a Dataframe with the N rows with highest divergence, that they are the ones we need to pay attention to.


### Prediction Model

Probably we will use a Classification Model, not selected yet, but that obtains a percentage of probability of win, draw or lose.

Example:
features = ['Country','Competition','Season','HomeTeam','AwayTeam','WHH','WHD','WHA','SBH','SBD','SBA','IWH','IWD','IWA','GBH','GBD','GBA']

label = 'FTR'

model = Model.train(features, label)

prediction = model.predict(features)

Result:

Win: 'Belgium JupilerLeague 2003-2004 B1 2003-08-08 Club Brugge Genk  ...', 0.60

Drop: 'Belgium JupilerLeague 2003-2004 B1 2003-08-08 Club Brugge Genk  ...', 0.30


## 5. Success Measures

The result of these variables and methods will be compared with the actual results of the new test data.

We have defined two sets of data:

- **Main** dataset: Include all the information


- **Recent** dataset: Include information form seassons 2017-18 and 2018-19

We have divided every dataset in two subdataset:

- Dataset for **training** the machine learning prediction model. It include randow 80% of information.


- Dataset for **test** de result of the model. It include 20% of informationn

<img src="../Img/calendari_anys.jpg">

If the goal of the test is predict if home team will gain the match, after calculate predictions with test dataset we could found the following **possible situations**:

* **True Positive:** The prediction and the actual result are the same, home team has won the match.


* **True Negative:** The prediction and the actual result are the same, home team has lost the match.


* **False Positive:** The prediction and the actual result differ, the prediction is that the home team will win the match but the actual result is that it has lost


* **False Negative:** The prediction and the actual result differ, the prediction is that the home team will lose the match but the actual result is that it has won

<img src="../Img/results_schema.gif">

Our model will be any type of classification. We can test it with this **indicators**: 

- **Accuracy:** among all the sample, how many are correct 
$$ acc = \frac{TP+TN}{TP+TN+FP+FN}$$


- **Precision:** for those for which the model said as positive, how many of them are correct 
$$ prec = \frac{TP}{TP+FP} $$


- **Recall:** for those which are actually real, how many of them my model can label correctly 
$$ rec = \frac{TP}{TP+FN} $$


- **F1 measure:**
$$ F = 2 \cdot \frac{prec \cdot acc}{prec + acc} $$


## 6. Activity & Timing

The tasks for the development of this project will be:

- Selection of the origin of the data and download and comprehension of information.

    
- Cleaning of files, logic organisation and data correction validation.

    
- Selection of final fields and file and consolidation of all data files in a single dataset.

    
- Preliminary analysis of data. Statistical description of information.

    
- Analysis of different machine learning models and creation of our model.

    
- Test of the model and presentation of results.


<img src="../Img/gant.gif">



## 7. Dependencies, Assumptions & Constraints

### Dependencies:

We don’t know yet how to generate a probabilistic model to help us in our objective.

### Assumptions:

We are assuming that the information we get from the page is correct.

We are assuming that the page will always be updating the information with the games of each week.

We are assuming that the new information uploaded to the page will keep the same structure.

### Constraints:
Webpage source of information must be active and it must continue providing information weekly.

Structure of source information will not change and will be as accurate as nowadays.

Hardware production infrastructure must be provided by the client and depends on other technical providers.



# B. Technical Requirements

## 1. Solution Description & Diagram
A costumes has ordered us to develop a system to predict the result of football matches based on the quotas from the betting houses.

We have identified the website: http://www.football-data.co.uk/data.php with all the information we can need to develope the solution and that it's updated weekly.

The front-end will consist in a the page with to main options:

* Update information: Will download latest information from source web page and show to user result as number of matches by country and competition.

* Get weekly prediction: User will select initial and final date of the analysis, by default current week. Solution will show a screen with 4 frames:
   * Betting houses realibility: Graphic with the reliability of bethouses included in dataset in all the seasons.
   * Top N greatest divergent matches: Description of the N matches with highest divergence in this week and his prediction
   * Top N lowes divergent matches: Description of the N matches with lowest divergence in this week and his prediction
   * Rest of the matches: List of the rest of the matches in the week.
   
* Get individual prediction: The user will select initial and final date and the match (country, competition and team). User will show all the information of the match, including prediction.

The back-end of the solution will be developed with Spark tools: RDD, Pandas and DataFrames

Data information will be in csv files.

The process of calculation is:

<img src="../Img/esquema.gif">


## 2. Data Inputs
The original data has been obtained from the website: http://www.football-data.co.uk/data.php

This webpage offers a large dataset of information about football matches for up to 25 European league divisions and other international leagues. This information comes from 15 seasons back to 2003-20004 and include more than 500 football teams.

Original files also includes information related quotas calculated by different bets houses based on probability of win, draw or lost.

Information is provided in csv files classified in folders by season, country and competition.

This website provides to separate datasets: main competitions and extra competitions. We have only dealt with the main leagues.

Our first process is responsible for downloading all this files and reorder in folders by country, competition and season, that it fits better to the subsequent treatment we will do.

This process is developed in notebook  '**1_download-raw.ipynb**'.

The following table shows the distribution of the downloaded data, grouped by country and competition. We can show de minimum and maximum session in every competition and the number of sessions. In fact every season it’s a separate file.

<img src="../Img/original_data.gif">


## 3. Data Cleaning Steps

The task that has supposed more work is the validation and treatment of the **downloaded files**. Initially we have download all the files in both datasets (main and extra).

The first part of cleaning process was validate downloaded files. We have detected almost 40 files with erroneous format, because the weren’t correct csv files.. We have discarded this information.

The next step was to identify the **common fields**. Since each combination of country, competition and season was a separate file, it could contain different fields. This has happened especially with the data of the betting houses. The cause of this situation is that each betting house operates in different countries and has had activity in different periods of time.

This process was developed in '**2_filesAnalysis-raw_to_correct.ipynb**'

We have reviewed the distribution of bookmakers by country and incorporated into our dataset houses with information in most countries. Finally we have included in our dataset 12 betting houses.
The following table is an example of the analysis of the distribution of betting houses.

<img src="../Img/bet_houses_distrib.gif">



After verification of files next step was **unifying** all files in a global dataset called 'main_competitions.csv'. 

As a part of this task we have selected want fields to include:

* Classification files: Country, Competition and Season
* Match information: Date, HomeTeam, AwayTeam, FTR (Full time result)
* Betting houses quotas. Fields names ara composed by the betting house id and the type of quota id.
    * Betting houses ids are: B365,BS,BW,GB,IW,LB,SO,SB,SJ,SY,VC,WH
    * Quota types ids are: H (Home), D (Draw), A (Away)

This proces was developed in notebook '**3_filesUnifying-raw_to_correct.ipynb**'

We have also created a second dataset with matches from **seasons 2017-2018 and 2018-2019**, that will be used in the prediction model and it's called 'main_competitions_recent.csv'

We have reavaluate the betting houses fields to identify if all of them were used. We detect and drop columns from 4 houses with nulls in all the records of this seasons. Finally this dataset contains 6 betting houses quotas.

This process was developed in notebook '**5_recent_seasons_subset.ipynb**' 


Next step was validate that **fields** of our dataset are **correct**. In this case all the fields have the correct type and format.

This step was developed in '**4_fieldsAnalysis-raw_to_correct.ipynb**'

Last step of cleaning was validate **consistency of fields**. 

This proces was developed in '**5_fieldsAnalysis-correct_to_consistent.ipynb**'

We made the following validations:

* Null values required: We validated nulls in required fields and deleted corresponding rows:
    * Date: 145 rows
    * HomeTeam and AwayTeam: 385 rows
    * FTR: 146 rows
    
    
* Null values bet houses: We reviewed nulls in betting houses quotas. We decide to drop columns of betting house with almost rows with nulls, in this case: SO and SY. We maintain the rest of the Betting Houses cols unles nulls because we will delete before every specific analisys.

In the analysis phase we create subdatasets and we review nulls in every specific subdataset.

To remove nulls in this case there are to possible strategies: remove cols (Bets houses) or rows (matches). We decide a mixed strategy. We detect automatically Bets houses quotes with a large number of nulls (<80% correct values) and drop his columns. In consequence rest of columns has a high level of information. Then we drop rows with nulls.

In the case of 'recent' dataset we have dropped 'LB' bets house (67% not null) and maintain 5 bets houses. After dropping nulls rows we keep '99.38%' of rows.

This specific part is developed in notebook '**8_NaiveBayesModel-probabilities.ipynb**'

* Unify team names: We have detected teams with different names and unifidied the names


<img src="../Img/team_names.gif">


## 4. Data Processing Steps

### 4.1 Divergence

To assess the different valuation that betting houses have given to a match we have created the measured 'divergence'.

We calculate the divergence as the maximum percentage of variation between quotas and them mean of quotas of the match. All this is made by every type of quota (H, D, A).

In first place we identify the betting houses and columns included in the dataset with function 'filterBetHouses'.

Then we calculate the divergence of every individual match with function '**calcDivergence**'.

Bellow this we map this calculation for all the rows in the dataset.

And finally we have implemented the function '**topNDivergence**' to obtain the top N matches with highest divergence and the top N matches with de lowest divergence.

To analyse divergence values we have calculated histogram of its values. This measure has a binomial distribution with values from almost 0% to 80%, with a central value of 7-8%. Central quartiles are between 5,42% and 9,69%. 

Distance between divergences it's small, with only 4,27% of interquartile distance, and 80% of values have only a maximum distance of 10%.

Next graph shows histogram amb values of mean, median, and percentiles 10%, 25%, 75% and 100%.


<img src="../Img/divergence1.gif">

When we will analyse lower and higher divergent matches we will take as limits values from quantiles 10% (4,31%) and 90% (14,06%).
 
This has been developed in notebook '**7_divergence.ipynb**'
.

Bellow this we have add this measure to datased and saved it to disc.

This action has been done with function '**calcAndSaveDivergence**'.

This function uses auxiliar functions like 'reformatRow', 'reformatRDD', 'createDataFrame' and 'saveDFtoCSV' to transform result from 'calcDivergence' function to the needed structured to save dataset again on the disc.

We have added Divergence field in both datasets: main_competitions and main_competitions_recent.

This has been developed in notebook '**7_divergenceSave.ipynb**'
.

### 4.2 Bet Houses Reliability


To calculate the reliability of each bethouse, we take a look at all the historical of quotas. Then, for each match we take into account only two columns: __FTR__ (Final Time Result) and __the lowest quota of the bethouse for that match__. The lowest of the quotas is what the bethouse has considered the most probable result for that match. Then, the computation is trivial: if __FTR__ is equal to the lowest quota, means that the bethouse has hit the result, otherwhise, not. Then, we just devide the hits between the total number of bets of that bethouse to obtain his reliability.

The reliability of our bethouses looks like this:

<img src="../Img/reliability.png">

As the image shows, there is a high correlation between all the quotas of the bethouses as more or less they have the same hit ratio.

### 4.3 Prediction of the mach result

In order to predict the match result we have decided to use a classificator model based in __Naive Bayes__. We have been discussing to use some other kinds of models like __Linear Regression__, __Linear Regression with Stochastic Gradient Descent(SGD)__ or __Multinomial Logistic Regression__ but, we have discarded them because we are aiming for another type of outcome. What we want is to get for a match the probability of the three different possible results. 

For example, a desired outcome would be this one: (__Home Wins__: 45%, __Draw__ : 25%, __Away Wins__ : 30%)

This kind of outcome can only be obtained by using a __Naive Bayes__ model.

Finally, we have decided to create two different __Naive Bayes__ models. The first one, will be trained with all the bets historical and the second one, will be trained only with the bets from the current season (2018-2019).
We have taken this decision because we wanted to see if the model performance increases if we take some more concrete data. Also, may be interesting to create a model with only the quotas for one competition, country or team.

To create this two models, we have just used the __FTR__ column as the __label__ and all the quotas from the bethouses as the __features__ vector.

Finally, we have split the datasets in 90% for training and 10% for testing.

The performance of our first model is the following:

<img src="../Img/full_ds.png">
<img src="../Img/full_th.png">
<img src="../Img/full_hh.png">
<img src="../Img/full_dh.png">
<img src="../Img/full_ah.png">


The performance of our second model is the following:

<img src="../Img/partial_ds.png">
<img src="../Img/partial_th.png">
<img src="../Img/partial_hh.png">
<img src="../Img/partial_dh.png">
<img src="../Img/partial_ah.png">



As it can be seen, in both cases the models seem to be struggling in the draw predictions. This is why their accuracy is only of more or less 50%.

As we can see, if the data is more concrete, the performance increases.

### Calculation of probabilities

To obtain a more precise prediction we don't have enough to calculate the most probable result. We need to know the probability of every result.

We try calculate it with NaiveBayesModel of library mllib, that works with RDDs. This model provide PI logs (probability of result values) and THETA logs (array of probabilities of parameters conditioned to ).

We calculation from logs of PI and THETA to probabilities of result must be done manually.

At this point we test **ML library**, that works with Spark DataFrames. NaiveBayesModel of this library provides as result of 'transform' method desired probabilities.

We have developped a second version of functions to create, train and test the model.

Function '**calcModelAndPrediction**' mades all this work and it's developped in notebook '**8_NaiveBayesModel-probabilities.ipynb**'

This are the followed steps:
* **cleanNulls**: Removes null information. 

We decide don't replace nulls with a default value or calculation because this will modify predictions.

To remove nulls there are to possible strategies: remove cols (Bets houses) or rows (matches). We decide a mixed strategy. We detect automatically Bets houses quotes with a large number of nulls (<80% correct values) and drop his columns. In consequence rest of columns has a high level of information. Then we drop rows with nulls. 

In the case of 'recent' dataset we have dropped 'LB' bets house (67% not null) and maintain 5 bets houses. After dropping nulls rows we keep '99.38%' of rows.

* **calcBetsHousesCols**: Obtain the list of bets houses quotas columns
This function checks dataset columns and it determine available columns to use as parameters of the model.


* **Split taining and test**
We split resulting dataset in two aleatory subsets. 80% it's used to train dataset and 20%  it's used to test dataset.


* **calcNaiveBayesModel**: Calculate Naive Bayes Model
We transform result to numeric values (H=0, D=1, A=2), after we create vector of labels and parameters create the model with the 'multinomial' type and fit (train) the model.
After this we calculate the prediction, with the 'transform' method. In this point it's when we obtain probabilities.


* **Evaluate model**
To avaluate the model we have used the '**MulticlassClassificationEvaluator**' class and the 'Accuracy' metric.
We use only accuracy metric because it's values are relatively low and has not sense calculate other metrics.

**Accuracy** values are:

Model: 48.16 %

Prediction: 47.44 %


### Partition by divergence

We want to know whether divergence measure affects prediction of results and the accuracy of the model.

To do this analysis we have used dataset 'main_competition_recent' and versioned functions defined in previous notebook. We have divided information in three subdatsets depending on the value of divergence:

- Lower divergence: Matches in the 10% lower interval, that is, values below 4.31.

- Higher divergence: Matches in the 10% upper interval, that is, values over 14.61

- Central divergence: Matche in the 80% central interval. Values between 4.31 and 14.61

In addition every subdataset have been divided also two parts. Model: 80%, test: 20%

We have calculated accuracy of the model and the test in all the subsets. Following graph represent accuracy calculated:

<img src="../Img/divergence_accuracy.gif">

Higher divergence subdataset has a good accuracy value around 70%, but the restant datasets have a similar accuracy as the normal model. 

We made also a more detailed partitioning of dataset by deciles of divergence and result confirm that accuracy increase with divergence value. This is probably because these matches have a very clear result prediction from bets houses point of view.

<img src="../Img/divergence_accuracy2.gif">

Finally we try to add the 'Divergence' measure as a parameter of the model, and result was similar to the original model:

* Naive Bayes Model with Divergence

Model accuracy     :  47.81 %

Prediction accuracy:  48.13 %

* Naive Bayes Model without Divergence

Model accuracy     : 48.16%

Prediction accuracy: 47.44%
.

### Partition by Country

We're interested in know if other variables can affect model prediction. We take the case of the 'Country' variable. 

Following the same steps as with the measure 'Divergence' we partitioned the dataset by country and we calculated the model and test accuracy.

<img src="../Img/country_accuracy.gif">

Only a few countries (greece, portugal, italy, ...) have a higher accuracy taking a partial dataset partitioned by country.

# C. Outcomes


## 1. Review Expected vs. Attained Outcomes

All our expected outcomes has been achieved.

We have developed functions to calculated divergence of matches and identify top N divergent matches.

We also have developed a proces to obtain the reliability of betting houses and we have calculated with main competitions dataset.

We have obtain a prediction of the result based in **Naive Bayes** that provides a result like this: (__Home Wins__: 45%, __Draw__ : 25%, __Away Wins__ : 30%).

We have codified the result so:
    0 : Home wins
    1 : Draw
    2 : Away wins

## 2. Review Final Outcomes vs. Objectives


The accuracy of our models right now is of more or less 50%.

We expected to obtain the same prediction accuracy for all the three different results. But, as we have explained, the models seem to have problems to predict the draws so that is decreasing the final global accuracy.

Note that if we achieve for the models to predict draws as good as they predict local and away victories, the final global accuray will be of almost 70%.

## 3. Conclusions and Future Steps

As we have seen, the model seems to work better when the date used us more concrete. So it would be a good idea to try to create new models with the bets for only one team, competition or country.

Finally, we can also try some other interesting kind of models like the associative ones. Maybe it would be a good idea to give a try to __A-Priori__ to try to find some associative rules (if they exist) between bets and results.
