# Big Data Project
## Preliminary Deliverable and Oral Presentation - Project Planning


## 1. Background

When it comes to football bets, for every single game, we have different bets houses (with different mathematical models) that are generating three different quotas for the three different results of that game (win home, draw, win away).

![introduction.png](../Img/introduction.png)

As it can be seen in the picture above, for the same game (Inglaterra vs Panamá), the three bets houses are offering different quotas for the three possible results of the match.

Most of the times, this quotas are pretty much the same (for example, the three of them are offering 1.22 for Inglaterra winning, and more or less 6 euros for a draw). But, there are a few times when the mathematical models have big discrepancies. For example, for Panamá winning, in this cases, bet365 is offering 4 euros more than the other two bets houses. 

So, in this case, for the model of bet365 a win for Panamá is much less probable than for the other two bets houses.

The main idea behind this project is to study this big discrepancies between bets houses and try to use them (if it’s possible) to predict the final result of the match.


## 2. Objective

The general objective of this project is increase the benefits in the management of bets of our customer. This will be achieved through the following specific objectives:

- Create a prediction model of results based on discrepancies between mathematical models of different betting houses for the same match, within the framework of a country and competition.


- Design of a method to identify matches with greater divergence between forecasts and therefore with more possibility of benefit if the result is correct.


- Determine the reliability of betting houses, evaluating the success rate by comparing their odds with the result of the matches.


- Identify if there is specialization of betting houses in a country, competition or team. Evaluate their highest success rate with respect to the different variables.



## 3. Approach

So, first of all, we need to get the data. Searching on internet we have found this page:
http://www.football-data.co.uk/data.php. 

The data in this page it is being updated every single week with the results of the games from that week.

The data structure is the following:

![leagues_seasons.gif](attachment:leagues_seasons.gif)

We have a set of leagues divided between main leagues and extra leagues. The difference between main leagues and extra leagues is, basically, that main leagues have more bets houses than extra leagues. Within a league, we have all the different seasons (starting in all of leagues by 2003-2004 more or less). And, finally, within a season, we have all the results from that season distributed in different csv files (one for each competition).

To get all this data, we have generated a python script that basically builds all the different urls of all the files and download all of them. Also, this script builds our own filesystem that has the following structure: Country > Competition > Season.

![download.gif](attachment:download.gif)

We download all the information but, by now, we have decided to use only the one coming from the main leagues as we have more information about bets houses.

The data from this page is very consistent, but, we have found some little problems. One of them is that not all the files in the main leagues has the same format. Some of them have more bet houses than others (depending on the country generally). Another of the problems is that not all the countries data are starting on the same season. 
So, basically, what we have done is to take a look in all the competitions and select a starting season from which we have information from all the competitions. Finally, to have the same amount of bet houses, we have created empty columns for the missing ones in all the competitions files.

Finally, we have join all this information coming from main leagues into a single file (keeping only le columns with valuable information) and exploit this data a little bit to get some interesting information as which is the bet house with more hit ratio, which are the bet houses that usually offer bets above or below the average...


## 4. Expected outcome

The expected outcomes of this project are:

- Variable or indicator that allows to assess the **success of a betting house** in his predictions. This indicator will use the values of the quotas compared with the result of the matches to determine the success of the betting house prediction.


- Machine learning model that predicts the **result of a match**, within the framework of the country, competition and specific moment. It will indicate  the probability of success of the local or away team. 


- Method that identifies matchies with **greater divergence** between forecasts and therefore with more possibility of benefit if the result is correct.

In the **production environment** data will be updated weekly, that is the update periodicity of the source data page. The update will be a batch process scheduled automatically. After updating the information  the process will be retrained and result files will be created.

User will access result files and he will make his analysis using it.


## 5. Success Measures

The result of these variables and methods will be compared with the actual results of the new test data.

We have defined two sets of data. 
- Dataset for **training** the machine learning prediction model. It include the information **until 2017-18** season.


- Dataset for **test** de result of the model. It include the information of the **2018-19** season

![calendari_anys.jpg](attachment:calendari_anys.jpg)

If the goal of the test is predict if home team will gain the match, after calculate predictions with test dataset we could found the following **possible situations**:

* **True Positive:** The prediction and the actual result are the same, home team has won the match.


* **True Negative:** The prediction and the actual result are the same, home team has lost the match.


* **False Positive:** The prediction and the actual result differ, the prediction is that the home team will win the match but the actual result is that it has lost


* **False Negative:** The prediction and the actual result differ, the prediction is that the home team will lose the match but the actual result is that it has won

![results_schema.gif](attachment:results_schema.gif)

Our model will be any type of classification. We can test it with this **indicators**: 

- **Accuracy:** among all the sample, how many are correct 
$$ acc = \frac{TP+TN}{TP+TN+FP+FN}$$


- **Precision:** for those for which the model said as positive, how many of them are correct 
$$ prec = \frac{TP}{TP+FP} $$


- **Recall:** for those which are actually real, how many of them my model can label correctly 
$$ rec = \frac{TP}{TP+FN} $$


- **F1 measure:**
$$ F = 2 \cdot \frac{prec \cdot acc}{prec + acc} $$


## 6. Activity & Timing

The tasks for the development of this project will be:

- Selection of the origin of the data and download and comprehension of information.

    
- Cleaning of files, logic organisation and data correction validation.

    
- Selection of final fields and file and consolidation of all data files in a single dataset.

    
- Preliminary analysis of data. Statistical description of information.

    
- Analysis of different machine learning models and creation of our model.

    
- Test of the model and presentation of results.


![gant.gif](attachment:gant.gif)


## 7. Dependencies, Assumptions & Constraints

### Dependencies:

We don’t know yet how to generate a probabilistic model to help us in our objective.

### Assumptions:

We are assuming that the information we get from the page is correct.

We are assuming that the page will always be updating the information with the games of each week.

We are assuming that the new information uploaded to the page will keep the same structure.

### Constraints:
Webpage source of information must be active and it must continue providing information weekly.

Structure of source information will not change and will be as accurate as nowadays.

Hardware production infrastructure must be provided by the client and depends on other technical providers.

