-
Notifications
You must be signed in to change notification settings - Fork 0
benallard/ml_bets
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
So, let's try to get some neural network predict the euro tournament. 1. To train the network, we need historic data ---------------------------------------------- https://www.oddsportal.com/soccer/europe/euro-2004/results/ https://www.oddsportal.com/soccer/europe/euro-2008/results/ https://www.oddsportal.com/soccer/europe/euro-2012/results/ https://www.oddsportal.com/soccer/europe/euro-2016/results/ Let's put that in a csv (4 csv ?) and see where we come. So the content comes from a json call. Let's save the payload in the docs dir. And remove the extra stuff. Leave only the html string. Iterate over the lines and extract the data. Hours, Day, Participants, Score are easy to extract. The odd seems to be encoded in an own cipher. The decoding must happen in js in browser. Either it's a lookup table, or there is some trick. At least they stay constant upon refresh of the page. So there is always an 'f' in the middle, and often a 'z' on both side. The z could stand for '.', and maybe only the part after the 'f' is relevant. Okay, we have two values there: the max amoung the brokers (first part) and the mean (the second part). It's some simple substitution code. Refactor the code to process all the data, and generate csv. 2. Once we have the data, we need to define the environment for our ML agent. ----------------------------------------------------------------------------- We settle on gym. So we need an observation space and an action space. The action space is easy: its the predicted score The observation space will simply take all the rows from our data. - date ? nope - score ? you kidding ? nope - only three odds instead of the six So: - 2 x team, 3 x odds Theoretically, every team in the world could play. Okay, it's more difficult than I thought to add countries to the observation space ... Let's remove it for now. 3. Once we have an environment, we can start writing an agent ------------------------------------------------------------- So the agent trained for 100352 steps, and averaged a bit more than 1 point per match. That's very bad. Let's start at 0. again and add the data of the qualification. And add more randomization: don't play always the same four tournaments. Ok, it's been learning the whole night. At the moment it's as ~8 million timesteps, and nothing new emerges: 1.04 points per prediction. 4. Let's do something basic: raw pytorch ---------------------------------------- (Or why have a shot at Reinforcement Learning where the topology of the network is that simple?) We have three nodes in input, two in output, and let's say two hidden layers of 5 each. Everything linear, and no activation function for now [sic.]. We should define a custom loss method based on the rewards of the betting system. Loss functions are a science in themselves. I can't really return contant values as a loose the autograd stuff if I make a new tensor. So I need to combine loss functions, a bit like the `Huber loss`_. What we actually need seems to be an asymmetric loss function: if the right team was predicted as a win, it remains good no matter what the predicted score was. I'm not good at tensor arithmetic. That fails. At the moment, the network doesn't seems to learn anything. As if the optimizer step wouldn't be performed. Looks like it performs better (weights updates themselves) when i remove the `round()` step. So I trained a model on 10000 epochs, and he loves to predict 1:1. That's the problem with the loss function: the draw is not penalized enough, and hence 'near enough' to the right score. 5. Back to 0. add historical knowledge about the FIFA ranking, and integrate the 2020 dataset. ---------------------------------------------------------------------------------------------- Luckily, the FIFA has that data on its website, including historical values. Luckily 2: There is a kaggle (never heard of that site before) with scraped data. Kudos goes here: https://github.com/cnc8/fifa-world-ranking So, download the dataset and write a utility to get values from it. -- I got an idea on the lost function trouble: Instead of getting the network generate the score directly, we could have it generate the tital amount of goals & the delta. So that (2, 0) - giving 1:1 is farther away from (1, 1) or (1, -1) -- The ranking computation is way too slow. Let's work on it and cache some results instead of going through the 3MB dataset everytime. Done. The new model is either predicting 1:1 or 2:1 (well, 1:2 as well) for every match. If I would use it for the current tournament that would yield 35 points (today), which is almost the last place in the classement. I'm a bit disappointed. Let's write a basic manual model. 1:0 for the team that's better according to FIFA => 57 points 1:0 for the team that's better according to the bookies => 53 points That's definitely a 1 for me and a 0 for the AI. What if we add three outputs that acts like a categorisation (won, draw, lose), and consider the delta absolutely. I thought AI won for a moment (in some sense it actually did): If you bet 1:1 for every match, you win 70 something points. Which is more than the best place in our classment. That was wrong, and due to some other error. 6. Let's got where I should have started: linear regression ----------------------------------------------------------- So I installed scikit-learn I gathered all my data together so that I can fit some models. One regression model for the total amount of goal, and the delta. And one other classification model for the winner. I'm amazed by the different results everytime I run the program again. The following models get 72 points (at the end of the 8th) (that's place 24 out of 49): learned reg: [[-0.01054463 -0.06416651 0.22055738 0.19051208 0.49561023] [ 0.10158429 -0.00526354 0.44703517 0.04222541 0.83012052]] learned clf: [[-0.53628824 0.38646718 -2.36716721 1.40582221 1.58215466] [ 0.05510174 -0.11317569 -0.15895176 -0.38535366 -0.03886293] [ 0.73385465 -0.3805209 2.09960059 0.3172455 -1.48905972]] The models tend to favorize 2:1 over 1:0. Let's update our static model to see. Bad news, 1:0 is slightly better. (72:74 for odds, and 76:81 for ranking). That also means that even with the amazing 72 'dynamic' points, we are still below a static model. I would sincerely have believed that we could have done better. I just added a standard calculation model, taken from various places on github, and tweaked the coefficients to maximize the output. => 2020: 79 to 89 points. What now ? ---------- We've got a bunch of models: manuals: draw: 28 points rank: 92 points odds: 91 points calc: 91 points scikit: up to 96 points. pytorch: 20000 epoch, 68 points 40000 79 PPO get 38 points 91 points is place 24 (out of 45).
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published