# ML BGG
## Machine Learning Project with BoardGames
![head](resources/img/head.png)  


### Table of Contents  
[Intro](#Intro)  
[Exploratory Data Analysis](#Exploratory-Data-Analysis)  
[Feature Engineering](#Feature-Engineering)  
[Machine Learning](#Machine-Learning)  
[Results](#results-of-the-training)  
[Final Thoughts](#Final-Thoughts)  
[Sources](#Sources)

### Intro
-------------
This is a Machine Learning project to analyze board games and predict the average rating of other games. The initial dataset was from Kaggle and we worked with the variables to adapt them and make them suitable for the Machine Learning models.

The objective of the project is to try to predict the average rating of boardgames if we provide the model with enough data such as the minimum age to play, the number of players or what kind of mechanics exist within the game.  

### Exploratory Data Analysis
-------------
The dataset was pretty clean and we had a comfortable number of columns to investigate. Initially, we had a dataset with 15909 rows and 33 columns. In the early stages of the project, we used dummies for all the categorical columns and it raised the number of columns to 200, so we discarded the idea.  

While doing the EDA, we discovered that there was a huge difference in releases in the last 30 years and the market could be over saturating.  
![gamespublished](resources/img/published.png)

Most of the games were for 2 or 4 players and while generally, people prefer to keep it simple when playing(low complexity), some of the newer games try to explore new frontiers. We could also observe that dices and resource management maintain their positions as the most popular type of games.  
![mechanics](resources/img/mechanicslong.png)

And finally, to give a general overview, we took glimpse at the heatmap to see the correlations between the variables in the dataset.  
![heatmap](resources/img/heatmap.png)

### Feature Engineering
-------------
Getting rid of the NaN values were the utmost priority. Since there was only a few rows with more NaN than data, we just got rid of them (unknown games that lacked a lot of information and were at the bottom of the ranking). After that, we noticed that the Domains column had a lot of missings, so we assigned them the Unknown categorical to start. After that, we separated the column with dummies to help them classify it.  

Before starting the ML part, we also erased some columns (ID, BGG Rank...) since there were irrelevant for the predictions. They were identifiers and data that had no use for this kind of exercise.

### Machine Learning
-------------
We started with the common models such as [Linear](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html), [Polynomial](https://towardsdatascience.com/polynomial-regression-with-scikit-learn-what-you-should-know-bed9d3296f2) and RandomForest. The results were mediocre so we started delving into other models. We tried with the [BayesRidge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html) or the [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) with similar scores. On top of that, the scores were pretty low and we started to reconsider previous steps.  

That's when we realized that we were feeding data with low numbers and that affected the models. Basically, we did an artificial split of board games from before 2020 (train) and after the same year (test), so the data in test had lower number of votes, comments and everything interactable. The model had too few things to work with, so we changed the split into before 1995 and after. That's when the models started improving their scores and we discovered that the XGB Regressor was a strong candidate for a starting point.  

After the initial trainings, the first models showed that the XGB Regressor, RandomForestRegressor or DecissionTreeRegressor were the ones with the highest scores. That's why we decided to explore more parameters within those regressors.  

**Initial iteration**  
| Model | R2 Score |
| ------------- |-------------:|
| [XGB Regressor](https://xgboost.readthedocs.io/en/stable/parameter.html) | 0.857179 |
| [Gradient Boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) | 0.618233 |
| [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) | 0.588974 |
| [Decision Tree](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html) | 0.391220 |

![scoring](resources/img/modelos.png)

### Results of the training
-------------  
While training the models, the feature importances showed that the Bayesian average and interaction with the game (visits to the board game page, comments about them, adding them to their wishlists...) helped greatly when trying to predict their rating. The last iteration shows that some models improved more than others, but in the end, the ranking stays the same. Please, note that the first column shows the scoring with the train data. When trying to predict new data, the scoring lowered a little bit but stayed constant.  

**Last iteration**  
| Model | R2 Score | R2 Score with new data |
| :--- | :---: | ---:|
| [XGB Regressor](https://xgboost.readthedocs.io/en/stable/parameter.html) | 0.899273 | 0.889522 |
| [Gradient Boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) | 0.826174 | 0.792521 |
| [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) | 0.800226 | 0.647375 |

### Final Thoughts
-------------
And the winner was the XGB Regressor, and the feature importances were:  
![XGBFinal](resources/img/XGBFinal.png)

The project was very revealing and we can see the bright possibilities for the board game industry in the near future. Even though the model is not very good with newer games with few interactions(number of players wishlisting the game, talking about the game in social media, etc.), it's robust when given enough data with a high probability of making the correct prediction.  

To wrap up, the model without being too picky with the feature selection is decent. But it's certain that with some more filtering and adding new parameters, the model could do a much better job, specially if Neural Networks makes an entrance. That would be the first thing to do in a newer iteration for the future.


### Sources
-------------
The initial dataset was from [Kaggle](https://www.kaggle.com/datasets/andrewmvd/board-games).
* Python: 3.7.4
* Python libraries such as pandas, seaborn or sklearn.