# European Transfer Market Prediction

Our club, **Team FC**, is competing in the top division of England. Our survival within the league relies on many facets, including elite performance on the pitch by the first team players, expert coaching from our coaching staff, and quality recruitment to continue to make the club competitive.

As the club's data scientist, I am in charge of improving our club by using the data available to create systems that accomplish these goals. The purpose of this report is to address the recruitment phase of our club.

### Goal for recruitment

Our recruitment policy as a club is as follows:

* Purchases for players should be limited to players below the age of 26 years old
* Transfer budget is 40 million
* At most 1 player can be bought that is not European

We would like to identify 3 players to recruit based on our model, and will try to maximize the value we receive per player based on the following questions:

* **What leagues are best for searching for young players (21 years or younger)?**
* **What position is best to target within the top league in regards to value (cheapest position)?**
* **What continent of players has the most potential profit in regards to buying/ then selling on after the player turns 27?**

### Data
Our task is to create a model that will help us predict the market value for players based on historical data. Our data set is:

* **[Kaggle- European Football Transfers Dataset](https://www.kaggle.com/giovannibeli/european-football-transfers-database)**

Within this data set, we have several *.csv* files that contain many different data sets. We focused our search to the data found in the following files that are found in the dataset listed above:
* transfers.csv
* stats_of_players.csv
* dict_players.csv
* clubs_in_leagues.csv
* dict_leagues.csv

### Cleaning the data

Our process for cleaning the data was to identify what data we wanted to keep and what was unnecessary. We narrowed down our feature selection to the following, listed by which data set each came from:

**Transfers**
* Player ID
* Season
* Fee (if there was a transfer that season)
* *Market value* (this is the value we are attempting to predict)

**Stats for players**
* Goal contributions (Goals + Assists)
* Minutes per appearance (Average per game)
* Total minutes played (in season)

**Dict of Players**
* Height
* Age
* Position (main only)
* Region of origin (continent)

**Clubs in Leagues/Dict of leagues**
* League level (tiers based on overall financial pull per league)



Many of these features were gathered directly from the data sets, with the exception of a few, such as age, region, and league level. For each of these, we combined data from at least two data sets. For example, age was found by combining the year gathered from Date of Birth (in dict_players) and subtracting from season to get the age for a player in a given season.

We combined the data all by grouping the data from the transfers data set by season and player id, then concatenating the data from our other sets, either by both season and player id in the case of stats, or just player id from dict.

Additionally, we consolidated positions, regions and league levels from the data by combining entries into groups. For example, our data gives the player's nationality. We then converted the nationality code (a 3 letter code used by [FIFA](https://en.wikipedia.org/wiki/List_of_FIFA_country_codes)) to the continent that the nation is a member of within FIFA, by scaping the individual federation lists of nationalities and converting the nation to the continent.

In order to deal with null or missing values, we searched through each column with null values and made decisions to remove when the missing data was not significant or find the accurate values if they could easily be found online (mainly for nationalities of individual players). Our largest set of null values actually comes from the market value column, with nearly half of the values in the data set missing. Because we are trying to predict that value with our model, we decided to separate the data with missing market value and then apply the model to this data, giving us a good test set of data for our final model.

### Exploring the data

Before we began the process of modeling, we explored various aspects of the data. As seen below, we have the three plots which show average market value in our data listed by position within Tier 1, 2, and 3 of our league tiers.

------------------------

<div style="text-align: center"> <b>TIER 3</b> </div>
<img src='market_tier_3.png'>

We see with this image that the leagues individually have some difference in regards to average prices per position. Particularly, we note that the attacking wide players in the Dutch league and goalkeepers in the Turkish league are significantly cheaper than their counterparts, while all other positions are relatively similar. Below we now look at the average age per position to see if we can get a reason why this is the case.

<img src='age_tier_3.png'>

The age of the Dutch league is consistently lower than the other two leagues, which would explain the lower prices of their players, possibly. We see that the average age of wide attacking players in the Dutch league is just above 23! That is very young, which means that the clubs in the league must be selling their wide players early on.

On the flip side, we see the Turkish side is older in every position than our other two leagues. In fact, the next image shows that the Turkish league is consistently older than every league that we are assessing.

<img src='overall_age_tier_3.png'>

----------------------------
We continue this assessment for Tier 2 with similar visualizations.

<div style="text-align: center"> <b>TIER 2</b> </div>
<img src='age_tier_2.png'>

We see that, from a glance, the Portuguese league has a similar pattern, excepting goalkeepers, to the Dutch league in Tier 3, with young players and significantly younger wide attacking players in particular. We also see that defenders from Russia and France tend to be on the cheaper side, while the ages are very similar. 

We move to Tier 1, which includes the league that our club currently play in.

-----------


<div style="text-align: center"> <b>TIER 1</b> </div>
<img src='age_tier_1.png'>

Our assessment is that the Italian league is similar in age breakdown to the Dutch and Portuguese leagues. If we are looking for an elite player, the Italian league is likely to have younger players available for purchase, while the English and Spanish leagues will be more expensive no matter the position. It should be noted that English goalkeepers are significantly lower value than the other two leagues, meaning that goalkeepers in the league may be more affordable with the higher intensity of experience.

In conclusion from our exploratory data analysis, we find that the leagues that consistently produce younger players are the Eredivisie (TIER 3), Liga NOS (TIER 2), and Serie A (TIER 1), depending on the money being spent. We also conclude that the Turkish league is very unlikely to have players within our criteria of being younger and talented enough to compete.

### Modeling

As stated before, we split our data set in half based on if the market value is null or not, leaving us 77,659 data points for training and testing. We decided to use a 70/30 split for training the data and create 