# Abstract

In the world of soccer, the transfer market serves as one of the main sources for generating revenue for soccer clubs. Over the last few years, the money spent on the transfer market has sky-rocketed with FIFA, the governing body in soccer, revealing that soccer clubs spent  $7.35 billion in player acquisitions just in 2019. In this high stake environment, with so much money involved, it is important for clubs to accurately assess the market value of players before submitting a bid to buy a player. However, the traditional crowd-sourced approaches have been questioned for their inconsistency and susceptibility to bias. Recognizing this challenge, we aim to build an alternative to the conventional crowd-sourced approach to determining a player's market value, using the power of machine learning algorithms. Our project not only targets determining a player's market value, but also addresses a more complex issue in soccer finance: the 'transfer fee'. The 'transfer fee'—the total cost a club pays for a player—extends beyond the player's market value, incorporating additional monetary obligations imposed by selling clubs. In the project, we have developed a machine learning model that relies on linear regression and random forest algorithms to estimate the transfer fee. This model has the potential to help clubs and transfer agents for strategic financial planning and reducing the risk of overpayment.

# Introduction

Soccer, globally referred to as football, is one of the most widely followed sports, boasting a fan base exceeding a billion as reported by FIFA in 2017. A critical aspect of this sport is the annual transfer windows, periods when soccer clubs can trade players, which significantly affect the clubs' strategies and finances.

Interestingly, one of the renowned websites for these market valuations, Transfermarkt, stated that they do not use an algorithm but instead relies on the wisdom of the community to estimate valuations.Given the significant influence of these crowd-sourced values, it becomes crucial to establish fixed parameters for evaluating these values, ensuring consistency and fairness. Elmundo, a popular newspaper in Spain, confirms that sports directors and representatives of soccer clubs do recognize the values from Transfermarkt. 

Upon realizing that most market values are crowd sourced, the primary objective in the project is to propose a model that assigns specific weights to pertinent parameters, hence ensuring consistency in accurately estimating a player's market value in the transfer market. 

Previous attempts at estimating player transfer values have revealed the importance of ratings. A ScienceDirect article (Dennis Coates, Petr Parshakov, 2022) (https://www.sciencedirect.com/science/article/abs/pii/S037722172100895X?via%3Dihub) emphasizes the significance of ratings in the player’s transfer market value. Alternative rating systems exist, such as the plus-minus rating, which calculates a team's net goals when a specific player is on the field (Sæbø, Olav Drivenes, and Lars Magnus Hvattum, 2015, https://core.ac.uk/reader/327107620). Another comparable rating system, the Goal Impact Metric (GIM) (McHale, Benjamin Holmes, 2023, https://www.sciencedirect.com/science/article/pii/S0377221722005082#bib0029), has proponents arguing its superior predictive capabilities. Nonetheless, our project chose to employ standard FIFA ratings from the FIFA game developed by EA Sports due to the extra time required to gather data on goals scored with a player both present and absent from a team.

Contrary to the referenced papers, which restricted their estimates to the English transfer market due to data limitations and potential market variations, our project has broadened its scope. We estimate player transfer fees across not only the English league but also the German, French, Italian, and Spanish leagues. This expansion was facilitated by an extensive dataset spanning FIFA editions from 2015 to 2020, merged with concurrent transfer data.

In our modeling approach, we've selected Linear Regression, Lasso (Linear regression with regularization)and Random Forest algorithms. While advanced models, such as Elastic-net regression — which integrates lasso and ridge regression — have shown enhanced accuracy in predicting transfer fees (McHale, Benjamin Holmes, 2023), our choice leans towards standard models. We decided to use Linear Regression Lasso and Random Forest algorithms due to our familiarity with these standard methods and our curiosity to see how they compare to advanced models such as Elastic-net regression.

https://ieeexplore.ieee.org/document/9721908


Perez-Cutino, Francisco. (2008). Innovative approaches to increase revenues for football clubs. Can good business sense make football better?. 10.13140/RG.2.1.1634.2164. 


# Values Statement

Football is such an amusing topic to work with in many aspects. Those who follow the sport are particularly interested in the transfer market as people like to see where their favorite player might end up in. We decided to track how much a player costs and how much a team must pay for a player for our project. 

Potential users can be viewed from different perspectives, as those who follow football could use the app to check the market value of their favorite players as well as how much they would cost a club. Because our market value predictor model is so accurate, even teams can use it to forecast a player's market value. Even though we have a complete transfer fee model, it is not very accurate, so teams may not benefit. Our model could also serve as a foundational basis for fantasy soccer in the sense that it provides a player's transfer information, allowing fantasy soccer users to predict a player's market value and transfer fee information and base their fantasy scores on that.

Fans and those looking for an unbiased market value prediction model with a sample transfer fee model can benefit from our program because the bias is standardized across all players, which may also help official teams make decisions without bias. Official reporters could also use this model to compare players who may have biases with actual market value models to the same players who do not have biases with our market value model.

As personal soccer fans, we found this project to be a fascinating aspect of the world. Because of the various external factors that can affect market value, it is sometimes difficult to know how much a player will actually cost a club, but this model that we created can help us get a close-to-accurate enough result that will let us know that if a club were to buy a specific player, they would need to pay a specific amount that our model tells us. Our model also assists us in determining a player's market value, which can tell us whether a player is highly rated or not, because our project also displays graphs that show that the higher the rating, the higher the market value a player has. This project assisted us in answering some of our personal interesting questions, as we had always wondered about player statistics without visiting individual club websites, whereas this model allows us to print important information about these plays that can tell us a lot.

In a nutshell, our model would assist many soccer fans in learning a player's value based on two pieces of information provided by our model, and it would assist the world of football in understanding the correlation between transfer fee and market value.


# Materials and Methods

## Data
Include some discussion of where it came from, who collected it (include a citation), how it was collected, and what each row represents (a person, an environmental event, a body of text, etc) Please also include a discussion of potential limitations in the data: who or what is represented, and who or what isn’t?
In structuring your description of the data, I encourage you to address many of the questions outlined in Gebru et al. (2021), although it is not necessary for you to write a complete data sheet for your data set.
## Approach

	Transfer Market Model:
Our imported data was fairly clean and we had plenty of samples and features (5700, 59). When preparing our data for model training, we dropped categorical features like name and field position and any samples with features containing N/A values. Also, we dropped release clause and wage because these features seem to be calculated by Fifa with the same formula that is used for market value so they were biased predictors .So, we got down to 5010 samples and 57 features (all data). We used 20% of this data as a holdoout test set and randomized our test_train_split because our original data was ordered by the player’s overall rating and we didn’t want the model to train on “good players” and test on “bad players” at the tail end of our dataset. We standardized our data using Sklearn’s StandardScaler() so the coefficients of our linear predictors could be more easily interpreted. Our target vector was the “value” column of our data and the rest of the features comprised our feature matrix.
	Thereafter, we performed feature selection. We combined univariate feature selection approaches like mutual information and ANOVA with recursive feature elimination (w/ cross-validation) using Lasso Linear Regression to select the “most important” predictive features. 




NOTE: When performing RFE with standard linear regression instead of regularized linear regression we noticed that our

We trained our models on these 2 standout selected features from our univariate + RFE. We used a LASSO model and a RandomForestRegressor model to observe the difference between linear and non-linear patterns.




Cross-val and test set + coefficient of determination. 
Positional bias coming soon…?
	Transfer Fee Model:
Our imported data had 708 samples and 78 features but it was not very clean. We had to subset the data for modeling to just field players (excluded goalkeepers) because goalkeepers had N/A values for many of the non-goalkeeping related features. We got down to 414 samples and 73 features (dropped categorical ones) for our field player data that we would use for modeling. We used 20% of this as a houldout test set and randomized our test_train_split. We standardized our data using Sklearn’s StandardScaler() so the coefficients of our linear predictors could be more easily interpreted. Our target vector was the “fee_cleaned” column of our data (which we turned into millions) and the rest of the features comprised our feature matrix. 

Based on the literature from Ian G. McHale, Benjamin Holmes, we engineered two features with high predictive power: avg. price paid by selling club and avg. price paid by buying club. 


We performed the same process for feature selection as described for the transfer market model. The most notable and intuitive of these features were: ‘fee_cleaned_buyer_avg’, ‘fee_cleaned_seller_avg’, ‘value_eur’.




We trained our models on these 3 standout selected features from our univariate + RFE. We used a LASSO model and a RandomForestRegressor (but why?). 


.What features of your data you used as predictors for your models, and what features (if any) you used as targets.
Whether you subset your data in any way, and for what reasons.
What model(s) you used trained on your data, and how you chose them.
How you trained your models, and on what hardware.
How you evaluated your models (loss, accuracy, etc), and the size of your test set.
If you performed an audit for bias, how you approached this and what metrics you used.
## Implementing a Small User Interface for our Model
After receiving nearly 90% accuracy for our market value mode, we decided to build a small user interface for our model that, upon receiving a player's name via user input, outputs the player's predicted market value. We also integrated our transfer market model into the UI, which calculates how much a team would have to pay for a player.

We started by creating separate Python files for each of our market value codes and transfer market codes, which we then used as modules in our main Flask backend file. After we finished our modules, we used conditionals to incorporate them into our flask file, and each module would run upon the retrieval of each user's input. A minor disadvantage of our program is that, because the model divides the data set into training and testing, not all players are available; thus, it may take a few input tries to obtain the market value of a specific player (for example, if the user enters "Robert Lewandowski" and he is in the testing data set, his value will be displayed; otherwise, the user may need to enter his name a couple of times until he is in the testing data set). We created our main front end designs using HTML, CSS, and JavaScript in one HTML file. We created our main front end designs in one HTML file by combining HTML, CSS, and JavaScript. The frontend itself was not complicated, as it only required sufficient time for page design, such as the colors, margins, and positions of each attribute visible in the UI.

Instructions for using the UI:
- Navigate to the Github repository.
- Download all of the files because each one is important to the model and app.
- Then open the Python files and navigate to the directory where you saved them on your computer.
- Then, within the Python notebook, run app.py, then navigate to the command shell and run the following command: python app.py 
- You can experiment with the UI by clicking on the link provided in the command shell. Unfortunately, because this is a - built-in Flask webpage, the link cannot be shared; thus, this is the only way to experiment with the app.
- When running the Market Value Model UI, player names are formatted like “R. Lewandowski,” where the first name is abbreviated to just the first letter with a period and the last name, whereas for some players with no big last name, such as “'Neymar Jr," the first name is spelled out and the last name is kept as is. Whereas the player name is formatted with their full name for the Transfer Fee Model, such as “Robert Lewandowski”. 







# Concluding Discussion

# Group Contributions Statement

Our project basically consisted of three sections: data collection and analysis, modeling the data to obtain predictions, and coding up the user interface. 
Anweshan handled the data side, while Hedavam and Ayman did contribute to the section by scraping half of the data required for the market value model. Later, Hedavam used the data that Anweshan had analyzed and cleaned to create a best-fit model that would provide us with good accuracy and predictions. Ayman and Anweshan assisted with the modeling by performing feature engineering to determine which variables would be delivered to Hedavam, who would use those variables to create the model. Ayman handled the final section of coding up the user interface because the code required the integration of JavaScript, HTML, CSS, and Python, whereas Hedavam provided the overall model code to support the UI, and Anweshan created the digital preview of the UI that appears appealing to the user at first glance.
Moving on from the actual project code, each group member discussed their own sections within the Materials and Methods section, while also dividing some sections within each other. Anweshan was in charge of the Abstract and Introduction section, Ayman of the Values Statement and Group Contributions Statement, and Hedavam of the Results and Concluding Discussion. Work was distributed evenly, and everyone brought in work that demonstrated the effort they put into the project.


# Personal Reflection