# Football wages in the top 6 leagues 2022
## Introduction

Salaries of sportsmen were always an interesting topic for public and I'm not an exception. Thus, I decided to analyse a dataset that contains information about salaries of football players from top 6 European leagues (the rating is based on the performance of teams in such tournaments as Champions league and Europa league). While conducting my research, I will use several libraries like Seaborn or Plotly to display graphs and fascinating findings.

I believe that at the beginning I should explain in more details what each column of my dataset contain.
* Wage:

  Wages of players (in euros) are listed in this column.
* Age:

    This column tells how old is a player in years.
* Club:

    A club of a player is represented in this column.

* League:

    The club of a player is in this league, there are 6 of them:
  * La liga - Spanish league
  * Premier League - English league
  * Primiera Liga - Portuguese league
  * Ligue 1 - French league
  * Bundesliga - German league
  * Serie A - Italian league

    
* Nation:

    Player's nationality, there are plenty of them.

* Position:

    Main position of a player on a field, there are only 4 of them.

* Apps:

    This number is how many matches a sportsman played in his career for a club.
* Caps:

    This number is how many times a footballer played for his national team.

### Here is a list of libraries that will be used in my project:

In [None]:
# link to my dataset: https://www.kaggle.com/datasets/ultimus/football-wages-prediction
# !pip3 install pandas
# !pip3 install matplotlib
# !pip3 install plotly
# !pip3 install seaborn
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

**Here is how my dataset looks like:**

In [None]:
df = pd.read_csv('SalaryPrediction.csv')
df


# Data clean up and transformation
1) Let's check if there are any NaNs and delete them if they are present.
2) Since some numerical data in this dataset are in the string format, I need to transform it to integer format (Data in the 'Wage' column is clearly not in an integer format). I also have to delete all extra spaces in columns with text data.
3) Then I want to delete all players that have 0 appearances, because I believe that they can distort the statistics significantly (There are a lot of players that count as members of the team, but the fact that they have 0 appearances implies that they didn't participate in any match in the 2022 season as well, while I want to analyse wages of players that play regularly).
4) Subsequently, I want to filter data with the IQR in order to get rid of all outliers (by wage).
5) Finally, I want to add new columns.

In [None]:
for name in df.columns.values:
    print(df[name].isna().unique())

This means that there are no NaNs in this dataset, so there is even no need to use dropna() method.

In [None]:
df['Wage'] = df['Wage'].apply(lambda x: x.replace(',', '')).astype('int')

for column in ['Age', 'Apps', 'Caps']:
    df[column] = df[column].astype('int')

for column in ['League', 'Club', 'Nation', 'Position']:
    df[column] = df[column].apply(lambda x: x.strip())
    
df

Now all numerical data are integer and all string values certainly don't have any extra spaces that can cause troubles later.

In [None]:
df = df[df['Apps'] > 0]
df

Now there are only players that have at least 1 appearance.

In [None]:
q = df['Wage'].quantile(0.75) - df['Wage'].quantile(0.25)
df = df[(df['Wage'] <= df['Wage'].quantile(0.75) + 1.5*q) & (df['Wage'] >= df['Wage'].quantile(0.25) - 1.5*q)]
df

As I already said before, I'm applying this method to drop all outliers, who are superstars like Messi, Neymar and so on, because their enormous wages will affect the results of my analysis significantly, while I want to analyse more average salaries.

In [None]:
df = df.assign(new_column = df.Wage / df.Apps).rename(columns={'new_column': 'Wage by Apps'})
df['Wage by Apps'] = df['Wage by Apps'].astype('int')
df

Perhaps, this newborn column with numerical data will be helpful later on.

# Descriptive statistics

**First of all, I decided to check which nations of playes dominate.**

In [None]:
df['Nation'].value_counts()[:10]

It can be seen that players whose national league is in the top 6 dominate.

**Then I decided to check mean, median, minimum, maximum, standart deviation and other statistics of players' age.**

In [None]:
df['Age'].describe()

Mean and median are almost equal.

**Then I checked same things for players' wages, but I also divided them into four categories by position.**

In [None]:
display(df[['Wage', 'Position']].groupby(by=['Position']).describe().astype('int'))

Strikingly, midfielders usually have the biggest salary.

**After that, I did it again but divided them into 6 categories based on each player's league.**

In [None]:
display(df[['Wage', 'League']].groupby(by=['League']).describe().astype('int'))

One can notice that mean wage and median wage in Premier League (England) are very close.

# Overview

**At first, I decided to visualise the number of players of each age.**

In [None]:
#df.to_csv('new.csv')
ages = df['Age'].value_counts()
plt.show(sns.barplot(data=ages).set(ylabel='Number of players', title='Number of players of each age.'))

Players from 22 years old to 25 prevail.

**Now I want to check which nations dominate in each of the leagues.**

In [None]:
for league in [x for x in df['League'].unique()]:
    d = {'Nation': [x for x in df[df['League'] == league]['Nation'].value_counts().index], 'Count': [x for x in df[df['League'] == league]['Nation'].value_counts()]}
    dd = pd.DataFrame(data=d)
    dd['Count'] = dd['Count'].astype('int')
    other_sum = sum(dd.iloc[5:,:]['Count'])
    dd = dd.iloc[:5,:]
    dd_add = {'Nation': 'Other', 'Count': str(other_sum)}
    dd = pd.concat([dd, pd.DataFrame([dd_add])], ignore_index=True)
    display(px.pie(dd, values='Count', names='Nation', title=f'Percentage of players with different nationalities in {league}.', hole=0.3))

It can be seen that national players prevail in all 6 leagues and there is no nation that has a big share in one of the leagues except from brazil players in Primera Liga (Portugal).

**After that, I decided to make a scatter plot of wages of players of each age.**

In [None]:
display(px.scatter(df, x='Age', y='Wage'))

The only conclusion that can be made from that graph is that the youngest and oldest players don't have salaries higher than 3 million.

**Next, I made a boxplot of wages of players of each age.**

In [None]:
sns.set(rc={"figure.figsize":(25, 14)})
plt.show(sns.boxplot(data=df, x='Age', y='Wage', width=0.8).set_title('Wages of players of each age.', fontdict={'size': 30}))

Surprisingly, mean wage of players that are 31, 35, 38, and 40 years old is higher than any other. Including the fact that a lot of players end their career at the age of 30 I made a conclusion that football managers and teams are ready to pay decent salaries to experienced players that are still physically capable of playing the game (otherwise they would have ended their career earlier due to health problems and wouldn't have been included into this dataset).

# Detailed overview
**I made a decision to start detailed overview with a classical pairplot.**

In [None]:
plt.show(sns.pairplot(df, hue='League'))

Unfortunately, it is impossible to make any conclusion from this graph apart from some obvious ones, such as, the older you are the more club appearances you will likely have.

**Subsequently, I made a scatter plot where x coordinate is the player's wage and y is the player's 'Wage by Apps' value (separated by positions).**

In [None]:
px.scatter(df, x='Wage', y='Wage by Apps', color='Position')

Once again, it is hard to make an interesting conclusion. It can be seen that Wage by Apps value is rarely higher than 200 thousand and these high values occur only if the player's wage is less then 2 million. 

**Then, I decided to make some scatter plots with overlaid regression lines. The first one is separated by player's position.**

In [None]:
plt.show(sns.lmplot(df, x='Age', y='Wage', hue='Position'))

A conclusion that player's wage grows with age regardless of his position can be made.

**Next, I made a similar scatter plot but divided players by their league.**

In [None]:
plt.show(sns.lmplot(df, x='Age', y='Wage', hue='League'))

There more entertaining details can be found. Firstly, in the Premier League wage increases drasticlly with age, which can't be said about the Primiera Liga. Secondly, if a player wants to play for his whole career in one league, it is more beneficial for him in terms of wage after 25 years old to play in the Serie A than in the Bundesliga, even though he is likely to start his career with a lower salary in the Italian league than in the German.

**After this, I decided to check players from which countries are likely to be paid more if they play or have played for their national team (Apps column stands for that). For this need, I wrote this piece of code:**

In [None]:
nations = df['Nation'].unique()
positions = df['Position'].unique()
for nation in nations:
    for position in positions:
        correlation = df.loc[(df['Nation'] == nation) & (df['Position'] == position)][['Wage', 'Age', 'Apps', 'Caps']].corr()
        try:
            if float(correlation['Caps']['Wage']) > 0.8 and df[(df['Nation'] == nation) & (df['Position'] == position)].shape[0]>10:
                print(f"Players' nation - {nation}, position - {position}.")
                display(correlation.style.background_gradient())
        except:
            pass

So, if you are a player from one of these nations that plays on a particular position, the more games you have played for your national team, the higher your club salary is likely to be.

**Then, I made a boxplot that represents wages of players in different leagues split by position.**

In [None]:
plt.show(sns.boxplot(data=df, x='League', y='Wage', hue='Position', palette='dark', width=0.8).set_title('Wages of players in the top 6 leagues separated by positions.', fontdict={'size': 30}))

In descriptive statistics we found out that midfielders have the highest salary on average, which, if we look at each league independently, is true only for the Premier League (it is actually tied with defenders' mean salary) and the Primiera Liga. For example, in the Serie A and Bundesliga mean wages of defenders and forwards are higher than average wage of midfielders. In the La Liga and Ligue 1 goalkeeper's mean salary is the highest, which is shocking. 

# My hypothesis
**In the overview I displayed pie charts that show percentage of players with different nationalities in each league and found that percentage of national players dominate in each of the top 6 leagues, which lead me to a hypothesis that players prefer to play in national league because *they are getting paid better in it* than in other leagues. To check this assumption, I created a catplot that shows average salaries of national players and all other players combined in the top 6 leagues.**

In [None]:
leagues_natonality = {'La Liga': 'ESP', 'Serie A': 'ITA', 'Premier League': 'ENG', 'Bundesliga': 'GER', 'Ligue 1 Uber Eats': 'FRA', 'Primiera Liga': 'POR'}
wages = pd.DataFrame(columns=['League', 'Nation', 'Wage'])
for league in leagues_natonality:
    national_mean_wage = df[(df['League'] == league) & (df['Nation'] == leagues_natonality[league])]['Wage'].mean().astype('int')
    other_mean_wage = df[(df['League'] == league) & (df['Nation'] != leagues_natonality[league])]['Wage'].mean().astype('int')
    add_info1 = {'League': league, 'Nation': 'National', 'Wage': national_mean_wage}
    add_info2 = {'League': league, 'Nation': 'Other', 'Wage': other_mean_wage}
    wages = pd.concat([wages, pd.DataFrame([add_info1])])
    wages = pd.concat([wages, pd.DataFrame([add_info2])])

plt.show(sns.catplot(data=wages, kind="bar", x="League", y="Wage", hue="Nation", palette="dark", alpha=.6, aspect=3,  height=6).set(title='Mean wages of national and other players in the top 6 leagues.'))    

And surprisingly, mean wage of national players in the top 6 leagues is lower than the average wage of players from other nations in these leagues, which means that football managers and clubs don't prioritize national players above other players in terms of salary size, so, perhaps, these players are paid better in not national leagues?

In [None]:
sns.set(rc={"figure.figsize":(30,12)})
wages2 = df[df.Nation.isin(leagues_natonality.values()) ]
plt.show(sns.boxplot(data=wages2, x='Nation', y='Wage', hue='League', palette='dark', width=0.8).set(title='Wages of players whose national league is in the top 6.'))

And it turns out that reality is the opposite for my hypothesis, since it isn't the best option in terms of wage size to play in your national league. For Spanish, Italian, French, and Portuguese foorballers it is better to play in the English Premier League if they want to have a more competetive salary. For German players, however, it is more beneficial to play in the French Ligue 1 to achieve this goal. Finally, English players are paid most in the Italian Serie A.