# Transfermarkt Transfers

The dataset was politely scraped from [transfermarkt.com](https://www.transfermarkt.com/statistik/transfertage)

## Preparation

### Import Libraries

In [2]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import sys
import datetime
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_columns', None)
InteractiveShell.ast_node_interactivity = "all"
warnings.filterwarnings('ignore')

print("python version: ", sys.version)
print("pandas version: ", pd.__version__)
print("seaborn version: ", sns.__version__)

print("last run: ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))


python version:  3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
pandas version:  1.1.3
seaborn version:  0.11.0
last run:  2021-11-05 18:51:00


### Loading Dataset

In [3]:
df = pd.read_csv("transfermarkt-transfer.csv")

# developer-friendly column names and format
df.columns = [x.lower().replace('-', '_').replace(' ', '_') for x in df.columns.to_list()]

df.sample(5, random_state=1)

Unnamed: 0,player_id,name,age,position,national_1,national_2,left_club,left_club_league,joined_club,joined_club_league,transfer_date,transfer_date_p,market_value,market_value_p,fee,left_club_country,joined_club_country,loan_fee,loan_fee_p,created_at,updated_at
6604,335736,Iliyan Popov,22.0,Right-Back,Bulgaria,,Yantra,Vtora Liga,Without Club,,"Oct 10, 2020",2020-10-10,€50Th.,50000,-,Bulgaria,,,,2021-10-04 16:12:54,2021-10-04 16:12:54
11841,255383,Michal Helik,25.0,Centre-Back,Poland,,Cracovia,Ekstraklasa,Barnsley FC,Championship,"Sep 9, 2020",2020-09-09,€600Th.,600000,€800Th.,Poland,England,,,2021-10-04 16:39:13,2021-10-04 16:39:13
34358,223225,Dan Crowley,23.0,Attacking Midfield,England,Ireland,Birmingham,Championship,Hull City,League One,"Jan 18, 2021",2021-01-18,€1.50m,1500000,loan transfer,England,England,,,2021-11-02 11:42:18,2021-11-02 11:42:18
16016,156629,Ivan Kovacec,32.0,Left Winger,Croatia,,SV Ried,Bundesliga,NK Zagorec,,"Aug 31, 2020",2020-08-31,€200Th.,200000,free transfer,Austria,Croatia,,,2021-10-04 17:01:01,2021-10-04 17:01:01
7583,399336,Vincenzo Garofalo,21.0,Central Midfield,Italy,,Avellino,Serie C - C,Foggia,Serie C - C,"Oct 5, 2020",2020-10-05,€50Th.,50000,free transfer,Italy,Italy,,,2021-10-04 16:17:20,2021-10-04 16:17:20


## Dataset Information & Description

### Data Sample

In [4]:
df.sample(5, random_state=1)

Unnamed: 0,player_id,name,age,position,national_1,national_2,left_club,left_club_league,joined_club,joined_club_league,transfer_date,transfer_date_p,market_value,market_value_p,fee,left_club_country,joined_club_country,loan_fee,loan_fee_p,created_at,updated_at
6604,335736,Iliyan Popov,22.0,Right-Back,Bulgaria,,Yantra,Vtora Liga,Without Club,,"Oct 10, 2020",2020-10-10,€50Th.,50000,-,Bulgaria,,,,2021-10-04 16:12:54,2021-10-04 16:12:54
11841,255383,Michal Helik,25.0,Centre-Back,Poland,,Cracovia,Ekstraklasa,Barnsley FC,Championship,"Sep 9, 2020",2020-09-09,€600Th.,600000,€800Th.,Poland,England,,,2021-10-04 16:39:13,2021-10-04 16:39:13
34358,223225,Dan Crowley,23.0,Attacking Midfield,England,Ireland,Birmingham,Championship,Hull City,League One,"Jan 18, 2021",2021-01-18,€1.50m,1500000,loan transfer,England,England,,,2021-11-02 11:42:18,2021-11-02 11:42:18
16016,156629,Ivan Kovacec,32.0,Left Winger,Croatia,,SV Ried,Bundesliga,NK Zagorec,,"Aug 31, 2020",2020-08-31,€200Th.,200000,free transfer,Austria,Croatia,,,2021-10-04 17:01:01,2021-10-04 17:01:01
7583,399336,Vincenzo Garofalo,21.0,Central Midfield,Italy,,Avellino,Serie C - C,Foggia,Serie C - C,"Oct 5, 2020",2020-10-05,€50Th.,50000,free transfer,Italy,Italy,,,2021-10-04 16:17:20,2021-10-04 16:17:20


### Columns Description

| Column | Description | Data Type |
| --- | ----------- | ------- |
| player_id | The player's ID in transfermarkt site | - |
| name | The player's name | - |
| age | The player's age | numerical - discrete |
| position | The player's position | categorical - nominal |
| national_1 | The player's nationality | categorical - nominal |
| national_2 | The player's other nationality | categorical - nominal |
| left_club | the player's former club | categorical - nominal |
| left_club_league | the player's former club league | categorical - nominal |
| left_club_country | the player's former club country | categorical - nominal |
| join_club | the player's new club | categorical - nominal |
| join_club_league | the player's new club league | categorical - nominal |
| join_club_country | the player's new club country | categorical - nominal |
| transfer_date | the transfer date | categorical - nominal |
| transfer_date_p | the transfer date (parsed) | date |
| market_value | the player's market value | categorical - nominal |
| market_value_p | the player's market value (parsed) | numerical - discrete |
| fee | - | categorical - nominal |
| loan_fee | - | categorical - nominal |
| loan_fee_p | - | numerical - discrete |
| created_at | the time data created (scraped) | timestamp |
| updated_at | the time data edited (after scraped) | timestamp |


### Data Information

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38212 entries, 0 to 38211
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   player_id            38212 non-null  int64  
 1   name                 38212 non-null  object 
 2   age                  37992 non-null  float64
 3   position             38212 non-null  object 
 4   national_1           38211 non-null  object 
 5   national_2           5940 non-null   object 
 6   left_club            38212 non-null  object 
 7   left_club_league     27721 non-null  object 
 8   joined_club          37776 non-null  object 
 9   joined_club_league   26800 non-null  object 
 10  transfer_date        38212 non-null  object 
 11  transfer_date_p      38212 non-null  object 
 12  market_value         38212 non-null  object 
 13  market_value_p       38212 non-null  int64  
 14  fee                  38211 non-null  object 
 15  left_club_country    35180 non-null 

## Preprocessing

### Remove Duplication

In [6]:
df.duplicated().sum()

0

## Exploratory Data Analysis

### Descriptive Statistics

#### Numerical

In [7]:
df.select_dtypes(include='number').describe()


Unnamed: 0,player_id,age,market_value_p,loan_fee_p
count,38212.0,37992.0,38212.0,200.0
mean,348299.596252,25.532244,401476.8,937365.0
std,194682.885114,4.69017,1816931.0,1902302.0
min,532.0,15.0,10000.0,2000.0
25%,189447.0,22.0,50000.0,103750.0
50%,340345.5,25.0,100000.0,450000.0
75%,497798.25,29.0,250000.0,1000000.0
max,860256.0,45.0,81000000.0,20000000.0


#### Categorical

In [None]:
df.select_dtypes(exclude='number').describe()
