# Preprocessing

### Demonstration Code Disclaimer

Please note that the code included in this document is intended **solely for demonstration purposes** and is not designed to be executed in its current form. It was originally crafted to process extensive datasets, which due to their size, could not be hosted on GitHub.

### Data Source

The data consists of two CSV files that have been acquired from [Kaggle](SOURCE).

### Python Notebook Objective

The primary focus of this Python notebook is to:

1. Filter the dataset to include only players from specific clubs.
2. Merge the tables into a single dataset.
3. Compute new features based on the combined data.

## Import libraries, define the working directory and load the data

In [3]:
import pandas as pd
# Define the path to the data folder
data_folder = 'data/original/'

To work with the data, we first load the CSV files into Pandas DataFrames using the following code snippet:

In [4]:
# Load appearances data
appearances_df = pd.read_csv(data_folder + 'appearances.csv')

# Load clubs data
clubs_df = pd.read_csv(data_folder + 'clubs.csv')

# Load players data
players_df = pd.read_csv(data_folder + 'players.csv')


## Players data 

This section outlines the steps taken to filter the player data based on selected competitions and to save the filtered dataset.



### Displaying Initial Data

Firstly, we display the first few entries of the player data to understand its structure:

In [12]:
# Display the first few rows of the players DataFrame
players_df.head()

Unnamed: 0,player_id,first_name,last_name,name,last_season,current_club_id,player_code,country_of_birth,city_of_birth,country_of_citizenship,...,foot,height_in_cm,contract_expiration_date,agent_name,image_url,url,current_club_domestic_competition_id,current_club_name,market_value_in_eur,highest_market_value_in_eur
0,10,Miroslav,Klose,Miroslav Klose,2015,398,miroslav-klose,Poland,Opole,Germany,...,right,184.0,,ASBW Sport Marketing,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/miroslav-klose...,IT1,Società Sportiva Lazio S.p.A.,1000000.0,30000000.0
1,26,Roman,Weidenfeller,Roman Weidenfeller,2017,16,roman-weidenfeller,Germany,Diez,Germany,...,left,190.0,,Neubauer 13 GmbH,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/roman-weidenfe...,L1,Borussia Dortmund,750000.0,8000000.0
2,65,Dimitar,Berbatov,Dimitar Berbatov,2015,1091,dimitar-berbatov,Bulgaria,Blagoevgrad,Bulgaria,...,,,,CSKA-AS-23 Ltd.,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/dimitar-berbat...,GR1,Panthessalonikios Athlitikos Omilos Konstantin...,1000000.0,34500000.0
3,77,,Lúcio,Lúcio,2012,506,lucio,Brazil,Brasília,Brazil,...,,,,,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/lucio/profil/s...,IT1,Juventus Football Club,200000.0,24500000.0
4,80,Tom,Starke,Tom Starke,2017,27,tom-starke,East Germany (GDR),Freital,Germany,...,right,194.0,,IFM,https://img.a.transfermarkt.technology/portrai...,https://www.transfermarkt.co.uk/tom-starke/pro...,L1,FC Bayern München,100000.0,3000000.0


### Filtering 
Here, we define the list of competition IDs that we are interested in

In [None]:
# Define the list of competition IDs to include
competition_ids = ['GB1', 'IT1', 'ES1', 'L1', 'FR1']

Next, we filter the players_df DataFrame to include only the players from the specified competition

In [None]:
# Filter the players DataFrame by the selected competition IDs
players_df_filtered = players_df[players_df['current_club_domestic_competition_id'].isin(competition_ids)]

In [None]:
### Selecting Relevant Columns
We then specify the columns we want to retain in our filtered DataFrame:

In [None]:
# Define the columns to keep
columns_players = [
    'name',
    'current_club_id',
    'country_of_citizenship',
    'date_of_birth',
    'position',
    'foot', 
    'height_in_cm',
    'market_value_in_eur',
    'highest_market_value_in_eur',
    'player_id'
]

# Apply the column filter to the DataFrame
players_df_filtered = players_df_filtered[columns_players]

### Display Filtered Data
After filtering, we can display the resulting DataFrame:

In [19]:
players_df_filtered

Unnamed: 0,name,current_club_id,country_of_citizenship,date_of_birth,position,foot,height_in_cm,market_value_in_eur,highest_market_value_in_eur,player_id
0,Miroslav Klose,398,Germany,1978-06-09,Attack,right,184.0,1000000.0,30000000.0,10
1,Roman Weidenfeller,16,Germany,1980-08-06,Goalkeeper,left,190.0,750000.0,8000000.0,26
3,Lúcio,506,Brazil,1978-05-08,Defender,,,200000.0,24500000.0,77
4,Tom Starke,27,Germany,1981-03-18,Goalkeeper,right,194.0,100000.0,3000000.0,80
6,Christoph Metzelder,33,Germany,1980-11-05,Defender,,,1500000.0,9500000.0,123
...,...,...,...,...,...,...,...,...,...,...
30421,Yunus Emre Konak,1148,Turkey,2006-01-10,Midfield,right,181.0,1800000.0,1800000.0,1141628
30439,Luck Zogbé,3911,Cote d'Ivoire,2005-03-24,Defender,left,,100000.0,100000.0,1144999
30443,Mateus Lusuardi,8970,Brazil,2004-01-08,Defender,left,190.0,50000.0,50000.0,1159022
30455,Mahamadou Nagida,273,Cameroon,2005-06-28,Defender,left,176.0,150000.0,150000.0,1176345


### Saving the Filtered Data
Finally, we save the filtered data to a CSV file:

In [20]:
# Save the filtered DataFrame to a CSV file
players_df_filtered.to_csv('data/filtered/players_filtered.csv')

## Appearances data
This section of the notebook focuses on processing the appearances data for a set of football players. We'll filter the data to include only relevant statistics for each player and then save this filtered dataset for future analysis.

### Define Columns of Interest

We will filter the DataFrame to only include the columns that are relevant to our analysis.

In [None]:
# Define the columns that we are interested in
columns_appearances = [
    'player_id',
    'yellow_cards',
    'red_cards',
    'goals',
    'assists',
    'minutes_played'
]

# Filter the DataFrame to include only the columns of interest
appearances_df_filtered = appearances_df[columns_appearances]

### Filter Data Based on Unique Player IDs

We will further filter the data based on a list of unique player IDs to ensure we're analyzing the correct subset of players.

In [None]:
# Assuming players_df_filtered has been defined previously and contains the 'player_id' column
player_ids = players_df_filtered['player_id'].unique()

# Filter the appearances DataFrame to only include rows with player IDs that are in our list of unique player IDs
appearances_df_filtered = appearances_df[appearances_df['player_id'].isin(player_ids)]

In [21]:
appearances_df.head()

Unnamed: 0,appearance_id,game_id,player_id,player_club_id,player_current_club_id,date,player_name,competition_id,yellow_cards,red_cards,goals,assists,minutes_played
0,2231978_38004,2231978,38004,853,235,2012-07-03,Aurélien Joachim,CLQ,0,0,2,0,90
1,2233748_79232,2233748,79232,8841,2698,2012-07-05,Ruslan Abyshov,ELQ,0,0,0,0,90
2,2234413_42792,2234413,42792,6251,465,2012-07-05,Sander Puri,ELQ,0,0,0,0,45
3,2234418_73333,2234418,73333,1274,6646,2012-07-05,Vegar Hedenstad,ELQ,0,0,0,0,90
4,2234421_122011,2234421,122011,195,3008,2012-07-05,Markus Henriksen,ELQ,0,0,0,1,90


### Save the Filtered Data

Finally, we will save the filtered data to a CSV file for further analysis or future use.

In [None]:
# Save the filtered DataFrame to a CSV file
appearances_df_filtered.to_csv('data/filtered/appearances_filtered.csv')