Data Preprocessing

Import the libraries

In [141]:
import pandas as pd

Load the dataset

In [142]:
df = pd.read_csv("seria.csv")
print(df.head().to_markdown())

|    | Player          | Team   |   Shirt Number | Nation   | Position   | Age    |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted | Pass Completion %   |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date      |
|---:|:----------------|:-------|---------------:|:---------|:-----------|:-------|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|----------------------:|------------------------:|-------------------------:|------------------------:|------------------------:|---------

Replace commas with periods and convert to float in "Pass Completion %" column

In [143]:
df["Pass Completion %"] = df["Pass Completion %"].str.replace(",",".",regex=True).astype(float)
print(df.head().to_markdown())

|    | Player          | Team   |   Shirt Number | Nation   | Position   | Age    |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted |   Pass Completion % |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date      |
|---:|:----------------|:-------|---------------:|:---------|:-----------|:-------|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|----------------------:|------------------------:|-------------------------:|------------------------:|------------------------:|---------

Replace commas with and in "Postion" column

In [144]:
df["Position"] = df["Position"].str.replace(","," and ",regex=True)
print(df.head().to_markdown())

|    | Player          | Team   |   Shirt Number | Nation   | Position   | Age    |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted |   Pass Completion % |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date      |
|---:|:----------------|:-------|---------------:|:---------|:-----------|:-------|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|----------------------:|------------------------:|-------------------------:|------------------------:|------------------------:|---------

Fill missing values in "Pass Completion %" with the mean

In [145]:
df['Pass Completion %'] = df['Pass Completion %'].fillna(df['Pass Completion %'].mean())
print(df.head().to_markdown())

|    | Player          | Team   |   Shirt Number | Nation   | Position   | Age    |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted |   Pass Completion % |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date      |
|---:|:----------------|:-------|---------------:|:---------|:-----------|:-------|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|----------------------:|------------------------:|-------------------------:|------------------------:|------------------------:|---------

Clean the "Age" Column
The 'Age' column is in 'YY-DDD' format, so we extract the years and convert to integer

In [146]:
df['Age'] = df['Age'].apply(lambda x: int(x.split('-')[0]))
print(df.head().to_markdown())

|    | Player          | Team   |   Shirt Number | Nation   | Position   |   Age |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted |   Pass Completion % |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date      |
|---:|:----------------|:-------|---------------:|:---------|:-----------|------:|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|----------------------:|------------------------:|-------------------------:|------------------------:|------------------------:|-----------

Convert "Date" Column to datetime objects

In [147]:
df["Date"] = pd.to_datetime(df["Date"])
print(df.head().to_markdown())

|    | Player          | Team   |   Shirt Number | Nation   | Position   |   Age |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted |   Pass Completion % |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date                |
|---:|:----------------|:-------|---------------:|:---------|:-----------|------:|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|----------------------:|------------------------:|-------------------------:|------------------------:|------------------------:|-

Check for missing values

In [148]:
df_missing = df.isnull().sum()
print(df_missing)

Player                    0
Team                      0
Shirt Number              0
Nation                    0
Position                  0
Age                       0
Minutes                   0
Goals                     0
Assists                   0
Penalty Shoot on Goal     0
Penalty Shoot             0
Total Shoot               0
Shoot on Target           0
Yellow Cards              0
Red Cards                 0
Touches                   0
Dribbles                  0
Tackles                   0
Blocks                    0
Expected Goals (xG)       0
Non-Penalty xG (npxG)     0
Expected Assists (xAG)    0
Shot-Creating Actions     0
Goal-Creating Actions     0
Passes Completed          0
Passes Attempted          0
Pass Completion %         0
Progressive Passes        0
Carries                   0
Progressive Carries       0
Dribble Attempts          0
Successful Dribbles       0
Date                      0
dtype: int64


Check for duplicates 

In [149]:
df_duplicated = df.duplicated()
print(df_duplicated)

df_duplicates = df.drop_duplicates()
print(df_duplicates.head().to_markdown())

0       False
1       False
2       False
3       False
4       False
        ...  
3973    False
3974    False
3975    False
3976    False
3977    False
Length: 3978, dtype: bool
|    | Player          | Team   |   Shirt Number | Nation   | Position   |   Age |   Minutes |   Goals |   Assists |   Penalty Shoot on Goal |   Penalty Shoot |   Total Shoot |   Shoot on Target |   Yellow Cards |   Red Cards |   Touches |   Dribbles |   Tackles |   Blocks |   Expected Goals (xG) |   Non-Penalty xG (npxG) |   Expected Assists (xAG) |   Shot-Creating Actions |   Goal-Creating Actions |   Passes Completed |   Passes Attempted |   Pass Completion % |   Progressive Passes |   Carries |   Progressive Carries |   Dribble Attempts |   Successful Dribbles | Date                |
|---:|:----------------|:-------|---------------:|:---------|:-----------|------:|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------

Rename columns for Clarity and Consistency

In [150]:
df.rename(columns={
    "Player":"player",
    "Team":"team",
    "Shirt Number":"shirt_number",
    "Nation":"nation",
    "Position":"position",
    "Age":"age",
    "Minutes":"minutes",
    "Goals":"goals",
    "Assists":"assists",
    "Penalty Shoot on Goal":"penalty_shoot_on_goal",
    "Penalty Shoot":"penalty_shoot",
    "Total Shoot":"total_shoot",
    "Shoot on Target ":"shoot_on_target",
    "Yellow Cards":"yellow_cards",
    "Red Cards":"red_cards",
    "Touches":"touches",
    "Dribbles":"dribbles",
    "Tackles":"tackles",
    "Blocks":"blocks",
    "Expected Goals (xG)":"expected_goals_xg",
    "Non-Penalty xG (npxG)":"non_penalty_xg_npxg",
    "Expected Assists (xAG)":"expected_assists_xag",
    "Shot-Creating Actions":"shot_creating_actions",
    "Goal-Creating Actions":"goal_creating_actions",
    "Passes Completed":"passes_completed",
    "Passes Attempted":"passes_attempted",
    "Pass Completion %":"pass_completed_%",
    "Progressive Passes":"progressive_passes",
    "Carries":"carries",
    "Progressive Carries":"progressive_carries",
    "Dribble Attempts":"dribble_attempts",
    "Successful Dribbles":"successful_dribbles",
    "Date":"date"
},inplace=True)
print(df.head().to_markdown())

|    | player          | team   |   shirt_number | nation   | position   |   age |   minutes |   goals |   assists |   penalty_shoot_on_goal |   penalty_shoot |   total_shoot |   Shoot on Target |   yellow_cards |   red_cards |   touches |   dribbles |   tackles |   blocks |   expected_goals_xg |   non_penalty_xg_npxg |   expected_assists_xag |   shot_creating_actions |   goal_creating_actions |   passes_completed |   passes_attempted |   pass_completed_% |   progressive_passes |   carries |   progressive_carries |   dribble_attempts |   successful_dribbles | date                |
|---:|:----------------|:-------|---------------:|:---------|:-----------|------:|----------:|--------:|----------:|------------------------:|----------------:|--------------:|------------------:|---------------:|------------:|----------:|-----------:|----------:|---------:|--------------------:|----------------------:|-----------------------:|------------------------:|------------------------:|--------------

Save the Cleaned data

In [151]:
df.to_csv("seria.cleaned.csv",index=False)
print("Saved Successfully")

Saved Successfully
