# Merge Datasets

In this notebook, we will merge two separate datasets: one containing player appearances and the other containing player details. The goal is to create a comprehensive dataset that includes both performance metrics and personal information of the players.

## Import Libraries and Set Working Directory

Before we begin, we need to import the necessary libraries and set the working directory to the location of our datasets.

In [18]:
# Import necessary libraries
import pandas as pd
import os
from datetime import datetime

# Define the new working directory path
new_working_directory = r'C:\cloudresume\react\resume\sports-data'

# Change the current working directory to the new path
os.chdir(new_working_directory)


## Import Datasets

Next, we will load our datasets from CSV files into pandas DataFrames to prepare for the merging process.

In [19]:
# Define the path where the data files are located
data_folder = 'data/'

# Load the appearances and players datasets into DataFrames
appearances_df = pd.read_csv(data_folder + 'appearances_filtered.csv')
players_df = pd.read_csv(data_folder + 'players_filtered.csv')

# Display the first few rows of each dataframe to verify successful loading
appearances_df.head()
players_df.head()

Unnamed: 0.1,Unnamed: 0,name,current_club_id,country_of_citizenship,date_of_birth,position,foot,height_in_cm,market_value_in_eur,highest_market_value_in_eur,player_id,current_club_domestic_competition_id
0,0,Miroslav Klose,398,Germany,1978-06-09,Attack,right,184.0,1000000.0,30000000.0,10,IT1
1,1,Roman Weidenfeller,16,Germany,1980-08-06,Goalkeeper,left,190.0,750000.0,8000000.0,26,L1
2,3,Lúcio,506,Brazil,1978-05-08,Defender,,,200000.0,24500000.0,77,IT1
3,4,Tom Starke,27,Germany,1981-03-18,Goalkeeper,right,194.0,100000.0,3000000.0,80,L1
4,6,Christoph Metzelder,33,Germany,1980-11-05,Defender,,,1500000.0,9500000.0,123,L1


## Merge both Dataframes and calculate players stats

In this section, we will merge two dataframes containing information on player appearances and their respective player details. After merging these dataframes, we will calculate various statistical measures to gain insights into player performance.

### Merging the Dataframes

The first step is to merge the two dataframes on the 'player_id' column to combine the appearance data with the player details.

In [20]:
# Merging the dataframes on 'player_id'
merged_df = appearances_df.merge(players_df, on='player_id')

### Calculate Player Statistics

We will now calculate various statistics for each player such as the number of games played, total and average minutes played, and the total and average number of yellow and red cards received.

In [21]:
# Calculate the number of games played per player
number_games_played = merged_df.groupby('player_id')['game_id'].nunique().reset_index(name='number_games_played')

# Calculate the total minutes played per player
total_minutes = merged_df.groupby('player_id')['minutes_played'].sum().reset_index(name='total_minutes')

# Calculate the average minutes played per player
average_minutes = merged_df.groupby('player_id')['minutes_played'].mean().reset_index(name='average_minutes')

# Calculate the sum of yellow cards per player
yellow_cards_sum = merged_df.groupby('player_id')['yellow_cards'].sum().reset_index(name='yellow_cards_sum')

# Calculate the average number of yellow cards per game per player
yellow_cards_avg = merged_df.groupby('player_id')['yellow_cards'].mean().reset_index(name='yellow_cards_avg')

# Calculate the sum of red cards per player
red_cards_sum = merged_df.groupby('player_id')['red_cards'].sum().reset_index(name='red_cards_sum')

# Calculate the average number of red cards per game per player
red_cards_avg = merged_df.groupby('player_id')['red_cards'].mean().reset_index(name='red_cards_avg')

# Calculate the total goals per player
goals = merged_df.groupby('player_id')['goals'].sum().reset_index(name='goals')

# Calculate the average goals per game per player
avg_goals_per_game = merged_df.groupby('player_id')['goals'].mean().reset_index(name='avg_goals_per_game')

# Calculate the total assists per player
assists = merged_df.groupby('player_id')['assists'].sum().reset_index(name='assists')

### Extract and Merge Desired Columns

After calculating the necessary statistics, we will extract the columns that contain the player's personal details and merge them with their calculated statistics.

In [22]:
# Extract the desired columns from the merged dataframe
desired_columns = ['player_id', 'name', 'country_of_citizenship', 'height_in_cm', 'foot', 'position', 'highest_market_value_in_eur', 'current_club_domestic_competition_id']
desired_df = merged_df[desired_columns].drop_duplicates()

# Merge all the calculated statistics dataframes with the desired dataframe
final_df = desired_df.merge(number_games_played, on='player_id')
final_df = final_df.merge(total_minutes, on='player_id')
final_df = final_df.merge(average_minutes, on='player_id')
final_df = final_df.merge(yellow_cards_sum, on='player_id')
final_df = final_df.merge(yellow_cards_avg, on='player_id')
final_df = final_df.merge(red_cards_sum, on='player_id')
final_df = final_df.merge(red_cards_avg, on='player_id')
final_df = final_df.merge(goals, on='player_id')
final_df = final_df.merge(avg_goals_per_game, on='player_id')
final_df = final_df.merge(assists, on='player_id')

### Renaming Columns

To conclude our data preparation, we'll rename certain columns to make the dataset clearer and more accessible for analysis.

In [23]:
# Rename the columns for clarity
final_df = final_df.rename(columns={
    'country_of_citizenship': 'country',
    'height_in_cm': 'height',
    'highest_market_value_in_eur': 'highest_market_value'
})

## Calculate Player Age as of Today

To analyze the age distribution of players or to see how age might correlate with other statistics, we need to calculate the age of each player as of today's date. To do this, we will convert the 'date_of_birth' column to a datetime format and create a function to calculate age.

In [24]:
# Convert the 'date_of_birth' column to datetime
players_df['date_of_birth'] = pd.to_datetime(players_df['date_of_birth'])

# Define the function to calculate age
def calculateAge(birthDate):
    today = date.today()
    age = today.year - birthDate.year - ((today.month, today.day) < (birthDate.month, birthDate.day))
    return age

### Applying the Age Calculation Function

Next, we apply the `calculateAge` function to each row in the `players_df` DataFrame to calculate the current age of each player.

In [25]:
# Apply the calculateAge function to each row in the DataFrame
final_df['age'] = players_df['date_of_birth'].apply(lambda x: calculateAge(x.date()))

# Display the first few rows of final_df to verify the 'age' column has been added
final_df.head()


Unnamed: 0,player_id,name,country,height,foot,position,highest_market_value,current_club_domestic_competition_id,number_games_played,total_minutes,average_minutes,yellow_cards_sum,yellow_cards_avg,red_cards_sum,red_cards_avg,goals,avg_goals_per_game,assists,age
0,122011,Markus Henriksen,Norway,187.0,right,Defender,5000000.0,GB1,165,12199,73.933333,15,0.090909,1,0.006061,33,0.2,22,45.0
1,14940,Razvan Rat,Romania,179.0,left,Defender,6500000.0,ES1,97,7690,79.278351,18,0.185567,1,0.010309,3,0.030928,13,43.0
2,14942,Darijo Srna,Croatia,182.0,right,Defender,17500000.0,IT1,227,19598,86.334802,59,0.259912,2,0.008811,22,0.096916,68,45.0
3,26267,Fernandinho,Brazil,179.0,right,Midfield,32000000.0,GB1,399,30325,76.002506,100,0.250627,3,0.007519,29,0.072682,41,42.0
4,55735,Henrikh Mkhitaryan,Armenia,177.0,both,Midfield,37000000.0,IT1,485,35878,73.975258,59,0.121649,0,0.0,128,0.263918,119,43.0


## Calculate Stats Per Year

In this section, we will calculate the yearly statistics for each player. This includes the total number of games, goals, and assists per year. Additionally, we will derive the average number of games, goals, and assists per year for each player's career.

In [26]:

# Convert the 'date' column to datetime and extract the year
appearances_df['date'] = pd.to_datetime(appearances_df['date'])
appearances_df['year'] = appearances_df['date'].dt.year

# Group by 'player_id' and 'year' to calculate sums for games, goals, and assists
yearly_stats_df = appearances_df.groupby(['player_id', 'year']).agg(
    total_games=('game_id', 'count'),
    total_goals=('goals', 'sum'),
    total_assists=('assists', 'sum')
).reset_index()

# Display the yearly statistics dataframe
yearly_stats_df.head()

Unnamed: 0,player_id,year,total_games,total_goals,total_assists
0,10,2012,20,11,1
1,10,2013,29,9,4
2,10,2014,31,8,6
3,10,2015,36,12,8
4,10,2016,20,8,6


### Calculate the Number of Years Played per Player

To understand how long players have been active, we calculate the number of years each player has played.

In [27]:
# Calculate the number of years each player has played
years_played_df = yearly_stats_df.groupby('player_id')['year'].nunique().rename('years_played').reset_index()

# Merge the years played with the final dataframe
final_df = pd.merge(final_df, years_played_df, on='player_id')

# Display the first few rows to verify the 'years_played' column has been added
final_df.head()

Unnamed: 0,player_id,name,country,height,foot,position,highest_market_value,current_club_domestic_competition_id,number_games_played,total_minutes,average_minutes,yellow_cards_sum,yellow_cards_avg,red_cards_sum,red_cards_avg,goals,avg_goals_per_game,assists,age,years_played
0,122011,Markus Henriksen,Norway,187.0,right,Defender,5000000.0,GB1,165,12199,73.933333,15,0.090909,1,0.006061,33,0.2,22,45.0,6
1,14940,Razvan Rat,Romania,179.0,left,Defender,6500000.0,ES1,97,7690,79.278351,18,0.185567,1,0.010309,3,0.030928,13,43.0,5
2,14942,Darijo Srna,Croatia,182.0,right,Defender,17500000.0,IT1,227,19598,86.334802,59,0.259912,2,0.008811,22,0.096916,68,45.0,8
3,26267,Fernandinho,Brazil,179.0,right,Midfield,32000000.0,GB1,399,30325,76.002506,100,0.250627,3,0.007519,29,0.072682,41,42.0,11
4,55735,Henrikh Mkhitaryan,Armenia,177.0,both,Midfield,37000000.0,IT1,485,35878,73.975258,59,0.121649,0,0.0,128,0.263918,119,43.0,13


### Calculate Averages Over Player Careers

With the yearly stats and total years played, we can now calculate the average number of games, goals, and assists per year over a player's career.

In [28]:
# Assuming `final_df` already includes 'number_games_played', 'goals', and 'assists' columns

# Calculate the average games, goals, and assists per year
final_df['avg_games_per_year'] = final_df['number_games_played'] / final_df['years_played']
final_df['avg_goals_per_year'] = final_df['goals'] / final_df['years_played']
final_df['avg_assists_per_year'] = final_df['assists'] / final_df['years_played']

# Print the list of columns to verify the new averages are included
print(final_df.columns.tolist())

['player_id', 'name', 'country', 'height', 'foot', 'position', 'highest_market_value', 'current_club_domestic_competition_id', 'number_games_played', 'total_minutes', 'average_minutes', 'yellow_cards_sum', 'yellow_cards_avg', 'red_cards_sum', 'red_cards_avg', 'goals', 'avg_goals_per_game', 'assists', 'age', 'years_played', 'avg_games_per_year', 'avg_goals_per_year', 'avg_assists_per_year']


### Exporting the Final DataFrame

Finally, we'll export the cleaned and aggregated data to a CSV file for further analysis or sharing.

In [29]:


# Export the final dataframe to a CSV file
# final_df.to_csv(data_folder + 'cleaned_df.csv', index=False)