# Data Processing for the Flight-route Map

This jupyter notebook contains the data processing steps for realizing the flight-route map on our website. 

The flight-route map will show the viewer the top 5 most popular transfer destinations of players exiting from a country. The data set we have at hand is the fifa game dataset that contains the player_id, fifa_version, name, overall rating and club name, league name for each player. In the end, we need to monitor the total number of players leaving a country over the course of 8 years (fifa 15~23), and rank the destination countries by the number of occurances. 

In the end, the script should print out the 5 most popular transfer destinations of each country. 

Link to the dataset: https://www.datacamp.com/tutorial/markdown-in-jupyter-notebook


In [1]:
import pandas as pd
import numpy as np

# Loading in the data.

In [2]:
male_players_df = pd.read_csv('../data/male_players.csv', usecols=['player_id', 'fifa_version', 'short_name', 'overall', 'age', 'club_name', 'league_name'])

In [3]:
# Print out the top 10 players for a sanity check
display(male_players_df.head(10))
print(male_players_df.shape)

Unnamed: 0,player_id,fifa_version,short_name,overall,age,league_name,club_name
0,158023,23,L. Messi,91,35,Ligue 1,Paris Saint Germain
1,165153,23,K. Benzema,91,34,La Liga,Real Madrid
2,188545,23,R. Lewandowski,91,33,La Liga,FC Barcelona
3,192985,23,K. De Bruyne,91,31,Premier League,Manchester City
4,231747,23,K. Mbappé,91,23,Ligue 1,Paris Saint Germain
5,192119,23,T. Courtois,90,30,La Liga,Real Madrid
6,209331,23,M. Salah,90,30,Premier League,Liverpool
7,167495,23,M. Neuer,89,36,Bundesliga,FC Bayern München
8,190871,23,Neymar Jr,89,30,Ligue 1,Paris Saint Germain
9,200145,23,Casemiro,89,30,Premier League,Manchester United


(10003590, 7)


# Preprocessing of the Data

Remove duplicates of the same player. 

In [4]:
# Set pandas display options to show all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Drop duplicates based on 'player_id' and 'fifa_version'
players = male_players_df.drop_duplicates(subset=['player_id', 'fifa_version'])

print(players[players['player_id'] == 158023])

         player_id  fifa_version short_name  overall  age league_name  \
0           158023            23   L. Messi       91   35     Ligue 1   
166674      158023            22   L. Messi       92   34     Ligue 1   
1385125     158023            21   L. Messi       93   33     Ligue 1   
2582753     158023            20   L. Messi       94   32     La Liga   
3711090     158023            19   L. Messi       94   31     La Liga   
4941076     158023            18   L. Messi       94   30     La Liga   
6453436     158023            17   L. Messi       93   29     La Liga   
8054866     158023            16   L. Messi       94   28     La Liga   
9041041     158023            15   L. Messi       93   27     La Liga   

                   club_name  
0        Paris Saint Germain  
166674   Paris Saint Germain  
1385125  Paris Saint Germain  
2582753         FC Barcelona  
3711090         FC Barcelona  
4941076         FC Barcelona  
6453436         FC Barcelona  
8054866         FC Ba

# Processing the Leagues

Getting all the leagues inside the dataset, and manually set up a look up table, so we know what countries the leagues are from. And then replace the league names with the name of the countries. 

In [5]:
unique_league_names = players['league_name'].unique()
print(unique_league_names)

['Ligue 1' 'La Liga' 'Premier League' 'Bundesliga' 'Pro League' 'Serie A'
 'Major League Soccer' 'Super Lig' 'Eredivisie' 'Liga Portugal'
 'Jupiler Pro League' 'Super League' '1. HNL' 'La Liga 2' nan
 'Liga Profesional' 'Serie B' 'Superliga' 'Premiership' 'Fortuna Liga'
 'Primera Division' 'A-League' 'Championship' 'Ligue 2' 'Primera División'
 'K League 1' 'Allsvenskan' 'Liga Pro' 'Liga BetPlay' 'NB I.'
 '2. Bundesliga' 'Eliteserien' 'Ekstraklasa' 'League One' '1. Division'
 'Liga 1' '3. Liga' 'Liga De Futbol Prof' 'Veikkausliiga' 'League Two'
 'Premier Division' 'National League' 'Liga MX' 'J-League' 'Rest of World']


In [6]:
league_country_mapping = {
    'Ligue 1': 'France',
    'La Liga': 'Spain',
    'Premier League': 'England',
    'Bundesliga': 'Germany',
    'Pro League': 'Belgium',
    'Serie A': 'Italy',
    'Major League Soccer': 'United States',
    'Super Lig': 'Turkey',
    'Eredivisie': 'Netherlands',
    'Liga Portugal': 'Portugal',
    'Jupiler Pro League': 'Belgium',
    'Super League': 'Greece',
    '1. HNL': 'Croatia',
    'La Liga 2': 'Spain',
    'Liga Profesional': 'Argentina',
    'Serie B': 'Italy',
    'Superliga': 'Denmark',
    'Premiership': 'Scotland',
    'Fortuna Liga': 'Slovakia',
    'Primera Division': 'Argentina',
    'A-League': 'Australia',
    'Championship': 'England',
    'Ligue 2': 'France',
    'Primera División': 'Spain',
    'K League 1': 'South Korea',
    'Allsvenskan': 'Sweden',
    'Liga Pro': 'Ecuador',
    'Liga BetPlay': 'Colombia',
    'NB I.': 'Hungary',
    '2. Bundesliga': 'Germany',
    'Eliteserien': 'Norway',
    'Ekstraklasa': 'Poland',
    'League One': 'England',
    '1. Division': 'Denmark',
    'Liga 1': 'Romania',
    '3. Liga': 'Germany',
    'Liga De Futbol Prof': 'Mexico',
    'Veikkausliiga': 'Finland',
    'League Two': 'England',
    'Premier Division': 'Ireland',
    'National League': 'England',
    'Liga MX': 'Mexico',
    'J-League': 'Japan',
    'Rest of World': 'Rest of World'
}

In [7]:
# Replace league names with country names
players_copy = players.copy()  # Create a copy of the DataFrame
players_copy['country'] = players_copy['league_name'].map(league_country_mapping)

# Drop the original 'league_name' column
players_copy.drop(columns=['league_name'], inplace=True)

# Print the modified DataFrame
print(players_copy.head(10))

   player_id  fifa_version      short_name  overall  age            club_name  \
0     158023            23        L. Messi       91   35  Paris Saint Germain   
1     165153            23      K. Benzema       91   34          Real Madrid   
2     188545            23  R. Lewandowski       91   33         FC Barcelona   
3     192985            23    K. De Bruyne       91   31      Manchester City   
4     231747            23       K. Mbappé       91   23  Paris Saint Germain   
5     192119            23     T. Courtois       90   30          Real Madrid   
6     209331            23        M. Salah       90   30            Liverpool   
7     167495            23        M. Neuer       89   36    FC Bayern München   
8     190871            23       Neymar Jr       89   30  Paris Saint Germain   
9     200145            23        Casemiro       89   30    Manchester United   

   country  
0   France  
1    Spain  
2    Spain  
3  England  
4   France  
5    Spain  
6  England  
7  G

# Getting the Top 5 Destinations for Each Country

In [8]:
from collections import defaultdict

# Initialize a dictionary to store transfer routes for each country
country_transfer_routes = defaultdict(list)

# Iterate through each player
for player_id in players_copy['player_id'].unique():
    # Sort the player's records by FIFA version to ensure consecutive years are in order
    player_records = players_copy[players_copy['player_id'] == player_id].sort_values('fifa_version')
    
    # Initialize variables to track the previous country and FIFA version
    previous_country = None
    
    # Iterate through each record of the player
    for index, row in player_records.iterrows():
        current_country = row['country']
        
        # Skip the first record as there is no transfer route for the first season
        if previous_country is not None and current_country != previous_country:
            route = f"{previous_country} to {current_country}"
            country_transfer_routes[previous_country].append(route)
        
        # Update the previous country for the next iteration
        previous_country = current_country

# Initialize a dictionary to store the top 5 destination countries for each country
top_destination_countries = {}

# Iterate through each country's transfer routes
for country, routes in country_transfer_routes.items():
    # Count the occurrences of each destination country
    destination_counts = defaultdict(int)
    for route in routes:
        destination = route.split(' to ')[1]
        destination_counts[destination] += 1
    
    # Sort the destination countries by count in descending order
    sorted_destinations = sorted(destination_counts.items(), key=lambda x: x[1], reverse=True)
    
    # Select the top 5 destination countries
    top_destinations = [destination for destination, count in sorted_destinations[:7]]
    
    # Store the top destination countries for the current country
    top_destination_countries[country] = top_destinations

# Print the top 5 destination countries for each country
for country, destinations in top_destination_countries.items():
    print(f"Top 5 destination countries for {country}:")
    for i, destination in enumerate(destinations, 1):
        print(f"{i}. {destination}")
    print()

Top 5 destination countries for Spain:
1. England
2. Argentina
3. Portugal
4. Italy
5. France
6. Belgium
7. Greece

Top 5 destination countries for Germany:
1. England
2. Netherlands
3. Greece
4. Belgium
5. Turkey
6. Spain
7. France

Top 5 destination countries for England:
1. Scotland
2. Spain
3. France
4. Germany
5. Rest of World
6. Ireland
7. Italy

Top 5 destination countries for Italy:
1. Spain
2. England
3. France
4. Greece
5. Turkey
6. Belgium
7. Germany

Top 5 destination countries for Portugal:
1. Spain
2. France
3. Belgium
4. Turkey
5. England
6. Greece
7. Italy

Top 5 destination countries for Scotland:
1. England
2. Ireland
3. Belgium
4. United States
5. Netherlands
6. France
7. Germany

Top 5 destination countries for nan:
1. nan
2. Greece
3. Mexico
4. Argentina
5. England
6. Spain
7. United States

Top 5 destination countries for Norway:
1. Sweden
2. Denmark
3. Belgium
4. Netherlands
5. United States
6. Germany
7. England

Top 5 destination countries for France:
1. Englan