# Name matching using dates of birth
*Author: Pierre Adda*
## context:
We want to try models using  players statistics for each games, in order to predict that game's outcome. </br>
the players stats that we will use are the FIFA player's stats, found at [this url](https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset "Kaggle Fifa complete dataset").</br>
The games results with the players appearance can be found at [this url](https://data.world/dcereijo/player-scores).
This dataset includes 40k+ games from european leagues (including russia) with 20k+ players from 300+ clubs</br></br>
The main focus of this notebook is to link the players from the latter dataset to those in the FIFA dataset, in order to build our model training dataframe. </br>
This is not a trivial task as the names are either not always properly registered, not registered the same way, are not present from one dataset to another, or several players have the exact same name or a very similar name.

## Requirements

1. Python 3.8 or higher
2. Pandas and Numpy
3. the following libraries: unicode, tqdm (for loading bars), python-Levenshtein and fuzzywuzzy for "fuzzy" string matching:

<code>
!pip install unidecode -q  

!pip install python-Levenshtein -q  
!pip install fuzzywuzzy -q  
!pip install "tqdm>=4.9.0"
</code>
<br><br><br><br>

## 1. Imports

In [1]:
import pandas as pd
import numpy as np
import unidecode
from fuzzywuzzy import fuzz #this amazing library uses levenshtein distance (string similarity scores) to match somewhat similar strings
from tqdm import tqdm
import re
tqdm.pandas()

In [2]:
pd.options.display.max_columns =None

## 2. Functions

In [3]:
# This function preprocesses the FIFA datasets
# it removes diacritics, hyphens, apostrophes on player names and club. 
# "dj" is replaced to "d" to avoid recurrent ambiguity related to eastern european names

def decode_fifa(fifa_df):
    fifa_df.short_name = fifa_df.short_name.apply(unidecode.unidecode)
    fifa_df.long_name = fifa_df.long_name.apply(unidecode.unidecode)
    fifa_df.club_name = fifa_df.club_name.apply(unidecode.unidecode)

    fifa_df.short_name = fifa_df.short_name.str.lower().str.replace("'","").str.replace("-"," ").str.replace('dj','d')
    fifa_df.long_name = fifa_df.long_name.str.lower().str.replace("'","").str.replace("-"," ").str.replace('dj','d')
        
        
    return None

In [4]:
# This function search players with the same date of birth as the target player 
# It returns a list of the best matching names index(es) and the the corresponding fuzz.token_set_ratio max value ( = string similarity max score)

def search_name(x,fifa_N):
    #mask selects players with the same date of birth
    #if there are NO matching DoB, then:
    if all( ~(mask := fifa_df.dob == x['date_of_birth']) ):
        max_score = np.NaN
        good_index = np.asarray([])
    else:
        fuzz_scores = fifa_N.long_name[mask].apply(lambda y: fuzz.token_set_ratio(x['name'],y)).values
        fifa_index = np.asarray(fifa_N.long_name[mask].index) #Returns a list of the index in fifa_N corresponding to the fuzz_scores list
        max_score = fuzz_scores.max()
        good_index = fifa_index[np.argwhere(fuzz_scores == max_score).squeeze()] #return the fifa_N index with the best matching name

    return max_score , good_index

In [5]:
# This function uses the function search_name() on all names from the "players" dataframe 
# The function adds the best matching name(s) index(es) and fuzz score to the link dataframe
# it returns the number of names that might have matched correctly

def search_all_names(fifa_N):
    results = link.progress_apply(lambda x:search_name(x,fifa_N),axis=1)
    
    link['fuzz_score'] = results.apply(lambda x : x[0])
    link['fifa_index']= results.apply(lambda x : x[1])
    
    return link.loc[link.fuzz_score > 90,'fuzz_score'].shape

## 3. Data

### Game dataset
##### If you do not have the "game" dataset (players.csv, games.csv, appearance.csv, clubs.csv), please take the following steps:


1. create ``./data`` and ``./data/games/`` folders.  

2. execute the cell below

##### If you already have this dataset, make sure it is in the right folder and ignore the next cell

In [None]:
# Downloads all the dataset and store it in ./data/games/

appearance = pd.read_csv('https://query.data.world/s/xemgpklltd3hlau4swg2vafdctgacf')
appearance.to_csv("../data/games/appearance.csv")

clubs = pd.read_csv('https://query.data.world/s/bmpof22nmwcl7dc4s5kf5l2pjf6l62')
clubs.to_csv("../data/games/clubs.csv")

leagues = pd.read_csv('https://query.data.world/s/zmlqmpvqs4atuxn3rsdkdqv5wa6c5o')
leagues.to_csv("../data/games/leagues.csv")

games = pd.read_csv('https://query.data.world/s/ntedgrx2r6shpsvskopamknbnl7sfk')
games.to_csv("../data/games/games.csv")

players = pd.read_csv('https://query.data.world/s/jyeqrkxvhxmqxzqfac2s6kquuxrfuo')
players.to_csv("../data/games/players.csv")

In [6]:
appearance = pd.read_csv("../data/games/appearance.csv")
clubs = pd.read_csv("../data/games/clubs.csv")
games = pd.read_csv("../data/games/games.csv")
players = pd.read_csv("../data/games/players.csv")


### Fifa Dataset
if you don't have this dataset, you can download it at [this url](https://www.kaggle.com/stefanoleone992/fifa-22-complete-player-dataset "fifa dataset").  
This dataset should be stored in `` ./data/FIFA/ ``

In [9]:
#Choose one of the FIFA dataset.
#The find_name function was initially designed to be used on all FIFA dataset at once, but trust me it's long enough using only one FIFA dataset

#Here we chose FIFA15.csv
fifa_df = pd.read_csv("../data/FIFA/players_22.csv", dtype={'nation_position' : str, 'nation_logo_url': str})

In [8]:
fifa_df.columns[[25,108]]

Index(['nation_position', 'nation_logo_url'], dtype='object')

## 4. Preprocessing

In [10]:
fifa_df = fifa_df.loc[:,['sofifa_id','short_name','long_name','club_team_id','club_name','dob']]

# Some club_name values are NaN obect. Problem: None and Nan object cannot be treated as strings. 
# Thus, in order to preprocess all names, we will replace missing names with unmistakingly non existing names, i.e. "ZZZZZ"
fifa_df.loc[fifa_df.club_name.isnull(),'club_name']="ZZZZZ"

# Only now can we apply the function decode_fifa
decode_fifa(fifa_df)

In [11]:
players.name = players.name.str.replace("-"," ").str.replace('dj','d')
players.name = players.name.apply(unidecode.unidecode)

## 5. Name matching

In [12]:
#This dataframe will be used to link the players from the 'appearance' dataset
link=players.loc[:,['player_id','name','date_of_birth']].copy()
link.head(1)

Unnamed: 0,player_id,name,date_of_birth
0,38790,dmitri golubov,1985-06-24


In [13]:
# Adds to each player the list of clubs in which that player has played at least one game, according to the "appearance" dataset

link['clubs'] = link.player_id.progress_apply(lambda x:
                    appearance.loc[appearance.player_id == x,'player_club_id'].unique() if type(x) == int
                                     else [appearance.loc[appearance.player_id == y,'player_club_id'].unique() for y in x]
                                    )

#let's add the "current club" from the players dataset to the clubs obtained from appearance dataset, IF that club is not already in that list:

link.clubs = link.progress_apply(lambda x: 
                                 np.append(x['clubs'],players.loc[players.player_id == x['player_id'],'current_club_id']) 
                                 if players.loc[players.player_id == x['player_id'],'current_club_id'].values[0] not in x['clubs']
                                 else x['clubs']
                                 ,axis = 1)

100%|██████████| 22432/22432 [00:31<00:00, 720.22it/s]
100%|██████████| 22432/22432 [00:08<00:00, 2496.92it/s]


***The actual name searching cell :***

In [14]:
# this function will apply the fuzz.token_set_ratio function on names in the fifa dataset that have the same date of birth than the name in players.csv
# It return the name(s) that have the highest fuzz_score


search_all_names(fifa_df)

100%|██████████| 22432/22432 [00:55<00:00, 407.49it/s]


(7840,)

In [15]:
# Let's save this dataframe in the root folder
# To store this file somewhere else is left to the user's discretion

#don't forget to change the name if you used another fifa dataframe
link.to_csv("fifa22_res.csv")

## 6. Name filtering

**if you want to open a "link" dataframe that was previously saved, compile the following cell**

In [None]:
link = pd.read_csv('fifa16_res.csv', index_col=0)

# Reading a .csv tranforms lists into strings.
# now let's apply some preprocessing to transform a string of a list back into a list again 

import ast # ast is part of python standard's library. 
def transform_int_column(x):
    if x[0] == r'[':
        res = re.sub(r'(\d)\s+(\d)',r'\1 , \2',x.replace('\n',''))
        res = re.sub(r'(\d)\s+(\d)',r'\1 , \2',res)
        res = ast.literal_eval(res) 
        res = np.asarray([np.int64(y) for y in res])
    else:
        res = np.int64(x)
    return res

link.fifa_index = link.fifa_index.apply(transform_int_column)
link.clubs = link.clubs.apply(transform_int_column)

In [16]:
# We remove players that had no matching DoB in the FIFA's players dataset
print('players with no matching DoB : ',link.loc[link.fuzz_score.isnull(),:].shape[0])
link.drop(link.player_id[link.fuzz_score.isnull()].index,inplace = True)
link.reset_index(drop = True, inplace = True)

players with no matching DoB :  2600


In [17]:
# creates a column with the names corresponding to the fifa indexes

link['name_compare'] = link.fifa_index.progress_apply(lambda x: \
    fifa_df.long_name[x] if type(x) == np.int64 \
    else [fifa_df.long_name[y] for y in x] \
    )
link.head(2)

100%|██████████| 19832/19832 [00:00<00:00, 61466.54it/s]


Unnamed: 0,player_id,name,date_of_birth,clubs,fuzz_score,fifa_index,name_compare
0,106539,aleksandr vasiljev,1992-01-23,[28095],43.0,1588,sergio alvarez diaz
1,164389,rory donnelly,1992-02-18,[2288],45.0,8852,daniel offenbacher


In [18]:
#removes row with fuzz_score <70
link = link.drop(link.loc[(link.fuzz_score<70),:].index).reset_index(drop = True)

#displays rows with multiple fifa index
index = link.loc[(link.fifa_index.apply(type)!=np.int64),:].index
display(link.loc[index,:])
link.name_compare[index].values

Unnamed: 0,player_id,name,date_of_birth,clubs,fuzz_score,fifa_index,name_compare
7797,302371,toni martinez,1997-06-30,"[3329, 720]",76.0,"[1067, 1398]","[yeferson julio soteldo martinez, antonio mart..."


array([list(['yeferson julio soteldo martinez', 'antonio martinez lopez'])],
      dtype=object)

In [19]:
# A remplir à la main:
#si vous ne voulez pas faire cette étape, décommentez et executer les deux lignes suivantes
#   link.drop(index, inplace = True)
#   link.reset_index(drop = True, inplace = True)

#index à jeter 
#link = link.drop(4089).reset_index(drop = True)
#index = link.loc[(link.fifa_index.apply(type)!=np.int64),:].index

#indice du nom à choisir dans la liste de nom pour chaque ligne
pp = [1]
for i in range(len(pp)):
    link.loc[index[i],'name_compare'] =  link.loc[index[i],'name_compare'][pp[i]]
    link.loc[index[i],'fifa_index'] =  link.loc[index[i],'fifa_index'][pp[i]]
link.loc[index,:]

Unnamed: 0,player_id,name,date_of_birth,clubs,fuzz_score,fifa_index,name_compare
7797,302371,toni martinez,1997-06-30,"[3329, 720]",76.0,1398,antonio martinez lopez


In [20]:
#adds the clubs from the FIFA dataset to help us in identifying good matches

link['fifa_clubs']=link.fifa_index.apply(lambda x: fifa_df.loc[x,'club_name'])
link.loc[:,'clubs'] = link.loc[:,'clubs'].apply(lambda x: np.asarray([clubs.name[clubs.club_id == y] for y in x],dtype=object).squeeze())


In [21]:
#adds the 'short_name' from the FIFA dataset to help us in identifying good matches

link['fifa_last_name'] = link.fifa_index.apply(lambda x: fifa_df.loc[x,'short_name'])#re.search( r'\w+$' ,fifa_df.loc[x,'short_name'].strip()).group(0))
link['last_name'] = link.name.apply(lambda x: re.search( r'(\w+)(\sjr)?$' ,x.strip()).group(0))
link['last_name_compare']=link.apply(lambda x: fuzz.token_set_ratio(x['fifa_last_name'],x['name']) , axis = 1)

In [29]:
#Change these parameters to select as much false names as possible
link.loc[(link.last_name_compare<50) & (link.fuzz_score <100),:]

Unnamed: 0,player_id,name,date_of_birth,clubs,fuzz_score,fifa_index,name_compare,fifa_clubs,fifa_last_name,last_name,last_name_compare
37,175739,michael ngoo,1992-10-23,kilmarnock-fc,74.0,12353,michael grant harriman,Northampton Town,m. harriman,ngoo,27
397,94523,martin milec,1991-09-20,"[standard-luttich, roda-jc-kerkrade]",72.0,3239,martin linnes,Molde FK,m. linnes,milec,40
452,368582,pablo fernandez,1996-09-17,"[sporting-gijon, real-sociedad-san-sebastian]",75.0,10755,alberto cayarga fernandez,FC Cartagena,berto cayarga,fernandez,29
595,68980,daniel stenderup,1989-05-31,esbjerg-fb,71.0,395,daniel wass,Valencia CF,d. wass,stenderup,27
1138,554463,gustavo moura,1996-06-12,moreirense-fc,70.0,5454,gustavo javier del prete,Estudiantes de La Plata,g. del prete,moura,25
1263,405676,oliver bias,2001-06-15,"[[], [rasenballsport-leipzig]]",71.0,17144,oliver bundgaard kristensen,Randers FC,o. bundgaard,bias,36
1369,45403,andre ayew,1989-12-17,"[olympique-marseille, swansea-city, west-ham-u...",73.0,2024,andre hansen,Rosenborg BK,a. hansen,ayew,44
1714,251860,javi jimenez,1997-03-11,"[fc-valencia, celta-vigo]",74.0,8498,alejandro jair cabeza jimenez,CS Emelec,a. cabeza,jimenez,40
1752,479233,younes delfi,2000-10-02,rsc-charleroi,70.0,19006,alfie jones,Crawley Town,a. jones,delfi,0
1975,587776,jonathan bustos,1994-06-29,ae-larisa,70.0,2152,jonathan cristian silva,Getafe CF,j. silva,bustos,18


In [27]:
#link.loc[1752, 'last_name_compare'] = 00

In [30]:
link = link.drop(link.loc[(link.last_name_compare<50) & (link.fuzz_score <100),:].index).reset_index(drop = True)

In [31]:
link.head()

Unnamed: 0,player_id,name,date_of_birth,clubs,fuzz_score,fifa_index,name_compare,fifa_clubs,fifa_last_name,last_name,last_name_compare
0,73048,ivan martic,1990-10-02,hellas-verona,100.0,7603,ivan martic,FC Sion,i. martic,martic,86
1,70203,ze eduardo,1991-08-16,cesena-fc,82.0,12284,jose eduardo de araujo,Sandefjord Fotball,ze eduardo,eduardo,100
2,221815,constantin nica,1993-03-18,cesena-fc,100.0,15218,constantin nica,FC Dinamo 1948 Bucuresti,c. nica,nica,80
3,197830,josh sheehan,1995-03-30,swansea-city,77.0,8961,joshua luke sheehan,Bolton Wanderers,j. sheehan,sheehan,88
4,89648,david cornell,1991-03-28,swansea-city,100.0,13427,david cornell,Peterborough United,d. cornell,cornell,88


In [None]:
link.drop(columns = ['date_of_birth',	'clubs',	'fuzz_score', 'fifa_clubs',	'fifa_last_name',	'last_name',	'last_name_compare'])\
    .to_csv('link_fifa22.csv')