In [1]:
import pandas as pd
import numpy as np
import csv

Load the CSV in as a DataFrame

In [67]:
df = pd.read_csv('nationality_data/2020.csv')
df.head()

Unnamed: 0,Rk,Nation,# Players,Min,List
0,1,eng England,249,274172.0,Harry Kane Patrick Bamford Ollie Watkins Jamie...
1,2,fr France,32,38786.0,Hugo Lloris Neal Maupay Lucas Digne Alexandre ...
2,3,es Spain,30,39874.0,Pablo Fornals Vicente Guaita Rodri Adama Traor...
3,4,ie Republic of Ireland,29,22244.0,David McGoldrick Enda Stevens John Egan Dara O...
4,5,br Brazil,26,41438.0,Roberto Firmino Richarlison Raphinha Ederson G...


We can see that the "Nation" column as the country abbreviation followed by the country name, so we can split that out into 2 separate columns, by separating on the first space (' ') value. 

In [68]:
abb = df['Nation'].apply(lambda x: x.split(' ', 1)[0])
df['Nation'] = df['Nation'].apply(lambda x: x.split(' ',1)[1])
df['Abbreviation'] = abb

In [69]:
df.head()

Unnamed: 0,Rk,Nation,# Players,Min,List,Abbreviation
0,1,England,249,274172.0,Harry Kane Patrick Bamford Ollie Watkins Jamie...,eng
1,2,France,32,38786.0,Hugo Lloris Neal Maupay Lucas Digne Alexandre ...,fr
2,3,Spain,30,39874.0,Pablo Fornals Vicente Guaita Rodri Adama Traor...,es
3,4,Republic of Ireland,29,22244.0,David McGoldrick Enda Stevens John Egan Dara O...,ie
4,5,Brazil,26,41438.0,Roberto Firmino Richarlison Raphinha Ederson G...,br


That worked, and now we can rearrange the columns. The List column has names without any separators between them - unfortunately, we can't fix this. We have no way of knowing which names are a first name and a surname, and which names have multiple spaces ("Van Persie") and which players go by one name ("Richarlison"). 

In [70]:
df = df[['Rk', 'Nation', 'Abbreviation', '# Players', 'Min', 'List']]
df.rename(columns={'Min': 'Minutes', '# Players': 'Count', 'Rk': 'Rank'}, inplace=True)
df.head()

Unnamed: 0,Rank,Nation,Abbreviation,Count,Minutes,List
0,1,England,eng,249,274172.0,Harry Kane Patrick Bamford Ollie Watkins Jamie...
1,2,France,fr,32,38786.0,Hugo Lloris Neal Maupay Lucas Digne Alexandre ...
2,3,Spain,es,30,39874.0,Pablo Fornals Vicente Guaita Rodri Adama Traor...
3,4,Republic of Ireland,ie,29,22244.0,David McGoldrick Enda Stevens John Egan Dara O...
4,5,Brazil,br,26,41438.0,Roberto Firmino Richarlison Raphinha Ederson G...


I want to change the 'Minutes' column to an integer, but there are NaN values which are giving me errors, so lets remove those first. If we have a NaN in the minutes column, we can reasonably assume that is due to 0 minutes being played by that nation, so lets change those. 

In [76]:
print(df['Minutes'].isnull().sum())
df.tail(10)

5


Unnamed: 0,Rank,Nation,Abbreviation,Count,Minutes,List
56,57,IR Iran,ir,1,528.0,Alireza Jahanbakhsh
57,58,Guinea,gn,1,520.0,Naby Keïta
58,59,Mauritania,mr,1,315.0,Aboubakar Kamara
59,60,Bosnia and Herzegovina,ba,1,90.0,Sead Kolašinac
60,61,Canada,ca,1,9.0,Theo Corbeanu
61,62,Albania,al,1,,Meritan Shabani
62,63,Bulgaria,bg,1,,Sylvester Jasper
63,64,Ecuador,ec,1,,Moisés Caicedo
64,65,Romania,ro,1,,Florin Andone
65,66,Thailand,th,1,,Thanawat Suengchitthawon


We have 5 NaN values which should all be at the tail/end. We can see they are all present, so we can just change those to 0, and then convert the column to integers. 

In [90]:
df = df.fillna(0)
df['Minutes'] = df['Minutes'].astype('int64')
print(df.dtypes)

Rank             int64
Nation          object
Abbreviation    object
Count            int64
Minutes          int64
List            object
dtype: object


In [92]:
df.tail(10)

Unnamed: 0,Rank,Nation,Abbreviation,Count,Minutes,List
56,57,IR Iran,ir,1,528,Alireza Jahanbakhsh
57,58,Guinea,gn,1,520,Naby Keïta
58,59,Mauritania,mr,1,315,Aboubakar Kamara
59,60,Bosnia and Herzegovina,ba,1,90,Sead Kolašinac
60,61,Canada,ca,1,9,Theo Corbeanu
61,62,Albania,al,1,0,Meritan Shabani
62,63,Bulgaria,bg,1,0,Sylvester Jasper
63,64,Ecuador,ec,1,0,Moisés Caicedo
64,65,Romania,ro,1,0,Florin Andone
65,66,Thailand,th,1,0,Thanawat Suengchitthawon


The DataFrame looks good, so we can save to a CSV file and prepare to clean our other files. 

In [93]:
df.to_csv('cleaned_data/2020.csv', index=False)

Now after visually checking that our other CSV files look similar, we can automate the cleaning of those files. 

In [101]:
def clean_data(fname):
    df = pd.read_csv('nationality_data/{}.csv'.format(fname))
    abb = df['Nation'].apply(lambda x: x.split(' ', 1)[0])
    df['Nation'] = df['Nation'].apply(lambda x: x.split(' ',1)[1])
    df['Abbreviation'] = abb
    df = df[['Rk', 'Nation', 'Abbreviation', '# Players', 'Min', 'List']]
    df.rename(columns={'Min': 'Minutes', '# Players': 'Count', 'Rk': 'Rank'}, inplace=True)
    df = df.fillna(0)
    df['Minutes'] = df['Minutes'].astype('int64')
    return df

In [102]:
for i in range(1992, 2023):
    fname = str(i)
    df = clean_data(fname)
    df.to_csv('cleaned_data/{}.csv'.format(fname), index=False)
    print('Saved file {}.csv to cleaned_data'.format(fname))


Saved file 1992.csv to cleaned_data
Saved file 1993.csv to cleaned_data
Saved file 1994.csv to cleaned_data
Saved file 1995.csv to cleaned_data
Saved file 1996.csv to cleaned_data
Saved file 1997.csv to cleaned_data
Saved file 1998.csv to cleaned_data
Saved file 1999.csv to cleaned_data
Saved file 2000.csv to cleaned_data
Saved file 2001.csv to cleaned_data
Saved file 2002.csv to cleaned_data
Saved file 2003.csv to cleaned_data
Saved file 2004.csv to cleaned_data
Saved file 2005.csv to cleaned_data
Saved file 2006.csv to cleaned_data
Saved file 2007.csv to cleaned_data
Saved file 2008.csv to cleaned_data
Saved file 2009.csv to cleaned_data
Saved file 2010.csv to cleaned_data
Saved file 2011.csv to cleaned_data
Saved file 2012.csv to cleaned_data
Saved file 2013.csv to cleaned_data
Saved file 2014.csv to cleaned_data
Saved file 2015.csv to cleaned_data
Saved file 2016.csv to cleaned_data
Saved file 2017.csv to cleaned_data
Saved file 2018.csv to cleaned_data
Saved file 2019.csv to clean