# Exploration

In this notebook we will explore our raw data in order to identify possible analysis paths that we could follow in our next steps. We will use the following dataset.

* FIFA players: statistics from FIFA 18 videogame players

---

First we will import all the necessary libraries

In [1]:
import pandas as pd

## FIFA Players

This dataset contains info about players in videogame FIFA 2018.

It was retrieved from a GitHub page regarding a Youtube course about Matplotlib made by user Keith Galli for freeCodeCamp.org Youtube channel. It is available at:

:link: https://github.com/KeithGalli/matplotlib_tutorial/blob/master/fifa_data.csv 

In [18]:
# Load the csv in a pandas DataFrame
fifa = pd.read_csv('../data/raw/fifa.csv')

In [3]:
fifa.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,...,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,...,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,...,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,...,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,...,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,...,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In this case, we have 88 different columns that we could use to run our analysis. Though this fact increases the analysis possibilities and richness, it also increases its complexity level, exceeding the purpose of this project. However, we could run our analysis reducing the number of variables that we take care of and still retrieve useful information. We will see it in advance.

In [4]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18207 non-null  int64  
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18207 non-null  int64  
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 10  Club Logo                 18207 non-null  object 
 11  Value                     18207 non-null  object 
 12  Wage                      18207 non-null  object 
 13  Special                   18207 non-null  int64  
 14  Prefer

The dataset consist of several information about 18207 different players in the game. Some of them are merely descriptive, others are related to their performance in the game. There are some missing values present, but most of them are in columns that we will probably not take into account or we could easily manage, so it is not a big concern. However, we should take into consideration column 'Loaned From', which only contains 1264 non-null values and could gives us important information about what common patterns share those players who are in a loan, if we want to retrieve information about loans.

In [5]:
fifa.describe()

Unnamed: 0.1,Unnamed: 0,ID,Age,Overall,Potential,Special,International Reputation,Weak Foot,Skill Moves,Jersey Number,...,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
count,18207.0,18207.0,18207.0,18207.0,18207.0,18207.0,18159.0,18159.0,18159.0,18147.0,...,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0
mean,9103.0,214298.338606,25.122206,66.238699,71.307299,1597.809908,1.113222,2.947299,2.361308,19.546096,...,48.548598,58.648274,47.281623,47.697836,45.661435,16.616223,16.391596,16.232061,16.388898,16.710887
std,5256.052511,29965.244204,4.669943,6.90893,6.136496,272.586016,0.394031,0.660456,0.756164,15.947765,...,15.704053,11.436133,19.904397,21.664004,21.289135,17.695349,16.9069,16.502864,17.034669,17.955119
min,0.0,16.0,16.0,46.0,48.0,731.0,1.0,1.0,1.0,1.0,...,5.0,3.0,3.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0
25%,4551.5,200315.5,21.0,62.0,67.0,1457.0,1.0,3.0,2.0,8.0,...,39.0,51.0,30.0,27.0,24.0,8.0,8.0,8.0,8.0,8.0
50%,9103.0,221759.0,25.0,66.0,71.0,1635.0,1.0,3.0,2.0,17.0,...,49.0,60.0,53.0,55.0,52.0,11.0,11.0,11.0,11.0,11.0
75%,13654.5,236529.5,28.0,71.0,75.0,1787.0,1.0,3.0,3.0,26.0,...,60.0,67.0,64.0,66.0,64.0,14.0,14.0,14.0,14.0,14.0
max,18206.0,246620.0,45.0,94.0,95.0,2346.0,5.0,5.0,5.0,99.0,...,92.0,96.0,94.0,93.0,91.0,90.0,92.0,91.0,90.0,94.0


## Deeper exploration

In order to perform a more focused analysis, we will drop columns regarding to detailed players skills and stats in every field of the game, taking only into considerations general stats for each player. We will also get rid off columns which contains links to images such as national flags, club logos or real photos and other minor stats withing the game.

Following those guidelines, we will only use the following 17 columns:

* **Name:** player's name. 

* **Age:** player's age.

* **Nationality:** player's nationality. 

* **Overall:** player's performance overall score whithin the game.

* **Potential:** player's potential performance overall score whithin the game. 

* **Value:** player's current market value.

* **International Reputation:** player's global reputation.

* **Club:** player's current club.

* **Wage:** player's wage in its current club.

* **Preferred Foot:** player's preferred foot

* **Weak Foot:** player's ability with its weak foot.

* **Position:** player' field position

* **Jersey Number:** player's jersey number.

* **Height:** player's height.

* **Weight:** player's weight.

* **Joined:** date on player joined its current club.

* **Contract Valid Until:** expiration ear of player's current contract.

* **Release Clause:** player's current contract release clause.

By doing this, we can explore the dataset reducing the noise and focusing in columns we are interested in.

In [10]:
fifa_reduced = fifa.loc[:, ['Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Value', 'International Reputation', 'Club', 
                            'Wage', 'Preferred Foot', 'Weak Foot', 'Position', 'Jersey Number', 'Height', 'Weight', 'Joined', 
                            'Contract Valid Until', 'Release Clause']]
fifa_reduced.head()

Unnamed: 0,Name,Age,Nationality,Overall,Potential,Value,International Reputation,Club,Wage,Preferred Foot,Weak Foot,Position,Jersey Number,Height,Weight,Joined,Contract Valid Until,Release Clause
0,L. Messi,31,Argentina,94,94,€110.5M,5.0,FC Barcelona,€565K,Left,4.0,RF,10.0,5'7,159lbs,"Jul 1, 2004",2021,€226.5M
1,Cristiano Ronaldo,33,Portugal,94,94,€77M,5.0,Juventus,€405K,Right,4.0,ST,7.0,6'2,183lbs,"Jul 10, 2018",2022,€127.1M
2,Neymar Jr,26,Brazil,92,93,€118.5M,5.0,Paris Saint-Germain,€290K,Right,5.0,LW,10.0,5'9,150lbs,"Aug 3, 2017",2022,€228.1M
3,De Gea,27,Spain,91,93,€72M,4.0,Manchester United,€260K,Right,3.0,GK,1.0,6'4,168lbs,"Jul 1, 2011",2020,€138.6M
4,K. De Bruyne,27,Belgium,91,92,€102M,4.0,Manchester City,€355K,Right,5.0,RCM,7.0,5'11,154lbs,"Aug 30, 2015",2023,€196.4M


Let's now take a look a some of the columns here:

* Value is measured in millions of euros. It coul be necessary to transform this column into strictly numerical data.

* Wage is measured in thousands of euros. It coul be necessary to transform this column into strictly numerical data.

* Jersey number, International Reputation and Weak Foot are float numbers, but they seem to lack of decimal part (we will confirm this assumption later). Aiming for a better presentations, we should cast these columns to integers.

* Height and Weight are expressed in Imperial System units. We want them in International System units and remove units from the values.

* Join dates are expressed in string format. We want them in date format or, at least, in three different columns. 

* Release Clause is expressed in million euros, Wage is expressed in thousand euros. We need to transform their values into strictly numerical data

In [7]:
print('Reputation:, ' + str(fifa_reduced['International Reputation'].unique()))
print('Foot:, ' + str(fifa_reduced['Weak Foot'].unique()))
print('Jersey:, ' + str(fifa_reduced['Jersey Number'].unique()))

Reputation:, [ 5.  4.  3.  2.  1. nan]
Foot:, [ 4.  5.  3.  2.  1. nan]
Jersey:, [10.  7.  1.  9. 15.  8. 21. 13. 22.  5.  3. 14. 12. 11.  2. 23. 26.  6.
 17. 18.  4. 19. 31. 25. 37. 30. 44. 29. 24. 20. 16. 33. 28. 27. 77. 47.
 38. 40. 92. 36. 87. 34. 32. 83. 70. 35. 89. 56. 99. 57. 91. 86. 45. 63.
 39. 43. 42. 93. 72. 71. 88. 55. 80. 50. 66. 60. 73. 67. 74. 69. 76. 41.
 90. 46. 75. 79. 62. 81. 61. 49. 95. 53. 96. 97. 68. 98. 94. 58. 78. nan
 48. 52. 54. 84. 82. 65. 64. 51. 59. 85.]


As we though, International Reputation, Weak Foot and Jersey Number store only integer values despite beign floating points numbers. A recommended approach will be casting these columns to integers.

In [8]:
fifa_reduced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Name                      18207 non-null  object 
 1   Age                       18207 non-null  int64  
 2   Nationality               18207 non-null  object 
 3   Overall                   18207 non-null  int64  
 4   Potential                 18207 non-null  int64  
 5   Value                     18207 non-null  object 
 6   International Reputation  18159 non-null  float64
 7   Club                      17966 non-null  object 
 8   Wage                      18207 non-null  object 
 9   Preferred Foot            18159 non-null  object 
 10  Weak Foot                 18159 non-null  float64
 11  Position                  18147 non-null  object 
 12  Jersey Number             18147 non-null  float64
 13  Height                    18159 non-null  object 
 14  Weight

We have information about 18207 players, but there are missing values in the following columns:

* **International reputation.** It is an integer value. We could replace it with its min value or drop it.

* **Club.** Tey are usually Free Agent players. 

* **Preferred Foot.** The vast number of players are right-footed, so this is a safe option for replacement.

* **Weak Foot.** If there is no value for this field, it would probably be due to lack of data. It means that this player has not shown enough skill with its weak foot, so We will replace it with the minimun value possible.

* **Position.** Field position has a key impact on how a player performance is determined. We will drop rows where position is not known.

* **Jersey Number.** We will use this field only to do some correlations, so we will not take into consideration missing values in analysis that involve this field.

* **Height.** Probably a mean or median is the best option for this field.

* **Weight.** Probably a mean or median is the best option for this field.

* **Joined.** Probably Free Agent Players.

* **Contract Valid Until.** No expiration date from a contract should probably mean that there is no contract at all (free agent). In case we do it, we should replace this value with the year the data was collected (2018).

* **Release Clause.** An absence of release clause means tha there is not such release clause (or no contract), so we should replace it with 0 value.

In [11]:
fifa_reduced.describe()

Unnamed: 0,Age,Overall,Potential,International Reputation,Weak Foot,Jersey Number
count,18207.0,18207.0,18207.0,18159.0,18159.0,18147.0
mean,25.122206,66.238699,71.307299,1.113222,2.947299,19.546096
std,4.669943,6.90893,6.136496,0.394031,0.660456,15.947765
min,16.0,46.0,48.0,1.0,1.0,1.0
25%,21.0,62.0,67.0,1.0,3.0,8.0
50%,25.0,66.0,71.0,1.0,3.0,17.0
75%,28.0,71.0,75.0,1.0,3.0,26.0
max,45.0,94.0,95.0,5.0,5.0,99.0


Let's take a look to these fields individually:

* **Age:** integer value between 16 and 45, centered in 25 years. 

* **Overall:** integer value between 46 and 94, centered in 66 points. In case we only want to analyze performance of relevant players, this would be a suitable field to filter by.

* **Potential:** integer value between 48 and 95, centered in 71 points. Taking into consideration its meaning, it shoul be greater or equal than the overall score. In case we only want to analyze performance of relevant players, this would be a suitable field to filter by.

* **International Reputation:** integer value between 1 and 5. The vast majority of players have an international reputation below 2. In case we only want to analyze performance of relevant players, this would be a suitable field to filter by.

* **Weak Foot:** integer value between 1 and 5. The vast majority of players have an score of 3 in this field.

* **Jersey Number:** Integer number between 1 and 99.

We have informationa about mor than sixteen thousand players, but this many of them are not elite or high proffessional players. If we want to reduce the noise in our analysis, we want to delete information about players that are no relevant. With that in mind, we have different ways of doing so.

1. The vast majority of players have an International Reputation of 1, so they are not globally relevant enough. We could filter our dataset by players with an International Reputation above 1.

2. Filter players by its overall score. This method allows to mantein those players who can make an impact on the field while beign unknown around the World.

3. Filer by club mean overall score. This method is similar to the previous one, but filtering by club ensures that every player of a club will remain in the dataset.

Let's check the three options.

In [12]:
unknown_players = fifa_reduced.loc[(fifa_reduced['International Reputation'] == 1)]

unknown_players.describe()

Unnamed: 0,Age,Overall,Potential,International Reputation,Weak Foot,Jersey Number
count,16532.0,16532.0,16532.0,16532.0,16532.0,16520.0
mean,24.722599,65.187031,70.65104,1.0,2.921425,19.91592
std,4.53885,6.122061,5.754878,0.0,0.647883,16.131025
min,16.0,46.0,48.0,1.0,1.0,1.0
25%,21.0,61.0,67.0,1.0,3.0,9.0
50%,24.0,66.0,70.0,1.0,3.0,18.0
75%,28.0,69.0,74.0,1.0,3.0,27.0
max,44.0,85.0,92.0,1.0,5.0,99.0


As we can see, there are players with low International Reputation but a high Overall and Potential score within the game. This means that filtering by International Reputation is not a proper way to filter our dataset. If we do it that way, we will be losing information about solid players in our dataset.

In [13]:
low_players = fifa_reduced.loc[(fifa_reduced['Overall'] < 65)]

low_players.describe()

Unnamed: 0,Age,Overall,Potential,International Reputation,Weak Foot,Jersey Number
count,7235.0,7235.0,7235.0,7187.0,7187.0,7180.0
mean,22.757429,59.605667,67.808155,1.001252,2.816335,23.202507
std,4.366682,3.890539,5.443552,0.035368,0.600997,16.471935
min,16.0,46.0,48.0,1.0,1.0,1.0
25%,19.0,57.0,64.0,1.0,2.0,12.0
50%,22.0,61.0,67.0,1.0,3.0,21.0
75%,25.0,63.0,72.0,1.0,3.0,30.0
max,44.0,64.0,86.0,2.0,5.0,99.0


If we filter by Overall, we could be deleting information about young players whou could be elite ones in the future. It does not seem a suitable filter option either.

In [14]:
low_players = fifa_reduced.loc[(fifa_reduced['Potential'] < 65)]

low_players.describe()

Unnamed: 0,Age,Overall,Potential,International Reputation,Weak Foot,Jersey Number
count,2334.0,2334.0,2334.0,2314.0,2314.0,2308.0
mean,26.877892,59.250214,61.927164,1.003457,2.798617,19.734402
std,4.655109,4.295846,2.318751,0.058709,0.625221,15.471581
min,17.0,47.0,48.0,1.0,1.0,1.0
25%,24.0,56.0,61.0,1.0,2.0,9.0
50%,27.0,61.0,63.0,1.0,3.0,18.0
75%,30.0,63.0,64.0,1.0,3.0,27.0
max,44.0,64.0,64.0,2.0,5.0,99.0


Filtering by potential only decreases our dataset in two thousand players, so it is not a great performance improvement.

In [15]:
# 1. Defining a Threshold for Overall Club mean
threshold = 70

# 2. Calculating club mean
club_means = fifa_reduced.groupby('Club')['Overall'].transform('mean')

# 3. Filtering the original data
elite_clubs_players = fifa_reduced[club_means > threshold]

# Verification
print(f"Original Clubs: {fifa_reduced['Club'].nunique()}")
print(f"Clubs after filtering: {elite_clubs_players['Club'].nunique()}")
elite_clubs_players.describe()

Original Clubs: 651
Clubs after filtering: 125


Unnamed: 0,Age,Overall,Potential,International Reputation,Weak Foot,Jersey Number
count,3560.0,3560.0,3560.0,3556.0,3556.0,3556.0
mean,25.153371,72.932303,77.860112,1.415917,3.060742,20.866142
std,4.651685,6.737787,5.247575,0.696759,0.697369,17.861006
min,16.0,50.0,58.0,1.0,1.0,1.0
25%,21.0,69.0,74.0,1.0,3.0,9.0
50%,25.0,74.0,78.0,1.0,3.0,17.0
75%,28.0,77.0,81.0,2.0,3.0,27.0
max,40.0,94.0,95.0,5.0,5.0,99.0


This last option seems to be a better approach to reduce our dataset. It allows to keep information about players with low Overall scores who are in competitive environments. This also could be a suitable approach to focus our analysis in different club categories, so maybe its better to avoid filtering at all.

We will decide later if we are going to filter our dataset rows or not.

Now let's take a look at the position column:

In [16]:
print(fifa['Position'].nunique())
fifa['Position'].unique()

27


array(['RF', 'ST', 'LW', 'GK', 'RCM', 'LF', 'RS', 'RCB', 'LCM', 'CB',
       'LDM', 'CAM', 'CDM', 'LS', 'LCB', 'RM', 'LAM', 'LM', 'LB', 'RDM',
       'RW', 'CM', 'RB', 'RAM', 'CF', 'RWB', 'LWB', nan], dtype=object)

There are 27 different field positions that a player can fill. So many options could make the analysis difficult to read, so we want to group these categories into more general ones based on tactical roles:

* **GK (goal keeper):** GK.

* **DEF (defense):** 'CB', 'RCB', 'LCB', 'LB', 'RB', 'LWB', 'RWB'.

* **MDF (mid fielders):** 'CM', 'LCM', 'RCM', 'CDM', 'LDM', 'RDM', 'CAM', 'LAM', 'RAM', 'LM', 'RM'.

* **ATT (attackers):** 'ST', 'LS', 'RS', 'CF', 'RF', 'LF', 'LW', 'RW'.

## Next steps

Once we have taken a look to the data we are going to use, we could stablish some steps that we are going to take before the analysis to clean the dataset:

1. Selecting only the columns that we are gonna use for our analysis.

2. Transform the fields needed according to the guidelines commented after checking dataset head. We will add the position groupment too.

3. Fill or drop missing values according to the guidelines we have stablished after checking dataset info.

The cleaning process will be executed and explained with more detail in a notebook in the same folder as this one called '02_cleaning.ipynb'.