## Imports and Defines

In [47]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

In [48]:
teams = pd.read_csv('./src/teams.csv')
players = pd.read_csv('./src/players.csv')
matches = pd.read_csv('./src/matches.csv')

## Players Dataset

Посмотрим на данные в таблице Players

In [49]:
players.sort_values('gk_reflexes', ascending=False).head(4)

Unnamed: 0,player_api_id,player_name,birthday,height,weight,date,overall_rating,potential,preferred_foot,attacking_work_rate,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
72609,30657,Iker Casillas,1981-05-20 00:00:00,185.42,185,2008-08-30 00:00:00,91.0,92.0,left,medium,...,51.0,94.0,22.0,22.0,9.0,94.0,88.0,71.0,90.0,96.0
72608,30657,Iker Casillas,1981-05-20 00:00:00,185.42,185,2009-02-22 00:00:00,91.0,92.0,left,medium,...,51.0,94.0,22.0,22.0,9.0,94.0,88.0,74.0,90.0,96.0
72606,30657,Iker Casillas,1981-05-20 00:00:00,185.42,185,2010-02-22 00:00:00,90.0,92.0,left,medium,...,51.0,94.0,22.0,22.0,9.0,93.0,87.0,74.0,91.0,94.0
159247,24503,Sebastian Frey,1980-03-18 00:00:00,190.5,198,2007-08-30 00:00:00,82.0,85.0,right,medium,...,52.0,82.0,21.0,21.0,34.0,84.0,75.0,83.0,86.0,94.0


In [50]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183978 entries, 0 to 183977
Data columns (total 44 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   player_api_id        183978 non-null  int64  
 1   player_name          183978 non-null  object 
 2   birthday             183978 non-null  object 
 3   height               183978 non-null  float64
 4   weight               183978 non-null  int64  
 5   date                 183978 non-null  object 
 6   overall_rating       183142 non-null  float64
 7   potential            183142 non-null  float64
 8   preferred_foot       183142 non-null  object 
 9   attacking_work_rate  177109 non-null  object 
 10  defensive_work_rate  183142 non-null  object 
 11  crossing             183142 non-null  float64
 12  finishing            183142 non-null  float64
 13  heading_accuracy     183142 non-null  float64
 14  short_passing        183142 non-null  float64
 15  volleys          

Приведем типы данных

In [51]:
players['date'] = players['date'].astype('datetime64[ns]')
players['birthday'] = players['birthday'].astype('datetime64[ns]')

Оставим только самые последние данные по каждому игроку

In [52]:
players['rnk'] = players.groupby('player_api_id')['date'].rank(method='first', ascending=False)
players.query('rnk == 1', inplace=True)

Удалим категориальные параметры: preferred_foot, attacking_work_rate, defensive_work_rate.
Удалим параметры, которые не сможем использовать 

In [55]:
players.drop(columns=['preferred_foot', 'attacking_work_rate', 'defensive_work_rate', 'height', 'weight', 'rnk'],
             inplace=True)

Проверим параметры на пропущенные значения

In [62]:
missing_players_values = players.isnull().sum()
missing_players_percentage = (missing_players_values / players.shape[0]) * 100
missing_players_percentage.sort_values(ascending=False)

volleys               4.321881
sliding_tackle        4.321881
agility               4.321881
curve                 4.321881
vision                4.321881
jumping               4.321881
balance               4.321881
standing_tackle       0.000000
marking               0.000000
penalties             0.000000
gk_diving             0.000000
gk_handling           0.000000
gk_kicking            0.000000
positioning           0.000000
interceptions         0.000000
aggression            0.000000
long_shots            0.000000
strength              0.000000
gk_positioning        0.000000
stamina               0.000000
player_api_id         0.000000
reactions             0.000000
shot_power            0.000000
heading_accuracy      0.000000
birthday              0.000000
date                  0.000000
overall_rating        0.000000
potential             0.000000
crossing              0.000000
finishing             0.000000
short_passing         0.000000
player_name           0.000000
dribblin

Пропусков менее 5% => Заполним пропуски средним значением.

In [68]:
players_columns_to_fill = missing_players_percentage[missing_players_percentage > 0].index
mean_players_values = players[players_columns_to_fill].mean()
players[players_columns_to_fill] = players[missing_players_percentage[missing_players_percentage > 0].index].fillna(
    mean_players_values)

## Teams Dataset

In [None]:
missing_teams_values = teams.isnull().sum()
missing_teams_percentage = (missing_teams_values / teams.shape[0]) * 100
missing_teams_percentage.sort_values(ascending=False)

Удалим атрибуты, содержащие > 60% пропусков. Остальные атрибуты пропусков не имеют.

In [None]:
teams_columns_to_drop = missing_teams_percentage[missing_teams_percentage > 60].index
teams.drop(columns=teams_columns_to_drop, inplace=True)

## Matches Dataset

In [None]:
missing_matches_values = matches.isnull().sum()
missing_matches_percentage = (missing_matches_values / matches.shape[0]) * 100
missing_matches_percentage.sort_values(ascending=False)

Имеем очень много пропусков в атрибутах. Удалим атрибуты с более, чем 30% пропусков.

In [None]:
matches_columns_to_drop = missing_matches_percentage[missing_matches_percentage > 30].index
matches.drop(columns=matches_columns_to_drop, inplace=True)