# **STATIC DATA CLEANING**

## *Data Import & Stats Features Creation*
1) *Import **Packages***
2) *Import **Datasets***
3) **Concatenate** vertically *leagues* dataset with *cups* dataset 
4) Create *Features from **Stats Column*** (represented as a list)
5) **Merge** Original Data with *Stats Data* from point (4)

##### 1) Import Packages

In [1]:
# Import needed packages
import pandas as pd
import numpy as np
from collections.abc import MutableMapping
from tqdm.notebook import trange
from datetime import datetime
from datetime import timedelta
import dateutil.parser
# Change Pandas rows and columns' options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None  # default='warn'

Functions

In [2]:
def flatten(d, parent_key='', sep='_'): 
    """This function turns a nested dictionary into a flattened dictionary.
    d: nested dictionary
    """
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

##### 2) Import Datasets

In [3]:
# Import leagues_data from csv
leagues_data = pd.read_csv('../../Data/From_Collection/Match&Odds/leagues_static.csv', low_memory = False)
leagues_data.set_index('id', inplace = True)
# Import cups_data from csv
cups_data = pd.read_csv('../../Data/From_Collection/Match&Odds/cups_static.csv', low_memory = False)
cups_data.set_index('id', inplace = True)

##### 3) Concatenate Vertically Leagues Dataset and Cups Dataset 

In [4]:
# Print lenght of both dataframes
print('Lenght of Leagues DataFrame: ', len(leagues_data), '\nLenght of Cups DataFrame: ', len(cups_data))
# Compare columns between dataframes
print('Leagues and Cups have the same columns? ', set(cups_data.columns) == set(leagues_data.columns))

# Concatenate leagues_data and cups_data
complete_data = pd.concat([leagues_data, cups_data], ignore_index=False)
print('Concatenated dataset has same lenght as the sum of individual sets? ', len(complete_data)==len(leagues_data)+len(cups_data))
complete_data.tail(3)

Lenght of Leagues DataFrame:  19768 
Lenght of Cups DataFrame:  18305
Leagues and Cups have the same columns?  True
Concatenated dataset has same lenght as the sum of individual sets?  True


Unnamed: 0_level_0,league_id,season_id,stage_id,round_id,group_id,aggregate_id,venue_id,referee_id,localteam_id,visitorteam_id,winner_team_id,weather_report,commentaries,attendance,pitch,details,neutral_venue,winning_odds_calculated,formations_localteam_formation,formations_visitorteam_formation,scores_localteam_score,scores_visitorteam_score,scores_localteam_pen_score,scores_visitorteam_pen_score,scores_ht_score,scores_ft_score,scores_et_score,scores_ps_score,time_status,time_starting_at_date_time,time_starting_at_date,time_starting_at_time,time_starting_at_timestamp,time_starting_at_timezone,time_minute,time_second,time_added_time,time_extra_minute,time_injury_time,coaches_localteam_coach_id,coaches_visitorteam_coach_id,standings_localteam_position,standings_visitorteam_position,assistants_first_assistant_id,assistants_second_assistant_id,assistants_fourth_official_id,leg,colors,deleted,is_placeholder,localTeam_id,localTeam_legacy_id,localTeam_name,localTeam_short_code,localTeam_twitter,localTeam_country_id,localTeam_national_team,localTeam_founded,localTeam_logo_path,localTeam_venue_id,localTeam_current_season_id,localTeam_is_placeholder,visitorTeam_id,visitorTeam_legacy_id,visitorTeam_name,visitorTeam_short_code,visitorTeam_twitter,visitorTeam_country_id,visitorTeam_national_team,visitorTeam_founded,visitorTeam_logo_path,visitorTeam_venue_id,visitorTeam_current_season_id,visitorTeam_is_placeholder,stats,league_active,league_type,league_legacy_id,league_country_id,league_logo_path,league_name,league_is_cup,league_is_friendly,league_current_season_id,league_current_round_id,league_current_stage_id,league_live_standings,league_coverage_predictions,league_coverage_topscorer_goals,league_coverage_topscorer_assists,league_coverage_topscorer_cards,season_name,season_league_id,season_is_current_season,season_current_round_id,season_current_stage_id,round_name,round_league_id,round_season_id,round_stage_id,round_start,round_end,venue_name,venue_surface,venue_address,venue_city,venue_capacity,venue_image_path,venue_coordinates,referee_common_name,referee_fullname,referee_firstname,referee_lastname,localCoach_coach_id,localCoach_team_id,localCoach_country_id,localCoach_common_name,localCoach_fullname,localCoach_firstname,localCoach_lastname,localCoach_nationality,localCoach_birthdate,localCoach_birthcountry,localCoach_birthplace,localCoach_image_path,visitorCoach_coach_id,visitorCoach_team_id,visitorCoach_country_id,visitorCoach_common_name,visitorCoach_fullname,visitorCoach_firstname,visitorCoach_lastname,visitorCoach_nationality,visitorCoach_birthdate,visitorCoach_birthcountry,visitorCoach_birthplace,visitorCoach_image_path,weather_report_code,weather_report_icon,weather_report_type,weather_report_wind_speed,weather_report_wind_degree,weather_report_clouds,weather_report_humidity,weather_report_temperature_temp,weather_report_temperature_unit,colors_localteam_color,colors_localteam_kit_colors,colors_visitorteam_color,colors_visitorteam_kit_colors,weather_report_pressure,weather_report_temperature_celcius_temp,weather_report_temperature_celcius_unit,weather_report_coordinates_lat,weather_report_coordinates_lon,weather_report_updated_at
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1
18490986,570,19089,77455898,,,28707.0,9162.0,17340.0,13258,214,,,False,,,,False,True,4-4-2,4-3-3,1,1,,,1-0,1-1,,,FT,2022-02-10 21:30:00,2022-02-10,21:30:00,1644525000,Europe/Rome,90.0,,,,,1467953.0,524009.0,1.0,2.0,15278.0,14028.0,19889.0,1/2,,False,False,13258,138.0,Athletic Club,ATH,@AthleticClub,32.0,False,1898.0,https://cdn.sportmonks.com/images/soccer/teams...,9162.0,18462.0,False,214,287.0,Valencia,VAL,@valenciacf,32.0,False,1919.0,https://cdn.sportmonks.com/images/soccer/teams...,9240.0,18462.0,False,"[{'team_id': 13258, 'fixture_id': 18490986, 's...",True,domestic_cup,21,32,https://cdn.sportmonks.com/images/soccer/leagu...,Copa Del Rey,True,False,19089,,77455898,False,True,True,True,True,2021/2022,570,True,,77455898.0,,,,,,,San Mamés Barria,grass,Rafael Moreno Pitxitxi Kalea,Bilbao,53289.0,https://cdn.sportmonks.com/images/soccer/venue...,"43.263476,-2.948150",J. Munuera Montero,José Luis Munuera Montero,José Luis,Munuera Montero,1467953.0,13258.0,32.0,M. García Toral,Marcelino García Toral,Marcelino,García Toral,Spain,14/08/1965,Spain,Villaviciosa,https://cdn.sportmonks.com/images/soccer/playe...,524009.0,214.0,32.0,J. Bordalás Jiménez,José Bordalás Jiménez,José,Bordalás Jiménez,Spain,05/03/1964,Spain,Alicante,https://cdn.sportmonks.com/images/soccer/playe...,clouds,https://cdn.sportmonks.com/images/weather/04n.png,broken clouds,15.01 m/s,290.0,75%,80%,51.8,fahrenheit,#C40010,"#C40010,#F0F0F0,#F0F0F0,#F0F0F0,#5C8FAE,#FFDF1...",#2B72DE,"#2B72DE,#2B72DE,#0A0A0A,#0A0A0A,#C40010,#0046A...",1028.0,11.0,celcius,43.2627,-2.9253,2022-02-10T22:15:13.505470Z
18490987,570,19089,77455898,,,28707.0,9240.0,14350.0,214,13258,214.0,,False,,,,False,True,3-4-3,4-4-2,1,0,,,1-0,1-0,,,FT,2022-03-02 21:30:00,2022-03-02,21:30:00,1646253000,Europe/Rome,90.0,,,,,524009.0,1467953.0,1.0,2.0,15276.0,12812.0,17341.0,2/2,,False,False,214,287.0,Valencia,VAL,@valenciacf,32.0,False,1919.0,https://cdn.sportmonks.com/images/soccer/teams...,9240.0,18462.0,False,13258,138.0,Athletic Club,ATH,@AthleticClub,32.0,False,1898.0,https://cdn.sportmonks.com/images/soccer/teams...,9162.0,18462.0,False,"[{'team_id': 214, 'fixture_id': 18490987, 'sho...",True,domestic_cup,21,32,https://cdn.sportmonks.com/images/soccer/leagu...,Copa Del Rey,True,False,19089,,77455898,False,True,True,True,True,2021/2022,570,True,,77455898.0,,,,,,,Estadio de Mestalla,grass,Avenida de Suecia,Valencia,54000.0,https://cdn.sportmonks.com/images/soccer/venue...,"39.474602,-0.358257",J. Gil Manzano,Jesús Gil Manzano,Jesús,Gil Manzano,524009.0,214.0,32.0,J. Bordalás Jiménez,José Bordalás Jiménez,José,Bordalás Jiménez,Spain,05/03/1964,Spain,Alicante,https://cdn.sportmonks.com/images/soccer/playe...,1467953.0,13258.0,32.0,M. García Toral,Marcelino García Toral,Marcelino,García Toral,Spain,14/08/1965,Spain,Villaviciosa,https://cdn.sportmonks.com/images/soccer/playe...,clouds,https://cdn.sportmonks.com/images/weather/02n.png,few clouds,4.97 m/s,281.0,21%,63%,51.22,fahrenheit,#F0F0F0,"#0A0A0A,#F0F0F0,#F0F0F0,#0A0A0A,#5C8FAE,#FFDF1...",#BFFFBF,"#BFFFBF,#BFFFBF,#CCCCCC,#CCCCCC,#C40010,#0046A...",1021.0,10.7,celcius,39.3333,-0.8333,2022-03-02T22:15:07.028532Z
18490985,570,19089,77455898,,,28706.0,68.0,15712.0,485,377,,,False,,,,False,True,4-2-3-1,4-2-3-1,1,1,,,0-0,1-1,,,FT,2022-03-03 21:00:00,2022-03-03,21:00:00,1646337600,Europe/Rome,90.0,,,,,523898.0,19960388.0,1.0,2.0,11711.0,16863.0,15823.0,2/2,,False,False,485,305.0,Real Betis,BET,@RealBetis,32.0,False,1907.0,https://cdn.sportmonks.com/images/soccer/teams...,68.0,18462.0,False,377,292.0,Rayo Vallecano,RAY,,32.0,False,1924.0,https://cdn.sportmonks.com/images/soccer/teams...,304396.0,18462.0,False,"[{'team_id': 485, 'fixture_id': 18490985, 'sho...",True,domestic_cup,21,32,https://cdn.sportmonks.com/images/soccer/leagu...,Copa Del Rey,True,False,19089,,77455898,False,True,True,True,True,2021/2022,570,True,,77455898.0,,,,,,,Estadio Benito Villamarín,grass,Avenida de Heliópolis,Sevilla,60721.0,https://cdn.sportmonks.com/images/soccer/venue...,"37.356483,-5.981768",J. Martínez Munuera,Juan Martínez Munuera,Juan,Martínez Munuera,523898.0,485.0,80.0,M. Pellegrini Ripamonti,Manuel Luis Pellegrini Ripamonti,Manuel Luis,Pellegrini Ripamonti,Chile,16/09/1953,Chile,Santiago de Chile,https://cdn.sportmonks.com/images/soccer/playe...,19960388.0,377.0,32.0,A. Iraola Sagarna,Andoni Iraola Sagarna,Andoni,Iraola Sagarna,Spain,22/06/1982,Spain,Usurbil,https://cdn.sportmonks.com/images/soccer/place...,clear,https://cdn.sportmonks.com/images/weather/01n.png,clear sky,12.66 m/s,290.0,0%,61%,54.39,fahrenheit,#339063,"#339063,#F0F0F0,#008000,#008000,#5C8FAE,#FFDF1...",#F0F0F0,"#F0F0F0,#D10918,#F0F0F0,#0A0A0A,#5C8FAE,#FFDF1...",1017.0,12.4,celcius,37.3824,-5.9761,2022-03-03T21:45:03.639626Z


##### 4) Create Features from Stats Column

In [5]:
# Empty list to store stats data
all_stats = []

for i in trange(len(complete_data)):
    diz_single_stats = {}
    single_stats = eval(complete_data['stats'].iloc[i])
    # In case of incomplete stats: single_stats is empty or has data for only one team
    if not single_stats or len(single_stats) == 1:
        all_stats.append(diz_single_stats)
    # This condition covers the case when the home and away team data in stats are in the opposite position
    elif single_stats[0]['team_id'] == complete_data['visitorTeam_id'].iloc[i] and single_stats[1]['team_id'] == complete_data['localTeam_id'].iloc[i]:
        diz_single_stats['Home'] = single_stats[1]
        diz_single_stats['Away'] = single_stats[0]
        all_stats.append(flatten(diz_single_stats))
    # Case with complete and correctly positioned stats
    else:
        diz_single_stats['Home'] = single_stats[0]
        diz_single_stats['Away'] = single_stats[1]
        all_stats.append(flatten(diz_single_stats))

# Create DataFrame
stats_df = pd.DataFrame(all_stats) 

# Check before dropping NAs 
print('Lenghts of Stats data equal to original data (including empty dictionaries)? ', len(stats_df) == len(complete_data)) 
# Drop observations from stats_df with only NA columns
stats_df.dropna(how = 'all', inplace = True)
# Check after dropping NAs 
print('Lenght original dataset: ', len(complete_data), '\nLenght stats datasets after dropping NAs: ', len(stats_df)) 
# Check for fixture_id
print('Percentage of observation sharing ID (home/away): ', len(stats_df.apply(lambda x: x.Home_fixture_id == x.Away_fixture_id, axis=1))/len(stats_df)*100)
stats_df.tail()

  0%|          | 0/38073 [00:00<?, ?it/s]

Lenghts of Stats data equal to original data (including empty dictionaries)?  True
Lenght original dataset:  38073 
Lenght stats datasets after dropping NAs:  19950
Percentage of observation sharing ID (home/away):  100.0


Unnamed: 0,Home_team_id,Home_fixture_id,Home_shots_total,Home_shots_ongoal,Home_shots_offgoal,Home_shots_blocked,Home_shots_insidebox,Home_shots_outsidebox,Home_passes,Home_attacks,Home_fouls,Home_corners,Home_offsides,Home_possessiontime,Home_yellowcards,Home_redcards,Home_yellowredcards,Home_saves,Home_substitutions,Home_goal_kick,Home_goal_attempts,Home_free_kick,Home_throw_in,Home_ball_safe,Home_goals,Home_penalties,Home_injuries,Home_tackles,Away_team_id,Away_fixture_id,Away_shots_total,Away_shots_ongoal,Away_shots_offgoal,Away_shots_blocked,Away_shots_insidebox,Away_shots_outsidebox,Away_passes,Away_attacks,Away_fouls,Away_corners,Away_offsides,Away_possessiontime,Away_yellowcards,Away_redcards,Away_yellowredcards,Away_saves,Away_substitutions,Away_goal_kick,Away_goal_attempts,Away_free_kick,Away_throw_in,Away_ball_safe,Away_goals,Away_penalties,Away_injuries,Away_tackles,Home_passes_total,Home_passes_accurate,Home_passes_percentage,Away_passes_total,Away_passes_accurate,Away_passes_percentage,Home_attacks_attacks,Home_attacks_dangerous_attacks,Away_attacks_attacks,Away_attacks_dangerous_attacks,Home_shots,Away_shots
38068,13258.0,18473799.0,16.0,2.0,14.0,5.0,10.0,5.0,,,12.0,8.0,2.0,42.0,3.0,0.0,0.0,2.0,3.0,,8.0,,,69.0,1.0,0.0,1.0,18.0,3468.0,18473799.0,6.0,2.0,4.0,4.0,2.0,5.0,,,7.0,3.0,3.0,58.0,2.0,0.0,0.0,2.0,2.0,,3.0,,,80.0,0.0,0.0,1.0,19.0,411.0,343.0,83.45,595.0,531.0,89.24,136.0,72.0,81.0,34.0,,
38069,377.0,18490984.0,13.0,7.0,6.0,0.0,6.0,6.0,,,16.0,7.0,3.0,52.0,1.0,0.0,,3.0,4.0,,,,,,1.0,0.0,,16.0,485.0,18490984.0,15.0,5.0,10.0,0.0,8.0,7.0,,,18.0,1.0,1.0,48.0,1.0,0.0,,5.0,3.0,,,,,,2.0,0.0,,19.0,400.0,302.0,75.5,383.0,283.0,73.89,105.0,59.0,108.0,61.0,,
38070,13258.0,18490986.0,12.0,2.0,10.0,0.0,7.0,3.0,,,13.0,5.0,3.0,57.0,1.0,0.0,,2.0,4.0,,,,,,1.0,0.0,,15.0,214.0,18490986.0,6.0,1.0,5.0,0.0,9.0,0.0,,,22.0,8.0,0.0,43.0,3.0,1.0,,1.0,5.0,,,,,,1.0,0.0,,9.0,392.0,267.0,68.11,292.0,166.0,56.85,99.0,51.0,123.0,37.0,,
38071,214.0,18490987.0,8.0,3.0,5.0,0.0,6.0,3.0,,,15.0,3.0,0.0,36.0,3.0,0.0,,1.0,3.0,,,,,,1.0,0.0,,10.0,13258.0,18490987.0,13.0,2.0,11.0,0.0,3.0,7.0,,,18.0,4.0,4.0,64.0,2.0,0.0,,2.0,5.0,,,,,,0.0,0.0,,15.0,257.0,154.0,59.92,480.0,365.0,76.04,87.0,41.0,114.0,54.0,,
38072,485.0,18490985.0,12.0,5.0,7.0,0.0,5.0,6.0,,,8.0,2.0,4.0,60.0,1.0,0.0,,0.0,4.0,,,,,,1.0,0.0,,23.0,377.0,18490985.0,7.0,1.0,6.0,0.0,3.0,4.0,,,17.0,7.0,2.0,40.0,1.0,0.0,,2.0,5.0,,,,,,1.0,0.0,,12.0,474.0,373.0,78.69,312.0,222.0,71.15,120.0,50.0,92.0,48.0,,


##### 5) Merge Original Data with Stats Data

In [6]:
all_df = pd.merge(complete_data.reset_index(), stats_df, how='outer', left_on='id', right_on='Away_fixture_id').set_index('id')
all_df.tail(3)

Unnamed: 0_level_0,league_id,season_id,stage_id,round_id,group_id,aggregate_id,venue_id,referee_id,localteam_id,visitorteam_id,winner_team_id,weather_report,commentaries,attendance,pitch,details,neutral_venue,winning_odds_calculated,formations_localteam_formation,formations_visitorteam_formation,scores_localteam_score,scores_visitorteam_score,scores_localteam_pen_score,scores_visitorteam_pen_score,scores_ht_score,scores_ft_score,scores_et_score,scores_ps_score,time_status,time_starting_at_date_time,time_starting_at_date,time_starting_at_time,time_starting_at_timestamp,time_starting_at_timezone,time_minute,time_second,time_added_time,time_extra_minute,time_injury_time,coaches_localteam_coach_id,coaches_visitorteam_coach_id,standings_localteam_position,standings_visitorteam_position,assistants_first_assistant_id,assistants_second_assistant_id,assistants_fourth_official_id,leg,colors,deleted,is_placeholder,localTeam_id,localTeam_legacy_id,localTeam_name,localTeam_short_code,localTeam_twitter,localTeam_country_id,localTeam_national_team,localTeam_founded,localTeam_logo_path,localTeam_venue_id,localTeam_current_season_id,localTeam_is_placeholder,visitorTeam_id,visitorTeam_legacy_id,visitorTeam_name,visitorTeam_short_code,visitorTeam_twitter,visitorTeam_country_id,visitorTeam_national_team,visitorTeam_founded,visitorTeam_logo_path,visitorTeam_venue_id,visitorTeam_current_season_id,visitorTeam_is_placeholder,stats,league_active,league_type,league_legacy_id,league_country_id,league_logo_path,league_name,league_is_cup,league_is_friendly,league_current_season_id,league_current_round_id,league_current_stage_id,league_live_standings,league_coverage_predictions,league_coverage_topscorer_goals,league_coverage_topscorer_assists,league_coverage_topscorer_cards,season_name,season_league_id,season_is_current_season,season_current_round_id,season_current_stage_id,round_name,round_league_id,round_season_id,round_stage_id,round_start,round_end,venue_name,venue_surface,venue_address,venue_city,venue_capacity,venue_image_path,venue_coordinates,referee_common_name,referee_fullname,referee_firstname,referee_lastname,localCoach_coach_id,localCoach_team_id,localCoach_country_id,localCoach_common_name,localCoach_fullname,localCoach_firstname,localCoach_lastname,localCoach_nationality,localCoach_birthdate,localCoach_birthcountry,localCoach_birthplace,localCoach_image_path,visitorCoach_coach_id,visitorCoach_team_id,visitorCoach_country_id,visitorCoach_common_name,visitorCoach_fullname,visitorCoach_firstname,visitorCoach_lastname,visitorCoach_nationality,visitorCoach_birthdate,visitorCoach_birthcountry,visitorCoach_birthplace,visitorCoach_image_path,weather_report_code,weather_report_icon,weather_report_type,weather_report_wind_speed,weather_report_wind_degree,weather_report_clouds,weather_report_humidity,weather_report_temperature_temp,weather_report_temperature_unit,colors_localteam_color,colors_localteam_kit_colors,colors_visitorteam_color,colors_visitorteam_kit_colors,weather_report_pressure,weather_report_temperature_celcius_temp,weather_report_temperature_celcius_unit,weather_report_coordinates_lat,weather_report_coordinates_lon,weather_report_updated_at,Home_team_id,Home_fixture_id,Home_shots_total,Home_shots_ongoal,Home_shots_offgoal,Home_shots_blocked,Home_shots_insidebox,Home_shots_outsidebox,Home_passes,Home_attacks,Home_fouls,Home_corners,Home_offsides,Home_possessiontime,Home_yellowcards,Home_redcards,Home_yellowredcards,Home_saves,Home_substitutions,Home_goal_kick,Home_goal_attempts,Home_free_kick,Home_throw_in,Home_ball_safe,Home_goals,Home_penalties,Home_injuries,Home_tackles,Away_team_id,Away_fixture_id,Away_shots_total,Away_shots_ongoal,Away_shots_offgoal,Away_shots_blocked,Away_shots_insidebox,Away_shots_outsidebox,Away_passes,Away_attacks,Away_fouls,Away_corners,Away_offsides,Away_possessiontime,Away_yellowcards,Away_redcards,Away_yellowredcards,Away_saves,Away_substitutions,Away_goal_kick,Away_goal_attempts,Away_free_kick,Away_throw_in,Away_ball_safe,Away_goals,Away_penalties,Away_injuries,Away_tackles,Home_passes_total,Home_passes_accurate,Home_passes_percentage,Away_passes_total,Away_passes_accurate,Away_passes_percentage,Home_attacks_attacks,Home_attacks_dangerous_attacks,Away_attacks_attacks,Away_attacks_dangerous_attacks,Home_shots,Away_shots
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1
18490986,570,19089,77455898,,,28707.0,9162.0,17340.0,13258,214,,,False,,,,False,True,4-4-2,4-3-3,1,1,,,1-0,1-1,,,FT,2022-02-10 21:30:00,2022-02-10,21:30:00,1644525000,Europe/Rome,90.0,,,,,1467953.0,524009.0,1.0,2.0,15278.0,14028.0,19889.0,1/2,,False,False,13258,138.0,Athletic Club,ATH,@AthleticClub,32.0,False,1898.0,https://cdn.sportmonks.com/images/soccer/teams...,9162.0,18462.0,False,214,287.0,Valencia,VAL,@valenciacf,32.0,False,1919.0,https://cdn.sportmonks.com/images/soccer/teams...,9240.0,18462.0,False,"[{'team_id': 13258, 'fixture_id': 18490986, 's...",True,domestic_cup,21,32,https://cdn.sportmonks.com/images/soccer/leagu...,Copa Del Rey,True,False,19089,,77455898,False,True,True,True,True,2021/2022,570,True,,77455898.0,,,,,,,San Mamés Barria,grass,Rafael Moreno Pitxitxi Kalea,Bilbao,53289.0,https://cdn.sportmonks.com/images/soccer/venue...,"43.263476,-2.948150",J. Munuera Montero,José Luis Munuera Montero,José Luis,Munuera Montero,1467953.0,13258.0,32.0,M. García Toral,Marcelino García Toral,Marcelino,García Toral,Spain,14/08/1965,Spain,Villaviciosa,https://cdn.sportmonks.com/images/soccer/playe...,524009.0,214.0,32.0,J. Bordalás Jiménez,José Bordalás Jiménez,José,Bordalás Jiménez,Spain,05/03/1964,Spain,Alicante,https://cdn.sportmonks.com/images/soccer/playe...,clouds,https://cdn.sportmonks.com/images/weather/04n.png,broken clouds,15.01 m/s,290.0,75%,80%,51.8,fahrenheit,#C40010,"#C40010,#F0F0F0,#F0F0F0,#F0F0F0,#5C8FAE,#FFDF1...",#2B72DE,"#2B72DE,#2B72DE,#0A0A0A,#0A0A0A,#C40010,#0046A...",1028.0,11.0,celcius,43.2627,-2.9253,2022-02-10T22:15:13.505470Z,13258.0,18490986.0,12.0,2.0,10.0,0.0,7.0,3.0,,,13.0,5.0,3.0,57.0,1.0,0.0,,2.0,4.0,,,,,,1.0,0.0,,15.0,214.0,18490986.0,6.0,1.0,5.0,0.0,9.0,0.0,,,22.0,8.0,0.0,43.0,3.0,1.0,,1.0,5.0,,,,,,1.0,0.0,,9.0,392.0,267.0,68.11,292.0,166.0,56.85,99.0,51.0,123.0,37.0,,
18490987,570,19089,77455898,,,28707.0,9240.0,14350.0,214,13258,214.0,,False,,,,False,True,3-4-3,4-4-2,1,0,,,1-0,1-0,,,FT,2022-03-02 21:30:00,2022-03-02,21:30:00,1646253000,Europe/Rome,90.0,,,,,524009.0,1467953.0,1.0,2.0,15276.0,12812.0,17341.0,2/2,,False,False,214,287.0,Valencia,VAL,@valenciacf,32.0,False,1919.0,https://cdn.sportmonks.com/images/soccer/teams...,9240.0,18462.0,False,13258,138.0,Athletic Club,ATH,@AthleticClub,32.0,False,1898.0,https://cdn.sportmonks.com/images/soccer/teams...,9162.0,18462.0,False,"[{'team_id': 214, 'fixture_id': 18490987, 'sho...",True,domestic_cup,21,32,https://cdn.sportmonks.com/images/soccer/leagu...,Copa Del Rey,True,False,19089,,77455898,False,True,True,True,True,2021/2022,570,True,,77455898.0,,,,,,,Estadio de Mestalla,grass,Avenida de Suecia,Valencia,54000.0,https://cdn.sportmonks.com/images/soccer/venue...,"39.474602,-0.358257",J. Gil Manzano,Jesús Gil Manzano,Jesús,Gil Manzano,524009.0,214.0,32.0,J. Bordalás Jiménez,José Bordalás Jiménez,José,Bordalás Jiménez,Spain,05/03/1964,Spain,Alicante,https://cdn.sportmonks.com/images/soccer/playe...,1467953.0,13258.0,32.0,M. García Toral,Marcelino García Toral,Marcelino,García Toral,Spain,14/08/1965,Spain,Villaviciosa,https://cdn.sportmonks.com/images/soccer/playe...,clouds,https://cdn.sportmonks.com/images/weather/02n.png,few clouds,4.97 m/s,281.0,21%,63%,51.22,fahrenheit,#F0F0F0,"#0A0A0A,#F0F0F0,#F0F0F0,#0A0A0A,#5C8FAE,#FFDF1...",#BFFFBF,"#BFFFBF,#BFFFBF,#CCCCCC,#CCCCCC,#C40010,#0046A...",1021.0,10.7,celcius,39.3333,-0.8333,2022-03-02T22:15:07.028532Z,214.0,18490987.0,8.0,3.0,5.0,0.0,6.0,3.0,,,15.0,3.0,0.0,36.0,3.0,0.0,,1.0,3.0,,,,,,1.0,0.0,,10.0,13258.0,18490987.0,13.0,2.0,11.0,0.0,3.0,7.0,,,18.0,4.0,4.0,64.0,2.0,0.0,,2.0,5.0,,,,,,0.0,0.0,,15.0,257.0,154.0,59.92,480.0,365.0,76.04,87.0,41.0,114.0,54.0,,
18490985,570,19089,77455898,,,28706.0,68.0,15712.0,485,377,,,False,,,,False,True,4-2-3-1,4-2-3-1,1,1,,,0-0,1-1,,,FT,2022-03-03 21:00:00,2022-03-03,21:00:00,1646337600,Europe/Rome,90.0,,,,,523898.0,19960388.0,1.0,2.0,11711.0,16863.0,15823.0,2/2,,False,False,485,305.0,Real Betis,BET,@RealBetis,32.0,False,1907.0,https://cdn.sportmonks.com/images/soccer/teams...,68.0,18462.0,False,377,292.0,Rayo Vallecano,RAY,,32.0,False,1924.0,https://cdn.sportmonks.com/images/soccer/teams...,304396.0,18462.0,False,"[{'team_id': 485, 'fixture_id': 18490985, 'sho...",True,domestic_cup,21,32,https://cdn.sportmonks.com/images/soccer/leagu...,Copa Del Rey,True,False,19089,,77455898,False,True,True,True,True,2021/2022,570,True,,77455898.0,,,,,,,Estadio Benito Villamarín,grass,Avenida de Heliópolis,Sevilla,60721.0,https://cdn.sportmonks.com/images/soccer/venue...,"37.356483,-5.981768",J. Martínez Munuera,Juan Martínez Munuera,Juan,Martínez Munuera,523898.0,485.0,80.0,M. Pellegrini Ripamonti,Manuel Luis Pellegrini Ripamonti,Manuel Luis,Pellegrini Ripamonti,Chile,16/09/1953,Chile,Santiago de Chile,https://cdn.sportmonks.com/images/soccer/playe...,19960388.0,377.0,32.0,A. Iraola Sagarna,Andoni Iraola Sagarna,Andoni,Iraola Sagarna,Spain,22/06/1982,Spain,Usurbil,https://cdn.sportmonks.com/images/soccer/place...,clear,https://cdn.sportmonks.com/images/weather/01n.png,clear sky,12.66 m/s,290.0,0%,61%,54.39,fahrenheit,#339063,"#339063,#F0F0F0,#008000,#008000,#5C8FAE,#FFDF1...",#F0F0F0,"#F0F0F0,#D10918,#F0F0F0,#0A0A0A,#5C8FAE,#FFDF1...",1017.0,12.4,celcius,37.3824,-5.9761,2022-03-03T21:45:03.639626Z,485.0,18490985.0,12.0,5.0,7.0,0.0,5.0,6.0,,,8.0,2.0,4.0,60.0,1.0,0.0,,0.0,4.0,,,,,,1.0,0.0,,23.0,377.0,18490985.0,7.0,1.0,6.0,0.0,3.0,4.0,,,17.0,7.0,2.0,40.0,1.0,0.0,,2.0,5.0,,,,,,1.0,0.0,,12.0,474.0,373.0,78.69,312.0,222.0,71.15,120.0,50.0,92.0,48.0,,


In [7]:
def IDs_match(df):
    """This function checks Fixtures and Team IDs for consistency across both the home and the away team.
    df: DataFrame to check
    """
    # Check Team's ID between match data and stats data (Home/Away)
    a_t = len(all_df[~(all_df['Away_team_id'].isna())].apply(lambda x: x.Away_team_id != x.visitorTeam_id, axis=1))
    h_t = len(all_df[~(all_df['Home_team_id'].isna())].apply(lambda x: x.Home_team_id != x.localTeam_id, axis=1))
    # Check Fixture's ID between match data and stats data (Home/Away)
    h_f = len(all_df[~(all_df['Home_fixture_id'].isna())].apply(lambda x: x.Home_fixture_id != pd.Series(x.index), axis=1))
    a_f = len(all_df[~(all_df['Away_fixture_id'].isna())].apply(lambda x: x.Away_fixture_id != pd.Series(x.index), axis=1))
    # Communicate if all IDs match or not
    if len(stats_df) == a_t == h_t == h_f == a_f:
        print('All team/fixture IDs match! - Match data and stats data are consistent')
    else:
        print('Errors in team/fixture IDs match!')

IDs_match(all_df)

All team/fixture IDs match! - Match data and stats data are consistent


## *Data Pre-Cleaning*

6) **Drop** useless **columns**
7) **Transform columns'** values 
8) **Rename columns**
9) Change **columns data types**
10) **Explore** Data

##### 6) Drop Useless Columns

In [8]:
# Features to drop
drop_features = ['group_id','aggregate_id','weather_report','pitch','details','neutral_venue','winning_odds_calculated','scores_localteam_pen_score','scores_visitorteam_pen_score','time_starting_at_timestamp','time_second','time_added_time','time_extra_minute','time_injury_time','leg','colors','deleted','is_placeholder','localTeam_id','localTeam_short_code','localTeam_national_team','localTeam_logo_path','localTeam_current_season_id','localTeam_is_placeholder','visitorTeam_id', 'visitorTeam_short_code','visitorTeam_national_team','visitorTeam_logo_path','visitorTeam_current_season_id','visitorTeam_is_placeholder','stats','league_active','league_legacy_id','league_logo_path','league_is_friendly','league_current_season_id','league_current_round_id','league_current_stage_id','league_live_standings','league_coverage_predictions','league_coverage_topscorer_goals','league_coverage_topscorer_assists','league_coverage_topscorer_cards','season_is_current_season','season_current_round_id','season_current_stage_id', 'round_season_id','round_stage_id','venue_address','venue_image_path','referee_common_name','referee_firstname','referee_lastname', 'localCoach_common_name','localCoach_firstname','localCoach_lastname','localCoach_image_path','visitorCoach_common_name','visitorCoach_firstname','visitorCoach_lastname','visitorCoach_image_path','weather_report_icon','weather_report_temperature_unit','weather_report_temperature_celcius_temp','colors_localteam_kit_colors','colors_visitorteam_kit_colors', 'weather_report_temperature_celcius_unit','Home_team_id','Home_fixture_id','Home_passes','Home_attacks','Away_team_id','Away_fixture_id','Away_passes','Away_attacks','Home_shots','Away_shots', 'time_starting_at_time', 'weather_report_updated_at', 'round_id', 'season_league_id', 'round_league_id']

# Check for df's shape - before drops
start_cols = all_df.shape[1] 
print('DataFrame shape BEFORE drop: ', all_df.shape)
# Drop all the columns in drop_features
all_df.drop(drop_features, axis = 1, inplace = True)
# Consider only games that were actually played
all_df = all_df.loc[all_df['time_status'] == 'FT']
all_df.drop('time_status',axis = 1,inplace = True)
# Check for df's shape - after drops
end_cols = all_df.shape[1]
print('DataFrame shape  AFTER drop: ', all_df.shape)
print('N. of columns dropped: ', start_cols - end_cols)

DataFrame shape BEFORE drop:  (38073, 224)
DataFrame shape  AFTER drop:  (36962, 141)
N. of columns dropped:  83


##### 7) Transform columns' values 

In [9]:
# Weather coordinates 
all_df['weather_lat_lon'] = list(zip(all_df.weather_report_coordinates_lat, all_df.weather_report_coordinates_lon))
all_df.drop(['weather_report_coordinates_lat', 'weather_report_coordinates_lon'], axis = 1, inplace = True)
# Format weather columns without measure unit
all_df['weather_report_humidity'] = all_df['weather_report_humidity'].str.replace('%', '')
all_df['weather_report_clouds'] = all_df['weather_report_clouds'].str.replace('%', '')
all_df['weather_report_wind_speed'] = all_df['weather_report_wind_speed'].str.replace(' m/s', '')
# Manipulate venue coordinates to add parentheses
all_df['venue_coordinates'] = '(' + all_df['venue_coordinates'] + ')'
# Format league_is_cup and commentaries as binary 
all_df['league_is_cup'] = np.where(all_df['league_is_cup'] == False, 0, 1)
all_df['commentaries'] = np.where(all_df['commentaries'] == False, 0, 1)
# Convert temperature from °F to °C
all_df['weather_report_temp_celsius'] = round((all_df['weather_report_temperature_temp'] - 32)*(5/9), 1)
all_df.drop('weather_report_temperature_temp', axis = 1, inplace = True)
# Format venue surface as binary
all_df.loc[(all_df['venue_surface'] == 'artificial turf', 'venue_surface')] = 0
all_df.loc[(all_df['venue_surface'] == 'sand pitch', 'venue_surface')] = 0
all_df.loc[(all_df['venue_surface'] == 'grass', 'venue_surface')] = 1


##### 8) Rename Columns

In [10]:
# Rename columns
all_df.rename(columns={'weather_report_humidity': 'weather_report_humidity(%)','weather_report_clouds': 'weather_report_clouds(%)', 'weather_report_wind_speed': 'weather_report_windspeed(m/s)','venue_surface': 'venue_surface_isgrass'}, inplace=True)
# Improve columns naming
all_df.columns = all_df.columns.str.lower().str.replace("localteam", "home").str.replace("visitorteam", "away")\
    .str.replace("local", "home").str.replace("visitor", "away")
all_df.columns = all_df.columns.str.lower()
# Handle duplicated columns
all_df = all_df.loc[:,~all_df.columns.duplicated()]

##### 9) Change columns data types

In [11]:
# Specify columns that will remain floats
float_columns = ['weather_report_pressure','weather_report_temp_celsius','weather_report_windspeed(m/s)']
float_columns.extend(list(all_df.columns[-60:-2]))
# Convert all numeric columns to Int except specified Float 
m = all_df.select_dtypes(np.number).loc[:, ~all_df.select_dtypes(np.number).columns.isin(float_columns)]
all_df[m.columns]= m.round().astype('Int64')
all_df[float_columns] = all_df[float_columns].astype('float64')
# Convert to datetime
all_df['time_starting_at_date_time'] = pd.to_datetime(all_df['time_starting_at_date_time'], infer_datetime_format=True)
all_df['time_starting_at_date'] = pd.to_datetime(all_df['time_starting_at_date'], format = '%Y-%m-%d')
all_df['homecoach_birthdate'] = pd.to_datetime(all_df['homecoach_birthdate'], format = '%d/%m/%Y')
all_df['awaycoach_birthdate'] = pd.to_datetime(all_df['awaycoach_birthdate'], format = '%d/%m/%Y')
all_df['round_start'] = pd.to_datetime(all_df['round_start'], format = '%Y-%m-%d')
all_df['round_end'] = pd.to_datetime(all_df['round_end'], format = '%Y-%m-%d')
# Sort values by datetime
all_df = all_df.sort_values(by='time_starting_at_date_time')

##### 10) Explore Data

In [12]:
print(all_df.shape)
all_df.tail(10)

(36962, 140)


Unnamed: 0_level_0,league_id,season_id,stage_id,venue_id,referee_id,home_id,away_id,winner_team_id,commentaries,attendance,formations_home_formation,formations_away_formation,scores_home_score,scores_away_score,scores_ht_score,scores_ft_score,scores_et_score,scores_ps_score,time_starting_at_date_time,time_starting_at_date,time_starting_at_timezone,time_minute,coaches_home_coach_id,coaches_away_coach_id,standings_home_position,standings_away_position,assistants_first_assistant_id,assistants_second_assistant_id,assistants_fourth_official_id,home_legacy_id,home_name,home_twitter,home_country_id,home_founded,home_venue_id,away_legacy_id,away_name,away_twitter,away_country_id,away_founded,away_venue_id,league_type,league_country_id,league_name,league_is_cup,season_name,round_name,round_start,round_end,venue_name,venue_surface_isgrass,venue_city,venue_capacity,venue_coordinates,referee_fullname,homecoach_coach_id,homecoach_team_id,homecoach_country_id,homecoach_fullname,homecoach_nationality,homecoach_birthdate,homecoach_birthcountry,homecoach_birthplace,awaycoach_coach_id,awaycoach_team_id,awaycoach_country_id,awaycoach_fullname,awaycoach_nationality,awaycoach_birthdate,awaycoach_birthcountry,awaycoach_birthplace,weather_report_code,weather_report_type,weather_report_windspeed(m/s),weather_report_wind_degree,weather_report_clouds(%),weather_report_humidity(%),colors_home_color,colors_away_color,weather_report_pressure,home_shots_total,home_shots_ongoal,home_shots_offgoal,home_shots_blocked,home_shots_insidebox,home_shots_outsidebox,home_fouls,home_corners,home_offsides,home_possessiontime,home_yellowcards,home_redcards,home_yellowredcards,home_saves,home_substitutions,home_goal_kick,home_goal_attempts,home_free_kick,home_throw_in,home_ball_safe,home_goals,home_penalties,home_injuries,home_tackles,away_shots_total,away_shots_ongoal,away_shots_offgoal,away_shots_blocked,away_shots_insidebox,away_shots_outsidebox,away_fouls,away_corners,away_offsides,away_possessiontime,away_yellowcards,away_redcards,away_yellowredcards,away_saves,away_substitutions,away_goal_kick,away_goal_attempts,away_free_kick,away_throw_in,away_ball_safe,away_goals,away_penalties,away_injuries,away_tackles,home_passes_total,home_passes_accurate,home_passes_percentage,away_passes_total,away_passes_accurate,away_passes_percentage,home_attacks_attacks,home_attacks_dangerous_attacks,away_attacks_attacks,away_attacks_dangerous_attacks,weather_lat_lon,weather_report_temp_celsius
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1
18509793,2,18346,77453619,53.0,19366,503,3477,,1,,3-2-4-1,4-4-2,1,1,0-0,1-1,,,2022-04-12 21:00:00,2022-04-12,Europe/Rome,94,458813,455907,1.0,2.0,16682,19367,19312,39,FC Bayern München,@FCBayern,11,1900,53,140,Villarreal,@VillarrealCF,32,1923,2028,cup_international,41,Champions League,1,2021/2022,,NaT,NaT,Allianz Arena,1.0,Munich,75000.0,"(48.218777,11.624748)",Slavko Vinčić,458813,503,11,Julian Nagelsmann,Germany,1987-07-23,Germany,Landsberg am Lech,455907,3477,32,Unai Emery Etxegoien,Spain,1971-11-03,Spain,Hondarribia,clouds,overcast clouds,4.0,107.0,99.0,64.0,#C40010,#FFDF1B,1015.0,23.0,4.0,19.0,6.0,11.0,12.0,11.0,6.0,5.0,68.0,1.0,0.0,0.0,0.0,3.0,4.0,14.0,9.0,21.0,54.0,1.0,0.0,1.0,16.0,4.0,1.0,3.0,0.0,3.0,1.0,8.0,0.0,1.0,32.0,2.0,0.0,0.0,3.0,3.0,25.0,5.0,16.0,13.0,70.0,1.0,0.0,1.0,15.0,580.0,502.0,86.55,289.0,198.0,68.51,161.0,99.0,66.0,20.0,"(48.1374, 11.5755)",10.2
18509789,2,18346,77453619,230.0,17308,8,605,,1,51373.0,4-3-3,4-2-3-1,3,3,1-1,3-3,,,2022-04-13 21:00:00,2022-04-13,Europe/Rome,95,455353,160758,1.0,2.0,11658,12150,15275,119,Liverpool,@LFC,462,1892,230,123,Benfica,@SLBenfica,20,1904,138,cup_international,41,Champions League,1,2021/2022,,NaT,NaT,Anfield,1.0,Liverpool,54074.0,"(53.430622,-2.960919)",Serdar Gözübüyük,455353,8,11,Jürgen Klopp,Germany,1967-06-16,Germany,Stuttgart,160758,605,20,Nelson Alexandre da Silva Veríssimo,Portugal,1977-04-17,Portugal,,clear,clear sky,0.0,0.0,0.0,80.0,#C40010,#F0F0F0,1017.0,17.0,6.0,11.0,3.0,12.0,5.0,9.0,8.0,1.0,67.0,0.0,0.0,0.0,1.0,5.0,1.0,12.0,16.0,14.0,105.0,3.0,0.0,0.0,12.0,6.0,4.0,2.0,0.0,5.0,1.0,11.0,0.0,5.0,33.0,0.0,0.0,0.0,2.0,5.0,10.0,7.0,10.0,19.0,86.0,3.0,0.0,3.0,16.0,778.0,676.0,86.89,399.0,303.0,75.94,176.0,71.0,65.0,23.0,"(53.4106, -2.9779)",10.8
18509787,2,18346,77453619,140808.0,17234,7980,9,,1,65675.0,5-4-1,4-3-3,0,0,0-0,0-0,,,2022-04-13 21:00:00,2022-04-13,Europe/Rome,102,452946,455361,2.0,1.0,12207,11619,15400,113,Atlético Madrid,@Atleti,32,1903,140808,127,Manchester City,@ManCity,462,1880,151,cup_international,41,Champions League,1,2021/2022,,NaT,NaT,Estadio Wanda Metropolitano,1.0,Madrid,67829.0,"(40.436111,-3.599444)",Daniel Siebert,452946,7980,44,Diego Pablo Simeone,Argentina,1970-04-28,Argentina,Buenos Aires,455361,9,32,Josep Guardiola i Sala,Spain,1971-01-18,Spain,Santpedor,clouds,broken clouds,8.01,70.0,75.0,57.0,#C40010,#022857,1002.0,13.0,3.0,10.0,0.0,9.0,5.0,8.0,2.0,4.0,40.0,4.0,1.0,1.0,1.0,5.0,7.0,7.0,8.0,28.0,87.0,0.0,0.0,0.0,19.0,10.0,1.0,9.0,0.0,7.0,3.0,7.0,4.0,1.0,60.0,5.0,0.0,0.0,3.0,3.0,12.0,5.0,12.0,19.0,102.0,0.0,0.0,2.0,16.0,399.0,301.0,75.44,605.0,525.0,86.78,86.0,37.0,132.0,43.0,"(40.4165, -3.7026)",14.7
18509800,5,18629,77454496,193.0,11710,708,277,277.0,1,,3-4-2-1,3-4-2-1,0,2,0-1,0-2,,,2022-04-14 18:45:00,2022-04-14,Europe/Rome,90,456038,893655,,,11711,14349,17340,333,Atalanta,@Atalanta_BC,251,1907,193,64,RB Leipzig,@DieRotenBullen,11,2009,2171,cup_international,41,Europa League,1,2021/2022,,NaT,NaT,Gewiss Stadium,1.0,Bergamo,26393.0,"(45.708889,9.680833)",Antonio Miguel Mateu Lahoz,456038,708,251,Gian Piero Gasperini,Italy,1958-01-26,Italy,Grugliasco,893655,277,11,Domenico Tedesco,Germany,1985-09-12,Italy,Rossano,clear,clear sky,4.61,140.0,0.0,37.0,#2B72DE,#F0F0F0,1019.0,16.0,3.0,13.0,3.0,8.0,8.0,12.0,4.0,1.0,57.0,5.0,0.0,0.0,5.0,4.0,,8.0,,,72.0,0.0,0.0,0.0,11.0,11.0,5.0,6.0,0.0,8.0,3.0,16.0,5.0,1.0,43.0,5.0,0.0,0.0,3.0,5.0,,10.0,,,79.0,2.0,1.0,1.0,26.0,516.0,422.0,81.78,400.0,306.0,76.5,165.0,70.0,79.0,32.0,"(45.698, 9.669)",18.7
18509802,5,18629,77454496,9236.0,16678,83,366,366.0,1,,4-3-3,3-4-2-1,2,3,0-2,2-3,,,2022-04-14 21:00:00,2022-04-14,Europe/Rome,90,184942,51518,,1.0,12857,12861,17714,130,FC Barcelona,@FCBarcelona,32,1899,9236,47,Eintracht Frankfurt,@eintracht,11,1899,2169,cup_international,41,Europa League,1,2021/2022,,NaT,NaT,Spotify Camp Nou,1.0,Barcelona,98787.0,"(41.380890,2.122813)",Artur Manuel Ribeiro Soares Dias,184942,83,32,Xavier Hernández Creus,Spain,1980-01-25,Spain,Terrassa,51518,366,143,Oliver Glasner,Austria,1974-08-28,Austria,Salzburg,clouds,few clouds,6.91,20.0,20.0,59.0,#0046A8,#F0F0F0,1019.0,9.0,5.0,4.0,0.0,9.0,1.0,7.0,5.0,2.0,75.0,3.0,0.0,0.0,4.0,5.0,,5.0,,,95.0,2.0,1.0,,13.0,15.0,7.0,8.0,0.0,8.0,7.0,18.0,0.0,2.0,25.0,6.0,1.0,1.0,2.0,5.0,,12.0,,,82.0,3.0,1.0,,19.0,679.0,599.0,88.22,231.0,151.0,65.37,176.0,65.0,54.0,16.0,"(41.3888, 2.159)",16.5
18509804,5,18629,77454496,,17285,79,1,1.0,1,,4-2-3-1,4-2-3-1,0,3,0-2,0-3,,,2022-04-14 21:00:00,2022-04-14,Europe/Rome,93,455958,455355,1.0,1.0,12884,15807,17929,552,Olympique Lyonnais,@OL,17,1950,6161,377,West Ham United,@WestHam,462,1895,214,cup_international,41,Europa League,1,2021/2022,,NaT,NaT,,,,,,Sandro Schärer,455958,79,38,Peter Bosz,Netherlands,1963-11-21,Netherlands,Apeldoorn,455355,1,1161,David Moyes,Scotland,1963-04-25,Scotland,Glasgow,,,,,,,#F0F0F0,#832034,,17.0,3.0,14.0,4.0,12.0,5.0,13.0,3.0,1.0,67.0,1.0,0.0,0.0,3.0,4.0,5.0,11.0,15.0,17.0,97.0,0.0,0.0,0.0,15.0,10.0,6.0,4.0,2.0,5.0,5.0,13.0,4.0,3.0,33.0,2.0,0.0,0.0,3.0,3.0,10.0,9.0,14.0,15.0,86.0,3.0,0.0,2.0,13.0,607.0,537.0,88.47,300.0,219.0,73.0,117.0,66.0,72.0,40.0,"(nan, nan)",
18220188,384,18576,77454372,7305.0,15969,345,2930,2930.0,1,,4-2-3-1,3-5-2,1,3,0-1,1-3,,,2022-04-15 19:00:00,2022-04-15,Europe/Rome,90,95726,128160,15.0,2.0,13402,12205,13669,344,Spezia,@acspezia,251,1906,7305,158,Inter,@Inter,251,1908,1721,domestic,251,Serie A,0,2021/2022,33.0,2022-04-15,2022-04-18,Stadio Alberto Picco,0.0,La Spezia,10336.0,"(44.101711,9.808218)",Fabio Maresca,95726,345,251,Thiago Motta,Italy,1982-08-28,Brazil,São Bernardo do Campo,128160,2930,251,Simone Inzaghi,Italy,1976-04-05,Italy,Piacenza,clear,clear sky,4.27,350.0,3.0,51.0,#F0F0F0,#002B87,1013.0,11.0,3.0,8.0,4.0,6.0,4.0,9.0,6.0,1.0,33.0,2.0,0.0,0.0,4.0,5.0,,7.0,,,114.0,1.0,0.0,2.0,11.0,23.0,7.0,16.0,5.0,18.0,5.0,11.0,7.0,1.0,67.0,0.0,0.0,0.0,1.0,5.0,,17.0,,,139.0,3.0,0.0,0.0,13.0,275.0,190.0,69.09,600.0,511.0,85.17,99.0,33.0,114.0,69.0,"(44.1105, 9.8434)",17.1
18220185,384,18576,77454372,1721.0,13970,113,102,113.0,1,,4-3-3,4-3-2-1,2,0,1-0,2-0,,,2022-04-15 21:00:00,2022-04-15,Europe/Rome,90,459100,35571,2.0,18.0,12816,12059,14082,327,Milan,@acmilan,251,1899,1721,324,Genoa,@GenoaCFC,251,1893,86,domestic,251,Serie A,0,2021/2022,33.0,2022-04-15,2022-04-18,Stadio Giuseppe Meazza,1.0,Milano,80018.0,"(45.478025,9.124206)",Daniele Chiffi,459100,113,251,Stefano Pioli,Italy,1965-10-20,Italy,Parma,35571,102,11,Alexander Blessin,Germany,1973-05-28,Germany,,clear,clear sky,6.91,120.0,0.0,49.0,#C40010,#CCCCCC,1018.0,9.0,4.0,5.0,1.0,7.0,3.0,10.0,3.0,3.0,59.0,1.0,0.0,0.0,2.0,5.0,,7.0,,,108.0,2.0,0.0,3.0,17.0,8.0,2.0,6.0,1.0,3.0,4.0,19.0,1.0,1.0,41.0,2.0,0.0,0.0,3.0,5.0,,5.0,,,102.0,0.0,0.0,2.0,30.0,498.0,412.0,82.73,342.0,250.0,73.1,103.0,46.0,101.0,30.0,"(45.4643, 9.1895)",18.5
18165743,564,18462,77454016,133.0,13963,594,485,,1,,4-3-1-2,4-2-3-1,0,0,0-0,0-0,,,2022-04-15 21:00:00,2022-04-15,Europe/Rome,90,530755,523898,6.0,5.0,12257,13964,18344,293,Real Sociedad,@RealSociedad,32,1909,133,305,Real Betis,@RealBetis,32,1907,68,domestic,32,La Liga,0,2021/2022,32.0,2022-04-15,2022-04-18,Reale Arena,1.0,San Sebastián,32076.0,"(43.301376,-1.973602)",Isidro Díaz de Mera Escuderos,530755,594,32,Imanol Alguacil Barrenetxea,Spain,1971-07-04,Spain,Orio,523898,485,80,Manuel Luis Pellegrini Ripamonti,Chile,1953-09-16,Chile,Santiago de Chile,clouds,overcast clouds,3.0,207.0,100.0,85.0,#2B72DE,#EA9C08,1026.0,10.0,3.0,7.0,1.0,7.0,2.0,17.0,8.0,7.0,57.0,3.0,1.0,1.0,2.0,6.0,,6.0,,,77.0,0.0,0.0,,16.0,9.0,2.0,7.0,1.0,3.0,6.0,7.0,1.0,3.0,43.0,4.0,0.0,0.0,2.0,5.0,,7.0,,,94.0,0.0,0.0,,26.0,477.0,408.0,85.53,377.0,305.0,80.9,108.0,78.0,95.0,37.0,"(43.3128, -1.975)",13.0
18157344,301,18441,77453967,135.0,15484,598,6789,6789.0,1,,4-2-3-1,4-2-3-1,2,3,1-1,2-3,,,2022-04-15 21:00:00,2022-04-15,Europe/Rome,90,523953,2158189,3.0,6.0,12253,15840,17292,558,Rennes,@staderennais,17,1901,135,121,Monaco,@AS_Monaco,75285,1919,4451,domestic,17,Ligue 1,0,2021/2022,32.0,2022-04-15,2022-04-17,Roazhon Park,1.0,Rennes,29778.0,"(48.107458,-1.712839)",Stéphanie Frappart,523953,598,17,Bruno Génésio,France,1966-09-01,France,Lyon,2158189,6789,556,Philippe Clement,Belgium,1974-03-22,Belgium,Antwerpen,clear,clear sky,8.05,310.0,0.0,82.0,#C40010,#F0F0F0,1026.0,14.0,4.0,10.0,3.0,7.0,6.0,6.0,6.0,0.0,64.0,1.0,0.0,0.0,1.0,4.0,,9.0,,,103.0,2.0,1.0,1.0,9.0,12.0,4.0,8.0,1.0,9.0,3.0,15.0,0.0,1.0,36.0,3.0,0.0,0.0,2.0,3.0,,11.0,,,99.0,3.0,0.0,0.0,13.0,668.0,569.0,85.18,383.0,293.0,76.5,125.0,58.0,74.0,33.0,"(48.1667, -1.6667)",13.6


## *Data Cleaning Main Steps*
11) Add Target Feature
12) Drop Duplicates & Features
13) Handle Missing Data - Functions
14) Handle missing data
15) Store Data

##### 11) Add Target Feature

Function

In [13]:
def add_result(df):
    """ This functions adds a results column to a DataFrame (possible results are 0 = draw, 1 = home-win, 2 = away-win). The DataFrame must have a column for the home team's goal scored and one for the away team.

    df: Dataframe to add result column
    """
    d = {}
    for i in range(len(df)):
        if df['scores_home_score'].iloc[i] > df['scores_away_score'].iloc[i]:
            d[df.index[i]] = 1
        elif df['scores_home_score'].iloc[i] < df['scores_away_score'].iloc[i]:
            d[df.index[i]] = 2
        elif df['scores_home_score'].iloc[i] == df['scores_away_score'].iloc[i]:
            d[df.index[i]] = 0
    df['result'] = pd.Series(d)
    return df

Add columns for results and goal difference

In [14]:
all_df = add_result(all_df)
all_df['goal_diff'] = all_df['scores_home_score'].sub(all_df['scores_away_score'], axis = 0)

##### 12) Drop Duplicates & Features

In [15]:
# Drop columns that contain only NAs
all_df.dropna(axis=1, how='all', inplace=True) 
# Drop duplicate observations
all_df.drop_duplicates(inplace=True)
# Drop not relevant columns 
feats_drop = ['coaches_away_coach_id','coaches_home_coach_id','assistants_second_assistant_id','awaycoach_birthplace','assistants_fourth_official_id','assistants_first_assistant_id','home_legacy_id','away_legacy_id','homecoach_birthplace','homecoach_team_id','awaycoach_team_id','home_substitutions','home_goal_kick','home_goal_attempts','home_free_kick','home_throw_in','home_ball_safe','home_goals','home_penalties','home_injuries','away_substitutions','away_goal_kick','away_goal_attempts','away_free_kick','away_throw_in','away_ball_safe','away_goals','away_penalties','away_injuries']
all_df.drop(feats_drop, axis = 1, inplace = True)
# Create new dataframe 
final_16 = all_df[all_df['time_starting_at_date_time'] > '2015-07-01']
final_16 = final_16.sort_values(by='time_starting_at_date_time', ascending = True)
print(final_16.shape)

(23483, 113)


##### 13) Handle Missing Data - Functions

###### Filling using on-line searched informations (in dictionaries) 

In [16]:
def check_fill(df, col_check, all):
    """ This functions checks if there are still NAs in a specified column considering either the entire data (all=True) or only data from leagues matches excluding cups (all=False).
    df: Dataframe
    col_check: column to check
    all: True for entire data check, False for only leagues data check
    """
    if all:    
        if len(df[df[col_check].isna()]) != 0:
            print('ERROR: There are still', len(df[df[col_check].isna()]),  'NAs for all data (leagues + cups) in ', col_check, '!!!')
        else:
            print('Successful Filling for all data (leagues + cups) of', col_check)
    elif not all:
        if len(df[(df[col_check].isna()) & (df['league_is_cup'] == 0)]) != 0:
            print('ERROR: There are still', len(df[(df[col_check].isna()) & (df['league_is_cup'] == 0)]), 'NAs for leagues data in ', col_check, '!!!')
        else:
            print('Successful Filling for leagues data of', col_check)

In [17]:
def fill_follow_ID(df, subs_dict, ID_column, column_to_FILL, check = True):
    """ This functions fills NAs in a column using data from a dictionary.
    df: Dataframe
    subs_dict: dictionary (the key is the ID, the value is used to fill NAs)
    ID_column: column containing the IDs
    column_to_FILL: column containing missing values to fill
    check (Default: True): when True prints the checks on missing data 
    """
    for key, val in subs_dict.items():
        df.loc[(df[ID_column] == key) & (df[column_to_FILL].isna()), column_to_FILL] = val
    # Check
    if check:
        check_fill(df=df, col_check=column_to_FILL, all=False)
    return df

In [18]:
def fill_venue_capacity(df):
    """ This functions fills NAs in venue capacity following a dictionary.
    df: Dataframe
    """
    # Create a dictionary containing venues' capacity + manual dictionary 
    d = {df['venue_id'].iloc[i]: df['venue_capacity'].iloc[i] for i in range(len(df)) if df['venue_capacity'].iloc[i] is not pd.NA}
    venue_caps = {339996:33150, 6154:20000, 339714:54726, 232157:1000, 339832:51500, 339831:42358, 340122:29000, 2088:24000}
    d.update(venue_caps)
    df = fill_follow_ID(df=df, subs_dict=d, ID_column='venue_id', column_to_FILL='venue_capacity')
    return df
    
def fill_attendance(df):
    """ This functions fills NAs in venue attendance, as the mean for a specific venue and season, and alternatively by following a dictionary.
    df: Dataframe
    """
    # If possible fill using the mean for a specific venue during a specific season 
    df['attendance'] = df['attendance'].astype(float)
    df['attendance'] = df.groupby(['venue_id', 'season_id'])['attendance'].transform(lambda x: x.fillna(round(x.mean(), 1)))
    # If not possible use this averages dictionary
    avg_attendance_seriea22 = {'Inter':43549,'Hellas Verona':13350,'Torino':9465,'Empoli':6387,'Udinese':11655,'Bologna':14581,'Napoli':27593,'Roma':40723,'Cagliari':9400,'Sampdoria':8754,'Atalanta':10828,'Lazio':22056,'Fiorentina':20346, 'Juventus':22871, 'Sassuolo':6839, 'Genoa':13026, 'Milan':42388,'Salernitana':14323,'Spezia':6704,'Venezia':6731, 'Bastia':10511}
    df = fill_follow_ID(df=df, subs_dict=avg_attendance_seriea22, ID_column='home_name', column_to_FILL='attendance')
    return df

def fill_surface(df):
    # Create a dictionary containing venues' surface type  + manual dictionary 
    d = {df['venue_id'].iloc[i]: df['venue_surface_isgrass'].iloc[i] for i in range(len(df)) if df['venue_surface_isgrass'].iloc[i] is not pd.NA}
    diz_surf = {6154: 1, 339996: 1, 13977: 1, 18207: 0, 339714: 1, 18457: 1, 346: 1, 9325: 1, 232157: 1, 339832: 1, 339831: 1, 340122: 1, 22845: 1,8118: 0, 2088: 1}
    d.update(diz_surf)
    df = fill_follow_ID(df=df, subs_dict=d, ID_column='venue_id', column_to_FILL='venue_surface_isgrass')
    return df

In [19]:
def fill_twitter_names(df):
    """ This functions fills NAs in Twitter names following a dictionary.
    df: Dataframe
    """
    # Create a dictionary containing teams' twitter names + manual dictionary 
    d_H = {df['home_id'].iloc[i]: df['home_twitter'].iloc[i] for i in range(len(df)) if df['home_twitter'].iloc[i] is not pd.NA}
    d_A = {df['away_id'].iloc[i]: df['away_twitter'].iloc[i] for i in range(len(df)) if df['away_twitter'].iloc[i] is not pd.NA}
    twitter_names = {430:'@SCBastia', 3520:'@asnlofficiel', 9257:'@FCLorient', 3:'@SunderlandAFC', 344:'@RealSporting', 482:'@sv98', 573:'@Schanzer', 377:'@RayoVallecano', 6967:'@nimesolympiquel', 956:'@1_fc_nuernberg', 274:'@SDHuesca', 266:'@SB29', 271:'@RCLens', 6827:'@Cadiz_CF', 2927:'@arminia', 1099:'@elchecf', 6898:'@ClermontFoot', 3431:'@kleeblattfuerth', 999:'@VfLBochum1848eV', 1393:'@SMCaen'}
    d_A.update(twitter_names)
    d_H.update(d_A)
    # Home Twitter
    df = fill_follow_ID(df=df, subs_dict=d_H, ID_column='home_id', column_to_FILL='home_twitter')
    # Away Twitter 
    df = fill_follow_ID(df=df, subs_dict=d_H, ID_column='away_id', column_to_FILL='away_twitter')
    return df

def fill_colors(df):
    """ This functions fills NAs in colors following a dictionary.
    df: Dataframe
    """ 
    # Create two dictionaries containing teams' colors (differ between home and away) + manual dictionary   
    d_H = {df['home_id'].iloc[i]: df['colors_home_color'].iloc[i] for i in range(len(df)) if df['colors_home_color'].iloc[i] is not pd.NA}
    d_A = {df['away_id'].iloc[i]: df['colors_away_color'].iloc[i] for i in range(len(df)) if df['colors_away_color'].iloc[i] is not pd.NA}
    df = fill_follow_ID(df=df, subs_dict=d_H, ID_column='home_id', column_to_FILL='colors_home_color', check=False)
    df = fill_follow_ID(df=df, subs_dict=d_A, ID_column='away_id', column_to_FILL='colors_away_color', check=False)
    # Manual dictionary with team_IDs : missing colors
    extra_colors={3:'#EB172B',7:'#E11B22',22:'#F18A01',26:'#E03A3E',30:'#F0F0F0',126:'#A7D6F5',344:'#F0F0F0',429:'#B9D9EC',430:'#202A44',482:'#004F9F',573:'#D71920',1216:'#EEC0C8',1343:'#2F97DA',2708:'#F0F0F0',2921:'#FCC24F',3520:'#F0F0F0', 10729:'F0F0F0'}
    df = fill_follow_ID(df=df, subs_dict=extra_colors, ID_column='home_id', column_to_FILL='colors_home_color')
    df = fill_follow_ID(df=df, subs_dict=extra_colors, ID_column='away_id', column_to_FILL='colors_away_color')
    return df

###### Filling Coaches and Referees NAs

In [20]:
def coach_ref_extras_fill(df):
    """ This functions fills NAs in coach or referee extra columns (not IDs columns), by creating a dictionary and use it to fill missing values. (Make sure there are no missing values in the IDs columns!)
    df: Dataframe
    column: column with missing values to fill
    """
    coach_ref_tofill = ['homecoach_country_id', 'homecoach_birthdate', 'homecoach_nationality', 'homecoach_fullname', 'referee_fullname', 'awaycoach_country_id', 'awaycoach_birthcountry', 'awaycoach_nationality', 'awaycoach_fullname', 'awaycoach_birthdate', 'homecoach_birthcountry']
    # To select the right ID column
    for column in coach_ref_tofill:
        if 'coach' in column.lower(): 
            id_col = 'homecoach_coach_id' if 'home' in column.lower() else ('awaycoach_coach_id' if 'away' in column.lower() \
                else print('Selected a Coach column, but unclear if home or away!'))
        elif 'referee' in column.lower(): id_col = 'referee_id'
        else: print('Selected column is neither Coach or Referee!')
        # Create dictionary with ID column and column to fill and use fill_follow_ID()
        d = {df[id_col].iloc[i]: df[column].iloc[i] for i in range(len(df)) if df[column].iloc[i] is not pd.NA}
        df = fill_follow_ID(df = df, subs_dict = d, ID_column = id_col, column_to_FILL = column)
    return df
    
def coach_ref_ID_fill(df):
    """ This functions fills NAs in coach or referee IDs columns, by using fillna with bfill.
    df: Dataframe
    column: column with missing values to fill
    """
    ids_cols = ['referee_id', 'formations_home_formation', 'formations_away_formation', 'homecoach_coach_id', 'awaycoach_coach_id']
    # To select the right ID column
    for column in ids_cols:
        if 'coach' in column.lower() or 'formations' in column.lower():
            id_col = 'home_id' if 'home' in column.lower() else ('away_id' if 'away' in column.lower() \
                else print('Selected a coach or formation column, but unclear if home or away!'))
            df[column] = df.groupby([id_col, 'season_id'])[column].fillna(method='bfill')
        elif 'referee' in column.lower(): id_col = 'league_id'
        else: print('It is neither coach or referee or formation')
        # Use fillna() on the column
        df[column] = df.groupby([id_col, 'season_id'])[column].fillna(method='bfill')
        # Check
        check_fill(df=df, col_check=column, all=False)
    return df

###### Fill Rounds information

In [21]:
laliga_853 = [['2016-08-19','2016-08-22'],['2016-08-26','2016-08-28'],['2016-09-09','2016-09-11'],['2016-09-16','2016-09-19'],['2016-09-20','2016-09-22'],['2016-09-23','2016-09-26'],['2016-09-30','2016-10-02'],['2016-10-14','2016-10-17'],['2016-10-21','2016-10-23'],['2016-10-28','2016-10-31'],['2016-11-04','2016-11-06'],['2016-11-18','2016-11-21'],['2016-11-25','2016-11-28'],['2016-12-03','2016-12-05'],['2016-12-09','2016-12-12'],['2016-12-16','2016-12-19'],['2017-01-06','2017-01-09'],['2017-01-14','2017-01-16'],['2017-01-20','2017-01-22'],['2017-01-27','2017-01-30'],['2017-02-04','2017-02-06'],['2017-02-10','2017-02-13'], ['2017-02-17','2017-02-20'],['2017-02-22','2017-02-26'],['2017-02-28','2017-03-02'],['2017-03-03','2017-03-06'],['2017-03-08','2017-03-13'],['2017-03-17','2017-03-19'],['2017-03-31','2017-04-03'],['2017-04-04','2017-04-06'],['2017-04-07','2017-04-10'],['2017-04-14','2017-04-17'],['2017-04-21','2017-04-24'],['2017-04-25','2017-04-27'],['2017-04-28','2017-05-01'],['2017-05-05','2017-05-08'],['2017-05-13','2017-05-14'],['2017-05-17','2017-05-21']] # La Liga rounds
seriea_802 = [['2016-08-20','2016-08-21'],['2016-08-27','2016-08-28'],['2016-09-10','2016-09-12'],['2016-09-16','2016-09-18'],['2016-09-20','2016-09-21'],['2016-09-24','2016-09-26'],['2016-10-01','2016-10-02'],['2016-10-15','2016-10-17'],['2016-10-22','2016-10-23'],['2016-10-25','2016-10-27'],['2016-10-29','2016-10-31'],['2016-11-05','2016-11-06'],['2016-11-19','2016-11-20'],['2016-11-26','2016-11-28'],['2016-12-02','2016-12-05'],['2016-12-10','2016-12-12'],['2016-12-15','2016-12-18'],['2016-12-20','2016-12-22'],['2017-01-07','2017-01-08'],['2017-01-14','2017-01-16'],['2017-01-21','2017-01-22'],['2017-01-28','2017-02-01'],['2017-02-04','2017-02-08'],['2017-02-10','2017-02-13'],['2017-02-17','2017-02-19'],['2017-02-25','2017-02-27'],['2017-03-04','2017-03-05'],['2017-03-10','2017-03-13'],['2017-03-18','2017-03-19'],['2017-04-01','2017-04-03'],['2017-04-08','2017-04-09'],['2017-04-15','2017-04-16'],['2017-04-22','2017-04-24'],['2017-04-28','2017-04-30'],['2017-05-06','2017-05-07'],['2017-05-13','2017-05-14'],['2017-05-20','2017-05-22'],['2017-05-27','2017-05-28']] # Serie A rounds
liga2015 = [['2015-08-21','2015-08-24'],['2015-08-28','2015-08-30'],['2015-09-11','2015-09-14'],['2015-09-18','2015-09-20'],['2015-09-22','2015-09-24'],['2015-09-25','2015-09-27'],['2015-10-02','2015-10-04'],['2015-10-17','2015-10-19'],['2015-10-23','2015-10-26'],['2015-10-30','2015-11-01'],['2015-11-06','2015-11-08'],['2015-11-21','2015-11-23'],['2015-11-27','2015-11-29'],['2015-12-05','2015-12-07'],['2015-12-11','2015-12-13'],['2015-12-19','2015-12-20'],['2015-12-30','2015-12-31'],['2016-01-02','2016-01-04'],['2016-01-09','2016-01-10'],['2016-01-16','2016-01-18'],['2016-01-22','2016-01-25'],['2016-01-30','2016-02-01'],['2016-02-05','2016-02-08'],['2016-02-12','2016-02-14'],['2016-02-19','2016-02-21'],['2016-02-26','2016-02-28'],['2016-03-01','2016-03-03'],['2016-03-05','2016-03-07'],['2016-03-11','2016-03-14'],['2016-03-18','2016-03-20'],['2016-04-01','2016-04-04'],['2016-04-08','2016-04-11'],['2016-04-15','2016-04-17'],['2016-04-19','2016-04-21'],['2016-04-22','2016-04-25'],['2016-04-29','2016-05-02'],['2016-05-08','2016-05-09'],['2016-05-13','2016-05-15']] # LaLiga 2015-2016 rounds

def str_date(string):
    return datetime.strptime(string, '%Y-%m-%d').date()

def fill_rounds(df, index, list_of_dates):
    """ This functions fills NAs in round columns using a list of lists of rounds limits.
    df: Dataframe
    index: since this function is called inside a for loop
    list_of_dates: list of lists containing round start and end dates
    """
    # Get round_name
    for round, date_list in enumerate(list_of_dates):
        if str_date(date_list[0]) <= df.loc[index,'time_starting_at_date_time'].date() <= str_date(date_list[1]):
            df.at[index, 'round_name'] = int(round) + 1
    # Get round_start and round_end using round_name
    df.loc[index, 'round_start'] = datetime.strptime(list_of_dates[df.loc[index,'round_name']-1][0], '%Y-%m-%d')
    df.loc[index, 'round_end'] = datetime.strptime(list_of_dates[df.loc[index,'round_name']-1][1], '%Y-%m-%d')
    return df
    
def round_all_fill(df):
    """ This functions fills NAs round specific columns (round_name, round_start, round_end), both using a dictionary and a list of round specific dates (season 2016/2017 of La Liga and Serie A, since both are missing all rounds).
    df: Dataframe
    """
    global laliga_853, seriea_802
    # Handle NAs using manual dictionary: for suspended games and other exceptions
    diz_exc = {299930:3, 301697:19, 301642:18, 301647:18, 404577:16, 405042:21, 405051:21, 404221:16}
    for key, value in diz_exc.items():
        if df.loc[key, 'season_id'] == 802: list_dates = seriea_802
        elif df.loc[key, 'season_id'] == 853: list_dates = laliga_853
        elif df.loc[key, 'season_id'] == 2063: list_dates = liga2015
        df.loc[key, 'round_name'] = value
        df.loc[key, 'round_start'] = datetime.strptime(list_dates[df.loc[key,'round_name']-1][0], '%Y-%m-%d')
        df.loc[key, 'round_end'] = datetime.strptime(list_dates[df.loc[key,'round_name']-1][1], '%Y-%m-%d')
    # Using Loop and checking for NAs 
    for ind in df.index:
        # Serie A
        if (pd.isnull(df.loc[ind, 'round_name']) and pd.isnull(df.loc[ind, 'round_start']) and pd.isnull(df.loc[ind, 'round_end']) \
            and df.loc[ind, 'season_id'] == 802): df = fill_rounds(df = df, index = ind, list_of_dates = seriea_802)
        # La Liga             
        elif pd.isnull(df.loc[ind, 'round_name']) and pd.isnull(df.loc[ind,'round_start']) and pd.isnull(df.loc[ind, 'round_end'])\
            and df.loc[ind, 'season_id'] == 853: df = fill_rounds(df = df, index = ind, list_of_dates = laliga_853)
        elif (pd.isnull(df.loc[ind, 'round_name']) and pd.isnull(df.loc[ind, 'round_start']) and pd.isnull(df.loc[ind, 'round_end']) \
            and df.loc[ind, 'season_id'] == 2063): df = fill_rounds(df = df, index = ind, list_of_dates = liga2015)
    # Check
    check_fill(df=df, col_check='round_name', all=False)
    check_fill(df=df, col_check='round_start', all=False)
    check_fill(df=df, col_check='round_end', all=False)
    return df

###### Fill Stats Columns

In [22]:
def fill_shot_plus(df):
    """ This functions fills NAs for all the stats features where a NA is not correlated with a 0 value.
    df: Dataframe
    """
    # Inside shots + Attacks
    shot_cols = ['away_shots_insidebox', 'home_shots_insidebox']
    for col in shot_cols:
        df[col] = df.groupby(['home_shots_total', 'home_shots_ongoal'])[col].transform(lambda x: x.fillna(round(x.mean(), 1)))
        check_fill(df=df, col_check=col, all=False) # check
    # Tackles
    df['home_tackles'] = df.groupby(['league_id', 'home_possessiontime'])['home_tackles'].\
        transform(lambda x: x.fillna(round(x.mean(), 1)))
    df['away_tackles'] = df.groupby(['league_id', 'away_possessiontime'])['away_tackles'].\
        transform(lambda x: x.fillna(round(x.mean(), 1)))
    check_fill(df=df, col_check='home_tackles', all=False) # check
    check_fill(df=df, col_check='away_tackles', all=False) # check
    # Outside shots
    df['home_shots_outsidebox'] = df['home_shots_outsidebox'].fillna((df['home_shots_total'] - df['home_shots_insidebox']))
    df['away_shots_outsidebox'] = df['away_shots_outsidebox'].fillna((df['away_shots_total'] - df['away_shots_insidebox']))
    df.loc[df['away_shots_outsidebox'] < 0, 'away_shots_outsidebox'] = 0 
    df.loc[df['home_shots_outsidebox'] < 0, 'home_shots_outsidebox'] = 0 
    check_fill(df=df, col_check='home_shots_outsidebox', all=False) # check
    check_fill(df=df, col_check='away_shots_outsidebox', all=False) # check
    # Blocked shots 
    df['home_shots_blocked'] = df['home_shots_blocked'].\
        fillna((df['home_shots_total'] - df['home_shots_ongoal'] - df['home_shots_offgoal']))
    df['away_shots_blocked'] = df['away_shots_blocked'].\
        fillna((df['away_shots_total'] - df['away_shots_ongoal'] - df['away_shots_offgoal']))
    df.loc[df['home_shots_blocked'] < 0, 'home_shots_blocked'] = 0 
    df.loc[df['away_shots_blocked'] < 0, 'away_shots_blocked'] = 0 
    check_fill(df=df, col_check='home_shots_blocked', all=False) # check
    check_fill(df=df, col_check='away_shots_blocked', all=False) # check
    # Saves
    df['home_saves'] = df.groupby(['away_shots_ongoal'])['home_saves'].transform(lambda x: x.fillna(round(x.mean(), 1)))
    df['away_saves'] = df.groupby(['home_shots_ongoal'])['away_saves'].transform(lambda x: x.fillna(round(x.mean(), 1)))
    check_fill(df=df, col_check='home_saves', all=False) # check
    check_fill(df=df, col_check='away_saves', all=False) # check
    # Fouls
    df['home_fouls'] = df.groupby(['home_tackles'])['home_fouls'].transform(lambda x: x.fillna(round(x.mean(), 1)))
    df['away_fouls'] = df.groupby(['away_tackles'])['away_fouls'].transform(lambda x: x.fillna(round(x.mean(), 1)))
    check_fill(df=df, col_check='home_fouls', all=False) # check
    check_fill(df=df, col_check='away_fouls', all=False) # check
    # Attacks
    df['home_attacks_attacks'] = df.groupby(['home_shots_total'])['home_attacks_attacks'].\
        transform(lambda x: x.fillna(round(x.mean(), 1)))
    df['away_attacks_attacks'] = df.groupby(['away_shots_total'])['away_attacks_attacks'].\
        transform(lambda x: x.fillna(round(x.mean(), 1)))
    check_fill(df=df, col_check='home_attacks_attacks', all=False) # check
    check_fill(df=df, col_check='away_attacks_attacks', all=False) # check    
    # Dangerous Attacks
    df['home_attacks_dangerous_attacks'] = df.groupby(['home_shots_ongoal'])['home_attacks_dangerous_attacks'].\
        transform(lambda x: x.fillna(round(x.mean(), 1)))
    df['away_attacks_dangerous_attacks'] = df.groupby(['away_shots_ongoal'])['away_attacks_dangerous_attacks'].\
        transform(lambda x: x.fillna(round(x.mean(), 1)))
    check_fill(df=df, col_check='home_attacks_dangerous_attacks', all=False) # check
    check_fill(df=df, col_check='away_attacks_dangerous_attacks', all=False) # check 
    return df

def fill_0_list(df, stats_list):
    """ This functions fills NAs for other stats columns with value 0.
    df: Dataframe
    stats_list: list of columns to fill with 0
    """
    for stat_col in stats_list:
        df[stat_col] = df[stat_col].fillna(0)   
        check_fill(df=df, col_check=stat_col, all=False) # check
    return df

##### 14) Handle Missing Data - Operations

In [23]:
def run_all_fill_functions(df):
    stats_tofill = ['home_corners', 'home_offsides', 'home_yellowcards', 'home_redcards', 'home_yellowredcards',  'away_corners', 'away_offsides', 'away_yellowcards', 'away_redcards', 'away_yellowredcards']
    df = fill_venue_capacity(df)
    df = fill_attendance(df)
    df = fill_surface(df)
    df = fill_twitter_names(df)
    df = fill_colors(df)
    df = round_all_fill(df)
    df = fill_shot_plus(df)
    df = coach_ref_ID_fill(df)
    df = coach_ref_extras_fill(df)
    df = fill_0_list(df, stats_tofill)
    return df

In [24]:
def some_manual_fills(df):
    # Manually change values 
    df.loc[df['homecoach_coach_id'] == 37606424, 'homecoach_birthdate'] = datetime.strptime('1981-07-02', '%Y-%m-%d')
    df.loc[df['awaycoach_coach_id'] == 37606424, 'awaycoach_birthdate'] = datetime.strptime('1981-07-02', '%Y-%m-%d')
    df.loc[299782, 'scores_ht_score'] = '1-0'
    df.loc[18156836, 'scores_ht_score'] = '0-0'
    df['referee_id'].replace(to_replace=74217, value=77, inplace=True)
    df.loc[df['winner_team_id'].isna(), 'winner_team_id'] = 0
    return df

In [25]:
# Use fill functions created
final_16 = some_manual_fills(final_16)
final_16 = run_all_fill_functions(final_16)

ERROR: There are still 108 NAs for leagues data in  venue_capacity !!!
ERROR: There are still 742 NAs for leagues data in  attendance !!!
ERROR: There are still 147 NAs for leagues data in  venue_surface_isgrass !!!
ERROR: There are still 19 NAs for leagues data in  home_twitter !!!
ERROR: There are still 19 NAs for leagues data in  away_twitter !!!
ERROR: There are still 38 NAs for leagues data in  colors_home_color !!!
ERROR: There are still 19 NAs for leagues data in  colors_away_color !!!
Successful Filling for leagues data of round_name
Successful Filling for leagues data of round_start
Successful Filling for leagues data of round_end
ERROR: There are still 386 NAs for leagues data in  away_shots_insidebox !!!
ERROR: There are still 386 NAs for leagues data in  home_shots_insidebox !!!
ERROR: There are still 392 NAs for leagues data in  home_tackles !!!
ERROR: There are still 392 NAs for leagues data in  away_tackles !!!
ERROR: There are still 386 NAs for leagues data in  home_sho

##### 15) Store Data

In [27]:
final_16.to_csv('../../Data/From_Preparation/match_cleaned.csv')