Thank you, this has been an interesting competition. :-)

I have approached the problem in the following way:
-We need to predict the outcome of a football match so
-We will predict which team scored more goals so 
-We will predict the number of goals scored by each team 

1. We will generate some features and later we will compress some features with a simple Neural Network (Keras) being used as an encoder/decoder, we set input=output and once the learning is done we are interested in the values at the middle of the Neural Network because we'll use them as features.

2. We will later "compress" the rows to keep only relevant ones by using following method: we run a prediction (supervised learning) but we will use its results more like a clustering algorithm (unsupervised learning). We do this to get rid of what we do not want to analyze so we can concentrate our effort and our computing power on the most relevant data.

3. Once we get to this point if we analyze carefully the data we’ll see that some patterns emerge and it's time to answer following questions:
    - Will this generalize well? In other words, would freshly generated data following same approach have the same patterns? Yes
    - Does this put us in a better place to determine the outcome of a football match? Yes 

    Key patterns:  
    - 3 moves plays (assist, goal scored, goal conceded) 
    - 2 moves plays (goal scored, goal conceded). 
    - All the time we’ll be trying to identify strikers and goalkeepers so we can decided who scored a goal and who conceded a goal. 
    - Difficulties:
        - Some data is missing on test
        - Some data is duplicated on train
        - Some players were not correctly identified so we’ll need to do take care of this
        - Own goals are a bit more difficult to process because for example a player scoring an own goal will appear as belonging to the rival team

    - Notes:
        - No GPU is used or required
        - Run time on a modest set-up is below 15 minutes
        - Sometimes we work with "Opposite Team" and it doesn't make a lot of sense to do this. This is not to generate confusion, it's because I started developing the solution from the goalkeeper point of view and when I detected a goal it was the Opposite Team scoring that goal but later to improve the model I had to also consider the strikers.
        

Should you identify something wrong or not correctly explained please feel free to contact me to give me the opportunity to review it.


Versions used:

Python                                3.7.12

numpy                                 1.21.6
pandas                                1.3.5
scikit-learn                          1.0.2
keras                                 2.6.0
tensorflow                            2.6.4
lightgbm                              3.3.2


The notebooks generates 2 files in the default output directory (no path added), the one which made it to 2 second place:
Filename: Final_submit2_NuSVC.csv

No external data files were used. Competition files are expected to be placed at:
../input/ladu-data/


In [None]:
# Define variables with values

name="ladu_3030k_v7_" #Name that will be used for the submit

#Following variables are used by two_items_loop() and two_items()
#They define upper limit and lower limit for both the initial and main phases
#The two_items_loop start running considering initial upper limit and will run until initial lower limit
#And then will start running again considering main upper limit and will run until main lower limit
#The limits apply to: 
# The number of goals conceded to consider a player a goalkeeper 
# The number of goals scored or assists to consider a player part of the team strikers
#For further information please review the functions explanations

two_items_initial_upper_limit=50 #We'll go in steps of -10 because this is the intial phase 

two_items_main_upper_limit=int(two_items_initial_upper_limit/2) #We'll go in steps of -5 because this is the main phase and we start at a value half as initial phase
two_items_main_lower_limit=two_items_initial_lower_limit=1 #These are the lower limits

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# Imports

import numpy as np
import pandas as pd
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.filterwarnings("ignore", category=SettingWithCopyWarning)
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.svm import NuSVC
from keras import Model, Input, Sequential, preprocessing, regularizers
from keras.layers import Dense
from lightgbm import LGBMClassifier

In [None]:
# Define empty variables

cols=[] # We will be using this list to define the columns that we want to use at each time
keep_train=pd.DataFrame() # We will keep here data that we want to drop at some time and recover later 
keep_test=pd.DataFrame() # We will keep here data that we want to drop at some time and recover later
strikers=pd.Series() #List of identified strikers (and other attacking players) 
goalkeepers=pd.Series() #List of identified goalkeepers 
two_items_initial_current_limit=0 #Here we will store current value to loop on two_items_loop() and two_items()
two_items_main_current_limit=0 #Here we will store current value to loop on two_items_loop() and two_items()
stk_and_gk=[] #Here we will store players that are considered both strikers and goalkeepers
gk_and_stk=[] #Here we will store players that are considered both goalkeepers and strikers

  
  import sys


In [None]:
# Functions

def prepare_relevant_actions_df(relevant_actions_df):
    ''' To prepare relevant_actions DataFrames
    We use this function to prepare the data to check if some actions are near in time between them
    We calculate the difference between current id (taken from the index) and next id  
    We add an arbitrary value (666) to the last row to mark it as last row of the Game '''
    
    relevant_actions_df=relevant_actions_df.sort_values(["Game_ID", "Start_minutes"])
    relevant_actions_df["id"]=relevant_actions_df.index.copy()
    relevant_actions_df["next_id"]=relevant_actions_df["id"].shift(-1).fillna(relevant_actions_df.index.max()).astype(int)
    relevant_actions_df["next_min"]=relevant_actions_df["Start_minutes"].shift(-1).fillna(relevant_actions_df["Start_minutes"].iloc[-1]).astype(int)
    relevant_actions_df["diff_min"]=abs(relevant_actions_df["next_min"]-relevant_actions_df["Start_minutes"])
    relevant_actions_df["diff_id"]=abs(relevant_actions_df["next_id"]-relevant_actions_df["id"])
    relevant_actions_df["diff_id"].iloc[-1]=666 #We add an arbitrary value (666) to the last row to mark it as last row of the Game
    relevant_actions_df.reset_index(drop=True, inplace=True)
    return(relevant_actions_df)

def reality_check():
    ''' To check the difference between predictions and reality
    First load the data from X_train
    Then calculate the differences
    And then show the differences'''
    
    reality["predi_Home"]=X_train["Home"]
    reality["predi_Away"]=X_train["Away"]
    reality["diff_Home"]=reality["Home"]-reality["predi_Home"]
    reality["diff_Away"]=reality["Away"]-reality["predi_Away"]
    print("\nHome goals differences: ")
    print(reality["diff_Home"].value_counts())
    print("\nAway goals differences: ")
    print(reality["diff_Away"].value_counts())
    print("\nTotal goals difference: ", reality["diff_Home"].sum()+ reality["diff_Away"].sum())
    return()

def prediction_log_loss(learner, cols):
    ''' To run a prediction using log_loss 
    First we learn on train and predict on val 
    Later we learn on val and predict on train 
    Later we append both predictions and check the global result '''
    
    global smallest_error #We store the smallest error so far
    global smallest_error_model #We store the model which generated the smallest error so far
    print(str(learner)[0:20])
    learner.fit(X_train[cols], y_train)
    predict_proba_val1=learner.predict_proba(X_val[cols])
    predict_proba_val1=pd.DataFrame(predict_proba_val1)
    predict_val1=learner.predict(X_val[cols])
    predict_val1=pd.DataFrame(predict_val1)
    print(pd.DataFrame(predict_proba_val1).max(axis=1).min(), pd.DataFrame(predict_proba_val1).max(axis=1).mean(),pd.DataFrame(predict_proba_val1).max(axis=1).max())
    print("log_loss 1:               ",log_loss(y_val, predict_proba_val1))
    check=predict_val1.copy()
    check["real"]=y_val.reset_index(drop=True)
    predict_proba_val1=predict_proba_val1.set_index(y_val.index) #Reset the index so later we can run the append with train

    learner.fit(X_val[cols], y_val)
    predict_proba_val2=learner.predict_proba(X_train[cols])
    predict_proba_val2=pd.DataFrame(predict_proba_val2)
    predict_val2=learner.predict(X_train[cols])
    predict_val2=pd.DataFrame(predict_val2)
    print("log_loss 2:               ",log_loss(y_train, predict_proba_val2))
    check=predict_val2.copy()
    check["real"]=y_train
    
    predict_proba_val=predict_proba_val2.append(predict_proba_val1)
    predict_val=predict_val2.append(predict_val1)
    predict_val.reset_index(drop=True, inplace=True) #Reset the index to match y_full index
    y_full=y_train.append(y_val)
    prediction_error=log_loss(y_train.append(y_val), predict_proba_val)
    print("log_loss sum:", log_loss(y_train.append(y_val), predict_proba_val))
    check=predict_val.copy()
    check["real"]=y_full.copy()
    #print("Number of errors: ", check[check[0]!=check["real"]][0].count())
    #print("Errors: ", check[check[0]!=check["real"]][0].index.values)
    if prediction_error== smallest_error:
        #If it's the same error we want to also record the learner name
        smallest_error_model=smallest_error_model+" - "+str(learner)[0:20]
    if prediction_error< smallest_error:
        #If it's the smallest error we want to print it and to update the smallest error with the new value
        print("We have a new Top Model: ", str(learner)[0:20])
        print("Log_loss:", prediction_error, "\n\n") 
        smallest_error=prediction_error 
        smallest_error_model=str(learner)[0:20]
    return()


def mark_plays(relevant_actions_df):
    ''' To mark each play with a number 
    We mark plays grouped in groups of 2 or 3 items with a number so later we can process them '''
    errors=[] #We start with no errors
    relevant_actions_df["play"]=0 #We set default play to 0 and later we'll check if it changes
    last_play=1 #We start from play 1 because by default we have set 0
    

    for game in relevant_actions_df["Game_ID"].unique():
        relevant_actions_df["diff_min"].iloc[relevant_actions_df[relevant_actions_df["Game_ID"]==game].tail(1).index]=666        

    row=relevant_actions_df.index.min() #We want to start from min and go increasing it
    cut=1.00000001 #Maximum difference to consider two actions as part of same play. We set 1 minute but we avoid the exact value of 1, this gives a tiny improve 
    while row < relevant_actions_df.index.max(): #We will be increasing row in the code until we reach max
        df="" #We make sure we start with empty DataFrame
        game=relevant_actions_df["Game_ID"].loc[row] #We will work with the game to which current row belongs to
        if game==relevant_actions_df.loc[row+1]["Game_ID"]: #If row+2 also belongs to same Game_ID (we assume row+1 also belongs to same Game_ID)
            if relevant_actions_df.loc[row]["diff_min"]<cut: #If diff_min (difference between current and next row) is below predefined cut value
                if relevant_actions_df.loc[row+1]["diff_min"]>cut: #If next row difference is above cut should be a 2items play 
                    relevant_actions_df["play"].loc[row]=last_play
                    relevant_actions_df["play"].loc[row+1]=last_play
                    last_play=last_play+1
                    row=row+2

                else: #If next row difference is below cut it should be a 3 items play 
                    if game==relevant_actions_df.loc[row+2]["Game_ID"]: #If row+2 also belongs to same Game_ID (we assume row+1 also belongs to same Game_ID)
                        if relevant_actions_df.loc[row+2]["diff_min"]>cut: #If row+2 difference is over cut (this means the play finished on row+2 and row+3 is another play)
                            relevant_actions_df["play"].loc[row]=last_play
                            relevant_actions_df["play"].loc[row+1]=last_play
                            relevant_actions_df["play"].loc[row+2]=last_play
                            last_play=last_play+1
                            row=row+3
                        else:
                            print("This is not correctly cut, no problem, we'll adjust it later. Issue on : ", row)
                            errors.append(row)
                            row=row+1
                    else:
                        print("***************** ERROR - Looks like different game on row: ", row)
            else:
                print("***************** ERROR - First row over cut: ", row)
                print(relevant_actions_df.loc[row]["diff_min"], cut)
                row=row+1
        
    games=relevant_actions_df[relevant_actions_df["play"]==0]["Game_ID"].to_list()
    print(len(relevant_actions_df[relevant_actions_df["play"]==0][["Game_ID", "diff_min"]]))
    print(relevant_actions_df[relevant_actions_df["play"]==0][["Game_ID", "diff_min", "diff_id"]])
    return(pd.DataFrame(errors))

def fix_mark_plays_errors(errors, relevant_actions_df):
    ''' To fix the plays which are not correctly marked 
    We detect and adjust the plays which previously were incorrectly marked by mark_plays() '''
    errors["next_1"]=errors[0].shift(-1) #We want to shift 1 position so later we can compare values
    errors["next_2"]=errors[0].shift(-2) #We want to shift 2 positions so later we can compare values
    row=0 #We start from row 0
    print("Errors lenght: ", len(errors))
    while row < len(errors):#We will be increasing row in the code until we reach max
        #print(row)
        if errors[0].iloc[row]==errors["next_1"].iloc[row]-1: 
            #next_1 minus 1 should be equal to current
            if errors[0].iloc[row]==errors["next_2"].iloc[row]-2:
                #next_2 minus 2 should be equal to current if we have to cut the play at next_2
                relevant_actions_df["diff_min"].iloc[int(errors["next_2"].iloc[row])]=666 #Assign arbitrary value
                row=row+2 #Later we'll add 1 more
            else:
                #so this is a 2 moves play so let's assign a high value to next_1 to mark it properly
                relevant_actions_df["diff_min"].iloc[int(errors["next_1"].iloc[row]+1)]=666 #Assign arbitrary value
                row=row+1 #Later we'll add 1 more
        else:
            #If next_1 is different then let's mark current move with an arbitrary value
            relevant_actions_df["diff_min"].iloc[int(errors[0].iloc[row]+1)]=666 #Assign arbitrary value
            print("Fixing: ", int(errors[0].iloc[row]+1))
        row=row+1
    return()


def three_items(relevant_actions_df, X_df, processing_train):
    ''' To run a three_items loop on both train and  test datasets 
    Once we have the data with a known patter we process it looking for plays of 3 moves'''  
    global strikers
    global goalkeepers
    own_goals=0
    print("Initial lenght: ", len(relevant_actions_df))
    to_drop=[] #We'll keep rows to drop here
    epoch=0 #We start by epoch 0
    row=relevant_actions_df.index.min() #We want to start from min and go increasing it
    cut=16 #Maximum difference to consider two actions as part of same play
    while row < relevant_actions_df.index.max(): #We will be increasing row in the code until we reach max
        game=relevant_actions_df.loc[row]["Game_ID"] #We will work with the game to which current row belongs to

        if len(relevant_actions_df[relevant_actions_df["play"]==relevant_actions_df["play"].iloc[row]]) == 2:
            #If it's a 2items play we don't want to process it now (we still want to learn more about goalkeepers and strikers)
            row=row+2
        elif len(relevant_actions_df[relevant_actions_df["play"]==relevant_actions_df["play"].iloc[row]]) != 3:     
            print("Error, we shouldn't have plays not 2items or 3items. Please check row: ", row)
            row=row+1
        else:
            df=relevant_actions_df[relevant_actions_df["play"]==relevant_actions_df["play"].iloc[row]]
            if len(df["Opposition_Team"].value_counts()) ==1:
                #If we only have one team we have an error because we expected to have 2 teams
                print(row, "          ERRRORRRRRRRRRRRRRRRRRRRRRRR, only 1 team")
                print(df)
            if len(df["Opposition_Team"].value_counts()) >2:
                #If we have more than 2 teams we have an error 
                print(row, "          ERRRORRRRRRRRRRRRRRRRRRRRRRR, more than 2 teams")
                print(df)
            if len(df["Opposition_Team"].value_counts()) ==2:
                #If we have 2 teams proceed with the data processing (add the strikers, goalkeepers and goals)
                if old_pandas==True:
                    if df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)[1:]]["Opposition_Team"].head(1).values == X_df.loc[X_df["Game_ID"]==game]["Home Team"].values:
                        #If it's on X_df away team let's add the goal there
                        X_df.loc[X_df["Game_ID"]==game, "Away"]=X_df.loc[X_df["Game_ID"]==game]["Away"]+1
                    if df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)[1:]]["Opposition_Team"].head(1).values == X_df.loc[X_df["Game_ID"]==game]["Away Team"].values:
                        #If it's on X_df home team let's add the goal there
                        X_df.loc[X_df["Game_ID"]==game, "Home"]=X_df.loc[X_df["Game_ID"]==game]["Home"]+1    
                else:
                    if df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)]["Opposition_Team"].head(1).values == X_df.loc[X_df["Game_ID"]==game]["Home Team"].values:
                        #If it's on X_df away team let's add the goal there
                        X_df.loc[X_df["Game_ID"]==game, "Away"]=X_df.loc[X_df["Game_ID"]==game]["Away"]+1
                    if df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)]["Opposition_Team"].head(1).values == X_df.loc[X_df["Game_ID"]==game]["Away Team"].values:
                        #If it's on X_df home team let's add the goal there
                        X_df.loc[X_df["Game_ID"]==game, "Home"]=X_df.loc[X_df["Game_ID"]==game]["Home"]+1

                if len(df["Player_ID"].unique()) == 3:
                # 3 different players, 1 goalkeepers, 2 strikers
                    if old_pandas==True:
                        strikers=strikers.append(pd.Series(df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)[1:]]["Player_ID"].head(1).to_list()), ignore_index=True)
                        strikers=strikers.append(pd.Series(df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)[1:]]["Player_ID"].tail(1).to_list()), ignore_index=True)
                        goalkeepers=goalkeepers.append(pd.Series(df[df["Opposition_Team"]!=df["Opposition_Team"].mode().to_string(index=False)[1:]]["Player_ID"].to_list()), ignore_index=True)
                    else:
                        strikers=strikers.append(pd.Series(df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)]["Player_ID"].head(1).to_list()), ignore_index=True)
                        strikers=strikers.append(pd.Series(df[df["Opposition_Team"]==df["Opposition_Team"].mode().to_string(index=False)]["Player_ID"].tail(1).to_list()), ignore_index=True)
                        goalkeepers=goalkeepers.append(pd.Series(df[df["Opposition_Team"]!=df["Opposition_Team"].mode().to_string(index=False)]["Player_ID"].to_list()), ignore_index=True)

                if len(df["Player_ID"].unique()) < 3:
                # Own goal by a field player (2) or own goal by goalkeeper (1) in either case no opposing striker was involved so we only categorize goalkeeper
                    own_goals=own_goals+1
                    if old_pandas==True:
                        goalkeepers=goalkeepers.append(pd.Series(df[df["Opposition_Team"]!=df["Opposition_Team"].mode().to_string(index=False)[1:]]["Player_ID"].to_list()), ignore_index=True)
                    else:
                        goalkeepers=goalkeepers.append(pd.Series(df[df["Opposition_Team"]!=df["Opposition_Team"].mode().to_string(index=False)]["Player_ID"].to_list()), ignore_index=True)
                 
                        
                if processing_train==True:
                    if train_stats["Action"].iloc[df["id"].to_list()].sort_values().to_list() != ['Assists', 'Goals', 'Goals conceded']:
                        #If it's not an Assist, a Goal and a Goal conceded (which is what we expect to have)
                        if train_stats["Action"].iloc[df["id"].to_list()].sort_values().to_list() != ['Goals', 'Goals conceded', 'Own goal']:
                            #If it's not an Own Goal 
                            print("Error, this is not expected: ", train_stats["Action"].iloc[df["id"].to_list()].sort_values().to_list())
                            print(df[cols])
                to_drop.extend([row, row+1, row+2]) #Later we'll drop the data that we've already processed
                row=row+3
                #row=relevant_actions_df[relevant_actions_df["play"]==relevant_actions_df["play"].iloc[row]].index.max()+1
    
    if len(to_drop) !=0:
        relevant_actions_df.drop(to_drop, inplace=True) #We drop the rows we already processed
    #if len(strikers) !=0:
    #    strikers=pd.DataFrame(strikers) #Convert to DataFrame
    #if len(goalkeepers) !=0:
    #    goalkeepers=pd.DataFrame(goalkeepers) #Convert to DataFrame
    print("Own goals:", own_goals)
    print("Final lenght: ",len(relevant_actions_df)) 
    relevant_actions_df.reset_index(drop=True, inplace=True) #Reset index
    reality_check() #We check the difference between the data we are building and the reality
    return()

def two_items_loop():
    ''' To run a two_items loop on both train and  test datasets 
    Once we have the data with a known patter we process it looking for plays of 2 moves'''  
    epoch_count=666 #We set a value that allows the loop to start
    while epoch_count > 2: #If we have more than 2 epochs (train+test) it means we are still having same gain so let's try again to process more data
        epoch_count=0
        
        is_train=True #Now we run it for train
        epoch_count=epoch_count+two_items(relevant_actions_train, X_train, is_train) #We add to current epoch count the number of epoch run for train
        
        is_train=False #Now we run it for test
        epoch_count=epoch_count+two_items(relevant_actions_test, X_test, is_train) #We add to current epoch count the number of epoch run for test
        
    return()  

def clean_missclassified():
    ''' To clean missclassified players 
    We do not want to learn about "Unknown" player because each team time it can be a different player
    Once we have the data with a known patter we process it looking for plays of 2 moves'''  
    global strikers
    global goalkeepers
    # Delete classification of players 

    if (len(strikers[strikers=="Unknown"])) !=0:
        #If we have Unknown strikers we remove them
        print ("Dropping",len(strikers[strikers=="Unknown"]), " Unknown strikers")
        strikers.drop(strikers[strikers=="Unknown"].index, inplace=True)
        print ("Now we have",len(strikers[strikers=="Unknown"]), " Unknown strikers")

    if (len(goalkeepers[goalkeepers=="Unknown"])) !=0:
        #If we have Unknown goalkeepers we remove them
        print ("Dropping",len(goalkeepers[goalkeepers=="Unknown"]), " Unknown goalkeepers")
        goalkeepers.drop(goalkeepers[goalkeepers=="Unknown"].index, inplace=True)
        print ("Now we have",len(goalkeepers[goalkeepers=="Unknown"]), " Unknown goalkeepers")
        
    if len(strikers[strikers.isin(goalkeepers)]) > 0:
        #If we have a player which is a striker and a goalkeeper at the same time we will check how many
        #goals did the player score and how many did the player conced and we'll know what to do:
        #scored > conceded -> it's a striker, delete it from goalkeepers
        #conceded > scored -> it's a goalkeeper, delete it from strikers
        #scored = conceded -> do nothing
        print("\nWe have some players considered strikers and goalkeepers, let's delete them \n")
        print("Strikers errors: \n", strikers[strikers.isin(goalkeepers)].count())
        print("Strikers errors: \n", strikers[strikers.isin(goalkeepers)].value_counts())
        stk_and_gk.extend(strikers[strikers.isin(goalkeepers)].values)
        gk_and_stk.extend(goalkeepers[goalkeepers.isin(strikers)].values)
        print("Goalkeepers errors: \n", goalkeepers[goalkeepers.isin(strikers)].count())
        print("Goalkeepers errors: \n", goalkeepers[goalkeepers.isin(strikers)].value_counts(),"\n")
        
        #Convert to list and append
        gk=goalkeepers[goalkeepers.isin(strikers)].to_list()
        stk=strikers[strikers.isin(goalkeepers)].to_list()
        players=pd.DataFrame(gk).append(stk)
        
        #We create a DataFrame to control goals scored vs goals conceded by player
        players=pd.DataFrame(pd.Series(gk).value_counts()).rename(columns={0: "goals"})
        players["player"]=players.index
        stk=pd.DataFrame(pd.Series(stk).value_counts()).rename(columns={0: "goals"})
        stk["player"]=stk.index
        #We merge the strikers and goalkeepers goals on the players DataFrame
        players=pd.merge(players, stk, on='player', suffixes=("_gk", "_stk"))

        #Drop and display if some players have goals conceded = goals scored
        goalkeepers.drop(goalkeepers[goalkeepers.isin(players[players["goals_gk"]<players["goals_stk"]]["player"].to_list())].index, inplace=True)
        strikers.drop(strikers[strikers.isin(players[players["goals_gk"]>players["goals_stk"]]["player"].to_list())].index, inplace=True)
        print("Equal goals scored than goals conceded: ", goalkeepers[goalkeepers.isin(players[players["goals_gk"]==players["goals_stk"]]["player"])])
        #strikers.drop(strikers[strikers.isin(players[(players["goals_gk"]>10) & (players["goals_stk"]<5) ]["player"])].index, inplace=True)
        

    print("\nStrikers errors: ", strikers[strikers.isin(goalkeepers)].count())
    print("Goalkeepers errors: ", goalkeepers[goalkeepers.isin(strikers)].count(), "\n")
    return()

def two_items(relevant_actions_df, X_test, is_train):
    ''' To process relevant_actions_df items in groups of 2 items
    First we consider top striker and top goalkeepers the ones who participate in more goals scored and goals conceded
    Note: Top goalkeeper doesn't mean the best goalkeeper but the one who is in the top of concede goals list
    Then we check if we have 2 teams involved as we expect
    If it's train we also check that we have a goal scored and goal conceded
    Then we check for each play if the player is already a known top striker or top goalkeeper to identify which team scored the goal
    We delete the rows, update last item difference count and we print the lenght '''

    #Load global variables
    global process_true_strikers
    global process_possible_strikers
    global true_strikers
    global true_goalkeepers
    global strikers
    global goalkeepers

    to_drop=[] #We will be droping later so let's make sure it starts empty
    epoch=0 # We will count epochs so let's start by 0
    current_len=len(relevant_actions_df)+1 #We add 1 to current len so the below while loop can start
    
    clean_missclassified()
        
    if is_train==True:
        print("*** TRAIN ***")
    else:
        print("*** TEST ***")
    while current_len > len(relevant_actions_df):
        epoch=epoch+1
        #We create top_strikers and top_goalkeepers by using data gathered from train initial data, from our process on train and from our process on test 

        top_strikers=true_strikers.copy()
        top_goalkeepers=true_goalkeepers.copy()                

        if two_items_main_current_limit !=0:
            if two_items_initial_current_limit !=0:
                #We should keep either initial or main current limit at 0 if both are different than 0 something went wrong
                print ("\n\n**************************** ERROR both initial and current limits are 0 **************************** ")
            else:
                #two_items_main_current_limit isn't 0 but two_items_initial_current_limit is 0
                #We append strikers to top_strikers and goalkeepers to top_goalkeepers
                temp_df=pd.DataFrame()
                temp_df["Player_ID"]=strikers
                top_strikers=top_strikers.append(temp_df, ignore_index=True)
                temp_df=pd.DataFrame()
                temp_df["Player_ID"]=goalkeepers
                top_goalkeepers=top_goalkeepers.append(temp_df, ignore_index=True)
                del temp_df

        
        #We keep value_conts
        top_strikers=top_strikers["Player_ID"].value_counts()
        top_goalkeepers=top_goalkeepers["Player_ID"].value_counts()

        if two_items_initial_current_limit !=0:
            #If two_items_initial_current_limit is different than 0 let's keep only players above the level defined
            top_strikers=top_strikers[top_strikers>two_items_initial_current_limit]
            top_goalkeepers=top_goalkeepers[top_goalkeepers>two_items_initial_current_limit]

        else:
            #If we aren't applying two_items_initial_current_limit then it's time to apply two_items_main_current_limit
            #let's keep only players above the level defined
            top_strikers=top_strikers[top_strikers>two_items_main_current_limit]
            top_goalkeepers=top_goalkeepers[top_goalkeepers>two_items_main_current_limit]

        if process_true_strikers!=True:
            #If we aren't processing true_strikers (global variable) then let's set it to just "None"
            #because we don't want errors or to have to check in the rest of the code if top_strikers 
            #is empty so we add "None" and all the code will work but will not find player "None"
            top_strikers=true_strikers.copy()
            top_strikers=pd.DataFrame(top_strikers.iloc[0])
            top_strikers["Player_ID"]="None"
        
        print("Epoch: ", epoch, "   Len top strikers: ", len(top_strikers), "   Len top goalkeepers: ", len(top_goalkeepers))
        current_len=len(relevant_actions_df) #We keep current value here so later we can shape if we've reduced pending relevant_actions
        row=relevant_actions_df.index.min() #We want to start from min and go increasing it
        while row < relevant_actions_df.index.max(): #We will be increasing row in the code until we reach max
            if str(relevant_actions_df.index.max()) =="nan":
                break #If max is nan we should have processed all the data so let's stop
            df="" #We make sure we start with empty DataFrame
            game=relevant_actions_df.loc[row]["Game_ID"] #We will work with the game to which current row belongs to
            if len(relevant_actions_df[relevant_actions_df["play"]==relevant_actions_df["play"].iloc[row]]) == 2:
                #If it's not a 3items play let's process next play
                df=relevant_actions_df[relevant_actions_df["play"]==relevant_actions_df["play"].iloc[row]]
                done=0 #We did nothing so far

                if len(df["Opposition_Team"].value_counts()) !=2:
                    #If we are expecting a goal scored and a goal conceded and both actions belong to same team we have an error
                    print("          ERRRORRRRRRRRRRRRRRRRRRRRRRR, only 1 team")
                    print(df[cols])
                    if is_train==True:
                        print(train_stats.iloc[df["id"].to_list()])
                    else:
                        print(test_stats.iloc[df["id"].to_list()])
                if is_train==True:
                    if train_stats["Action"].iloc[df["id"].to_list()].sort_values().to_list() != ['Goals', 'Goals conceded']:
                        #We are expecting a goal scored and a goal conceded. For train we can check if that's true and if it's wrong display the error
                        print("          ERRRORRRRRRRRRRRRRRRRRRRRRRR, not Goals + Goals conceded")
                        print(train_stats["Action"].iloc[df["id"].to_list()].sort_values().to_list())
                        print(df[cols])

                if is_train==True:
                    #We do this only for train dataframe, we'll work with train variables 

                    if (df["Player_ID"].iloc[0] in list(top_goalkeepers.index)) & (done==0):
                        #If the first player is a top goalkeeper and so far we didn't do anything
                        done=1 #We don't to duplicate goals, we only want to count 1 goal either by identifying the striker or the goalkeeper 
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a goalkeeper, let's append it
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a striker, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[0] == X_train.loc[X_train["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_train home team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Home"]=X_train.loc[X_train["Game_ID"]==game]["Home"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[0] == X_train.loc[X_train["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_train away team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Away"]=X_train.loc[X_train["Game_ID"]==game]["Away"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")

                    elif (df["Player_ID"].iloc[1] in list(top_goalkeepers.index)) & (done==0) :
                        done=1 #We don't to duplicate goals, we only want to count 1 goal either by identifying the striker or the goalkeeper 
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a goalkeeper, let's append it
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a striker, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[1] == X_train.loc[X_train["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_train home team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Home"]=X_train.loc[X_train["Game_ID"]==game]["Home"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[1] == X_train.loc[X_train["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_train away team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Away"]=X_train.loc[X_train["Game_ID"]==game]["Away"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")

                    elif (df["Player_ID"].iloc[0] in list(top_strikers.index)) & (done==0):
                        #If the first player is a top strikers and so far we didn't do anything
                        done=1 #We don't want following code to identify a top goalkeeper and count one goal for the goal scored and another for the goal conceded
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a striker, let's append it
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a goalkeeper, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[1] == X_train.loc[X_train["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_train home team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Home"]=X_train.loc[X_train["Game_ID"]==game]["Home"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit

                        elif  df["Opposition_Team"].iloc[1] == X_train.loc[X_train["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_train away team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Away"]=X_train.loc[X_train["Game_ID"]==game]["Away"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")

                    elif (df["Player_ID"].iloc[1] in list(top_strikers.index)) & (done==0) :
                        #If the second player is a top strikers and so far we didn't do anything
                        done=1 #We don't want following code to identify a top goalkeeper and count one goal for the goal scored and another for the goal conceded
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a striker, let's append it
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a goalkeeper, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[0] == X_train.loc[X_train["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_train home team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Home"]=X_train.loc[X_train["Game_ID"]==game]["Home"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[0] == X_train.loc[X_train["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_train away team let's add the goal there
                            X_train.loc[X_train["Game_ID"]==game, "Away"]=X_train.loc[X_train["Game_ID"]==game]["Away"]+1
                            train_last_match_lower_limit.loc[train_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")
                else:
                    #We do this only for test dataframe, we'll work with test variables 

                    if (df["Player_ID"].iloc[0] in list(top_goalkeepers.index)) & (done==0):
                        done=1 #We don't to duplicate goals, we only want to count 1 goal either by identifying the striker or the goalkeeper 
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a goalkeeper, let's append it
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a striker, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[0] == X_test.loc[X_test["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_test home team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Home"]=X_test.loc[X_test["Game_ID"]==game]["Home"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[0] == X_test.loc[X_test["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_test away team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Away"]=X_test.loc[X_test["Game_ID"]==game]["Away"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")

                    elif (df["Player_ID"].iloc[1] in list(top_goalkeepers.index)) & (done==0) :
                        done=1 #We don't to duplicate goals, we only want to count 1 goal either by identifying the striker or the goalkeeper 
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a goalkeeper, let's append it
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a striker, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[1] == X_test.loc[X_test["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_test home team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Home"]=X_test.loc[X_test["Game_ID"]==game]["Home"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[1] == X_test.loc[X_test["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_test away team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Away"]=X_test.loc[X_test["Game_ID"]==game]["Away"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")

                    elif (df["Player_ID"].iloc[0] in list(top_strikers.index)) & (done==0):
                        done=1 #We don't to duplicate goals, we only want to count 1 goal either by identifying the striker or the goalkeeper 
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a striker, let's append it
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a goalkeeper, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[1] == X_test.loc[X_test["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_test home team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Home"]=X_test.loc[X_test["Game_ID"]==game]["Home"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[1] == X_test.loc[X_test["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_test away team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Away"]=X_test.loc[X_test["Game_ID"]==game]["Away"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")

                    elif (df["Player_ID"].iloc[1] in list(top_strikers.index)) & (done==0) :
                        done=1 #We don't to duplicate goals, we only want to count 1 goal either by identifying the striker or the goalkeeper 
                        strikers=strikers.append(pd.Series(df["Player_ID"].iloc[1]), ignore_index=True) #We identified a striker, let's append it
                        goalkeepers=goalkeepers.append(pd.Series(df["Player_ID"].iloc[0]), ignore_index=True) #We identified a goalkeeper, let's append it
                        to_drop.extend([row, row+1]) #Later we'll drop the data that we've already processed

                        if  df["Opposition_Team"].iloc[0] == X_test.loc[X_test["Game_ID"]==game]["Home Team"].values:
                            #If it's on X_test home team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Home"]=X_test.loc[X_test["Game_ID"]==game]["Home"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        elif  df["Opposition_Team"].iloc[0] == X_test.loc[X_test["Game_ID"]==game]["Away Team"].values:
                            #If it's on X_test away team let's add the goal there
                            X_test.loc[X_test["Game_ID"]==game, "Away"]=X_test.loc[X_test["Game_ID"]==game]["Away"]+1
                            test_last_match_lower_limit.loc[test_last_match_lower_limit["Game_ID"]==game, "min"]=two_items_main_current_limit+two_items_initial_current_limit
                        else:
                            #If it's not home team or away team we have an error
                            print("Error")
            row=row+2 #Increase the counter to proceed the loop with next row
    print("Len relevant_actions_df: ", len(relevant_actions_df))
    relevant_actions_df.drop(to_drop, inplace=True) #We drop the rows we already processed
    to_drop=[] #We empty the variable
    relevant_actions_df.reset_index(drop=True, inplace=True) #Reset index
    print("Len relevant_actions_df: ", len(relevant_actions_df))
    return(epoch)

def check_NuSVC(cols):
    ''' To run a NuSVC model and show its performance   '''  
    global smallest_error #We store the smallest error so far
    global smallest_error_model #We store the model which generated the smallest error so far
    learner=NuSVC(probability=True, random_state=2022, kernel="linear")
    prediction_log_loss(learner, cols)
    print("\nError: ", smallest_error)
    return()

def adjust_prediction(predict_proba_df):
    ''' To adjust the results 
    This only works if there is maximum 1 pending goal for each game
    We could also run this for train but it's only useful for test because we already processed all train
    data
    When we have a goal but we don't know who scored it we give near 50% ocurrence chance on each 
    of the possible options (we don't give 50% because then we would set one prediction at 0% and when
    using log_loss that doesn't sound good)
    If home team wins 2 to 0 and there's an extra goal we don't care who scored it, home team win
    If teams are draw and there's an extra goal it's either home win (near 50%) o away win 
    (near 50%) but draw it's (near 0%)
    If home team wins 1 to 0 and there's an extra goal then it's either home win ( near 50%) or draw 
    (near 50%) but away win it's near 0% because it would need to score at least 2 goals to win
    If away team wins 0 to 1 and there's an extra goal then it's either away win (near 50%) or draw 
    (near 50%) but home win it's near 0% because it would need to score at least 2 goals to win '''

    global relevant_actions_test
    global name    
    
    games=pd.Series(relevant_actions_test["Game_ID"].unique()).to_list()
    #If we have only one entry on relevant_actions_test it's probably an error during the processing so let's delete it
    for game in games:
        if relevant_actions_test[relevant_actions_test["Game_ID"]==game]["id"].count() < 1:
            print("********** This should NOT happen, check: ", game)     
        elif relevant_actions_test[relevant_actions_test["Game_ID"]==game]["id"].count() == 1:
            print("Only has 1 occurence, probably a processin error so we drop it: ", game)
            games.remove(game)
            relevant_actions_test.drop(relevant_actions_test[relevant_actions_test["Game_ID"]==game].index, inplace=True)
        elif relevant_actions_test[relevant_actions_test["Game_ID"]==game]["id"].count() > 2:
            print("Too much pending data, we'll have to go with 1/3 for game: ", game)
            predict_proba_df["Draw"].iloc[predict_proba_df[test["Game_ID"]==game].index]=1/3
            predict_proba_df["Away win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=1/3
            predict_proba_df["Home Win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=1/3
        elif relevant_actions_test[relevant_actions_test["Game_ID"]==game]["id"].count() == 2:
            #If -2 or 2 don't care
            #If -1 Away or draw
            #If 1, Home or draw 

            #We define near0 by the maximum of the minimum of the predictions to be consistent with
            #the rest of the predictions
            #near50 is the difference between 100% and the lowest prediction assigned and divided 
            #between 2 because we have 3 predictions (the lower one and 2 equally probable)
            near0=predict_proba_df.min(axis=1).max()
            near50=(1-predict_proba_df.min(axis=1).max())/2
            
            print("\nBefore: ")
            print(predict_proba_df[test["Game_ID"]==game])
            print("Goal difference:", X_test[test["Game_ID"]==game]["diff"].values)
            if X_test[test["Game_ID"]==game]["diff"].values==0: 
                #If now it's a draw and there's an extra goal it's either Away win or Home win
                predict_proba_df["Draw"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near0
                predict_proba_df["Away win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near50
                predict_proba_df["Home Win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near50
            elif X_test[test["Game_ID"]==game]["diff"].values==1: 
                #If now Home team wins by 1 one extra goal can only mean Home wins or draw 
                predict_proba_df["Draw"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near50
                predict_proba_df["Away win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near0
                predict_proba_df["Home Win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near50
            elif X_test[test["Game_ID"]==game]["diff"].values==-1: 
                #If now Away wins by 1 one extra goal can only mean Away wins or draw
                predict_proba_df["Draw"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near50
                predict_proba_df["Away win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near50
                predict_proba_df["Home Win"].iloc[predict_proba_df[test["Game_ID"]==game].index]=near0
            else:
                #If it's not 0, 1 or -1, it won't affect match result, if for example the difference in goals is 2 one extra goal 
                #could mean a difference of 3 or 1 either way Home team still wins. 
                print("This goal should not change match result so we don't change the prediction")
            print("\nAfter: ")
            print(predict_proba_df[test["Game_ID"]==game])
    #predict_proba_df.to_csv("adjusted_"+name+".csv", index=False) #We save the adjusted file
    return(predict_proba_df)

def unclear_goals_confidence(algo, predict_proba_df, goal1, goal2, goal3):
    ''' To reduce confidence when decisions were based in a small amount of goals 
    We use this function to avoid giving a very high % to a result if it based on a small
    amount of goals 1,2 or 3 
    We correct by adding ((predi*parameter)+1/3)/(parameter+1) '''
    global name
    cols=["Away win", "Draw", "Home Win"]
    print("Goal parameters; ", goal1, goal2, goal3)
    print("\n1 goal")
    for game in test_last_match_lower_limit[test_last_match_lower_limit["min"]==1]["Game_ID"].to_list():
        print(game)
        print("\nBefore")
        print(predict_proba_df[predict_proba_df["Game_ID"]==game][cols])
        for col in cols:
            predict_proba_df[col].iloc[predict_proba_df[predict_proba_df["Game_ID"]==game].index]=predict_proba_df[col].iloc[predict_proba_df[predict_proba_df["Game_ID"]==game].index].mul(goal1).add(1/3).div(goal1+1) #We add a proportion of 1/3 to the mix to reduce confidence
        print("After")
        print(predict_proba_df[predict_proba_df["Game_ID"]==game][cols])
    
    print("\n2 goals")
    for game in test_last_match_lower_limit[test_last_match_lower_limit["min"]==2]["Game_ID"].to_list():
        print(game)
        print("\nBefore")
        print(predict_proba_df[predict_proba_df["Game_ID"]==game][cols])
        for col in cols:
            predict_proba_df[col].iloc[predict_proba_df[predict_proba_df["Game_ID"]==game].index]=predict_proba_df[col].iloc[predict_proba_df[predict_proba_df["Game_ID"]==game].index].mul(goal2).add(1/3).div(goal2+1)
        print("After")
        print(predict_proba_df[predict_proba_df["Game_ID"]==game][cols])
    
    print("\n3 goal")
    for game in test_last_match_lower_limit[test_last_match_lower_limit["min"]==3]["Game_ID"].to_list():
        print(game)
        print("\nBefore")
        print(predict_proba_df[predict_proba_df["Game_ID"]==game][cols])
        for col in cols:
            predict_proba_df[col].iloc[predict_proba_df[predict_proba_df["Game_ID"]==game].index]=predict_proba_df[col].iloc[predict_proba_df[predict_proba_df["Game_ID"]==game].index].mul(goal3).add(1/3).div(goal3+1)
        print("After")
        print(predict_proba_df[predict_proba_df["Game_ID"]==game][cols])
    
    #name2=name+"_adj_"+algo+str(goal1)+"_"+str(goal2)+"_"+str(goal3)
    #predict_proba_df.to_csv(name2, index=False) 
    predict_proba_df.to_csv("Final_submit2_NuSVC.csv", index=False) #Time to save the prediction and wish for the best 
    return()

In [None]:
# Load data and set pandas display options and set last_match_lower_limit

#We have to run different code if we use old pandas version
#Additionally in my set-up different environments have different paths
#We define last_match_lower_limit where we will store how low did we have to go for each Game, e.g.
#did we decide the match with strikers over 30 goals or only over 1 goal? (This way we have the
#possibility of reducing % to give less confident predictions)

old_pandas=False
if pd.__version__=="1.0.1":
    print("Using old pandas")
    old_pandas=True

if old_pandas==True:
    train=pd.read_csv("Train.csv")
    test=pd.read_csv("Test.csv")
    train_stats=pd.read_csv("train_game_statistics.csv")
    test_stats=pd.read_csv("test_game_statistics.csv")
else:
    train=pd.read_csv("/content/gdrive/MyDrive/Competition #1/2. Data/2. Processed/Train.csv")
    test=pd.read_csv("/content/gdrive/MyDrive/Competition #1/2. Data/2. Processed/Test.csv")
    train_stats=pd.read_csv("/content/gdrive/MyDrive/Competition #1/2. Data/2. Processed/train_game_statistics.csv")
    test_stats=pd.read_csv("/content/gdrive/MyDrive/Competition #1/2. Data/2. Processed/test_game_statistics.csv")

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

train_last_match_lower_limit=pd.DataFrame(train["Game_ID"]).copy()
test_last_match_lower_limit=pd.DataFrame(test["Game_ID"]).copy()
train_last_match_lower_limit["min"]=100
test_last_match_lower_limit["min"]=100

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# Get a list of the true_goalkeepers and true_strikers

#This will be handy later
#Goalkeepers will contain the players that we know from train data that are goalkeepers
#Strikers will contain both the players that scored goals but also the collaborators assisting them (passing the ball to the striker to score)

true_goalkeepers=pd.DataFrame(train_stats[train_stats["Action"]=="Goals conceded"]["Player_ID"])
true_strikers=pd.DataFrame(train_stats[train_stats["Action"].isin(["Goals", "Assists"])]["Player_ID"])

In [None]:
# We check if any of the games has duplicated data

#If duplicated data is found it is corrected and we check that: final sum = initial sum / 2 and display the amount of error 
#At this time we only run if from train because test doesn't have duplicated data

for game in train_stats["Game_ID"].unique():
    df=train_stats[train_stats["Game_ID"]==game]["id"][0:2]
    if df.head(1).values-df.tail(1).values !=-1:  #If the id difference isn't -1 (as it should be)
        print("Duplicated: ", game)
        initial_sum=train_stats[train_stats["Game_ID"]==game]["X"].sum()
        to_drop=[]
        for id in range(train_stats[train_stats["Game_ID"]=="ID_HPYKEW7R"].index[0],train_stats[train_stats["Game_ID"]=="ID_HPYKEW7R"].index[-1],2):
            to_drop.append(id) 
        train_stats.drop(to_drop, inplace=True)
        print("Amount of error: ", int(train_stats[train_stats["Game_ID"]==game]["X"].sum()-(initial_sum/2)))
train_stats.reset_index(drop=True, inplace=True)

Duplicated:  ID_ZL8511XW
Amount of error:  189364
Duplicated:  ID_XWV89SWE
Amount of error:  204088
Duplicated:  ID_VUN0XPN1
Amount of error:  198193
Duplicated:  ID_VU4MWMEQ
Amount of error:  179556
Duplicated:  ID_V891OYVP
Amount of error:  194392
Duplicated:  ID_U3SD7K1V
Amount of error:  181469
Duplicated:  ID_SSBUAMO7
Amount of error:  186997
Duplicated:  ID_LTKK72S4
Amount of error:  174262
Duplicated:  ID_KST7PRMU
Amount of error:  168681
Duplicated:  ID_KFK1U0BS
Amount of error:  187467
Duplicated:  ID_KDY6XYY0
Amount of error:  191555
Duplicated:  ID_HPYKEW7R
Amount of error:  51
Duplicated:  ID_DKNFRR94
Amount of error:  192654
Duplicated:  ID_7HNMFC11
Amount of error:  205907
Duplicated:  ID_0O3OGCB9
Amount of error:  213472
Duplicated:  ID_0A2WEVY3
Amount of error:  193484


In [None]:
# Keep caution information

#We want to keep the missing data identified to we can compare versus all the data
caution=test_stats[(test_stats["Passes"].isna()) | (test_stats["Half"].isna())].index
print(len(caution))

92


In [None]:
# Adjust test data

#When Manager is not informed we fill in "Shy Manager"

test_stats["Manager"]=test_stats["Manager"].fillna("Shy_Manager")

In [None]:
# Create reality DataFrame 

#We create reality DataFrame with the Home and Away goals that have been scored
#Later we will use this information as a reference to check how good are our predictions not in terms of who won but in terms
#of how much goals were scored

cols=['Game_ID', 'Home Team', 'Away Team']
reality=train[cols]
reality["Home"]=0
reality["Away"]=0
goals_df=train_stats[train_stats["Action"]=="Goals conceded"]
for game in train["Game_ID"]:
    reality.loc[reality["Game_ID"]==game, "Home"]=reality.loc[reality["Game_ID"]==game]["Home"]+goals_df[ (goals_df["Game_ID"]==game) & (goals_df["Team"]==train[train["Game_ID"]==game]["Away Team"].values[0]) ]["Game_ID"].count()
    reality.loc[reality["Game_ID"]==game, "Away"]=reality.loc[reality["Game_ID"]==game]["Away"]+goals_df[ (goals_df["Game_ID"]==game) & (goals_df["Team"]==train[train["Game_ID"]==game]["Home Team"].values[0]) ]["Game_ID"].count()
del goals_df
reality["target"]=train["Score"]
#reality

In [None]:
# Let's keep some data that can be useful in the future 
#We'll be deleting same columns from X_train and X_test in the following code block

cols=["Game_ID" , "Opposition_Team", "Player_ID"]
keep_train[cols]=train_stats[cols]
keep_test[cols]=test_stats[cols]

In [None]:
# Convert Teams, players and managers into their performance indicators

#We want to convert the teams, the players and the managers into their performance indicators 
#so it has meaning for the AI, we don't want to use a meaningless number, we want their performance

team_performance=pd.DataFrame(index=range(1))
score=pd.DataFrame(train[["Home Team", "Away Team", "Score", "Game_ID"]])
score.loc[score["Score"]=="Draw", "Score"]=0
score.loc[score["Score"]=="Home Win", "Score"]=1
score.loc[score["Score"]=="Away win", "Score"]=-1
score.head(2)
for team in score["Home Team"].unique():
    team_performance[team]=score[score["Home Team"]==team]["Score"].sum()+score[score["Away Team"]==team]["Score"].sum()*-1

cols=["Team", "Opposition_Team", "Player_ID", "Manager"]
for col in cols:
    train_stats[col+"_Performance"]=train_stats[col].copy()
    test_stats[col+"_Performance"]=test_stats[col].copy()

#Convert team into team performance
print("Converting team into team performance")
for team in team_performance.columns:
    train_stats.loc[train_stats["Team"]==team, "Team_Performance"]=team_performance[team].iloc[0]
    train_stats.loc[train_stats["Opposition_Team"]==team, "Opposition_Team_Performance"]=team_performance[team].iloc[0]
    test_stats.loc[test_stats["Team"]==team, "Team_Performance"]=team_performance[team].iloc[0]
    test_stats.loc[test_stats["Opposition_Team"]==team, "Opposition_Team_Performance"]=team_performance[team].iloc[0]
team_performance

train_stats["Score"]=0
print("Computing scores")
for game in train["Game_ID"].unique():
    train_stats.loc[train_stats["Game_ID"]==game, "Score"]=score[score["Game_ID"]==game]["Score"].iloc[0]
del score

print("Converting player into player performance")
for player in train_stats["Player_ID"].unique():
    player_performance=train_stats[train_stats["Player_ID"]==player]["Score"].mean()
    train_stats.loc[train_stats["Player_ID"]==player, "Player_ID_Performance"]=player_performance
    test_stats.loc[test_stats["Player_ID"]==player, "Player_ID_Performance"]=player_performance

print("Converting manager into manager performance")
for manager in train_stats["Manager"].unique():
    manager_performance=train_stats[train_stats["Manager"]==manager]["Score"].mean()
    train_stats.loc[train_stats["Manager"]==manager, "Manager"]=manager_performance
    test_stats.loc[test_stats["Manager"]==manager, "Manager"]=manager_performance

for col in cols:
    col=col+"_Performance"
    train_stats[col]=pd.to_numeric(train_stats[col], errors ="coerce").fillna(0).astype("float")
    test_stats[col]=pd.to_numeric(test_stats[col], errors ="coerce").fillna(0).astype("float")    


Converting team into team performance
Computing scores
Converting player into player performance
Converting manager into manager performance


In [None]:
# Prepare X_train and X_test

#We will use "Goals" as target and train_stats and test_stats as features
#We will drop from train columns with 90% or more of missing values
#We will drop Season because we have 1 and 2 for train and 3 for test so we shouldn't allow our algorithm to learn about this
#We will drop Game_ID because we can easly get it back from keep_train and keep_test if we want to
#We only keep columns that exist on both train and test

y_train=(train_stats["Action"]=="Goals") 
X_train=train_stats
X_test=test_stats
X_train.drop(X_train.columns[X_train.isna().sum()>len(train_stats)*0.9], axis=1, inplace=True)
X_train.drop("Season", axis=1, inplace=True)
X_train.drop("Game_ID", axis=1, inplace=True) #Following commands will also drop Game_ID from test
cols=train_stats.columns[train_stats.columns.isin(test_stats.columns)].tolist()
test_stats=test_stats[cols]
cols.append("Action")
train_stats=train_stats[cols]
cols=X_train.columns[X_train.columns.isin(X_test.columns)].tolist()
X_train=X_train[cols]
X_test=X_test[cols]

In [None]:
#We calculate the Game Performance difference between Home team performance and Away Team performance, from 2 features to 1
X_train["Game_Performance_diff"]=X_train["Team_Performance"]-X_train["Opposition_Team_Performance"]
X_test["Game_Performance_diff"]=X_test["Team_Performance"]-X_test["Opposition_Team_Performance"]

In [None]:
#Let's have a look at the columns
X_train.columns

Index(['Player_ID', 'id', 'X', 'Y', 'Team', 'Half', 'Manager',
       'Opposition_Team', 'Shots', 'SoT', 'Accurate passes',
       'Inaccurate passes', 'Passes', 'Start_minutes', 'End_minutes',
       'Team_Performance', 'Opposition_Team_Performance',
       'Player_ID_Performance', 'Manager_Performance',
       'Game_Performance_diff'],
      dtype='object')

In [None]:
# Define features to be compressed

#We will use cut value, for example 1.05 means a 5% difference between actions which are a goal and
#other actions. We can keep as is features with differences over 5% and we'll compress the rest

cut=1.05
#We compute absolute value of the mean of goals and no goals and we divide them to get the division as
#a cut point
mean_diff_goal_vs_no_goal=abs(X_train[y_train==True].describe().loc["mean"]/X_train[y_train==False].describe().loc["mean"])
#We want to compress below cut level and do not compress above cut level
cols_to_compress=mean_diff_goal_vs_no_goal[mean_diff_goal_vs_no_goal<cut].index
do_not_compress=mean_diff_goal_vs_no_goal[mean_diff_goal_vs_no_goal>cut].index
print("Do not compress: \n", mean_diff_goal_vs_no_goal[do_not_compress].sort_values(ascending=False))
print("\n\nCompress:\n", mean_diff_goal_vs_no_goal[cols_to_compress].sort_values(ascending=False))

Do not compress: 
 Game_Performance_diff          5.997882
Opposition_Team_Performance    3.149519
Team_Performance               2.908389
X                              1.753342
End_minutes                    1.087416
Start_minutes                  1.077649
id                             1.069287
Name: mean, dtype: float64


Compress:
 Player_ID_Performance    1.041386
Y                        1.006245
Shots                    0.000000
SoT                      0.000000
Accurate passes          0.000000
Inaccurate passes        0.000000
Passes                   0.000000
Name: mean, dtype: float64


In [None]:
# Compress the features

#We will use keras to compress the features
#We set a Neural Network with input = output and once it has learned we run "predictions" on train and
#test to get the values of the middle layer which contains the features compressed

#We only want to work with the features that we want to compress, we've calculated before which ones 
#have more difference between goal and no_goal actions
X_train_keras=X_train[cols_to_compress]
X_test_keras=X_test[cols_to_compress]

#We define a keras encoder/decoder model 
input_layer = Input(shape=(X_train_keras.shape[1],))
## encoding part
encoded = Dense(4, activation='tanh', activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoded = Dense(1, activation='relu')(encoded)
## decoding part
decoded = Dense(1, activation='tanh')(encoded)
decoded = Dense(4, activation='tanh')(decoded)
## output layer
output_layer = Dense(X_train_keras.shape[1], activation='relu')(decoded)
autoencoder = Model(input_layer, output_layer)

#Let's compile the model
autoencoder.compile(optimizer="adadelta", loss="mse")

#Let's run the model with input=output
autoencoder.fit(X_train_keras, X_train_keras, batch_size = 64, epochs = 1,validation_split = 0.20);

#Let's define hidden_representation as the sequential use of the first 3 layers, with their learnt 
#weights to convert the input into the same output with a layer with a single neuron in the middle
#So we will get a single feature compressing all the other features
hidden_representation = Sequential()
hidden_representation.add(autoencoder.layers[0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2])

#Let's run the "predict" for train and test to convert all the features that we want to compress into
#a single feature
X_train["compressed_feas"]=hidden_representation.predict(X_train[cols_to_compress])
X_test["compressed_feas"]=hidden_representation.predict(X_test[cols_to_compress])



In [None]:
# Split data 50% train and 50% val, print shape and first rows

#The list is of features to be used is the list of features not compressed adding "compressed_feas"
#which contains the features that we've compressed using Keras
do_not_compress=do_not_compress.to_list()
do_not_compress.append("compressed_feas")

#Let's keep only the features that we want to use
X_train=X_train[do_not_compress]
X_test=X_test[do_not_compress]

#Let's cut the data
cut=int(len(X_train)*0.5)
X_val=X_train[cut:]
X_train=X_train[:cut]
y_val=y_train[cut:]
y_train=y_train[:cut]
print ("X_train shape: ", X_train.shape, "     X_val shape: ", X_val.shape)

#Make sure that we fix any missing data for test 
X_test.fillna(0, inplace=True)

X_train shape:  (809259, 8)      X_val shape:  (809259, 8)


In [None]:
# Prediction

#First we learn on train and predict on val
#Later we learn on val and predict on train
#Later we learn from train+val and predict on test
#We print roc_auc score
#We create check DataFrame to check prediction vs target and count errors
#We display test predicted values counts

print("Val1 = 50% of val, Val2 = the other 50%, Val3 = 100% of val (Val1 + Val2)\n")
learner=LinearDiscriminantAnalysis()
learner.fit(X_train, y_train)
predict_val1=learner.predict(X_val)
predict_val1=pd.DataFrame(predict_val1)
check=predict_val1.copy()
check["real"]=y_val.reset_index(drop=True) #We have to reset index because val data keeps original index
print("Val1 roc_auc score: ", roc_auc_score(y_val, predict_val1), "       Val1 errors: ", check[check[0]!=check["real"]][0].count())
predict_val1=predict_val1.set_index(y_val.index) #We set back original index because later we will append the data

learner.fit(X_val, y_val)
predict_val2=learner.predict(X_train)
predict_val2=pd.DataFrame(predict_val2)
check=predict_val2.copy()
check["real"]=y_train
print("Val2 roc_auc score: ", roc_auc_score(y_train, predict_val2), "       Val2 errors: ", check[check[0]!=check["real"]][0].count())

predict_val=predict_val2.append(predict_val1)

check=predict_val.copy()
check["real"]=y_train.append(y_val)
print("Val3 roc_auc score: ", roc_auc_score(y_train.append(y_val), predict_val), "       Val3 errors: ", check[check[0]!=check["real"]][0].count())

learner.fit(X_train.append(X_val), y_train.append(y_val))

predict_test=learner.predict(X_test)
predict_test=pd.DataFrame(predict_test)
print("\nTest prediction:")
print(predict_test[0].value_counts())

Val1 = 50% of val, Val2 = the other 50%, Val3 = 100% of val (Val1 + Val2)

Val1 roc_auc score:  0.9995209004845397        Val1 errors:  775
Val2 roc_auc score:  0.9984296106352429        Val2 errors:  783
Val3 roc_auc score:  0.9989692832847046        Val3 errors:  1558

Test prediction:
False    800837
True       1261
Name: 0, dtype: int64


In [None]:
# Create relevant actions dataframes

#We select the colums that we want
#We append train and val
#Retrieve data from keep dataframes, add the predictions, keep only rows with True prediction and list lenghts

cols=["Game_ID", "Opposition_Team", "Player_ID"]
X_train=X_train.append(X_val)
X_train[cols]=keep_train[cols]
X_test[cols]=keep_test[cols]
X_train["prediction"]=predict_val[0]
X_test["prediction"]=predict_test[0]
cols.append("prediction") #We add it now because before it didn't exist
cols.append("Start_minutes") #This will help to deal with the missing data because we end up with some rows not sorted
relevant_actions_train=X_train[X_train["prediction"]==True][cols].copy()
relevant_actions_test=X_test[X_test["prediction"]==True][cols].copy()
print("Train relevant actions lenght: ", len(relevant_actions_train),"      Test relevant actions lenght: ", len(relevant_actions_test) )

Train relevant actions lenght:  2466       Test relevant actions lenght:  1261


In [None]:
# Reload initial data, create features, relevant_actions DataFrames and caution feature

#Once we have relevant_actions datafiles we can reload initial data and prepare the target 
#We also initialize "Away" and "Home" features which will contain number of scored goals (key to know whow won the match)

if old_pandas==True:
    X_train=pd.read_csv("Train.csv")
    X_test=pd.read_csv("Test.csv")
else:
    X_train=pd.read_csv("/content/gdrive/MyDrive/Competition #1/2. Data/2. Processed/Train.csv")
    X_test=pd.read_csv("/content/gdrive/MyDrive/Competition #1/2. Data/2. Processed/Test.csv")

y_train=X_train["Score"]
X_train.drop("Score", axis=1, inplace=True)

X_train["Away"]=0
X_train["Home"]=0
X_test["Away"]=0
X_test["Home"]=0

relevant_actions_train=prepare_relevant_actions_df(relevant_actions_train)
relevant_actions_test=prepare_relevant_actions_df(relevant_actions_test)

#We will keep track of caution games with missing data and now we display the counts to ensure we still
#have 1146+92=1238
relevant_actions_train["caution"]=0
relevant_actions_test["caution"]=0
relevant_actions_test["caution"].iloc[relevant_actions_test[relevant_actions_test["id"].isin(caution)].index]=1
print(relevant_actions_test["caution"].value_counts(), "\n", len(relevant_actions_test))

0    1169
1      92
Name: caution, dtype: int64 
 1261


In [None]:
# Let's mark Train plays

#We mark the plays keeping track of the errors
#Then we fix the errors and rerun the mark plays process to ensure we don't have errors

errors=mark_plays(relevant_actions_train)
fix_mark_plays_errors(errors, relevant_actions_train)
errors=mark_plays(relevant_actions_train)

This is not correctly cut, no problem, we'll adjust it later. Issue on :  409
This is not correctly cut, no problem, we'll adjust it later. Issue on :  410
This is not correctly cut, no problem, we'll adjust it later. Issue on :  411
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1278
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1436
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1437
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1438
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1553
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1554
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1555
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1878
This is not correctly cut, no problem, we'll adjust it later. Issue on :  1879
This is not correctly cut, no problem, we'll adjust it 

In [None]:
# Let's mark Test plays

#We mark the plays keeping track of the errors
#Then we fix the errors and rerun the mark plays process to ensure we don't have errors

errors=mark_plays(relevant_actions_test)
fix_mark_plays_errors(errors, relevant_actions_test)
errors=mark_plays(relevant_actions_test)

This is not correctly cut, no problem, we'll adjust it later. Issue on :  26
This is not correctly cut, no problem, we'll adjust it later. Issue on :  27
This is not correctly cut, no problem, we'll adjust it later. Issue on :  33
This is not correctly cut, no problem, we'll adjust it later. Issue on :  34
This is not correctly cut, no problem, we'll adjust it later. Issue on :  164
This is not correctly cut, no problem, we'll adjust it later. Issue on :  165
This is not correctly cut, no problem, we'll adjust it later. Issue on :  166
This is not correctly cut, no problem, we'll adjust it later. Issue on :  336
This is not correctly cut, no problem, we'll adjust it later. Issue on :  337
This is not correctly cut, no problem, we'll adjust it later. Issue on :  417
This is not correctly cut, no problem, we'll adjust it later. Issue on :  418
This is not correctly cut, no problem, we'll adjust it later. Issue on :  419
This is not correctly cut, no problem, we'll adjust it later. Issue 

In [None]:
# Code to view the relevant actions of a Game ID belonging to a specific relevant action in this example is 633

#We aren't using it at this time but it's handy

#relevant_actions_test[relevant_actions_test["Game_ID"]==relevant_actions_test["Game_ID"].iloc[633]]

In [None]:
# Sanity check about the unknown player

print("Unknow player count on the data wihout missing data: \n", pd.Series(relevant_actions_test[relevant_actions_test["caution"]==0]["Player_ID"]=="Player_X982WR9W").value_counts(), "\n\nUnknow player count on the data with missing data: \n", pd.Series(relevant_actions_test[relevant_actions_test["caution"]==1]["Player_ID"]=="Player_X982WR9W").value_counts())

Unknow player count on the data wihout missing data: 
 False    1169
Name: Player_ID, dtype: int64 

Unknow player count on the data with missing data: 
 True     51
False    41
Name: Player_ID, dtype: int64


In [None]:
# Replace the token used to reference unknown players by "Unknow" for a human easier understanding

relevant_actions_test["Player_ID"].iloc[relevant_actions_test[relevant_actions_test["Player_ID"]=="Player_X982WR9W"].index]="Unknown"

In [None]:
# Check pending relevant actions on train number of groups of 3 and groups of 2

relevant_actions_train.groupby("play")["id"].count().value_counts()

3    648
2    261
Name: id, dtype: int64

In [None]:
# Process 3 items plays for train

three_items(relevant_actions_train, X_train, processing_train=True)
print("LEN: ", len(relevant_actions_train))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
479           ERRRORRRRRRRRRRRRRRRRRRRRRRR, only 1 team
         Game_ID Opposition_Team        Player_ID  prediction  Start_minutes  \
479  ID_7HNMFC11   Medusa Merger  Player_EGASDW45        True          43.47   
480  ID_7HNMFC11   Medusa Merger  Player_DQTT1QWW        True          43.47   
481  ID_7HNMFC11   Medusa Merger  Player_EGASDW45        True          43.47   

          id  next_id  next_min  diff_min  diff_id  caution  play  
479  1301885  1301887        43      0.47        2        0   174  
480  1301887  1301889        43      0.47        2        0   174  
481  1301889  1301205        50      6.53      684        0   174  
479           ERRRORRRRRRRRRRRRRRRRRRRRRRR, only 1 team
         Game_ID Opposition_Team        Player_ID  prediction  Start_minutes  \
479  ID_7HNMFC11   Medusa Merger  Player_EGASDW45        True          43.47   
480  ID_7HNMFC11   Medusa Merger  Player_DQTT1QWW        True         

In [None]:
# Sanity check to ensure that we are keeping all the strikers and goalkeepers information

#We take initial lenght, final lenght, we divide between items number, take into account own goals 
#and multiply by 2 if it's strikers(player who scores + player assisting)


(((2414-512)/3)-27)*2

print(len(strikers), len(goalkeepers))
print("Expected numbers: ", int((((2414-512)/3)-27)*2), int (2414-512)/3)

In [None]:
# Check pending relevant actions on train number to ensure only groups of 2 plays are left

relevant_actions_train.groupby("play")["id"].count().value_counts()

In [None]:
# Check pending relevant actions on test number of groups of 3 and groups of 2

relevant_actions_test.groupby("play")["id"].count().value_counts()

In [None]:
# Clean missclassified players from strikers and goalkeepers lists

clean_missclassified()

In [None]:
# Process 3 items plays for test

three_items(relevant_actions_test, X_test, processing_train=False)
print("LEN: ", len(relevant_actions_test))

In [None]:
# Sanity check to ensure that we are keeping all the strikers and goalkeepers information

#We take previous operations and we add current operations following this approach:
#initial lenght, final lenght, we divide between items number, take into account own goals and multiply
#by 2 if it's strikers(player who scores + player assisting)
#Additionally this time we also have to remove the 3 missclassified players

print(len(strikers), len(goalkeepers))
print("Expected numbers: ", int((((((2414-512)/3)-27)*2)+(((1238-368)/3)-8)*2)-3), int(((2414-512)/3)+(1238-368)/3))

In [None]:
# Check pending relevant actions on train number to ensure only groups of 2 plays are left

relevant_actions_test.groupby("play")["id"].count().value_counts()

In [None]:
# Clean missclassified players from strikers and goalkeepers lists

clean_missclassified()

In [None]:
# Process the plays in 2 items groups

#This time we stay in the safest zone, we don't process true strikers (checked from train data)
#or possible strikers learnt by our process
#So we only use true goalkeepers (confirmed from train data by checking Action feature)
#And we run the loops according to the parameters, decreasing by 10 goals at each iteration for initial 
#limit, 5 for upper limit and a final loop decreasing 1 goal each iteration. 10, 5 and 1 come from a 
#best guess to try to go from more safe to less safe so we start by categorizing the players that look
#more clear, we require a lot of goals during first iterations but we end up accepting a far smaller 
#number. We do this because we want the most clearly categorized players to be the ones that help us 
#categorize more players because at their turn the players that we've just categorized can also be 
#used to categorize more players so we do not want to keep accumulating error because of an initial
#error
#Decreasing a single goal every time could potentially be better but would also increase process time
#and notebook size because the output would be even bigger and currently it isn't that small.

process_true_strikers=False
process_possible_strikers=False

two_items_main_current_limit=0 #Here we will store current value to loop on two_items_loop() and two_items()

for two_items_initial_current_limit in range(two_items_initial_upper_limit,two_items_initial_lower_limit, -10 ): 
    print("\n\nInitial loop:", two_items_initial_current_limit)
    two_items_loop()    
    #two_items_initial_current_limit=two_items_initial_current_limit-1
two_items_initial_current_limit=0
for two_items_main_current_limit in range(two_items_main_upper_limit,two_items_main_lower_limit, -5 ):   
    print("\n\nMain loop:", two_items_main_current_limit)
    two_items_loop()    
    #two_items_main_current_limit=two_items_main_current_limit-1
for two_items_main_current_limit in range(4,1, -1 ):   
    print("\n\nFinal loop:", two_items_main_current_limit)
    two_items_loop()    
    #two_items_main_current_limit=two_items_main_current_limit-1
reality_check() #We end by comparing our goal estimates versus train goals counted using "Action" feature

In [None]:
# Process the plays in 2 items groups

#This time we stay in a safe zone, we process true goalkeepers (confirmed from train data by checking 
#Action feature) and we add true strikers also confirmed from train data Action feature.

#And we run the loops according to the parameters, decreasing by 10 goals at each iteration for initial 
#limit, 5 for upper limit and a final loop decreasing 1 goal each iteration. 10, 5 and 1 come from a 
#best guess to try to go from more safe to less safe so we start by categorizing the players that look
#more clear, we require a lot of goals during first iterations but we end up accepting a far smaller 
#number. We do this because we want the most clearly categorized players to be the ones that help us 
#categorize more players because at their turn the players that we've just categorized can also be 
#used to categorize more players so we do not want to keep accumulating error because of an initial
#error
#Decreasing a single goal every time could potentially be better but would also increase process time
#and notebook size because the output would be even bigger and currently it isn't that small.

process_true_strikers=True
process_possible_strikers=False


two_items_main_current_limit=0 #Here we will store current value to loop on two_items_loop() and two_items()

for two_items_initial_current_limit in range(two_items_initial_upper_limit,two_items_initial_lower_limit, -10 ): 
    print("Initial loop:", two_items_initial_current_limit)
    two_items_loop()    
    #two_items_initial_current_limit=two_items_initial_current_limit-1
two_items_initial_current_limit=0
for two_items_main_current_limit in range(two_items_main_upper_limit,two_items_main_lower_limit, -5 ):   
    print("Main loop:", two_items_main_current_limit)
    two_items_loop()    
    #two_items_main_current_limit=two_items_main_current_limit-1
for two_items_main_current_limit in range(4,1, -1 ):   
    print("Final loop:", two_items_main_current_limit)
    two_items_loop()    
    #two_items_main_current_limit=two_items_main_current_limit-1
reality_check() #We end by comparing our goal estimates versus train goals counted using "Action" feature

In [None]:
# Process the plays in 2 items groups

#This time we stay in a less safe zone, we process true goalkeepers (confirmed from train data by 
#checking Action feature) and we add true strikers also confirmed from train data Action feature and
#we also add possible strikers identified by this process.

#And we run the loops according to the parameters, decreasing by 10 goals at each iteration for initial 
#limit, 5 for upper limit and a final loop decreasing 1 goal each iteration. 10, 5 and 1 come from a 
#best guess to try to go from more safe to less safe so we start by categorizing the players that look
#more clear, we require a lot of goals during first iterations but we end up accepting a far smaller 
#number. We do this because we want the most clearly categorized players to be the ones that help us 
#categorize more players because at their turn the players that we've just categorized can also be 
#used to categorize more players so we do not want to keep accumulating error because of an initial
#error
#Decreasing a single goal every time could potentially be better but would also increase process time
#and notebook size because the output would be even bigger and currently it isn't that small.


process_true_strikers=True
process_possible_strikers=True


two_items_main_current_limit=0 #Here we will store current value to loop on two_items_loop() and two_items()

for two_items_initial_current_limit in range(two_items_initial_upper_limit,two_items_initial_lower_limit, -10 ): 
    print("Initial loop:", two_items_initial_current_limit)
    two_items_loop()    
    #two_items_initial_current_limit=two_items_initial_current_limit-1
two_items_initial_current_limit=0
for two_items_main_current_limit in range(two_items_main_upper_limit,two_items_main_lower_limit, -5 ):   
    print("Main loop:", two_items_main_current_limit)
    two_items_loop()    
    #two_items_main_current_limit=two_items_main_current_limit-1
#for two_items_main_current_limit in range(4,1, -1 ):   
for two_items_main_current_limit in range(4,1, -1 ):   
    print("Final loop:", two_items_main_current_limit)
    two_items_loop()    
    #two_items_main_current_limit=two_items_main_current_limit-1
reality_check() #We end by comparing our goal estimates versus train goals counted using "Action" feature

In [None]:
# Process the plays in 2 items groups: last final loop

#This is the last final loop so anything goes, we use previous parameters strategy to go from 
#more safe to less safe parameters and we do this because we want the most clearly categorized 
#players to be the ones that help us categorize more players because at their turn the players that 
#we've just categorized can also be used to categorize more players so we do not want to keep 
#accumulating error because of an initial error

process_true_strikers=False
process_possible_strikers=False
for two_items_main_current_limit in range(1,0, -1 ):   
    print("Final loop:", two_items_main_current_limit)
    two_items_loop()   
    
process_true_strikers=True
process_possible_strikers=False
for two_items_main_current_limit in range(1,0, -1 ):   
    print("Final loop:", two_items_main_current_limit)
    two_items_loop()   
    
process_true_strikers=True
process_possible_strikers=True
for two_items_main_current_limit in range(1,0, -1 ):   
    print("Final loop:", two_items_main_current_limit)
    two_items_loop()   

    

In [None]:
# Clean missclassified players from strikers and goalkeepers lists

clean_missclassified()

In [None]:
# Display train pending relevant actions (still not processed)

relevant_actions_train

In [None]:
# Display test pending relevant actions (still not processed)

relevant_actions_test

In [None]:
# We check pending test relevant actions grouped by number of moves per play

#We don't check train because there are no pending relevant actions on train

relevant_actions_test.groupby("play")["id"].count().value_counts()

In [None]:
# Display train and test match lower limit

#We display information regarding train and test lower limit (minimum number of goals used to 
#come to conclusions regarding which team scored, who's a striker (or assisting the striker) and who's
#a goalkeeper
#This information can be useful if we want to use a more conservative algorithm to predict on the 
#matches which had their goals decided by less goals or even got with a 1/3 prediction for Home Win,
#Draw and Away win

print("Train match lower limit:")
print(train_last_match_lower_limit["min"].value_counts())

print("\nTest match lower limit:")
print(test_last_match_lower_limit["min"].value_counts())

print("\nTest match lower limit in percentage:")
print(test_last_match_lower_limit["min"].value_counts(normalize=True))

In [None]:
# Create diff feature (goals scored by home team - goals scored by away team) for both train and test

X_train["diff"]=X_train["Home"]-X_train["Away"]
X_test["diff"]=X_test["Home"]-X_test["Away"]

In [None]:
# Create Direct_guess feature

#With the data that we have so far we can prepare a direct guess of the match outcome by considering
#the difference of our prediction of goals scored by home team vs goals scored by away team 

X_train["Direct_guess"]=666
X_train["Direct_guess"].iloc[X_train[X_train["diff"]==0].index]=0
X_train["Direct_guess"].iloc[X_train[X_train["diff"]<0].index]=2
X_train["Direct_guess"].iloc[X_train[X_train["diff"]>0].index]=1

X_test["Direct_guess"]=666
X_test["Direct_guess"].iloc[X_test[X_test["diff"]==0].index]=0
X_test["Direct_guess"].iloc[X_test[X_test["diff"]<0].index]=2
X_test["Direct_guess"].iloc[X_test[X_test["diff"]>0].index]=1


print("Train: \n", X_train["Direct_guess"].value_counts(),"\nTest: \n", X_test["Direct_guess"].value_counts())

In [None]:
# We keep a copy of test because later it will be handy to have it

recover_X_test=X_test.copy()

In [None]:
# We drop the data that we do not want to use for this prediction because we have better features

cols=["Date", "Game_ID", "Home Team", "Away Team"]
X_train.drop(cols, axis=1, inplace=True)
X_test.drop(cols, axis=1, inplace=True)
X_train.columns

In [None]:
# Let's split the data in half for train and half for val
#We could use 0.8 to go with 80% train and 20% val but with small amounts of data I prefer to sacrifice some data so I learn 
#with less data but I'm more confident about how does each change on the data or model affect the prediction
#I was planning on running Cross Validation but it looks like it isn't needed anymore

cut=int(len(X_train)*0.5)
X_val=X_train[cut:]
X_train=X_train[:cut]
y_val=y_train[cut:]
y_train=y_train[:cut]
print(len(X_train), len(X_val), len(X_train)+len(X_val), len(train))

In [None]:
# Delete samples difficult to classify on train

#First we run a LGBMClassifier model learning on train and predicting on train, we do this because we 
#want to detect which train samples might be outliers reducing our test predictions accuracy
#This method should only be used in scenarios were you know ALL your future data like a competition or 
#when working with a closed dataset 

fi = [] #We will store feature importance here

learner=LGBMClassifier(random_state = 21, metric="multi_logloss")
learner.fit(X_train, y_train)
fi.append(pd.Series(learner.feature_importances_ / learner.feature_importances_.sum(), index=X_train.columns))
fi = pd.concat(fi, axis=1).mean(axis=1)
fi.sort_values(ascending=False).to_frame()
predict_proba_val=learner.predict_proba(X_train)
predict_val=learner.predict(X_train)
print("Feature importance: ")
print( fi.sort_values(ascending=False))
print("\nFeature with 0 importance: ", fi[fi==0])
print(learner)
print("Log_loss: ", log_loss(y_train, predict_proba_val))
predict_val=pd.DataFrame(predict_val)
predict_val[["index", "real"]]=pd.Series(y_train).reset_index() #We add the data to check the errors
print("Print number of errors: ", predict_val[predict_val[0]!=predict_val["real"]][0].count()) #We count the errors
print("This should be deleted to reduce error: ",predict_val[predict_val[0]!=predict_val["real"]]["index"].to_list()) #We list errors
errors=predict_val[predict_val[0]!=predict_val["real"]]["index"].to_list()

X_train.drop(errors, inplace=True)
y_train.drop(errors, inplace=True)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

In [None]:
# Let's display the error that we just deleted

predict_val[predict_val[0]!=predict_val["real"]]

In [None]:
# Prepare the table for the feature selection

#Let's prepare a table by checking how good is each feature at predicting the target

fea_selection=pd.DataFrame(columns=['fea','score'], index=range(len(X_train.columns)))

for num, cols in enumerate(X_train.columns):
    cols=cols.split()
    learner=NuSVC(probability=True, random_state=2022, kernel="linear")
    learner.fit(X_train[cols], y_train)
    predict_proba_val=learner.predict_proba(X_val[cols])
    fea_selection["fea"].iloc[num]=cols[0]
    fea_selection["score"].iloc[num]=log_loss(y_val, predict_proba_val)
fi=fea_selection.sort_values("score", ascending=False)["fea"].to_list() #We want this list for the following feature selection loop
fi=pd.Series(fi)
fi=fi.reindex(fi)
fea_selection.sort_values("score", ascending=False)

In [None]:
# Let's add the dummy feature

#NuSVC can work on a single feature but RandomForestClassifier can't do that
#Now we will check that we get same prediction with NuSVC with the dummy feature and without it
#So we will set those features for train, val and test

smallest_error=666 #We set this variable with an arbitrary, it will keep smallest error
cols=["Direct_guess"]
print("\n", cols)
check_NuSVC(cols)

X_train["dummy"]=0
X_val["dummy"]=0
X_test["dummy"]=0   

cols=["Direct_guess", "dummy" ]
print("\n", cols)
check_NuSVC(["Direct_guess","dummy" ])

X_train=X_train[cols]
X_val=X_val[cols]
X_test=X_test[cols]

In [None]:
# Prepare diferent predictions so later we can ensemble them to get a stronger prediction

learner=NuSVC(probability=True, random_state=2022, kernel="linear")
learner.fit(X_train.append(X_val), y_train.append(y_val))
print("\n", learner)
predict_proba_test=learner.predict_proba(X_test)
predict_proba_test=pd.DataFrame(predict_proba_test)
#predict_test=learner.predict(X_test)
#predict_test=pd.DataFrame(predict_test)
predict_proba_test["Game_ID"]=test["Game_ID"] #We will need Game_ID for the submit
predict_proba_test.rename(columns={0 :"Away win",1:"Draw",2 :"Home Win" }, inplace=True) #Let's rename the columns
predict_proba_test_NuSVC=predict_proba_test[['Game_ID','Away win', 'Draw', 'Home Win' ]] #Let's order the columns
#predict_proba_test_NuSVC.to_csv(name+"_NuSVC_.csv", index=False) #Time to save the prediction and wish for the best 

from sklearn.ensemble import RandomForestClassifier
learner=RandomForestClassifier(random_state=2022)
learner.fit(X_train.append(X_val), y_train.append(y_val))
print("\n", learner)
predict_proba_test=learner.predict_proba(X_test)
predict_proba_test=pd.DataFrame(predict_proba_test)
#predict_test=learner.predict(X_test)
#predict_test=pd.DataFrame(predict_test)
predict_proba_test["Game_ID"]=test["Game_ID"] #We will need Game_ID for the submit
predict_proba_test.rename(columns={0 :"Away win",1:"Draw",2 :"Home Win" }, inplace=True) #Let's rename the columns
predict_proba_test_RF=predict_proba_test[['Game_ID','Away win', 'Draw', 'Home Win' ]] #Let's order the columns
#predict_proba_test_RF.to_csv(name+"_RF_.csv", index=False) #Time to save the prediction and wish for the best 


In [None]:
# Submit 1 - Ensemble: Let's be confident in NuSVC but let's 
#add 10% of RandomForest 

#We do this to try to get better predictions than with a single model

RF=0.1
NuSVC=0.9

if (RF+NuSVC)!=1:
    print("*********** Error, sum is not 1 ***********")

predict_proba_test_ensemble=predict_proba_test_NuSVC.copy()
for col in ['Away win', 'Draw', 'Home Win']:
    predict_proba_test_ensemble[col]=predict_proba_test_RF[col].mul(RF)+predict_proba_test_NuSVC[col].mul(NuSVC)
name2=name+"_NuSVC_"+str(NuSVC)+"RF_"+str(RF)+".csv"
print(name2)
#predict_proba_test_ensemble.to_csv(name2, index=False) #Time to save the prediction and wish for the best 

In [None]:
# Let's adjust the predictions

#We have a function to adjust our predictions so let's use it to adjust the matches that still have
#goals to be processed 

X_test=recover_X_test.copy() #We recover the data that first we copied, then we deleted and now we need
predict_proba_test_ensemble_adjusted=adjust_prediction(predict_proba_test_ensemble.copy())
predict_proba_test_ensemble_adjusted.to_csv("Final_submit1_ensemble.csv", index=False) #Time to save the prediction and wish for the best 
print("\n\nFinished with ensemble, we start with NuSVC\n\n")
predict_proba_test_NuSVC_adjusted=adjust_prediction(predict_proba_test_NuSVC.copy())

In [None]:
# Unclear_goals_confidence

#To reduce confidence when decisions were based on a small 
#amount of goals we use this function to avoid giving a very 
#high % to a prediction if its based on a small amount of goals 1,2 or 3
#We correct by adding ((predi*parameter)+1/3)/(parameter+1)

goal1=3
goal2=30
goal3=300
unclear_goals_confidence("NuSVC_", predict_proba_test_NuSVC_adjusted.copy(),goal1, goal2, goal3)

In [None]:
# THIS DOESN'T WORK AND IT ISN'T BEING USED BUT I WANTED TO SHARE

# Last try to deal with Unknwon players (the ones not correctly identified by the image AI)

#This didn't work for the matches being analyzed but it could work for different data
#If we have a play for Team A (or opposite Team B) on minute 39 and we can identify the goalkeeper for 
#that team at minutes 30 and 45 we can conclude that the player in that play isn't the goalkeeper so 
#the other player in that play is the goalkeeper so we can safely assign the goal to the correct team

if old_pandas==True:
    train_stats=pd.read_csv("train_game_statistics.csv")
    test_stats=pd.read_csv("test_game_statistics.csv")
else:
    train_stats=pd.read_csv("../input/ladu-data/train_game_statistics.csv")
    test_stats=pd.read_csv("../input/ladu-data/test_game_statistics.csv")

for row in relevant_actions_test[relevant_actions_test["Player_ID"]!="Unknown"].index.values:
    game=relevant_actions_test["Game_ID"].iloc[row]
    other_team=relevant_actions_test["Opposition_Team"].iloc[row]
    player=relevant_actions_test["Player_ID"].iloc[row]
    print("\n", game, other_team, player)
    print("Goalkeeper before : \n", test_stats["Player_ID"][(test_stats["Game_ID"]==game) & (test_stats["Opposition_Team"]==other_team)& (test_stats["Player_ID"]!=player) & (test_stats["Start_minutes"]<relevant_actions_test[relevant_actions_test["Game_ID"]==game]["Start_minutes"].min())].isin(goalkeepers).values)
    print("Goalkeeper after : \n", test_stats["Player_ID"][(test_stats["Game_ID"]==game) & (test_stats["Opposition_Team"]==other_team)& (test_stats["Player_ID"]!=player) & (test_stats["Start_minutes"]>relevant_actions_test[relevant_actions_test["Game_ID"]==game]["Start_minutes"].max())].isin(goalkeepers).values)   

In [None]:
# List packages

!pip list |grep pandas 
!python --version
!pip list 

In [None]:
# Print wrong classified goalkeepers and strikers

players=pd.DataFrame(gk_and_stk).append(stk_and_gk)
print(len(gk_and_stk), len(stk_and_gk))

players=pd.DataFrame(pd.Series(gk_and_stk).value_counts()).rename(columns={0: "goals"})
players["player"]=players.index
stk=pd.DataFrame(pd.Series(stk_and_gk).value_counts()).rename(columns={0: "goals"})
stk["player"]=stk.index
players=pd.merge(players, stk, on='player', suffixes=("_gk", "_stk"))

print("Wrong goalkeepers: ", players[players["goals_gk"]<players["goals_stk"]]["player"].to_list())
print("\nWrong strikers: ", players[players["goals_gk"]>players["goals_stk"]]["player"].to_list())
#goalkeepers.drop(goalkeepers[goalkeepers.isin(players[players["goals_gk"]<players["goals_stk"]]["player"].to_list())].index, inplace=True)
#strikers.drop(strikers[strikers.isin(players[players["goals_gk"]>players["goals_stk"]]["player"].to_list())].index, inplace=True)
print("Equal goals scored than goals conceded: ", goalkeepers[goalkeepers.isin(players[players["goals_gk"]==players["goals_stk"]]["player"])])

In [None]:
# Print a table of goals conceded as goalkeeper and scored as striker (or assisted the striker)

wrong_strikers=players[players["goals_gk"]>players["goals_stk"]]
wrong_goalkeepers=players[players["goals_gk"]<players["goals_stk"]]
players