<h2 style="font-family: sans-serif; color: black;">
Introduction
</h2>

<p style="font-family: sans-serif; font-size: medium;">
Data analysis in football is a common practice used by leading teams to achieve better results. Software tools are available to analyze football matches, providing summaries through various dashboards. These tools can track shots on goal and other in-game events. The dataset for this exercise comes from one such software, which has been analyzing matches of the Shahin team (Captain Tsubasa’s team).
</p>


<h2 style="font-family: sans-serif; color: black;">
Problem Objective
</h2>

<p style="font-family: sans-serif; font-size: medium;">
The goal of this exercise is to predict the probability of each shot resulting in a goal.
<br><br>
After performing basic preprocessing steps like cleaning certain columns and creating new features, we will focus on the main objective of this project: <b>feature selection</b>.
</p>


In [1]:
import numpy as np
import pandas as pd 

In [2]:
df = pd.read_csv('../data/train.csv')
df

Unnamed: 0,matchId,playerId,playType,bodyPart,x,y,interveningOpponents,interveningTeammates,interferenceOnShooter,minute,second,outcome
0,m_91,p_103,جریان بازی,پای راست,13.47,-11.22,1,0,متوسط,70,9,گُل
1,m_17,p_16,جریان بازی,پای چپ,9.48,14.22,3,0,متوسط,55,4,مهار توسط دروازه بان
2,m_111,p_88,ضربه آزاد مستقیم,پای چپ,29.43,-1.25,6,2,کم,86,31,مهار توسط دروازه بان
3,m_142,p_87,جریان بازی,پای راست,26.93,1.00,4,1,متوسط,77,2,موقعیت از دست رفته
4,m_117,p_9,جریان بازی,پای راست,10.72,5.24,2,0,متوسط,76,46,گُل
...,...,...,...,...,...,...,...,...,...,...,...,...
8920,m_57,p_115,جریان بازی,سر,6.48,3.99,3,0,زیاد,69,50,موقعیت از دست رفته
8921,m_59,p_76,جریان بازی,پای راست,21.45,-8.73,4,1,متوسط,15,53,برخورد به دفاع
8922,m_55,p_150,جریان بازی,پای چپ,11.97,3.24,3,0,متوسط,84,34,موقعیت از دست رفته
8923,m_33,p_130,جریان بازی,پای راست,6.48,-6.98,1,0,زیاد,4,39,موقعیت از دست رفته


In [3]:
# Create a new column 'label' based on the 'outcome' column, where values are set to 1 if 'outcome' is 'گُل' or 'گُل به خودی' (own goal), otherwise 0
df['label'] = df['outcome'].apply(lambda row: row == 'گُل' or row == 'گُل به خودی').map({True: 1, False: 0})

df = df.drop('outcome', axis=1)

df


Unnamed: 0,matchId,playerId,playType,bodyPart,x,y,interveningOpponents,interveningTeammates,interferenceOnShooter,minute,second,label
0,m_91,p_103,جریان بازی,پای راست,13.47,-11.22,1,0,متوسط,70,9,1
1,m_17,p_16,جریان بازی,پای چپ,9.48,14.22,3,0,متوسط,55,4,0
2,m_111,p_88,ضربه آزاد مستقیم,پای چپ,29.43,-1.25,6,2,کم,86,31,0
3,m_142,p_87,جریان بازی,پای راست,26.93,1.00,4,1,متوسط,77,2,0
4,m_117,p_9,جریان بازی,پای راست,10.72,5.24,2,0,متوسط,76,46,1
...,...,...,...,...,...,...,...,...,...,...,...,...
8920,m_57,p_115,جریان بازی,سر,6.48,3.99,3,0,زیاد,69,50,0
8921,m_59,p_76,جریان بازی,پای راست,21.45,-8.73,4,1,متوسط,15,53,0
8922,m_55,p_150,جریان بازی,پای چپ,11.97,3.24,3,0,متوسط,84,34,0
8923,m_33,p_130,جریان بازی,پای راست,6.48,-6.98,1,0,زیاد,4,39,0



<p style="line-height:200%;font-family:vazir;font-size:medium">
<font face="vazir" size=3>
    We will first create an automated model without any preprocessing and observe its accuracy. Then, we will compare it with a model that uses preprocessed data.
</font>
</p>

In [4]:
from sklearn.model_selection import train_test_split
from flaml import AutoML
from sklearn.metrics import roc_auc_score

x_train,x_test , y_train,y_test = train_test_split(df.drop('label', axis=1), df.label, random_state=313, stratify=df.label)

model = AutoML(task='classification', time_budget=60, verbose=0)
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(roc_auc_score(y_test, y_pred))



0.5601638504864311


In [5]:
df = df.drop(columns = ['matchId', 'playerId'], axis = 1)
df

Unnamed: 0,playType,bodyPart,x,y,interveningOpponents,interveningTeammates,interferenceOnShooter,minute,second,label
0,جریان بازی,پای راست,13.47,-11.22,1,0,متوسط,70,9,1
1,جریان بازی,پای چپ,9.48,14.22,3,0,متوسط,55,4,0
2,ضربه آزاد مستقیم,پای چپ,29.43,-1.25,6,2,کم,86,31,0
3,جریان بازی,پای راست,26.93,1.00,4,1,متوسط,77,2,0
4,جریان بازی,پای راست,10.72,5.24,2,0,متوسط,76,46,1
...,...,...,...,...,...,...,...,...,...,...
8920,جریان بازی,سر,6.48,3.99,3,0,زیاد,69,50,0
8921,جریان بازی,پای راست,21.45,-8.73,4,1,متوسط,15,53,0
8922,جریان بازی,پای چپ,11.97,3.24,3,0,متوسط,84,34,0
8923,جریان بازی,پای راست,6.48,-6.98,1,0,زیاد,4,39,0


In [6]:
def replace_feet(row) :
    if row == 'پای راست' or row == 'پای چپ' :
        return 'پا'
    else :
        return row

df['bodyPart'] = df['bodyPart'].apply(replace_feet)
df

Unnamed: 0,playType,bodyPart,x,y,interveningOpponents,interveningTeammates,interferenceOnShooter,minute,second,label
0,جریان بازی,پا,13.47,-11.22,1,0,متوسط,70,9,1
1,جریان بازی,پا,9.48,14.22,3,0,متوسط,55,4,0
2,ضربه آزاد مستقیم,پا,29.43,-1.25,6,2,کم,86,31,0
3,جریان بازی,پا,26.93,1.00,4,1,متوسط,77,2,0
4,جریان بازی,پا,10.72,5.24,2,0,متوسط,76,46,1
...,...,...,...,...,...,...,...,...,...,...
8920,جریان بازی,سر,6.48,3.99,3,0,زیاد,69,50,0
8921,جریان بازی,پا,21.45,-8.73,4,1,متوسط,15,53,0
8922,جریان بازی,پا,11.97,3.24,3,0,متوسط,84,34,0
8923,جریان بازی,پا,6.48,-6.98,1,0,زیاد,4,39,0


In [7]:
# Calculate the Euclidean distance from the origin (0, 0) for each (x, y) coordinate
def calc_distance(x, y):
    return np.sqrt(x ** 2 + y ** 2)

# Calculate the shooting angle based on (x, y) coordinates, using a formula with goal width adjustment
def calc_angle(x, y):
    tan_theta = 7.32 * x / (x ** 2 + y ** 2 - (7.32 / 2) ** 2)  # 7.32 is the goal width in meters
    theta = np.arctan(tan_theta)

    # Convert angle to degrees and ensure it’s in the correct range
    if theta >= 0:
        return np.rad2deg(theta)
    else:
        return np.rad2deg(theta + np.pi)

# Apply the distance and angle calculations to each row in the DataFrame
df['distance'] = df.apply(lambda row: calc_distance(row['x'], row['y']), axis=1)
df['angle'] = df.apply(lambda row: calc_angle(row['x'], row['y']), axis=1)

df = df.drop(columns=['x', 'y'], axis=1)

df


Unnamed: 0,playType,bodyPart,interveningOpponents,interveningTeammates,interferenceOnShooter,minute,second,label,distance,angle
0,جریان بازی,پا,1,0,متوسط,70,9,1,17.530810,18.544088
1,جریان بازی,پا,3,0,متوسط,55,4,0,17.090313,13.982592
2,ضربه آزاد مستقیم,پا,6,2,کم,86,31,0,29.456534,14.153255
3,جریان بازی,پا,4,1,متوسط,77,2,0,26.948560,15.458384
4,جریان بازی,پا,2,0,متوسط,76,46,1,11.932141,31.315918
...,...,...,...,...,...,...,...,...,...,...
8920,جریان بازی,سر,3,0,زیاد,69,50,0,7.609895,46.818116
8921,جریان بازی,پا,4,1,متوسط,15,53,0,23.158484,16.713121
8922,جریان بازی,پا,3,0,متوسط,84,34,0,12.400746,31.970470
8923,جریان بازی,پا,1,0,زیاد,4,39,0,9.524222,31.529506


In [8]:
df.isna().sum()

playType                  0
bodyPart                  0
interveningOpponents      0
interveningTeammates      0
interferenceOnShooter    34
minute                    0
second                    0
label                     0
distance                  0
angle                     0
dtype: int64

In [9]:
def fill_nan_vals_of_interference_on_shooter(interveningOpponents) :
    if interveningOpponents == 0 :
        return 'کم'
    elif interveningOpponents == 1 :
        return 'متوسط'
    else :
        return 'زیاد'

df['interferenceOnShooter'].fillna(df['interveningOpponents'].apply(fill_nan_vals_of_interference_on_shooter), inplace = True)
df.isna().sum()

playType                 0
bodyPart                 0
interveningOpponents     0
interveningTeammates     0
interferenceOnShooter    0
minute                   0
second                   0
label                    0
distance                 0
angle                    0
dtype: int64

In [10]:
# Perform one-hot encoding
oneHotPlayType = pd.get_dummies(df['playType'], prefix='playType')

oneHotBodyPart = pd.get_dummies(df['bodyPart'], prefix='bodyPart')

oneHotInterferenceOnShooter = pd.get_dummies(df['interferenceOnShooter'], prefix='interferenceOnShooter')

df = pd.concat([df, oneHotPlayType, oneHotBodyPart, oneHotInterferenceOnShooter], axis=1)

df = df.drop(columns=['playType', 'bodyPart', 'interferenceOnShooter'], axis=1)

df


Unnamed: 0,interveningOpponents,interveningTeammates,minute,second,label,distance,angle,playType_جریان بازی,playType_ضربه آزاد مستقیم,playType_مستقیم از کرنر,playType_پنالتی,bodyPart_سایر,bodyPart_سر,bodyPart_پا,interferenceOnShooter_زیاد,interferenceOnShooter_متوسط,interferenceOnShooter_کم
0,1,0,70,9,1,17.530810,18.544088,1,0,0,0,0,0,1,0,1,0
1,3,0,55,4,0,17.090313,13.982592,1,0,0,0,0,0,1,0,1,0
2,6,2,86,31,0,29.456534,14.153255,0,1,0,0,0,0,1,0,0,1
3,4,1,77,2,0,26.948560,15.458384,1,0,0,0,0,0,1,0,1,0
4,2,0,76,46,1,11.932141,31.315918,1,0,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8920,3,0,69,50,0,7.609895,46.818116,1,0,0,0,0,1,0,1,0,0
8921,4,1,15,53,0,23.158484,16.713121,1,0,0,0,0,0,1,0,1,0
8922,3,0,84,34,0,12.400746,31.970470,1,0,0,0,0,0,1,0,1,0
8923,1,0,4,39,0,9.524222,31.529506,1,0,0,0,0,0,1,1,0,0


In [11]:
from sklearn.feature_selection import mutual_info_classif

X = df.drop(['label'], axis = 1)
y = df['label']
discrete_features = [True if col != 'distance' and col != 'angle' else False for col in list(X.columns)]

index = [col for col in list(X.columns)]

feature_importance = pd.DataFrame(columns = ['fi'], index = index)
feature_importance['fi'] = mutual_info_classif(X = X, y = y, discrete_features = discrete_features, random_state = 1401)

feature_importance = feature_importance.sort_values(by = 'fi', ascending = False)

feature_importance

Unnamed: 0,fi
angle,0.05955956
distance,0.05800996
interveningOpponents,0.04429951
playType_پنالتی,0.01581679
interveningTeammates,0.008254063
minute,0.005363962
interferenceOnShooter_کم,0.00430743
second,0.003790938
playType_جریان بازی,0.003353142
interferenceOnShooter_زیاد,0.003124208


In [12]:
model = AutoML(task='classification', time_budget=60, verbose=0)
cols_to_train = feature_importance[feature_importance.fi >= feature_importance.fi.quantile(.5)].index
x = df[cols_to_train]
y = df.label
model.fit(x,y)
x_train,x_test , y_train,y_test = train_test_split(x,y, random_state=313, test_size=.3, stratify=y)
y_pred = model.predict(x_test)
print(f'performance of model is {roc_auc_score(y_test, y_pred)}')



performance of model is 0.5814467992941821
