#Introduction

**Author: Angel Captan**

**Assignment: Inro to AI Midsem Project**

## Description
In sports prediction, large numbers of factors including the historical performance of the teams, results of matches, and data on players, have to be accounted for to help different stakeholders understand the odds of winning or losing.

In this project, I am tasked to build a model(s) that predict a player's overall rating given the player's profile.

The specific tasks given are;
1. Demonstrate the data preparation & feature extraction process
2. Create feature subsets that show maximum correlation with the dependent variable.
3. Create and train a suitable machine learning model with cross-validation that can predict a player's rating.
4. Measure the model's performance and fine-tune it as a process of optimization.
5. Use the data from another season(players_22) which was not used during the training to test how good is the model.
6. Deploy the model on a simple web page using either (Heroku, Streamlite, or Flask) and upload a video that shows how the model performs on the web page/site.

## Imports and Data Loading

This section of the notebook will be dedicated to installing, loading datasets and libraries

In [None]:
!pip install pandas numpy matplotlib seaborn xgboost sklearn

Collecting sklearn
  Using cached sklearn-0.0.post10.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post10-py3-none-any.whl size=2959 sha256=4fe0baf05f41a1e5950dab03ef2904037e07d7103b136c90bc7f629a0d54bd37
  Stored in directory: /root/.cache/pip/wheels/5b/f6/92/0173054cc528db7ffe7b0c7652a96c3102aab156a6da960387
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post10


In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import pickle

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, VotingRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

We have the necessary libraries needed, now let's load our dataset

In [None]:
#loading datasets
player21_df = pd.read_csv("/content/drive/MyDrive/Cap_Mid/players_21.csv") #for training
player22_df =  pd.read_csv("/content/drive/MyDrive/Cap_Mid/players_22.csv") # for testing

  player22_df =  pd.read_csv("/content/drive/MyDrive/Cap_Mid/players_22.csv") # for testing


# Data Preprocessing

## EDA, Imputation and Encoding

In this section we know that our data is loaded. Therefore we would be performing an exploratory data analysis, identifying features that are important to us, doing imputation and performing encoding on all the necessary columns.

This step is necessary for the transformation of our data since not all columns, rows are needed for the analysis.

In [None]:
#view first few rows and nature of data
player21_df.head()

Unnamed: 0,sofifa_id,player_url,short_name,long_name,player_positions,overall,potential,value_eur,wage_eur,age,...,lcb,cb,rcb,rb,gk,player_face_url,club_logo_url,club_flag_url,nation_logo_url,nation_flag_url
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,"RW, ST, CF",93,93,103500000.0,560000.0,33,...,52+3,52+3,52+3,62+3,19+3,https://cdn.sofifa.net/players/158/023/21_120.png,https://cdn.sofifa.net/teams/241/60.png,https://cdn.sofifa.net/flags/es.png,https://cdn.sofifa.net/teams/1369/60.png,https://cdn.sofifa.net/flags/ar.png
1,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,"ST, LW",92,92,63000000.0,220000.0,35,...,54+3,54+3,54+3,61+3,20+3,https://cdn.sofifa.net/players/020/801/21_120.png,https://cdn.sofifa.net/teams/45/60.png,https://cdn.sofifa.net/flags/it.png,https://cdn.sofifa.net/teams/1354/60.png,https://cdn.sofifa.net/flags/pt.png
2,188545,https://sofifa.com/player/188545/robert-lewand...,R. Lewandowski,Robert Lewandowski,ST,91,91,111000000.0,240000.0,31,...,60+3,60+3,60+3,61+3,19+3,https://cdn.sofifa.net/players/188/545/21_120.png,https://cdn.sofifa.net/teams/21/60.png,https://cdn.sofifa.net/flags/de.png,,https://cdn.sofifa.net/flags/pl.png
3,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Júnior,"LW, CAM",91,91,132000000.0,270000.0,28,...,49+3,49+3,49+3,62+3,20+3,https://cdn.sofifa.net/players/190/871/21_120.png,https://cdn.sofifa.net/teams/73/60.png,https://cdn.sofifa.net/flags/fr.png,,https://cdn.sofifa.net/flags/br.png
4,192985,https://sofifa.com/player/192985/kevin-de-bruy...,K. De Bruyne,Kevin De Bruyne,"CAM, CM",91,91,129000000.0,370000.0,29,...,69+3,69+3,69+3,75+3,21+3,https://cdn.sofifa.net/players/192/985/21_120.png,https://cdn.sofifa.net/teams/10/60.png,https://cdn.sofifa.net/flags/gb-eng.png,https://cdn.sofifa.net/teams/1325/60.png,https://cdn.sofifa.net/flags/be.png


In [None]:
 player22_df.head()

Unnamed: 0,sofifa_id,player_url,short_name,long_name,player_positions,overall,potential,value_eur,wage_eur,age,...,lcb,cb,rcb,rb,gk,player_face_url,club_logo_url,club_flag_url,nation_logo_url,nation_flag_url
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,"RW, ST, CF",93,93,78000000.0,320000.0,34,...,50+3,50+3,50+3,61+3,19+3,https://cdn.sofifa.net/players/158/023/22_120.png,https://cdn.sofifa.net/teams/73/60.png,https://cdn.sofifa.net/flags/fr.png,https://cdn.sofifa.net/teams/1369/60.png,https://cdn.sofifa.net/flags/ar.png
1,188545,https://sofifa.com/player/188545/robert-lewand...,R. Lewandowski,Robert Lewandowski,ST,92,92,119500000.0,270000.0,32,...,60+3,60+3,60+3,61+3,19+3,https://cdn.sofifa.net/players/188/545/22_120.png,https://cdn.sofifa.net/teams/21/60.png,https://cdn.sofifa.net/flags/de.png,https://cdn.sofifa.net/teams/1353/60.png,https://cdn.sofifa.net/flags/pl.png
2,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,"ST, LW",91,91,45000000.0,270000.0,36,...,53+3,53+3,53+3,60+3,20+3,https://cdn.sofifa.net/players/020/801/22_120.png,https://cdn.sofifa.net/teams/11/60.png,https://cdn.sofifa.net/flags/gb-eng.png,https://cdn.sofifa.net/teams/1354/60.png,https://cdn.sofifa.net/flags/pt.png
3,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Júnior,"LW, CAM",91,91,129000000.0,270000.0,29,...,50+3,50+3,50+3,62+3,20+3,https://cdn.sofifa.net/players/190/871/22_120.png,https://cdn.sofifa.net/teams/73/60.png,https://cdn.sofifa.net/flags/fr.png,,https://cdn.sofifa.net/flags/br.png
4,192985,https://sofifa.com/player/192985/kevin-de-bruy...,K. De Bruyne,Kevin De Bruyne,"CM, CAM",91,91,125500000.0,350000.0,30,...,69+3,69+3,69+3,75+3,21+3,https://cdn.sofifa.net/players/192/985/22_120.png,https://cdn.sofifa.net/teams/10/60.png,https://cdn.sofifa.net/flags/gb-eng.png,https://cdn.sofifa.net/teams/1325/60.png,https://cdn.sofifa.net/flags/be.png


In [None]:
#understand nature of data
player21_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18944 entries, 0 to 18943
Columns: 110 entries, sofifa_id to nation_flag_url
dtypes: float64(16), int64(44), object(50)
memory usage: 15.9+ MB


In [None]:
player22_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19239 entries, 0 to 19238
Columns: 110 entries, sofifa_id to nation_flag_url
dtypes: float64(16), int64(44), object(50)
memory usage: 16.1+ MB


From the info decribed we can see that there are a lot of columns to consider for the analysis, so we are going to need to drop some, before finding the needed features we can work with, let's get the number of missing values for each of our dataframes

In [None]:
# Checking for missing data
print("Checking sum of missing value for Players 21(Train Data:)")
player21_df.isnull().sum()

Checking sum of missing value for Players 21(Train Data:)


sofifa_id               0
player_url              0
short_name              0
long_name               0
player_positions        0
                    ...  
player_face_url         0
club_logo_url         225
club_flag_url         225
nation_logo_url     17817
nation_flag_url         0
Length: 110, dtype: int64

In [None]:
print("Checking sum of missing value for Players 22(Test Data:)")
player22_df.isnull().sum()

Checking sum of missing value for Players 22(Test Data:)


sofifa_id               0
player_url              0
short_name              0
long_name               0
player_positions        0
                    ...  
player_face_url         0
club_logo_url          61
club_flag_url          61
nation_logo_url     18480
nation_flag_url         0
Length: 110, dtype: int64

## Dropping Missing Values

Now we are going to drop columns which have 30% of the data missing

In [None]:
total_rows_21 = player21_df.shape[0] #shape for train data

Calaculate the 30% threshhold for the two sets

In [None]:
threshold_21 = int(0.3 * total_rows_21)

print("The threshold for Players 21 is", threshold_21)

The threshold for Players 21 is 5683


Get a list of all columns with a sum of missing values greater than the threshold:

In [None]:
columns_to_drop = []
for column in player21_df.columns:
    if player21_df[column].isna().sum() > threshold_21:
        columns_to_drop.append(column)

Drop the columns:

In [None]:
#run once
player21_df = player21_df.drop(columns=columns_to_drop, axis=0)
player22_df = player22_df.drop(columns=columns_to_drop , axis=0)

Let's check info again:

In [None]:
#understand nature of data
player21_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18944 entries, 0 to 18943
Columns: 102 entries, sofifa_id to nation_flag_url
dtypes: float64(13), int64(44), object(45)
memory usage: 14.7+ MB


In [None]:
player22_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19239 entries, 0 to 19238
Columns: 102 entries, sofifa_id to nation_flag_url
dtypes: float64(13), int64(44), object(45)
memory usage: 15.0+ MB


After reviewing kaggle, reading the data description and looking at things using the data explorer, I came to understand some comuns don't contribute to the overall rating of a player, so we are going to drop those columns too.

Here is a link to exploer the columns in the data: [Data Explorer on Kaggle](https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset/?select=players_22.csv)

In [None]:
drop_columns = ['sofifa_id','player_url','long_name','dob','body_type','real_face','player_face_url','club_logo_url','club_flag_url','nation_flag_url']

player21_df = player21_df.drop(drop_columns, axis=1)
player22_df = player22_df.drop(drop_columns, axis=1)

After a further review, some columns were identified that could be dropped with this justfication. If we look at the `ls` column it is described as the `player attribute playing as LW`.

Such columns are only useful if we wanted to predict a player's effectiveness in playing such a position, so we drop such columns with that description.


Players are normally played in a specific posiion at their clubs which contibutes more to their overall rating, thus columns like `players_positions` which is the `player preferred positions`


Other columns reviewd that can be dropped are;


*   `short_name`
*   `club_joined`
*   `nationality_name`


In [None]:
#drop new identified columnas
drop_r_cols = ['short_name', 'player_positions', 'league_name', 'nationality_name', 'ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam', 'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm', 'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb', 'gk']

player21_df = player21_df.drop(drop_r_cols, axis=1)
player22_df = player22_df.drop(drop_r_cols, axis=1)

## Imputation x Encoding

Having dropped columns will many missing values now we do imutation. Imputation is where will fill missing data with certain values.

In [None]:
## Filling missing numeric data with the mean value
num_imputer = SimpleImputer(strategy='mean')

## Filling missing categorical data with the most frequent value
cat_imputer = SimpleImputer(strategy='most_frequent')

In [None]:
# Selecting numerical and categorical features

num_features_21 = player21_df.select_dtypes(include=[np.number]).columns.tolist()
cat_features_21 = player21_df.select_dtypes(include=[np.object]).columns.tolist()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  cat_features_21 = player21_df.select_dtypes(include=[np.object]).columns.tolist()


In [None]:
cat_features_21

['club_name', 'club_position', 'club_joined', 'preferred_foot', 'work_rate']

From Kaggle the column `overall` is described as the *player current overall attribute* which transalte to the **the player rating** i.e the crux of this whole project. Thus we remove `overall` since it is our target variable.

In [None]:
# Removing the target variable from the features
num_features_21.remove('overall')  # 'overall' is the target variable

Now we do the imputation:

In [None]:
#numeric imputation
player21_df[num_features_21] = num_imputer.fit_transform(player21_df[num_features_21])

#categorical imputation
player21_df[cat_features_21] = cat_imputer.fit_transform(player21_df[cat_features_21])

In [None]:
#numeric imputation
player22_df[num_features_21] = num_imputer.fit_transform(player22_df[num_features_21])

#categorical imputation
player22_df[cat_features_21] = cat_imputer.fit_transform(player22_df[cat_features_21])

In [None]:
player21_df.shape

(18944, 61)

In [None]:
player22_df.shape

(19239, 61)

Next task is to do encoding. We do this for only categorical columns. We first explored encoding use OneHot Encoding technique, but quickly discovered that we run out of memory so quickly pivoted to encoding using `pd.get_dummies`

In [None]:
# Using `get_dummies` for one-hot encoding and dropping the first category
player21_encoded_df = pd.get_dummies(player21_df, columns=cat_features_21, drop_first=True)
player22_encoded_df = pd.get_dummies(player22_df, columns=cat_features_21, drop_first=True)

player21_encoded_df.head()  # display the first few rows to verify the changes

Unnamed: 0,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,club_team_id,league_level,club_jersey_number,...,club_joined_2020-09-22,preferred_foot_Right,work_rate_High/Low,work_rate_High/Medium,work_rate_Low/High,work_rate_Low/Low,work_rate_Low/Medium,work_rate_Medium/High,work_rate_Medium/Low,work_rate_Medium/Medium
0,93,93.0,103500000.0,560000.0,33.0,170.0,72.0,241.0,1.0,10.0,...,0,0,0,0,0,0,0,0,1,0
1,92,92.0,63000000.0,220000.0,35.0,187.0,83.0,45.0,1.0,7.0,...,0,1,1,0,0,0,0,0,0,0
2,91,91.0,111000000.0,240000.0,31.0,184.0,80.0,21.0,1.0,9.0,...,0,1,0,1,0,0,0,0,0,0
3,91,91.0,132000000.0,270000.0,28.0,175.0,68.0,73.0,1.0,10.0,...,0,1,0,1,0,0,0,0,0,0
4,91,91.0,129000000.0,370000.0,29.0,181.0,70.0,10.0,1.0,17.0,...,0,1,0,0,0,0,0,0,0,0


In [None]:
player21_encoded_df.shape

(18944, 2594)

Finally let's describe our data set before feature analysis

In [None]:
player21_encoded_df.describe()

Unnamed: 0,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,club_team_id,league_level,club_jersey_number,...,club_joined_2020-09-22,preferred_foot_Right,work_rate_High/Low,work_rate_High/Medium,work_rate_Low/High,work_rate_Low/Low,work_rate_Low/Medium,work_rate_Medium/High,work_rate_Medium/Low,work_rate_Medium/Medium
count,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,...,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0
mean,65.677787,71.086729,2902288.0,9148.482825,25.225823,181.190773,75.016892,47908.905551,1.355468,20.589668,...,0.000581,0.762669,0.041491,0.18238,0.023543,0.002798,0.025443,0.094911,0.048195,0.526816
std,7.002278,6.109985,7695181.0,19774.654223,4.697354,6.825672,7.05714,53585.632395,0.734613,16.955963,...,0.024091,0.425458,0.199428,0.386167,0.151625,0.052821,0.157472,0.2931,0.214183,0.499294
min,47.0,47.0,9000.0,500.0,16.0,155.0,50.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,61.0,67.0,475000.0,1000.0,21.0,176.0,70.0,462.75,1.0,9.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,66.0,71.0,975000.0,3000.0,25.0,181.0,75.0,1920.0,1.0,18.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,70.0,75.0,2100000.0,9000.0,29.0,186.0,80.0,110981.0,1.0,27.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,93.0,95.0,185500000.0,560000.0,53.0,206.0,110.0,114899.0,4.0,99.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
player22_encoded_df.describe()

Unnamed: 0,overall,potential,value_eur,wage_eur,age,height_cm,weight_kg,club_team_id,league_level,club_jersey_number,...,club_joined_2021-09-15,preferred_foot_Right,work_rate_High/Low,work_rate_High/Medium,work_rate_Low/High,work_rate_Low/Low,work_rate_Low/Medium,work_rate_Medium/High,work_rate_Medium/Low,work_rate_Medium/Medium
count,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,...,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0,19239.0
mean,65.772182,71.07937,2850452.0,9017.989363,25.210822,181.299704,74.943032,50580.498123,1.354364,20.94525,...,0.000312,0.762722,0.041998,0.190291,0.02365,0.002339,0.024222,0.097718,0.042102,0.520557
std,6.880232,6.086213,7599043.0,19439.284122,4.748235,6.863179,7.069434,54315.551123,0.746679,17.880953,...,0.017657,0.425425,0.20059,0.392541,0.15196,0.048308,0.153741,0.296941,0.200827,0.49959
min,47.0,49.0,9000.0,500.0,16.0,155.0,49.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,61.0,67.0,475000.0,1000.0,21.0,176.0,70.0,479.0,1.0,9.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,66.0,71.0,975000.0,3000.0,25.0,181.0,75.0,1939.0,1.0,18.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,70.0,75.0,2100000.0,8000.0,29.0,186.0,80.0,111138.0,1.0,27.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,93.0,95.0,194000000.0,350000.0,54.0,206.0,110.0,115820.0,5.0,99.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Feature Engineering

## Feature Extraction
Now we are going to analyze the dataset to understand which features are important for determining a player's overall rating. We are using feature importance *to* identify necessary features.


Here we are fitting a RandomForestRegressor to obtain feature importances.

In [None]:
# the target variable and features; drop non-numeric columns if necessary
X = player21_encoded_df.drop(columns=['overall'])
y = player21_encoded_df['overall']

In [None]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=69)

In [None]:
# Create a Random Forest Regressor model
model = RandomForestRegressor(random_state=69)
model.fit(X_train, y_train)

In [None]:

# Get feature importances
importances = model.feature_importances_

# Sort them in descending order
indices = np.argsort(importances)[::-1]

# Let's print out the feature importance ranking
print("Feature ranking:")

for i in range(X.shape[1]):
    print(f"{i + 1}. Feature {X.columns[indices[i]]} ({importances[indices[i]]})")



Feature ranking:
1. Feature value_eur (0.655313911307742)
2. Feature release_clause_eur (0.1671399032578144)
3. Feature age (0.10563586970386792)
4. Feature potential (0.04598021500830903)
5. Feature movement_reactions (0.01838299700497527)
6. Feature defending (0.0003302305182612406)
7. Feature wage_eur (0.00029034834456474464)
8. Feature attacking_crossing (0.00021826096369428509)
9. Feature goalkeeping_positioning (0.0001954981518714694)
10. Feature club_joined_2019-07-01 (0.00018700597181429076)
11. Feature mentality_composure (0.00018407466823738676)
12. Feature skill_ball_control (0.00017820083145583438)
13. Feature club_name_1. FSV Mainz 05 (0.00017185890894593782)
14. Feature dribbling (0.00017148550095093503)
15. Feature attacking_short_passing (0.0001673233198858285)
16. Feature goalkeeping_reflexes (0.00016183621540485465)
17. Feature goalkeeping_diving (0.00016131726165922577)
18. Feature physic (0.000153583808687285)
19. Feature passing (0.0001482086864542363)
20. Feature 

In [None]:
# Now, let's get the top 10 features
top_features = [X.columns[indices[i]] for i in range(10)]
print("\nTop 10 features with % Contribution:")

for i in range(10):
    print(f"{i + 1}.  {top_features[i]} ({round(importances[indices[i]]*100,2)}%)")



Top 10 features with % Contribution:
1.  value_eur (65.53%)
2.  release_clause_eur (16.71%)
3.  age (10.56%)
4.  potential (4.6%)
5.  movement_reactions (1.84%)
6.  defending (0.03%)
7.  wage_eur (0.03%)
8.  attacking_crossing (0.02%)
9.  goalkeeping_positioning (0.02%)
10.  club_joined_2019-07-01 (0.02%)


From observing the results of the feature importance process I observe the top 5 features contribute a percentage importance of *99.24%*.

Thus my strategy is to use the top 10 features to train so I capture the underlying data patterns even for weak contributing features. Then when testing use the same 5. And, when deployed in the future use the top 5 features for prediction.

Let's see how it Goes. On to Feature subsetting.

In [None]:
top_features = top_features[:5]

print('Features being used for model development are:\n')
top_features

Features being used for model development are:



['value_eur', 'release_clause_eur', 'age', 'potential', 'movement_reactions']

## Feature Subset

At this stage our goal is to use the top features we have identified at our feature extraction stage to create subsetted data that we will use to train models.

In [None]:
#Now we subset our X feauture set
X_top_f = X[top_features]
X_top_f

#no need to do for y

Unnamed: 0,value_eur,release_clause_eur,age,potential,movement_reactions
0,103500000.0,138400000.0,33.0,93.0,94.0
1,63000000.0,75900000.0,35.0,92.0,95.0
2,111000000.0,132000000.0,31.0,91.0,93.0
3,132000000.0,166500000.0,28.0,91.0,91.0
4,129000000.0,161000000.0,29.0,91.0,91.0
...,...,...,...,...,...
18939,70000.0,57000.0,21.0,52.0,48.0
18940,70000.0,72000.0,21.0,53.0,50.0
18941,45000.0,47000.0,28.0,47.0,44.0
18942,130000.0,165000.0,17.0,67.0,53.0


Now let's scale our features which is our independent variables

In [None]:
# Initialize the scaler
scaler = StandardScaler()

# Scale the features
X_scaled = scaler.fit_transform(X_top_f)

# The features are now scaled and ready for training the model.
X_scaled_df = pd.DataFrame(X_scaled, columns=X_top_f.columns)

X_scaled_df.head()

Unnamed: 0,value_eur,release_clause_eur,age,potential,movement_reactions
0,13.073165,13.695696,1.655055,3.586563,3.554438
1,7.809992,7.312715,2.080838,3.422893,3.664174
2,14.047827,13.042079,1.229273,3.259222,3.444701
3,16.776879,16.565484,0.590598,3.259222,3.225227
4,16.387015,16.003782,0.80349,3.259222,3.225227


In [None]:
#Saving scaler to use in deployment

with open('scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)


# Training Models

We are now reading to train some models, here we are going to train 3 modes;
1. XGBoost
2. Gradient Boost
3. Random Forest


Lets split data for training

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, test_size=0.3, random_state=84)

Now we define a function for training our various models

In [None]:
def train_model(model, param_grid, X, y):
    '''
        Trains a model using grid search with cross-validation and returns the best model.
        Parameters:
            model: scikit-learn model
            param_grid: dictionary with parameters to try
            X: features(independent variables)
            y: target(dependent variable)
    '''
    cv = KFold(n_splits=7 , random_state=69, shuffle=True)

    # Grid search with cross-validation
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)
    grid_search.fit(X, y)

    # Results of the grid search
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best score (MAE): {-grid_search.best_score_}")  # We negate the score because grid search maximizes performance (so it negates the scores)

    return grid_search.best_estimator_  # Returns the best model

## Model 1: XGBoost

In [None]:
print("\nTraining XGBoost...")
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_params = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.1, 0.001, 0.01],
    'max_depth': [3, 5, 9, 15],
    'colsample_bytree': [0.5, 0.75, 1]
}
best_xgb = train_model(xgb_model, xgb_params, X_train, y_train)


Training XGBoost...
Best parameters: {'colsample_bytree': 1, 'learning_rate': 0.01, 'max_depth': 15, 'n_estimators': 1000}
Best score (MAE): 0.2875536853687638


## Model 2: Gradient Bossting Regressor

In [None]:
print("\nTraining Gradient Boosting...")
gbr_model = GradientBoostingRegressor(random_state=63)
gbr_params = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.1, 0.001, 0.01],
    'max_depth': [9, 15]
}
best_gbr = train_model(gbr_model, gbr_params, X_train, y_train)


Training Gradient Boosting...
Best parameters: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 500}
Best score (MAE): 0.28019891850304995


## Model 3: Random Forest Regressor



In [None]:
print("\nTraining Random Forest...")
rf_model = RandomForestRegressor(random_state=39)
rf_params = {
    'n_estimators': [500,1000],
    'max_depth': [12, 15],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}
best_rf = train_model(rf_model, rf_params, X_train, y_train)


Training Random Forest...
Best parameters: {'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1000}
Best score (MAE): 0.2748741579368859


## Model 4: Ensembled Model

Form discussion in class I have come to understand that, an ensembled model can improve a model's predicitve perfromance. Here I will combine the best versions of my 3 models into a single ensemble model.

In [None]:
# Create an ensemble model
ensemble = VotingRegressor(
    estimators=[
        ('xgb', best_xgb),
        ('gbr', best_gbr),
        ('rf', best_rf)
    ]
)

In [None]:
# Fit model on the training data
print("\nTraining Ensemble Model...")
ensemble.fit(X_train, y_train)

# Predict and evaluate on the training set
train_pred = ensemble.predict(X_train)
train_mae = mean_absolute_error(y_train, train_pred)
print(f"Ensemble model MAE on training set: {train_mae}")


Training Ensemble Model...
Ensemble model MAE on training set: 0.04907840399452329


Now we have our trained Models. We are moving on to evaluations on the test set to see how they perform. Before Let's save so we don't have to incur cost of training if runtime fails

# Saving Models

In [None]:
%cd "/content/drive/MyDrive/Cap_Mid"

/content/drive/MyDrive/Cap_Mid


In [None]:
with open('best_xgb_model.pkl', 'wb') as file:
    pickle.dump(best_xgb, file)

with open('best_gbr_model.pkl', 'wb') as file:
    pickle.dump(best_gbr, file)

with open('best_rf_model.pkl', 'wb') as file:
    pickle.dump(best_rf, file)

with open('ensemble_model.pkl', 'wb') as file:
    pickle.dump(ensemble, file)

Test if model saved well

In [None]:
with open('ensemble_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

predictions = loaded_model.predict(X_test)

en_mae = mean_absolute_error(y_test, predictions)

print(f"Ensemble model MAE on test set: {en_mae}")

Ensemble model MAE on test set: 0.2637278347406203


Saved Well

# Evaluation

We are going to do two evaluations, one on the test set seperated from the training data. The other on `Players 22` an unseen dataset similar to the data used to train models

## Test Set Evaluations

In [None]:
print("\nEvaluating XGBoost...")

#predict on test set
pred_xgb = best_xgb.predict(X_test)
xgb_mae = mean_absolute_error(y_test, pred_xgb)

print(f"XGBoost model MAE on test set: {xgb_mae:.2f}")


Evaluating XGBoost...
XGBoost model MAE on test set: 0.28


In [None]:
print("\nEvaluating Gradient Boost...")

#predict on test set
pred_gbr = best_gbr.predict(X_test)
gbr_mae = mean_absolute_error(y_test, pred_gbr)

print(f"Gradient Boost Regressor model MAE on test set: {gbr_mae}:.2f")


Evaluating Gradient Boost...
Gradient Boost Regressor model MAE on test set: 0.27734439088845086:.2f


In [None]:
print("\nEvaluating Random Forest...")

#predict on test set
pred_rf = best_rf.predict(X_test)
rf_mae = mean_absolute_error(y_test, pred_rf)

print(f"Random Forest Regressor model MAE on test set: {rf_mae:.2f}")


Evaluating Random Forest...
Random Forest Regressor model MAE on test set: 0.27


In [None]:
print("\nEvaluating Ensemble...")

#predict on test set
pred_en = ensemble.predict(X_test)
en_mae = mean_absolute_error(y_test, pred_en)

print(f"Ensemble model MAE on test set: {en_mae:.2f}")


Evaluating Ensemble...
Ensemble model MAE on test set: 0.26


## Player 22 Evaluations


Here we will test our trained models further on `player22` data, the data has been preprocessed already. We only have to extract just the top features needed

In [None]:
player22_encoded_df['overall']

0        93
1        92
2        91
3        91
4        91
         ..
19234    47
19235    47
19236    47
19237    47
19238    47
Name: overall, Length: 19239, dtype: int64

In [None]:
top_features

['value_eur', 'release_clause_eur', 'age', 'potential', 'movement_reactions']

In [None]:
player22_encoded_df[top_features]

Unnamed: 0,value_eur,release_clause_eur,age,potential,movement_reactions
0,78000000.0,144300000.0,34.0,93.0,94.0
1,119500000.0,197200000.0,32.0,92.0,93.0
2,45000000.0,83300000.0,36.0,91.0,94.0
3,129000000.0,238700000.0,29.0,91.0,89.0
4,125500000.0,232200000.0,30.0,91.0,91.0
...,...,...,...,...,...
19234,70000.0,114000.0,22.0,52.0,53.0
19235,110000.0,193000.0,19.0,59.0,49.0
19236,100000.0,175000.0,21.0,55.0,46.0
19237,110000.0,239000.0,19.0,60.0,48.0


In [None]:
#Get player 22 info
y_22 = player22_encoded_df['overall']
X_22 = player22_encoded_df[top_features]


In [None]:
#Scale input

X_scaled_22 = scaler.fit_transform(X_22)

# The features are now scaled and ready for training the model.
X22_scaled_df = pd.DataFrame(X_scaled_22, columns=X_22.columns)

In [None]:
#reassign
X_22 = X22_scaled_df

X_22.head()

Unnamed: 0,value_eur,release_clause_eur,age,potential,movement_reactions
0,9.889601,9.591752,1.851089,3.60178,3.599846
1,15.350958,13.244084,1.429869,3.43747,3.489252
2,5.546836,5.380179,2.272309,3.27316,3.599846
3,16.601147,16.109335,0.798039,3.27316,3.046874
4,16.140551,15.66056,1.008649,3.27316,3.268063


Using saved models here.

### Loading Saved Models

In [None]:
#move to directory where models are saved
%cd "/content/drive/MyDrive/Cap_Mid"

/content/drive/MyDrive/Cap_Mid


In [None]:
with open('best_xgb_model.pkl', 'rb') as file:
    lbest_xgb = pickle.load(file)

with open('best_gbr_model.pkl', 'rb') as file:
    lbest_gbr = pickle.load(file)

with open('best_rf_model.pkl', 'rb') as file:
    lbest_rf = pickle.load(file)

with open('ensemble_model.pkl', 'rb') as file:
    lensemble = pickle.load(file)

### Testing

In [None]:
print("\nEvaluating XGBoost...")

#predict on test set
pred_xgb = lbest_xgb.predict(X_22)
xgb_mae = mean_absolute_error(y_22, pred_xgb)

print(f"XGBoost model MAE on Players 22 set: {xgb_mae:.2f}")


Evaluating XGBoost...
XGBoost model MAE on Players 22 set: 0.68


In [None]:
print("\nEvaluating Random Forest...")

#predict on test set
pred_rf = lbest_rf.predict(X_22)
rf_mae = mean_absolute_error(y_22, pred_rf)

print(f"Random Forest Regressor model MAE on Players 22 set: {rf_mae:.2f}")


Evaluating Random Forest...
Random Forest Regressor model MAE on Players 22 set: 0.58


In [None]:
print("\nEvaluating Gradient Boost...")

#predict on test set
pred_gbr = lbest_gbr.predict(X_22)
gbr_mae = mean_absolute_error(y_22, pred_gbr)

print(f"Gradient Boost Regressor model MAE on Players 22 set: {gbr_mae:.2f}")


Evaluating Gradient Boost...
Gradient Boost Regressor model MAE on Players 22 set: 0.62


In [None]:
print("\nEvaluating Ensemble...")

#predict on test set
pred_en = lensemble.predict(X_22)
en_mae = mean_absolute_error(y_22, pred_en)

print(f"Ensemble model MAE on Players 22 set: {en_mae:.2f}")



Evaluating Ensemble...
Ensemble model MAE on Players 22 set: 0.59


In [None]:
!pip freeze > requirements.txt