# PIPELINE CREATION FOR FIFA

In this notebook, a complete machine learning pipeline was created to automate the full workflow of data preprocessing and model training.

The pipeline integrates all essential steps — including feature scaling, model training, and prediction — into a single unified object. This ensures that the same preprocessing applied during training is automatically applied during prediction, preventing data leakage and guaranteeing consistency.

Specifically, the notebook builds two separate pipelines:

1. Classification Pipeline:
Used to predict the player’s role (cluster) based on selected attributes. This pipeline includes:

StandardScaler for feature normalization

A classification model (e.g., RandomForestClassifier or LogisticRegression or XGBClassifier)

2. Regression Pipeline:
Used to predict the overall rating of a player. This pipeline follows the same preprocessing structure but uses a regression model (e.g., RandomForestRegressor).

Both pipelines are trained, evaluated, and then exported as .pkl files for deployment in the Streamlit application. This design ensures that all preprocessing + model steps are encapsulated into one object, making the deployment cleaner, safer, and more efficient.

## INITIAL STEPS

In [20]:
# IMPORTING THE REQUIRED LIBRARIES
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, r2_score
from sklearn.ensemble import  RandomForestRegressor
from xgboost import XGBClassifier, XGBRegressor
import pickle
import joblib

In [21]:
# URL OF THE CLEANED FIFA DATASET
url = "https://drive.google.com/uc?export=download&id=12KMhMsgHaXj1BeMMIhpTn1QLxCraZvti"

In [22]:
# USING PANDAS STORE THE DATA IN 'fifa' VARIABLE
fifa_cleaned = pd.read_csv(url)

In [23]:
fifa_cleaned

Unnamed: 0,pace,shooting,passing,dribbling,defending,physic,gk_diving,gk_reflexes,gk_handling,movement_acceleration,movement_reactions,power_strength,cluster,overall
0,87.0,92.0,92.0,96.0,39.0,66.0,0.0,0.0,0.0,91,95,68,2,94
1,90.0,93.0,82.0,89.0,35.0,78.0,0.0,0.0,0.0,89,96,78,2,93
2,91.0,85.0,87.0,95.0,32.0,58.0,0.0,0.0,0.0,94,92,49,2,92
3,0.0,0.0,0.0,0.0,0.0,0.0,87.0,89.0,92.0,43,88,78,1,91
4,91.0,83.0,86.0,94.0,35.0,66.0,0.0,0.0,0.0,94,90,63,2,91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18273,57.0,23.0,28.0,33.0,47.0,51.0,0.0,0.0,0.0,56,40,47,3,48
18274,58.0,24.0,33.0,35.0,48.0,48.0,0.0,0.0,0.0,55,41,44,3,48
18275,54.0,35.0,44.0,45.0,48.0,51.0,0.0,0.0,0.0,55,52,51,3,48
18276,59.0,35.0,47.0,47.0,45.0,52.0,0.0,0.0,0.0,55,54,55,3,48


In [24]:
 # SELECT THE IMPORTANT PLAYER ATTRIBUTES TO USE AS INPUT FEATURES FOR CLASSIFICATION/REGRESSION
selected_features = [
    'pace','shooting','passing','dribbling','defending','physic',
    'gk_diving','gk_reflexes','gk_handling',
    'movement_acceleration','movement_reactions',
    'power_strength'
]

## CLASSIFICATION PIPELINE

The classification pipeline is designed to predict the player role (cluster) using the selected features. The pipeline combines data preprocessing and model training into a single, reusable structure, ensuring that the same transformations applied during training are also applied during prediction.

In [25]:
# SELECT INPUT FEATURES AND TARGET COLUMN FOR THE CLASSIFICATION PIPELINE
X_clf = fifa_cleaned[selected_features]
y_clf = fifa_cleaned['cluster']

In [26]:
# SPLIT THE DATA INTO TRAIN (80%) AND TEST (20%) SETS WITH A FIXED RANDOM STATE FOR REPRODUCIBILITY
X_train, X_test, y_train, y_test = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)

In [27]:
# CREATE A CLASSIFICATION PIPELINE THAT FIRST SCALES THE FEATURES AND THEN TRAINS A RANDOM FOREST MODEL
clf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', XGBClassifier(eval_metric='mlogloss'))
])

In [28]:
# # TRAIN THE CLASSIFICATION PIPELINE USING THE TRAINING DATA
clf_pipeline.fit(X_train, y_train)

In [29]:
# USE THE TRAINED REGRESSION PIPELINE TO PREDICT CLUSTER FOR THE TEST SET
y_pred = clf_pipeline.predict(X_test)

In [30]:
# CALCULATE AND PRINT THE CLASSIFICATION ACCURACY OF THE MODEL
accuracy = accuracy_score(y_test, y_pred)
print("Classification Accuracy:", accuracy)

Classification Accuracy: 0.962253829321663


In [31]:
pickle.dump(clf_pipeline, open("classification_pipeline.pkl", "wb"))

The classification pipeline achieved an accuracy of 96.22%, demonstrating that the selected numerical features provide strong predictive power for identifying player roles. The high accuracy indicates that the clusters discovered through unsupervised learning are well-structured and can be reliably classified using supervised machine learning techniques.

## REGRESSION PIPELINE

A regression pipeline was created using scikit-learn to predict the overall rating of a player. The pipeline includes feature scaling using StandardScaler and a Random Forest Regressor as the estimation model. This ensures a consistent, automated workflow for predicting overall ratings and allows the entire pipeline to be saved and deployed easily.

In [32]:
# SELECT INPUT FEATURES AND SET 'OVERALL' AS THE TARGET VARIABLE FOR THE REGRESSION PIPELINE
X_reg = fifa_cleaned[selected_features]
y_reg = fifa_cleaned['overall']

In [33]:
# SPLIT THE DATA INTO TRAIN (80%) AND TEST (20%) SETS FOR THE REGRESSION PIPELINE
X_train, X_test, y_train, y_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)


In [34]:
# CREATE A REGRESSION PIPELINE THAT SCALES THE FEATURES AND TRAINS A RANDOM FOREST REGRESSOR
reg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(n_estimators=300, random_state=42))
])

In [35]:
# TRAIN THE REGRESSION PIPELINE USING THE TRAINING DATA
reg_pipeline.fit(X_train, y_train)

In [36]:
# USE THE TRAINED REGRESSION PIPELINE TO PREDICT OVERALL RATINGS FOR THE TEST SET
y_pred = reg_pipeline.predict(X_test)

In [37]:
# CALCULATE AND PRINT THE R² SCORE TO MEASURE
r2 = r2_score(y_test, y_pred)
print("Regression R² Score:", r2)

Regression R² Score: 0.970895231262467


In [40]:
pickle.dump(reg_pipeline, open("regression_pipeline.pkl", "wb"))