Adel Movahedian 400102074

---

# bonus Assignment 8

---

In [9]:
# -------------------------------------------------
# 0. Environment setup
# -------------------------------------------------
import re
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
import lightgbm as lgb
# -------------------------------------------------
# 1. Load dataset
# -------------------------------------------------
df = pd.read_csv('fifa19.csv')
# -------------------------------------------------
# 2. Keep only the 12 target positions
# -------------------------------------------------
map_pos = {
    "ST":"CF", "LS":"LF", "RS":"RF", "LF":"LF", "RF":"RF",
    "LW":"LF", "RW":"RF", "LAM":"LF", "RAM":"RF",
    "LCM":"CM", "RCM":"CM", "CM":"CM",
    "LDM":"CDM", "RDM":"CDM", "CDM":"CDM",
    "LCB":"CB", "RCB":"CB", "CB":"CB",
    "LWB":"LB", "RWB":"RB", "LB":"LB", "RB":"RB",
    "GK":"GK"
}
df = df[df["Position"].isin(map_pos)].copy()
df["Position"] = df["Position"].map(map_pos)
print("After position filtering:", df.shape)

After position filtering: (14896, 89)


In [10]:
# -------------------------------------------------
# 3. Feature engineering
# -------------------------------------------------
# 3a. Weight → kg
if df["Weight"].dtype == object:
    df["Weight"] = df["Weight"].str.replace("lbs", "", regex=False).astype(float) * 0.4536

# 3b. Drop ultra‑unique ID‑like columns
DROP_COLS = ["ID", "Name", "Club", "Photo", "Flag", "Club Logo"]
df.drop(columns=[c for c in DROP_COLS if c in df.columns], inplace=True)

# 3c. Sanitise column names so LightGBM doesn’t complain
safe_cols = {c: re.sub(r"[^A-Za-z0-9_]+", "_", c) for c in df.columns}
df.rename(columns=safe_cols, inplace=True)

# 3d. Cast remaining object columns to pandas 'category'
cat_cols = [c for c in df.select_dtypes("object").columns if c != "Position"]
for c in cat_cols:
    df[c] = df[c].astype("category")

In [11]:
# -------------------------------------------------
# 4. Split 80 / 20
# -------------------------------------------------
X = df.drop(columns=["Position"])
y = df["Position"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42
)
print("Train:", X_train.shape, "Test:", X_test.shape)

Train: (11916, 82) Test: (2980, 82)


In [12]:
# -------------------------------------------------
# 5. LightGBM
# -------------------------------------------------
params = dict(
    n_estimators=400,
    learning_rate=0.07,
    num_leaves=64,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="multiclass",
    metric="None",
    random_state=42,
    n_jobs=-1,
    verbosity=-1,   # turn off info & warning lines
)
model = lgb.LGBMClassifier(**params)
model.fit(X_train, y_train, categorical_feature=cat_cols)

In [13]:
# -------------------------------------------------
# 6. Evaluation
# -------------------------------------------------
y_pred = model.predict(X_test)
print("\nMacro‑F1:", round(f1_score(y_test, y_pred, average="macro"), 3))
print("\nPer‑class scores:\n", classification_report(y_test, y_pred, digits=2))


Macro‑F1: 0.71

Per‑class scores:
               precision    recall  f1-score   support

          CB       0.91      0.90      0.91       618
         CDM       0.64      0.53      0.58       288
          CF       0.78      0.96      0.86       430
          CM       0.72      0.80      0.76       436
          GK       1.00      1.00      1.00       405
          LB       0.89      0.87      0.88       280
          LF       0.41      0.22      0.29       125
          RB       0.79      0.86      0.82       276
          RF       0.39      0.22      0.28       122

    accuracy                           0.81      2980
   macro avg       0.73      0.71      0.71      2980
weighted avg       0.80      0.81      0.80      2980



---
---
---

Below there is a brief explanation on what the code is doing:
1. **Data Loading**: The code begins by importing necessary libraries and loading the FIFA 19 dataset from a specified URL into a Pandas DataFrame.

2. **Position Mapping**: It then filters and maps player positions to a standardized set of 12 categories, ensuring consistency across the dataset.

3. **Feature Engineering**: The dataset undergoes feature engineering, including converting weight from pounds to kilograms and sanitizing column names to remove special characters that could cause issues during model training.

4. **Categorical Encoding**: Categorical columns are identified and converted to the 'category' dtype, which is more memory-efficient and compatible with certain machine learning models.

5. **Data Splitting**: The dataset is split into training and testing sets using an 80/20 stratified split, ensuring that the distribution of player positions is consistent across both sets.

6. **Model Initialization**: A LightGBM classifier is initialized with specified hyperparameters, including the number of estimators, learning rate, and other settings to control the model's complexity and training process.

7. **Model Training**: The model is trained on the training data, with categorical features specified, and verbosity set to -1 to suppress detailed output during training.

8. **Model Evaluation**: After training, the model's performance is evaluated on the test set using metrics such as macro F1 score and a detailed classification report, providing insights into the model's accuracy and precision across different classes.

This process effectively prepares and trains a machine learning model to predict player positions in the FIFA 19 dataset, ensuring efficient handling of data and model parameters.