# Rank prediction using CatBoost Classifier

-- Nishant Ragate

<b><u>Key components of the model</u></b>

Data Preprocessing: After imputing missing values using KNN. We encode categorical variables, and scale numerical features to ensure compatibility with the CatBoost model.

Feature selection: "Name" and "College" are removed and relevant one's are selected based on their importance in predicting the target variable, ensuring that the model focuses on the most informative aspects of the data while reducing complexity and overfitting.

Target feature: The target feature, initially "Round," is later updated to "Pick" to better reflect the desired outcome, which is predicting the order in which players are selected in a draft.

Dataset split: The dataset is split into training and testing sets, with the training set used to train the model and the testing set used to evaluate its performance.


Additionally, a 5-Fold cross-validation strategy is employed during hyperparameter tuning to ensure robustness and generalizability.

In [2]:
import pandas as pd

# Read the CSV file
df = pd.read_csv("data/imputed_data.csv")
print(df.columns)

Index(['Name', 'Position', 'College', 'Round', 'Pick', 'Stat URL', 'Height',
       'Weight', '40 Yard Dash', 'Bench Press', 'Vertical Jump', 'Broad Jump',
       '3 Cone Drill', 'Shuttle', 'conf_abbr', 'games', 'seasons',
       'tackles_solo', 'tackles_assists', 'tackles_total', 'tackles_loss',
       'sacks', 'def_int', 'def_int_yds', 'def_int_td', 'pass_defended',
       'fumbles_rec', 'fumbles_rec_yds', 'fumbles_rec_td', 'fumbles_forced',
       'rec', 'rec_yds', 'rec_yds_per_rec', 'rec_td', 'rush_att', 'rush_yds',
       'rush_yds_per_att', 'rush_td', 'scrim_att', 'scrim_yds',
       'scrim_yds_per_att', 'scrim_td', 'Year'],
      dtype='object')


In [2]:
df.head

In [5]:
df.loc[df.Round != 1, "Round"] = 0

# Dropping the columns which donot contribute in prediction
all_X = df.drop(["Name", "Round", "Pick", "College"], axis=1)
all_X = pd.get_dummies(all_X)

# Splitting testing and training sets
train_X = all_X[(all_X.Year != 2023)].drop(["Year"], axis=1)
test_X = all_X[all_X.Year == 2023].drop(["Year"], axis=1)
train_y = df[(df.Year != 2023)].Round
test_y = df[df.Year == 2023].Round

In [None]:
train_X.head()

In [None]:
test_X.head()

In [3]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [6]:
from catboost import CatBoostClassifier

# Initialize the baseline CatBoost classifier with custom parameters
baseline_catboost = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    l2_leaf_reg=3,
    border_count=32,
    thread_count=4,
    verbose=False
)

# Train the baseline CatBoost model
baseline_catboost.fit(train_X, train_y)

<catboost.core.CatBoostClassifier at 0x7d71f98a5c60>

In [7]:
# Make predictions on test data
baseline_preds = baseline_catboost.predict_proba(test_X)

# Ranking done according to the probability scores
count = 1
for i in pd.DataFrame(baseline_preds).sort_values(by=1, ascending=False).index:
    print(f"{count} {df[df.Year == 2023].reset_index().at[i, 'Name']}")
    count += 1

1 Jakorian Bennett
2 Will Anderson Jr.
3 DJ Turner
4 Emmanuel Forbes
5 Anthony Richardson
6 Marvin Mims
7 Bryce Young
8 Isaiah Foskey
9 Carrington Valentine
10 Kelee Ringo
11 Christian Gonzalez
12 Trenton Simpson
13 Darnell Wright
14 Tyler Steen
15 C.J. Stroud
16 Tavius Robinson
17 Deonte Banks
18 YaYa Diaby
19 Rejzohn Wright
20 Tre'Vius Hodges-Tomlinson
21 Josh Downs
22 Lukas Van Ness
23 Cory Trice
24 Darrell Luter Jr.
25 Bijan Robinson
26 Yasir Abdullah
27 Robert Beal
28 Ali Gaye
29 Jalin Hyatt
30 Byron Young
31 Ryan Hayes
32 Quentin Johnston
33 Nathaniel Dell
34 Wanya Morris
35 Jalen Carter
36 Anthony Bradford
37 Adetomiwa Adebawore
38 Malaesala Aumavae-Laulu
39 Matthew Bergeron
40 BJ Ojulari
41 Jonathan Mingo
42 Carter Warren
43 Rakim Jarrett
44 Tyler Scott
45 Broderick Jones
46 John Ojukwu
47 Jon Gaines
48 Gervon Dexter
49 Bryan Bresee
50 Nick Hampton
51 Nolan Smith
52 Felix Anudike-Uzomah
53 Jartavius Martin
54 Terell Smith
55 Jason Taylor II
56 Tanner McKee
57 Anton Harrison
58 

In [10]:
from tabulate import tabulate
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (baseline_preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-baseline_preds[:, 1])
k = 10
num_relevant = sum(test_y)

# Define ranking evaluation functions
def calculate_MRR(sorted_indices, test_y):
    mrr = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            mrr = 1 / (idx + 1)
            break
    return mrr

def calculate_MAP(sorted_indices, test_y):
    ap = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            ap += sum(test_y.iloc[:idx + 1]) / (idx + 1)
    map_score = ap / num_relevant
    return map_score

def calculate_NDCG(sorted_indices, test_y):
    dcg = 0
    idcg = sum(1 / np.log2(np.arange(2, k + 2)))
    for idx, i in enumerate(sorted_indices[:k]):
        if test_y.iloc[i] == 1:
            dcg += 1 / np.log2(idx + 2)
    ndcg = dcg / idcg
    return ndcg

def calculate_PAK(sorted_indices, test_y):
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    precision_at_k = tp_at_k / k
    return precision_at_k

def calculate_RAK(sorted_indices, test_y):
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    recall_at_k = tp_at_k / num_relevant
    return recall_at_k

# Calculate all measurements
baseline_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Baseline measurements")
print(tabulate(baseline_measurements, headers=["Metric", "Value"]))

Baseline measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.897887
ROC AUC Score                                         0.777552
Mean Reciprocal Rank (MRR)                            0.5
Mean Average Precision (MAP)                          0.123912
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.392158
Precision at k (P@k) at k=10                          0.4
Recall at k (R@k) at k=10                             0.137931


In [28]:
from sklearn.model_selection import cross_val_score

# Find the best parameters for CatBoost
best_catboost = CatBoostClassifier(
    iterations=1500,
    learning_rate=0.1,
    depth=10,
    l2_leaf_reg=1,
    border_count=128,
    scale_pos_weight=(len(train_y) - sum(train_y)) / sum(train_y),
    thread_count=4,
    verbose=False
)

# Cross-validation on the best CatBoost model
best_scores = cross_val_score(best_catboost, train_X, train_y, cv=5)

best_catboost.fit(train_X, train_y)

<catboost.core.CatBoostClassifier at 0x7d7194e3b0a0>

In [29]:
# Make predictions on test data
preds = best_catboost.predict_proba(test_X)

# Ranking done according to the probability scores after tuning the hyperparameters
count = 1
for i in pd.DataFrame(preds).sort_values(by=1, ascending=False).index:
    print(f"{count} {df[df.Year == 2023].reset_index().at[i, 'Name']}")
    count += 1

1 Darnell Wright
2 Anthony Richardson
3 Jakorian Bennett
4 Bryce Young
5 Tyler Steen
6 Richard Gouraige
7 YaYa Diaby
8 C.J. Stroud
9 Byron Young
10 Emmanuel Forbes
11 Nolan Smith
12 Wanya Morris
13 Ryan Hayes
14 Anton Harrison
15 Kelee Ringo
16 Ali Gaye
17 Christian Gonzalez
18 Malaesala Aumavae-Laulu
19 Rejzohn Wright
20 Nathaniel Dell
21 Brandon Hill
22 Marvin Mims
23 Rakim Jarrett
24 Anthony Bradford
25 Matthew Bergeron
26 Carter Warren
27 Sam LaPorta
28 Jay Ward
29 Bijan Robinson
30 Tavius Robinson
31 Jalin Hyatt
32 Isaiah Foskey
33 Jaren Hall
34 DJ Turner
35 Joe Tippmann
36 Zacch Pickens
37 Trenton Simpson
38 Paris Johnson Jr.
39 Rashee Rice
40 Asim Richards
41 Tanner McKee
42 Calijah Kancey
43 Jon Gaines
44 Broderick Jones
45 A.T. Perry
46 Robert Beal
47 Michael Mayer
48 Adetomiwa Adebawore
49 Jalen Carter
50 Tyler Scott
51 Thomas Incoom
52 Darnell Washington
53 Myles Murphy
54 Jerrod Clark
55 Devon Achane
56 Warren McClendon
57 Mazi Smith
58 Carrington Valentine
59 Peter Skorons

In [30]:
# Make predictions on test data
preds = best_catboost.predict(test_X)

# Calculate accuracy
accuracy = (preds == test_y).mean()

# Evaluation metrics
baseline_measurements = [
    ("Accuracy", accuracy)
]

# Print baseline measurements
print("Baseline CatBoost measurements")
print(tabulate(baseline_measurements, headers=["Metric", "Value"]))

Baseline CatBoost measurements
Metric       Value
--------  --------
Accuracy  0.901408


In [31]:
preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0])

In [32]:
# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (preds > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-preds)

# Calculate all measurements
best_rf_measurements = [
	("Accuracy", accuracy_score(test_y, predicted_labels)),
	("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
	("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
	("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
	("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
	("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
	("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]
# Print measurements in a table format
print("Best Fit measurements")
print(tabulate(best_rf_measurements, headers=["Metric", "Value"]))



Best Fit measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.901408
ROC AUC Score                                         0.777552
Mean Reciprocal Rank (MRR)                            0.5
Mean Average Precision (MAP)                          0.103692
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.55778
Precision at k (P@k) at k=10                          0.6
Recall at k (R@k) at k=10                             0.206897


# Comparative Analysis of Baseline and Best-Fit CatBoost Models for Ranking Prediction

The comparison between the baseline and best-fit CatBoost models reveals notable differences in performance across various metrics. In terms of accuracy the best fit model does slightly better than the baseline, and ROC AUC score in both models exhibit similar results. However, significant improvements are observed in the best-fit model for ranking-related metrics. The Mean Reciprocal Rank (MRR) shows a substantial increase, indicating that the best-fit model provides more relevant and accurate predictions at the top of the ranked list compared to the baseline. Similarly, the Mean Average Precision (MAP) and Precision at k (P@k) at k=10 metrics demonstrate considerable enhancements, implying better precision in predicting relevant instances within the top results. Moreover, the Normalized Discounted Cumulative Gain (NDCG) at k=10 reflects a notable improvement, suggesting that the best-fit model produces more relevant results at the top ranks, which is crucial for ranking tasks. Despite these improvements, the recall at k (R@k) at k=10 remains relatively low for both models, indicating a challenge in capturing all relevant instances within the top k results.

Overall, while the baseline model provides reasonable predictive performance, the best-fit Random Forest model significantly enhances the model's ability to accurately rank and prioritize instances, particularly at the top of the list, thereby improving its utility in predicting NFL Draft.
