# **Rank prediction using XGBoost Classifier**

-- Bhavya Batta

**Key components of the model**

**Data Preprocessing**: Missing values are filled in using K-nearest neighbors, and categorical data is transformed into numeric form through pandas' get_dummies method. This transformation ensures the data is properly formatted for model input.

**Feature selection**:  Features deemed irrelevant, including "Name" and "College," are omitted from the prediction process. Furthermore, "Round" and "Pick" are not considered as they contribute to the target feature.

**Target feature:** Our current goal is to predict rankings based on the "Round" feature. We intend to include "Pick" as part of the target variable in the upcoming final version.

**Dataset split:** Given that this is a ranking problem, the training dataset includes all years except for 2023. Data from 2023 will be used solely for predicting the rank.

The **hyperparameters** are tuned using cross-validation. The disparity between baseline measurements and best-fit measurements demonstrates an improvement in accuracy and other metrics following 5-Fold cross-validation.

Note: This project is ongoing, with objectives to enhance measurement criteria, replace accuracy with ranking metrics in Cross-Validation, and incorporate "Pick" into the target feature for improvement.

# **Comparative Analysis of Baseline and best-fit XGBoost models.**

In [2]:
import pandas as pd

# Read the CSV file
df = pd.read_csv("data/imputed_data.csv")
print(df.columns)

Index(['Name', 'Position', 'College', 'Round', 'Pick', 'Stat URL', 'Height',
       'Weight', '40 Yard Dash', 'Bench Press', 'Vertical Jump', 'Broad Jump',
       '3 Cone Drill', 'Shuttle', 'conf_abbr', 'games', 'seasons',
       'tackles_solo', 'tackles_assists', 'tackles_total', 'tackles_loss',
       'sacks', 'def_int', 'def_int_yds', 'def_int_td', 'pass_defended',
       'fumbles_rec', 'fumbles_rec_yds', 'fumbles_rec_td', 'fumbles_forced',
       'rec', 'rec_yds', 'rec_yds_per_rec', 'rec_td', 'rush_att', 'rush_yds',
       'rush_yds_per_att', 'rush_td', 'scrim_att', 'scrim_yds',
       'scrim_yds_per_att', 'scrim_td', 'Year'],
      dtype='object')


In [3]:
df.head()

Unnamed: 0,Name,Position,College,Round,Pick,Stat URL,Height,Weight,40 Yard Dash,Bench Press,...,rec_td,rush_att,rush_yds,rush_yds_per_att,rush_td,scrim_att,scrim_yds,scrim_yds_per_att,scrim_td,Year
0,Emmanuel Acho,OLB,Texas,6,204,https://www.sports-reference.com/cfb/players/e...,74.0,238.0,4.64,24.0,...,5.29,199.2,1282.58,8.83,14.91,239.71,1747.91,8.22,20.2,2012
1,Joe Adams,WR,Arkansas,4,104,https://www.sports-reference.com/cfb/players/j...,71.0,179.0,4.51,14.59,...,8.5,4.0,69.5,11.65,0.0,96.0,1393.5,14.45,8.5,2012
2,Chas Alecxih,DT,Pittsburgh,0,0,https://www.sports-reference.com/cfb/players/c...,76.0,296.0,5.31,19.0,...,0.0,1.19,5.2,-0.68,0.36,1.36,5.55,0.86,0.36,2012
3,Frank Alexander,DE,Oklahoma,4,103,https://www.sports-reference.com/cfb/players/f...,76.0,270.0,4.8,24.48,...,2.17,22.98,75.37,4.12,4.24,36.81,231.59,6.49,6.41,2012
4,Antonio Allen,S,South Carolina,7,242,https://www.sports-reference.com/cfb/players/a...,73.0,210.0,4.58,17.0,...,1.68,374.69,2061.25,4.94,19.21,420.39,2397.36,6.43,20.89,2012


In [4]:
df.loc[df.Round != 1, "Round"] = 0

# Dropping the columns which donot contribute in prediction
all_X = df.drop(["Name", "Round", "Pick", "College"], axis=1)
all_X = pd.get_dummies(all_X)

# Splitting testing and training sets
train_X = all_X[(all_X.Year != 2023)].drop(["Year"], axis=1)
test_X = all_X[all_X.Year == 2023].drop(["Year"], axis=1)
train_y = df[(df.Year != 2023)].Round
test_y = df[df.Year == 2023].Round

In [38]:
train_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
0,74.0,238.0,4.64,24.0,35.5,118.0,7.13,4.28,37.0,3.0,...,False,False,False,False,False,False,False,False,False,False
1,71.0,179.0,4.51,14.59,36.0,123.0,7.09,4.12,40.0,4.0,...,False,False,False,False,False,False,False,False,False,False
2,76.0,296.0,5.31,19.0,25.5,99.0,7.74,4.62,34.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3,76.0,270.0,4.8,24.48,31.13,115.26,7.19,4.48,37.0,4.0,...,False,False,False,False,False,False,False,False,False,False
4,73.0,210.0,4.58,17.0,34.0,118.0,7.02,4.25,42.0,4.0,...,False,False,False,False,False,False,False,False,False,False


In [39]:
test_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
3400,70.0,216.0,4.51,19.42,33.64,115.58,7.03,4.28,31.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3401,73.0,237.0,4.47,17.09,36.5,129.0,7.22,4.25,53.0,5.0,...,False,False,False,False,False,False,False,False,False,False
3402,69.0,188.0,4.32,14.92,33.0,119.26,7.02,4.19,30.0,3.0,...,False,False,False,False,False,False,False,True,False,False
3403,71.0,173.0,4.49,15.14,34.0,122.0,7.0,4.16,35.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3404,74.0,282.0,4.49,27.0,37.5,125.0,7.22,4.47,36.0,4.0,...,False,False,False,False,False,False,False,False,False,False


In [5]:
import numpy as np

# Assuming y_train is a numpy array or a pandas Series
class_counts = np.unique(train_y, return_counts=True)

# Print class labels and their counts
for label, count in zip(class_counts[0], class_counts[1]):
    print(f"Class {label}: {count} instances")

Class 0: 3064 instances
Class 1: 336 instances


In [8]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-2.0.3-py3-none-macosx_12_0_arm64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-macosx_12_0_arm64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.0.3
Note: you may need to restart the kernel to use updated packages.


In [9]:
import xgboost as xgb
from sklearn.model_selection import cross_val_score

In [19]:
# Initialize the baseline XGBoost classifier with custom parameters
baseline_XGB = xgb.XGBClassifier(colsample_bytree=0.7,
 eta= 0.001,
 eval_metric= 'mae',
 max_depth= 6,
 min_child_weight= 15,
 objective= 'binary:logistic',
 subsample= 0.7)

baseline_XGB.fit(train_X, train_y)

In [34]:
from sklearn.metrics import accuracy_score

# Make predictions on test data
baseline_preds = baseline_XGB.predict_proba(test_X)
count = 1

# Ranking done according to the probability scores
for i in pd.DataFrame(baseline_preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Christian Gonzalez
2 Marvin Mims
3 Emmanuel Forbes
4 Jakorian Bennett
5 DJ Turner
6 Anthony Richardson
7 Jalin Hyatt
8 Byron Young
9 Keaton Mitchell
10 Kelee Ringo
11 Tyler Scott
12 Carrington Valentine
13 Dawand Jones
14 Brandon Hill
15 Quentin Johnston
16 Tyler Steen
17 Trenton Simpson
18 Tavius Robinson
19 Devon Achane
20 Jahmyr Gibbs
21 Asim Richards
22 Eli Ricks
23 Josh Downs
24 Lukas Van Ness
25 Wanya Morris
26 Blake Freeland
27 Rakim Jarrett
28 John Ojukwu
29 Luke Schoonmaker
30 Rejzohn Wright
31 BJ Ojulari
32 YaYa Diaby
33 Joe Tippmann
34 Anton Harrison
35 Jon Gaines
36 Ali Gaye
37 C.J. Stroud
38 Nick Herbig
39 Carter Warren
40 Malaesala Aumavae-Laulu
41 Peter Skoronski
42 Broderick Jones
43 Darrell Luter Jr.
44 Dontayvion Wicks
45 Nolan Smith
46 Gervon Dexter
47 Will Anderson Jr.
48 Isaiah Foskey
49 Bijan Robinson
50 Michael Mayer
51 Thomas Incoom
52 Darnell Wright
53 Bryce Ford-Wheaton
54 Ryan Hayes
55 Yasir Abdullah
56 Isaiah McGuire
57 Darnell Washington
58 Matt Landers
5

In [36]:
print(baseline_preds)

[[0.8406071  0.15939291]
 [0.8347372  0.16526279]
 [0.83031726 0.16968271]
 [0.8404305  0.15956952]
 [0.83618313 0.16381687]
 [0.8395903  0.16040972]
 [0.8419374  0.15806259]
 [0.83860916 0.16139083]
 [0.84091175 0.15908827]
 [0.8376796  0.16232036]
 [0.83310044 0.16689959]
 [0.84568405 0.15431592]
 [0.8424343  0.15756573]
 [0.8378336  0.16216642]
 [0.83551794 0.16448204]
 [0.84077954 0.15922044]
 [0.8459727  0.1540273 ]
 [0.8444205  0.15557952]
 [0.8449366  0.15506339]
 [0.8390686  0.16093141]
 [0.84309506 0.15690497]
 [0.8226664  0.17733356]
 [0.84324485 0.15675515]
 [0.8414581  0.1585419 ]
 [0.83726907 0.16273093]
 [0.8415219  0.15847807]
 [0.84198487 0.15801516]
 [0.8410558  0.15894417]
 [0.8403734  0.15962657]
 [0.84318686 0.15681316]
 [0.83585775 0.16414227]
 [0.8389463  0.16105373]
 [0.839421   0.16057901]
 [0.83582556 0.16417444]
 [0.8408768  0.15912315]
 [0.8355671  0.16443287]
 [0.8386291  0.16137087]
 [0.8381932  0.16180685]
 [0.8360106  0.16398945]
 [0.84200275 0.15799728]


In [58]:
predicted_labels = (baseline_preds[:, 1] > 0.16).astype(int)

In [54]:
print(predicted_labels)

[0 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 1 0 1 1
 1 1 0 1 0 0 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1
 1 0 0 1 1 1 1 0 0 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 1 0 0 1 1 0 1 1
 0 0 0 0 1 0 1 0 0 1 0 1 1 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1
 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 1 1 0 0
 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1
 1 1 0 1 1 1 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1
 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 0 1 0]


In [55]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (baseline_preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-baseline_preds[:, 1])
k = 10
num_relevant = sum(test_y)

def calculate_MRR(sorted_indices, test_y):
    # Calculate Mean Reciprocal Rank (MRR)
    mrr = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:  # Use iloc to access test_y by index
            mrr = 1 / (idx + 1)
            break
    return mrr

def calculate_MAP(sorted_indices, test_y):
    # Calculate Mean Average Precision (MAP)
    ap = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            ap += sum(test_y.iloc[:idx + 1]) / (idx + 1)
    map_score = ap / num_relevant
    return map_score

def calculate_NDCG(sorted_indices, test_y):
    # Calculate Normalized Discounted Cumulative Gain (NDCG) at k=10
    dcg = 0
    idcg = sum(1 / np.log2(np.arange(2, k + 2)))
    for idx, i in enumerate(sorted_indices[:k]):
        if test_y.iloc[i] == 1:
            dcg += 1 / np.log2(idx + 2)
    ndcg = dcg / idcg
    return ndcg

def calculate_PAK(sorted_indices, test_y):
    # Calculate Precision at k (P@k)
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    precision_at_k = tp_at_k / k
    return precision_at_k

def calculate_RAK(sorted_indices, test_y):
    # Calculate Recall at k (R@k)
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    recall_at_k = tp_at_k / num_relevant
    return recall_at_k

In [56]:
pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [59]:
from tabulate import tabulate

# Calculate all measurements
baseline_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Baseline measurements")
print(tabulate(baseline_measurements, headers=["Metric", "Value"]))

Baseline measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.559859
ROC AUC Score                                         0.776065
Mean Reciprocal Rank (MRR)                            1
Mean Average Precision (MAP)                          0.107357
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.408536
Precision at k (P@k) at k=10                          0.3
Recall at k (R@k) at k=10                             0.103448


In [62]:
best_XGB = xgb.XGBClassifier(
    colsample_bytree=0.8,
    eta=0.1,
    eval_metric='logloss',
    max_depth=6,
    min_child_weight=1,
    objective='binary:logistic',
    subsample=0.8
)

# Hypertuning parameters using 5-Fold Cross Validation method
scores = cross_val_score(best_XGB, train_X, train_y, cv=5)

best_XGB.fit(train_X, train_y)

In [65]:
# Predict on the testing data
y_pred = best_XGB.predict(test_X)

In [75]:
# Predicting the probabilities of Test set
preds = best_XGB.predict_proba(test_X)
count = 1

# Ranking done according to the probability scores
for i in pd.DataFrame(preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Emmanuel Forbes
2 Byron Young
3 Anthony Richardson
4 Darnell Wright
5 Dawand Jones
6 Jakorian Bennett
7 YaYa Diaby
8 Kelee Ringo
9 Jalin Hyatt
10 Tyler Steen
11 Will Anderson Jr.
12 Joe Tippmann
13 Marvin Mims
14 Carter Warren
15 Anton Harrison
16 Michael Mayer
17 Quentin Johnston
18 DJ Turner
19 Christian Gonzalez
20 Richard Gouraige
21 Bryce Young
22 Blake Freeland
23 C.J. Stroud
24 Lukas Van Ness
25 Malaesala Aumavae-Laulu
26 Thomas Incoom
27 Wanya Morris
28 Anthony Bradford
29 Broderick Jones
30 Jalen Redmond
31 Trenton Simpson
32 Adetomiwa Adebawore
33 Jon Gaines
34 Isaiah Foskey
35 Darnell Washington
36 Carrington Valentine
37 Paris Johnson Jr.
38 Zach Harrison
39 Rakim Jarrett
40 Jonathan Mingo
41 Zacch Pickens
42 Tyree Wilson
43 Nathaniel Dell
44 Tavius Robinson
45 Ikenna Enechukwu
46 Peter Skoronski
47 Asim Richards
48 Will Levis
49 Matthew Bergeron
50 Devon Achane
51 Gervon Dexter
52 Rashee Rice
53 Zach Charbonnet
54 Rejzohn Wright
55 John Ojukwu
56 Tanner McKee
57 Dalton K

In [78]:
best_predicted_labels = (preds[:, 1] > 0.16).astype(int)
best_predicted_labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0])

In [80]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-preds[:, 1])

# Calculate all measurements
best_rf_measurements = [
    ("Accuracy", accuracy_score(test_y, best_predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Best Fit measurements")
print(tabulate(best_rf_measurements, headers=["Metric", "Value"]))

Best Fit measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.862676
ROC AUC Score                                         0.776065
Mean Reciprocal Rank (MRR)                            1
Mean Average Precision (MAP)                          0.11541
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.424926
Precision at k (P@k) at k=10                          0.3
Recall at k (R@k) at k=10                             0.103448


# **Comparative Analysis of Baseline and Best-Fit Random Forest Models for Ranking Prediction**

**Accuracy**: Baseline is slightly higher, indicating it correctly classified a marginally higher percentage of the total.
ROC AUC Score: Both results are identical, showing the same ability to discriminate between classes.

**MRR**: Baseline is perfect, indicating it always ranks the correct item highest. Baseline result shows a significant drop, which could be critical if the goal is to rank a correct item as high as possible.

**MAP:** Best fit is slightly better, indicating a slight improvement in the ranking of relevant items across queries.

**NDCG at k=10:** Best fit is higher, showing it ranks relevant items more effectively within the top 10 positions.

**P@k at k=10:** Best fit is significantly higher, suggesting it has a better top-10 precision.

**R@k at k=10:** Best fit is also higher here, indicating it retrieves a higher proportion of relevant items within its top 10 predictions.

### **Conclusion**

For Ranking Tasks: If the focus is on ranking performance, particularly in retrieving and ranking the most relevant items as high as possible, Best fit is better. It shows superior performance in MAP, NDCG, P@k, and R@k, which are critical for ranking and recommendation systems.




