# **Rank prediction using XGBoost Classifier**

-- Bhavya Batta

**Key components of the model**

**Data Preprocessing**: Missing values are filled in using K-nearest neighbors, and categorical data is transformed into numeric form through pandas' get_dummies method. This transformation ensures the data is properly formatted for model input.

**Feature selection**:  Features deemed irrelevant, including "Name" and "College," are omitted from the prediction process. Furthermore, "Round" and "Pick" are not considered as they contribute to the target feature.

**Target feature:** Our current goal is to predict rankings based on the "Round" feature. We intend to include "Pick" as part of the target variable in the upcoming final version.

**Dataset split:** Given that this is a ranking problem, the training dataset includes all years except for 2023. Data from 2023 will be used solely for predicting the rank.

The **hyperparameters** are tuned using cross-validation. The disparity between baseline measurements and best-fit measurements demonstrates an improvement in accuracy and other metrics following 5-Fold cross-validation.

Note: This project is ongoing, with objectives to enhance measurement criteria, replace accuracy with ranking metrics in Cross-Validation, and incorporate "Pick" into the target feature for improvement.

# **Comparative Analysis of Baseline and best-fit XGBoost models.**

In [61]:
import pandas as pd

# Read the CSV file
df = pd.read_csv("imputed_data.csv")
print(df.columns)

Index(['Name', 'Position', 'College', 'Round', 'Pick', 'Stat URL', 'Height',
       'Weight', '40 Yard Dash', 'Bench Press', 'Vertical Jump', 'Broad Jump',
       '3 Cone Drill', 'Shuttle', 'conf_abbr', 'games', 'seasons',
       'tackles_solo', 'tackles_assists', 'tackles_total', 'tackles_loss',
       'sacks', 'def_int', 'def_int_yds', 'def_int_td', 'pass_defended',
       'fumbles_rec', 'fumbles_rec_yds', 'fumbles_rec_td', 'fumbles_forced',
       'rec', 'rec_yds', 'rec_yds_per_rec', 'rec_td', 'rush_att', 'rush_yds',
       'rush_yds_per_att', 'rush_td', 'scrim_att', 'scrim_yds',
       'scrim_yds_per_att', 'scrim_td', 'Year'],
      dtype='object')


In [37]:
df.head()

Unnamed: 0,Name,Position,College,Round,Pick,Stat URL,Height,Weight,40 Yard Dash,Bench Press,...,rec_td,rush_att,rush_yds,rush_yds_per_att,rush_td,scrim_att,scrim_yds,scrim_yds_per_att,scrim_td,Year
0,Emmanuel Acho,OLB,Texas,6,204,https://www.sports-reference.com/cfb/players/e...,74.0,238.0,4.64,24.0,...,5.29,199.2,1282.58,8.83,14.91,239.71,1747.91,8.22,20.2,2012
1,Joe Adams,WR,Arkansas,4,104,https://www.sports-reference.com/cfb/players/j...,71.0,179.0,4.51,14.59,...,8.5,4.0,69.5,11.65,0.0,96.0,1393.5,14.45,8.5,2012
2,Chas Alecxih,DT,Pittsburgh,0,0,https://www.sports-reference.com/cfb/players/c...,76.0,296.0,5.31,19.0,...,0.0,1.19,5.2,-0.68,0.36,1.36,5.55,0.86,0.36,2012
3,Frank Alexander,DE,Oklahoma,4,103,https://www.sports-reference.com/cfb/players/f...,76.0,270.0,4.8,24.48,...,2.17,22.98,75.37,4.12,4.24,36.81,231.59,6.49,6.41,2012
4,Antonio Allen,S,South Carolina,7,242,https://www.sports-reference.com/cfb/players/a...,73.0,210.0,4.58,17.0,...,1.68,374.69,2061.25,4.94,19.21,420.39,2397.36,6.43,20.89,2012


In [2]:
df.loc[df.Round != 1, "Round"] = 0

# Dropping the columns which donot contribute in prediction
all_X = df.drop(["Name", "Round", "Pick", "College"], axis=1)
all_X = pd.get_dummies(all_X)

# Splitting testing and training sets
train_X = all_X[(all_X.Year != 2023)].drop(["Year"], axis=1)
test_X = all_X[all_X.Year == 2023].drop(["Year"], axis=1)
train_y = df[(df.Year != 2023)].Round
test_y = df[df.Year == 2023].Round

In [38]:
train_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
0,74.0,238.0,4.64,24.0,35.5,118.0,7.13,4.28,37.0,3.0,...,False,False,False,False,False,False,False,False,False,False
1,71.0,179.0,4.51,14.59,36.0,123.0,7.09,4.12,40.0,4.0,...,False,False,False,False,False,False,False,False,False,False
2,76.0,296.0,5.31,19.0,25.5,99.0,7.74,4.62,34.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3,76.0,270.0,4.8,24.48,31.13,115.26,7.19,4.48,37.0,4.0,...,False,False,False,False,False,False,False,False,False,False
4,73.0,210.0,4.58,17.0,34.0,118.0,7.02,4.25,42.0,4.0,...,False,False,False,False,False,False,False,False,False,False


In [39]:
test_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
3400,70.0,216.0,4.51,19.42,33.64,115.58,7.03,4.28,31.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3401,73.0,237.0,4.47,17.09,36.5,129.0,7.22,4.25,53.0,5.0,...,False,False,False,False,False,False,False,False,False,False
3402,69.0,188.0,4.32,14.92,33.0,119.26,7.02,4.19,30.0,3.0,...,False,False,False,False,False,False,False,True,False,False
3403,71.0,173.0,4.49,15.14,34.0,122.0,7.0,4.16,35.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3404,74.0,282.0,4.49,27.0,37.5,125.0,7.22,4.47,36.0,4.0,...,False,False,False,False,False,False,False,False,False,False


In [40]:
import xgboost as xgb
from sklearn.model_selection import cross_val_score

In [68]:
# Initialize the baseline XGBoost classifier with custom parameters
baseline_XGB = xgb.XGBClassifier(colsample_bytree=0.7,
 eta= 0.001,
 eval_metric= 'mae',
 max_depth= 6,
 min_child_weight= 15,
 objective= 'binary:logistic',
 subsample= 0.7)

baseline_XGB.fit(train_X, train_y)

In [63]:
# Predict on the testing data
baseline_pred = baseline_XGB.predict(test_X)

# Calculate accuracy
accuracy = accuracy_score(test_y, baseline_pred)
print("Accuracy:", accuracy)

Accuracy: 0.897887323943662


In [64]:
# Predicting the probabilities of Test set
baseline_preds = baseline_XGB.predict_proba(test_X)
count = 1

# Ranking done according to the probability scores
for i in pd.DataFrame(baseline_preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Christian Gonzalez
2 Marvin Mims
3 Jakorian Bennett
4 Jalin Hyatt
5 DJ Turner
6 Anthony Richardson
7 Emmanuel Forbes
8 Byron Young
9 Keaton Mitchell
10 Kelee Ringo
11 Brandon Hill
12 Devon Achane
13 Trenton Simpson
14 Jahmyr Gibbs
15 Tyler Scott
16 Carrington Valentine
17 Dawand Jones
18 Quentin Johnston
19 Tyler Steen
20 Wanya Morris
21 Tavius Robinson
22 Eli Ricks
23 Lukas Van Ness
24 Rejzohn Wright
25 Asim Richards
26 Blake Freeland
27 Gervon Dexter
28 YaYa Diaby
29 Will Anderson Jr.
30 John Ojukwu
31 Rakim Jarrett
32 Joe Tippmann
33 Jon Gaines
34 Malaesala Aumavae-Laulu
35 Josh Downs
36 Anton Harrison
37 Nick Herbig
38 Carter Warren
39 Ali Gaye
40 Nolan Smith
41 Luke Schoonmaker
42 Isaiah Foskey
43 Yasir Abdullah
44 BJ Ojulari
45 C.J. Stroud
46 Broderick Jones
47 Matt Landers
48 Peter Skoronski
49 Bryan Bresee
50 Tanner McKee
51 Nick Hampton
52 Bijan Robinson
53 Bryce Ford-Wheaton
54 Thomas Incoom
55 Myles Murphy
56 Michael Mayer
57 Tre'Vius Hodges-Tomlinson
58 Darnell Wright
59 

In [72]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (baseline_preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-baseline_preds[:, 1])
k = 10
num_relevant = sum(test_y)

def calculate_MRR(sorted_indices, test_y):
    # Calculate Mean Reciprocal Rank (MRR)
    mrr = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:  # Use iloc to access test_y by index
            mrr = 1 / (idx + 1)
            break
    return mrr

def calculate_MAP(sorted_indices, test_y):
    # Calculate Mean Average Precision (MAP)
    ap = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            ap += sum(test_y.iloc[:idx + 1]) / (idx + 1)
    map_score = ap / num_relevant
    return map_score

def calculate_NDCG(sorted_indices, test_y):
    # Calculate Normalized Discounted Cumulative Gain (NDCG) at k=10
    dcg = 0
    idcg = sum(1 / np.log2(np.arange(2, k + 2)))
    for idx, i in enumerate(sorted_indices[:k]):
        if test_y.iloc[i] == 1:
            dcg += 1 / np.log2(idx + 2)
    ndcg = dcg / idcg
    return ndcg

def calculate_PAK(sorted_indices, test_y):
    # Calculate Precision at k (P@k)
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    precision_at_k = tp_at_k / k
    return precision_at_k

def calculate_RAK(sorted_indices, test_y):
    # Calculate Recall at k (R@k)
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    recall_at_k = tp_at_k / num_relevant
    return recall_at_k

In [73]:
from tabulate import tabulate

# Calculate all measurements
baseline_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Baseline measurements")
print(tabulate(baseline_measurements, headers=["Metric", "Value"]))

Baseline measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.897887
ROC AUC Score                                         0.764841
Mean Reciprocal Rank (MRR)                            1
Mean Average Precision (MAP)                          0.110987
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.371854
Precision at k (P@k) at k=10                          0.3
Recall at k (R@k) at k=10                             0.103448


In [74]:
best_XGB = xgb.XGBClassifier(
    colsample_bytree=0.8,
    eta=0.1,
    eval_metric='logloss',
    max_depth=6,
    min_child_weight=1,
    objective='binary:logistic',
    subsample=0.8
)

# Hypertuning parameters using 5-Fold Cross Validation method
scores = cross_val_score(best_XGB, train_X, train_y, cv=5)

best_XGB.fit(train_X, train_y)

In [75]:
# Predict on the testing data
y_pred = best_XGB.predict(test_X)

# Calculate accuracy
accuracy = accuracy_score(test_y, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8943661971830986


In [76]:
# Predicting the probabilities of Test set
preds = best_XGB.predict_proba(test_X)
count = 1

# Ranking done according to the probability scores
for i in pd.DataFrame(preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Dawand Jones
2 Darnell Wright
3 Byron Young
4 Marvin Mims
5 Anthony Richardson
6 Emmanuel Forbes
7 C.J. Stroud
8 Kelee Ringo
9 Anton Harrison
10 Jakorian Bennett
11 Will Anderson Jr.
12 Tyler Steen
13 Rejzohn Wright
14 Thomas Incoom
15 Richard Gouraige
16 Lukas Van Ness
17 Adetomiwa Adebawore
18 Christian Gonzalez
19 Michael Mayer
20 Anthony Bradford
21 YaYa Diaby
22 Joe Tippmann
23 Wanya Morris
24 Quentin Johnston
25 Blake Freeland
26 Bryce Young
27 Asim Richards
28 Carrington Valentine
29 Broderick Jones
30 Malaesala Aumavae-Laulu
31 Calijah Kancey
32 DJ Turner
33 Isaiah Foskey
34 Matthew Bergeron
35 Gervon Dexter
36 Mazi Smith
37 Ryan Hayes
38 Trenton Simpson
39 Carter Warren
40 Nolan Smith
41 John Ojukwu
42 Jalin Hyatt
43 Henry To'oTo'o
44 Jon Gaines
45 Zach Charbonnet
46 Zach Harrison
47 Tyree Wilson
48 Jonathan Mingo
49 Tavius Robinson
50 Luke Schoonmaker
51 Tanner McKee
52 Bijan Robinson
53 Peter Skoronski
54 Rakim Jarrett
55 Robert Beal
56 Darrell Luter Jr.
57 Paris Johnson J

In [77]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-preds[:, 1])

# Calculate all measurements
best_rf_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Best Fit measurements")
print(tabulate(best_rf_measurements, headers=["Metric", "Value"]))

Best Fit measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.894366
ROC AUC Score                                         0.764841
Mean Reciprocal Rank (MRR)                            0.5
Mean Average Precision (MAP)                          0.114885
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.442022
Precision at k (P@k) at k=10                          0.5
Recall at k (R@k) at k=10                             0.172414


# **Comparative Analysis of Baseline and Best-Fit Random Forest Models for Ranking Prediction**

**Accuracy**: Baseline is slightly higher, indicating it correctly classified a marginally higher percentage of the total.
ROC AUC Score: Both results are identical, showing the same ability to discriminate between classes.

**MRR**: Baseline is perfect, indicating it always ranks the correct item highest. Baseline result shows a significant drop, which could be critical if the goal is to rank a correct item as high as possible.

**MAP:** Best fit is slightly better, indicating a slight improvement in the ranking of relevant items across queries.

**NDCG at k=10:** Best fit is higher, showing it ranks relevant items more effectively within the top 10 positions.

**P@k at k=10:** Best fit is significantly higher, suggesting it has a better top-10 precision.

**R@k at k=10:** Best fit is also higher here, indicating it retrieves a higher proportion of relevant items within its top 10 predictions.

### **Conclusion**

For Ranking Tasks: If the focus is on ranking performance, particularly in retrieving and ranking the most relevant items as high as possible, Best fit is better. It shows superior performance in MAP, NDCG, P@k, and R@k, which are critical for ranking and recommendation systems.




