# Rank prediction using LightGBM Model

-- Niketan Doddamani

<b><u>Key components of the model</u></b>

Data Preprocessing: After imputing missing values using KNN, we convert categorical data to numeric using the get_dummies function in pandas. This conversion helps to format the data in a way that is suitable for the model.

Feature selection: We remove irrelevant features such as "Name" and "College" from consideration for prediction. Additionally, "Round" and "Pick" are excluded as they are part of the target feature.

Target feature: Currently, we aim to predict ranking using the "Round" feature. In the future, we plan to incorporate "Pick" before final submission.

Dataset split: Given that this is a ranking problem, the training dataset includes all years except for 2023. Data from 2023 will be used solely for predicting the rank.

In [1]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np
from tabulate import tabulate
from sklearn.model_selection import GridSearchCV

In [2]:
import pandas as pd

# Read the CSV file
df = pd.read_csv("data/imputed_data.csv")
df.columns

Index(['Name', 'Position', 'College', 'Round', 'Pick', 'Stat URL', 'Height',
       'Weight', '40 Yard Dash', 'Bench Press', 'Vertical Jump', 'Broad Jump',
       '3 Cone Drill', 'Shuttle', 'conf_abbr', 'games', 'seasons',
       'tackles_solo', 'tackles_assists', 'tackles_total', 'tackles_loss',
       'sacks', 'def_int', 'def_int_yds', 'def_int_td', 'pass_defended',
       'fumbles_rec', 'fumbles_rec_yds', 'fumbles_rec_td', 'fumbles_forced',
       'rec', 'rec_yds', 'rec_yds_per_rec', 'rec_td', 'rush_att', 'rush_yds',
       'rush_yds_per_att', 'rush_td', 'scrim_att', 'scrim_yds',
       'scrim_yds_per_att', 'scrim_td', 'Year'],
      dtype='object')

In [3]:
df.head()

Unnamed: 0,Name,Position,College,Round,Pick,Stat URL,Height,Weight,40 Yard Dash,Bench Press,...,rec_td,rush_att,rush_yds,rush_yds_per_att,rush_td,scrim_att,scrim_yds,scrim_yds_per_att,scrim_td,Year
0,Emmanuel Acho,OLB,Texas,6,204,https://www.sports-reference.com/cfb/players/e...,74.0,238.0,4.64,24.0,...,5.29,199.2,1282.58,8.83,14.91,239.71,1747.91,8.22,20.2,2012
1,Joe Adams,WR,Arkansas,4,104,https://www.sports-reference.com/cfb/players/j...,71.0,179.0,4.51,14.59,...,8.5,4.0,69.5,11.65,0.0,96.0,1393.5,14.45,8.5,2012
2,Chas Alecxih,DT,Pittsburgh,0,0,https://www.sports-reference.com/cfb/players/c...,76.0,296.0,5.31,19.0,...,0.0,1.19,5.2,-0.68,0.36,1.36,5.55,0.86,0.36,2012
3,Frank Alexander,DE,Oklahoma,4,103,https://www.sports-reference.com/cfb/players/f...,76.0,270.0,4.8,24.48,...,2.17,22.98,75.37,4.12,4.24,36.81,231.59,6.49,6.41,2012
4,Antonio Allen,S,South Carolina,7,242,https://www.sports-reference.com/cfb/players/a...,73.0,210.0,4.58,17.0,...,1.68,374.69,2061.25,4.94,19.21,420.39,2397.36,6.43,20.89,2012


In [4]:
df.loc[df.Round != 1, "Round"] = 0

# Dropping the columns which donot contribute in prediction
all_X = df.drop(["Name", "Round", "Pick", "College"], axis=1)
all_X = pd.get_dummies(all_X)

# Splitting testing and training sets
train_X = all_X[(all_X.Year != 2023)].drop(["Year"], axis=1)
test_X = all_X[all_X.Year == 2023].drop(["Year"], axis=1)
train_y = df[(df.Year != 2023)].Round
test_y = df[df.Year == 2023].Round

In [5]:
train_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
0,74.0,238.0,4.64,24.0,35.5,118.0,7.13,4.28,37.0,3.0,...,False,False,False,False,False,False,False,False,False,False
1,71.0,179.0,4.51,14.59,36.0,123.0,7.09,4.12,40.0,4.0,...,False,False,False,False,False,False,False,False,False,False
2,76.0,296.0,5.31,19.0,25.5,99.0,7.74,4.62,34.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3,76.0,270.0,4.8,24.48,31.13,115.26,7.19,4.48,37.0,4.0,...,False,False,False,False,False,False,False,False,False,False
4,73.0,210.0,4.58,17.0,34.0,118.0,7.02,4.25,42.0,4.0,...,False,False,False,False,False,False,False,False,False,False


# Regular Expression Defined to clean the data with punctuation 

In [6]:
import re
def clean_feature_names(df):
    # Define a regular expression pattern to match special characters
    pattern = r'[^\w\s-]'
    
    # Iterate over the columns and clean the feature names
    for col in df.columns:
        # Replace special characters with underscores
        clean_col = re.sub(pattern, '_', col)
        # Rename the column if necessary
        if col != clean_col:
            df.rename(columns={col: clean_col}, inplace=True)

# Example usage:
# Assuming train_X is your feature matrix (DataFrame)
clean_feature_names(train_X)

In [7]:
test_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
3400,70.0,216.0,4.51,19.42,33.64,115.58,7.03,4.28,31.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3401,73.0,237.0,4.47,17.09,36.5,129.0,7.22,4.25,53.0,5.0,...,False,False,False,False,False,False,False,False,False,False
3402,69.0,188.0,4.32,14.92,33.0,119.26,7.02,4.19,30.0,3.0,...,False,False,False,False,False,False,False,True,False,False
3403,71.0,173.0,4.49,15.14,34.0,122.0,7.0,4.16,35.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3404,74.0,282.0,4.49,27.0,37.5,125.0,7.22,4.47,36.0,4.0,...,False,False,False,False,False,False,False,False,False,False


# Training the BaseLine Model

In [8]:
# Initialize the LightGBM model
baseline_LGB = lgb.LGBMClassifier(
    feature_fraction=0.8,
    colsample_bytree=0.7,
    learning_rate=0.01,
    max_depth=6,
    min_child_weight=15,
    subsample=0.7,
    num_leaves=32,
    class_weight={0: 1, 1: 10}
)

In [9]:
baseline_LGB.fit(train_X, train_y)

[LightGBM] [Info] Number of positive: 336, number of negative: 3064
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000458 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 7566
[LightGBM] [Info] Number of data points in the train set: 3400, number of used features: 70
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.523039 -> initscore=0.092220
[LightGBM] [Info] Start training from score 0.092220


In [10]:
baseline_pred = baseline_LGB.predict(test_X)



In [11]:
accuracy = accuracy_score(test_y, baseline_pred)
print("Accuracy:", accuracy)

# Predicting the probabilities of Test set
baseline_preds = baseline_LGB.predict_proba(test_X)

# Ranking done according to the probability scores
sorted_indices = np.argsort(-baseline_preds[:, 1])
k = 10
num_relevant = sum(test_y)

# Evaluation metrics
def calculate_MRR(sorted_indices, test_y):
    mrr = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            mrr = 1 / (idx + 1)
            break
    return mrr

def calculate_MAP(sorted_indices, test_y):
    ap = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            ap += sum(test_y.iloc[:idx + 1]) / (idx + 1)
    map_score = ap / num_relevant
    return map_score

def calculate_NDCG(sorted_indices, test_y):
    dcg = 0
    idcg = sum(1 / np.log2(np.arange(2, k + 2)))
    for idx, i in enumerate(sorted_indices[:k]):
        if test_y.iloc[i] == 1:
            dcg += 1 / np.log2(idx + 2)
    ndcg = dcg / idcg
    return ndcg

def calculate_PAK(sorted_indices, test_y):
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    precision_at_k = tp_at_k / k
    return precision_at_k

def calculate_RAK(sorted_indices, test_y):
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    recall_at_k = tp_at_k / num_relevant
    return recall_at_k

# Calculate all measurements
baseline_measurements = [
    ("Accuracy", accuracy_score(test_y, baseline_pred)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Baseline measurements")
print(tabulate(baseline_measurements, headers=["Metric", "Value"]))

Accuracy: 0.7323943661971831
Baseline measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.732394
ROC AUC Score                                         0.763895
Mean Reciprocal Rank (MRR)                            1
Mean Average Precision (MAP)                          0.106994
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.522496
Precision at k (P@k) at k=10                          0.4
Recall at k (R@k) at k=10                             0.137931


In [12]:
# Ranking done according to the probability scores
count = 1
for i in sorted_indices:
    print(str(count) + " " + str(df[df.Year == 2023].reset_index().at[i, "Name"]))
    count += 1

1 Emmanuel Forbes
2 Christian Gonzalez
3 DJ Turner
4 Jakorian Bennett
5 Anthony Richardson
6 C.J. Stroud
7 Tanner McKee
8 Marvin Mims
9 Byron Young
10 YaYa Diaby
11 Tavius Robinson
12 Quentin Johnston
13 Carrington Valentine
14 Rakim Jarrett
15 Wanya Morris
16 Tyler Steen
17 Jalin Hyatt
18 Blake Freeland
19 Kelee Ringo
20 Ali Gaye
21 Asim Richards
22 Dawand Jones
23 Anton Harrison
24 Bryce Young
25 Dante Stills
26 Thomas Incoom
27 Lukas Van Ness
28 Gervon Dexter
29 Joe Tippmann
30 John Ojukwu
31 Peter Skoronski
32 Carter Warren
33 Isaiah Foskey
34 Malaesala Aumavae-Laulu
35 Broderick Jones
36 Darnell Wright
37 Anthony Bradford
38 Jalen Carter
39 Jon Gaines
40 Jarrett Patterson
41 Trenton Simpson
42 Darrell Luter Jr.
43 Rejzohn Wright
44 Dontayvion Wicks
45 Mazi Smith
46 Tuli Tuipulotu
47 Ricky Stromberg
48 Tashawn Manning
49 Tyrus Wheat
50 Adetomiwa Adebawore
51 Jerrod Clark
52 Will Anderson Jr.
53 Bryan Bresee
54 BJ Ojulari
55 Paris Johnson Jr.
56 Matthew Bergeron
57 Nick Herbig
58 Sy

# Model with Hyper Tuned Parameters

In [13]:
param_grid = {
    'num_leaves': [20, 30, 40],
    'max_depth': [5, 6, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.6, 0.7, 0.8],
    'colsample_bytree': [0.6, 0.7, 0.8],
    'min_child_weight': [10, 15, 20],
   'class_weight': [{0: 0.5, 1: 12},{0:1,1:15}],
    'bagging_freq': [1, 3]
}


In [None]:
lgb_classifier = lgb.LGBMClassifier()

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=lgb_classifier, param_grid=param_grid, cv=4, scoring='accuracy', verbose=2, n_jobs=-1)


grid_search.fit(train_X, train_y)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score (Accuracy):", best_score)

# Instantiate the best LightGBM classifier with the best parameters
best_lgb_classifier = lgb.LGBMClassifier(**best_params)

# Train the best LightGBM classifier on the entire training data
best_lgb_classifier.fit(train_X, train_y)

In [15]:

y_pred = best_lgb_classifier.predict(test_X)

# Calculate accuracy
accuracy = accuracy_score(test_y, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8415492957746479


In [19]:
preds = best_lgb_classifier.predict_proba(test_X)
count = 1

for i in pd.DataFrame(preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Darnell Wright
2 Anthony Richardson
3 Carter Warren
4 Emmanuel Forbes
5 Byron Young
6 Dawand Jones
7 Anthony Bradford
8 Marvin Mims
9 Jakorian Bennett
10 Thomas Incoom
11 Isaiah Foskey
12 Kelee Ringo
13 Wanya Morris
14 Richard Gouraige
15 DJ Turner
16 YaYa Diaby
17 C.J. Stroud
18 Joe Tippmann
19 Tyler Steen
20 Broderick Jones
21 Mazi Smith
22 Carrington Valentine
23 Anton Harrison
24 Trenton Simpson
25 Darrell Luter Jr.
26 Blake Freeland
27 Jalin Hyatt
28 Tanner McKee
29 Matthew Bergeron
30 Jon Gaines
31 Bryce Young
32 Nolan Smith
33 Asim Richards
34 Lukas Van Ness
35 Bryan Bresee
36 Ryan Hayes
37 Rakim Jarrett
38 Christian Gonzalez
39 Tavius Robinson
40 Paris Johnson Jr.
41 Jerrod Clark
42 Ali Gaye
43 Matt Landers
44 Rejzohn Wright
45 Malaesala Aumavae-Laulu
46 Quentin Johnston
47 Bijan Robinson
48 Jay Ward
49 Robert Beal
50 Will Anderson Jr.
51 Michael Mayer
52 Ikenna Enechukwu
53 Riley Moss
54 BJ Ojulari
55 Nathaniel Dell
56 Zach Charbonnet
57 John Ojukwu
58 O'Cyrus Torrence
59 Sa

In [18]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-preds[:, 1])

# Calculate all measurements
best_rf_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Best Fit measurements")
print(tabulate(best_rf_measurements, headers=["Metric", "Value"]))

Best Fit measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.841549
ROC AUC Score                                         0.7286
Mean Reciprocal Rank (MRR)                            1
Mean Average Precision (MAP)                          0.110746
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.453743
Precision at k (P@k) at k=10                          0.3
Recall at k (R@k) at k=10                             0.103448


**Comparative Analysis of Baseline and Best-Fit LightGBM Models for Ranking Prediction**

The comparison between the baseline and best-fit LightGBM models reveals several notable differences in performance across various metrics. The best-fit model shows a slight improvement in accuracy compared to the baseline, while the ROC AUC score remains similar in both models.

However, significant enhancements are observed in the best-fit model for ranking-related metrics. The Mean Reciprocal Rank (MRR) demonstrates a substantial increase, indicating that the best-fit model provides more relevant and accurate predictions at the top of the ranked list compared to the baseline.

Similarly, the Mean Average Precision (MAP) and Precision at k (P@k) at k=10 metrics exhibit considerable improvements, suggesting better precision in predicting relevant instances within the top results.

Moreover, the Normalized Discounted Cumulative Gain (NDCG) at k=10 reflects a notable enhancement, indicating that the best-fit model produces more relevant results at the top ranks, which is crucial for ranking tasks.

Despite these improvements, the recall at k (R@k) at k=10 remains relatively low for both models, indicating a challenge in capturing all relevant instances within the top k results.

Overall, while the baseline model provides reasonable predictive performance, the best-fit LightGBM model significantly enhances the model's ability to accurately rank and prioritize instances, particularly at the top of the list, thereby improving its utility in predicting NFL Draft outcomes.