# Rank prediction using Random Forest Classifier

-- Vishwa Sheth

<b><u>Key components of the model</u></b>

Data Preprocessing: After imputing missing values using KNN, we convert categorical data to numeric using the get_dummies function in pandas. This conversion helps to format the data in a way that is suitable for the model.

Feature selection: We remove irrelevant features such as "Name" and "College" from consideration for prediction. Additionally, "Round" and "Pick" are excluded as they are part of the target feature.

Target feature: Currently, we aim to predict ranking using the "Round" feature. In the future, we plan to incorporate "Pick" before final submission.

Dataset split: Given that this is a ranking problem, the training dataset includes all years except for 2023. Data from 2023 will be used solely for predicting the rank.

The hyperparameters are tuned using cross-validation. The disparity between baseline measurements and best-fit measurements demonstrates an improvement in accuracy and other metrics following 5-Fold cross-validation.

Comparative Analysis of Baseline and Best-Fit Random Forest Models for Ranking Prediction

<u>Note</u>: This work is in progress; we aim to improve measurement parameters, include ranking parameters in Cross Validation instead of accuracy and include "Pick" in the target feature.

In [6]:
import pandas as pd

# Read the CSV file
df = pd.read_csv("data/imputed_data.csv")
print(df.columns)

Index(['Name', 'Position', 'College', 'Round', 'Pick', 'Stat URL', 'Height',
       'Weight', '40 Yard Dash', 'Bench Press', 'Vertical Jump', 'Broad Jump',
       '3 Cone Drill', 'Shuttle', 'conf_abbr', 'games', 'seasons',
       'tackles_solo', 'tackles_assists', 'tackles_total', 'tackles_loss',
       'sacks', 'def_int', 'def_int_yds', 'def_int_td', 'pass_defended',
       'fumbles_rec', 'fumbles_rec_yds', 'fumbles_rec_td', 'fumbles_forced',
       'rec', 'rec_yds', 'rec_yds_per_rec', 'rec_td', 'rush_att', 'rush_yds',
       'rush_yds_per_att', 'rush_td', 'scrim_att', 'scrim_yds',
       'scrim_yds_per_att', 'scrim_td', 'Year'],
      dtype='object')


In [7]:
df.head

<bound method NDFrame.head of                  Name Position          College  Round  Pick  \
0       Emmanuel Acho      OLB            Texas      6   204   
1           Joe Adams       WR         Arkansas      4   104   
2        Chas Alecxih       DT       Pittsburgh      0     0   
3     Frank Alexander       DE         Oklahoma      4   103   
4       Antonio Allen        S   South Carolina      7   242   
...               ...      ...              ...    ...   ...   
3679      Luke Wypler        C         Ohio St.      6   190   
3680      Bryce Young       QB          Alabama      1     1   
3681      Byron Young       DT          Alabama      3    70   
3682      Byron Young     EDGE        Tennessee      3    77   
3683    Cameron Young       DT  Mississippi St.      4   123   

                                               Stat URL  Height  Weight  \
0     https://www.sports-reference.com/cfb/players/e...    74.0   238.0   
1     https://www.sports-reference.com/cfb/players/

In [8]:
df.loc[df.Round != 1, "Round"] = 0

# Dropping the columns which donot contribute in prediction
all_X = df.drop(["Name", "Round", "Pick", "College"], axis=1)
all_X = pd.get_dummies(all_X)

# Splitting testing and training sets
train_X = all_X[(all_X.Year != 2023)].drop(["Year"], axis=1)
test_X = all_X[all_X.Year == 2023].drop(["Year"], axis=1)
train_y = df[(df.Year != 2023)].Round
test_y = df[df.Year == 2023].Round



In [9]:
train_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
0,74.0,238.0,4.64,24.0,35.5,118.0,7.13,4.28,37.0,3.0,...,False,False,False,False,False,False,False,False,False,False
1,71.0,179.0,4.51,14.59,36.0,123.0,7.09,4.12,40.0,4.0,...,False,False,False,False,False,False,False,False,False,False
2,76.0,296.0,5.31,19.0,25.5,99.0,7.74,4.62,34.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3,76.0,270.0,4.8,24.48,31.13,115.26,7.19,4.48,37.0,4.0,...,False,False,False,False,False,False,False,False,False,False
4,73.0,210.0,4.58,17.0,34.0,118.0,7.02,4.25,42.0,4.0,...,False,False,False,False,False,False,False,False,False,False


In [10]:
test_X.head()

Unnamed: 0,Height,Weight,40 Yard Dash,Bench Press,Vertical Jump,Broad Jump,3 Cone Drill,Shuttle,games,seasons,...,conf_abbr_CUSA,conf_abbr_Ind,conf_abbr_MAC,conf_abbr_MVC,conf_abbr_MWC,conf_abbr_Pac-10,conf_abbr_Pac-12,conf_abbr_SEC,conf_abbr_Sun Belt,conf_abbr_WAC
3400,70.0,216.0,4.51,19.42,33.64,115.58,7.03,4.28,31.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3401,73.0,237.0,4.47,17.09,36.5,129.0,7.22,4.25,53.0,5.0,...,False,False,False,False,False,False,False,False,False,False
3402,69.0,188.0,4.32,14.92,33.0,119.26,7.02,4.19,30.0,3.0,...,False,False,False,False,False,False,False,True,False,False
3403,71.0,173.0,4.49,15.14,34.0,122.0,7.0,4.16,35.0,3.0,...,False,False,False,False,False,False,False,False,False,False
3404,74.0,282.0,4.49,27.0,37.5,125.0,7.22,4.47,36.0,4.0,...,False,False,False,False,False,False,False,False,False,False


In [13]:
from sklearn.ensemble import RandomForestClassifier
# Define the parameter values as baseline
n_estimators = 1      
max_depth = None      
min_samples_split = 1000  
min_samples_leaf = 1000   
max_features = None   
bootstrap = False     


# Initialize the Random Forest classifier with custom parameters
baseline_rf = RandomForestClassifier(n_estimators=n_estimators,
                                    max_depth=max_depth,
                                    min_samples_split=min_samples_split,
                                    min_samples_leaf=min_samples_leaf,
                                    max_features=max_features,
                                    bootstrap=bootstrap)

# Initialize and train Random Forest classifier as baseline
# baseline_rf = RandomForestClassifier()
baseline_rf.fit(train_X, train_y)


In [14]:
# Make predictions on test data
baseline_preds = preds = baseline_rf.predict_proba(test_X)
count = 1

# Ranking done according to the probability scores
for i in pd.DataFrame(baseline_preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Israel Abanikanda
2 Mike Morris
3 Tashawn Manning
4 Michael Mayer
5 Warren McClendon
6 Jordan McFadden
7 Tanner McKee
8 Kendre Miller
9 Marvin Mims
10 Keaton Mitchell
11 Wanya Morris
12 Calijah Kancey
13 Myles Murphy
14 Lukas Van Ness
15 John Ojukwu
16 BJ Ojulari
17 Jarrett Patterson
18 Kyle Patterson
19 Jack Podlesny
20 Asim Richards
21 Jaxson Kirkland
22 Darrell Luter Jr.
23 Anton Harrison
24 Clark Phillips III
25 Malik Heath
26 Nick Herbig
27 Ronnie Hickman
28 Brandon Hill
29 Xavier Hutchinson
30 Jalin Hyatt
31 Andre Carter II
32 Rashad Torrence II
33 Thomas Incoom
34 Paris Johnson Jr.
35 Rakim Jarrett
36 Antonio Johnson
37 Quentin Johnston
38 Broderick Jones
39 Dawand Jones
40 Jaylon Jones
41 Will Anderson Jr.
42 Emil Ekiyor Jr.
43 Anthony Richardson
44 Eli Ricks
45 Kelee Ringo
46 Parker Washington
47 DJ Turner
48 Carrington Valentine
49 Deuce Vaughn
50 Andrew Vorhees
51 Dalton Wagner
52 Alex Ward
53 Carter Warren
54 Darnell Washington
55 Tyrus Wheat
56 Tavius Robinson
57 Blake W

In [15]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (baseline_preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-preds[:, 1])
k = 10
num_relevant = sum(test_y)

def calculate_MRR(sorted_indices, test_y):
    # Calculate Mean Reciprocal Rank (MRR)
    mrr = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:  # Use iloc to access test_y by index
            mrr = 1 / (idx + 1)
            break
    return mrr

def calculate_MAP(sorted_indices, test_y):
    # Calculate Mean Average Precision (MAP)
    ap = 0
    for idx, i in enumerate(sorted_indices):
        if test_y.iloc[i] == 1:
            ap += sum(test_y.iloc[:idx + 1]) / (idx + 1)
    map_score = ap / num_relevant
    return map_score

def calculate_NDCG(sorted_indices, test_y):
    # Calculate Normalized Discounted Cumulative Gain (NDCG) at k=10
    dcg = 0
    idcg = sum(1 / np.log2(np.arange(2, k + 2)))
    for idx, i in enumerate(sorted_indices[:k]):
        if test_y.iloc[i] == 1:
            dcg += 1 / np.log2(idx + 2)
    ndcg = dcg / idcg
    return ndcg

def calculate_PAK(sorted_indices, test_y):
    # Calculate Precision at k (P@k) 
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    precision_at_k = tp_at_k / k
    return precision_at_k

def calculate_RAK(sorted_indices, test_y):
    # Calculate Recall at k (R@k)
    tp_at_k = sum(test_y.iloc[sorted_indices[:k]])
    recall_at_k = tp_at_k / num_relevant
    return recall_at_k

In [16]:
pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [17]:
from tabulate import tabulate

# Calculate all measurements
baseline_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Baseline measurements")
print(tabulate(baseline_measurements, headers=["Metric", "Value"]))


Baseline measurements
Metric                                                    Value
----------------------------------------------------  ---------
Accuracy                                              0.897887
ROC AUC Score                                         0.696552
Mean Reciprocal Rank (MRR)                            0.125
Mean Average Precision (MAP)                          0.115021
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.0694312
Precision at k (P@k) at k=10                          0.1
Recall at k (R@k) at k=10                             0.0344828


In [24]:
from sklearn.model_selection import GridSearchCV
# Training the model using Random Forest by using best parameters

param_grid = {
    'n_estimators': [100, 500, 1000]
}

# Initialize the Random Forest classifier
rf = RandomForestClassifier()

# Hypertuning parameters using 5-Fold Cross Validation method
clf = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')
clf.fit(train_X, train_y)

In [25]:
# Get the best parameters
best_params = clf.best_params_
print("Best Parameters:", best_params)

# Use the best estimator to make predictions
best_rf = clf.best_estimator_

Best Parameters: {'n_estimators': 100}


In [26]:
# Predicting the probabilities of Test set
preds = best_rf.predict_proba(test_X)
count = 1

# Ranking done according to the probability scores
for i in pd.DataFrame(preds).sort_values(by=1, ascending=False).index:
    print(str(count) + " " + str(df[df.Year==2023].reset_index().at[i, "Name"]))
    count += 1

1 Jakorian Bennett
2 C.J. Stroud
3 Byron Young
4 Dante Stills
5 Christian Gonzalez
6 Emmanuel Forbes
7 Anthony Richardson
8 Darnell Wright
9 Bryce Young
10 Will Anderson Jr.
11 DJ Turner
12 Adetomiwa Adebawore
13 Blake Freeland
14 Jaren Hall
15 Isaiah Foskey
16 Hendon Hooker
17 Nolan Smith
18 Joe Tippmann
19 Tyler Steen
20 Cam Smith
21 Marvin Mims
22 Myles Brooks
23 Lukas Van Ness
24 Owen Pappoe
25 Kelee Ringo
26 Richard Gouraige
27 Nick Hampton
28 Carter Warren
29 Dawand Jones
30 Quentin Johnston
31 YaYa Diaby
32 Paris Johnson Jr.
33 Tre'Vius Hodges-Tomlinson
34 Carrington Valentine
35 Deonte Banks
36 Ryan Hayes
37 Thomas Incoom
38 John Ojukwu
39 Yasir Abdullah
40 Riley Moss
41 Jartavius Martin
42 Rejzohn Wright
43 Malik Cunningham
44 Jon Gaines
45 Joey Porter Jr.
46 Julius Brents
47 Myles Murphy
48 Jason Taylor II
49 Isaiah McGuire
50 Jalen Redmond
51 Clark Phillips III
52 Ali Gaye
53 Trenton Simpson
54 Sydney Brown
55 Jacob Copeland
56 Cameron Brown
57 Jalen Brooks
58 Josh Downs
59 

In [27]:
from sklearn.metrics import accuracy_score, roc_auc_score
import numpy as np

# Convert predicted probabilities to binary predictions based on a threshold (e.g., 0.5)
predicted_labels = (preds[:, 1] > 0.5).astype(int)

# Evaluation for ranking metrics
# Sort the predictions based on probability scores
sorted_indices = np.argsort(-preds[:, 1])

# Calculate all measurements
best_rf_measurements = [
    ("Accuracy", accuracy_score(test_y, predicted_labels)),
    ("ROC AUC Score", roc_auc_score(test_y, baseline_preds[:, 1])),
    ("Mean Reciprocal Rank (MRR)", calculate_MRR(sorted_indices, test_y)),
    ("Mean Average Precision (MAP)", calculate_MAP(sorted_indices, test_y)),
    ("Normalized Discounted Cumulative Gain (NDCG) at k=10", calculate_NDCG(sorted_indices, test_y)),
    ("Precision at k (P@k) at k=10", calculate_PAK(sorted_indices, test_y)),
    ("Recall at k (R@k) at k=10", calculate_RAK(sorted_indices, test_y))
]

# Print measurements in a table format
print("Best Fit Random Forest measurements")
print(tabulate(best_rf_measurements, headers=["Metric", "Value"]))

Best Fit Random Forest measurements
Metric                                                   Value
----------------------------------------------------  --------
Accuracy                                              0.897887
ROC AUC Score                                         0.696552
Mean Reciprocal Rank (MRR)                            0.5
Mean Average Precision (MAP)                          0.118354
Normalized Discounted Cumulative Gain (NDCG) at k=10  0.591464
Precision at k (P@k) at k=10                          0.7
Recall at k (R@k) at k=10                             0.241379


# Comparative Analysis of Baseline and Best-Fit Random Forest Models for Ranking Prediction

The comparison between the baseline and best-fit Random Forest models reveals notable differences in performance across various metrics. In terms of accuracy and ROC AUC score, both models exhibit similar results. However, significant improvements are observed in the best-fit model for ranking-related metrics. The Mean Reciprocal Rank (MRR) shows a substantial increase, indicating that the best-fit model provides more relevant and accurate predictions at the top of the ranked list compared to the baseline. Similarly, the Mean Average Precision (MAP) and Precision at k (P@k) at k=10 metrics demonstrate considerable enhancements, implying better precision in predicting relevant instances within the top results. Moreover, the Normalized Discounted Cumulative Gain (NDCG) at k=10 reflects a notable improvement, suggesting that the best-fit model produces more relevant results at the top ranks, which is crucial for ranking tasks. Despite these improvements, the recall at k (R@k) at k=10 remains relatively low for both models, indicating a challenge in capturing all relevant instances within the top k results. 

Overall, while the baseline model provides reasonable predictive performance, the best-fit Random Forest model significantly enhances the model's ability to accurately rank and prioritize instances, particularly at the top of the list, thereby improving its utility in predicting NFL Draft.
