# Classifying NBA All-Defensive Votes


# Introduction: 

## Background:
In the National Basketball Association (NBA), players can get elected to an All-Defensive team for exemplary defense throughout the season. A set of people were given ballots to fill out with two "teams" of players that they thought had the best defense, a first team and a second team. If they nominate a player for the first team, that player gets 2 points and for second team 1 point. The players with the most points get assigned to the All-Defensive teams.
## The Task
We are building classifiers to look at an NBA player's statistics and team in order to determine if they got put on anyone's ballot (received a vote).
## The Data
Our dataset [NBA Stats (1947-present)](https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats?resource=download&select=Team+Stats+Per+Game.csv) came from [Kaggle](https://www.kaggle.com). It was created by Sumitro Datta and contains 22 separate csv files, containing a total of 32 MB of data. It contains player, team, and award data from the past 78 NBA seasons.  
We will be using a mixture of 6 of the CSV files; player advanced stats, player basic stats, team basic stats, team advanced stats, opposing team stats, and end of season voting information.

## Preview

Our best classifiers were Naive Bayes and Random Forest, with 92% and ??% F1 Scores respectively. 


# Data Analysis:

## Final Dataset

As mentioned previously, we joined together 6 seperate tables, but all of those tables needed pruning. Firstly, removed all data from before the 1980 season from each table. The end of season voting data was missing the 2025 voting results, so that was added in manually. We converted the voting results into a binary value, instead of being a total number of votes/ team assignments. We joined the tables with a final count of 35 attributes and 3412 rows.
Attributes: 
* abbreviation (Team Name Abbreviation)
* g, gs (Games, Games Started) 
* pos   (Position)
* mp, drb, trb, stl, blk, pf (Minutes Played, Defensive Rebounds, Total Rebounds, Steals, Blocks, Personal Fouls) per game
* drb, trb, stl, blk (Defensive Rebound, Total Rebound, Steal, Block) percentage
* dws, ws, ws_48 (Defensive Win-Shares, Total Win-Shares, Win-Shares per 48 Minutes)
* dbpm, bpm, vorp (Defensive Box Plus-Minus, Box Puls-Minus, Value Over Replacement Player)
* voted (Whether or not a player received vote) <- This is what we are classifying
* team (Team the player is on and provides team statistics below )
* w, l, pw, pl, mov, srs, d_rtg, pace, drb_percent (Wins, Losses, Projected Wins, Projected Losses, Margin of Victory, Simple Rating System, Defensive Rating, Pace, Defensive Rebound Percentage ) 
* opp_e_fg_percent, opp_tov_percent, opp_ft_fga ( Opponent Effective Field Goal Percentage. Opponent Turnover Percentage, Opponent Free Throw Attempts)


Originally we had 24,615 rows with only 1706 positive instances, which caused our classifiers to have high accuracy and low everything else. We decided to randomly select 1706 negative instances from the 22,910 in order to have 50% of our data positve rather than 6.9% split. Below are the before and after balncing results.
    

<img src=plots/defensive_team_pie_chart.png width = 600> </img> <img src=plots/balanced/defensive_team_pie_chart.png width = 600> </img>
<img src=plots/players_per_season_total_vs_voted.png width = 600> </img> <img src=plots/balanced/players_per_season_total_vs_voted.png width = 600> </img>


# Classification Results: 
We implemented and compared five classifiers: **a Dummy baseline, k-Nearest Neighbors (k=5), a Decision Tree (depth=5), a Random Forest (20 trees, depth=5, max_features=3), and a Naive Bayes classifier**. kNN used standardized features, and the tree-based models used depth limits to reduce overfitting. All models were evaluated on the same held-out test set.

To evaluate predictive ability, we used accuracy, precision, recall, F1-score, and confusion matrices, which allowed us to examine not only overall correctness but also the types of errors each model tended to make.

### Performance Summary

- Dummy Classifier performed near chance (Accuracy 0.48), providing a baseline.

- kNN (k=5) achieved the highest overall accuracy (0.888) and highest F1-score (0.892), with strong precision (0.858) and recall (0.929).

- Decision Tree (D=5) performed slightly worse than kNN but still strong, with F1 = 0.871 and recall = 0.923.

- Random Forest (20 trees) was comparable to kNN in accuracy (0.884) and achieved the best recall (0.949) and lowest false negatives.

- Naive Bayes had the highest accuracy overall (0.953) but significantly lower F1-score (0.667) due to a large precision–recall imbalance (Precision 0.630, Recall 0.707). This suggests it predicts the majority class very well but struggles on the minority class.

### Comparison and Best Model

All learned models outperformed the Dummy baseline, but their strengths differ:

- Naive Bayes achieved the highest accuracy but had the weakest F1-score, meaning its performance on the positive class is substantially worse than its overall accuracy suggests.

- kNN offers the best balance of accuracy and F1-score.

- Random Forest remains the strongest choice when minimizing false negatives is important, due to its high recall (0.949).

- Decision Tree performs well and is the easiest to interpret.

Considering both precision and recall together, the Random Forest and kNN provide the most reliable overall performance, while Naive Bayes—despite its high accuracy—performs less effectively on the positive class.

# Conclusion: 
In conclusion, classification of whether or not a player received an All-Defensive team vote based on their stats and team is very possible. The best classifier to use is a Naive Bayes classifier, providing all around high scores (~90%). There is still some inaccuracy, which could possibly be because of the subjectiveness of basketball or due to the natural evolution of offense and defense in the league. Causing some seasons to appear as better defensively when in reality it was just a different dynamic.
## Challenges
At first we had some trouble with the precision, recall, and F1 Score because our dataset was extremely imbalanced, but once we balanced it out (at the input of our professor) our scores increased greatly. We lost a little bit of accuracy, which is only natural as we no longer had 93% of the dataset as one value. We had a very large dataset, so it was difficult to pick what attributues to use. Additionally, the game of basketball evolves over time and changes from season to season, for instance the average points for a team to score in a game last season was 114, in 2010 it was 104. This can really impact what defensive rating players have purely because it was a different type of game. We also had challenges with some of our old classifiers we developed in class not being efficient enough to handle the large amounts of data we were processing. We had to switch to using numpy arrays in order to get the classifiers to finish in less than 30 minutes.
## Improvements
We had a couple of ideas for improving our classifier performance, if we split the dataset further into smaller chunks of seasons (5~10 years instead of 45) it could eliminate some of the inaccuracies mentioned above. We could also experiment with more attributes. If we wanted to make our classifier more useful, we could also have it try and predict All-NBA Teams/Votes using different statistics, or try and predict the specific defensive team the players were on.

Provide a brief conclusion of your project, including a short summary of the dataset you used (and any of its inherent challenges for classification), the classification approach you developed, your classifiers’ performance, and any ideas you have on ways to improve its performance.
   

# Acknowledgments: 
Sources : [Kaggle](Kaggle.com), [matplotlib](matplotlib.com), [NBA](NBA.com), [scikit-learn](http://scikit-learn.org/), [Decision Tree Classification in Python](https://www.youtube.com/watch?v=sgQAhG5Q7iY&t=909s), and class materials
