Initial data science project focuses on analyzing NBA player performance data and predicting player salaries using data analysis and machine learning models. The analysis identifies key metrics influencing salaries and provides actionable insights through data visualization, database queries, and modeling.
The dataset was sourced from Kaggle. The data can be downloaded from here: NBA Salary Dataset. The two important datasets used are 'NBA Player Stats.csv' and 'NBA Salaries.csv'.
- Merged the player statistics and salary datasets to include player performance statistics, positions, teams, and salaries. This involved joining the datasets on common attributes.
- Handled missing values and outliers across key features like points, assists, rebounds, and salary. This involved data cleaning, handling Null Values, and removing outliers using the Z-Score Method.
- Converted numerical columns to the appropriate data types for mathematical operations.
- Standardized features and scaled data where necessary.
We derived new variables from existing ones to better capture underlying patterns in the data.
- Weighted Efficiency (WEFF): Combines points, assists, rebounds, steals, blocks, and turnovers, normalized by games played.
- Points Per Game (PPG): Points scored divided by games played.
- Assists Per Game (APG): Assists divided by games played.
- Rebounds Per Game (RPG): Total rebounds divided by games played.
- Steals Per Game (SPG): Steals divided by games played.
- Blocks Per Game (BPG): Blocks divided by games played.
- Turnovers Per Game (TPG): Turnovers divided by games played.
- Usage Rate: Estimate of a player's involvement in offensive plays, based on field goals attempted, free throws attempted, and turnovers.
- Shooting Efficiency: Average of field goal percentage and effective field goal percentage.
- Offensive Contribution: Weighted sum of points, assists, and offensive rebounds.
- Defensive Contribution: Sum of defensive rebounds, steals, and blocks.
- Experience: Estimated years of professional activity, assuming players start their careers at age 19.
- Games Started Percentage (GS%): Games started as a percentage of games played.
- Impact Score: Weighted Efficiency (WEFF) per minute played
- **Minutes Played per game (MPG): Total minutes played divided by games played
- Efficiency Tiers: Players categorized into low, moderate, and high efficiency tiers.
- Visualization:
- Created plots to see the correlation of different metrics with Salary to find which metric was important.
- Made visualizations for each numeric metric vs salary to find the best features for predicting salary
- Some plots we made include:
- Visualized salary trends by position, efficiency, and season using bar plots, scatter plots, and line plots.
- Created more salary related visualizations to better see trends for predicting overall salary based on player stats and performance.
-
Created a relational SQLite database for querying player statistics.
-
Build a schema and then inserted our dataset information in the local database
-
Executed advanced SQL queries for aspects such as:
- Analyzed salary trends in relation to NBA player stats, performance metrics, and efficiency.
- Explored year-on-year salary growth and distribution across seasons.
- Examined salary variations across different age groups and career stages.
- Investigated the impact of specific contributions (offensive/defensive) and efficiency on player earnings
-
By incorporating database management, we were able to easily query and find salary related trends, helping us build the overall machine learning model.
- Linear Regression:
- Established a baseline model for salary prediction. This model didnt perform too well.
- Evaluated using Mean Squared Error (MSE), Mean Absolute Error (MAE), and R² score.
- Decision Tree Regressor:
- Improved prediction accuracy by capturing non-linear relationships. It uses decision trees to predict.
- Neural Network:
- Uses a network of nodes (with one hidden layer using a greedy optimization approach)
- Random Forest Regressor:
- Enhanced the model by reducing overfitting and improving robustness. This works by combining the predictions of multiple decision trees. This turned out to be the best model with a relatively high R² value.
- PRESS, Cp, Bootstrapping, K-fold cross-validation:
- Helped evaluate the model's performance and analyze how robust it was
- Integrated an user interactive dashboard using
ipywidgetsto dynamically input player statistics and predict salaries. - Added a slider and textboxs for value inputs.
- Used a trained Random Forest Regressor model (best performing model) to predict salaries dynamically.
- Displayed predicted salaries after button click to create an user friendly dashboard.
- Developed a web-based interactive dashboard using Dash for dynamic player salary predictions.
- Integrated sliders and dropdowns for real-time input of player stats.
- Displayed predicted salaries using the best-performing Random Forest Regressor model.
- Enhanced user accessibility with visual styling, animations, and responsiveness for mobile devices.
- This is deployed using Heroku. Here is the link:
-
Setup and Environment:
- Clone this repository and open the Jupyter Notebook file
finalProj.ipynbin a Python notebook platform like Jupyter Notebook or CodeBench.
- Clone this repository and open the Jupyter Notebook file
-
Data Import:
- The notebook imports multiple
.csvfiles containing player statistics/info and salaries. This was downloaded fromthe Kaggle data link. - Reads and retrieves the data using
pandasand merges datasets into a consolidated DataFrame.
- The notebook imports multiple
-
Data Cleaning and Processing:
- Handles missing values and outliers.
- Converts numerical columns to appropriate data types.
- Performs feature engineering to calculate metrics like Weighted Efficiency (WEFF).
-
Visualization:
- Creates various plots and charts (e.g., bar plots, scatter plots, heatmaps) to understand trends and relationships in the data.
- Analyzes salary trends by season, player position, and efficiency tiers.
-
SQL Queries:
- Transfers the data into an SQLite database.
- Executes queries to analyze salary trends, identify top players, evaluate team performance, etc.
-
Machine Learning:
- Trains models (Linear Regression, Decision Tree, Nueral Network, Random Forest) to predict salaries based on player performance metrics.
- Evaluates models using metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R² score.
- Create robust models that can help predict NBA player salaries in future seasons/years given their current stats/performance.
-
Model Evaluation:
- Applies statistical concepts such as PRESS and Mallow's Cp to choose the best model and see its performance
- Performs bootstrapping to evaluate the final model's performance on multiple resampled training sets to get 95% confidence intervals for Mean Squared Error (MSE) and R² scores.
- Does k-fold cross validation by splitting the dataset into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold, repeating the process k times so that each fold serves as the test set once. This method ensures that the model is tested on different subsets of data, providing a robust estimate of its performance.
-
Local Dashboard:
- Implements an user interacgive dashboard using
ipywidgetsto allow users to input player stats dynamically. - Predicts salaries using the best performing model (Random Forest Regressor) and displays results interactively.
- Implements an user interacgive dashboard using
-
Web Dashboard:
- Deployed an interactive Dash-based web application for real-time salary predictions for users.
The following libraries were used that must be imported:
pandasnumpymatplotlibseabornscikit-learnsqlite3statsmodeltensorflowipywidgetsdash
- Clone this repository:
git clone https://github.com/vedp2003/NBAMachineLearningProject.git
- Navigate to the project folder:
cd NBAMachineLearningProject - Download the dataset csvs. Save them in the same directory/folder as the Jupyter notebook.
- Open and run the Jupyter notebook
nba_salary_prediction_project.ipynbcell by cell to execute the analysis - Additional steps and installations may be needed to successfully run the Interactive Dashboard on the notebook. See below for steps
To enable the local dashboard in Jupyter Notebook, follow these steps:
-
Verify Node.js and npm: Run the following commands to check if Node.js and npm are installed and their versions:
node -v npm -v
-
. Upgrade Node.js if necessary: If your Node.js version is below 20.0.0, you may need to upgrade it using the following commands:
wget https://nodejs.org/dist/v20.8.0/node-v20.8.0-linux-x64.tar.xz tar -xf node-v20.8.0-linux-x64.tar.xz mv node-v20.8.0-linux-x64 /path/to/your/directory/nodejs # Replace /path/to/your/directory with the directory you want export PATH=/path/to/your/directory/nodejs/bin:$PATH # Replace with the same directory as above
-
Make sure updated Node.js path is active:: You can ensure the updated Node.js path is active by running this:
import os os.environ['PATH'] = "/path/to/your/directory/nodejs/bin:" + os.environ['PATH'] # Replace /path/to/your/directory with the directory
-
Install required Python packages: Install the necessary Python packages for widgets functionality:
pip install --user --upgrade ipywidgets pip install --user jupyterlab_widgets pip install --upgrade jupyterlab pip install --upgrade jupyterlab_widgets
-
Install JupyterLab extensions: Install the required JupyterLab extensions for enabling widgets:
jupyter labextension install @jupyter-widgets/jupyterlab-manager #Run this as long there are no permission constraints jupyter labextension install @jupyter-widgets/jupyterlab-manager --app-dir=$(jupyter --data-dir)/lab #You can also run this if you want to install the extensions in your home director
-
Rebuild JupyterLab and Restart the Kernel: After installing the extensions, rebuild JupyterLab to integrate the changes. Restart the kernel to ensure all updates take effect. It can be rebuilt by running: !jupyter lab build
The commands listed above can be executed in the terminal. However, these commands can also be run directly within Jupyter Notebook cells by adding a
!in front of each command. For example:!pip install --user ipywidgets !jupyter labextension install @jupyter-widgets/jupyterlab-manager !node -v
To enable the web dashboard, follow these steps:
- Run the dashboard:
Run the following commands to run the web dashboard
python nba_salary_dashboard.py