This is the final project for Advanced Data Science @ JHU. This project built a prediction model for winning percentage of the opposing teams in a basketball game based on the team-level and player-level statistics. After simulating the rest games in the 2019-2020 season, we present the playoffs in the Western Conference.
The data
folder contains our training and testing data scraped from BASKETBALL REFERENCE
game_stats_all.csv
- The game results (scores) for all regular season games from season 2015-16 to 2018-19 and for regular season games already played in season 2019-20 and the upcoming schedule for the remaining games in season 2019-2020.team_stats_all.csv
- The team statistics of first 20 games from season 2015-16 to 2019-20.pre_20_games.rds
- the results for the first 20 games for each team during season 2019-20post_62_games_pred.rds
- predictions for the rest of season 2019-2020 We only need the first two csv files to run the rmarkdown code.
The plot
folder contains plots from our EDA.
back2back.png
- boxplot: the proportions of games won stratified by whether it is a back-to-back gamehome_away.png
- histogram: the total wins of each team stratified by home/away and conference during season 2018-2019player_ws_plot.png
- boxplot: the players' win share rankings over season 2014-15 to season 2018-19team_cluster_plot.png
- cluster: five clusters of the teams from 2015-16 to 2019-20 using k-means algorithmteam_def_off_plot.png
- scatter plot: points scored of each team vs opponent in the first 20 games during season 2018-19team_pc_plot.png
- principle component: two PCs of all the teams from 2015-16 to 2019-20team_rank_plot.png
- scatter plot: final conference standing vs conference standing after 20 games during season 2018-19team_win_plot.png
- scatter plot: final total wins vs total wins after 20 games during season 2018-19team_ws_plot.png
- line segment: win shares of of the best players in each teamwest_east.png
- histogram: win/lose when the teams face opposing conferenceyear_dist.png
- heatmap: the distance between every pair of seasons
The source
folder contains source code of the functions used in the analysis
data_scraping.R
- functions to scrape team/player statistics, game results, injury information, and game scheduledata_wrangling.R
- functions to tidy data and extract information from raw dataEDA_visualize.R
- functions to generate the plots in the analysisdata_integration.R
- functions to explore the yearly effectfeature_eng.R
- functions to create team featuresmodel_build.R
- functions to build and test the prediction modelsimulation.R
- based on the predicted probability, simulate the games for B times
The Shinyapp presents one simulation based on the predicted probabilities from our model. User could see the the final simulation results as well as daily game predictions.
To get a quick overview of the project, checkout our website and Youtube video. For the full analysis, read our final report.
Kate Li (yli324@jhu.edu)
Runzhe Li (rli51@jhmi.edu)
Yifan Zhang (yzhan170@jhu.edu)
Linda Zhou (lzhou54@jhu.edu)