# COGS 108 - Project Proposal

## Authors

- Alexander Huang Liu: Project administration, Conceptualization, Software, Writing – review & editing

- Brody Vandiver: Data curation, Software, Methodology, Writing – original draft

- Jay Ma: Formal analysis, Investigation, Visualization, Writing – original draft

- Justin Wu: Background research (Investigation), Data curation, Methodology, Writing – review & editing

- Srinivasa Perisetla: Formal analysis, Software, Validation, Writing – original draft


## Research Question

Does cumulative travel stress, quantified by the interaction between total distance traveled (miles), time-zone shifts, and rest intervals (e.g., back to back games), predict a statistically significant decline in a team's Offensive Efficiency (OE) relative to their season average?

Specifically, can we develop a multiple linear regression model that accurately identifies the amount of fatigue tax on shooting percentages (teams FG%) and turnover rates for visiting teams?

This project will test whether geographic factors like eastward travel and high-mileage road trips correlate with specific performance drops. By isolating these variables, we aim to determine if a predictive threshold exists where travel fatigue becomes a dominant predictor of offensive variance.



## Background and Prior Work

The modern landscape of professional basketball is defined by a rigorous competitive structure that frequently pushes the boundaries of human physiological and cognitive limits.

Within the National Basketball Association (NBA), the standard 82-game regular season involves a high density of competition interspersed with frequent transmeridian travel, creating a unique environment where athletic performance is perpetually modulated by recovery status and circadian alignment.

Travel stress in this context is characterized by two distinct but overlapping phenomena: travel fatigue—the acute exhaustion from the logistics of movement—and jet lag, a circadian rhythm disorder caused by crossing multiple time zones. This biological asymmetry is critical, as the human circadian system is naturally more adept at adapting to westward travel (phase delay) than shortening its cycle through eastward travel (phase advance) <a name="cite_note-1"></a><sup>1</sup>.

Research indicates that sleep disruption is the primary mechanism through which travel stress erodes performance. On the first night after travel, athletes experience a predictable reduction in sleep duration based on the number of time zones crossed, with eastward travel being significantly more disruptive (averaging -24.5 minutes) than westward travel <a name="cite_note-2"></a><sup>2</sup>.

These disruptions can manifest as significant decrements in Offensive Efficiency (OE), a holistic metric that measures scoring effectiveness per possession. Specifically, studies have shown that eastward jet lag is associated with a 1.2% decrease in Effective Field Goal Percentage ($eFG\%$) differential <a name="cite_note-3"></a><sup>3</sup>.

Furthermore, mental fatigue disrupts psychomotor vigilance, essential for shooting accuracy and ball security, often leading to a rise in turnover rates (TOV%) at the end of long road trips.

Existing literature has also identified specific "tipping points" where travel fatigue becomes the dominant predictor of performance variance. One of the most robust findings is that cumulative time zone changes of three or more within a three-day period are significantly detrimental to performance <a name="cite_note-4"></a><sup>4</sup>.

Additionally, the back-to-back game scenario remains the most consistent predictor of decline; visiting teams playing on the second night of a back-to-back win only about 36% of the time.

While total mileage is a factor, its impact is most severe when combined with the duration of a multi-city tour, as the cumulative stress of changing environments leads to a measurable decline in $eFG\%$.

To capture these complex, non-linear interactions, modern sports analytics projects have increasingly moved from simple linear models toward advanced techniques like Random Forests and Gradient Boosting (LightGBM) to accurately identify and predict the "fatigue tax" <a name="cite_note-5"></a><sup>5</sup>.

Footnotes

<a name="cite_note-1"></a> ^ Charest, J., et al. (2021). Eastward Jet Lag is Associated with Impaired Performance and Game Outcome in the National Basketball Association. Journal of Clinical Sleep Medicine. https://pmc.ncbi.nlm.nih.gov/articles/PMC9245584/

<a name="cite_note-2"></a> ^ Vitale, K. C., et al. (2017). Sleep Hygiene for Optimizing Recovery in Athletes: Review and Recommendations. International Journal of Sports Medicine. https://pmc.ncbi.nlm.nih.gov/articles/PMC10520441/

<a name="cite_note-3"></a> ^ NBAstuffer. (2026). Team Stats at Home and Away: How to Find Value in NBA Games. https://www.nbastuffer.com/team-stats-at-home-and-away-how-to-find-value-in-nba-games/

<a name="cite_note-4"></a> ^ Nutting, A. W. (2022). Hiding in plain sight: schedule density and travel influence on NBA game outcomes. ResearchGate. https://www.researchgate.net/publication/357856729

<a name="cite_note-5"></a> ^ MITRA, S. (2026). Predicting-Travel-duration. GitHub. https://github.com/sowmenMITRA/Predicting-Travel-duration


## Hypothesis


We hypothesize that **cumulative travel stress**, more specifically the interaction of eastward travel (time zone shifts), total distance covered, and limited rest (playing back to back games), will serve as a **statistically significant predictor** of a decline in a visiting team’s Offensive Efficiency (OE).

### Specific Predictions
* **Directional Asymmetry:** Eastward travel across two or more time zones will result in a much more noticeable decline in Effective Field Goal Percentage ($eFG\%$) than westward travel of equal distance.
* **The "Fatigue Tax" Tipping Point:** The decline will follow a non-linear trend; the combination of high-mileage road trips and back-to-back scenarios creates a "tipping point" that significantly increases Turnover Rates ($TOV\%$).

### Reasoning
This prediction is based on the biological reality of **circadian asymmetry**, where the human body struggles more with "phase advance" (losing time traveling east) than "phase delay" (traveling west). 



As highlighted in our background research, this sleep disruption impairs both the fine motor skills required for shooting accuracy and the cognitive awareness necessary for ball security. We expect the data to show that while distance matters, the timing and direction of travel are the primary drivers of impact on offensive effectiveness.

## Data

To answer our research question, the ideal dataset would combine NBA game-level performance data with detailed scheduling and geographic travel information for each team. Each observation would correspond to a single NBA game played by a visiting team, allowing us to compare performance relative to that team's season baseline.

 Key variables would include:
### Outcome / Performance Variables (Dependent Variables)
   1. Offensive Efficiency (OE)
   2. Field Goal Percentage (FG%)
   3. Effective Field Goal Percentage (eFG%)
   4. Turnover Rate (TOV%)
### Primary Predictors (Independent Variables)
   1. Total travel distance since last game (miles)
   2. Number of time zones crossed
   3. Rest days since last game
   4. Back-to-back indicator (binary)
   5. Cumulative travel distance over recent games (e.g., last 3–5 games)

To achieve sufficient statistical power, we would ideally collect multiple full NBA seasons (e.g., 5–10 seasons), resulting in approximately 10,000+ visiting-team game observations.

### Data Collection & Organization
  Game-level statistics would be collected from publicly available NBA statistics websites.
  Schedule data (dates, opponents, locations) would be merged with arena latitude/longitude data.
  Travel distance and time-zone changes would be computed programmatically using arena coordinates and game dates.
  Data would be stored in structured tabular format (CSV files), with one row per game per team.
### Potential Real-World Datasets
 1.**NBA Official Statistics**: https://www.nba.com/stats
 This dataset provides comprehensive team-level and game-level statistics, including offensive efficiency, shooting percentages, and turnover rates. The data is publicly accessible but require web scraping or API usage; care must be taken to follow usage policies
 
2.**Basketball-Reference Game Logs**: https://www.basketball-reference.com
 Basketball-Reference offers historical NBA game logs, team statistics, and advanced metrics going back multiple decades. The data is freely available for academic use and can be downloaded manually or scraped responsibly.
 
3.**NBA Schedule & Arena Location Data**: https://github.com/rlabausa/nba-schedule-data
 These datasets provide NBA schedules and arena geographic coordinates necessary to compute travel distances and time-zone shifts.
  

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them