# **0. Data Processing Plan for Portfolio Project (Extended Technical Version)**
### *Dataset: NYC Yellow Taxi Trips (2019â€“2020)*
### *Author: Alvi Sulaj*
### *Holberton School â€“ Machine Learning Track*

---

## **1. Data Sources**

The dataset will be obtained from the official NYC Taxi & Limousine Commission (TLC):

ðŸ”— **Dataset Link:** https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

This dataset contains trip-level information for all Yellow Taxis in NYC.  
It includes timestamps, distances, fares, passenger details, and spatial identifiers.  
It is collected automatically through in-vehicle GPS and meter systems.

### **Additional sources (optional):**
- NOAA NYC Weather Data â†’ for weather-demand correlation
- NYC Borough shapefiles â†’ for mapping and spatial modeling

### **Aggregation plan:**
I will combine multiple monthly CSV files covering 2019â€“2020 into one unified dataset.

---

## **2. Data Format**

### **Original format:**
- CSV files
- Monthly trip records
- Large size (1â€“7 million rows per month)

### **Working format:**
- Pandas DataFrame

### **Final stored format:**
- **Parquet** (for speed + compression)
- **Feather** (optional)
- **CSV** (for compatibility with other tools)

---

## **3. Existing Features & Planned Exploratory Data Analysis**

### **Main columns included:**
- `tpep_pickup_datetime`
- `tpep_dropoff_datetime`
- `passenger_count`
- `trip_distance`
- `PULocationID`, `DOLocationID`
- `fare_amount`
- `tip_amount`
- `tolls_amount`
- `total_amount`
- `payment_type`
- `ratecodeID`
- `vendorID`

### **Planned EDA:**

#### **A. Data Quality**
- Missing value analysis  
- Duplicate detection  
- Impossible values (zero distance, negative fare)  

#### **B. Statistics & Distributions**
- Histograms for numerical values  
- Boxplots for outliers  
- Skewness/kurtosis  

#### **C. Temporal Patterns**
- Trips per hour  
- Trips per weekday  
- Seasonal changes  

#### **D. Spatial Patterns**
- Pickup/dropoff hotspot maps  
- Borough-to-borough flow analysis  

#### **E. Fare Analysis**
- Fare-per-mile  
- Tip distribution  
- Fraud anomaly detection  

---

## **4. Hypotheses & Testing Methods**

### **Hypothesis 1:**  
Short trips have higher fare-per-mile than long trips.  
**Testing:** compute `fare_per_mile`, compare distributions, run statistical tests.

### **Hypothesis 2:**  
Taxi demand peaks during morning and evening rush hours.  
**Testing:** extract hour-of-day, plot rides per hour.

### **Hypothesis 3:**  
Rainy days increase taxi usage.  
**Testing:** merge with weather data, compare ride counts on rainy vs dry days.

### **Hypothesis 4:**  
Late-night trips have higher tip percentages.  
**Testing:** extract hour data, analyze tip rates by time category.

---

## **5. Missing Data, Sparsity & Outliers**

### **Missing values handling:**
- Remove rows with missing coordinates  
- Impute missing fare components with logical defaults  
- Validate payment and vendor IDs  

### **Outlier removal:**
- Z-score or IQR filtering  
- Remove:
  - Distance > 50 miles  
  - Duration > 2 hours  
  - Negative values  
  - Unrealistic fare/amount combinations  

### **Sparse data:**
- Rare pickup/dropoff zones will be grouped into broader borough-level categories.

---

## **6. Dataset Splitting Strategy**

### **If NOT time-series ML:**
- Train: **70%**  
- Validation: **15%**  
- Test: **15%**  
- Stratify if predicting categorical outcomes

### **If time-series ML:**
- Train: 2019  
- Validation: Janâ€“Jun 2020  
- Test: Julâ€“Dec 2020  
- Ensures no future data leaks into the past

---

## **7. Ensuring Unbiased Data**

### **Bias checks:**
- Geographic representation across boroughs  
- Balanced payment types  
- Balanced time-of-day usage  
- Balanced seasonal data  

### **Mitigation:**
- Oversampling or undersampling zones  
- SMOTE for classification  
- Time-based normalization  
- Fare scaling across seasons  

---

## **8. Expected Features for the Model**

### **Engineered features:**
- Trip duration  
- Fare per mile  
- Tip rate  
- Speed (distance / duration)  
- Weekend indicator  
- Rush-hour indicator  
- Distance buckets  
- Origin-destination (encoded)

### **Categorical encoded:**
- Pickup zone  
- Dropoff zone  
- Payment type  
- Rate code  
- Vendor ID  

### **Numerical:**
- Distance  
- Fare  
- Duration  
- Tip amount  

---

## **9. Type of Data in the Dataset**
- **Numerical:** fare, distance, duration, tips  
- **Categorical:** payment type, borough IDs  
- **Datetime:** pickup/dropoff timestamps  
- **Geospatial:** pickup and dropoff location IDs  

---

## **10. Data Transformations**

### **Transformation steps:**
1. Convert timestamps â†’ hour, weekday, month  
2. Remove outliers  
3. One-hot encode categorical fields  
4. Scale numerical columns (StandardScaler/MinMax)  
5. Log-transform skewed variables  
6. Create engineered features  
7. Save cleaned dataset into Parquet format  

### **ML pipeline:**
- `ColumnTransformer` for handling mixed data types  
- `Pipeline` for consistent preprocessing + model training  
- Ensure reproducible transformations  

---

## **11. Data Storage Plan**

### **Local directory structure:**

```
project/
â”‚
â”œâ”€â”€ raw_data/
â”‚   â”œâ”€â”€ yellow_tripdata_2019-01.csv
â”‚   â”œâ”€â”€ yellow_tripdata_2019-02.csv
â”‚   â””â”€â”€ ...
â”‚
â”œâ”€â”€ processed_data/
â”‚   â”œâ”€â”€ cleaned_2019_2020.parquet
â”‚   â””â”€â”€ features.feather
â”‚
â”œâ”€â”€ notebooks/
â”‚   â”œâ”€â”€ 00_data_plan.ipynb
â”‚   â”œâ”€â”€ 01_cleaning.ipynb
â”‚   â””â”€â”€ 02_modeling.ipynb
â”‚
â”œâ”€â”€ src/
â”‚   â”œâ”€â”€ cleaning.py
â”‚   â”œâ”€â”€ features.py
â”‚   â””â”€â”€ model.py
â”‚
â””â”€â”€ README.md
```

### **Backups:**
- Google Drive  
- GitHub LFS  
- Local external storage  

---

# âœ” This completes the extended technical data plan.
