# New York Taxi Analysis Project
**Author:** Frederick Richard Apina  
**Date:** 2025-02-26

## 1.  PROJECT OVERVIEW

In this project, we analyze the New York City Taxi Trips dataset from [Kaggle](https://www.kaggle.com/datasets/dhrulivalave/new-york-city-taxi-trips-2019). The dataset contains detailed information on taxi rides in NYC, including:
- Pickup and dropoff locations and times
- Trip distances
- Fare amounts
- Passenger counts
- Other trip-related features

### Goals 🥅
1. **Regression Goal**: Predict the fare amount for a given taxi trip (continuous target variable).
2. **Classification Goal**: Categorize trips into fare ranges:
   - Class 1: Short-distance fares (< \$10)
   - Class 2: Medium-distance fares (\$10–\$30)
   - Class 3: Long-distance fares (\$30–\$60)
   - Class 4: Premium fares (>\$60)

### Why It Matters 📣
- **Pricing Optimization**: Understanding factors influencing fares helps optimize pricing.
- **Fraud Detection**: Anomalies in fare amounts can indicate fraudulent activity.
- **Traffic Pattern Analysis**: Insights into passenger count and travel times can inform city planning.

## 2. PROBLEM FORMULATION

### 2.1 Problem Statement
We want to investigate how various trip features (e.g., trip distance, pickup time, passenger count) affect the total fare amount.

### 2.2 Objectives
1. **Primary Objective**: Develop a model to predict the fare amount (regression).
2. **Secondary Objective**: Classify rides into predefined fare ranges (classification).

### 2.3 Expected Outcomes
- A set of data-driven insights about which features most influence fare.
- A cleaned dataset ready for further modeling and analysis.
- Baseline regression and classification models to measure performance.

### 2.4 Key Questions
1. Which variables (distance, passenger count, time of day) are most important in determining fare?
2. Are there noticeable patterns (e.g., peak hours, airport trips) that lead to higher fares?
3. How can the data be best transformed or normalized for accurate predictions?

## 3. Data Analysis & Cleansing

### 3.1 Data Description & Source

**Data Source**: [Kaggle: New York City Taxi Trips (2019)](https://www.kaggle.com/datasets/dhrulivalave/new-york-city-taxi-trips-2019)

**Dataset Columns** (Typical):
- `vendorid`
- `tpep_pickup_datetime`
- `tpep_dropoff_datetime`
- `passenger_count`
- `trip_distance`
- `ratecodeid`
- `store_and_fwd_flag`
- `pulocationid`
- `dolocationid`
- `payment_type`
- `fare_amount`
- `extra`
- `mta_tax`
- `tip_amount`
- `tolls_amount`
- `improvement_surcharge`
- `total_amount`
- `congestion_surcharge`
(Columns may vary depending on the version of the dataset.)

**Size & Format**:
- Approx. X million rows (depending on the subset used).
- SQL file with columns for each trip feature.

#### 3.2 Data Preprocessing

Here we address:
1. **Importing Libraries & Dataset**
2. **Handling Missing Values**
3. **Removing/Imputing Outliers**
4. **Correcting Data Types** (e.g., `pickup_datetime` to datetime)
5. **Normalizing/Standardizing** variables
6. **Encoding Categorical Features** (e.g., `payment_type`)

In [None]:
import sys
import pickle

sys.path.append('../')
from scripts.data_processor import DataManipulator, DataNormalizer, DataCleaner
from scripts.data_visualizer import EDA, DataVisualizer
from scripts.feature_analyzer import FeatureAnalysisAndGenerator

In [None]:
data_loader = DataManipulator(filename='../data/2019/2019-01.sqlite', labels_test='fare_amount')
data_loader.data_train.head()

In [None]:
print("Before data cleaning")
print("Training data shape:", data_loader.data_train.shape)
print("Training labels shape:", data_loader.labels_train.shape)
print("Testing data shape:", data_loader.data_test.shape)
print("Testing labels shape:", data_loader.labels_test.shape)

In [None]:
# Data cleaning
data_cleaner = DataCleaner(data_loader)
data_cleaner.remove_duplicates()
data_cleaner.handle_missing_values()
data_cleaner.remove_outliers()

In [None]:
print("After data preprocessing")
print("Training data shape:", data_loader.data_train.shape)
print("Training labels shape:", data_loader.labels_train.shape)
print("Testing data shape:", data_loader.data_test.shape)
print("Testing labels shape:", data_loader.labels_test.shape)

In [None]:
data_loader.data_train.store_and_fwd_flag.unique()

In [None]:
data_loader.data_train.mta_tax.unique()

In [None]:
data_loader.data_train.improvement_surcharge.unique()

In [None]:
# Drop store_and_fwd_flag, improvement_surcharge and mta_tax columns beacause they have only one unique value after data cleaning
data_loader.data_train.drop(['store_and_fwd_flag', 'mta_tax', 'improvement_surcharge'], axis=1, inplace=True)

In [None]:
# Define categorical and numerical features
categorical_features = ['vendorid', 'ratecodeid', 'payment_type', 'pulocationid', 'dolocationid']
numerical_features = data_loader.data_train.columns.difference(categorical_features).tolist()

In [None]:
# Normalizing the data
data_normalizer = DataNormalizer(data_loader, categorical_features, numerical_features)
data_normalizer.normalize_features()
data_normalizer.data_loader.data_train.head()

In [None]:
with open('../data/processed/data_loader.pkl', 'wb') as f:
    pickle.dump(data_loader, f)

## 4. Exploratory Data Analysis (EDA)
**Goals**:
- Understand data distribution and relationships
- Identify key patterns
- Use descriptive statistics, histograms, boxplots, correlation matrices
- Attempt dimension reduction (PCA, UMAP) to visualize patterns

In [None]:
eda = EDA(data_loader)
eda.perform_eda()

In [None]:
data_visualization = DataVisualizer(data_loader)
data_visualization.perform_visualization()

In [None]:
feature_analyzer = FeatureAnalysisAndGenerator(data_loader)

In [None]:
# Perform PCA
feature_analyzer.perform_pca(numerical_features=numerical_features)

In [None]:
# Generate new features
feature_analyzer.generate_features()

In [None]:
# Modified data sample visualization
data_visualization.plot_boxplot()

In [None]:
# Perform relevant feature identification
relevant_features = feature_analyzer.relevant_feature_identification(len(data_loader.data_train.columns))