# NYC Yellow Taxi Data - Exploratory Data Analysis

This notebook loads one of the Parquet files to examine the data structure and understand what information is available.

## 1. Import Required Libraries

Import the necessary libraries, including pandas and pyarrow for reading Parquet files.

In [2]:
pip install pyarrow

Defaulting to user installation because normal site-packages is not writeable
Collecting pyarrow
  Using cached pyarrow-21.0.0-cp313-cp313-macosx_12_0_x86_64.whl.metadata (3.3 kB)
Using cached pyarrow-21.0.0-cp313-cp313-macosx_12_0_x86_64.whl (32.7 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-21.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Cellar/jupyterlab/4.4.2_1/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Import Required Libraries
import pandas as pd
import pyarrow.parquet as pq
import os

## 2. Load Parquet File

Use pandas to read one of the 12 Parquet files into a DataFrame.

In [2]:
# Define the path to your parquet files
data_path = '/Users/yash/Documents/Projects/NYC_Yellow_Taxi_Analytics/'

# List all parquet files in the directory (update the path as needed)
# Uncomment the following lines if files are in a subdirectory
# parquet_files = [f for f in os.listdir(data_path) if f.endswith('.parquet')]
# print(f"Found {len(parquet_files)} parquet files:")
# print(parquet_files)

# Load one parquet file (update the filename to match your actual file)
# Example: if your files are named like 'yellow_tripdata_2023-01.parquet'
file_name = 'yellow_tripdata_2025-01.parquet'  # Update this to your actual file name
df = pd.read_parquet(os.path.join(data_path, file_name))

print(f"Successfully loaded: {file_name}")
print(f"DataFrame shape: {df.shape[0]} rows, {df.shape[1]} columns")

Successfully loaded: yellow_tripdata_2025-01.parquet
DataFrame shape: 3475226 rows, 20 columns


## 3. Display DataFrame Information

Use the info() method to display basic information about the DataFrame, including column names, data types, and non-null counts.

In [3]:
# Display DataFrame information
print("DataFrame Information:")
print("=" * 80)
df.info()

DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3475226 entries, 0 to 3475225
Data columns (total 20 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int32         
 1   tpep_pickup_datetime   datetime64[us]
 2   tpep_dropoff_datetime  datetime64[us]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int32         
 8   DOLocationID           int32         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18 

## 4. View Data Header

Use the head() method to display the first few rows of the DataFrame to see what the data looks like.

In [4]:
# Display the first 10 rows of the DataFrame
print("First 10 rows of the dataset:")
print("=" * 80)
df.head(10)

First 10 rows of the dataset:


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee
0,1,2025-01-01 00:18:38,2025-01-01 00:26:59,1.0,1.6,1.0,N,229,237,1,10.0,3.5,0.5,3.0,0.0,1.0,18.0,2.5,0.0,0.0
1,1,2025-01-01 00:32:40,2025-01-01 00:35:13,1.0,0.5,1.0,N,236,237,1,5.1,3.5,0.5,2.02,0.0,1.0,12.12,2.5,0.0,0.0
2,1,2025-01-01 00:44:04,2025-01-01 00:46:01,1.0,0.6,1.0,N,141,141,1,5.1,3.5,0.5,2.0,0.0,1.0,12.1,2.5,0.0,0.0
3,2,2025-01-01 00:14:27,2025-01-01 00:20:01,3.0,0.52,1.0,N,244,244,2,7.2,1.0,0.5,0.0,0.0,1.0,9.7,0.0,0.0,0.0
4,2,2025-01-01 00:21:34,2025-01-01 00:25:06,3.0,0.66,1.0,N,244,116,2,5.8,1.0,0.5,0.0,0.0,1.0,8.3,0.0,0.0,0.0
5,2,2025-01-01 00:48:24,2025-01-01 01:08:26,2.0,2.63,1.0,N,239,68,2,19.1,1.0,0.5,0.0,0.0,1.0,24.1,2.5,0.0,0.0
6,1,2025-01-01 00:14:47,2025-01-01 00:16:15,0.0,0.4,1.0,N,170,170,1,4.4,3.5,0.5,2.35,0.0,1.0,11.75,2.5,0.0,0.0
7,1,2025-01-01 00:39:27,2025-01-01 00:51:51,0.0,1.6,1.0,N,234,148,1,12.1,3.5,0.5,2.0,0.0,1.0,19.1,2.5,0.0,0.0
8,1,2025-01-01 00:53:43,2025-01-01 01:13:23,0.0,2.8,1.0,N,148,170,1,19.1,3.5,0.5,3.0,0.0,1.0,27.1,2.5,0.0,0.0
9,2,2025-01-01 00:00:02,2025-01-01 00:09:36,1.0,1.71,1.0,N,237,262,2,11.4,1.0,0.5,0.0,0.0,1.0,16.4,2.5,0.0,0.0


## 5. Check Data Types and Summary Statistics

Use describe() to get summary statistics and dtypes to check the data types of each column.

In [5]:
# Display data types for each column
print("Data Types:")
print("=" * 80)
print(df.dtypes)
print("\n")

Data Types:
VendorID                          int32
tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int32
DOLocationID                      int32
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
Airport_fee                     float64
cbd_congestion_fee              float64
dtype: object




In [6]:
# Display summary statistics for numerical columns
print("Summary Statistics (Numerical Columns):")
print("=" * 80)
df.describe()

Summary Statistics (Numerical Columns):


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee
count,3475226.0,3475226,3475226,2935077.0,3475226.0,2935077.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,2935077.0,2935077.0,3475226.0
mean,1.785428,2025-01-17 11:02:55.910964,2025-01-17 11:17:56.997901,1.297859,5.855126,2.482535,165.1916,164.1252,1.036623,17.0818,1.317737,0.4780991,2.959813,0.4493081,0.9547946,25.61129,2.225237,0.1239111,0.4834093
min,1.0,2024-12-31 20:47:55,2024-12-18 07:52:40,0.0,0.0,1.0,1.0,1.0,0.0,-900.0,-7.5,-0.5,-86.0,-126.94,-1.0,-901.0,-2.5,-1.75,-0.75
25%,2.0,2025-01-10 07:59:01,2025-01-10 08:15:29.500000,1.0,0.98,1.0,132.0,113.0,1.0,8.6,0.0,0.5,0.0,0.0,1.0,15.2,2.5,0.0,0.0
50%,2.0,2025-01-17 15:41:33,2025-01-17 15:59:34,1.0,1.67,1.0,162.0,162.0,1.0,12.11,0.0,0.5,2.45,0.0,1.0,19.95,2.5,0.0,0.75
75%,2.0,2025-01-24 19:34:06,2025-01-24 19:48:31,1.0,3.1,1.0,234.0,234.0,1.0,19.5,2.5,0.5,3.93,0.0,1.0,27.78,2.5,0.0,0.75
max,7.0,2025-02-01 00:00:44,2025-02-01 23:44:11,9.0,276423.6,99.0,265.0,265.0,5.0,863372.1,15.0,10.5,400.0,170.94,1.0,863380.4,2.5,6.75,0.75
std,0.4263282,,,0.7507503,564.6016,11.63277,64.52948,69.40169,0.7013334,463.4729,1.861509,0.1374623,3.779681,2.002582,0.2781938,463.6585,0.9039932,0.472509,0.3619307


In [7]:
# Display summary statistics for all columns including object types
print("Summary Statistics (All Columns):")
print("=" * 80)
df.describe(include='all')

Summary Statistics (All Columns):


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee
count,3475226.0,3475226,3475226,2935077.0,3475226.0,2935077.0,2935077,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,3475226.0,2935077.0,2935077.0,3475226.0
unique,,,,,,,2,,,,,,,,,,,,,
top,,,,,,,N,,,,,,,,,,,,,
freq,,,,,,,2927431,,,,,,,,,,,,,
mean,1.785428,2025-01-17 11:02:55.910964,2025-01-17 11:17:56.997901,1.297859,5.855126,2.482535,,165.1916,164.1252,1.036623,17.0818,1.317737,0.4780991,2.959813,0.4493081,0.9547946,25.61129,2.225237,0.1239111,0.4834093
min,1.0,2024-12-31 20:47:55,2024-12-18 07:52:40,0.0,0.0,1.0,,1.0,1.0,0.0,-900.0,-7.5,-0.5,-86.0,-126.94,-1.0,-901.0,-2.5,-1.75,-0.75
25%,2.0,2025-01-10 07:59:01,2025-01-10 08:15:29.500000,1.0,0.98,1.0,,132.0,113.0,1.0,8.6,0.0,0.5,0.0,0.0,1.0,15.2,2.5,0.0,0.0
50%,2.0,2025-01-17 15:41:33,2025-01-17 15:59:34,1.0,1.67,1.0,,162.0,162.0,1.0,12.11,0.0,0.5,2.45,0.0,1.0,19.95,2.5,0.0,0.75
75%,2.0,2025-01-24 19:34:06,2025-01-24 19:48:31,1.0,3.1,1.0,,234.0,234.0,1.0,19.5,2.5,0.5,3.93,0.0,1.0,27.78,2.5,0.0,0.75
max,7.0,2025-02-01 00:00:44,2025-02-01 23:44:11,9.0,276423.6,99.0,,265.0,265.0,5.0,863372.1,15.0,10.5,400.0,170.94,1.0,863380.4,2.5,6.75,0.75


## Additional Information

Check for column names and sample values to better understand the dataset.

In [8]:
# Display all column names
print("Column Names:")
print("=" * 80)
print(df.columns.tolist())
print(f"\nTotal number of columns: {len(df.columns)}")

Column Names:
['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'Airport_fee', 'cbd_congestion_fee']

Total number of columns: 20
