# Notebook Overview

In this notebook we will use NYC TLC official website to acquire dataset for yello taxi trips in NYC for year 2023 in parquet format and create a combined dataset for further analysis

The URL used : https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

### Importing Necesssary Libraries

In [2]:
# Import necessary libraries
import pandas as pd
import os
import urllib.request

## Fetching Data From NYC Official Website

In [3]:


# Configuration
data_url = "https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
file_path = "/Users/md/Desktop/python_project/parquet_files/2023"  # Update this path

# Function to download data
def download_data(url, filename):
    """Download file from a specified URL to a local path."""
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)
        print(f"Downloaded {filename}")
    else:
        print(f"{filename} already exists")

# List of file names and their URLs
files_to_download = {
    "yellow_tripdata_2023-01.parquet": "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet",
     "yellow_tripdata_2023-02.parquet": "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet",
    "yellow_tripdata_2023-03.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet",
    "yellow_tripdata_2023-04.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-04.parquet",
    "yellow_tripdata_2023-05.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-05.parquet",
    "yellow_tripdata_2023-06.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-06.parquet",
    "yellow_tripdata_2023-07.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-07.parquet",
    "yellow_tripdata_2023-08.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-08.parquet",
    "yellow_tripdata_2023-09.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-09.parquet",
    "yellow_tripdata_2023-10.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-10.parquet",
    "yellow_tripdata_2023-11.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-11.parquet",
    "yellow_tripdata_2023-12.parquet":"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-12.parquet"
    
}

# Download the files
for filename, url in files_to_download.items():
    download_data(url, os.path.join(file_path, filename))

# Loading the data
df_list = []
for filename in files_to_download.keys():
    full_path = os.path.join(file_path, filename)
    df = pd.read_parquet(full_path)
    df_list.append(df)
    print(f"Loaded data from {filename} with shape {df.shape}")

Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-01.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-02.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-03.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-04.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-05.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-06.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-07.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-08.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-09.parquet
Downloaded /Users/md/Desktop/python_project/parquet_files/2023/yellow_tripdata_2023-10.parquet
Downloaded /Users/md/Desktop/python_project/parque

## Creating One Combined Parquet For Data Preparation

In [4]:
# Combine all DataFrames into one
combined_df = pd.concat(df_list, ignore_index=True)
print("Combined DataFrame shape:", combined_df.shape)



Combined DataFrame shape: (38310226, 20)


In [5]:
# Initial Data Check
print(combined_df.head())
print(combined_df.info())
print(combined_df.describe())

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         2  2023-01-01 00:32:10   2023-01-01 00:40:36              1.0   
1         2  2023-01-01 00:55:08   2023-01-01 01:01:27              1.0   
2         2  2023-01-01 00:25:04   2023-01-01 00:37:49              1.0   
3         1  2023-01-01 00:03:48   2023-01-01 00:13:25              0.0   
4         2  2023-01-01 00:10:29   2023-01-01 00:21:19              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           0.97         1.0                  N           161           141   
1           1.10         1.0                  N            43           237   
2           2.51         1.0                  N            48           238   
3           1.90         1.0                  N           138             7   
4           1.43         1.0                  N           107            79   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \


In [6]:

# Save the combined DataFrame to a new Parquet file
combined_df.to_parquet(os.path.join(file_path, "combined_yellow_tripdata_2023.parquet"))
print("Saved combined data to disk.")

# Conclusion
print("Data acquisition and initial loading completed.")

Saved combined data to disk.
Data acquisition and initial loading completed.
