# Introduction to Data Preparation for NYC Taxi Trips

## Objective
This Jupyter Notebook is dedicated to the initial stage of our data analytics project—**Data Preparation**. The primary goal here is to prepare the vast New York City Taxi Trips dataset for the year 2021, ensuring it's clean, organized, and ready for in-depth analysis and modeling in subsequent stages. The dataset, which includes over 30 million records, requires meticulous handling to manage its volume and enhance its quality effectively.

## Background
The NYC Taxi Trips dataset is sourced from the NYC Open Data portal and offers a detailed snapshot of taxi activities across New York City. It records every taxi trip's core details, such as times of pickup and dropoff, trip distances, fares charged, and more. These records not only provide insights into the city’s mobility patterns but also serve as a basis for predictive modeling of fares and understanding factors influencing taxi trip dynamics.

## Scope of This Notebook
In this notebook, we will perform several key tasks to prepare the data for further analysis:
1. **Data Loading**: Load the data from four pre-processed partitions to manage the dataset's size efficiently.
2. **Initial Exploration**: Conduct a preliminary examination to understand the dataset's structure, missing values, anomalies, and data types.
3. **Data Cleaning**: Address missing or incorrect values, remove duplicates, and handle any outliers or erroneous entries.
4. **Feature Engineering**: Develop new features that are more informative for analysis and predictive modeling, such as calculating trip durations and categorizing times of day.
5. **Data Transformation**: Standardize and normalize data as necessary to prepare for machine learning algorithms that require standardized input.
6. **Data Reduction**: Reduce dimensionality where applicable to improve model performance and decrease computational requirements.

## Tools and Libraries
We will use Python as our main programming language, leveraging libraries such as Pandas for data manipulation, Numpy for numerical operations, and Matplotlib/Seaborn for visualization purposes.Dask for loading, memory optimization and data cleaning. These tools are chosen for their efficiency and ease of use in handling large datasets like ours.
Utilizing Dask for Efficient Data Handling in NYC Taxi Trips Analysis
In this Jupyter Notebook, we embark on the crucial stage of Data Preparation for the NYC Taxi Trips dataset from 2021. Given the dataset's extensive volume, encompassing over 30 million records, efficient management is paramount. Our objective is to clean, organize, and ready the dataset for detailed exploratory analysis and advanced modeling.
Why Dask?
Handling such a massive dataset requires tools that extend beyond traditional data processing capabilities. Here, we introduce Dask, a flexible parallel computing library designed to integrate seamlessly with Pandas and Numpy. Dask enables us to work with large datasets that don't fit entirely in memory by breaking them down into manageable chunks, allowing for parallel computation on a single machine or across a cluster.

Dask's Role in Our Project
Dask will be utilized primarily for:

Efficient Data Loading: To handle data across four pre-processed partitions without overwhelming system memory.
Data Manipulation and Transformation: Including concatenation of multiple dataframes, handling missing values, outlier correction, and feature engineering, all performed in a way that optimizes memory use and computational speed.
By leveraging Dask, we can maintain the integrity of our data manipulation processes while ensuring that our operations are scalable and efficient. This approach not only facilitates faster data processing but also enhances our capability to manage data intricacies due to the size of the dataset.


## Conclusion
By the end of this notebook, the dataset will be transformed into a clean, comprehensive format suitable for detailed exploratory data analysis and machine learning tasks in the following stages of this project. The meticulous preparation we perform here is crucial for ensuring the accuracy and reliability of our later analyses and predictions.


# 1. Data Loading 
In this section we import all necessary libraries and tools used for data preprocessing as mentioned above. 

We will first load the pickle partitioned datasets, we then will analyze each partitioned dataset separately to understand data structures and create one combined dask file for further cleaning and data preparation.

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd


# Set some options for displaying the data tables nicely
pd.set_option('display.max_columns', None)  # Show all columns of DataFrames
pd.set_option('display.width', 1000)        # Ensure the display is wide enough to view all DataFrame columns
pd.set_option('display.float_format', '{:,.2f}'.format)  # Format floats for easier reading

# Setting the style for seaborn plots
sns.set(style="whitegrid")


Pickle is a Python-specific binary serialization method used to save and load Python objects directly, preserving their data types and structure. In our NYC Taxi Trips data analytics project, we chose Pickle for its efficiency and ease of use, especially given the large volume of data involved. It enables fast loading and saving of complex Pandas DataFrames, significantly speeding up our workflow by avoiding repeated preprocessing. Although Pickle should be used cautiously due to security risks when dealing with untrusted data sources, it is ideal for our controlled environment where these concerns are mitigated.



In [21]:
# Define data types for data consistency
dtypes = {
    'VendorID': 'category',
    'tpep_pickup_datetime': 'str',
    'tpep_dropoff_datetime': 'str',
    'passenger_count': 'float64',
    'trip_distance': 'float64',
    'RatecodeID': 'category',
    'store_and_fwd_flag': 'category',
    'PULocationID': 'category',
    'DOLocationID': 'category',
    'payment_type': 'category',
    'fare_amount': 'float64',
    'extra': 'float64',
    'mta_tax': 'float64',
    'tip_amount': 'float64',
    'tolls_amount': 'float64',
    'improvement_surcharge': 'float64',
    'total_amount': 'float64',
    'congestion_surcharge': 'float64'
}

# Load data from CSV files only if Pickle files do not exist or when processing for the first time
try:
    df1 = pd.read_pickle("/Users/md/Desktop/python_project/df1.pkl")
    df2 = pd.read_pickle("/Users/md/Desktop/python_project/df2.pkl")
    df3 = pd.read_pickle("/Users/md/Desktop/python_project/df3.pkl")
    df4 = pd.read_pickle("/Users/md/Desktop/python_project/df4.pkl")
    print("Data loaded from Pickle files.")
except FileNotFoundError:
    print("Pickle files not found. Loading data from CSV files and saving as Pickle.")
    df1 = pd.read_csv("2021_TLC_0.csv", dtype=dtypes)
    df2 = pd.read_csv("2021_TLC_1.csv", dtype=dtypes)
    df3 = pd.read_csv("2021_TLC_2.csv", dtype=dtypes)
    df4 = pd.read_csv("2021_TLC_3.csv", dtype=dtypes)
    # Save DataFrames to Pickle for future use
    df1.to_pickle("/Users/md/Desktop/python_project/df1.pkl")
    df2.to_pickle("/Users/md/Desktop/python_project/df2.pkl")
    df3.to_pickle("/Users/md/Desktop/python_project/df3.pkl")
    df4.to_pickle("/Users/md/Desktop/python_project/df4.pkl")



Data loaded from Pickle files.


In [22]:
df4.head(2)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,2.0,10/28/2021 03:35:00 PM,10/28/2021 03:44:01 PM,1.0,1.8,1.0,N,170.0,79.0,1.0,8.0,0.0,0.5,1.0,0.0,0.3,12.3,2.5
1,2.0,10/28/2021 03:45:48 PM,10/28/2021 04:04:51 PM,1.0,1.45,1.0,N,79.0,170.0,1.0,12.5,0.0,0.5,3.16,0.0,0.3,18.96,2.5


In the section above we loaded all 4 partitioned datasets for analysis defined data types based on the documentation for the dataset. We then loaded the Dataframes into pickle format for faster processing and for the purpose of tracking and maintaining consistency throughout rest of the notebook without the need to load large CSV files into df after every launch.

#### Column Descriptions Per Data Dictonary and The Datasets
* 1 vendor ID- A code indicating the TPEP provider that provided the record, 1=Creative Mobile Technologies, LLC; 2=VeriFone Inc.

* tpep_pickup_datetime- The date and time when the meter was engaged.
* tpep_dropoff_datetime- The date and time when the meter was disengaged. 
* passenger_count	The number of passengers in the vehicle.
* trip_distance	The elapsed trip distance in miles reported by the taximeter
* RatecodeID The final rate code in effect at the end of the trip.	
    **1= Standard rate
    **2=JFK
    **3=Newark
    **4=Nassau or Westchester
    **5=Negotiated fare
    **6=Group ride
    **99 = Null/unknown
* store_and_fwd_flag	This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.	"Y= store and forward trip N= not a store and forward trip"

* PULocationID	TLC Taxi Zone in which the taximeter was engaged
* DOLocationID	TLC Taxi Zone in which the taximeter was disengaged

* payment_type	A numeric code signifying how the passenger paid for the trip.	"0= Flex Fare trip
     **1= Credit card
     **2= Cash
     **3= No charge
     **4= Dispute
     **5= Unknown
     **6= Voided trip
* fare_amount	The time-and-distance fare calculated by the meter. 

* extra	Miscellaneous extras and surcharges.
* mta_tax	Tax that is automatically triggered based on the metered rate in use.
* tip_amount	Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
* tolls_amount	Total amount of all tolls paid in trip.
* improvement_surcharge	Improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
* total_amount	The total amount charged to passengers. Does not include cash tips.
* congestion_surcharge	Total amount collected in trip for NYS congestion surcharge.


 

### 1.1 Initial Exploration Of Partitioned Datasets

After loading the data, it's crucial to understand its structure and understanding how we can workk with it.

#### 1.1.1 Partition 1 Initial Exploration

The DataFrame `df1` provides a comprehensive overview of the 2021 NYC Taxi Trips dataset, which contains approximately 7.92 million entries across 18 attributes. The data spans categorical variables like `VendorID` and `RatecodeID`, datetime variables for pickup and dropoff, and numerical values for passenger counts, trip distances, and fare amounts. Notably, the data reveals a central tendency where most trips average 1.41 passengers with trip distances of 5.02 miles, and an average fare of $12.36. However, the presence of significant outliers in fare and distance, and anomalies such as negative values, suggest data entry errors or rare long-distance trips. The dataset shows no missing values in monetary or distance-related fields but highlights gaps in categorical data, which could impact demographic analysis. For effective data utilization, steps will include cleaning to address outliers, filling missing values based on their distribution and impact, and enhancing the dataset through feature engineering like calculating trip durations and categorizing times of day. These measures will prepare the dataset for further comprehensive analysis and modeling.


In [23]:
# Display the first few rows of the first DataFrame
print(df1.head(3))

# Display summary information about DataFrame
print(df1.info())

# Basic statistical details
print(df1.describe())


  VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge
0      1.0  2021-01-01 00:30:10   2021-01-01 00:36:12             1.00           2.10        1.0                  N        142.0         43.0          2.0         8.00   3.00     0.50        0.00          0.00                   0.30         11.80                  2.50
1      1.0  2021-01-01 00:51:20   2021-01-01 00:52:19             1.00           0.20        1.0                  N        238.0        151.0          2.0         3.00   0.50     0.50        0.00          0.00                   0.30          4.30                  0.00
2      1.0  2021-01-01 00:43:30   2021-01-01 01:11:06             1.00          14.70        1.0                  N        132.0        165.0          1.0        42.00   0.50     0.50        8.

In [5]:
# Count missing values in each column
missing_values1 = df1.isnull().sum()
print("Missing values in each column:\n", missing_values1)


Missing values in each column:
 VendorID                 452538
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count          452538
trip_distance                 0
RatecodeID               452538
store_and_fwd_flag       452538
PULocationID                  0
DOLocationID                  0
payment_type             452538
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge          0
dtype: int64


In [7]:
print("Data types of each column:\n", df1.dtypes)


Data types of each column:
 VendorID                       category
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                     category
store_and_fwd_flag             category
PULocationID                   category
DOLocationID                   category
payment_type                   category
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
dtype: object


#### 1.1.2 Partition 2 Initial Exploration

The DataFrame from partition 2 of the NYC Taxi Trips dataset offers a detailed look at approximately 7.89 million taxi trips, with a wide range of data points from 18 different attributes. This partition includes both categorical variables like `VendorID` and `RatecodeID`, and continuous variables such as `trip_distance` and `fare_amount`. Notably, the dataset records an average trip distance of 7.91 miles and an average fare amount of $13.46, although extremes in data (such as a maximum trip distance of over 332,541 miles) suggest the presence of significant outliers or errors. There is a noticeable issue with missing data in several categorical fields, including `VendorID` and `payment_type`, totaling 381,490 missing entries for several categories, which could affect the completeness and reliability of any demographic or fare-based analysis. The data types are appropriately assigned, facilitating efficient data handling and analysis. The next steps include cleaning the dataset to address outliers, filling in missing values where possible, and potentially simplifying the dataset through feature engineering to better focus on the most impactful variables.

In [8]:
# Display the first few rows of the first DataFrame
print(df2.head(3))

# Display summary information about DataFrame
print(df2.info())

# Basic statistical details
print(df2.describe())


  VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  passenger_count  trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge
0      2.0  05/15/2021 10:35:13 AM  05/15/2021 11:08:19 AM             5.00          17.89        2.0                  N        132.0        224.0          1.0        52.00   0.00     0.50        8.00          6.55                   0.30         69.85                  2.50
1      2.0  05/15/2021 10:13:57 AM  05/15/2021 10:19:57 AM             2.00           1.11        1.0                  N        114.0        249.0          1.0         6.50   0.00     0.50        1.96          0.00                   0.30         11.76                  2.50
2      2.0  05/15/2021 10:26:33 AM  05/15/2021 10:43:56 AM             2.00           4.44        1.0                  N        100.0         12.0          1.0        16.50   0.0

In [9]:
# Count missing values in each column
missing_values2 = df2.isnull().sum()
print("Missing values in each column:\n", missing_values2)


Missing values in each column:
 VendorID                 381490
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count          381490
trip_distance                 0
RatecodeID               381490
store_and_fwd_flag       381490
PULocationID                  0
DOLocationID                  0
payment_type             381490
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge          0
dtype: int64


In [13]:
print("Data types of each column:\n", df2.dtypes)


Data types of each column:
 VendorID                 category
tpep_pickup_datetime       object
tpep_dropoff_datetime      object
passenger_count           float64
trip_distance             float64
RatecodeID               category
store_and_fwd_flag       category
PULocationID             category
DOLocationID             category
payment_type             category
fare_amount               float64
extra                     float64
mta_tax                   float64
tip_amount                float64
tolls_amount              float64
improvement_surcharge     float64
total_amount              float64
congestion_surcharge      float64
dtype: object


#### 1.1.3 Partition 3 Initial Exploration

The third partition of the dataset consists of approximately 7.87 million entries, each detailing aspects of taxi trips such as passenger count, trip distance, and fare amounts. Key metrics show an average trip distance of 6.02 miles and an average fare amount of $13.99. The data indicates a wide range of fares and distances, including outliers with extreme values that highlight potential errors or extraordinary trip scenarios. A noticeable portion of data is missing for several categorical variables like RatecodeID and congestion_surcharge, amounting to 278,444 missing entries which could influence analysis outcomes. The dataset is structured effectively for analysis, with categorical data stored efficiently and numerical data in a format conducive to statistical operations. Future steps involve rigorous data cleaning to rectify anomalies and address missing values, alongside exploring data relationships and potential feature engineering to enrich the dataset’s analytical value.




In [11]:
# Display the first few rows of the first DataFrame
print(df3.head(3))

# Display summary information about DataFrame
print(df3.info())

# Basic statistical details
print(df3.describe())


  VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  passenger_count  trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge
0      1.0  08/10/2021 04:22:40 PM  08/10/2021 04:40:02 PM             1.00           2.50        1.0                  N        170.0        236.0          1.0        13.00   3.50     0.50        3.45          0.00                   0.30         20.75                  2.50
1      1.0  08/10/2021 04:11:09 PM  08/10/2021 04:19:01 PM             0.00           1.00        1.0                  N        107.0         90.0          2.0         7.00   3.50     0.50        0.00          0.00                   0.30         11.30                  2.50
2      1.0  08/10/2021 04:20:56 PM  08/10/2021 04:35:54 PM             0.00           2.40        1.0                  N         68.0        142.0          1.0        11.50   3.5

In [17]:
# Count missing values in each column
missing_values3 = df3.isnull().sum()
print("Missing values in each column:\n", missing_values3)

Missing values in each column:
 VendorID                      0
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count          278444
trip_distance                 0
RatecodeID               278444
store_and_fwd_flag       278444
PULocationID                  0
DOLocationID                  0
payment_type                  0
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge     278444
dtype: int64


In [15]:
print("Data types of each column:\n", df3.dtypes)

Data types of each column:
 VendorID                 category
tpep_pickup_datetime       object
tpep_dropoff_datetime      object
passenger_count           float64
trip_distance             float64
RatecodeID               category
store_and_fwd_flag       category
PULocationID             category
DOLocationID             category
payment_type             category
fare_amount               float64
extra                     float64
mta_tax                   float64
tip_amount                float64
tolls_amount              float64
improvement_surcharge     float64
total_amount              float64
congestion_surcharge      float64
dtype: object


#### 1.1.4 Partition 4 Initial Exploration

The fourth partition mirrors the structure and detail of the earlier partitions, containing approximately 7.22 million records. It reports an average trip distance of 8.92 miles and an average fare amount of $14.33, with extremes in both measurements suggesting notable outliers or data integrity issues. Missing data impacts 366,223 records for categories such as RatecodeID and store_and_fwd_flag, posing challenges for complete demographic analysis. The data types are optimally designated to support varied analytical processes, from time series evaluations to categorical analyses. The next actions for this partition include comprehensive data cleaning to handle outliers and fill data gaps, further exploratory analysis to decipher underlying patterns, and the potential development of new features to better capture the dynamics of taxi trip economics and behaviors in NYC.

In [16]:
# Display the first few rows of the first DataFrame
print(df4.head(3))

# Display summary information about DataFrame
print(df4.info())

# Basic statistical details
print(df4.describe())

  VendorID    tpep_pickup_datetime   tpep_dropoff_datetime  passenger_count  trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge
0      2.0  10/28/2021 03:35:00 PM  10/28/2021 03:44:01 PM             1.00           1.80        1.0                  N        170.0         79.0          1.0         8.00   0.00     0.50        1.00          0.00                   0.30         12.30                  2.50
1      2.0  10/28/2021 03:45:48 PM  10/28/2021 04:04:51 PM             1.00           1.45        1.0                  N         79.0        170.0          1.0        12.50   0.00     0.50        3.16          0.00                   0.30         18.96                  2.50
2      1.0  10/28/2021 03:08:00 PM  10/28/2021 03:26:20 PM             2.00           1.70        1.0                  N        141.0        142.0          1.0        12.00   2.5

In [18]:
# Count missing values in each column
missing_values4 = df4.isnull().sum()
print("Missing values in each column:\n", missing_values4)

Missing values in each column:
 VendorID                      0
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count          366223
trip_distance                 0
RatecodeID               366223
store_and_fwd_flag       366223
PULocationID                  0
DOLocationID                  0
payment_type                  0
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge     366223
dtype: int64


In [19]:
print("Data types of each column:\n", df4.dtypes)

Data types of each column:
 VendorID                 category
tpep_pickup_datetime       object
tpep_dropoff_datetime      object
passenger_count           float64
trip_distance             float64
RatecodeID               category
store_and_fwd_flag       category
PULocationID             category
DOLocationID             category
payment_type             category
fare_amount               float64
extra                     float64
mta_tax                   float64
tip_amount                float64
tolls_amount              float64
improvement_surcharge     float64
total_amount              float64
congestion_surcharge      float64
dtype: object


### 1.2 Integrating Dask for Scalable Data Management

n this section of the Jupyter Notebook, we focus on integrating Dask to enhance our data handling capabilities for the large NYC Taxi Trips dataset. The section begins with defining a function, load_pickle_as_dask, which efficiently loads data stored in pickle format using Pandas and then converts it into a Dask DataFrame. This conversion is crucial as it allows us to manage the dataset in partitions, harnessing Dask’s ability to handle data that exceeds system memory limitations. We apply this function to load four separate data partitions and then utilize Dask’s concat function to merge them into a single DataFrame. This approach not only ensures efficient memory management but also sets the stage for more complex data manipulations and analysis that require handling large volumes of data seamlessly. This methodical preparation of our data underscores our commitment to maintaining high performance and scalability throughout the analysis lifecycle.

In [24]:
# Function to load a pickle file and convert to a Dask DataFrame
def load_pickle_as_dask(filepath):
    # Read the pickle file using Pandas
    pdf = pd.read_pickle(filepath)
    # Convert the Pandas DataFrame to a Dask DataFrame
    return dd.from_pandas(pdf, npartitions=10)

# Load data using the function defined above
df1 = load_pickle_as_dask("/Users/md/Desktop/python_project/df1.pkl")
df2 = load_pickle_as_dask("/Users/md/Desktop/python_project/df2.pkl")
df3 = load_pickle_as_dask("/Users/md/Desktop/python_project/df3.pkl")
df4 = load_pickle_as_dask("/Users/md/Desktop/python_project/df4.pkl")

# Combine dataframes using Dask
df_combined = dd.concat([df1, df2, df3, df4])

# 2. Initial Exploration Of Combined Dataset

To ensure data integrity and to understand how our combined data looks now we will look at the concatanated combined df.

In [25]:
df_combined.head(1)

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142.0,43.0,2.0,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5


In [26]:
#to ensure that our new combined df has all the data from partitoned files we use code below
# Calculate the total number of rows in each partition
num_rows_df1 = df1.shape[0].compute()
num_rows_df2 = df2.shape[0].compute()
num_rows_df3 = df3.shape[0].compute()
num_rows_df4 = df4.shape[0].compute()

# Sum the rows of all individual partitions
total_rows_partitions = num_rows_df1 + num_rows_df2 + num_rows_df3 + num_rows_df4

# Calculate the total number of rows in the combined DataFrame
total_rows_combined = df_combined.shape[0].compute()

# Print the results to compare
print(f"Total rows in df1: {num_rows_df1}")
print(f"Total rows in df2: {num_rows_df2}")
print(f"Total rows in df3: {num_rows_df3}")
print(f"Total rows in df4: {num_rows_df4}")
print(f"Total rows in all partitions: {total_rows_partitions}")
print(f"Total rows in combined DataFrame: {total_rows_combined}")

# Check if the sums match
if total_rows_partitions == total_rows_combined:
    print("The sum of rows from all partitions matches the total rows in the combined DataFrame.")
else:
    print("Mismatch in row counts. Please check the data loading and concatenation steps.")

Total rows in df1: 7919804
Total rows in df2: 7887665
Total rows in df3: 7873284
Total rows in df4: 7223319
Total rows in all partitions: 30904072
Total rows in combined DataFrame: 30904072
The sum of rows from all partitions matches the total rows in the combined DataFrame.


In [27]:
# Show the first few rows of the combined DataFrame to understand what it contains
print("First few rows of the combined DataFrame:")
print(df_combined.head())

# Get a summary of the DataFrame to understand the data types and the non-null counts
print("Summary of the combined DataFrame:")
print(df_combined.info())

# Compute basic statistical details like percentile, mean, std etc. of the DataFrame's numeric columns
print("Basic statistical details of the combined DataFrame:")
print(df_combined.describe().compute())

# List all columns to make sure all expected columns are present
print("Columns in the combined DataFrame:")
print(df_combined.columns)

# Additional: Check for missing values in each column
print("Missing values in each column of the combined DataFrame:")
print(df_combined.isnull().sum().compute())

First few rows of the combined DataFrame:
  VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge
0      1.0  2021-01-01 00:30:10   2021-01-01 00:36:12             1.00           2.10        1.0                  N        142.0         43.0          2.0         8.00   3.00     0.50        0.00          0.00                   0.30         11.80                  2.50
1      1.0  2021-01-01 00:51:20   2021-01-01 00:52:19             1.00           0.20        1.0                  N        238.0        151.0          2.0         3.00   0.50     0.50        0.00          0.00                   0.30          4.30                  0.00
2      1.0  2021-01-01 00:43:30   2021-01-01 01:11:06             1.00          14.70        1.0                  N        132.0        165.0          

### Combined DataFrame Overview

#### Data Composition
The dataset comprises an extensive collection of 30,904,072 taxi trips, detailing various aspects such as pickup and dropoff times, trip distances, fares, and more.

#### Data Structure
The DataFrame is organized into 18 columns, providing a broad spectrum of information per taxi trip. This includes categorical data like `VendorID` and `RatecodeID`, alongside numerical data such as `passenger_count`, `trip_distance`, and `fare_amount`.

#### Data Types
A mix of data types optimizes the storage and processing efficiency, with categories used for identifiers and descriptors, and floating-point numbers for quantitative measurements.

#### Statistical Summary
- **Central Tendencies**: Average metrics show typical urban taxi rides with a trip distance of approximately 6.92 miles and an average fare amount of 13.52 dollars.
- **Variability**: Notable variability in data, especially in `fare_amount` and `trip_distance`, highlighted by standard deviations that are large due to extreme values.
- **Extremes**: The presence of extreme values, such as a maximum trip distance of over 351,613 miles and a fare exceeding $818,283, indicate potential outliers or data entry errors.

#### Missing Data
Several key columns, including `VendorID`, `RatecodeID`, and `congestion_surcharge`, have missing entries. Over 1.4 million records lack passenger count information, with similar deficiencies noted for rate code and store and forward flag, impacting the completeness of analyses that depend on these fields.

#### Data Integrity and Actionable Insights
While the dataset's comprehensive detail makes it a rich resource for analysis, the presence of outliers and missing values necessitates meticulous data cleaning and preprocessing. Addressing these issues will ensure the robustness of subsequent analyses and modeling efforts, enhancing the reliability of derived insights and predictive models.

This initial overview sets the stage for targeted data cleaning, exploratory analysis, and further in-depth study to uncover insights and develop models based on the dynamics of taxi trips documented in the dataset.


# 3. Data Cleaning 

## 3.1 Handling Missing Values

#### 3.1.1 VendorID   
vendorID shows the taxi service provider ID, this is not a crucial part of our analysis, however this can indicate whether different service providers have different charging and pricing systems, thus we will not remove the missing values but use mode imputation to fill in the gaps


In [30]:
most_common_vendor = df_combined['VendorID'].mode().compute()
df_combined['VendorID'] = df_combined['VendorID'].fillna(most_common_vendor)


#### 3.1.2 payment_type
payment type has 834028 missing values, as “payment_type”, information from the knowledge suggests that tips are automatically filled for credit card payments, if “tips” are greater than 0, it is inferred that credit card payment was used. Any remaining missing values are filled with the “unknown” option

In [31]:
# Infer the payment type based on tip amount:
df_combined['payment_type'] = df_combined.apply(
    lambda row: '1' if pd.isna(row['payment_type']) and row['tip_amount'] > 0 else row['payment_type'],
    axis=1,
    meta=('payment_type', 'object')  # specifying the meta is important for Dask to know the output format
)

# Fill remaining missing values in 'payment_type' with 'Unknown'
df_combined['payment_type'] = df_combined['payment_type'].fillna('Unknown')

# Calculate and display the updated counts of each payment type
df_combined['payment_type'].value_counts().compute()

1.0        22480190
2.0         6669962
0.0          644667
Unknown      460442
1            373586
3.0          154148
4.0          121073
5.0               4
Name: payment_type, dtype: int64

#### 3.1.3 congestion_surcharge     

for congestion surcharge we see that we have  644667 missing values. This surcharge is applied based on specific criteria related to the time and location of the trip, potentially affecting the total fare. 

We can assume that since this field is empty that this criteria were not met and include 0 values for simlicity.

In [32]:
df_combined['congestion_surcharge'] = df_combined['congestion_surcharge'].fillna(0)


#### 3.1.4 Analysis of Missing Data Patterns

The fact that the missing values count is exactly the same for `passenger_count`, `RatecodeID`, and `store_and_fwd_flag` suggests a few potential scenarios:

- **Data Collection Issue**: 
  The missing values might indicate a systematic error in how the data was collected, processed, or extracted. Perhaps there was an issue with the data collection equipment or software in certain taxis that failed to record these specific details.

- **Record Integrity**: 
  These fields could be missing for entire trips, implying that some records are partially complete. This might happen if, for instance, a technical glitch occurred at the start or end of these trips, affecting multiple data fields simultaneously.

- **Data Entry Protocol**: 
  The taxi meters or systems that log this information might have a unified data entry procedure that skips multiple fields when one critical piece of information is unavailable. For example, if the meter does not start properly and doesn’t log `RatecodeID`, it might also skip logging `passenger_count` and `store_and_fwd_flag`.

#### Investigating Further:
To address and confirm the root cause, you might consider the following steps:

- **Cross-Validation with Other Fields**: 
  Check if these records with missing values in the three fields also show anomalies or patterns in other data points like `fare_amount`, `trip_distance`, or timestamps. This could help confirm if the trips are entirely corrupt or if only certain aspects are affected.

- **Temporal Analysis**: 
  Analyze the timestamps (`tpep_pickup_datetime` and `tpep_dropoff_datetime`) of these records to see if the missing data occurred during specific periods, which might indicate a temporary system issue.

- **Source Data Check**: 
  If possible, review the raw data or communicate with the data provider to understand potential reasons for these patterns. There might be logs or metadata that explain anomalies in data collection.

#### Handling Strategy:
Depending on the investigation outcomes, you might choose to:

- **Exclude the Affected Records**: 
  If the records are found to be unreliable or significantly incomplete, consider removing them from the analysis to maintain data integrity.

- **Impute Conservatively**: 
  If the missing data does not compromise the rest of the record's integrity, we will impute these missing fields based on typical assumptions (e.g., most common values) or predictive models if other related data fields are available and reliable.


In [29]:
# Check if the same rows are missing these fields
missing_data_rows = df_combined[
    df_combined['passenger_count'].isnull() & 
    df_combined['RatecodeID'].isnull() & 
    df_combined['store_and_fwd_flag'].isnull()
]

# Compute to bring the result to pandas DataFrame for easier manipulation and viewing
missing_data_rows = missing_data_rows.compute()

# Analyze if these rows have any temporal patterns or other commonalities
print("Timestamp Statistics:")
print(missing_data_rows['tpep_pickup_datetime'].describe())

print("\nFare and Distance Statistics:")
print(missing_data_rows[['fare_amount', 'trip_distance']].describe())

# Optionally, drop these rows if deemed unreliable
df_combined_cleaned = df_combined.dropna(subset=['passenger_count', 'RatecodeID', 'store_and_fwd_flag'])

# Compute the cleaned DataFrame if you need a pandas DataFrame, else you can keep it lazy in Dask
# df_combined_cleaned = df_combined_cleaned.compute()


Timestamp Statistics:
count                    1478695
unique                   1139451
top       09/15/2021 05:22:00 PM
freq                          25
Name: tpep_pickup_datetime, dtype: object

Fare and Distance Statistics:
       fare_amount  trip_distance
count 1,478,695.00   1,478,695.00
mean         25.50          83.59
std          16.13       3,186.98
min        -134.22           0.00
25%          13.20           2.01
50%          21.07           4.15
75%          33.96           8.83
max       3,554.70     351,613.36


#### Summary

The analysis of records with missing `passenger_count`, `RatecodeID`, and `store_and_fwd_flag` fields reveals several important characteristics and potential data quality issues:

#### Timestamp Statistics
- **Count of Missing Records**: 1,478,695
- **Unique Pickup Times**: 1,139,451
- **Most Frequent Pickup Time**: `09/15/2021 05:22:00 PM`, occurring 25 times.

This distribution indicates that missing data instances are not isolated to specific events but are spread across various times, suggesting systematic data collection issues.

#### Fare and Distance Statistics
- **Mean Fare**: 25.50, indicating higher than average fares for these records.
- **Standard Deviation of Fare**: 16.13, showing a wide range of fare amounts.
- **Fare Range**: From -134.22 (indicating refunds or errors) to $3,554.70 (suggesting outliers or very long trips).
- **Mean Trip Distance**: 83.59 miles, much higher than typical urban taxi trips.
- **Standard Deviation of Distance**: 3,186.98, confirming significant variability and the presence of extreme outliers.
- **Distance Range**: From 0 miles (potential cancellations) to 351,613.36 miles (clear data entry errors).

#### Conclusions and Recommendations for Data Cleaning
The analysis suggests significant integrity issues with this subset of the data:
- **Extreme Values**: The records with missing values are associated with unusually high fares and distances, likely indicating outliers or erroneous entries.
- **Cleaning Actions**:
  - **Remove Extreme Outliers**: Apply thresholds to exclude implausibly high fares and distances.
  - **Correct Negative Values**: Investigate and likely remove records with negative fares or distances as they are not valid transactions.
  - **Conservative Imputation**: For records that appear otherwise normal, consider imputing missing values based on typical values or predictive models, ensuring other related data fields are reliable.
  - **Refine Temporal Analysis**: Further investigate if data issues are concentrated during specific times or conditions to better understand underlying causes.



##### 3.1.4 Analysis of Missing Data Patterns - handling missing values 

As passanger count is a crucial feature we belive could be useful for our analysis we will  “passenger count”, we will replace the 0 values with the central tendency measure, which is the median of that feature.

In [None]:
# Compute the median of passenger_count
median_passenger_count = df_combined['passenger_count'].quantile(0.5).compute()

# Replace zero values with the median
df_combined['passenger_count'] = df_combined['passenger_count'].mask(df_combined['passenger_count'] == 0, median_passenger_count)

# Confirming the replacement
print("Number of zero values replaced with median:", (df_combined['passenger_count'] == median_passenger_count).sum().compute())

In [None]:
# Calculate the percentage of cases where trip_type is 1 for each RatecodeID
trip_type_percentage = df_combined.groupby('RatecodeID')['trip_type'].apply(lambda x: (x == 1).mean().compute())

# Print the results
print(trip_type_percentage)


This code will group the DataFrame by "RatecodeID" and calculate the percentage of cases where "trip_type" is 1 for each group. You can then examine the results to verify if the observation holds true. If the percentages are close to 99% for RatecodeID values 1, 2, and 3, then the observation is confirmed. If not, further investigation may be needed.

store_and_fwd_flag - this is not an important feature for our analysis thus we can drop this collumn as we are not interested in the data storage and change. 


### Removing extreme outliers

For our dataset we have features that we can pre-determine for our research purposes what their values can be.

VendorID                  834028 - 
tpep_pickup_datetime           0 - within 2021 
tpep_dropoff_datetime          0 - within 2021
passenger_count          1478695 - more than 1 less than 6 as per regulations no more than 4 people can sit in a taxi but to ensure that we include all vehicle types and instances where there might be children the accaptable number can be 6. 
trip_distance                  0 - as we want to creat model for taxis within the city we want to bind the data to be inside the city and maximum to ariport.
RatecodeID               1478695 - no limit other than the fact that has to be from 1 to 6 and 4 is nassau county which is outside of new york and 3 is newark which is also outside the city. 
store_and_fwd_flag       1478695 - this field indicates if there were any technical problems during data collection, which is irrelavant to our study purposes.
Below 2 features have to be bound based on NYC taxi zones provided in another CSV file which we will use to filter out drop off or pickup locations outside NYC for now lets use the code they have to be between 1- 262 
PULocationID                   0 - 
DOLocationID                   0 - 
payment_type              834028 - from 0 to 6
fare_amount                    0 more than 0 
extra                          0 
mta_tax                        0 
tip_amount                     0
tolls_amount                   0
improvement_surcharge          0
total_amount                   0 not minus
congestion_surcharge      644667 