# Feature Engineering

It seems you are ready to progress to more advanced stages of your data analysis. Here are some steps you can take:

1. **Feature Engineering**: Introduce new features that could be significant for the model:
    - **Time of Day**: Create a new feature based on the `tpep_pickup_datetime` and `tpep_dropoff_datetime` that categorizes each trip into time slots such as morning, afternoon, evening, and night.
    - **Seasons**: Add a feature for the season of the year the trip took place in.
    - **Passenger Count Categories**: Classify the `passenger_count` into categories like low (1), medium (2-4), and high (5-6).
    - **Weekday/Weekend**: Determine whether the trip took place on a weekday or weekend.

2. **Standardization and Normalization**: If your model requires it, scale your features to have a mean of 0 and a standard deviation of 1, or normalize them to be between 0 and 1.

3. **Correlation Check**: Use Pearson or Spearman correlation to check how strongly your features are related to the target variable (`fare_amount`), and to each other, which can help in reducing multicollinearity.

4. **Segmentation**: Consider creating segments within your data based on geography (zones), time (rush hour, non-rush hours), and trip types (airport, non-airport) for more targeted analysis.

5. **Handling Missing Values and Duplicates**: Ensure that all missing values are handled appropriately (whether filled or dropped), and remove any duplicates to clean the dataset.

6. **Visualization**: Use visual tools like box plots and histograms to understand distributions and outliers after feature engineering.

7. **Hypothesis Testing**: If there are specific assumptions or hypotheses you want to test (like the impact of weather on fares), this would be the time to conduct such tests using statistical methods.

8. **Model Training and Validation**: Start with simpler models and gradually move to more complex ones. Use cross-validation to assess model performance and avoid overfitting.


In [2]:
!pip install pyarrow
!pip install fastparquet




In [1]:
import pandas as pd
import glob
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sns
pd.set_option('display.float_format', lambda x: '%.2f' % x)
from pandas.tseries.holiday import USFederalHolidayCalendar as calander

In [3]:


# Replace 'path_to_file.parquet' with the path to your Parquet file
df = pd.read_parquet('/Users/md/Desktop/python_project/parquet_files/cleaned/cleaned_taxi_data.parquet', engine='pyarrow')  # or engine='fastparquet' if you prefer


In [5]:
df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,General_Airport_Fee,JFK_LGA_Pickup_Fee
count,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0,32449058.0
mean,1.75,1.39,3.57,1.58,165.37,164.2,1.2,19.73,1.61,0.5,3.59,0.6,1.0,28.91,2.5,0.14,0.01
std,0.43,0.88,4.46,7.13,63.57,69.75,0.46,17.92,1.83,0.0,4.04,2.18,0.0,22.74,0.0,0.46,0.11
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.5,0.0,0.0,1.0,0.01,2.5,0.0,0.0
25%,2.0,1.0,1.09,1.0,132.0,114.0,1.0,9.3,0.0,0.5,1.0,0.0,1.0,15.96,2.5,0.0,0.0
50%,2.0,1.0,1.8,1.0,162.0,162.0,1.0,13.5,1.0,0.5,2.88,0.0,1.0,21.0,2.5,0.0,0.0
75%,2.0,1.0,3.42,1.0,234.0,234.0,1.0,21.9,2.5,0.5,4.48,0.0,1.0,30.72,2.5,0.0,0.0
max,2.0,6.0,30.0,99.0,265.0,265.0,4.0,300.0,50.0,0.6,500.0,50.0,1.0,500.0,2.75,1.75,1.25


In [None]:
# Convert the pickup and dropoff datetime to pandas datetime format if not already
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

# Time of day segmentation
df['pickup_time_of_day'] = df['tpep_pickup_datetime'].dt.hour.apply(lambda x: 'morning' if 5 <= x <= 11
                                                                           else 'afternoon' if 12 <= x <= 17
                                                                           else 'evening' if 18 <= x <= 23
                                                                           else 'night')

# Seasons segmentation
df['pickup_season'] = df['tpep_pickup_datetime'].dt.month.apply(lambda x: 'spring' if 3 <= x <= 5
                                                                       else 'summer' if 6 <= x <= 8
                                                                       else 'autumn' if 9 <= x <= 11
                                                                       else 'winter')

# Passenger count categories
df['passenger_count_category'] = pd.cut(df['passenger_count'], bins=[0, 1, 4, 6], include_lowest=True, 
                                        labels=['low', 'medium', 'high'])

# Weekday/Weekend segmentation
df['pickup_day_type'] = df['tpep_pickup_datetime'].dt.day_name().apply(lambda x: 'weekend' if x in ['Saturday', 'Sunday'] else 'weekday')

# Check for US federal holidays
cal = calander()


In [None]:
# Check if 'pickup_season' is correctly categorized
season_counts = df['pickup_season'].value_counts()
print(season_counts)

# Check if 'pickup_time_of_day' is correctly categorized
time_of_day_counts = df['pickup_time_of_day'].value_counts()
print(time_of_day_counts)

# Check if 'pickup_day_type' is correctly categorized for weekdays and weekends
day_type_counts = df['pickup_day_type'].value_counts()
print(day_type_counts)

# To check all together
summary = df.groupby(['pickup_day_type', 'pickup_time_of_day', 'pickup_season'])['fare_amount'].count()
print(summary)
