# Feature Engineering 

After exploring the data, we proceed with feature engineering.  
We create new features that are more meaningful and have a direct relation with the target (`trip_duration_min`), such as:

- **Distances**: `haversine_km`, `manhattan_km`
- **Directions & Speed**: `bearing_deg`, `speed_kmh`
- **Datetime-based features**: `hour`, `is_weekend`, `is_rush_hour`, `weekday_*`, `month_*`

All these functions are organized in the `features` module.  
After applying them, we save the processed dataset for modeling.


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Importing the important libraries, loading the dataset, and taking a first look at it


In [250]:
import sys
import os

# Allow Python to access the src folder
sys.path.append(os.path.abspath(os.path.join("..", "src")))

# Now you can import the modules
from features.distances import add_distance_features
from features.timeparts import add_time_features
from features.encoding import add_one_hot_encoding
from data.Data_helper import save_processed_data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [251]:
df = pd.read_csv(r"G:\ML mostafa saad\slides\my work\13 Project 1 - Regression - Trip Duration Prediction\data\raw\train.csv")

print(df.shape)
print(df.head())
print(df.info())

(1000000, 10)
          id  vendor_id      pickup_datetime  passenger_count  \
0  id2793718          2  2016-06-08 07:36:19                1   
1  id3485529          2  2016-04-03 12:58:11                1   
2  id1816614          2  2016-06-05 02:49:13                5   
3  id1050851          2  2016-05-05 17:18:27                2   
4  id0140657          1  2016-05-12 17:43:38                4   

   pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude  \
0        -73.985611        40.735943         -73.980331         40.760468   
1        -73.978394        40.764351         -73.991623         40.749859   
2        -73.989059        40.744389         -73.973381         40.748692   
3        -73.990326        40.731136         -73.991264         40.748917   
4        -73.789497        40.646675         -73.987137         40.759232   

  store_and_fwd_flag  trip_duration  
0                  N           1040  
1                  N            827  
2                 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Why We Need Feature Engineering  

From the initial EDA, we noticed that the raw dataset is not expressive enough to directly capture the relationship with our target variable (trip duration).  
Therefore, we need to create new **domain-related features** that are more meaningful.  

Examples:  
- **Distance Features**: Haversine distance, Manhattan distance  
- **Direction & Speed Features**: Bearing, average speed (km/h)  
- **Datetime Features**: Extracting hour, day of week, month, weekend/weekday, rush hours  

All these feature engineering steps will be implemented using our organized functions inside the `src\features` module.


In [252]:
#  Feature Ideas Generation ====


# Make sure df is defined
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])

# 1. Distance-based features
df = add_distance_features(df)

# 2. Time-based features (using pickup_datetime)
df = add_time_features(df, datetime_col="pickup_datetime")

print("\n Columns after add_distance_features:")
print(df.columns.tolist())

# 3. Quick check
print("\n✅ Features added successfully!")
print(df.info())
print(df.head(10)[[
    "trip_duration",
    "haversine_km", "manhattan_km", "bearing_deg",
    "hour", "weekday", "month", "is_weekend", "is_rush_hour"
]])
print(df[["trip_duration", "haversine_km", "manhattan_km", "speed_kmh", "bearing_deg"]].head())

✅ add_distance_features: 1000000/1000000 rows got valid speed_kmh

 Columns after add_distance_features:
['id', 'vendor_id', 'pickup_datetime', 'passenger_count', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'store_and_fwd_flag', 'trip_duration', 'haversine_km', 'manhattan_km', 'bearing_deg', 'speed_kmh', 'hour', 'weekday', 'month', 'is_weekend', 'is_rush_hour']

✅ Features added successfully!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 19 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   id                  1000000 non-null  object        
 1   vendor_id           1000000 non-null  int64         
 2   pickup_datetime     1000000 non-null  datetime64[ns]
 3   passenger_count     1000000 non-null  int64         
 4   pickup_longitude    1000000 non-null  float64       
 5   pickup_latitude     1000000 non-null  float64

## Insights from the New Features
    Now, after adding the new features and exploring some of them,  
    we can say they are more logical and meaningful.  

    - Larger distances lead to higher trip durations.  
    - For two equal distances, the higher speed results in a shorter duration.  


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Target Variable Distribution and Log Transformation

The target data is slightly **skewed**, meaning its distribution is not perfectly normal.  
To handle this, we can apply a **Log Transformation** to make the distribution closer to normal.


Assuming 'id' is not needed for modeling and we extracted all features we need from pickup_datetime so we should delete them as the model can't handle non-numerical data

In [253]:


df = df.drop(columns=["id"])  # Assuming 'id' is not needed for modeling
df = df.drop(columns=["pickup_datetime"])

print("After dropping:", df.shape)


After dropping: (1000000, 17)


In [254]:
df["log_trip_duration"] = np.log1p(df["trip_duration"])
print (df.head(10))

   vendor_id  passenger_count  pickup_longitude  pickup_latitude  \
0          2                1        -73.985611        40.735943   
1          2                1        -73.978394        40.764351   
2          2                5        -73.989059        40.744389   
3          2                2        -73.990326        40.731136   
4          1                4        -73.789497        40.646675   
5          2                3        -73.969833        40.768570   
6          2                1        -73.988419        40.760006   
7          1                1        -73.987854        40.749695   
8          2                1        -73.955017        40.764462   
9          1                1        -73.971535        40.794827   

   dropoff_longitude  dropoff_latitude store_and_fwd_flag  trip_duration  \
0         -73.980331         40.760468                  N           1040   
1         -73.991623         40.749859                  N            827   
2         -73.973381   

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### One-Hot Encoding for Categorical Variables  

In our dataset, we have some categorical columns that are **not numeric** or **not ordinal**:  

- `vendor_id`  
- `store_and_fwd_flag`  
- `hour`  
- `weekday`  
- `month`  

Machine learning models cannot directly handle text values or numbers that do not represent a true order.  
If we keep them as plain numbers (1, 2, 3 …), the model may incorrectly assume there is an **order or distance** between them.  

#### Why One-Hot Encoding?  
- Converts each category into a **new binary column (0 or 1)**.  
- Prevents the model from assuming a false relationship between categories.  
- Makes categorical data more interpretable for the model.  

#### Example:  
For `vendor_id` (values: 1 or 2), after encoding we get:  
- New column: `vendor_id_2`  
- Value = 1 if vendor_id = 2  
- Value = 0 if vendor_id = 1  

This way, the categorical features are represented in a clear and model-friendly format.


In [255]:
df = add_one_hot_encoding(df)
print(df.columns.tolist())

print(df.head(10))

['passenger_count', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'trip_duration', 'haversine_km', 'manhattan_km', 'bearing_deg', 'speed_kmh', 'hour', 'is_weekend', 'is_rush_hour', 'log_trip_duration', 'vendor_id_1', 'vendor_id_2', 'store_and_fwd_flag_N', 'store_and_fwd_flag_Y', 'weekday_0', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'month_1', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6']
   passenger_count  pickup_longitude  pickup_latitude  dropoff_longitude  \
0                1        -73.985611        40.735943         -73.980331   
1                1        -73.978394        40.764351         -73.991623   
2                5        -73.989059        40.744389         -73.973381   
3                2        -73.990326        40.731136         -73.991264   
4                4        -73.789497        40.646675         -73.987137   
5                3        -73.969833        40.768570         -73.962646 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Save Processed Data
Next step: Use the processed file in the notebook for further exploration of correlations and insights.

In [256]:
save_processed_data(df, r"G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/train_processed.csv")


 Processed data saved to G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/train_processed.csv


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Take a look just to check

In [257]:
path = r"G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/train_processed.csv"
df_processed = pd.read_csv(path)
df_processed.head()
print(df_processed.shape)
print(df_processed.info())
df_processed.describe()

(1000000, 31)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 31 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   passenger_count       1000000 non-null  int64  
 1   pickup_longitude      1000000 non-null  float64
 2   pickup_latitude       1000000 non-null  float64
 3   dropoff_longitude     1000000 non-null  float64
 4   dropoff_latitude      1000000 non-null  float64
 5   trip_duration         1000000 non-null  int64  
 6   haversine_km          1000000 non-null  float64
 7   manhattan_km          1000000 non-null  float64
 8   bearing_deg           1000000 non-null  float64
 9   speed_kmh             1000000 non-null  float64
 10  hour                  1000000 non-null  int64  
 11  is_weekend            1000000 non-null  int64  
 12  is_rush_hour          1000000 non-null  int64  
 13  log_trip_duration     1000000 non-null  float64
 14  vendor_id_1          

Unnamed: 0,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,haversine_km,manhattan_km,bearing_deg,speed_kmh,hour,is_weekend,is_rush_hour,log_trip_duration
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,1.665353,-73.973475,40.750947,-73.973421,40.751829,954.885,3.439099,4.444066,-15.768007,14.428394,13.604165,0.285775,0.428937,6.466492
std,1.315723,0.065404,0.033745,0.065432,0.035782,3882.07,4.424258,5.806581,104.460439,16.575388,6.400685,0.451783,0.494925,0.794744
min,0.0,-121.933342,34.359695,-121.933304,34.359695,1.0,0.0,0.0,-179.992701,0.0,0.0,0.0,0.0,0.693147
25%,1.0,-73.991852,40.737372,-73.991341,40.735928,397.0,1.232068,1.570859,-125.293788,9.118796,9.0,0.0,0.0,5.986452
50%,1.0,-73.981728,40.754131,-73.979767,40.754551,662.0,2.091231,2.68598,8.043926,12.787544,14.0,0.0,0.0,6.496775
75%,2.0,-73.967346,40.768379,-73.963036,40.769833,1074.0,3.871264,4.993929,53.263991,17.842133,19.0,1.0,1.0,6.980076
max,7.0,-61.335529,51.881084,-61.335529,43.921028,2227612.0,1240.908677,1336.846365,180.0,9274.836731,23.0,1.0,1.0,14.616441


"I noticed that there are some unrealistic values, like a trip duration reaching 40 days. These are likely outliers, and we need to detect and handle them."

In [258]:
columns = ['trip_duration', 'haversine_km', 'manhattan_km', 'speed_kmh']

factor = 6  #  → You can increase to 2.5 if you want to reduce stricter

for col in columns:
    Q1 = df_processed[col].quantile(0.25)
    Q3 = df_processed[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    outliers = df_processed[(df_processed[col] < lower) | (df_processed[col] > upper)]
    print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df_processed)*100:.2f}%)")


trip_duration: 2368 outliers (0.24%)
haversine_km: 18286 outliers (1.83%)
manhattan_km: 20928 outliers (2.09%)
speed_kmh: 227 outliers (0.02%)


## This is the percentage of examples with clear outliers  
From this percentage, I think it’s not that high, so it’s better to remove them


In [259]:
columns = ['trip_duration', 'haversine_km', 'manhattan_km', 'speed_kmh']

factor = 6# You can increase to 2.5 if you want to reduce stricter

df_clean = df_processed.copy()
original_rows = len(df_clean)

for col in columns:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    df_clean = df_clean[(df_clean[col] >= lower) & (df_clean[col] <= upper)]

cleaned_rows = len(df_clean)
deleted_rows = original_rows - cleaned_rows
deleted_percent = deleted_rows / original_rows * 100



print(f"Original rows: {original_rows}")
print(f"Cleaned rows: {cleaned_rows}")
print(f"Deleted rows: {deleted_rows} ({deleted_percent:.2f}%)")

Original rows: 1000000
Cleaned rows: 974935
Deleted rows: 25065 (2.51%)


In [260]:
# Train
train_min = df_clean['log_trip_duration'].min()
train_max = df_clean['log_trip_duration'].max()
print(f"Train log_trip_duration: min={train_min}, max={train_max}")

Train log_trip_duration: min=0.6931471805599453, max=8.543640361431086


We should also trnsform any boolien values to make it easier for the model

In [261]:
bool_cols = df_clean.select_dtypes(include='bool').columns

df_clean[bool_cols] = df_clean[bool_cols].astype(int)
df_clean[bool_cols] = df_clean[bool_cols].astype(int)

In [262]:
df_clean.head()
print(df_clean.shape)
print(df_clean.info())
df_clean.describe()

(974935, 31)
<class 'pandas.core.frame.DataFrame'>
Index: 974935 entries, 0 to 999999
Data columns (total 31 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   passenger_count       974935 non-null  int64  
 1   pickup_longitude      974935 non-null  float64
 2   pickup_latitude       974935 non-null  float64
 3   dropoff_longitude     974935 non-null  float64
 4   dropoff_latitude      974935 non-null  float64
 5   trip_duration         974935 non-null  int64  
 6   haversine_km          974935 non-null  float64
 7   manhattan_km          974935 non-null  float64
 8   bearing_deg           974935 non-null  float64
 9   speed_kmh             974935 non-null  float64
 10  hour                  974935 non-null  int64  
 11  is_weekend            974935 non-null  int64  
 12  is_rush_hour          974935 non-null  int64  
 13  log_trip_duration     974935 non-null  float64
 14  vendor_id_1           974935 non-null  int64

Unnamed: 0,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,haversine_km,manhattan_km,bearing_deg,speed_kmh,...,weekday_3,weekday_4,weekday_5,weekday_6,month_1,month_2,month_3,month_4,month_5,month_6
count,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,...,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0,974935.0
mean,1.662727,-73.976129,40.752403,-73.974813,40.752543,787.703397,3.012779,3.854216,-16.286606,14.011199,...,0.149978,0.15291,0.152057,0.133705,0.157665,0.164202,0.175844,0.172139,0.170021,0.160131
std,1.314605,0.061447,0.028206,0.061185,0.032798,560.289104,2.84974,3.630651,104.773065,7.258885,...,0.35705,0.359901,0.359077,0.340336,0.364427,0.370459,0.380687,0.377501,0.375651,0.366727
min,0.0,-121.933342,34.359695,-121.933304,34.359695,1.0,0.0,0.0,-179.992701,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,-73.99202,40.738281,-73.991386,40.736652,391.0,1.215727,1.549502,-127.446289,9.051366,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,-73.981964,40.754639,-73.979919,40.754791,647.0,2.040311,2.621193,8.893652,12.6247,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2.0,-73.96814,40.768635,-73.963577,40.769997,1032.0,3.662812,4.724664,52.573708,17.430092,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,7.0,-61.335529,43.911762,-61.335529,43.911762,5133.0,19.642441,24.110947,180.0,67.57329,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [263]:
save_processed_data(df_clean, r"G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/train_processed_filtered.csv")


 Processed data saved to G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/train_processed_filtered.csv


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Preparing Validation and Test Data

We’ve finished cleaning and feature engineering for the training data.  
Now we need to apply the same transformations to the validation and test sets, **without changing values or removing rows**.  

Steps to apply:  
- Add the same **features** created in the training set.  
- Apply the same **encoding** for categorical variables.  
- Add the same **new columns** (feature engineering).   

**Reason:**  
This ensures the model is evaluated on unseen data while keeping consistency between training, validation, and test sets.


In [264]:
df_val = pd.read_csv(r"G:\ML mostafa saad\slides\my work\13 Project 1 - Regression - Trip Duration Prediction\data\raw\val.csv")
df_test = pd.read_csv(r"G:\ML mostafa saad\slides\my work\13 Project 1 - Regression - Trip Duration Prediction\data\raw\test.csv")



df_val = df_val.drop(columns=["id"])  # Assuming 'id' is not needed for modeling
df_test = df_test.drop(columns=["id"])  # Assuming 'id' is not needed for modeling



print(df_val.shape)
print(df_val.head())
print(df_val.info())

print(df_test.shape)
print(df_test.head())
print(df_test.info())

(229319, 9)
   vendor_id      pickup_datetime  passenger_count  pickup_longitude  \
0          1  2016-01-10 16:01:46                1        -74.013916   
1          1  2016-06-23 18:41:05                1        -74.005440   
2          1  2016-05-14 21:25:34                1        -73.987587   
3          2  2016-05-02 20:09:00                1        -73.973862   
4          1  2016-05-19 10:01:39                1        -73.999916   

   pickup_latitude  dropoff_longitude  dropoff_latitude store_and_fwd_flag  \
0        40.713444         -73.993858         40.752510                  N   
1        40.727306         -73.983063         40.734715                  N   
2        40.749863         -73.986809         40.757549                  N   
3        40.784153         -73.983025         40.774479                  N   
4        40.733101         -74.008331         40.734177                  N   

   trip_duration  
0           1249  
1            817  
2            366  
3         

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [265]:
#  Add distance features
df_val = add_distance_features(df_val)
df_test = add_distance_features(df_test)

#  Add time-based features
df_val = add_time_features(df_val)
df_test = add_time_features(df_test)

df_val = df_val.drop(columns=["pickup_datetime"])  
df_test = df_test.drop(columns=["pickup_datetime"])

#  Apply one-hot encoding using the same columns as in train
df_val = add_one_hot_encoding(df_val)
df_test = add_one_hot_encoding(df_test)




✅ add_distance_features: 229319/229319 rows got valid speed_kmh
✅ add_distance_features: 229322/229322 rows got valid speed_kmh


We have to cut the outliers in same range as we did for the train dataset

In [266]:
columns = ['trip_duration', 'haversine_km', 'manhattan_km', 'speed_kmh']
factor = 6  

df_val_clean = df_val.copy()
original_rows = len(df_val_clean)

for col in columns:
    Q1 = df_val_clean[col].quantile(0.25)
    Q3 = df_val_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    df_val_clean = df_val_clean[(df_val_clean[col] >= lower) & (df_val_clean[col] <= upper)]

cleaned_rows = len(df_val_clean)
deleted_rows = original_rows - cleaned_rows
deleted_percent = deleted_rows / original_rows * 100

print(f"Original rows: {original_rows}")
print(f"Cleaned rows: {cleaned_rows}")
print(f"Deleted rows: {deleted_rows} ({deleted_percent:.2f}%)")

Original rows: 229319
Cleaned rows: 223529
Deleted rows: 5790 (2.52%)


In [267]:
columns = ['trip_duration', 'haversine_km', 'manhattan_km', 'speed_kmh']
factor = 6  

df_test_clean = df_test.copy()
original_rows = len(df_test_clean)

for col in columns:
    Q1 = df_test_clean[col].quantile(0.25)
    Q3 = df_test_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - factor * IQR
    upper = Q3 + factor * IQR
    df_test_clean = df_test_clean[(df_test_clean[col] >= lower) & (df_test_clean[col] <= upper)]

cleaned_rows = len(df_test_clean)
deleted_rows = original_rows - cleaned_rows
deleted_percent = deleted_rows / original_rows * 100

print(f"Original rows: {original_rows}")
print(f"Cleaned rows: {cleaned_rows}")
print(f"Deleted rows: {deleted_rows} ({deleted_percent:.2f}%)")

Original rows: 229322
Cleaned rows: 223633
Deleted rows: 5689 (2.48%)


We should also trnsform any boolien values to make it easier for the model

In [268]:

df_val_clean["log_trip_duration"] = np.log1p(df_val_clean["trip_duration"])
df_test_clean["log_trip_duration"] = np.log1p(df_test_clean["trip_duration"])

print (df_val_clean.head(10))
print (df_test_clean.head(10))


# بعد كل preprocessing و one-hot encoding
bool_cols = df_val_clean.select_dtypes(include='bool').columns
df_val_clean[bool_cols] = df_val_clean[bool_cols].astype(int)

bool_cols_test = df_test_clean.select_dtypes(include='bool').columns
df_test_clean[bool_cols_test] = df_test_clean[bool_cols_test].astype(int)


   passenger_count  pickup_longitude  pickup_latitude  dropoff_longitude  \
0                1        -74.013916        40.713444         -73.993858   
1                1        -74.005440        40.727306         -73.983063   
2                1        -73.987587        40.749863         -73.986809   
3                1        -73.973862        40.784153         -73.983025   
4                1        -73.999916        40.733101         -74.008331   
5                1        -73.988953        40.743847         -73.980835   
6                1        -73.995529        40.744221         -73.991905   
7                1        -73.872879        40.774136         -73.981529   
8                1        -73.986626        40.750439         -73.955849   
9                1        -73.994125        40.751194         -73.971939   

   dropoff_latitude  trip_duration  haversine_km  manhattan_km  bearing_deg  \
0         40.752510           1249      4.661154      6.033519    21.252151   
1    

In [269]:
print(df_val_clean.shape)
print(df_val_clean.head())
print(df_val_clean.info())

print(df_val_clean.columns.tolist())



(223529, 31)
   passenger_count  pickup_longitude  pickup_latitude  dropoff_longitude  \
0                1        -74.013916        40.713444         -73.993858   
1                1        -74.005440        40.727306         -73.983063   
2                1        -73.987587        40.749863         -73.986809   
3                1        -73.973862        40.784153         -73.983025   
4                1        -73.999916        40.733101         -74.008331   

   dropoff_latitude  trip_duration  haversine_km  manhattan_km  bearing_deg  \
0         40.752510           1249      4.661154      6.033519    21.252151   
1         40.734715            817      2.057606      2.709162    66.393091   
2         40.757549            366      0.857222      0.920259     4.385297   
3         40.774479            195      1.323778      1.847282  -144.348180   
4         40.734177            283      0.719070      0.828662   -80.421578   

   speed_kmh  ...  weekday_4  weekday_5  weekday_6  mon

In [270]:
print(df_test_clean.shape)
print(df_test_clean.head())
print(df_test_clean.info())
print(df_test_clean.columns.tolist())


(223633, 31)
   passenger_count  pickup_longitude  pickup_latitude  dropoff_longitude  \
0                1        -73.988609        40.748978         -73.992798   
1                2        -74.012787        40.715820         -74.002220   
2                1        -73.968391        40.768036         -73.990997   
3                2        -73.991753        40.738033         -73.863556   
4                1        -73.862984        40.769592         -73.989059   

   dropoff_latitude  trip_duration  haversine_km  manhattan_km  bearing_deg  \
0         40.763409            437      1.642979      1.957414   -12.398378   
1         40.740219            676      2.855402      3.603266    18.166238   
2         40.760632           1033      2.074249      2.727281  -113.378772   
3         40.767456           3194     11.283243     14.067781    73.102749   
4         40.758305           1328     10.691987     11.874097   -96.700340   

   speed_kmh  ...  weekday_4  weekday_5  weekday_6  mon

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [271]:

val_min =  df_val_clean.copy()['trip_duration'].min()
val_max = df_val_clean.copy()['trip_duration'].max()
test_min =  df_test_clean.copy()['trip_duration'].min()
test_max = df_test_clean.copy()['trip_duration'].max()

print(f"Validation trip_duration: min={val_min}, max={val_max}")
print(f"test trip_duration: min={test_min}, max={test_max}")

Validation trip_duration: min=1, max=5145
test trip_duration: min=1, max=5122


## Step: Align Validation and Test Files
Now we have standardized the validation and test files to match the training set in terms of columns, without altering the actual column values.

In [272]:
save_processed_data(df_val_clean, r"G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/val_processed.csv")
save_processed_data(df_test, r"G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/test_processed.csv")

 Processed data saved to G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/val_processed.csv
 Processed data saved to G:/ML mostafa saad/slides/my work/13 Project 1 - Regression - Trip Duration Prediction/data/processed/test_processed.csv
