<a href="https://colab.research.google.com/github/aymuos/starship/blob/main/fundamental_EDA/actuallyWorking/feature_creator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Objective

- create features using both the datasets
-

In [None]:
base = "/content/drive/MyDrive/ml/"

In [None]:
import pandas as pd
import os

# Construct full file paths
chongqing_path = os.path.join(base, 'chongqing_data.csv')
hangzhou_path = os.path.join(base, 'hangzhou_data.csv')
shanghai_path = os.path.join(base, 'shanghai_data.csv')

# Load data into DataFrames
chongqing_df = pd.read_csv(chongqing_path)
hangzhou_df = pd.read_csv(hangzhou_path)
shanghai_df = pd.read_csv(shanghai_path)

print("DataFrames loaded successfully: chongqing_df, hangzhou_df, shanghai_df")

DataFrames loaded successfully: chongqing_df, hangzhou_df, shanghai_df


In [None]:
chongqing_df = chongqing_df[(chongqing_df['horizon_ETA'] > 0) & (chongqing_df['horizon_ETA'] < 240)]
hangzhou_df = hangzhou_df[(hangzhou_df['horizon_ETA'] > 0) & (hangzhou_df['horizon_ETA'] < 240)]
shanghai_df = shanghai_df[(shanghai_df['horizon_ETA'] > 0) & (shanghai_df['horizon_ETA'] < 240)]

print("DataFrames filtered successfully based on 'horizon_eta' condition.")

DataFrames filtered successfully based on 'horizon_eta' condition.


In [None]:
chongqing_df = chongqing_df.sort_values(by='delivery_user_id')
hangzhou_df = hangzhou_df.sort_values(by='delivery_user_id')
shanghai_df = shanghai_df.sort_values(by='delivery_user_id')

print("DataFrames sorted successfully by 'delivery_user_id'.")

DataFrames sorted successfully by 'delivery_user_id'.


## Calculate Euclidean Distance

### Subtask:
Create a new feature 'euclidean_distance' for each DataFrame. This will be the straight-line distance between the 'poi_lat', 'poi_lng' (pickup) and 'receipt_lat', 'receipt_lng' (drop-off) coordinates.


**Reasoning**:
The subtask requires calculating the Euclidean distance, which involves mathematical operations that can be efficiently handled by the NumPy library. Therefore, the first step is to import NumPy.



In [None]:
import numpy as np

**Reasoning**:
The next step is to calculate the Euclidean distance for each DataFrame using the specified formula and assign it to a new column named 'euclidean_distance'. This involves using the `np.sqrt` function and squaring the differences between the respective latitude and longitude columns.



In [None]:
chongqing_df['euclidean_distance'] = np.sqrt(
    (chongqing_df['receipt_lng'] - chongqing_df['poi_lng'])**2 +
    (chongqing_df['receipt_lat'] - chongqing_df['poi_lat'])**2
)
hangzhou_df['euclidean_distance'] = np.sqrt(
    (hangzhou_df['receipt_lng'] - hangzhou_df['poi_lng'])**2 +
    (hangzhou_df['receipt_lat'] - hangzhou_df['poi_lat'])**2
)
shanghai_df['euclidean_distance'] = np.sqrt(
    (shanghai_df['receipt_lng'] - shanghai_df['poi_lng'])**2 +
    (shanghai_df['receipt_lat'] - shanghai_df['poi_lat'])**2
)

print("Euclidean distance calculated for all DataFrames.")

Euclidean distance calculated for all DataFrames.


## Extract Hour of the Day

### Subtask:
Create a new feature 'hour_of_the_day' by extracting the hour from the 'receipt_time' column for each DataFrame.


**Reasoning**:
The subtask requires extracting the hour from the 'receipt_time' column. First, the 'receipt_time' column in each DataFrame needs to be converted to datetime objects to enable hour extraction. Then, the hour can be extracted using the `.dt.hour` accessor and assigned to a new column named 'hour_of_the_day' for all three DataFrames.



In [None]:
chongqing_df['receipt_time'] = pd.to_datetime(chongqing_df['receipt_time'])
hangzhou_df['receipt_time'] = pd.to_datetime(hangzhou_df['receipt_time'])
shanghai_df['receipt_time'] = pd.to_datetime(shanghai_df['receipt_time'])

chongqing_df['hour_of_the_day'] = chongqing_df['receipt_time'].dt.hour
hangzhou_df['hour_of_the_day'] = hangzhou_df['receipt_time'].dt.hour
shanghai_df['hour_of_the_day'] = shanghai_df['receipt_time'].dt.hour

print("'hour_of_the_day' feature created for all DataFrames.")

'hour_of_the_day' feature created for all DataFrames.


## Determine Work Hours

### Subtask:
Create a new boolean feature 'is_in_work_hours' for each DataFrame. This will be true if 'hour_of_the_day' is between 8 (inclusive) and 19 (exclusive), and false otherwise.


**Reasoning**:
The subtask requires creating a boolean feature 'is_in_work_hours' based on the 'hour_of_the_day' column for each DataFrame. I will apply the condition 'hour_of_the_day' >= 8 and 'hour_of_the_day' < 19 to each DataFrame to generate this new feature.



In [None]:
chongqing_df['is_in_work_hours'] = (chongqing_df['hour_of_the_day'] >= 8) & (chongqing_df['hour_of_the_day'] < 19)
hangzhou_df['is_in_work_hours'] = (hangzhou_df['hour_of_the_day'] >= 8) & (hangzhou_df['hour_of_the_day'] < 19)
shanghai_df['is_in_work_hours'] = (shanghai_df['hour_of_the_day'] >= 8) & (shanghai_df['hour_of_the_day'] < 19)

print("'is_in_work_hours' feature created for all DataFrames.")

'is_in_work_hours' feature created for all DataFrames.


## Extract Day of the Week (Cyclic)

### Subtask:
Create a new feature 'day_of_the_week' for each DataFrame by extracting the day of the week from the 'receipt_time' column. This will be represented numerically (e.g., 0 for Monday, 6 for Sunday).


**Reasoning**:
The subtask requires extracting the day of the week from the 'receipt_time' column for each DataFrame. Since 'receipt_time' has already been converted to datetime objects, I will use the `.dt.dayofweek` accessor to get the numerical day of the week and assign it to a new column 'day_of_the_week' for each DataFrame.



In [None]:
chongqing_df['day_of_the_week'] = chongqing_df['receipt_time'].dt.dayofweek
hangzhou_df['day_of_the_week'] = hangzhou_df['receipt_time'].dt.dayofweek
shanghai_df['day_of_the_week'] = shanghai_df['receipt_time'].dt.dayofweek

print("'day_of_the_week' feature created for all DataFrames.")

'day_of_the_week' feature created for all DataFrames.


## Final Sort

### Subtask:
After creating all features, sort each DataFrame again, first by 'delivery_user_id' and then by 'receipt_time'.


**Reasoning**:
The subtask requires sorting each DataFrame by 'delivery_user_id' and then by 'receipt_time'. I will use the `sort_values` method on each DataFrame with a list of the specified columns.



In [None]:
chongqing_df = chongqing_df.sort_values(by=['delivery_user_id', 'receipt_time'])
hangzhou_df = hangzhou_df.sort_values(by=['delivery_user_id', 'receipt_time'])
shanghai_df = shanghai_df.sort_values(by=['delivery_user_id', 'receipt_time'])

print("DataFrames sorted successfully by 'delivery_user_id' and 'receipt_time'.")

DataFrames sorted successfully by 'delivery_user_id' and 'receipt_time'.


## Final Task

### Subtask:
The data loading, filtering, feature creation, and sorting steps have been completed for all three datasets. The processed dataframes are ready for further analysis.


## Summary:

### Data Analysis Key Findings

*   Three datasets, `chongqing_data`, `hangzhou_data`, and `shanghai_data`, were successfully loaded into pandas DataFrames.
*   DataFrames were filtered to include records where `horizon_ETA` (estimated time of arrival) was greater than 0 and less than 240 minutes, addressing an initial `KeyError` due to incorrect column casing (`horizon_eta` vs. `horizon_ETA`).
*   An initial sort was performed on all DataFrames based on the `delivery_user_id` column.
*   A new feature, `euclidean_distance`, was calculated and added to each DataFrame, representing the straight-line distance between the pickup (`poi_lat`, `poi_lng`) and drop-off (`receipt_lat`, `receipt_lng`) coordinates.
*   The `receipt_time` column was converted to datetime objects, and two new temporal features were extracted:
    *   `hour_of_the_day`: The hour component of `receipt_time`.
    *   `day_of_the_week`: The numerical day of the week (0 for Monday, 6 for Sunday) from `receipt_time`.
*   A boolean feature, `is_in_work_hours`, was created, indicating whether the `hour_of_the_day` falls between 8 (inclusive) and 19 (exclusive).
*   Finally, all DataFrames were sorted by `delivery_user_id` and then by `receipt_time`.

### Insights or Next Steps

*   The processed datasets, enriched with temporal and geographical features, are now ready for comparative analysis across the three cities to identify delivery patterns or performance differences.
*   Further analysis could involve exploring the correlation between the newly created features (e.g., `euclidean_distance`, `hour_of_the_day`, `is_in_work_hours`) and delivery efficiency metrics (like `horizon_ETA`) within each city.


In [None]:
print("Chongqing DataFrame (chongqing_df):")
print(chongqing_df.head(10))
print("\nHangzhou DataFrame (hangzhou_df):")
print(hangzhou_df.head(10))
print("\nShanghai DataFrame (shanghai_df):")
print(shanghai_df.head(10))

Chongqing DataFrame (chongqing_df):
                              order_id                     from_dipan_id  \
7492  7b15e523a08ad352b9f22bb4ecd99810  37e976ad4abb10da92c97c6b34f7f54f   
2346  3d7ca50514737df23244a2c633b7eebe  37e976ad4abb10da92c97c6b34f7f54f   
4431  4767aafe217550daac5895cbc46b467d  37e976ad4abb10da92c97c6b34f7f54f   
4271  60e2c8af8b3f4a692bccb16403e87e14  37e976ad4abb10da92c97c6b34f7f54f   
8007  8a4a6dacc0851a066600a3e6a7d4204e  37e976ad4abb10da92c97c6b34f7f54f   
688   5e5bc229715fd138d41cfabf4a7143a7  37e976ad4abb10da92c97c6b34f7f54f   
1506  5074d3ed479303991c9d83d66275408e  37e976ad4abb10da92c97c6b34f7f54f   
716   ed4c004b1d4d318a4a25c564a2731bb7  37e976ad4abb10da92c97c6b34f7f54f   
3739  b271e95fccb2388097693adc24d2dd9c  37e976ad4abb10da92c97c6b34f7f54f   
2369  df13ce63066334c1aa65953265fcaedc  37e976ad4abb10da92c97c6b34f7f54f   

                      delivery_user_id       poi_lng       poi_lat  \
7492  0008c2b6a2314db8715301b7eeeebc5a  8.903075e+06 -7.6

In [None]:
import pickle
import os

processed_dataframes = {
    'chongqing_df': chongqing_df,
    'hangzhou_df': hangzhou_df,
    'shanghai_df': shanghai_df
}

# Define the path to save the dictionary
output_path = os.path.join(base, 'processed_dataframes.pkl')

# Save the dictionary as a pickle file
with open(output_path, 'wb') as f:
    pickle.dump(processed_dataframes, f)

print(f"Processed dataframes dictionary saved to: {output_path}")

Processed dataframes dictionary saved to: /content/drive/MyDrive/ml/processed_dataframes.pkl
