### 1 - ( Date Preparation ) :



In [1]:
import pandas as pd

df = pd.read_csv('train.csv')

### 2 - ( EDA ) :
- visualize data
- analyse data (features and target)
- Data Cleaning (outliers, missing data, duplicates)

This helps us understand the structure and contents of the dataset before performing modeling.



In [None]:
print(df.head())
print(df.info())

Important questions when thinking about missing data:

- How prevalent is the missing data?
- Is missing data random or does it have a pattern?

how to handle the missing data :
- according to the percentage of missing values , i will put specific percentage for deleting 
- if missing data has patterns , use ML Models to Predict Missing Values using other columns OR replace it withe specific value


In [None]:
print(df.isnull().sum())
print(f"Duplicates: {df.duplicated().sum()}")

- Fortunately this data doesn't has missing data and Duplicates!
- i will filter categorical variable because it doesn't affect to Target ( if it will affect i will convert it into dummy ), and extract numerical features

In [2]:
num_columns = [col for col in df.columns if df[col].dtype != 'object']
num_columns

['vendor_id',
 'passenger_count',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'trip_duration']

 ### Statistics of the target

In [None]:
df['trip_duration'].describe()

- Max: 3,526,282 sec (~40 days) → extreme outliers

- Std Dev: 5,237 sec (too high because of outliers)

- Fortunately minimum trip_duration is larger than zero , so i don't have one of those personal traits that would destroy my model.

In [None]:
print("Skewness: %f" % df['trip_duration'].skew())
print("Kurtosis: %f" % df['trip_duration'].kurt())

- Skewness: 343 → highly right-skewed

- Kurtosis: 192,131 → heavy tails (extreme outliers)

In [None]:

var = 'pickup_longitude'
data = pd.concat([df['trip_duration'], df[var]], axis=1)
data.plot.scatter(x=var, y='trip_duration', ylim=(0,800000))


Trip duration has extreme outliers !

In [None]:
import seaborn as sns
sns.heatmap(df[num_columns].corr(), annot=True)

- pickup_longitude and dropoff_longitude have a strong correlation (0.78) .

- pickup_latitude and dropoff_latitude also show a moderate correlation (0.49).

- trip_duration has very weak correlations with all other features (max is ~0.02) → means:
   - Linear models on raw features will not work well for predicting trip_duration.
   - We need feature engineering.

### 3 - ( Feature Engineering) : 
- such as ( Hash & One-hot encoding ,Log transform for large values ,Variance stabilizing transform ,Scaling (minmax / standrize) ,add new features, ..... )


In [4]:
import numpy as np
from math import radians, sin, cos, asin, sqrt

df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropoff_datetime'] = pd.to_datetime(df['dropoff_datetime'])

df['hour'] = df['pickup_datetime'].dt.hour
df['day_of_week'] = df['pickup_datetime'].dt.dayofweek

df = df[(df['trip_duration'] >= 10) & (df['trip_duration'] <= 21600)]

df['log_trip_duration'] = np.log1p(df['trip_duration'])

def haversine_vectorized(lon1, lat1, lon2, lat2):
    # Convert degrees to radians
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    
    # Haversine formula
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    r = 6371  # Earth's radius in km
    return c * r

df['distance_km'] = haversine_vectorized(
    df['pickup_longitude'], df['pickup_latitude'],
    df['dropoff_longitude'], df['dropoff_latitude']
)



- i Convert the columns to datetime objects so you can extract hour, day, etc.
- i remove ( Outliers ) for trip_duration
- i Applie log(1 + x) to reduce skewness and Make the distribution more normal.
- i compute great-circle distance between two GPS points (pickup & dropoff).
- i Add a new feature --> (distance_km)

In [None]:
import matplotlib.pyplot as plt
import folium

sns.histplot(df['trip_duration'], bins=100, kde=True)
plt.title("Trip Duration Distribution")
plt.xlabel("Trip Duration (seconds or minutes)")
plt.ylabel("Frequency")
plt.show()

sns.scatterplot(x='distance_km', y='trip_duration', data=df, alpha=0.3)
plt.title("Distance vs Duration")
plt.xlabel("Distance (km)")
plt.ylabel("Trip Duration")
plt.show()

sns.boxplot(x='hour', y='trip_duration', data=df)
plt.title("Trip Duration by Hour")
plt.xlabel("Hour of Day")
plt.ylabel("Trip Duration")
plt.show()

# Folium Map centered at given coordinates
m = folium.Map(location=[40.75, -73.98], tiles='CartoDB Positron', zoom_start=12)
m

- Trip Duration Distribution : --> if most trips are short, if there are long tails or outliers.
- Distance vs Duration Scatter Plot : to check if trip duration increases with distance
- Trip Duration by Hour (Boxplot)
- Folium Map : To visualize geographic data (pickup/dropoff points) on an interactive map.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

features = ['passenger_count', 'vendor_id', 'hour', 'day_of_week', 'distance_km']
X = df[features]
y = df['log_trip_duration']

# Define categorical and numeric columns
categorical_cols = ['vendor_id']
numeric_cols = ['passenger_count', 'hour', 'day_of_week', 'distance_km', ]

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- Features : take features of data additional to new features that i made it for inhancing Accuracity

- Preprocessing pipeline : 
   - i Apply StandardScaler on numeric columns —-> for scaleing .
   - i Apply OneHotEncoder on categorical columns —-> to convert categories to binary vectors.



### 4 - ( Model Training ) : 

- i used ( RandomForestRegressor ) → algorithm for regression based on an ensemble of decision trees.
   - ( Random Forest ) ---> is robust to outliers and handles non-linear relationships well.
- i apply ( Pipeline ) → Combines preprocessing and model steps into one workflow

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
from sklearn.pipeline import Pipeline


y_train = df.loc[X_train.index, 'log_trip_duration']
y_test = df.loc[X_test.index, 'log_trip_duration']

model = RandomForestRegressor(
    n_estimators=100,   # more trees for stability but increase training time
    max_depth=20,       # Prevent very deep trees to avoid ( overfitting )
    min_samples_split=5,
    random_state=42,    # to get same split 
    n_jobs=-1           # Parallel processing for speed ( use all CPU cores )   
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

pipeline.fit(X_train, y_train)

0,1,2
,steps,"[('preprocessor', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,20
,min_samples_split,5
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


### 5 - ( Model Validation & Evaluation ) :

In [7]:
y_pred_log = pipeline.predict(X_test)
y_pred = np.expm1(y_pred_log)       # Convert back from log scale
y_test_original = np.expm1(y_test)

rmse = np.sqrt(mean_squared_error(y_test_original, y_pred))
r2 = r2_score(y_test_original, y_pred)

print(f"Model Evaluation:")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

Model Evaluation:
RMSE: 372.93
R² Score: 0.6868


- why i used ( exp1 ) ?
   - When apply predict with the pipeline , The model outputs predictions in log scale, because it was trained on log_trip_duration
   So we need to reverse the transformation to get the real duration in seconds --> so To reverse log1p , used exp1
- ( log1p and expm1 ) : are numerically stable for small values.
- RMSE (Root Mean Squared Error) :measures the average error between predicted and actual durations  --> Lower RMSE = better accuracy.
- R² Score : accuracy measure for regression.

In [8]:
import joblib

joblib.dump(pipeline, 'BaseLine_pipeline.pkl')
print("Pipeline saved as 'BaseLine_pipeline.pkl'")

Pipeline saved as 'BaseLine_pipeline.pkl'


### Final step :  ( Tuning & finalize )
 Brute force : ( Algorithms & hyper parameters & any thing in the code needs many diffrent Tries ) ---> To inhance the Berformance

 But in this file i implement Base Line Model ! ---> i made Fine Tuning in ( advancedModel.py )