# Assignment 4: Pipelines and Hyperparameter Tuning

Christian Valdez

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

from datetime import datetime

# calculation
from math import radians, cos, sin, asin, sqrt

# sk learn
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import make_scorer, mean_squared_error, r2_score

In [2]:
MAX_DURATION = 7200  # Maximum trip duration in seconds
MIN_DURATION = 180   # Minimum trip duration in seconds
MAX_DISTANCE = 60.0  # Maximum distance in kilometers
MIN_DISTANCE = 1.5   # Minimum distance in kilometers

## Step 1: Data Input (4 marks)

In [3]:
# import dataset
taxi_df = pd.read_csv("train_5000.csv")
taxi_df.shape

(5000, 11)

In [4]:
taxi_df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24,2016-03-14 17:32,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 0:43,2016-06-12 0:54,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35,2016-01-19 12:10,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32,2016-04-06 19:39,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30,2016-03-26 13:38,1,-73.973053,40.793209,-73.972923,40.78252,N,435


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

1. This dataset is sourced from Kaggle and can be accessed via this [link](https://www.kaggle.com/competitions/nyc-taxi-trip-duration/overview).
2. I chose this dataset for its convenience and am collaborating with a group on this data.
3. Finding a suitable dataset was straightforward, as my primary criterion was cleanliness.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [5]:
# drop unwanted columns
taxi_df.drop(columns=["id", "vendor_id", "store_and_fwd_flag", "passenger_count"], inplace=True)
taxi_df.columns

Index(['pickup_datetime', 'dropoff_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'trip_duration'],
      dtype='object')

In [6]:
taxi_df.dtypes

pickup_datetime       object
dropoff_datetime      object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
trip_duration          int64
dtype: object

In [7]:
taxi_df.isnull().sum()

pickup_datetime      0
dropoff_datetime     0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
trip_duration        0
dtype: int64

In [8]:
def extract_datetime_components(dt_str):
    dt_obj = datetime.strptime(dt_str, "%Y-%m-%d %H:%M")
    year = dt_obj.year
    quarter = (dt_obj.month - 1) // 3 + 1
    month = dt_obj.month
    week_day = dt_obj.weekday() + 1  # Monday as 1, Sunday as 7
    hour = dt_obj.hour
    minute = dt_obj.minute
    return pd.Series([year, quarter, month, week_day, hour, minute], index=["year", "quarter", "month", "week_day", "hour", "minute"])

In [9]:
datetime_components = taxi_df["pickup_datetime"].apply(extract_datetime_components)
taxi_df = taxi_df.join(datetime_components)
taxi_df.head()

Unnamed: 0,pickup_datetime,dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,year,quarter,month,week_day,hour,minute
0,2016-03-14 17:24,2016-03-14 17:32,-73.982155,40.767937,-73.96463,40.765602,455,2016,1,3,1,17,24
1,2016-06-12 0:43,2016-06-12 0:54,-73.980415,40.738564,-73.999481,40.731152,663,2016,2,6,7,0,43
2,2016-01-19 11:35,2016-01-19 12:10,-73.979027,40.763939,-74.005333,40.710087,2124,2016,1,1,2,11,35
3,2016-04-06 19:32,2016-04-06 19:39,-74.01004,40.719971,-74.012268,40.706718,429,2016,2,4,3,19,32
4,2016-03-26 13:30,2016-03-26 13:38,-73.973053,40.793209,-73.972923,40.78252,435,2016,1,3,6,13,30


In [10]:
taxi_df.drop(columns=["pickup_datetime", "dropoff_datetime"], inplace=True)
taxi_df.columns

Index(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'trip_duration', 'year', 'quarter', 'month',
       'week_day', 'hour', 'minute'],
      dtype='object')

In [11]:
def calculate_haversine_distance(pick_up_lat, drop_off_lat, pick_up_lon, drop_off_lon):
    # Convert decimal degrees to radians
    pick_up_lat, drop_off_lat, pick_up_lon, drop_off_lon = map(radians, [pick_up_lat, drop_off_lat, pick_up_lon, drop_off_lon])

    # Haversine formula
    dlon = drop_off_lon - pick_up_lon
    dlat = drop_off_lat - pick_up_lat
    a = sin(dlat / 2)**2 + cos(pick_up_lat) * cos(drop_off_lat) * sin(dlon / 2)**2
    c = 2 * asin(sqrt(a))

    # Radius of earth in kilometers
    earth_radius = 6371
    return c * earth_radius

In [12]:
taxi_df["distance"] = taxi_df.apply(lambda row: calculate_haversine_distance(row["pickup_latitude"], row["dropoff_latitude"], row["pickup_longitude"], row["dropoff_longitude"]), axis = 1)
taxi_df.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,year,quarter,month,week_day,hour,minute,distance
0,-73.982155,40.767937,-73.96463,40.765602,455,2016,1,3,1,17,24,1.498521
1,-73.980415,40.738564,-73.999481,40.731152,663,2016,2,6,7,0,43,1.805508
2,-73.979027,40.763939,-74.005333,40.710087,2124,2016,1,1,2,11,35,6.385099
3,-74.01004,40.719971,-74.012268,40.706718,429,2016,2,4,3,19,32,1.485499
4,-73.973053,40.793209,-73.972923,40.78252,435,2016,1,3,6,13,30,1.188589


In [13]:
def calculate_velocity(distance, duration):
    if duration and duration > 0:
        # Convert duration from seconds to hours and calculate velocity
        return distance / (duration / 3600)
    else:
        # Return None or a default value if duration is zero or negative
        return None

In [14]:
taxi_df["velocity"] = taxi_df.apply(lambda row: calculate_velocity(row["distance"], row["trip_duration"]), axis = 1)
taxi_df.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,year,quarter,month,week_day,hour,minute,distance,velocity
0,-73.982155,40.767937,-73.96463,40.765602,455,2016,1,3,1,17,24,1.498521,11.856429
1,-73.980415,40.738564,-73.999481,40.731152,663,2016,2,6,7,0,43,1.805508,9.803661
2,-73.979027,40.763939,-74.005333,40.710087,2124,2016,1,1,2,11,35,6.385099,10.822201
3,-74.01004,40.719971,-74.012268,40.706718,429,2016,2,4,3,19,32,1.485499,12.465723
4,-73.973053,40.793209,-73.972923,40.78252,435,2016,1,3,6,13,30,1.188589,9.836602


In [15]:
# Creating masks
trip_duration_mask = (taxi_df["trip_duration"] >= MIN_DURATION) & (taxi_df["trip_duration"] <= MAX_DURATION)
distance_mask = (taxi_df["distance"] >= MIN_DISTANCE) & (taxi_df["distance"] <= MAX_DISTANCE)

# Applying the masks to the DataFrame
taxi_df = taxi_df[trip_duration_mask & distance_mask]

In [16]:
print("After masking")
taxi_df.shape

After masking


(3289, 13)

In [17]:
# Columns to be scaled
columns_to_scale = ["distance", "trip_duration"]

# Create the ColumnTransformer
ct = ColumnTransformer([('scaler', MinMaxScaler(), columns_to_scale)], remainder='passthrough')

# Apply the ColumnTransformer
taxi_df_scaled = ct.fit_transform(taxi_df)

# Convert back to DataFrame (optional)
taxi_df_scaled = pd.DataFrame(taxi_df_scaled, columns=ct.get_feature_names_out())


In [18]:
# Feature matrix
X = taxi_df_scaled.drop(["scaler__trip_duration"], axis=1)

# Target vector is trip duration
y = taxi_df_scaled["scaler__trip_duration"]

In [19]:
X.shape

(3289, 12)

In [20]:
taxi_df_scaled.dtypes

scaler__distance                float64
scaler__trip_duration           float64
remainder__pickup_longitude     float64
remainder__pickup_latitude      float64
remainder__dropoff_longitude    float64
remainder__dropoff_latitude     float64
remainder__year                 float64
remainder__quarter              float64
remainder__month                float64
remainder__week_day             float64
remainder__hour                 float64
remainder__minute               float64
remainder__velocity             float64
dtype: object

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

1. No missing values; I would have filled them with the column's average due to potential normal distribution in numerical columns.
2. I only have float values, so scaling or feature engineering may be the sole processing methods.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [21]:
# pipelines for regression
pipe_ridge = Pipeline([("scaler", StandardScaler()), ("ridge", Ridge())])
pipe_tree = Pipeline([("scaler", StandardScaler()), ("tree", DecisionTreeRegressor())])
pipe_svm = Pipeline([("scaler", StandardScaler()), ("svm", SVR())])

In [22]:
# Parameter grids for GridSearchCV
param_grid_ridge = {
    "ridge__alpha": [0.1, 1, 10, 100]
}

param_grid_tree = {
    "tree__max_depth": [None, 5, 10],
    "tree__min_samples_split": [2, 5, 10]
}

param_grid_svm = {
    "svm__C": [0.1, 1, 10],
    "svm__gamma": ["scale", "auto"],
    "svm__kernel": ["rbf", "linear"]
}

In [23]:
# Scoring metric
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
r2_scorer = make_scorer(r2_score)

In [24]:
# GridSearchCV for each pipeline
grid_ridge = GridSearchCV(pipe_ridge, param_grid_ridge, cv=5, scoring=r2_scorer)
grid_tree = GridSearchCV(pipe_tree, param_grid_tree, cv=5, scoring=r2_scorer)
grid_svm = GridSearchCV(pipe_svm, param_grid_svm, cv=5, scoring=r2_scorer)

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [26]:
grid_ridge.fit(X_train, y_train)
grid_tree.fit(X_train, y_train)
grid_svm.fit(X_train, y_train)

In [27]:
print(f"Best Ridge estimator: {grid_ridge.best_estimator_}")
print(f"Best Ridge parameters: {grid_ridge.best_estimator_}")
print(f"Best Ridge score: {grid_ridge.best_score_}")

Best Ridge estimator: Pipeline(steps=[('scaler', StandardScaler()), ('ridge', Ridge(alpha=10))])
Best Ridge parameters: Pipeline(steps=[('scaler', StandardScaler()), ('ridge', Ridge(alpha=10))])
Best Ridge score: 0.8363714645059653


In [28]:
print(f"Best Tree estimator: {grid_tree.best_estimator_}")
print(f"Best Tree parameters: {grid_tree.best_estimator_}")
print(f"Best Tree score: {grid_tree.best_score_}")

Best Tree estimator: Pipeline(steps=[('scaler', StandardScaler()),
                ('tree',
                 DecisionTreeRegressor(max_depth=10, min_samples_split=10))])
Best Tree parameters: Pipeline(steps=[('scaler', StandardScaler()),
                ('tree',
                 DecisionTreeRegressor(max_depth=10, min_samples_split=10))])
Best Tree score: 0.9602828164377364


In [29]:
print(f"Best svm estimator: {grid_svm.best_estimator_}")
print(f"Best svm parameters: {grid_svm.best_estimator_}")
print(f"Best svm score: {grid_svm.best_score_}")

Best svm estimator: Pipeline(steps=[('scaler', StandardScaler()),
                ('svm', SVR(C=1, kernel='linear'))])
Best svm parameters: Pipeline(steps=[('scaler', StandardScaler()),
                ('svm', SVR(C=1, kernel='linear'))])
Best svm score: 0.6892589408988623


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

1. I used regression for my dataset because I wanted to predict a *continuous* value, and my target vector is quantifiable.
2. I chose **Ridge** for its parameter flexibility. **Random Forest** is suitable for detecting non-linear relationships. **Support Vector Machines (SVM)** are effective in high-dimensional spaces, reducing the risk of overfitting.
3. The **decision tree** yielded the *best* results, while **SVM** performed the *least* effectively. The decision tree's accuracy aligns with the dataset's cyclical weekly patterns and people's routines, though *overfitting* is a concern. Although **SVM** and **Ridge** models didn't perform the *best*, they may be suitable for more generalized data.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [30]:
grid_tree.fit(X_test, y_test)

In [31]:
print("Tree Regression R2:", grid_tree.score(X_test, y_test))
print("Tree Regression MSE:", mean_squared_error(y_test, grid_tree.predict(X_test)))

Tree Regression R2: 0.994566786009953
Tree Regression MSE: 6.625179029400504e-05


In [32]:
print(f"Best Tree estimator: {grid_tree.best_estimator_}")
print(f"Best Tree parameters: {grid_tree.best_estimator_}")
print(f"Best Tree score: {grid_tree.best_score_}")

Best Tree estimator: Pipeline(steps=[('scaler', StandardScaler()),
                ('tree', DecisionTreeRegressor(min_samples_split=5))])
Best Tree parameters: Pipeline(steps=[('scaler', StandardScaler()),
                ('tree', DecisionTreeRegressor(min_samples_split=5))])
Best Tree score: 0.9081141385954588



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

1. I chose the **R2** score as the **accuracy metric**.
2. The results closely resembled those of the training dataset but with slightly **lower** performance. Generally, the model exhibited good **generalization**, particularly with the decision tree.
3. I don't believe the **best-performing** model achieved a **sufficient** level of performance for real-world use. I desire **higher** accuracy, as time is highly **valued** by many. To enhance the model, consider **feature engineering** with less granular pickup and drop-off points, increasing the **sample size**, and conducting **hyperparameter tuning**.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. I utilized ChatGPT for code review and addressing my queries, and also referred to Python documentation sites like Pandas and scikit-learn.

- [ChatGPT](https://chat.openai.com/share/b6b8df54-e909-4c1f-906c-3c59d38e31b8)
- [Pandas](https://pandas.pydata.org/docs/index.html)
- [scikit learn](https://scikit-learn.org/stable/index.html)

2. I completed the steps in order. 

3. My initial prompt to ChatGPT was: "Act as a Python developer". Subsequent prompts sought code samples to show workflows and ask conceptual questions.

4. Manually tuning the model was challenging; with more time, I would automate the process to identify the best model for the dataset.

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.

I **liked** being able to **choose** datasets and **select** my model. While I **enjoyed** having the choice, it was **challenging** to pick from a set of models. There are **numerous** models aimed at achieving the same goal but yielding **varying** results.