# Assignment 4: Pipelines and Hyperparameter Tuning

Christian Valdez

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

from datetime import datetime

# calculation
from math import radians, cos, sin, asin, sqrt

## Step 1: Data Input (4 marks)

In [2]:
# import dataset
taxi_df = pd.read_csv("train_5000.csv")
taxi_df.shape

(5000, 11)

In [3]:
taxi_df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24,2016-03-14 17:32,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 0:43,2016-06-12 0:54,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35,2016-01-19 12:10,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32,2016-04-06 19:39,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30,2016-03-26 13:38,1,-73.973053,40.793209,-73.972923,40.78252,N,435


### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

1. This dataset is sourced from Kaggle and can be accessed via this [link](https://www.kaggle.com/competitions/nyc-taxi-trip-duration/overview).
2. I chose this dataset for its convenience and am collaborating with a group on this data.
3. Finding a suitable dataset was straightforward, as my primary criterion was cleanliness.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [4]:
# drop unwanted columns
taxi_df.drop(columns=["id", "vendor_id", "store_and_fwd_flag", "passenger_count"], inplace=True)
taxi_df.columns

Index(['pickup_datetime', 'dropoff_datetime', 'pickup_longitude',
       'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude',
       'trip_duration'],
      dtype='object')

In [5]:
taxi_df.dtypes

pickup_datetime       object
dropoff_datetime      object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
trip_duration          int64
dtype: object

In [6]:
taxi_df.isnull().sum()

pickup_datetime      0
dropoff_datetime     0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
trip_duration        0
dtype: int64

In [7]:
def extract_datetime_components(dt_str):
    dt_obj = datetime.strptime(dt_str, "%Y-%m-%d %H:%M")
    year = dt_obj.year
    quarter = (dt_obj.month - 1) // 3 + 1
    month = dt_obj.month
    week_day = dt_obj.weekday() + 1  # Monday as 1, Sunday as 7
    hour = dt_obj.hour
    minute = dt_obj.minute
    return pd.Series([year, quarter, month, week_day, hour, minute], index=["year", "quarter", "month", "week_day", "hour", "minute"])

In [8]:
datetime_components = taxi_df["pickup_datetime"].apply(extract_datetime_components)
taxi_df = taxi_df.join(datetime_components)
taxi_df.head()

Unnamed: 0,pickup_datetime,dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,year,quarter,month,week_day,hour,minute
0,2016-03-14 17:24,2016-03-14 17:32,-73.982155,40.767937,-73.96463,40.765602,455,2016,1,3,1,17,24
1,2016-06-12 0:43,2016-06-12 0:54,-73.980415,40.738564,-73.999481,40.731152,663,2016,2,6,7,0,43
2,2016-01-19 11:35,2016-01-19 12:10,-73.979027,40.763939,-74.005333,40.710087,2124,2016,1,1,2,11,35
3,2016-04-06 19:32,2016-04-06 19:39,-74.01004,40.719971,-74.012268,40.706718,429,2016,2,4,3,19,32
4,2016-03-26 13:30,2016-03-26 13:38,-73.973053,40.793209,-73.972923,40.78252,435,2016,1,3,6,13,30


In [9]:
taxi_df.drop(columns=["pickup_datetime", "dropoff_datetime"], inplace=True)
taxi_df.columns

Index(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'trip_duration', 'year', 'quarter', 'month',
       'week_day', 'hour', 'minute'],
      dtype='object')

In [13]:
def calculate_haversine_distance(pick_up_lat, drop_off_lat, pick_up_lon, drop_off_lon):
    # Convert decimal degrees to radians
    pick_up_lat, drop_off_lat, pick_up_lon, drop_off_lon = map(radians, [pick_up_lat, drop_off_lat, pick_up_lon, drop_off_lon])

    # Haversine formula
    dlon = drop_off_lon - pick_up_lon
    dlat = drop_off_lat - pick_up_lat
    a = sin(dlat / 2)**2 + cos(pick_up_lat) * cos(drop_off_lat) * sin(dlon / 2)**2
    c = 2 * asin(sqrt(a))

    # Radius of earth in kilometers
    earth_radius = 6371
    return c * earth_radius

In [17]:
taxi_df["distance"] = taxi_df.apply(lambda row: calculate_haversine_distance(row["pickup_latitude"], row["dropoff_latitude"], row["pickup_longitude"], row["dropoff_longitude"]), axis = 1)
taxi_df.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,year,quarter,month,week_day,hour,minute,distance
0,-73.982155,40.767937,-73.96463,40.765602,455,2016,1,3,1,17,24,1.498521
1,-73.980415,40.738564,-73.999481,40.731152,663,2016,2,6,7,0,43,1.805508
2,-73.979027,40.763939,-74.005333,40.710087,2124,2016,1,1,2,11,35,6.385099
3,-74.01004,40.719971,-74.012268,40.706718,429,2016,2,4,3,19,32,1.485499
4,-73.973053,40.793209,-73.972923,40.78252,435,2016,1,3,6,13,30,1.188589


In [None]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [None]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [None]:
# Calculate testing accuracy (1 mark)


### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*