# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name:  ALTON WONG 30201904

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import mglearn 

import calendar
from datetime import datetime

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [14]:
# Import dataset (1 mark)
taxi_df = pd.read_csv("train.csv")
taxi_df = taxi_df.sample(frac=0.001, random_state=0) #Reduce size of dataset to run GridSearchCV
taxi_df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
571578,id2141905,2,2016-03-31 16:04:41,2016-03-31 16:06:34,1,-73.971916,40.757042,-73.974663,40.753624,N,113
1280332,id0996953,2,2016-04-21 21:54:52,2016-04-21 22:28:49,2,-73.961891,40.771061,-73.906311,40.908562,N,2037
177838,id1572284,1,2016-03-30 11:26:24,2016-03-30 11:56:35,3,-74.010338,40.711674,-73.957047,40.777634,N,1811
1433776,id0103694,1,2016-03-06 20:07:45,2016-03-06 20:24:02,1,-74.005898,40.740093,-73.992287,40.758511,N,977
757662,id2548956,1,2016-04-06 13:45:10,2016-04-06 13:50:52,1,-74.011063,40.715599,-74.005035,40.720966,N,342


### Questions (3 marks)

1. (1 mark) What is the source of your dataset? 
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

This data set was sourced from Kaggle. (https://www.kaggle.com/competitions/nyc-taxi-trip-duration/overview)

I picked this dataset because I am currently using it for a project in ENSF 612.

There were no challenges in finding a dataset to use because I was already kind of familiar with this dataset. 

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [15]:
# Clean data (if needed)
print(f"There are: {taxi_df.isnull().sum().sum()} null values")


#creating column for day of the week
taxi_df['pickup_datetime'] = pd.to_datetime(taxi_df['pickup_datetime'])
taxi_df['day_of_week'] = taxi_df['pickup_datetime'].dt.dayofweek

#creating column for distance travelled based off Haversine formula
def haversine_distance(lat1, lon1, lat2, lon2):
    #Radius of the Earth in km
    R = 6371.0

    lat1_rad = np.radians(lat1)
    lon1_rad = np.radians(lon1)
    lat2_rad = np.radians(lat2)
    lon2_rad = np.radians(lon2)

    #Difference in coordinates
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad

    # Haversine formula
    a = np.sin(dlat / 2)**2 + np.cos(lat1_rad) * np.cos(lat2_rad) * np.sin(dlon / 2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    # Distance in kilometers
    distance = R * c

    return distance

taxi_df['distance_km'] = taxi_df.apply(lambda row: haversine_distance(row['dropoff_latitude'], row['dropoff_longitude'],
                                                          row['pickup_latitude'], row['pickup_longitude']), axis = 1)


#dropping unimportant features
taxi_df.drop(["vendor_id", "passenger_count", "id", "store_and_fwd_flag","pickup_datetime","dropoff_datetime"], axis = 1, inplace=True)

taxi_df.head()

There are: 0 null values


Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,trip_duration,day_of_week,distance_km
571578,-73.971916,40.757042,-73.974663,40.753624,113,3,0.444935
1280332,-73.961891,40.771061,-73.906311,40.908562,2037,3,15.988325
177838,-74.010338,40.711674,-73.957047,40.777634,1811,2,8.599361
1433776,-74.005898,40.740093,-73.992287,40.758511,977,6,2.34703
757662,-74.011063,40.715599,-74.005035,40.720966,342,2,0.783716


In [16]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
X = taxi_df.drop(["distance_km"], axis = 1)
y = taxi_df['distance_km']
print(X.shape)


(1459, 6)


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

No there are no missing null values in the dataset. However if there were any null values I would simply drop them as this is a very very large dataset, so missing a few rows of information would not be detrimental.

It is all numerical data. Below I apply standard scaling into my pipeline.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [17]:
# Implement pipeline and grid search here. Can add more code blocks if necessary
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = train_test_split(X, y , random_state = 0)

pipe = Pipeline([
    ('preprocessing', StandardScaler()),
    ('regressor', LinearRegression()) 
])


param_grid = [
    {
        'regressor': [LinearRegression()]
    },
    {
        'regressor': [RandomForestRegressor(random_state=43)],
        'regressor__max_depth': [3, 5, 7, 9],
        'regressor__n_estimators': [10, 50, 100]  
    },
    {
        'regressor': [SVR(kernel='linear')],
        'regressor__C': [0.01, 0.1, 1.0, 10.0]
    }
]
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print(f'Best grid estimator: {grid.best_estimator_ }')
print(f'Best grid parameters: {grid.best_params_}')
print(f'Cross-Validation accuracy {grid.best_score_:.2f}')


Best grid estimator: Pipeline(steps=[('preprocessing', StandardScaler()),
                ('regressor',
                 RandomForestRegressor(max_depth=9, random_state=43))])
Best grid parameters: {'regressor': RandomForestRegressor(max_depth=9, random_state=43), 'regressor__max_depth': 9, 'regressor__n_estimators': 100}
Cross-Validation accuracy 0.91


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

1. I need to use regression for this dataset, as it is trying to predict travel times.

2. Linear regression is a good quick and simple baseline model to try. Random Forest Regressor can capture any non-linear relationships between features and the target variable. Lastly, SVR can also capture non-linear relationships. The non-linear regression models are beneficial as they hae different parameters that can be tuned and tested by GridSearchCV.

3. Random Forest Regressor worked best. This makes sense as RFG can take advantage of the non-linear relationship such as day of the week the traffic is in. and it has different parameters that can be tuned to make the model more complex if required.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [18]:
# Calculate testing accuracy (1 mark)
print(f'Test accuracy {grid.score(X_test, y_test):.2f}')

Test accuracy 0.89



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

1. I chose R2 score as the accurracy metric.

1. The test accuracy is slightly lower than the cross-validation accuracy, butoverall the model did generalize well.

1. The best model does perform "well enough" to be used in the real world. Especially since the dataset is focused on something like estimating taxi travel times inside NYC, it is not like predicting if someone has a heart disease or not. To improve analysis, we could probably do some more feature engineering or further hyperparameter tuning.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*