# Mini Project: Build a Machine Learning Model

## Predict Total Fare on the NYC Taxi Dataset

Welcome to the NYC Taxi Fare Prediction project! In this Colab, we will continue using the NYC Taxi Dataset to predict the fare amount for taxi rides using a subset of available features. We will go through three main stages: building a baseline model, creating a full model, and performing hyperparameter tuning to enhance our predictions.

Now that you've completed exploratory data analysis on this dataset you should have a good understanding of the feature space.

## Project Objectives

The primary objectives of this project are as follows:

Baseline Model: We will start by building a simple baseline model to establish a benchmark for our predictions. This model will serve as a starting point to compare the performance of our subsequent models.

Full Model: Next, we will develop a more comprehensive model that leverages machine learning techniques to improve prediction accuracy. We will use Scikit-Learn's model pipeline to build a framework that enables rapid experimentation.

Hyperparameter Tuning: Lastly, we will optimize our full model by fine-tuning its hyperparameters. By systematically adjusting the parameters that control model behavior, we aim to achieve the best possible performance for our prediction task.

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from scipy.sparse import vstack
from sklearn.compose import make_column_transformer

Load the NYC taxi dataset into a Pandas DataFrame and do a few basic checks to ensure the data is loaded properly. Note, there are several months of data that can be used. For simplicity, use the Yellow Taxi 2022-01 parquet file [here](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet). Here are your tasks:

  1. Load the `yellow_tripdata_2022-01.parquet` file into Pandas.
  2. Print the first 5 rows of data.
  3. Drop any rows of data that contain NULL values.
  4. Create a new feature, 'trip_duration' that captures the duration of the trip in minutes.
  5. Create a varible named 'target_variable' to store the name of the thing we're trying to predict, 'total_amount'.
  6. Create a list called 'feature_cols' containing the feature names that we'll be using to predict our target variable. The list should contain 'VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', and 'trip_duration'.

In [46]:
# Load the dataset into a pandas DataFrame (from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
taxi_data = pd.read_parquet('yellow_tripdata_2022-01.parquet')

In [47]:
# Display the first few rows of the dataset
taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


In [48]:
taxi_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2463931 entries, 0 to 2463930
Data columns (total 19 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int64         
 8   DOLocationID           int64         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  airport_fee           

In [49]:
# Drop rows with missing values.
taxi_data.dropna(inplace=True)


In [68]:
# Create new feature, 'trip_duration'.
taxi_data['trip_duration'] = (taxi_data['tpep_dropoff_datetime'] - taxi_data['tpep_pickup_datetime']).dt.total_seconds()

In [51]:
# Create a list called feature_col to store column names
feature_col = ['VendorID', 'trip_distance', 'payment_type', 'PULocationID', 'DOLocationID', 'trip_duration']
target_variable = 'total_amount'

Splitting a dataset into training and test sets is a crucial step in machine learning model development. It allows us to evaluate the performance and generalization ability of our models accurately. The training set is used to train the model, while the test set serves as an independent sample for evaluating its performance.

1. **Model Training**: The training set is used to fit the model, allowing it to learn the underlying patterns and relationships between the features and the target variable. By exposing the model to a diverse range of examples in the training set, it can capture the underlying structure of the data.

2. **Model Evaluation**: The test set, which is independent of the training set, is crucial for evaluating how well the trained model generalizes to unseen data. It provides an unbiased assessment of the model's performance on new instances. By measuring the model's accuracy, precision, recall, or other evaluation metrics on the test set, we can estimate how well the model will perform on unseen data.

3. **Preventing Overfitting**: Overfitting occurs when a model learns the training data's noise and idiosyncrasies instead of the underlying patterns. By evaluating the model on the test set, we can identify if the model is overfitting. If the model performs significantly worse on the test set compared to the training set, it indicates overfitting. In such cases, we might need to adjust the model, feature selection, or regularization techniques to improve generalization.

4. **Hyperparameter Tuning**: Splitting the dataset allows us to perform hyperparameter tuning on the model. Hyperparameters are configuration settings that control the learning process, such as learning rate, regularization strength, or the number of hidden layers in a neural network. By using a validation set (often created from a portion of the training set), we can iteratively adjust the hyperparameters and select the best combination that maximizes the model's performance on the validation set. The final evaluation on the test set provides an unbiased estimate of the model's performance.

By splitting the dataset into training and test sets, we can ensure that our models are both well-trained and accurately evaluated. This separation helps us understand how the model will perform on new, unseen data, which is critical for assessing its effectiveness and making informed decisions about its deployment.

Here is your task:

  1. Use Scikit-Learn's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split the data into training and test sets. Don't forget to set the random state.

In [52]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(taxi_data[feature_col], taxi_data[target_variable], test_size=0.2, random_state=42)

NOTE: after running into memory issues I've come back to transform these variables into smaller datatypes.

In [53]:
# Integer columns
int_columns = ['VendorID', 'payment_type', 'PULocationID', 'DOLocationID']
for col in int_columns:
    X_train[col] = X_train[col].astype('int16')
    X_test[col] = X_test[col].astype('int16')

# Float columns
float_columns = ['trip_distance', 'trip_duration']
for col in float_columns:
    X_train[col] = X_train[col].astype('float32')
    X_test[col] = X_test[col].astype('float32')

for col in X_train.columns:
    print(f"{col}: {X_train[col].dtype}")

VendorID: int16
trip_distance: float32
payment_type: int16
PULocationID: int16
DOLocationID: int16
trip_duration: float32


The importance of a baseline model, even if it uses a simple strategy like always predicting the mean, cannot be understated. Here's why a baseline model is valuable:

1. **Performance Comparison**: A baseline model serves as a reference point for evaluating the performance of more sophisticated models. By establishing a simple yet reasonable baseline, we can determine whether our advanced models offer any significant improvement over this basic approach. It helps us set realistic expectations and gauge the effectiveness of our efforts.

2. **Model Complexity**: A baseline model provides insight into the complexity required to solve the prediction task. If a simple strategy like predicting the median performs reasonably well, it suggests that the problem might not necessitate complex modeling techniques. Conversely, if the baseline model performs poorly, it indicates the presence of more intricate patterns that need to be captured by more sophisticated models.

3. **Minimum Performance Requirement**: A baseline model can establish a minimum performance requirement for a predictive task. If we cannot outperform the baseline, it suggests that our models have failed to capture even the most fundamental relationships within the data. In such cases, we may need to revisit our data preprocessing steps, feature engineering techniques, or consider other external factors affecting the task.

4. **Identifying Data Issues**: A baseline model can help identify potential issues within the dataset. If the baseline model performs poorly, it may indicate problems like missing values, outliers, or data inconsistencies. These issues can be further investigated and resolved to improve the overall model performance.

While a baseline model like always predicting the median may not offer the highest prediction accuracy, its importance lies in its role as a starting point for model development and evaluation. It provides a solid foundation for comparing and assessing the performance of more complex models, ensuring that any improvements made are meaningful and significant.

Here is your task:

  1. Create a model that always predicts the mean total fare of the training dataset. Use Scikit-Learn's [mean_absolute_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) to evaluate this model. Is it any good?

In [14]:
# Create a baseline for mean absolute error of total amount
def predict_fare_as_avg(feature_vector):
    try:
        # Check if feature_vector is array-like and has correct length
        if not isinstance(feature_vector, (list, np.ndarray, pd.Series)):
            raise TypeError("Input must be a list, numpy array, or pandas Series")

        # Check if length matches number of features
        expected_length = len(feature_col)
        if len(feature_vector) != expected_length:
            raise ValueError(f"Input must have {expected_length} features, got {len(feature_vector)}")

        return np.mean(taxi_data[target_variable])

    except Exception as e:
        print(f"Error in prediction: {str(e)}")
        return None


# test the function
print(predict_fare_as_avg(X_train.iloc[0]))
print(predict_fare_as_avg(X_train.iloc[1]))
print(predict_fare_as_avg(X_test.iloc[2]))

# Show the baseline error
baseline_error = mean_absolute_error(y_test, [predict_fare_as_avg(x) for x in X_test.values])
print(f"Baseline error: {baseline_error}")



19.02453219908813
19.02453219908813
19.02453219908813
Baseline error: 9.180927605339791


The Baseline error was computed at 9.181 in units of the target vector (USD).

With a baseline metric in place, we can try to build a machine learning model. Obviously, if the model can't beat the baseline then there are some major issues to be resolved.

It's always a good idea to start with a simple machine learning model, like linear regression, and build upon it if necessary.

Here are your tasks:

  1. Use Scikit-Learn's [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to preprocess the categorical and continuous features independently. Apply the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to the continuous columns and [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to the categorical columns.

  One-hot encoding is a popular technique used to represent categorical variables numerically in machine learning models. It transforms categorical features into a binary vector representation, where each category is represented by a binary column. Here's an explanation of one-hot encoding:

  When working with categorical variables, such as colors (e.g., red, blue, green) or vehicle types (e.g., car, truck, motorcycle), machine learning algorithms often require numerical inputs. However, directly assigning numerical values to categories can introduce unintended relationships or orderings between them. For example, assigning the values 0, 1, and 2 to the categories red, blue, and green may imply a sequential relationship, which is not desired.

  One-hot encoding solves this problem by creating new binary columns, equal to the number of unique categories in the original feature. Each binary column represents a specific category and takes a value of 1 if the data point belongs to that category, and 0 otherwise. This encoding ensures that no implicit ordering or relationship exists between the categories.

  2. Integrate the preprocessor in the previous step with Scikit-Learn's [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model using a [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

  3. Train the pipeline on the training data.

  4. Evaluate the model using mean absolute error as a metric on the test data. Does the model beat the baseline?


In [54]:
num_cols = ['trip_distance', 'trip_duration']  # Float columns
cat_cols = ['VendorID', 'payment_type', 'PULocationID', 'DOLocationID']  # Integer columns

After running into memory issues I chose to batch process the ColumnTransformer

In [55]:
# Setup the preprocessor with sparse output for memory issues
preprocessor = make_column_transformer(
    (StandardScaler(), num_cols),
    (OneHotEncoder(drop='first', sparse_output=True), cat_cols),
    sparse_threshold=1  # Force sparse output
)

In [56]:
# process in pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression(fit_intercept=True, copy_X=False))
])

pipeline.fit(X_train, y_train)

In [57]:
# Make predictions on the test data
lr_predictions = pipeline.predict(X_test)
lr_score_mae = mean_absolute_error(y_test, lr_predictions)
print(f"Linear Regression MAE: {lr_score_mae}")
lr_r2 = pipeline.score(X_test, y_test)
print(f"Linear Regression R-Squared: {lr_r2}")

Linear Regression MAE: 3.39243260825956
Linear Regression R-Squared: 0.7064490184982686


Random Forest Regression and Linear Regression are two commonly used regression algorithms, each with its own advantages and suitability for different scenarios. Random Forest Regression offers several advantages over Linear Regression, including:

1. **Non-linearity**: Random Forest Regressor is capable of capturing non-linear relationships between features and the target variable. In contrast, Linear Regression assumes a linear relationship between the features and the target. When faced with non-linear relationships or complex feature interactions, Random Forest Regressor can provide more accurate predictions.

2. **Robustness to Outliers**: Random Forest Regressor is generally more robust to outliers compared to Linear Regression. Outliers can disproportionately impact the coefficients and predictions of Linear Regression models. However, as an ensemble of decision trees, Random Forest Regressor can mitigate the effect of outliers by averaging predictions from multiple trees.

3. **Feature Importance**: Random Forest Regressor provides a measure of feature importance, which helps identify the most influential features for making predictions. This information is useful for feature selection, understanding the underlying relationships in the data, and gaining insights into the problem domain. Unlike Linear Regression, which provides coefficient values indicating the direction and magnitude of relationships, Random Forest Regressor explicitly highlights feature importance.

4. **Handling of Categorical Variables**: Random Forest Regressor can effectively handle categorical variables without requiring pre-processing steps like one-hot encoding. It can directly incorporate categorical variables into the model, making it more convenient when working with mixed data types. In contrast, Linear Regression often requires categorical variables to be encoded or transformed before use.

5. **Handling of High-Dimensional Data**: Random Forest Regressor can handle datasets with a large number of features (high dimensionality) by automatically selecting subsets of features during the construction of individual decision trees. This reduces the risk of overfitting, which is a concern with Linear Regression when dealing with high-dimensional data.

6. **Resistance to Multicollinearity**: Random Forest Regressor is less affected by multicollinearity, which occurs when predictor variables are highly correlated. In Linear Regression, highly correlated features can lead to unstable coefficient estimates, making it challenging to interpret the individual effects of each feature. Random Forest Regressor, as an ensemble approach, is less impacted by multicollinearity because each tree is built independently.

Here are your tasks:

  1. Build a Random Forest Regressor model using Scikit-Learn's [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) and train it on the train data.

  2. Evaluate the performance of the model on the test data using mean absolute error as a metric. Mess around with various input parameter configurations to see how they affect the model. Can you beat the performance of the linear regression model?

In [58]:
# Build random forest regressor model
# Make predictions on the test data

rf_model = RandomForestRegressor(n_estimators= 100, max_depth= 20, min_samples_split= 100, min_samples_leaf= 100, random_state= 42)
rf_model.fit(X_train, y_train)

rf_model_predictions = rf_model.predict(X_test)
rf_model_score = mean_absolute_error(y_test, rf_model_predictions)
print(f"Random Forest Regression MAE: {rf_model_score}")

Random Forest Regression MAE: 1.6588712450090126


The output is "Random Forest Regression MAE: 1.6589".  This is markably better than the baseling (~9.1) and the linear regressor (~3.4).  

In [30]:
# Changing Parameters

rf_model = RandomForestRegressor(n_estimators= 50, max_depth= 50, min_samples_split= 1000, min_samples_leaf= 1000, random_state= 42)
rf_model.fit(X_train, y_train)

rf_model_predictions = rf_model.predict(X_test)
rf_model_score = mean_absolute_error(y_test, rf_model_predictions)
print(f"Random Forest Regression MAE: {rf_model_score}")

Random Forest Regression MAE: 1.9465893749444014


The output is "RAndom Forest Regression MAE: 1.9465". 
As expected, a Random Forest Regressor with less estimators produces a higher MAE.

Hyperparameter tuning plays a critical role in machine learning model development. It involves selecting the optimal values for the hyperparameters, which are configuration settings that control the behavior of the learning algorithm. Here's why hyperparameter tuning is so important in ML:

1. **Optimizing Model Performance**: The choice of hyperparameters can significantly impact the model's performance. By fine-tuning the hyperparameters, we can improve the model's accuracy, precision, recall, or other performance metrics. It helps to extract the maximum predictive power from the chosen algorithm and ensures that the model is well-suited to the specific problem at hand.

2. **Avoiding Overfitting and Underfitting**: Hyperparameter tuning helps strike a balance between overfitting and underfitting.

3. **Exploring Model Complexity**: Hyperparameter tuning enables us to explore the complexity of the model. For instance, in algorithms like decision trees or neural networks, we can adjust the number of layers, the number of neurons, or the maximum depth of the tree. By systematically modifying these hyperparameters, we can understand how different levels of complexity impact the model's performance and find the right balance between simplicity and complexity.

Note, there are multiple approaches to hyperparemeter tuning.  

While grid search is the easiest to understand and implement there are many advantages of Bayesian search over grid search for hyperparameter tuning:

1. **Efficiency**: Bayesian search is generally more efficient than grid search. Grid search explores all possible combinations of hyperparameter values, which can be computationally expensive and time-consuming, especially when dealing with a large number of hyperparameters or a wide range of values. Bayesian search, on the other hand, intelligently selects the next hyperparameter configuration to evaluate based on the results of previous evaluations. It focuses on areas of the hyperparameter space that are more likely to yield better performance, reducing the number of evaluations needed.

2. **Flexibility**: Bayesian search is flexible in handling continuous and discrete hyperparameters. It can handle both types of hyperparameters naturally and effectively. In contrast, grid search is more suitable for discrete hyperparameters but may struggle with continuous ones, as it requires discretization or defining a finite set of values to search over.

3. **Adaptive Search**: Bayesian search adapts its search strategy based on the results of previous evaluations. It maintains a probability distribution over the hyperparameter space, updating it with each evaluation. This allows it to dynamically allocate more evaluations to promising regions and explore unexplored areas. In contrast, grid search follows a fixed and predefined search grid, regardless of the results of previous evaluations.

4. **Better Convergence**: Bayesian search has the potential to converge to the optimal hyperparameter configuration more quickly.

Here are your tasks:

  1. Perform a grid-search on a Random Forest Regressor model. Only search the space for the parameters 'n_estimators', 'max_depth', and 'min_samples_split'. Note, this can take some time to run. Make sure you set reasonable boundaries for the search space. Use Scikit-Learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) method.

  2. After you've identified the best parameters, train a random forest regression model using these parameters on the full training data.

  3. Evaluate the model from the previous step using the test data. How does your model perform?

In [59]:
# Apparently grid search needs int64
X_train = X_train.astype(np.float64)
y_train = y_train.astype(np.float64)

In [36]:
# Define the hyperparameters to tune.
# Create the grid
param_grid = {'n_estimators': [25, 50, 100, 200],
              'max_depth': [5, 10, 25],
              'min_samples_split': [10, 100, 1000]}


In the sake of time I will only run the grid search on the first 1500 values.

In [37]:
# Perform grid search to find the best hyperparameters. This could take a while.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor  # Fixed spelling and imported

# subset X_train for speed of grid search
X_train_subset = X_train[:1500]
y_train_subset = y_train[:1500]  # Added corresponding y_train subset

# instantiate the model
rf_class = RandomForestRegressor(random_state=42)  

# instantiate the grid search
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,  
    scoring='neg_mean_absolute_error',
    cv=5,
    refit=True,
    return_train_score=True
)

# Fit the grid search
grid_rf_class.fit(X_train_subset, y_train_subset) 

In [39]:
# Get the best model and its parameters.

cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)

print("Best parameters:", grid_rf_class.best_params_)
print("Best score:", -grid_rf_class.best_score_)  # Negative since we used neg_mean_absolute_error

print(cv_results_df.head())

Best parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 100}
Best score: 2.1498680661813823
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.062432      0.002192         0.003550        0.000070   
1       0.160580      0.021174         0.006234        0.000912   
2       0.298721      0.023653         0.008710        0.001081   
3       0.647597      0.112634         0.016607        0.004159   
4       0.058097      0.001014         0.003581        0.000263   

   param_max_depth  param_min_samples_split  param_n_estimators  \
0                5                       10                  25   
1                5                       10                  50   
2                5                       10                 100   
3                5                       10                 200   
4                5                      100                  25   

                                              params  split0_test_score  \
0  {'ma

Output is "Best parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 100}
Best score: 2.1498680661813823"

In [40]:
# Fit the best classifier on the training data.
best_rf_class = RandomForestRegressor(
    n_estimators= 100, 
    max_depth= 10, 
    min_samples_split= 10, 
    random_state= 42
    )
    

best_rf_class.fit(X_train, y_train)


In [42]:
# Make predictions on the test data
best_rf_class_predictions = best_rf_class.predict(X_test)
best_rf_class_MEA = mean_absolute_error(y_test, best_rf_class_predictions)
print(f"Random Forest Regression MAE: {rf_model_score}")

Random Forest Regression MAE: 1.9465893749444014


Output is "Random Forest Regression MAE: 1.9466"
So I'm a bit surprised that the MAE for the Random Forest Regressor here, using the hyperparameters from the grid search, is less accurate than that from the first Random Forest Regressor.  That said, all those values for the grid search show MAE's closer to 2 than to 1.6 (the best score from my first Random Forest).  It's possible that limiting the test data to the first 1500 rows produced sub-optimal results.  I'll expand the parameters of the grid search and run it on a larger subset of the test data to try and get closer.  

In [60]:
# Define the hyperparameters to tune.
# Create the grid
param_grid = {'n_estimators': [100, 150, 200],
              'max_depth': [10, 20, 50],
              'min_samples_split': [10, 100, 1000]}

In [61]:
# Perform grid search to find the best hyperparameters. This could take a while.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor  # Fixed spelling and imported

# subset X_train for speed of grid search
X_train_subset = X_train[:5000]
y_train_subset = y_train[:5000]  # Added corresponding y_train subset

# instantiate the model
rf_class = RandomForestRegressor(random_state=42)  

# instantiate the grid search
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,  
    scoring='neg_mean_absolute_error',
    cv=5,
    refit=True,
    return_train_score=True
)

# Fit the grid search
grid_rf_class.fit(X_train_subset, y_train_subset) 

In [62]:
# Get the best model and its parameters.

cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)

print("Best parameters:", grid_rf_class.best_params_)
print("Best score:", -grid_rf_class.best_score_)  # Negative since we used neg_mean_absolute_error

print(cv_results_df.head())

Best parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}
Best score: 2.1025472187648337
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       1.054273      0.177548         0.024580        0.014646   
1       1.536921      0.164955         0.024686        0.002734   
2       1.987711      0.153435         0.030918        0.004221   
3       0.741696      0.027195         0.013044        0.000589   
4       1.058122      0.010482         0.017729        0.000140   

   param_max_depth  param_min_samples_split  param_n_estimators  \
0               10                       10                 100   
1               10                       10                 150   
2               10                       10                 200   
3               10                      100                 100   
4               10                      100                 150   

                                              params  split0_test_score  \
0  {'ma

Output reads: Best parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}
Best score: 2.1025472187648337

So I guess we are kinda stuck here around a best MAE of around 2 to 2.1.  Maybe my early random forest model was just an outlier in itself.  Even just printing the head of the results cv I see values as low as 1.39 but there must be other values dragging the average up over all the cvs.  

In [63]:
# Make predictions on the test data
best_rf_class_predictions = best_rf_class.predict(X_test)
best_rf_class_MEA = mean_absolute_error(y_test, best_rf_class_predictions)
print(f"Random Forest Regression MAE: {rf_model_score}")

Random Forest Regression MAE: 1.6588712450090126


Output: Random Forest Regression MAE: 1.6588712450090126
So I guess we are back to that 1.65 value.  Interesting.  

### Conclusions
I guess the only real conclusion to be drawn is that, as expected based on the math, lower values for max_depth and min_values_split and higher values for number of regressors result in better accuracy of the model.  

In [72]:
print(X_train.columns)

Index(['VendorID', 'trip_distance', 'payment_type', 'PULocationID',
       'DOLocationID', 'trip_duration'],
      dtype='object')


In [73]:
# Just for fun, what does the model actually tell us?
# This data approximates one of the first rows in the dataset
#A ten minute trip traveling one mile.  What will it cost?

input_values_dict = {'VendorID': 2,
                     'trip_distance': 1.2,
                      'payment_type': 1,
                     'PULocationID': 114,
                     'DOLocationID': 68,
                     'trip_duration': 600
                    }



# Create a pandas DataFrame with proper feature names
input_df = pd.DataFrame([input_values_dict])

# Make prediction
predicted_cost = best_rf_class.predict(input_df)

# Print a more informative result
print("Model Predicts a cost of: ", predicted_cost)

Model Predicts a cost of:  [13.57479106]


Ok.  The model predicts that at around $13.57.  That seems reasonable for that trip.  