# Cab Fare Estimation

Using machine learning and analytics to build a regression model. This model will predict the cost of a taxi trip using several numeric features of the ride. The model was built using a dataset of Chicago taxi trips.

## Project Objective

To train and test a supervised learning model that can predict the cost of a taxi ride using several features of the ride. 

## Dataset: Taxi Trips Chicago 2024

__Dataset:__ https://www.kaggle.com/datasets/adelanseur/taxi-trips-chicago-2024

__Data Description__

The initial dataset consisted of 23 columns of mostly numeric features. Each row was an individual trip with the following attributes:

- Trip ID: a unique identifier for each trip
- Taxi ID: identifier for the taxi cab
- Trip Start Timestamp: time trip started
- Trip End Timestamp: time trip ended
- Trip Seconds: time in seconds
- Trip Miles: distance in miles
- Pickup Census Tract: Census Tract where trip began
- Dropoff Census Tract: Census Tract where trip ended
- Pickup Community Area: Community Area where the trip began
- Dropoff Community Area: Community Area where the trip ended
- Fare: fare for the trip
- Tips: tip given for the trip
- Tolls: any tolls accrued during trip
- Extras: any extra charges for the trip
- Trip Total: sum of fare, tolls, tips, extra charges
- Payment Type: payment type used
- Company: taxi company
- Pickup Centroid Latitude: latitude of the center of the pickup census tract or the community area
- Pickup Centroid Longitude: longitude of the center of the pickup census tract or the community area
- Pickup Centroid Location: location of the center of the pickup census tract or the community area
- Dropoff Centroid Latitude: latitude of the center of the dropoff census tract or the community area
- Dropoff Centroid Longitude: longitude of the center of the dropoff census tract or the community area
- Dropoff Centroid Location: location of the center of the dropoff census tract or the community area 

There were 865,247 rows in the dataset. 

### Importing Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

## Data Preparation

In [None]:
#load initial csv
initial_data = pd.read_csv('Taxi_Trips.csv')

#show columns in initial dataset
print('Columns in initial dataframe:')
print(initial_data.columns)

I did not want to keep all of the columns in the initial dataset. I created a new dataframe by selecting only features I thought would be most impactful on trip cost. Also, there were a few columns that contained pickup and dropoff location information and I did not need all of them. So, I chose to keep the latitude and longitude columns since they are numeric and recognizable.

In [None]:
#select desired rows
#Trip Start Timestamp, Trip End Timestamp, Trip Seconds, Trip Miles, Fare, Extras
#Company, Pickup Centroid Latitude, Pickup Centroid Longitude, Dropoff Centroid Latitude, Dropoff Centroid Longitude 

cab_fare_df = initial_data.iloc[: ,[2,3,4,5,10,13,16,17,18,20,21]].copy()
print('Columns in new cab fare dataframe')
print(cab_fare_df.columns)
print(cab_fare_df.dtypes)
initial_rows = len(cab_fare_df)
print(f"Initial length of dataframe: {initial_rows}")

In [None]:
#drop duplicates
cab_fare_df.drop_duplicates()

print(f"Length of dataframe after dropping duplicates: {len(cab_fare_df)}")


The length of the dataset did not change after dropping duplicates; all rows were unique.

In [None]:
#check # of NAs

print("Columns containing null values")
print(cab_fare_df.isnull().sum())

The latitude and longitude fields have the most null values. I do not want to simply drop them, as it's tens of thousands of rows. Instead, I'll fill the nulls with the mean for each column. 

In [None]:
#handle blank latitudes and longitudes by filling in with mean
 
cab_fare_df['Pickup Centroid Latitude'] = cab_fare_df['Pickup Centroid Latitude'].fillna(cab_fare_df['Pickup Centroid Latitude'].mean())

cab_fare_df['Pickup Centroid Longitude'] = cab_fare_df['Pickup Centroid Longitude'].fillna(cab_fare_df['Pickup Centroid Longitude'].mean())
cab_fare_df['Dropoff Centroid Latitude'] = cab_fare_df['Dropoff Centroid Latitude'].fillna(cab_fare_df['Dropoff Centroid Latitude'].mean())
cab_fare_df['Dropoff Centroid Longitude'] = cab_fare_df['Dropoff Centroid Longitude'].fillna(cab_fare_df['Dropoff Centroid Longitude'].mean())


In [None]:
#drop rows with NA values
print('Columns still containing null values')
print(cab_fare_df.isnull().sum())
cab_fare_df = cab_fare_df.dropna()
print(f"Length of dataframe after dropping NAs: {len(cab_fare_df)}")

There were much fewer null values remaining after filling in the latitudes and longitudes. I dropped the rows with remaining null values, only a total of 2,143 rows or 0.25% of the dataset. 

In [None]:
#Chicago coordinates: 41.8832° N, 87.6324° W
print(cab_fare_df['Pickup Centroid Latitude'].describe())
print(cab_fare_df['Pickup Centroid Longitude'].describe())
print(cab_fare_df['Dropoff Centroid Latitude'].describe())
print(cab_fare_df['Dropoff Centroid Longitude'].describe())
#min and max of all coordinates are appropriate 

All of the dropoff and pickup coordinates are in a reasonable range. There is no need to drop any coordinates.

In [None]:
#view trip seconds, miles, fare, and extras

print(cab_fare_df['Trip Seconds'].describe())
print(cab_fare_df['Trip Miles'].describe())
print(cab_fare_df['Fare'].describe())
print(cab_fare_df['Extras'].describe())
print(f"Current length of dataframe: {len(cab_fare_df)}")

For the numeric columns seconds, miles, and fare, all of the minimum values are zero. These are likely erroneous entries and should not be included in the data. I will filter out any rows where the trip is less than 60 seconds, less than zero miles, or less than one dollar.

In [None]:
#filter erroneous entries for miles, seconds, fare

cab_fare_df = cab_fare_df[(cab_fare_df['Trip Seconds']>60) & (cab_fare_df['Trip Miles']>0) & (cab_fare_df['Fare']>1)]
print(f"Length of dataframe after removing erroneous entries: {len(cab_fare_df)}")

print(cab_fare_df['Trip Seconds'].describe())
print(cab_fare_df['Trip Miles'].describe())
print(cab_fare_df['Fare'].describe())


A large amount of the data was lost due to being erroneous. The number of rows fell from 863,104 to 767,127, a difference of 95,977 or 11%. 

Next, there is filtering out any outliers that might skew the data. I use the quantile method to define "outlier". 

In [None]:
#filter outliers for seconds and miles 

#define a function to return the low and high thresholds for outliers
def find_outliers(df_column):
    q1 = df_column.quantile(0.25)
    q3 = df_column.quantile(0.75)
    iqr = q3 - q1
    low = float(q1 - (1.5*iqr))
    high = float(q3 + (1.5*iqr))
    return low, high

#define a function to find the percentage of values in a column that are outliers
def find_percent_outliers(df_column):
    low, high = find_outliers(df_column)
    prop_outliers = 1-(df_column.between(low, high).sum()/len(df_column))
    percent_outlier = prop_outliers*100
    return percent_outlier

print(f"Percent of Trip Seconds that are outliers: {round(find_percent_outliers(cab_fare_df['Trip Seconds']),4)}%")
print(f"Percent of Trip Miles that are outliers: {round(find_percent_outliers(cab_fare_df['Trip Miles']),4)}%")

#since % of outliers is so low, we can filter out outliers for seconds and miles
s_low, s_high = find_outliers(cab_fare_df['Trip Seconds'])
cab_fare_df = cab_fare_df[cab_fare_df['Trip Seconds'].between(s_low, s_high)]

m_low, m_high = find_outliers(cab_fare_df['Trip Miles'])
cab_fare_df = cab_fare_df[cab_fare_df['Trip Miles'].between(m_low, m_high)]

    

In [None]:
#filtering outliers in fare and extras

print(f"Percent of fares that are outliers: {round(find_percent_outliers(cab_fare_df['Fare']),4)}%")
print(f"Percent of extra charges that are outliers: {round(find_percent_outliers(cab_fare_df['Extras']),4)}%")

#drop these outliers
f_low, f_high = find_outliers(cab_fare_df['Fare'])
cab_fare_df = cab_fare_df[cab_fare_df['Fare'].between(f_low, f_high)]

x_low, x_high = find_outliers(cab_fare_df['Extras'])
cab_fare_df = cab_fare_df[cab_fare_df['Extras'].between(x_low, x_high)]

print(f"Length of dataframe after filtering outliers: {len(cab_fare_df)}")


In [None]:
#calculating % dropped
cleaned_rows = len(cab_fare_df)
dropped_rows = initial_rows - cleaned_rows
percent_dropped = round(dropped_rows/initial_rows,4)*100
print(f"Initial number of rows: {initial_rows}")
print(f"Final number of rows after cleaning: {cleaned_rows}")
print(f"Percent of dataset dropped: {round(percent_dropped,2)}%")

The percentage of outliers in each column were very small, so I dropped them from the dataset to avoid skewing the data. Overall, after dealing with null values, erroneous entries and outliers, the dataset dropped 16.45% from the original, with a final row count of 722,878. 

### Feature Creation

In [None]:
#sum fare and extras columns 
cab_fare_df['Trip Cost'] = cab_fare_df['Fare'] + cab_fare_df['Extras']

I summed the fare and extras columns into one "trip cost" column to be the target variable. 

In [None]:
#convert pickup timestamp column

cab_fare_df['Trip Start Timestamp'] = pd.to_datetime(cab_fare_df['Trip Start Timestamp'], format="%m/%d/%Y %I:%M:%S %p")
#cab_fare_df['Trip End Timestamp'] = pd.to_datetime(cab_fare_df['Trip End Timestamp'])

#create month column 
cab_fare_df['Month'] = cab_fare_df['Trip Start Timestamp'].dt.month
print(cab_fare_df['Month'].unique())

#create hour column
cab_fare_df['Hour'] = cab_fare_df['Trip Start Timestamp'].dt.hour 
print(cab_fare_df['Hour'].unique())


I converted the trip start timestamp column into month and hour columns. 

This dataset only contains rides from January-March 2024. The month column, therefore, will not be useful for analysis since not all the months are represented.

The hours column, however, now contains the hour of the day the trip began in numeric form. All 24 hours are represented in the data.

In [None]:
#convert seconds to minutes
cab_fare_df['Minutes'] = cab_fare_df['Trip Seconds']/60

I decided to convert the seconds column into minutes to improve interpretability. It is generally easier to contextualize time in minutes than in seconds.

In [None]:
#checking current columns
print(cab_fare_df.dtypes)

## Correlation Analysis

In [None]:
#correlation matrix for numeric variables
correlation = cab_fare_df.corr(numeric_only=True)

#heatmap of correlation matrix
sns.heatmap(correlation, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation for Numeric Variables in Taxi Dataset')
plt.show()

Looking at the correlation matrix heatmap, the variables most highly associated with the dependent variable, trip cost, are seconds/minutes, miles, pickup longitude. Fare and extras are associated with cost because those were the two variables summed to calculate cost. 

There are very weak associations between trip cost and the pickup month and hour. There is no observed association between cost and dropoff latitude. 

### Target Encoding

In [None]:
#view all taxi companies

all_companies = cab_fare_df['Company'].unique()
print(all_companies, len(all_companies))

In [None]:
#target encoding

#sample ~20% of dataframe
target_sample = cab_fare_df.sample(n=150000, random_state=456)

target_mean = target_sample.groupby('Company')['Trip Cost'].mean()
target_sample['Company_mean'] = target_sample['Company'].map(target_mean)

company_correlation = target_sample['Company_mean'].corr(target_sample['Trip Cost'])
print(f"The correlation between taxi company and trip cost is {company_correlation}")

After target encoding, the correlation between company and cost was only 0.129, which is not very high. Therefore, I feel comfortable not including company as a variable in any further analysis. 

## Exploratory Data Analysis

In [None]:
#means of numeric columns

means = cab_fare_df.mean(numeric_only=True)
print(means)


In [None]:
#creating sample dataframe for EDA

sample = cab_fare_df.sample(n=50000, random_state=123)

I created a sample dataset of 50,000 rows to make plotting more efficient and more visually appealing. 



In [None]:
#boxplots of seconds and miles features
 
fig = make_subplots(rows=1, cols=2, subplot_titles=("Minutes", "Miles"))

# seconds plot
fig.add_trace(
    go.Box(y = sample['Minutes'], name="Minutes"), row=1, col=1)

# miles
fig.add_trace(
    go.Box(y = sample['Trip Miles'], name="Miles"), row=1, col=2)

fig.update_layout(width=800, height=400, title="Plots of Minutes and Miles") 
fig.show()


The boxplot of trip time in minutes shows that the median time for trips is about 15 minutes. With the exception of outliers, the longest trip is about 55 minutes and the shortest trip is one minute. 

The median number of miles for a trip is 3.8 with the majority of trips being less than 28 miles. 

In [None]:
#histograms of fares, extra charges, and total cost

dist = make_subplots(rows=3, cols=1, subplot_titles=("Fare", "Extras", "Total Cost"))

dist.add_trace(
    go.Histogram(x=sample["Fare"], name="Distribution of Cab Fares", histnorm="percent",
                xbins=dict(start=0, end=70, size=2)), row=1, col=1)

dist.add_trace(
    go.Histogram(x=sample["Extras"], name="Distribution of Extra Charges", histnorm="percent",
                xbins=dict(start=0, end=10, size=0.5)), row=2, col=1)

dist.add_trace(
    go.Histogram(x=sample["Trip Cost"], name="Distribution of Total Costs", histnorm="percent",
                xbins=dict(start=0, end=80, size=2)), row=3, col=1)

dist.update_layout(width=1100, height=800, title="Distributions of Fares, Extras, and Totals") 
dist.update_yaxes(title_text="Percentage", row=1, col=1)
dist.update_yaxes(title_text="Percentage", row=2, col=1)
dist.update_yaxes(title_text="Percentage", row=3, col=1)

dist.update_xaxes(title_text="Fare in USD", row=1, col=1)
dist.update_xaxes(title_text="Extra Charges in USD", row=1, col=2)
dist.update_xaxes(title_text="Total Cost in USD", row=1, col=3)

dist.show()


The distributions of fare and total cost line up almost perfectly. This makes sense as total cost is the sum of fare and extras, and most extra charges fall between zero and one dollar. So cost is going to be more determined by fare than by extra charges. 

All three of the cost-related distributions are right-skewed. The majority of fares are cheaper but there are a few trips with higher charges that skew the distributions. 

In [None]:
#histogram of miles
miles_dist = px.histogram(sample['Trip Miles'], histnorm="percent", x="Trip Miles")
miles_dist.update_xaxes(title_text="Miles")
miles_dist.update_yaxes(title_text="Percentage")
miles_dist.update_layout(title="Distribuition of Trip Miles")

miles_dist.show()

The distribution of miles is right-skewed; the mean number of miles is 6.7 while the median is 3.8. Trips going less than five miles are much more common than not. There are relatively few trips over 20 miles that skew the distribution.  

In [None]:
#histogram of minutes
m_dist = px.histogram(sample['Minutes'], histnorm="percent", x="Minutes")
m_dist.update_traces(xbins=dict(start=0, end=60, size=1))
m_dist.update_xaxes(title_text="Minutes")
m_dist.update_yaxes(title_text="Percentage")
m_dist.update_layout(title="Distribuition of Trip Minutes")

m_dist.show()

The distribution of trip duration in minutes is also right-skewed; the mean number of minutes is 18.48 and the median is 15.58. 

In [None]:
#scatterplots of miles v cost

m_plot = px.scatter(sample, x=sample['Trip Miles'], y=sample['Trip Cost'], trendline='ols', trendline_color_override = 'black',
                    render_mode="webgl")
m_plot.update_layout(title='Trip Miles vs Trip Cost in USD')
m_plot.show()

This plot shows that there is a positive trend between the lengths of trips and their costs; as the number of miles increases, the cost of the trip also increases. There are also a few vertical outliers closer to the lower end of the number of miles. These shorter trips are more expensive than trips of a similar distance tend to be. This could be due to extra charges or traffic increasing the duration of trip time without increasing the mileage. 

In [None]:
#scatterplot of seconds v cost

s_plot = px.scatter(sample, x=sample['Minutes'], y=sample['Trip Cost'], trendline='ols', trendline_color_override = 'black',
                    render_mode="webgl")
s_plot.update_layout(title='Trip Time in Minutes vs Trip Cost in USD')
s_plot.update_traces(marker=dict(color='green'))
s_plot.show()

This plot also shows a positive trend between trip cost and trip time in minutes. There is more scatter in this plot than in the plot of miles. There tends to be a wider range of costs associated with any given trip duration in minutes. This could be due to differences in pricing at different hours or maybe charges associated with certain locations like airports. 

In [None]:
#pickup density heatmap 

hb = plt.hexbin(sample['Pickup Centroid Longitude'], sample['Pickup Centroid Latitude'], gridsize=50, cmap='viridis')
plt.close()

verts = hb.get_offsets()
counts = hb.get_array()

fig = go.Figure(go.Scatter(x=verts[:,0], y=verts[:,1], mode='markers', marker=dict(size=10, color=np.log1p(counts),
        colorscale='Viridis', showscale=True, colorbar=dict(title="Log of Pickup Counts per Location"))))

fig.update_layout(width=1000, height=800, xaxis_title='Longitude', yaxis_title='Latitude', title='Pickup Density')
fig.show()

This pickup density grid shows that there are a few locations with higher ratios of pickups. The majority of the latitude/longitude pairs were tracking lower numbers of pickup orders. So much so, that I decided to use the log of the counts to show a little more distinction in pickup density. 

In [None]:
#heatmap of average duration of trip across different hours and distances

miles = np.linspace(sample['Trip Miles'].min(), sample['Trip Miles'].max(), 28)
sample['Miles_bin'] = pd.cut(sample['Trip Miles'], bins=miles)
sample['Miles_bin_center'] = sample['Miles_bin'].apply(lambda x: x.mid)


agg = sample.groupby(['Hour', 'Miles_bin_center'],observed=True)['Minutes'].mean().reset_index(name='Avg_Minutes')

pivot = agg.pivot(index='Hour', columns='Miles_bin_center', values='Avg_Minutes')
t_heatmap = go.Figure(go.Heatmap(z=pivot.values, x=pivot.columns, y=pivot.index, colorscale='Viridis',
    colorbar=dict(title='Avg Trip Minutes')))

t_heatmap.update_layout(title='Average Trip Minutes by Trip Miles and Hour of Day', xaxis_title='Trip Miles',
    yaxis_title='Hour of Day', width=1000, height=600)

t_heatmap.show()

This heatmap illustrates the relationship between hour of day, distance of trip, and trip duration in minutes. It depicts logical conclusions like that shorter trips take less time than longer ones, regardless of time of day. Also, trips over 20 miles are rarer, especially in the early mornings. One noteworthy observation is that for trips between 15 and 20 miles, the average time it takes seems to increase from around 7-8am and again from around 3-6pm. This likely corresponds to an increase in morning and evening rush hour traffic. 

In [None]:
#stacked bar of hour vs avg cost

avg_cost = cab_fare_df['Trip Cost'].mean()
print(avg_cost)

#bin cost
sample['more_than_avg'] = np.where(sample['Trip Cost']>avg_cost, 'yes', 'no')

grouped = (sample.groupby(['Hour', 'more_than_avg'], observed=True).size().reset_index(name='Number of Trips'))

bar = px.bar(grouped, x='Hour', y='Number of Trips', color='more_than_avg', title="Hourly Trips by Avg Costs")
bar.show()

This bar chart shows whether trip costs tend to fall above or below overall average cost based on time of day. From 8pm - 12am and then again from 5-6am, more trips cost more than the overall average of $22.29 than not. From 7am until 8pm, the numbers of trips increased and the ratio of more expensive to less expensive becomes more favorable. 

## Feature Transformation

The exploratory data analysis revealed that several numeric features of the dataframe are skewed. I will apply log transformation to those features to try and mitigate the skewness. 

In [None]:
#log transform minutes, miles, fare, extras, cost columns

transform_columns = ['Minutes', 'Trip Miles', 'Trip Cost']
cab_fare_df[transform_columns] = np.log1p(cab_fare_df[transform_columns])

In [None]:
#keep only numeric features

cab_fare_df = cab_fare_df.select_dtypes(include=np.number)

### Splitting and Scaling Data

Before feature selection, I will need to split the data into testing and training sets and also apply scaling.

For my features, I dropped the target variable, cost, as well as fare and extra since they are represented in the cost column. I also dropped the seconds column since I'd rather use the minutes column.

I split the data 70/30 for testing and training. I then scaled the features using the StandardScaler, since outliers were already filtered.

In [None]:
#split data

#print(cab_fare_df.columns)

x = cab_fare_df.drop(['Trip Cost', 'Fare', 'Extras', 'Trip Seconds'], axis=1)
y = cab_fare_df['Trip Cost']

x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=0.3, random_state=123)

In [None]:
#scale the data

scaler = StandardScaler().set_output(transform="pandas")

x_train = scaler.fit_transform(x_train)
x_temp = scaler.transform(x_temp)

### Feature selection

I used scikit-learn's Univariate Feature Selector to select the top five features for my model. 

In [None]:
# feature selection: f_regression

selector = SelectKBest(f_regression, k=5)
selector.fit(x_train, y_train)

print("Input features: ", selector.feature_names_in_)
print("Selected Features: ", selector.get_feature_names_out())

__The top five features are: number of miles, pickup latitude and longitude, dropoff longitude, and time duration in minutes.__

# Model Training and Evaluation

I split the original test set 50/50 to create a validation set. I used this validation set to test my original models and fine-tune to avoid data leakage.

In [None]:
#creating validation set
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=123)
print("Records in validation set = ", len(x_val))

### Multiple Linear Regression

My first model was a multiple linear regression model using the five selected features. 

In [None]:
# fit MLR model

lr = LinearRegression()
lr.fit(x_train, y_train)

In [None]:
# test mlr model

mlr_pred = lr.predict(x_val)

In [None]:
# undo log

mlr_pred = np.expm1(mlr_pred)      
y_val_ex = np.expm1(y_val) 

In [None]:
#evaluate model using mae

mae_mlr = mean_absolute_error(y_val_ex, mlr_pred)
print("MAE for multiple regression model: ", mae_mlr)

After fitting and testing the MLR model, I had to take the exponent of the prediction results and the actual y-values to return them to their original units. 

I chose to use mean absolute error as the evaluation metric due to its interpretability; it is straightforward to understand that the mean error is about $3.50. 

### Random Forest 

I decided to build a second model, a Random Forest model, to better capture any complex, non-linear patterns in the data that the linear regression model couldn't. 

In [None]:
# fit random forest

rf = RandomForestRegressor(n_estimators=100, max_depth=8, max_features='sqrt', bootstrap=True,random_state=123)
rf.fit(x_train, y_train)


In [None]:
# test random forest model

rf_pred = rf.predict(x_val)

rf_pred = np.expm1(rf_pred)

In [None]:
# evaluate model using mae

mae_rf = mean_absolute_error(y_val_ex, rf_pred)
print("MAE for random forest model: ", mae_rf)

I started with a smaller number of trees and a smaller max depth as a baseline for the model. When the number of trees is 100 and the max depth is 8, the MAE is 1.77

 ### Fine-tuning Random Forest

 I changed the values of the hyperparameters from the original Random Forest model by increasing the number of trees and the max depth. 

In [None]:
# fine tuning random forest

rf2 = RandomForestRegressor(n_estimators=300, max_depth=15, max_features='sqrt', bootstrap=True,random_state=123)
rf2.fit(x_train, y_train)

In [None]:
# test random forest model

rf2_pred = rf2.predict(x_val)
rf2_pred = np.expm1(rf2_pred)

In [None]:
# evaluate model using mae

mae_rf2 = mean_absolute_error(y_val_ex, rf2_pred)
print("MAE for fine-tuned random forest model: ", mae_rf2)

In [None]:
# r2 score

r2_rf2 = r2_score(y_val_ex, rf2_pred)
print("R2 score for fine-tuned model: ", r2_rf2)

I adjusted the hyperparameters several times relative to the first random forest model. Each adjustment lead to a lower and lower mean error until the final model. When the number of trees is 300 and the max depth is 15, the mean absolute error came out to 1.24. This was a satisfactory MAE but I also found the R-squared value to be sure the model was accurate. The R-squared was 0.96, which is a very high R-squared value. 

# Final Model Test

After fine-tuning and settling on the final model, I tested it on the actual test set and evaluated again using MAE and R-squared.

In [None]:
#undo log transformation

y_test = np.expm1(y_test)


In [None]:
# test model on actual test set

rf2_test_pred = rf2.predict(x_test)
rf2_test_pred = np.expm1(rf2_test_pred)

In [None]:
# evaluate model 

final_mae = mean_absolute_error(y_test, rf2_test_pred)
final_r2 = r2_score(y_test, rf2_test_pred)

print("Final MAE: ", final_mae)
print("Final R2 score: ", final_r2)

# Conclusion

The dataset began with 23 initial features. After feature engineering, I chose the top five features to use in the model: number of miles, pickup latitude and longitude, dropoff longitude, and number of minutes. 

My initial model was a linear regression model with those five features. That model had a mean error of $3.50.

The first random forest regressor did much better than the linear regression model, with a mean error of $1.77. 

After fine tuning the random forest model by increasing the number of trees and maximum depth, the mean error again decreased to $1.24, which is an acceptable mean error. I also found the R-squared value for the model as 0.96. This means 96% of the variability in the cost can be explained by the model, which is very high.

I did not want to increase the complexity of the model anymore with additional trees or a higher max depth, so I chose this as the final model.

When testing the final model on the test set, the mean error was the same as the validation set's, $1.24 after rounding. The R-squared value also remained the same, 0.96. This model was able to predict trip cost with an average error of only a little over one dollar. 