# Assignment 10 Arhat Shah

In [1]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder


In [5]:
tripdata_df = pd.read_parquet(path = 'https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2020-02.parquet', #provide the URL to the data source
                      engine = 'fastparquet')

# Include any previous data preparation steps, EDA and visualizations. Its ok to copy and paste your code. 
# However, ensure that you update the code based on the previous feedback from the TAs.

# Commented out print statements are used to check the data set and the data types of the variables.

# print("dimensions: ", tripdata_df.shape)
# print(tripdata_df.dtypes)

# The variables do not all have proper types. 
# The variables without suitable types: RatecodeID, passenger_count, trip_type, and payment_type. 

# print(tripdata_df.isnull().sum())

# Missing values are a widspread issue because multiple columns have null values 
# The method that can be used to handle missing data is to delete the observations with the missing values. 

# The ehail_fee does not need to be included in the data set because all values are null. 
# print(tripdata_df.loc[:,'ehail_fee'])

# Congestion_surcharge is not included in the data dictionary 
# print(tripdata_df.loc[:,'congestion_surcharge'])

# There are trip total amounts that are negative 
# print(tripdata_df.loc[tripdata_df['total_amount'] < 0.00, :])

# deleting missing and invalid data 
tripdata_df = tripdata_df[tripdata_df.isnull().sum(axis=1) < 4] 
tripdata_df.drop(columns = ['ehail_fee', 'congestion_surcharge']) 

# make the variables to have proper types 
tripdata_df['store_and_fwd_flag'] = tripdata_df['store_and_fwd_flag'].astype('str') 
tripdata_df['payment_type'] = tripdata_df['payment_type'].astype(int) 
tripdata_df['RatecodeID'] = tripdata_df['RatecodeID'].astype(int) 
tripdata_df['passenger_count'] = tripdata_df['passenger_count'].astype(int) 
tripdata_df['trip_type'] = tripdata_df['trip_type'].fillna(0).astype(int) 

# Drop any rows with null values in tip_amount
tripdata_df.dropna(subset=['tip_amount'], inplace=True)


# negative total amounts removed 
tripdata_df = tripdata_df[tripdata_df['total_amount'] > 0]

In [6]:
# Select the required features to build your model based on the insights from your EDA. 
# Briefly explain the reason for the features that you selected. 
# Ensure that you encode any categorical features.

features = ['RatecodeID', 'passenger_count', 'trip_distance', 'PULocationID', 'DOLocationID']

# The reason these features were selected is because they all can potentially affect the tip amount.

# Selecting target variable
target = 'tip_amount'

# select only requied features and target variable
tripdata_df = tripdata_df[features + [target]]

# encode any categorical features 
label_encoders = {}
for feature in ['RatecodeID', 'PULocationID', 'DOLocationID']:
    label_encoders[feature] = LabelEncoder()

for feature, encoder in label_encoders.items():
    tripdata_df[feature] = encoder.fit_transform(tripdata_df[feature].astype(str))



In [7]:
# Partition the data into train/test split.
X_train, X_test, y_train, y_test = train_test_split(tripdata_df.drop('tip_amount', axis=1), tripdata_df['tip_amount'], test_size=0.2, random_state=42)


In [8]:
# Build a model that predicts the tip amount.

# Create model
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit model
rf_reg.fit(X_train, y_train)

# Predict
y_pred = rf_reg.predict(X_test)

# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 9.768485542098125
R-squared: 0.05030988435573991


### Evaluate the predictions from your model and comment on the results. Ensure that you choose the correct metric. Remember that we evaluate models differently depending on the task, i.e. classification or regression.

We assessed our regression model by utilizing mean squared error (MSE) as the evaluative criteria. Our model obtained an MSE of 9.77, indicating that the average gap between the projected and actual tip amount was roughly 9.77 dollars. Typically, a lower MSE score suggests a better model, but this varies according to the situation and the degree of precision required. As a result, there is room for enhancing this model to decrease the MSE.

### How do you feel about the model? Does it do a good job of predicting the tip_amount?

In general, the model requires modifications as its mean squared error (MSE) of 9.77 can be lowered for more accurate predictions. Additionally, its important that the model's data originates from just one month and the effectiveness could differ if the data was from a different month. For example, the model may be more accurate if the data was from a month with a higher average tip amount. However, it can be improved to be more accurate.

In [9]:
# Tweak the model

# Create a random forest regressor with 50 trees
rf = RandomForestRegressor(n_estimators=50, random_state=42)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (n_estimators=50):", mse)
print("R-squared (n_estimators=50):", r2)

# Create a random forest regressor with 200 trees
rf = RandomForestRegressor(n_estimators=500, random_state=42)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (n_estimators=500):", mse)
print("R-squared (n_estimators=200):", r2)

# Create a random forest regressor with 400 trees
rf = RandomForestRegressor(n_estimators=1000, random_state=42)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (n_estimators=1000):", mse)
print("R-squared (n_estimators=400):", r2)

Mean Squared Error (n_estimators=50): 9.79343792148943
R-squared (n_estimators=50): 0.04788401926461572
Mean Squared Error (n_estimators=500): 9.737126599221279
R-squared (n_estimators=200): 0.05335859420527067
Mean Squared Error (n_estimators=1000): 9.726635585151305
R-squared (n_estimators=400): 0.05437852839285806


As the n_estimators parameter increases, the MSE decreases, meaning the model is improving. This is because the model is more accurate as it is able to predict the tip amount more accurately. However, the changes in the MSE are not extremely significant as the MSE decreases from 9.79 to 9.72, only a .07 difference.

In [10]:
# Bonus: The Random forest has a method that returns the importance of each feature in your model. 
# Can you find out which of your selected features were the most important when making the predictions?

# random forest model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get importance scores of features
importances = rf.feature_importances_

# datafram with feature names and importance scores
feature_importances = pd.DataFrame({'feature': X_train.columns, 'importance': importances})

# Sort the DataFrame by importance score in descending order
feature_importances = feature_importances.sort_values('importance', ascending=False)

# Print out the feature importances
print(feature_importances)

           feature  importance
2    trip_distance    0.504239
4     DOLocationID    0.230406
3     PULocationID    0.186424
1  passenger_count    0.052820
0       RatecodeID    0.026111
