## Random Forest Regression Model 
The Random Forest Regressor was selected due to its effectiveness in the regression tasks especially when dealing with complex datasets (level 3), that may have non-linear relationships. This choice was driven beacause of the need for a model that balances accuracy and interpretability.

### Steps to Create the Model and Prediction
#### Loading the Data: 
The preprocessed datset was loaded as a dataframe. Various features related to taxi trips were observed. 

#### Data Preparation:
The target variable was dropped from the features(X) and set in Y.Irrelevant columns (i.e.., 'congestion_surcharge','improvement_surcharge','passenger_count','mta_tax','tolls_amount','is_peak_month', 'payment_type_Unknown','RatecodeID_6.0') were dropped from the features (X) to focus on those most relevant for predicting the target variable (fare_amount), 'total_amount' was also dropped because it had high correlation that resulted in overfitting. These features were termed irrelevant based on the model feature importance values. This step ensures that the model trains on features that are likely to influence the outcome.

#### Splitting the Data:
The dataset was then split into training and testing sets, with 80% of the data used for the training and 20% for testing with random state at 42. This split will allows the model to learn while evaluating it's performance.

#### Initializing the Model:
Random Forest Regressor was initialized with 100 trees (i.e.., n_estimators=100).
Training the Model:


#### Making Predictions:
Predictions were made on the test set (i.e.., X_test) using the predict() method, which then generated predicted values for target fare_amount.



In [1]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd

file_path = 'raw_tripdata_2022-01.csv'
data_cleaned = pd.read_csv(file_path)

# Drop the target and also unimportant features
unimportant_features = ['congestion_surcharge','improvement_surcharge','passenger_count','mta_tax','tolls_amount','is_peak_month', 'payment_type_Unknown','RatecodeID_6.0']
X = data_cleaned.drop(columns=['fare_amount','total_amount'] + unimportant_features)
y = data_cleaned['fare_amount']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

accuracy = rf_model.score(X_test, y_test)
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R^2 Score: {r2:.2f}")
print(f"Model Accuracy: {accuracy}")

# Feature Importance
importances = rf_model.feature_importances_
feature_importances = pd.DataFrame(importances, index=X.columns, columns=["Importance"]).sort_values("Importance", ascending=False)
print(feature_importances)


Mean Absolute Error: 0.58
Mean Squared Error: 2.03
R^2 Score: 0.96
Model Accuracy: 0.9607001571394429
                          Importance
trip_duration               0.806437
trip_distance               0.133688
VendorID_2                  0.009969
DOLocationID                0.009559
PULocationID                0.008449
pickup_hour                 0.006153
tip_amount                  0.006066
pickup_dayofyear            0.005376
pickup_dayofweek            0.003014
RatecodeID_5.0              0.002608
payment_type_Credit Card    0.002471
pickup_weekofyear           0.002130
RatecodeID_2.0              0.001141
extra                       0.001112
pickup_month                0.000606
is_peak_hour                0.000606
is_peak_day                 0.000431
RatecodeID_4.0              0.000119
trip_type_Street-hail       0.000031
payment_type_No Charge      0.000026
payment_type_Dispute        0.000005


### Model Performance:
##### Mean Absolute Error (MAE): 0.58
##### Mean Squared Error (MSE): 2.03
##### R² Score: 0.96
##### Model Accuracy: 0.9607
##### Feature Importance: 
The importance of each feature provided insight into which features contribute most to the predictions. Therefore features with low(>0.000005) values that were termed insignificant and dropped.

### Is the Model a Good Model? 
Yes, the model seems to be a good for several reasons as its R² score of 0.96 is indicative of strong relationship between the features and the target variable i.e.., the model captures almost all of the variance.
The Mean Absolute Error (MAE) of 0.58 shows that the model's prediction error is relatively low i.e.. it is indicating accurate predictions in real-world terms (i.e.., fare amounts).

#### Feature Selection:
Specific features (i.e.., 'congestion_surcharge','improvement_surcharge','passenger_count','mta_tax','tolls_amount','is_peak_month', 'payment_type_Unknown','RatecodeID_6.0') were deliberately dropped from the dataset to improve model performance and interpretability. To do this the model was built repeatedly and the importance values were considered, where in every built one of the low valued features were removed, this continued until the best scores were obtained and hence, that is the final model.
