<a href="https://colab.research.google.com/github/diegodemiranda/linear_regression_models/blob/main/chicago_taxi_fare_prediction/linear_regression_chicago_taxi_fare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Linear regression - Chicago taxi fare predriction

---



With this notebook we will use a dataset to train a model to predict the fare of a taxi ride in Chicago, Illinois. 🚕

In reality, Chicago taxi cabs use a documented formula to determine cab fares.
For a single passenger paying cash, the fare is calculated like this:

**FARE** = 2.25 * `TRIP_MILES` + 0.12 * `TRIP_MINUTES` + 3.25

Typically with machine learning problems we would not know the 'correct'
formula, but in this case we will use this knowledge to evaluate our model.

## Part 1 - Setup the Environment


---

### Load dependencies

The model depends on several **Python** libraries to help with data manipulation, machine learning tasks, and data visualization:
* **NumPy** is a fundamental package for scientific computing in Python.
* **Pandas** is a powerful library for data analysis and manipulation.
* **Keras** is a high-level API for building and training deep learning models.
* **Plotly** is an interactive, open-source, and browser-based graphing library for creating interactive and visually appealing plots.
* **Seaborn** is a statistical data visualization library built on top of Matplotlib. It provides a higher-level interface for creating informative and aesthetically pleasing statistical graphics.

In [None]:
#general
import io

# data
import numpy as np
import pandas as pd

# machine learning
import keras

# data visualization
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns

### Load the dataset
The following code cell loads the dataset and creates a pandas DataFrame.

In [None]:
chicago_taxi_dataset = pd.read_csv("chicago_taxi_train.csv")

training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]

print('Read dataset completed successfully.')
print('Total number of rows: {0}\n\n'.format(len(training_df.index)))

training_df.head(200)

### Normalizing the dataset


Machine learning algorithms, especially those using **gradient descent**, often benefit from normalized data. Features with widely different scales can cause the model to prioritize features with larger values, leading to suboptimal performance. Normalization brings all features to a similar scale, preventing this bias.

Furthermore, normalization can help the model converge faster during training, potentially reducing the number of epochs required to achieve good results. This is because the optimizer can navigate the loss landscape more efficiently when the features are on a similar scale.

To scale values before training a machine learning model, we typically use a technique called **feature scaling** where we normalize or standardize our data by calculating the minimum and maximum values (for **min-max scaling**) or the mean and standard deviation (for **z-score standardization**) from our training data. Then applying the calculated values to transform all features to a common scale, usually between 0 and 1 for min-max scaling or a mean of 0 and standard deviation of 1 for z-score scaling;

We should always apply scaling to both our training and testing data using the parameters calculated from the training set only to avoid **data leakage** from the test set.

For this specific case, where we are working with `TRIP_MILES` and `TRIP_SECONDS` as features, is recommended use **min-max scaling**.
This technique scales the features to a specific range, typically between 0 and 1.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Select features to normalize (excluding categorical ones like 'COMPANY', 'PAYMENT_TYPE')
features_to_normalize = ['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'TIP_RATE']

# Fit and transform on dataset
chicago_taxi_dataset[features_to_normalize] = scaler.fit_transform(chicago_taxi_dataset[features_to_normalize])

## Part 2 - Dataset Exploration


---



### View dataset statistics

 In this step, I will use the `DataFrame.describe` method to view **descriptive statistics** about the dataset and answer some important questions about the data.

In [None]:
training_df.describe(include='all')

### View correlation matrix and visualize relationships with pair plot

In this step, I will use a **correlation matrix** to identify features whose values correlate well with the label. In essence, finding out how strongly related the numerical features of the taxi trip data are to each other.

Correlation values have the following meanings:

* **+1**: Perfect positive correlation (as one feature increases, the other increases proportionally)
* **-1**: Perfect negative correlation (as one feature increases, the other decreases proportionally)
* **0**: No linear correlation (no relationship between the features).

In general, the higher the absolute value of a correlation value, the greater its predictive power.
The correlation **does not imply causality**. The correlation matrix only captures **linear relationships**. If the relationship between the variables is non-linear, the correlation can be close to 0, even if there is a strong relationship.

In [None]:
training_df.corr(numeric_only = True)

A **pair plot** generates a grid of pairwise plots to visualize the relationship of each feature with all other features all in one place.
Pair plots help us quickly identify **correlations, patterns, distribution and outliers**.
This visualization helps you understand our data better and guide us next steps in the machine learning process.

In [None]:
sns.pairplot(training_df, x_vars=["FARE", "TRIP_MILES", "TRIP_SECONDS"], y_vars=["FARE", "TRIP_MILES", "TRIP_SECONDS"])

## Part 3 - Train Model


---

### Defining plotting functions

To help us visualize the results of each training run we will generate two plots at the end of each experiment:

* a **scatter plot** of the features vs. the label with a line showing the output of the trained model;
* a **loss curve**;

In [None]:
def make_plots(df, feature_names, label_name, model_output, sample_size=200):

  random_sample = df.sample(n=sample_size).copy()
  random_sample.reset_index()
  weights, bias, epochs, rmse = model_output

  is_2d_plot = len(feature_names) == 1
  model_plot_type = "scatter" if is_2d_plot else "surface"
  fig = make_subplots(rows=1, cols=2,
                      subplot_titles=("Loss Curve", "Model Plot"),
                      specs=[[{"type": "scatter"}, {"type": model_plot_type}]])

  plot_data(random_sample, feature_names, label_name, fig)
  plot_model(random_sample, feature_names, weights, bias, fig)
  plot_loss_curve(epochs, rmse, fig)

  fig.show()
  return

def plot_loss_curve(epochs, rmse, fig):
  curve = px.line(x=epochs, y=rmse)
  curve.update_traces(line_color='#ff0000', line_width=3)

  fig.append_trace(curve.data[0], row=1, col=1)
  fig.update_xaxes(title_text="Epoch", row=1, col=1)
  fig.update_yaxes(title_text="Root Mean Squared Error", row=1, col=1, range=[rmse.min()*0.8, rmse.max()])

  return

def plot_data(df, features, label, fig):
  if len(features) == 1:
    scatter = px.scatter(df, x=features[0], y=label)
  else:
    scatter = px.scatter_3d(df, x=features[0], y=features[1], z=label)

  fig.append_trace(scatter.data[0], row=1, col=2)
  if len(features) == 1:
    fig.update_xaxes(title_text=features[0], row=1, col=2)
    fig.update_yaxes(title_text=label, row=1, col=2)
  else:
    fig.update_layout(scene1=dict(xaxis_title=features[0], yaxis_title=features[1], zaxis_title=label))

  return

def plot_model(df, features, weights, bias, fig):
  df['FARE_PREDICTED'] = bias[0]

  for index, feature in enumerate(features):
    df['FARE_PREDICTED'] = df['FARE_PREDICTED'] + weights[index][0] * df[feature]

  if len(features) == 1:
    model = px.line(df, x=features[0], y='FARE_PREDICTED')
    model.update_traces(line_color='#ff0000', line_width=3)
  else:
    z_name, y_name = "FARE_PREDICTED", features[1]
    z = [df[z_name].min(), (df[z_name].max() - df[z_name].min()) / 2, df[z_name].max()]
    y = [df[y_name].min(), (df[y_name].max() - df[y_name].min()) / 2, df[y_name].max()]
    x = []
    for i in range(len(y)):
      x.append((z[i] - weights[1][0] * y[i] - bias[0]) / weights[0][0])

    plane=pd.DataFrame({'x':x, 'y':y, 'z':[z] * 3})

    light_yellow = [[0, '#89CFF0'], [1, '#FFDB58']]
    model = go.Figure(data=go.Surface(x=plane['x'], y=plane['y'], z=plane['z'],
                                      colorscale=light_yellow))

  fig.add_trace(model.data[0], row=1, col=2)

  return

def model_info(feature_names, label_name, model_output):
  weights = model_output[0]
  bias = model_output[1]

  nl = "\n"
  header = "-" * 80
  banner = header + nl + "|" + "MODEL INFO".center(78) + "|" + nl + header

  info = ""
  equation = label_name + " = "

  for index, feature in enumerate(feature_names):
    info = info + "Weight for feature[{}]: {:.3f}\n".format(feature, weights[index][0])
    equation = equation + "{:.3f} * {} + ".format(weights[index][0], feature)

  info = info + "Bias: {:.3f}\n".format(bias[0])
  equation = equation + "{:.3f}\n".format(bias[0])

  return banner + nl + info + nl + equation

print("Success: defining plotting functions complete.")

### Defining functions to build and train a model

The code you need to build and train your model is in the **Define ML functions** code cell. If you would like to explore this code, expand the code cell and take a look.

In [None]:
def build_model(my_learning_rate, num_features):
  """Create and compile a simple linear regression model."""
  # Describe the topography of the model.
  # The topography of a simple linear regression model
  # is a single node in a single layer.
  inputs = keras.Input(shape=(num_features,))
  outputs = keras.layers.Dense(units=1)(inputs)
  model = keras.Model(inputs=inputs, outputs=outputs)

  # Compile the model topography into code that Keras can efficiently
  # execute. Configure training to minimize the model's mean squared error.
  model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=my_learning_rate),
                loss="mean_squared_error",
                metrics=[keras.metrics.RootMeanSquaredError()])

  return model


def train_model(model, df, features, label, epochs, batch_size):
  """Train the model by feeding it data."""

  # Feed the model the feature and the label.
  # The model will train for the specified number of epochs.
  # input_x = df.iloc[:,1:3].values
  # df[feature]
  history = model.fit(x=features,
                      y=label,
                      batch_size=batch_size,
                      epochs=epochs)

  # Gather the trained model's weight and bias.
  trained_weight = model.get_weights()[0]
  trained_bias = model.get_weights()[1]

  # The list of epochs is stored separately from the rest of history.
  epochs = history.epoch

  # Isolate the error for each epoch.
  hist = pd.DataFrame(history.history)

  # To track the progression of training, we're going to take a snapshot
  # of the model's root mean squared error at each epoch.
  rmse = hist["root_mean_squared_error"]

  return trained_weight, trained_bias, epochs, rmse


def run_experiment(df, feature_names, label_name, learning_rate, epochs, batch_size):

  print('Info: starting training experiment with features={} and label={}\n'.format(feature_names, label_name))

  num_features = len(feature_names)

  features = df.loc[:, feature_names].values
  label = df[label_name].values

  model = build_model(learning_rate, num_features)
  model_output = train_model(model, df, features, label, epochs, batch_size)

  print('\nSuccess: training experiment complete\n')
  print('{}'.format(model_info(feature_names, label_name, model_output)))
  make_plots(df, feature_names, label_name, model_output)

  return model

print("Success: defining linear regression functions complete.")

### Train a model with one feature

In this step we will train a model to predict the cost of the fare using a single feature. Earlier, we saw that `TRIP_MILES` (distance) correlates **most strongly** with the `FARE`, so let's start with `TRIP_MILES` as the feature for our first training run.

During training, we should see the **root mean square error (RMSE)** in the output. The units for RMSE are the same as the units for the label (dollars). In other words, we can use the RMSE to determine how far off, on average, the predicted fares are in dollars from the observed values.

In [None]:
# The following variables are the hyperparameters.
# We can adjust these hyperparameters to see how they impact a training run.
learning_rate = 0.001
epochs = 20
batch_size = 50

# Specify the feature and the label.
features = ['TRIP_MILES']
label = 'FARE'

model_1 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)

### Experiment with hyperparameters

It is common with machine learning to run multiple experiments to find the best set of hyperparmeters to train our model.

When the learning rate is **too high**, the loss curve bounces around and does not appear to be moving towards convergence with each iteration. Also the predicted model does not fit the data very well.

When the learning rate is **too small**, it may take longer for the loss curve to converge. With a small learning rate the loss curve decreases slowly, but does not show a dramatic drop or leveling off. With a small learning rate we could increase the number of epochs so that our model will eventually converge, but it will take longer.

Increasing the batch size makes each epoch run faster, but as with a smaller
learning rate, the model does not converge with just a few epochs. So we can increasing the number of epochs and eventually we should see the
model converge.

All of this hyperparameters can be tested by making adjustments to the code above.

### Train a model with two features

The model we trained with the feature `TOTAL_MILES` demonstrates fairly strong predictive power, but is it possible to imporve the model training with two features: `TRIP_MILES` and `TRIP_SECONDS`.

When training a model with more than one feature, it is important that all
numeric values are roughly on the same scale. In this case, `TRIP_SECONDS` and
`TRIP_MILES` do not meet this criteria. The mean value for `TRIP_MILES` is 8.3 and the mean for `TRIP_SECONDS` is 1320; that is two orders of magnitude difference.



In [None]:
# The following variables are the hyperparameters.
# We can adjust these hyperparameters to see how they impact a training run.
learning_rate = 0.001
epochs = 20
batch_size = 50

# Specify the feature and the label.
features = ['TRIP_MILES', 'TRIP_SECONDS']
label = 'FARE'

model_2 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)

## Part 4 - Validate Model


---