### Linear regression
- relationship between features and labels
- equation: $y =  b + w_1x_1$, y is the predicted values (output), b is bias (y-intecept and sometimes refer as $w_0$), $w_1$ is weight (slope m), and $x_1$ is the features (input).
    - bias and weight are calculated/updated during training.

Loss:
- the difference between the predicted values and actual values, don't care about the direction -> taking absolute value of the difference or square of the difference.
- mainly 4 types of loss:
    - $L_1$ loss: $\sum|actual\,value - predicted\,value|$
    - MAE (mean absolute error): $\frac{1}{N}\sum|actual\,value - predicted\,value|$
    - $L_2$ loss: $\sum(actual\,value - predicted\,value)^2$
    - MSE (mean squared error): $\frac{1}{N}\sum(actual\,value - predicted\,value)^2$
- MSE and MAE are preferred for multiple features
- choose proper loss function:
    - MSE/$L_2$ loss if want to fit tighly to data, including outliers because squaring will amplify differences (large error, loss larger) and causing high penalty and so the model tends to move heavily toward the outlier.
    - MAE/$L_1$ loss if want to avoid outliers.


Gradient descent:
- an iterative method that is used to find the optimal values for parameters (weights and bias) that produce the lowest loss.
- steps:
    1. Set weight = 0, bias = 0
    2. Calculate loss using current paramters
    3. Determine the direction to move the weights and bias that reduce loss
        - direction: calculate the slope of the tangent to the loss function at each weight and bias = the derivative of the loss function w.r.t the weight and the bias.
    4. Move the weight and bias values a small amount (which is gradient multiply by the learning rate) in the direction determined above
    5. Repeat step 2 and so on, until the model **converges**.
- When graph the loss surface for a model with one feature, it is a **convex** shape (weight and bias have a slop ~ 0).
    - A linear model converges when it's found the minimum loss.

Hyperparameters: 
- control different aspect of training
- 3 common hyperparameters:
    - Learning rate: how quickly the model converges
        - small -> converge slowly, too many iterations
        - large -> never converge, fluctuate
    - Batch size: number of examples the model processes before updating weights and bias
        - **Stochastic gradient descent (SGD)**
            - use only a single example (batch size = 1) per iteration
            - produce noise: varations during training cause the loss to increase during iteration 
        - **Mini-batch stochastic gradient descent**
            - 1 < batch size < N 
            - size choosed at random, take average of gradients, and update weights and bias once per iteration
        - larger batch sizes can help reduce the negative effects of having outliers in the data
    - Epochs
        - means the model has processed every example in the training set once
        - e.g., 1000 examples with mini-batch size = 100, it will take 10 iterations to complete one epoch
        - more epoch, better model, more time to train

Questions related batch size raised during experiments:
- Why Does Small Batch Size Take Longer to Execute? see https://github.com/changbai-phu/ML-Courses/blob/main/Google_ML_Crash_Course/ML_models/Batch_Size_clarification.md 

### Programming exercises
- See https://github.com/changbai-phu/ML-Courses/blob/main/ml/cc/exercises/linear_regression_taxi.ipynb for the original files and results

#### P1: Setup

In [None]:
#@title Code - Load dependencies

#general
import io

# data
import numpy as np
import pandas as pd

# machine learning
import keras

# data visualization
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns

chicago_taxi_dataset = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv")

In [None]:
#@title Code - Read dataset

# Updates dataframe to use specific columns.
training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]

print('Read dataset completed successfully.')
print('Total number of rows: {0}\n\n'.format(len(training_df.index)))
training_df.head(200)

#### P2: Dataset exploration

In [None]:
#@title Code - View dataset statistics

print('Total number of rows: {0}\n\n'.format(len(training_df.index)))
training_df.describe(include='all')

In [None]:
Total number of rows: 31694

TRIP_MILES	TRIP_SECONDS	FARE	COMPANY	PAYMENT_TYPE	TIP_RATE
count	31694.000000	31694.000000	31694.000000	31694	31694	31694.000000
unique	NaN	NaN	NaN	31	7	NaN
top	NaN	NaN	NaN	Flash Cab	Credit Card	NaN
freq	NaN	NaN	NaN	7887	14142	NaN
mean	8.289463	1319.796397	23.905210	NaN	NaN	12.965785
std	7.265672	928.932873	16.970022	NaN	NaN	15.517765
min	0.500000	60.000000	3.250000	NaN	NaN	0.000000
25%	1.720000	548.000000	9.000000	NaN	NaN	0.000000
50%	5.920000	1081.000000	18.750000	NaN	NaN	12.200000
75%	14.500000	1888.000000	38.750000	NaN	NaN	20.800000
max	68.120000	7140.000000	159.250000	NaN	NaN	648.600000

- What is the maximum fare?
    - 159.250000
- What is the mean distance across all trips?
    - 8.289463 miles
- How many cab companies are in the dataset?
    - 31
- What is the most frequent payment type?
    - Credit Card
- Are any features missing data?
    - No

NaN: if the result of a calculation can not be computed or if there is missing information. For example, numeric information required for categorical features. 

##### Generate a correlation matrix
- use a correlation matrix to identify features whose values correlate well with the label
- the higher the absolute value of a correlation value is, the greater its predictive power. 0 means no correlation (not linearly related)


In [None]:
#@title Code - View correlation matrix
training_df.corr(numeric_only = True)

In [None]:
    TRIP_MILES	TRIP_SECONDS	FARE	TIP_RATE
TRIP_MILES	1.000000	0.800855	0.975344	-0.049594
TRIP_SECONDS	0.800855	1.000000	0.830292	-0.084294
FARE	0.975344	0.830292	1.000000	-0.070979
TIP_RATE	-0.049594	-0.084294	-0.070979	1.000000

- Which feature correlates most strongly to the label FARE?
    - Trip miles because it has the largest absolute value
- Which feature correlates least strongly to the label FARE?
    - Tip rate because of the lowest absolute value

##### Visualize relationships in dataset

In [None]:
#@title Code - View pairplot
sns.pairplot(training_df, x_vars=["FARE", "TRIP_MILES", "TRIP_SECONDS"], y_vars=["FARE", "TRIP_MILES", "TRIP_SECONDS"])

#### Train a model

##### Define plotting functions to view model information

In [None]:
#@title Define plotting functions

def make_plots(df, feature_names, label_name, model_output, sample_size=200):

  random_sample = df.sample(n=sample_size).copy()
  random_sample.reset_index()
  weights, bias, epochs, rmse = model_output

  is_2d_plot = len(feature_names) == 1
  model_plot_type = "scatter" if is_2d_plot else "surface"
  fig = make_subplots(rows=1, cols=2,
                      subplot_titles=("Loss Curve", "Model Plot"),
                      specs=[[{"type": "scatter"}, {"type": model_plot_type}]])

  plot_data(random_sample, feature_names, label_name, fig)
  plot_model(random_sample, feature_names, weights, bias, fig)
  plot_loss_curve(epochs, rmse, fig)

  fig.show()
  return

def plot_loss_curve(epochs, rmse, fig):
  curve = px.line(x=epochs, y=rmse)
  curve.update_traces(line_color='#ff0000', line_width=3)

  fig.append_trace(curve.data[0], row=1, col=1)
  fig.update_xaxes(title_text="Epoch", row=1, col=1)
  fig.update_yaxes(title_text="Root Mean Squared Error", row=1, col=1, range=[rmse.min()*0.8, rmse.max()])

  return

def plot_data(df, features, label, fig):
  if len(features) == 1:
    scatter = px.scatter(df, x=features[0], y=label)
  else:
    scatter = px.scatter_3d(df, x=features[0], y=features[1], z=label)

  fig.append_trace(scatter.data[0], row=1, col=2)
  if len(features) == 1:
    fig.update_xaxes(title_text=features[0], row=1, col=2)
    fig.update_yaxes(title_text=label, row=1, col=2)
  else:
    fig.update_layout(scene1=dict(xaxis_title=features[0], yaxis_title=features[1], zaxis_title=label))

  return

def plot_model(df, features, weights, bias, fig):
  df['FARE_PREDICTED'] = bias[0]

  for index, feature in enumerate(features):
    df['FARE_PREDICTED'] = df['FARE_PREDICTED'] + weights[index][0] * df[feature]

  if len(features) == 1:
    model = px.line(df, x=features[0], y='FARE_PREDICTED')
    model.update_traces(line_color='#ff0000', line_width=3)
  else:
    z_name, y_name = "FARE_PREDICTED", features[1]
    z = [df[z_name].min(), (df[z_name].max() - df[z_name].min()) / 2, df[z_name].max()]
    y = [df[y_name].min(), (df[y_name].max() - df[y_name].min()) / 2, df[y_name].max()]
    x = []
    for i in range(len(y)):
      x.append((z[i] - weights[1][0] * y[i] - bias[0]) / weights[0][0])

    plane=pd.DataFrame({'x':x, 'y':y, 'z':[z] * 3})

    light_yellow = [[0, '#89CFF0'], [1, '#FFDB58']]
    model = go.Figure(data=go.Surface(x=plane['x'], y=plane['y'], z=plane['z'],
                                      colorscale=light_yellow))

  fig.add_trace(model.data[0], row=1, col=2)

  return

def model_info(feature_names, label_name, model_output):
  weights = model_output[0]
  bias = model_output[1]

  nl = "\n"
  header = "-" * 80
  banner = header + nl + "|" + "MODEL INFO".center(78) + "|" + nl + header

  info = ""
  equation = label_name + " = "

  for index, feature in enumerate(feature_names):
    info = info + "Weight for feature[{}]: {:.3f}\n".format(feature, weights[index][0])
    equation = equation + "{:.3f} * {} + ".format(weights[index][0], feature)

  info = info + "Bias: {:.3f}\n".format(bias[0])
  equation = equation + "{:.3f}\n".format(bias[0])

  return banner + nl + info + nl + equation

print("SUCCESS: defining plotting functions complete.")

##### Define functions to build and train a model

In [None]:
#@title Code - Define ML functions

def build_model(my_learning_rate, num_features):
  """Create and compile a simple linear regression model."""
  # Describe the topography of the model.
  # The topography of a simple linear regression model
  # is a single node in a single layer.
  inputs = keras.Input(shape=(num_features,))
  outputs = keras.layers.Dense(units=1)(inputs)
  model = keras.Model(inputs=inputs, outputs=outputs)

  # Compile the model topography into code that Keras can efficiently
  # execute. Configure training to minimize the model's mean squared error.
  model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=my_learning_rate),
                loss="mean_squared_error",
                metrics=[keras.metrics.RootMeanSquaredError()])

  return model


def train_model(model, df, features, label, epochs, batch_size):
  """Train the model by feeding it data."""

  # Feed the model the feature and the label.
  # The model will train for the specified number of epochs.
  # input_x = df.iloc[:,1:3].values
  # df[feature]
  history = model.fit(x=features,
                      y=label,
                      batch_size=batch_size,
                      epochs=epochs)

  # Gather the trained model's weight and bias.
  trained_weight = model.get_weights()[0]
  trained_bias = model.get_weights()[1]

  # The list of epochs is stored separately from the rest of history.
  epochs = history.epoch

  # Isolate the error for each epoch.
  hist = pd.DataFrame(history.history)

  # To track the progression of training, we're going to take a snapshot
  # of the model's root mean squared error at each epoch.
  rmse = hist["root_mean_squared_error"]

  return trained_weight, trained_bias, epochs, rmse


def run_experiment(df, feature_names, label_name, learning_rate, epochs, batch_size):

  print('INFO: starting training experiment with features={} and label={}\n'.format(feature_names, label_name))

  num_features = len(feature_names)

  features = df.loc[:, feature_names].values
  label = df[label_name].values

  model = build_model(learning_rate, num_features)
  model_output = train_model(model, df, features, label, epochs, batch_size)

  print('\nSUCCESS: training experiment complete\n')
  print('{}'.format(model_info(feature_names, label_name, model_output)))
  make_plots(df, feature_names, label_name, model_output)

  return model

print("SUCCESS: defining linear regression functions complete.")

##### Train a model with one feature

In [None]:
#@title Code - Experiment 1

# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 20
batch_size = 50

# Specify the feature and the label.
features = ['TRIP_MILES']
label = 'FARE'

model_1 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)

In [None]:
INFO: starting training experiment with features=['TRIP_MILES'] and label=FARE

Epoch 1/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 348.6346 - root_mean_squared_error: 18.6426
Epoch 2/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 133.5084 - root_mean_squared_error: 11.5241
Epoch 3/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 34.1264 - root_mean_squared_error: 5.8180
Epoch 4/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 16.2051 - root_mean_squared_error: 4.0221
Epoch 5/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - loss: 17.6679 - root_mean_squared_error: 4.1991
Epoch 6/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 16.5441 - root_mean_squared_error: 4.0616
Epoch 7/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 15.2725 - root_mean_squared_error: 3.9015
Epoch 8/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.5931 - root_mean_squared_error: 3.8125
Epoch 9/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 13.5662 - root_mean_squared_error: 3.6773
Epoch 10/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.7848 - root_mean_squared_error: 3.8370
Epoch 11/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.3020 - root_mean_squared_error: 3.7787
Epoch 12/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 13.7281 - root_mean_squared_error: 3.6994
Epoch 13/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.0061 - root_mean_squared_error: 3.7396
Epoch 14/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 15.4342 - root_mean_squared_error: 3.9199
Epoch 15/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 14.1800 - root_mean_squared_error: 3.7613
Epoch 16/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 15.0049 - root_mean_squared_error: 3.8598
Epoch 17/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 13.5835 - root_mean_squared_error: 3.6804
Epoch 18/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.3659 - root_mean_squared_error: 3.7696
Epoch 19/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.5944 - root_mean_squared_error: 3.8178
Epoch 20/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.2164 - root_mean_squared_error: 3.7658

SUCCESS: training experiment complete

--------------------------------------------------------------------------------
|                                  MODEL INFO                                  |
--------------------------------------------------------------------------------
Weight for feature[TRIP_MILES]: 2.277
Bias: 4.970

FARE = 2.277 * TRIP_MILES + 4.970


- How many epochs did it take to converge on the final model?
    - 9 because after 9 epochs, error increases and it has the minimum error value
- How well does the model fit the sample data?
    - pretty well

- Ans：
    - Use the **loss curve** to see where the loss begins to level off during training.
    - With this set of hyperparameters: learning_rate = 0.001，epochs = 20，batch_size = 50
    - it takes about 5 epochs for the training run to converge to the final model.
    - It appears from the model plot that the model fits the sample data fairly well.

##### Experiment with hyperparameters
- Experiment 1: Increase the learning rate to 1 (batch size at 50).
- Experiment 2: Decrease the learning rate to 0.0001 (batch size at 50).
- Experiment 3: Increase the batch size to 500 (learning rate at 0.001).

In [None]:
# Experiment 1
INFO: starting training experiment with features=['TRIP_MILES'] and label=FARE

Epoch 1/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 64.4992 - root_mean_squared_error: 7.6212
Epoch 2/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 45.1373 - root_mean_squared_error: 6.6933
Epoch 3/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 47.7835 - root_mean_squared_error: 6.9096
Epoch 4/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 45.2509 - root_mean_squared_error: 6.7005
Epoch 5/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 47.7799 - root_mean_squared_error: 6.8892
Epoch 6/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 46.4212 - root_mean_squared_error: 6.7982
Epoch 7/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 45.5827 - root_mean_squared_error: 6.7395
Epoch 8/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 48.0836 - root_mean_squared_error: 6.9301
Epoch 9/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 47.3604 - root_mean_squared_error: 6.8767
Epoch 10/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 48.5620 - root_mean_squared_error: 6.9658
Epoch 11/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 46.2598 - root_mean_squared_error: 6.7891
Epoch 12/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 45.9575 - root_mean_squared_error: 6.7713
Epoch 13/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 48.8434 - root_mean_squared_error: 6.9865
Epoch 14/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 45.7235 - root_mean_squared_error: 6.7424
Epoch 15/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 46.7988 - root_mean_squared_error: 6.8387
Epoch 16/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 47.8109 - root_mean_squared_error: 6.9128
Epoch 17/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 51.0231 - root_mean_squared_error: 7.1289
Epoch 18/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 48.4777 - root_mean_squared_error: 6.9601
Epoch 19/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 48.9496 - root_mean_squared_error: 6.9924
Epoch 20/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 48.2791 - root_mean_squared_error: 6.9457

SUCCESS: training experiment complete

--------------------------------------------------------------------------------
|                                  MODEL INFO                                  |
--------------------------------------------------------------------------------
Weight for feature[TRIP_MILES]: 2.646
Bias: 4.974

FARE = 2.646 * TRIP_MILES + 4.974

In [None]:
# Experiment 3
INFO: starting training experiment with features=['TRIP_MILES'] and label=FARE

Epoch 1/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 346.5096 - root_mean_squared_error: 18.6143
Epoch 2/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 316.7681 - root_mean_squared_error: 17.7977
Epoch 3/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 296.7084 - root_mean_squared_error: 17.2241
Epoch 4/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 270.0554 - root_mean_squared_error: 16.4328
Epoch 5/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 245.9910 - root_mean_squared_error: 15.6837
Epoch 6/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 220.2236 - root_mean_squared_error: 14.8398
Epoch 7/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 202.0853 - root_mean_squared_error: 14.2153
Epoch 8/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 182.0072 - root_mean_squared_error: 13.4908
Epoch 9/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 163.9141 - root_mean_squared_error: 12.8027
Epoch 10/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 149.6938 - root_mean_squared_error: 12.2338
Epoch 11/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 132.0897 - root_mean_squared_error: 11.4913
Epoch 12/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 115.6083 - root_mean_squared_error: 10.7517
Epoch 13/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 103.9859 - root_mean_squared_error: 10.1946
Epoch 14/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 89.7640 - root_mean_squared_error: 9.4732
Epoch 15/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 76.9155 - root_mean_squared_error: 8.7699
Epoch 16/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 65.8133 - root_mean_squared_error: 8.1118
Epoch 17/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 54.3047 - root_mean_squared_error: 7.3690
Epoch 18/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 47.4587 - root_mean_squared_error: 6.8881
Epoch 19/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - loss: 43.1465 - root_mean_squared_error: 6.5618
Epoch 20/20
64/64 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step - loss: 34.1069 - root_mean_squared_error: 5.8392

SUCCESS: training experiment complete

--------------------------------------------------------------------------------
|                                  MODEL INFO                                  |
--------------------------------------------------------------------------------
Weight for feature[TRIP_MILES]: 2.241
Bias: 1.276

FARE = 2.241 * TRIP_MILES + 1.276

- How did raising the learning rate impact your ability to train the model?
    - based on the loss curve, the model doesn't converge (fluctuate)
- How did lowering the learning rate impact your ability to train the model?
    - based on the loss curve, the model converge slowly，but not yet converge even after epochs finished, expect larger epochs for it to be converged.
- Did changing the batch size effect your training results?
    - until epochs finished, the curve doesn't converge. Based on previous lectures, large batch size can reduce impact by outliers while small batch size can help to train better model while introducing noises. 

- Ans:
    - When the learning rate is too high, the loss curve bounces around and does not
appear to be moving towards convergence with each iteration. Also, notice that
the predicted model does not fit the data very well. With a learning rate that
is too high, it is unlikely that you will be able to train a model with good
results.
    - When the learning rate is too small, it may take longer for the loss curve to
converge. With a small learning rate the loss curve decreases slowly, but does
not show a dramatic drop or leveling off. With a small learning rate you could
increase the number of epochs so that your model will eventually converge, but
it will take longer.
    - Increasing the batch size makes each epoch run faster, but as with the smaller
learning rate, the model does not converge with just 20 epochs. If you have
time, try increasing the number of epochs and eventually you should see the
model converge.

Extra experiment done by myself: make batch size = 5
- each epoch finish much slower than when having larger batch size 
- but can see convergence while large batch size couldn't converge within the same number of epochs
- see a really deep drop for loss curve to converge

##### Train a model with two features
- In this step, try training the model with two features, TRIP_MILES and TRIP_MINUTES, to see if you can improve the model. 

In [None]:
#@title Code - Experiment 3

# The following variables are the hyperparameters.
learning_rate = 0.001
epochs = 20
batch_size = 50

training_df.loc[:, 'TRIP_MINUTES'] = training_df['TRIP_SECONDS']/60

features = ['TRIP_MILES', 'TRIP_MINUTES']  # features = ['TRIP_MILES', 'TRIP_SECONDS']
label = 'FARE'

model_2 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)

In [None]:
INFO: starting training experiment with features=['TRIP_MILES', 'TRIP_MINUTES'] and label=FARE

Epoch 1/20
<ipython-input-19-8892146dfcd1>:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 96.0879 - root_mean_squared_error: 9.7069
Epoch 2/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 31.2299 - root_mean_squared_error: 5.5792
Epoch 3/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 20.7375 - root_mean_squared_error: 4.5379
Epoch 4/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.4589 - root_mean_squared_error: 3.7997
Epoch 5/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 13.8762 - root_mean_squared_error: 3.7199
Epoch 6/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 13.5240 - root_mean_squared_error: 3.6698
Epoch 7/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 12.1855 - root_mean_squared_error: 3.4879
Epoch 8/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 12.6429 - root_mean_squared_error: 3.5509
Epoch 9/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.1179 - root_mean_squared_error: 3.7507
Epoch 10/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 11.4985 - root_mean_squared_error: 3.3820
Epoch 11/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 3s 2ms/step - loss: 11.2461 - root_mean_squared_error: 3.3477
Epoch 12/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 12.0351 - root_mean_squared_error: 3.4657
Epoch 13/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 11.7800 - root_mean_squared_error: 3.4179
Epoch 14/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 14.1723 - root_mean_squared_error: 3.7434
Epoch 15/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 13.0145 - root_mean_squared_error: 3.6026
Epoch 16/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 11.4771 - root_mean_squared_error: 3.3826
Epoch 17/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 11.2512 - root_mean_squared_error: 3.3514
Epoch 18/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - loss: 12.0794 - root_mean_squared_error: 3.4682
Epoch 19/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - loss: 12.1967 - root_mean_squared_error: 3.4893
Epoch 20/20
634/634 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 12.1323 - root_mean_squared_error: 3.4750

SUCCESS: training experiment complete

--------------------------------------------------------------------------------
|                                  MODEL INFO                                  |
--------------------------------------------------------------------------------
Weight for feature[TRIP_MILES]: 2.027
Weight for feature[TRIP_MINUTES]: 0.147
Bias: 3.833

FARE = 2.027 * TRIP_MILES + 0.147 * TRIP_MINUTES + 3.833

In [None]:
--------------------------------------------------------------------------------
|                                  MODEL INFO                                  |
--------------------------------------------------------------------------------
Weight for feature[TRIP_MILES]: 2.021
Weight for feature[TRIP_SECONDS]: 0.002
Bias: 3.645

FARE = 2.021 * TRIP_MILES + 0.002 * TRIP_SECONDS + 3.645

- Does the model with two features produce better results than one using a single feature?
    - yes, with the same setting of parameters (0.001, 20, 50) two features has lower root mean squared error than one feature
- Does it make a difference if you use TRIP_SECONDS instead of TRIP_MINUTES?
    - based on the experiment, using trip_seconds, the loss curve converges really quickly at epoch = 2, while for trip_minutes, it was epoch=5
- How well do you think the model comes to the ground truth fare calculation for Chicago Taxi Trips?
    - based on the model plot, it predicts pretty well.

- Ans:
    - To answer this question for your specific training runs, compare the RMSE for
each model. For example, if the RMSE for the model trained with one feature was
3.7457 and the RMSE for the model with two features is 3.4787, that means that
on average the model with two features makes predictions that are about $0.27
closer to the observed fare.

    - When training a model with more than one feature, it is important that all
numeric values are roughly on the same scale. In this case, TRIP_SECONDS and
TRIP_MILES do not meet this criteria. The mean value for TRIP_MILES is 8.3 and
the mean for TRIP_SECONDS is 1320; that is two orders of magnitude difference.
Converting the trip duration to minutes helps during training because in puts
values for both features on a more comparable scale. Of course, this is not the
only way to scale values before training, but you will learn about that in
another module.

    - In reality, Chicago taxi cabs use a documented formula to determine cab fares.
For a single passenger paying cash, the fare is calculated like this: FARE = 2.25 * TRIP_MILES + 0.12 * TRIP_MINUTES + 3.25

    - Typically with machine learning problems you would not know the 'correct'
formula, but in this case you can this knowledge to evaluate your model. Take a
look at your model output (the weights and bias) and determine how well it
matches the ground truth fare calculation. You should find that the model is
roughly close to this formula.

#### Part 4 - Validate Model

##### Use the model to make predictions

In [None]:
#@title Code - Define functions to make predictions
def format_currency(x):
  return "${:.2f}".format(x)

def build_batch(df, batch_size):
  batch = df.sample(n=batch_size).copy()
  batch.set_index(np.arange(batch_size), inplace=True)
  return batch

def predict_fare(model, df, features, label, batch_size=50):
  batch = build_batch(df, batch_size)
  predicted_values = model.predict_on_batch(x=batch.loc[:, features].values)

  data = {"PREDICTED_FARE": [], "OBSERVED_FARE": [], "L1_LOSS": [],
          features[0]: [], features[1]: []}
  for i in range(batch_size):
    predicted = predicted_values[i][0]
    observed = batch.at[i, label]
    data["PREDICTED_FARE"].append(format_currency(predicted))
    data["OBSERVED_FARE"].append(format_currency(observed))
    data["L1_LOSS"].append(format_currency(abs(observed - predicted)))
    data[features[0]].append(batch.at[i, features[0]])
    data[features[1]].append("{:.2f}".format(batch.at[i, features[1]]))

  output_df = pd.DataFrame(data)
  return output_df

def show_predictions(output):
  header = "-" * 80
  banner = header + "\n" + "|" + "PREDICTIONS".center(78) + "|" + "\n" + header
  print(banner)
  print(output)
  return

In [None]:
#@title Code - Make predictions

output = predict_fare(model_2, training_df, features, label)
show_predictions(output)

In [None]:
--------------------------------------------------------------------------------
|                                 PREDICTIONS                                  |
--------------------------------------------------------------------------------
   PREDICTED_FARE OBSERVED_FARE L1_LOSS  TRIP_MILES TRIP_SECONDS
0           $5.71         $5.50   $0.21        0.66       294.00
1          $11.55        $13.25   $1.70        2.00      1560.00
2          $23.48        $23.75   $0.27        7.90      1562.00
3           $8.60         $8.75   $0.15        1.69       622.00
4           $5.80         $5.25   $0.55        0.80       217.00
5          $10.31        $10.00   $0.31        2.30       817.00
6          $19.26        $19.74   $0.48        6.63       897.00
7          $28.19        $28.00   $0.19       10.60      1260.00
8          $14.33        $13.75   $0.58        4.61       552.00
9          $13.22        $13.50   $0.28        3.59       936.00
10          $9.18         $9.50   $0.32        1.66       883.00
11          $9.45         $9.50   $0.05        2.06       663.00
12         $31.17        $31.50   $0.33       12.00      1320.00
13         $52.23        $53.00   $0.77       21.73      1886.00
14         $15.73        $15.50   $0.23        5.10       720.00
15          $7.12         $6.75   $0.37        1.21       418.00
16          $6.15         $6.00   $0.15        0.80       360.00
17         $42.66        $42.75   $0.09       17.10      1800.00
18         $25.61        $25.75   $0.14        9.50      1118.00
19         $30.19        $30.50   $0.31       12.09       850.00
20          $9.33         $9.25   $0.08        1.96       697.00
21         $26.95        $26.75   $0.20       10.04      1219.00
22         $30.29        $29.50   $0.79       10.10      2520.00
23         $52.21        $55.25   $3.04       19.48      3716.00
24          $5.96         $5.50   $0.46        0.85       240.00
25         $41.87        $42.00   $0.13       17.00      1560.00
26          $7.66         $7.50   $0.16        1.30       562.00
27         $40.88        $41.50   $0.62       16.59      1497.00
28          $8.92         $8.75   $0.17        1.80       660.00
29          $9.77         $9.50   $0.27        2.20       678.00
30         $61.13        $59.75   $1.38       24.79      2982.00
31         $19.21        $19.74   $0.53        6.55       942.00
32         $29.65        $29.75   $0.10       10.52      1917.00
33         $24.71        $24.50   $0.21        8.00      1980.00
34         $10.08         $9.50   $0.58        2.10       887.00
35          $5.15         $4.75   $0.40        0.60       117.00
36         $26.24        $29.00   $2.76        8.54      2156.00
37          $5.96         $5.50   $0.46        0.82       267.00
38         $15.47        $17.75   $2.28        3.50      1920.00
39          $8.25         $8.00   $0.25        1.62       538.00
40         $10.78        $10.25   $0.53        2.64       726.00
41          $6.61         $6.25   $0.36        0.99       391.00
42         $10.49        $10.46   $0.03        2.31       879.00
43         $12.29        $14.75   $2.46        2.34      1584.00
44         $45.13        $45.25   $0.12       18.17      1925.00
45          $8.07         $8.00   $0.07        1.60       480.00
46         $31.72        $31.25   $0.47       11.10      2280.00
47         $12.17        $12.50   $0.33        2.87      1100.00
48         $38.50        $38.50   $0.00       14.74      2045.00
49         $29.50        $29.50   $0.00       11.27      1245.00

- How close is the predicted value to the label value? In other words, does your model accurately predict the fare for a taxi ride?
    - they are pretty close, most of them are < $0.5

- Ans:
    - Based on a random sampling of examples, the model seems to do pretty well
predicting the fare for a taxi ride. Most of the predicted values do not vary
significantly from the observed value. You should be able to see this by looking
at the column L1_LOSS = |observed - predicted|.