### Linear regression
- relationship between features and labels
- equation: $y =  b + w_1x_1$, y is the predicted values (output), b is bias (y-intecept and sometimes refer as $w_0$), $w_1$ is weight (slope m), and $x_1$ is the features (input).
    - bias and weight are calculated/updated during training.

Loss:
- the difference between the predicted values and actual values, don't care about the direction -> taking absolute value of the difference or square of the difference.
- mainly 4 types of loss:
    - $L_1$ loss: $\sum|actual\,value - predicted\,value|$
    - MAE (mean absolute error): $\frac{1}{N}\sum|actual\,value - predicted\,value|$
    - $L_2$ loss: $\sum(actual\,value - predicted\,value)^2$
    - MSE (mean squared error): $\frac{1}{N}\sum(actual\,value - predicted\,value)^2$
- MSE and MAE are preferred for multiple features
- choose proper loss function:
    - MSE/$L_2$ loss if want to fit tighly to data, including outliers because squaring will amplify differences (large error, loss larger) and causing high penalty and so the model tends to move heavily toward the outlier.
    - MAE/$L_1$ loss if want to avoid outliers.


Gradient descent:
- an iterative method that is used to find the optimal values for parameters (weights and bias) that produce the lowest loss.
- steps:
    1. Set weight = 0, bias = 0
    2. Calculate loss using current paramters
    3. Determine the direction to move the weights and bias that reduce loss
        - direction: calculate the slope of the tangent to the loss function at each weight and bias = the derivative of the loss function w.r.t the weight and the bias.
    4. Move the weight and bias values a small amount (which is gradient multiply by the learning rate) in the direction determined above
    5. Repeat step 2 and so on, until the model **converges**.
- When graph the loss surface for a model with one feature, it is a **convex** shape (weight and bias have a slop ~ 0).
    - A linear model converges when it's found the minimum loss.

Hyperparameters: 
- control different aspect of training
- 3 common hyperparameters:
    - Learning rate: how quickly the model converges
        - small -> converge slowly, too many iterations
        - large -> never converge, fluctuate
    - Batch size: number of examples the model processes before updating weights and bias
        - **Stochastic gradient descent (SGD)**
            - use only a single example (batch size = 1) per iteration
            - produce noise: varations during training cause the loss to increase during iteration 
        - **Mini-batch stochastic gradient descent**
            - 1 < batch size < N 
            - soze choosed at random, take average of gradients, and update weights and bias once per iteration
        - larger batch sizes can help reduce the negative effects of having outliers in the data
    - Epochs
        - means the model has processed every example in the training set once
        - e.g., 1000 examples with mini-batch size = 100, it will take 10 iterations to complete one epoch
        - more epoch, better model, more time to train

### Programming exercises

#### P1: Setup

In [None]:
#@title Code - Load dependencies

#general
import io

# data
import numpy as np
import pandas as pd

# machine learning
import keras

# data visualization
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns

chicago_taxi_dataset = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv")

In [None]:
#@title Code - Read dataset

# Updates dataframe to use specific columns.
training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]

print('Read dataset completed successfully.')
print('Total number of rows: {0}\n\n'.format(len(training_df.index)))
training_df.head(200)

#### P2: Dataset exploration

In [None]:
#@title Code - View dataset statistics

print('Total number of rows: {0}\n\n'.format(len(training_df.index)))
training_df.describe(include='all')

In [None]:
Total number of rows: 31694

TRIP_MILES	TRIP_SECONDS	FARE	COMPANY	PAYMENT_TYPE	TIP_RATE
count	31694.000000	31694.000000	31694.000000	31694	31694	31694.000000
unique	NaN	NaN	NaN	31	7	NaN
top	NaN	NaN	NaN	Flash Cab	Credit Card	NaN
freq	NaN	NaN	NaN	7887	14142	NaN
mean	8.289463	1319.796397	23.905210	NaN	NaN	12.965785
std	7.265672	928.932873	16.970022	NaN	NaN	15.517765
min	0.500000	60.000000	3.250000	NaN	NaN	0.000000
25%	1.720000	548.000000	9.000000	NaN	NaN	0.000000
50%	5.920000	1081.000000	18.750000	NaN	NaN	12.200000
75%	14.500000	1888.000000	38.750000	NaN	NaN	20.800000
max	68.120000	7140.000000	159.250000	NaN	NaN	648.600000

What is the maximum fare?
- 159.250000
What is the mean distance across all trips?
- 8.289463 miles
How many cab companies are in the dataset?
- 31
What is the most frequent payment type?
- Credit Card
Are any features missing data?
- No

NaN: if the result of a calculation can not be computed or if there is missing information. For example, numeric information required for categorical features. 

##### Generate a correlation matrix

#### Train a model