# 1. Evaluating Machine Learning Models

### 1.1 Introduction
Systematically evaluating a machine learning model's performance is crucial for understanding how well it generalizes to new data.

### 1.2 Data Splitting
The first step is to split your data into two parts:

- **Training set** - Fit the model on this subset of the data.
- **Test set** - Evaluate the model on this subset of the data.

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
```

### 1.3 Regression Model Evaluation
For regression models, where the task is predicting a continuous target value, two important metrics are:

- **Training error** - Average error on training data. Lower values indicate better fit.
- **Test error** - Average error on held-out test data. Lower values indicate better generalization.

For example, with a squared error cost function:

```python
# average training error
J_train = (1/2*m) * sum((y_train - y_train_pred) ** 2)

# average test error
J_test = (1/2*m) * sum((y_test - y_test_pred) ** 2)
```

A large gap between training and test error indicates overfitting.

### 1.4 Classification Model Evaluation
For classification models, where the task is predicting a discrete class label, two useful metrics are:

- **Training accuracy** - Fraction of correct predictions on training data. Higher is better.
- **Test accuracy** - Fraction of correct predictions on held-out test data. Higher is better.

```python
# fraction correct on training set
accuracy_train = sum(y_train == y_train_pred) / m_train

# fraction correct on test set
accuracy_test = sum(y_test == y_test_pred) / m_test
```

Again, a large gap between training and test accuracy indicates overfitting.

### 1.5 Conclusion
By evaluating models on held-out test data, you can identify overfitting and select the best model for your problem. The techniques here form the basis for more advanced methods like `cross-validation`.

<hr>

# 2. Model Selection and Evaluation

### 2.1 Introduction
Choosing the right machine learning model and evaluating its performance are crucial steps in applying ML to real-world problems. 

### 2.2 Training, Validation, and Test Sets
Best practice is to split data into 3 sets:

- **Training set:** Used to fit the machine learning model parameters. We want the model to learn from these examples.
- **Validation set:** Used to evaluate model performance during training and select the best model. Helps prevent overfitting to the training data.
- **Test set:** Used to provide an unbiased evaluation of the final model. Since this data is never used for model selection or hyperparameter tuning, it gives a true estimate of model performance.

```python
data = load_data() 

# 60% for training
train_set = data[:60%] 

# 20% for validation
val_set = data[60%:80%]

# 20% for testing  
test_set = data[80%:]
```

### 2.3 Model Selection
We train several candidate models on the training set:

```python
# Train 3 different polynomial models
model1 = train_polynomial(train_set, degree=1) 
model2 = train_polynomial(train_set, degree=3)
model3 = train_polynomial(train_set, degree=5)
```

Then we evaluate them on the validation set:

```python
# Evaluate models on validation data
val_error1 = evaluate(model1, val_set)
val_error2 = evaluate(model2, val_set) 
val_error3 = evaluate(model3, val_set)
```

We select the best performing model based on validation error:

```python
# Pick model with lowest validation error
best_model = model2
```

### 2.4 Model Evaluation
Finally, we evaluate the selected model on the test set to estimate generalization performance:
    
```python
# Evaluate selected model on test set 
test_error = evaluate(best_model, test_set)

print('Expected test error: ', test_error)
```

Since the test set was not used during training or model selection, it provides an unbiased estimate of how the model will perform on new data.

### 2.5 Key Points
- Training set: Used for learning model parameters
- Validation set: Used for model selection
- Test set: Used for evaluation of final model
- Avoid overfitting by keeping test set fully isolated

<hr>

# 3. Machine Learning Model Development Workflow

### 3.1 Introduction
The workflow of developing a machine learning system often involves a continuous cycle of idea generation, model training, and evaluation. It's quite rare for a machine learning model to perform exceptionally well on its first run. An integral part of the machine learning model development process involves deciding what to do next to enhance the model's performance. A practical approach to determining the next steps in improving a model's performance involves examining the model's bias and variance. A model's bias and variance can provide useful insights into its performance and guide the steps needed to improve it.

### 3.2 Bias and Variance
In machine learning, bias refers to an algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data. **A high bias algorithm often results in underfitting,** where the model is too simple to capture the complexity of the data and does not perform well.

On the other hand, variance refers to an algorithm's sensitivity to small fluctuations in the training set. **A high variance algorithm results in overfitting,** where the model performs well on the training data but does not generalize well to unseen data.

### 3.3 Linear Regression Example
Consider a dataset that we want to fit a model to. We have a few options:
- Fitting a straight line (linear model) might not work well because it underfits the data, hence resulting in a high bias.
- Fitting a high-degree polynomial may overfit the data, capturing not only the underlying trend but also the noise in the data, leading to high variance.
- A model with a degree in between, like a quadratic polynomial, might be a good fit, neither underfitting nor overfitting the data.

### 3.4 Bias-Variance Trade-off
The concepts of bias and variance are tied to the model's performance on the training set and the cross-validation set:

- When a model underfits (high bias), it performs poorly on both the training set ($J_{train}$ is high) and the cross-validation set ($J_{cv}$ is high).
> $$J_{train} \approx J_{cv} \approx \text{high}$$

- When a model overfits (high variance), it performs well on the training set ($J_{train}$ is low) but poorly on the cross-validation set ($J_{cv}$ is high).
> $$J_{cv} >> J_{train}$$

- These different scenarios lead to the famous bias-variance trade-off in machine learning: models with a lower bias have a higher variance, and vice versa.

As the model complexity increases (degree of the polynomial increases), the training error ($J_{train}$) decreases — the model fits the training data better.

However, the cross-validation error ($J_{cv}$) decreases initially, reaches a minimum, and then starts increasing. When the model is too simple, it underfits the data, leading to high $J_{cv}$. When the model is too complex, it overfits the training data, and $J_{cv}$ increases.

Choosing the right complexity for the model, such that both the training and cross-validation errors are minimized, is key to building a successful machine learning model.

<hr>

# 4. Learning Algorithm Performance: The Impact of Regularization Parameter

The choice of the regularization parameter (Lambda or $\lambda$) influences the bias and variance, ultimately affecting the overall performance of the learning algorithm.

### 4.1 Regularization Parameter and Model Complexity
We begin with the exploration of the model's behavior under extreme regularization values.

- When $\lambda$ is set to a large value, the model's parameters are compelled to remain small to minimize the regularization term in the cost function. Consequently, this produces a simplistic model with high bias (underfitting) as it doesn't fit the training data well.

- On the other extreme, if $\lambda$ is set to a small value (even zero), the regularization effect is essentially nullified. The model tends to overfit the data, as seen in the form of a high-variance model.

An optimal value of $\lambda$ will ideally strike a balance between these extremes, resulting in a model that fits the data well with relatively low training and cross-validation errors.

### 4.2 Selecting Regularization Parameter Using Cross-validation
To identify a suitable value for $\lambda$, we can utilize cross-validation. This process involves iteratively fitting the model with different $\lambda$ values and computing the corresponding cross-validation error. The $\lambda$ value that yields the lowest cross-validation error is then chosen as the optimal parameter.

### 4.3 Understanding Error as a Function of Lambda
- For smaller $\lambda$ values, the model has high variance (overfitting), which results in a low training error but a high cross-validation error.

- For larger $\lambda$ values, the model suffers from high bias (underfitting), leading to high training and cross-validation errors.

### 4.4 Frobenius Norm
In a neural network, the regularization term is computed as the sum of the squares of all the weights in the network. This is known as the Frobenius norm of the weight matrices and is denoted by $||w||^2$. The matrix $W$ has dimensions $n^{[l]} \times n^{[l-1]}$, where $n^{[l]}$ is the number of units in layer $l$ and $n^{[l-1]}$ is the number of units in layer $l-1$. The Frobenius norm is computed as follows:

$$||w^{(l)}||^2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2$$

Therefore, the cost function with regularization is given by:

$$J(W,b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} ||w^{[l]}||^2$$

### 4.5 Dropout Regularization
Dropout is a regularization technique that randomly drops out a fraction of the units in a layer during the training process. This helps to prevent overfitting and improve the model's generalization ability. The fraction of units to be dropped out is a hyperparameter that can be tuned using cross-validation.

#### 4.5.1 Inverted Dropout
Inverted dropout is a variant of dropout that is commonly used in practice. It involves scaling the activations of the units that are not dropped out by a factor of $\frac{1}{1-p}$, where $p$ is the dropout probability. This ensures that the expected value of the activations remains the same during training and testing.

```python
"""
This script demonstrates how to apply inverted dropout on the activation matrix of layer 3. It includes three main steps:
1. Generate a dropout mask: Create a boolean mask for randomly setting a fraction of activations to zero.
2. Apply Dropout: Perform element-wise multiplication of the activation matrix with the dropout mask.
3. Scaling the Activations: Scale the remaining activations to ensure the expected value remains unchanged.

Parameters:
- keep_prob: The probability of keeping a unit active. Set to 0.8, meaning 80% of the neurons are retained.
- a3: The activation from layer 3, a NumPy array with shape (num_units, num_examples).
"""

keep_prob = 0.8

# Step 1: Generate a dropout mask
mask = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob

# Step 2: Apply Dropout
a3 = np.multiply(a3, mask)

# Step 3: Scale the activations
a3 /= keep_prob

# The modified activation matrix a3 is now ready for forward propagation to the next layer.
```

Dropout is mostly used in computer vision applications, where the input data is usually high-dimensional. It is not commonly used in natural language processing (NLP) tasks, as the input data is usually sparse and dropout may lead to significant information loss.

One downside of dropout is that the cost function is no longer well-defined. This is because the activations are randomly dropped out during training, which means that the cost is a random variable. Therefore, the cost function is computed as the average of the cost over multiple iterations of the training process. Since the cost function is no longer well-defined, the cost cannot be used to monitor the training process (e.g. to check for convergence or reduction in cost over many iterations). In this case, we usually set keep_prob to 1 during training and only apply dropout during testing.

### 4.6 Vanishing and Exploding Gradients
The vanishing gradient problem occurs when the gradients become increasingly smaller as the number of layers increases. This is due to the repeated multiplication of small numbers (the gradients) in the backpropagation process. Consequently, the weights in the earlier layers are updated very slowly, which leads to a slow training process.

On the other hand, the exploding gradient problem occurs when the gradients become increasingly larger as the number of layers increases. This is due to the repeated multiplication of large numbers (the gradients) in the backpropagation process. Consequently, the weights in the earlier layers are updated very quickly, which leads to an unstable training process.

### 4.7 Weight Initialization for Deep Networks
The choice of weight initialization method can have a significant impact on the training process of a deep neural network. A good initialization method should ensure that the weights are initialized to small values that are close to zero. This helps to partially alleviate the vanishing and exploding gradient problems.

#### 4.7.1 Xavier Initialization
Xavier initialization is a popular weight initialization method that is commonly used in deep neural networks. It involves initializing the weights to random values that are sampled from a uniform distribution with a mean of zero and a variance of $\frac{1}{n^{[l-1]}}$, where $n^{[l-1]}$ is the number of units in the previous layer. This ensures that the weights are initialized to small values that are close to zero.

$$ W^{[l]} = np.random.randn(n^{[l]}, n^{[l-1]}) * \sqrt{\frac{1}{n^{[l-1]}}} $$

This effectively sets the variance to $\text{Var}(W^{[l]}) = \frac{1}{n}$.

- If the activation function is a linear function, then the variance of the activations is also $\frac{1}{n}$.
- If the activation function is a non-linear function, then the variance of the activations is approximately $\frac{2}{n}$.

### 4.8 Gradient Checking (Grad Check)
Gradient checking is a technique for verifying the correctness of the gradient computations in a neural network. It involves comparing the gradients computed using backpropagation with the numerical gradients computed using finite differences. If the relative error between the two gradients is small (e.g. less than $10^{-7}$), then we can be confident that the gradient computations are correct.

#### Gradient Checking Implementation
```python
"""Gradient Checking in Python"""

# Step 1: Reshape the weight matrices and bias vectors into a single column vector theta
theta = np.concatenate((W1.flatten(), b1.flatten(), W2.flatten(), b2.flatten()))

# Step 2: Reshape the gradient matrices and vectors into a single column vector dtheta
dtheta = np.concatenate((dW1.flatten(), db1.flatten(), dW2.flatten(), db2.flatten()))

# Step 3: Compute the numerical gradient using finite differences
epsilon = 1e-7

# Initialize the gradient vector
num_grad = np.zeros(dtheta.shape)

# Iterate over each element in theta
for i in range(theta.shape[0]):
    # Compute the cost at theta + epsilon
    theta_plus = np.copy(theta)
    theta_plus[i] += epsilon
    J_plus, _ = forward_prop(X, Y, theta_plus)

    # Compute the cost at theta - epsilon
    theta_minus = np.copy(theta)
    theta_minus[i] -= epsilon
    J_minus, _ = forward_prop(X, Y, theta_minus)

    # Compute the numerical gradient
    num_grad[i] = (J_plus - J_minus) / (2 * epsilon)

# Step 4: Compute the gradient using backpropagation
_, grad = forward_prop(X, Y, theta)

# Step 5: Compute the relative error between the numerical gradient and the gradient computed using backpropagation
diff = np.linalg.norm(num_grad - grad) / np.linalg.norm(num_grad + grad)
```

#### Gradient Checking Implementation Notes
- Gradient checking is a computationally expensive process. It is usually performed only during the debugging process to verify the correctness of the gradient computations.
- Gradient checking is performed on a small subset of the training data (e.g. 100 examples) to reduce the computational cost.
- Gradient checking does not work with dropout regularization, as the cost function is no longer well-defined.

<hr>

# 5. Human-level Performance Benchmarking: Understanding High Bias and High Variance in Learning Algorithms
In this resource, we illustrate how to determine whether a learning algorithm has high bias or high variance using concrete examples from the domain of speech recognition.

### 5.1 Speech Recognition: A Case Study
Speech recognition systems are increasingly being utilized for various tasks such as web search on mobile phones. These systems transcribe audio clips into text, such as "What is today's weather?" or "Coffee shops near me."

### 5.2 Measuring Training and Cross-validation Errors
Let's suppose that our speech recognition system achieves a training error of 10.8%, meaning that it fails to transcribe 10.8% of the audio clips from the training set perfectly. When tested on a separate cross-validation set, it gets 14.8% error. These numbers may initially suggest high bias as the system is getting 10% of the training set wrong.

### 5.3 Benchmarking Against Human-level Performance
However, when dealing with tasks such as speech recognition, it's important to consider the human-level performance. For instance, even fluent speakers may achieve a transcription error rate of 10.6% due to various factors such as noisy audio.

By benchmarking against this human-level performance, we realize that our learning algorithm's training error (10.8%) is just slightly worse than humans (10.6%). Meanwhile, the cross-validation error (14.8%) is significantly higher than both the training error and human-level performance, indicating high variance.

### 5.4 Establishing a Baseline Level of Performance
Establishing a baseline level of performance is crucial in understanding whether an algorithm has high bias or high variance. This baseline can be set by human-level performance or by another competing algorithm's performance. Once we have this baseline level, we can measure:

- The difference between the training error and the baseline level. If this is large, we say that the algorithm has a high bias problem.
- The difference between the training error and the cross-validation error. If this is high, we say that the algorithm has a high variance problem.

For some tasks, the baseline level of performance could be 0%, indicating perfect performance. However, in tasks like speech recognition, where some audio can be noisy, the baseline level could be higher.

### 5.5 High Bias and High Variance
An algorithm can potentially suffer from both high bias and high variance. For instance, if the baseline performance, training error, and cross-validation error yield significant gaps, it would indicate that the algorithm has both high bias (for not achieving baseline performance) and high variance (for the high gap between training and cross-validation errors).

<hr>

# 6. Learning Curves - Understanding Your Model's Performance
Learning curves are a powerful tool for understanding the performance of a learning algorithm with respect to the amount of experience it has (for instance, the number of training examples).

### 6.1 Training Error and Cross-Validation Error
As the training set size increases, we observe that the cross-validation error generally decreases. This makes sense as with more data, the algorithm learns a better model.

Contrary to this, the training error increases as the training set size increases. This is because it becomes increasingly difficult for a quadratic function to perfectly fit all training examples.

It is important to note that the cross-validation error is typically higher than the training error since the parameters are fit to the training set. Hence, the model is expected to perform better on the training set than on the cross-validation set.

### 6.2 High Bias vs High Variance
Learning curves can also shed light on whether a model suffers from high bias (underfitting) or high variance (overfitting).

A high bias scenario is seen when a simple linear function is fitted. The training and cross-validation errors both tend to flatten out after a while. This is because the model doesn't change much more with the addition of more examples, hence the errors plateau. **If your learning algorithm has high bias, collecting more training data won't significantly improve the performance.** In this case, you should focus on improving the model (using a more complex model, decreasing regularization, etc.).

In a high variance scenario, the training error increases gradually with the increase in training set size and the cross-validation error is significantly higher than the training error. This indicates that the model performs much better on the training set than on the cross-validation set. **In this case, increasing the training set size could help a lot as it reduces the cross-validation error and improves the performance of the algorithm.**

<hr>

# 7. Machine Learning Development: Iterative Process

Developing a machine learning model often involves an iterative loop:

**1. Decide the architecture of your system:** Choose your machine learning model, decide on what data to use, pick the hyperparameters, etc.

**2. Implement and train a model:** It is important to note that a model almost never works as expected in the first iteration.

**3. Implement diagnostics:** Look at the bias and variance of your algorithm. Based on the insights from these diagnostics, you can then make decisions to improve your model.

**4. Modify the model or data:** Based on the insights from the diagnostics, you can then decide whether you want to change the architecture of your system, add more data, add or subtract features, etc.

**5. Go back to step one** and iterate until you achieve the performance you desire.

<div style="align=center">
    <img src="media/iterative_process.png" width=600>
</div>

The following sections will provide a walkthrough of this process using an example project: building an email spam classifier.

### 7.1 Email Spam Classifier: Project Description
Email spam is a nuisance for many, and this project aims to mitigate it. The goal is to develop a classifier that can differentiate between spam and non-spam emails. This is an example of text classification, where we use a supervised learning algorithm with features derived from the emails and corresponding labels indicating whether an email is spam or not.

For instance, we could construct a feature vector using the 10,000 most common English words and assign binary values indicating whether a word appears in a given email. Alternative feature construction methods could include counting the number of times each word appears in an email. Once these features are established, a classification algorithm like logistic regression or a neural network can be trained to predict whether an email is spam or not.

### 7.2 Next Steps
If the initial model's performance isn't as high as desired, you may consider different strategies for improvement. These could involve collecting more data, creating more sophisticated features based on email routing information or the body of the email, or even developing algorithms to detect deliberate misspellings common in spam emails.

However, it can be challenging to determine which strategies are the most promising. Using diagnostics like bias and variance, and performing error analysis can guide you towards more effective modifications to your model. This is the essence of the iterative loop of machine learning development.

### 7.3 Error Analysis
Error analysis is a process that helps you understand where your model is failing. It involves the following steps:

- **Look at misclassified examples:** From your validation set, find examples that your algorithm has misclassified.
- **Identify common themes:** Go through these misclassified examples and try to identify common properties or traits.
- **Count the categories of errors:** Try to categorize the errors into groups, and count how many errors belong to each category.
- **Prioritize:** Based on this analysis, you can decide which areas need immediate attention and which ones can be deprioritized.

Remember that these categories can overlap, and that you might need to randomly sample a subset of your errors if your validation set is large. The error analysis process can help you gain inspiration for what might be useful to try next and sometimes it can also tell you that certain types of errors are sufficiently rare that they aren't worth as much of your time to try to fix.

<hr>

# 8. Machine Learning Data Engineering Techniques
This resource hosts a comprehensive guide that details various strategies for managing and enhancing data for machine learning applications.

### 8.1 Adding More Data
Machine learning applications often require large amounts of data. However, obtaining more data for every type can be time-consuming and expensive. **Instead, focus on adding data where analysis indicates it might help the most.** For instance, if error analysis reveals that your model struggles with identifying a particular type of spam email, target your efforts on gathering more examples of that type of spam. This approach can make your algorithm more proficient in identifying that particular type of spam.

### 8.2 Data Augmentation
This technique involves creating new training examples by slightly modifying existing ones. It's especially useful for image and audio data. For instance, an image of a letter can be rotated, enlarged, shrunk, or have its contrast altered to create a new training example. These changes teach the algorithm that these modifications don't change the fact that it's still the same letter.

### 8.3 Advanced Data Augmentation
This method takes data augmentation to the next level. For example, you can superimpose a grid on a letter and introduce random distortions to create a vast array of new training examples. This approach helps the learning algorithm learn more robustly.

### 8.4 Data Augmentation for Audio Data
One method of augmenting audio data is by adding background noise to an original audio clip, thereby creating an artificial scenario where someone is speaking in a noisy environment. This strategy can significantly increase your training dataset size.

### 8.5 Data Synthesis
This technique involves creating new examples from scratch, rather than modifying existing examples. A great example of data synthesis is in photo optical character recognition (OCR) tasks. You can generate synthetic data that looks very similar to real-world images by typing random text in different fonts, colors, and contrasts and then capturing screenshots.

### 8.6 Data-centric Approach
Instead of focusing on improving the code or the algorithm, it might be more productive to focus on engineering the data used by your algorithm. Techniques such as targeted data collection, data augmentation, and data synthesis can help improve your learning algorithm's performance.

### 8.7 Transfer Learning
This technique is useful when you don't have much data and it's difficult to obtain more. Transfer learning involves using data from a different, albeit somewhat related, task to improve your algorithm's performance on your application. This strategy is not applicable to every situation, but it can be highly effective when it does apply.

<hr>

# 9. Transfer Learning
Transfer learning is a highly effective technique, particularly useful in applications where data is limited. It works by utilizing data from a different but related task to help your application. 

### 9.1 What is Transfer Learning?
Assume you need to recognize handwritten digits from zero through nine, but you have a scarce amount of labeled data. You can, however, find a large dataset of one million images comprising a thousand classes (e.g., cats, dogs, cars, people). The idea is to initially train a neural network on this large dataset and use this knowledge to recognize any of the 1,000 different classes.

During the training process, the network learns parameters for each layer. To apply transfer learning, copy the neural network, retaining parameters of all layers except the last one, which is replaced with a smaller output layer having just ten (instead of 1,000) output units corresponding to the digit classes (0-9) you want the network to recognize.

Note that the parameters of the last layer can't be copied because the layer's dimension has changed. Therefore, new parameters need to be trained from scratch. Then, run an optimization algorithm with these parameters to further train the network.

There are two ways to train this neural network:

- **Option 1:** Only train the output layer's parameters and leave the others unchanged.
- **Option 2:** Train all the parameters in the network, initializing the first $n$ layers' parameters using the pre-trained values.

### 9.2 Why does Transfer Learning work?
Training a neural network to detect diverse objects from images helps it to learn to detect pretty generic features of images, like edges, corners, curves, and basic shapes. This knowledge can be beneficial for many other computer vision tasks, like recognizing handwritten digits.

It's worth noting that the type of input (e.g., image, audio, text) used for pre-training and fine-tuning should be the same. For instance, if the final task is a computer vision task, the pre-training should also be done using an image-based neural network.

### 9.3 Transfer Learning Steps
- **Step 1:** Download (or train) a neural network with parameters pre-trained on a large dataset of the same input type as your application.
- **Step 2:** Further train or fine-tune the network on your data.

Transfer learning isn't a panacea — it can't solve every problem with only a handful of images. But it does offer significant advantages when the dataset for your application isn't that large.

<hr>

# 10. The Full Cycle of a Machine Learning Project: A Guide
In machine learning projects, training a model is just part of the process. Here, we present the complete lifecycle of a machine learning project, using speech recognition as an example.

<div style="align=center">
    <img src="media/full_cycle.png" width=600>
</div>

### 10.1 Scoping the Project
The first step in any machine learning project is scoping, i.e., deciding on the specifics of the project. In our case, the project was to develop speech recognition for voice search – enabling web searches using voice commands rather than typing.

### 10.2 Data Collection
Once you have defined the scope of your project, the next step is to decide what data you need to train your machine learning system. This involves acquiring the necessary audio clips and their respective transcripts to serve as labels for your dataset.

### 10.3 Model Training
With your initial dataset ready, you can start training your speech recognition system. This process usually involves error analysis and iterative improvements to your model. Sometimes, your analysis might suggest the need for more or specific types of data to improve the performance of your learning algorithm. For instance, you may find your model performs poorly in the presence of car noise in the background and you need more data of that type to enhance its performance.

### 10.4 Deployment
After sufficient rounds of training and data collection, when you deem your model is ready, you can deploy it in a production environment. Deployment involves making your system available for users. You should also monitor and maintain your system post-deployment, improving it as necessary. If you have user permission, data from your deployed system can even be used to further enhance its performance.

### 10.5 Deploying in Production

<div style="align=center">
    <img src="media/ml_deployment.png" width=600>
</div>

To give you more insight into deployment, here's a typical scenario:

After training a high-performing speech recognition model, you implement it in a server called an "inference server." When a user interacts with a mobile application, the app makes an API call, sending the recorded audio clip to your inference server. The server uses your model to make predictions, in this case, producing the text transcripts of the spoken words.

Depending on the scale of your application, the amount of software engineering required can vary significantly. You'll also want to log the inputs and predictions for system monitoring. This data can be invaluable in identifying when your model's performance is slipping due to shifts in the data, such as when new names or terms enter common usage. Monitoring then allows you to retrain your model and update it accordingly.

### 10.6 Machine Learning Operations (MLOps)
MLOps, or Machine Learning Operations, is a growing field that focuses on systematically building, deploying, and maintaining machine learning systems. Practices in MLOps ensure your machine learning model is reliable, scales well, has good logging practices, is consistently monitored, and can be updated as needed to maintain its performance.

<hr>

# 11. Evaluating Machine Learning Systems in Skewed Classes
When the ratio of positive to negative examples is very skewed in machine learning applications, traditional error metrics such as `accuracy` may not give meaningful results. Consider an example where we're training a binary classifier to detect a rare disease in patients based on their data. Let's represent presence of the disease as $y = 1$ and absence as $y = 0$.

Assume we've achieved 1% error on the test set, meaning a 99% correct diagnosis rate. This seems impressive, but if only 0.5% of the population have the disease, a naïve classifier that always predicts $y = 0$ would have a 99.5% accuracy. Clearly, the naïve classifier is not a useful diagnostic tool even though it has lower error. 

In skewed datasets, we often use alternative error metrics like     `precision` and `recall`. These are based on a confusion matrix, a 2x2 table with actual classes (1 or 0) on one axis and predicted classes on the other.

The cells of the matrix are:

- True Positives (TP): actual = 1, predicted = 1
- False Positives (FP): actual = 0, predicted = 1
- True Negatives (TN): actual = 0, predicted = 0
- False Negatives (FN): actual = 1, predicted = 0

| | Predicted 1 | Predicted 0 |
| --- | --- | --- |
| Actual 1 | TP | FN |
| Actual 0 | FP | TN |

<div style="align=center">
    <img src="media/precision_recall.png" width=800>
</div>

We define precision and recall as:

- **Precision:** Of all the patients where we predicted y = 1, what fraction actually has the disease.

$$\text{Precision} = \frac{\text{true positives}}{\text{total predicted positives}} = \frac{TP}{TP + FP}$$

- **Recall:** Of all the patients that actually have the disease, what fraction did we correctly detect as having it?

$$\text{Recall} = \frac{\text{true positives}}{\text{total actual positives}} = \frac{TP}{TP + FN}$$

If an algorithm predicts 0 all the time, it will have zero recall and undefined precision. In such a case, we conventionally define precision as zero as well.

Precision and recall help us identify whether our learning algorithm makes useful predictions and can diagnose a reasonable fraction of actual disease cases.

### 11.1 Precision-Recall Trade-Off
In an ideal world, we want learning algorithms that have high precision and high recall. However, in practice, there's often a trade-off between these two.

#### 11.1.1 Balancing Precision and Recall
When using logistic regression, we can choose the threshold at which we predict $y=1$ (e.g., the presence of a rare disease). For example, we may choose to predict $y=1$ only when we are 70% sure, rather than 50% sure. Increasing this threshold would increase precision (because we're more likely to be right when we predict 1), but it would decrease recall (since we're predicting 1 less often and may miss some true positives).

Conversely, we could lower this threshold if we want to avoid missing too many cases of the rare disease. Lowering the threshold to say, 30%, would result in higher recall but lower precision.

### 11.2 Trade-Off Curve
The "best" threshold will depend on the specific costs and benefits of false positives versus false negatives, high precision versus high recall. This choice cannot be made through cross-validation but rather, has to be made based on the specific needs and constraints of the application.

<div style="align=center">
    <img src="media/f1_score.png" width=600>
</div>


### 11.3 F1 Score: Combining Precision and Recall
In cases where we want an automatic way to balance precision and recall, we can use the F1 score. This is a metric that combines precision and recall, with more emphasis on whichever is lower. The F1 score is calculated using the following formula:

$$\text{F1} = \frac{1}{\frac{1}{2}(\frac{1}{\text{P}} + \frac{1}{\text{R}})} = \frac{2 \cdot \text{P} \cdot \text{R}}{\text{P} + \text{R}}$$

The F1 score is also known as the `harmonic mean of precision and recall`. The higher the F1 score, the better the balance between precision and recall.
