# Practical Advice for Building Machine Learning Systems

## 1. The Core Challenge: Deciding What to Try Next

When a trained model's performance is not good enough, there are many possible next steps. The key to being an effective ML practitioner is choosing the right path forward to avoid wasting time.

**Common options include:**
- Get more training examples.
- Try a smaller set of features (to reduce overfitting).
- Try a larger set of features (to reduce underfitting).
- Add polynomial features ($x_1^2, x_1x_2$, etc.).
- Decrease the regularization parameter, $\lambda$.
- Increase the regularization parameter, $\lambda$.

To make good choices, we need **diagnostics**: tests you can run to gain insight into what is or isn't working with your algorithm.

## 2. Evaluating a Model

The first step is to have a reliable way to measure your model's performance.

### Training, Cross-Validation, and Test Sets
Instead of using all your data for training, you should split it into three sets:
1.  **Training Set (~60%):** Used to train the model's parameters ($W, B$).
2.  **Cross-Validation (CV) Set (~20%):** Also called the **validation** or **dev** set. Used to tune hyperparameters (like the degree of a polynomial, $\lambda$, or neural network architecture) and for diagnostics like bias/variance analysis.
3.  **Test Set (~20%):** Used only *once* at the very end to get an unbiased estimate of the final model's real-world performance (generalization error).

**Never use the test set to make decisions about the model architecture or hyperparameters.**

### Error Metrics
We define error functions for each set. For a regression problem, these would be:

- **Training Error:**
  $$ J_{train}(W,B) = \frac{1}{2m_{train}} \sum_{i=1}^{m_{train}} (f_{W,B}(\vec{x}_{train}^{(i)}) - y_{train}^{(i)})^2 $$

- **Cross-Validation Error:**
  $$ J_{cv}(W,B) = \frac{1}{2m_{cv}} \sum_{i=1}^{m_{cv}} (f_{W,B}(\vec{x}_{cv}^{(i)}) - y_{cv}^{(i)})^2 $$

- **Test Error:**
  $$ J_{test}(W,B) = \frac{1}{2m_{test}} \sum_{i=1}^{m_{test}} (f_{W,B}(\vec{x}_{test}^{(i)}) - y_{test}^{(i)})^2 $$

**Note:** The cost function $J(W,B)$ used for training includes the regularization term. The error metrics $J_{train}, J_{cv}, J_{test}$ **do not**.

For classification, the error is typically the fraction of misclassified examples.

### The Model Selection Process
1.  Train several different models (e.g., polynomials of different degrees, NNs with different architectures) on the **training set**.
2.  Evaluate each of these trained models on the **cross-validation set** using $J_{cv}$.
3.  Pick the model that has the lowest $J_{cv}$.
4.  Finally, evaluate the chosen model on the **test set** to get a fair estimate of its generalization error.

## 3. Diagnosing Bias and Variance

This is one of the most powerful diagnostics for understanding model performance.

- **High Bias (Underfitting):** The model is too simple and fails to capture the underlying patterns in the data.
- **High Variance (Overfitting):** The model is too complex and fits the training data's noise instead of its underlying signal.

### Diagnosing with Error Metrics
To judge whether your model suffers from high bias or high variance, you first need a **baseline level of performance**. This could be:
- Human-level performance.
- A competitor's algorithm performance.
- An estimate based on prior experience.

Let's compare the errors:
- **High Bias:** $J_{train}$ is high (significantly worse than the baseline). $J_{cv}$ will also be high, and typically close to $J_{train}$.
- **High Variance:** $J_{train}$ is low (at or near baseline performance). $J_{cv}$ is **much higher** than $J_{train}$.
- **High Bias AND High Variance:** $J_{train}$ is high, and $J_{cv}$ is even higher. (This is possible with very complex models like NNs).

### Learning Curves
A learning curve plots the training and cross-validation error as a function of the training set size ($m_{train}$).

- **High Bias Learning Curve:** Both $J_{train}$ and $J_{cv}$ are high and they plateau quickly. There is a small gap between them. **Conclusion:** Getting more data will not help.

- **High Variance Learning Curve:** There is a large gap between $J_{train}$ (which is low) and $J_{cv}$ (which is high). As the training set size increases, the gap narrows. **Conclusion:** Getting more data is likely to help.

![Learning Curves](https://i.imgur.com/GgGvA5K.png)

## 4. Fixing Bias and Variance Problems

Your diagnosis guides your next steps.

| If your model has... | Actions to try...                                                                                                              |
| :------------------- | :---------------------------------------------------------------------------------------------------------------------------- |
| **High Bias** | - **Get additional features**. <br> - **Add polynomial features**. <br> - **Use a more complex model** (e.g., bigger neural network). <br> - **Decrease regularization** ($\lambda$). |
| **High Variance** | - **Get more training data**. <br> - **Try a smaller set of features**. <br> - **Increase regularization** ($\lambda$).                                          |

### A Recipe for Neural Networks
Neural networks, when large enough, are "low bias machines." This allows for a simplified workflow:
1.  **Does your model have high bias?** (Is $J_{train}$ too high?).
    - If yes, make the network bigger (more layers/units). Repeat until bias is low.
2.  **Does your model have high variance?** (Is $J_{cv}$ much higher than $J_{train}$?).
    - If yes, get more data. Repeat until variance is low.
    - Regularization is also key. A larger network with good regularization often performs better than a small one.


In [1]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.regularizers import l2

# Define a neural network model for handwritten digit classification
# with L2 regularization added to the layers.

# The regularization strength (lambda) is set inside the l2() function.
# A common practice is to start with a small value like 0.01.
lambda_val = 0.01

model_regularized = Sequential([
    Dense(units=25, activation='relu',
          kernel_regularizer=l2(lambda_val)), # Add L2 regularization here
    Dense(units=15, activation='relu',
          kernel_regularizer=l2(lambda_val)), # Add L2 regularization here
    Dense(units=10, activation='linear')       # Usually, the output layer is not regularized
])

# The rest of the compile and fit process remains the same.
# For example, for multiclass classification:
model_regularized.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001)
)

print("Model with L2 Regularization has been defined.")
# model_regularized.fit(X_train, y_train, epochs=100) # Example of how you would train it



Model with L2 Regularization has been defined.


## 5. The Full Machine Learning Project Cycle

Building a production system is more than just training a model.

1.  **Scope Project**: Define what you want to achieve.
2.  **Collect Data**: Gather and label your initial dataset.
3.  **Train Model**: This is an iterative loop:
    - Train your model.
    - Perform **Error Analysis** and **Bias/Variance Analysis**.
    - Based on diagnostics, improve your model or collect more targeted data.
4.  **Deploy & Monitor**: Make the model available to users (e.g., via an API) and monitor its performance on live data. This is part of a field called **MLOps** (Machine Learning Operations).

### Error Analysis
This is a manual process of examining the examples your algorithm misclassified in the cross-validation set to find common themes. For a spam classifier, you might find that it struggles with:
- Pharmaceutical spam (21/100 errors).
- Phishing emails (18/100 errors).
- Emails with unusual routing (7/100 errors).
- Emails with deliberate misspellings (3/100 errors).

This analysis tells you that focusing on pharma spam and phishing would be more impactful than spending a lot of time on a sophisticated misspelling detector.

### Data-Centric AI
While model improvements are important, often the biggest gains come from improving the data.
- **Data Augmentation**: Creating new training examples from existing ones. For images, this includes rotating, shearing, changing contrast, etc. For audio, it could mean adding background noise.
- **Data Synthesis**: Creating brand new, artificial examples from scratch. For example, using different computer fonts to generate images of text for an OCR system.

### Transfer Learning
This is a powerful technique for when you don't have much data.
1.  **Pre-training**: Take a large, pre-existing neural network that was trained on a massive dataset (e.g., a million images from the internet).
2.  **Fine-tuning**:
    - Remove the final output layer of the pre-trained network.
    - Add a new output layer that matches your specific task (e.g., 10 units for digit classification).
    - You can either **freeze** the early layers and only train the new output layer, or you can train all the layers but with a very small learning rate.

The intuition is that the pre-trained network has already learned useful low-level features (like edges and shapes for images) that are transferable to your task.

---

## 6. Handling Skewed Classes (Optional but Important)

When one class is very rare (e.g., disease diagnosis where <1% of patients are positive), accuracy is a misleading metric.

### Precision and Recall
We use a **confusion matrix** to define better metrics:

|                   | **Actual: 1** | **Actual: 0** |
| :---------------- | :----------------- | :----------------- |
| **Predicted: 1** | True Positive (TP) | False Positive (FP)|
| **Predicted: 0** | False Negative (FN)| True Negative (TN) |

- **Precision**: Of all the examples we predicted as positive, what fraction were actually positive?
  $$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

- **Recall**: Of all the actual positive examples, what fraction did we correctly identify?
  $$ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$

### Precision-Recall Tradeoff
You can change the prediction threshold (default is 0.5) to trade off between precision and recall:
- **High Threshold (e.g., 0.9):** Predict 1 only when very confident -> High Precision, Low Recall.
- **Low Threshold (e.g., 0.3):** Predict 1 even when not very confident -> Low Precision, High Recall.

### F1 Score
To combine precision (P) and recall (R) into a single number, we use the **F1 Score**, which is the harmonic mean of the two. It penalizes models that have very low P or R.
$$ F_1 \text{ Score} = 2 \frac{P \cdot R}{P + R} $$
A higher F1 score is generally better. This allows you to compare models or different thresholds automatically.