
## The Machine Learning Landscape

Machine Learning (ML) is the science (and art) of programming computers so they can learn from data. It is defined as the field of study that gives computers the ability to learn without being explicitly programmed. A more engineering-oriented definition states: a computer program is said to learn from **experience E** with respect to some **task T** and some **performance measure P**, if its performance on T, as measured by P, improves with experience E.

The system learns from a **training set**, composed of **training instances** (or samples). For a spam filter, the task T is to flag spam, the experience E is the training data, and P could be **accuracy** (ratio of correctly classified emails).

---

## Why Use Machine Learning?

ML is valuable because it can solve problems too complex for traditional programming, adapt to **fluctuating environments**, and provide **insights** through **data mining**.

1.  **Simplifying and Improving Code**: ML automatically learns patterns, making programs shorter, easier to maintain, and often more accurate than traditional rule-based approaches.
    * Figure 1.1: [Insert Image Here]
        *Figure 1-1. The traditional approach*
    * Figure 1.2: [Insert Image Here]
        *Figure 1-2. The Machine Learning approach*
2.  **Automatic Adaptation**: An ML system automatically notices new patterns (e.g., spammers changing "4U" to "For U") and adapts without manual updates.
    * Figure 1.3: [Insert Image Here]
        *Figure 1-3. Automatically adapting to change*
3.  **Complex Problems**: ML is essential for problems with no known algorithm, such as large-scale speech recognition.
4.  **Helping Humans Learn (Data Mining)**: Inspecting a trained ML model (like a spam filter) can reveal unsuspected correlations or trends, leading to a better understanding of the problem-a process called **data mining**.
    * Figure 1.4: [Insert Image Here]
        *Figure 1-4. Machine Learning can help humans learn*

---

## Examples of Applications

| ML Task | Goal | Technique(s) |
| :--- | :--- | :--- |
| **Image classification** | Classifying products on a production line | Convolutional Neural Networks (CNNs) |
| **Semantic segmentation** | Detecting tumors in brain scans (pixel classification) | CNNs |
| **Text classification (NLP)** | Classifying news articles or flagging offensive comments | Recurrent Neural Networks (RNNs), CNNs, or Transformers |
| **Text summarization (NLP)** | Summarizing long documents | RNNs, CNNs, or Transformers |
| **Chatbots/Assistants** | Creating conversational agents | Natural Language Understanding (NLU) and question-answering |
| **Regression** | Forecasting company revenue | Linear/Polynomial Regression, regression SVM, Random Forest, ANNs; RNNs, CNNs, or Transformers for sequences |
| **Speech recognition** | Making an app react to voice commands | RNNs, CNNs, or Transformers (for audio samples) |
| **Anomaly detection** | Detecting credit card fraud | One-class SVM, Isolation Forest |
| **Clustering** | Segmenting clients based on purchases | K-Means, DBSCAN, Hierarchical Cluster Analysis (HCA) |
| **Visualization/Dimensionality Reduction** | Representing high-dimensional data clearly | PCA, Kernel PCA, LLE, t-SNE |
| **Recommender system** | Recommending products based on past purchases | Artificial Neural Networks (ANNs) |
| **Intelligent game bot** | Building an AI bot for a game (e.g., AlphaGo) | Reinforcement Learning (RL) |

---

## Types of Machine Learning Systems

ML systems are categorized by:
1.  **Supervision**: Supervised, unsupervised, semisupervised, and Reinforcement Learning.
2.  **Incremental Learning**: Online versus batch learning.
3.  **Generalization Method**: Instance-based versus model-based learning.

### 1. Supervised/Unsupervised Learning

#### Supervised Learning
The training set includes desired solutions called **labels**.
* **Tasks**: **Classification** (e.g., spam/ham) and **Regression** (predicting a target numeric value, like price, using **predictors** or **features**).
    * An **attribute** is a data type (e.g., "mileage"), while a **feature** is typically the attribute plus its value (e.g., "mileage = 15,000").
* Figure 1.5: [Insert Image Here]
    *Figure 1-5. A labeled training set for spam classification (an example of supervised learning)*
* Figure 1.6: [Insert Image Here]
    *Figure 1-6. A regression problem: predict a value, given an input feature (there are usually multiple input features, and sometimes multiple output values)*
* **Algorithms**: k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Decision Trees and Random Forests, Neural networks.

#### Unsupervised Learning
The training data is **unlabeled**; the system tries to learn without a teacher.
* **Clustering**: Detecting groups of similar visitors (e.g., K-Means, DBSCAN, HCA).
    * Figure 1.8: [Insert Image Here]
        *Figure 1-8. Clustering*
* **Visualization and Dimensionality Reduction**: Simplifying data (e.g., to 2D/3D plots) while preserving structure (e.g., PCA, t-SNE).
    * **Feature extraction** is merging correlated features to reduce dimension.
    * Figure 1.9: [Insert Image Here]
        *Figure 1-9. Example of a t-SNE visualization highlighting semantic clusters*
* **Anomaly Detection** (detecting unusual instances, trained on mostly normal instances) and **Novelty Detection** (detecting new instances that differ from a clean training set).
    * Figure 1.10: [Insert Image Here]
        *Figure 1-10. Anomaly detection*
* **Association Rule Learning**: Discovering interesting relations between attributes (e.g., Apriori, Eclat).

#### Semisupervised Learning
Deals with partially labeled data. Most algorithms combine unsupervised (e.g., clustering people in photos) and supervised techniques (e.g., adding one label per person).
* Figure 1.11: [Insert Image Here]
    *Figure 1-11. Semisupervised learning with two classes (triangles and squares): the unlabeled examples (circles) help classify a new instance (the cross) into the triangle class rather than the square class, even though it is closer to the labeled squares*

#### Reinforcement Learning (RL)
An **agent** observes the environment, performs actions, and receives **rewards** (or penalties), learning a **policy** (optimal strategy) to maximize reward over time. AlphaGo is a famous example.
* Figure 1.12: [Insert Image Here]
    *Figure 1-12. Reinforcement Learning*

### 2. Batch and Online Learning

#### Batch Learning
The system is trained **offline** using **all available data** and cannot learn incrementally. To incorporate new data, the system must be retrained from scratch on the full, updated dataset. It requires significant time and computing resources, especially with huge datasets.

#### Online Learning (Incremental Learning)
The system is trained incrementally by feeding data instances sequentially or in **mini-batches**. It is fast and cheap, allowing the system to learn **on the fly**. It's ideal for continuous data flows, rapid adaptation, or systems with limited resources.
* It can be used for **out-of-core learning** (training on huge datasets that don't fit in memory) by processing the data in pieces.
* The **learning rate** controls adaptation speed: a high rate adapts rapidly but may forget old data quickly; a low rate learns slowly but is less sensitive to noise.
* A major challenge is that bad data can gradually cause performance degradation, requiring close monitoring and anomaly detection.
* Figure 1.13: [Insert Image Here]
    *Figure 1-13. In online learning, a model is trained and launched into production, and then it keeps learning as new data comes in*
* Figure 1.14: [Insert Image Here]
    *Figure 1-14. Using online learning to handle huge datasets*

### 3. Instance-Based Versus Model-Based Learning

The goal is to **generalize** to new instances.

#### Instance-Based Learning
The system learns examples by heart and generalizes by using a **similarity measure** to compare new cases to the learned examples (or a subset). **k-Nearest Neighbors** regression is an instance-based algorithm.
* Figure 1.15: [Insert Image Here]
    *Figure 1-15. Instance-based learning*

#### Model-Based Learning
The system builds a **model** of the examples to make predictions. This involves:
1.  **Model selection**: Choosing a model type (e.g., Linear Regression) and specifying its architecture.
2.  **Defining a performance/cost function**: Measures how good/bad the model is (e.g., minimizing distance between predictions and training examples).
3.  **Training the model**: The learning algorithm finds the optimal **model parameters** ($\theta_{0}$, $\theta_{1}$) that best fit the data.
4.  **Inference**: Applying the final model to make predictions on new cases.

* Figure 1.16: [Insert Image Here]
    *Figure 1-16. Model-based learning*
* Figure 1.17: [Insert Image Here]
    *Figure 1-17. Do you see a trend here?*
* Table 1-1. Does money make people happier?

| Country | GDP per capita (USD) | Life satisfaction |
| :--- | :--- | :--- |
| Hungary | 12,240 | 4.9 |
| Korea | 27,195 | 5.8 |
| France | 37,675 | 6.5 |
| Australia | 50,962 | 7.3 |
| United States | 55,805 | 7.2 |

* Equation 1-1. A simple linear model
    $$\text{life\_satisfaction} = \theta_{0} + \theta_{1} \times \text{GDP\_per\_capita}$$
* Figure 1.18: [Insert Image Here]
    *Figure 1-18. A few possible linear models*
* Figure 1.19: [Insert Image Here]
    *Figure 1-19. The linear model that fits the training data best*

* Example 1-1. Training and running a linear model using Scikit-Learn
```python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv("gdp_per_capita.csv", thousands=',', delimiter='\t',
encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats ["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

# Select a linear model
model = sklearn.linear_model.LinearRegression()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus's GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

Main Challenges of Machine Learning
Challenges are classified as

"bad data" or "bad algorithm".

Bad Data

Insufficient Quantity of Training Data: Most ML algorithms need thousands or millions of examples to work properly.


The Unreasonable Effectiveness of Data: Studies show that different algorithms perform nearly identically on complex problems when given enough data. Data often matters more than algorithm development.


Figure 1.20: [Insert Image Here]
Figure 1-20. The importance of data versus algorithms




Nonrepresentative Training Data: Training data must be representative of the cases to generalize to. Nonrepresentativeness is caused by:



Sampling noise: Nonrepresentative data due to a small sample.


Sampling bias: Flawed sampling method, even with a large sample. The 1936 Literary Digest poll suffered from sampling bias (wealthier subscribers) and

nonresponse bias (low response rate).



Figure 1.21: [Insert Image Here]
Figure 1-21. A more representative training sample




Poor-Quality Data: Data full of errors, outliers, and noise makes it hard to detect underlying patterns.

Cleaning up training data is crucial. Cleaning involves discarding/fixing outliers and handling missing features (ignoring attribute/instances, or filling values).



Irrelevant Features: The training data needs enough relevant features.

Feature engineering is crucial and involves feature selection, feature extraction (dimensionality reduction helps), and creating new features.

Bad Algorithm

Overfitting the Training Data: The model performs well on the training data but fails to generalize to new instances. This occurs when the model is

too complex relative to the amount and noisiness of the training data.



Solutions: Simplify the model (e.g., less complex model, fewer attributes), gather more training data, or reduce noise in the training data.


Regularization: Constraining a model to make it simpler and reduce overfitting risk. The amount of regularization is controlled by a

hyperparameter.


Figure 1.22: [Insert Image Here]
Figure 1-22. Overfitting the training data



Figure 1.23: [Insert Image Here]
Figure 1-23. Regularization reduces the risk of overfitting




Underfitting the Training Data: The model is too simple to learn the underlying structure of the data.


Solutions: Select a more powerful model, feed better features (feature engineering), or reduce constraints (e.g., reduce the regularization hyperparameter).

Testing and Validating
The only way to know how well a model will generalize is to try it on new cases.

Training and Test Sets
Data is split into a

training set (for training) and a test set (for testing). The error rate on new cases is the

generalization error (or out-of-sample error). If training error is low but generalization error is high, the model is

overfitting.

Hyperparameter Tuning and Model Selection
Evaluating multiple models or tuning hyperparameters repeatedly on the test set leads to the model overfitting the test set itself, resulting in an inaccurate generalization error estimate.


Holdout validation is the common solution:

Hold out a

validation set (or dev set) from the training data.

Train candidate models (with various hyperparameters) on the

reduced training set (excluding the validation set).

Select the best model based on performance on the

validation set.

Train the best model on the

full training set (including the validation set) to get the final model.

Evaluate the final model on the

test set for an unbiased estimate of generalization error.


Repeated cross-validation can be used to average evaluations across many small validation sets for a more accurate performance measure.


Data Mismatch
If the large training data (e.g., web images) is not perfectly representative of the production data (e.g., mobile app images), a

data mismatch occurs.


The

validation set and test set must be as representative as possible of the production data.

A

train-dev set (held out from the nonrepresentative training data) can be used to diagnose the problem.

Good performance on the

train-dev set but poor performance on the validation set implies a data mismatch problem.

Poor performance on the

train-dev set implies the model has overfit the training set.