# Chapter 19 - Learning From Examples

*In which we describe agents that can improve their behavior through diligent study of past
experiences and predictions about the future.* - [Artificial Intelligence: A Modern Approach](http://aima.cs.berkeley.edu/)



## Introduction

- **Learning and machine learning** : An agent is considered to be learning if its performance on tasks improves over time after making observations about the world. This encompasses a broad spectrum of learning, from simple tasks to complex theories. Machine learning specifically refers to this process when the agent is a computer. 
- **Need for machine learning** : There are two primary reasons for utilizing machine learning: 
- **Anticipation of future situations** : Designers cannot foresee every possible scenario an agent might encounter, such as a robot navigating unknown mazes or a system predicting stock market trends during economic shifts from boom to bust. 
- **Complexity of programming certain tasks** : For some tasks, like facial recognition, even skilled programmers may not know how to explicitly program a solution due to the subconscious nature of these tasks in humans. Machine learning algorithms offer a viable approach. 
- **Content overview** : The chapter discusses various model classes including decision trees, linear models, nonparametric models like nearest neighbors, and ensemble models such as random forests. It also provides practical advice on building machine learning systems and discusses the theory underlying machine learning, covering a comprehensive spectrum from practical implementations to theoretical foundations.

## Basics of Machine Learning

Machine learning is a field of artificial intelligence (AI) focused on building systems that learn from data to make decisions or predictions. Here are the basics:
### 1. **What is Machine Learning?** 

Machine learning involves algorithms and statistical models that computer systems use to perform tasks without using explicit instructions. Instead, they rely on patterns and inference derived from data.
### 2. **Types of Machine Learning**  
- **Supervised Learning:**  The model is trained on a labeled dataset, which means that each training example is paired with an output label. The model makes predictions or decisions based on input data. 
- **Unsupervised Learning:**  The model is trained on data without labeled responses. It tries to find patterns and relationships in the data by itself. 
- **Semi-supervised Learning:**  A mix of supervised and unsupervised learning. The model is trained on a partially labeled dataset. 
- **Reinforcement Learning:**  The model learns to make decisions by performing actions in an environment to achieve some goals. It learns from the consequences of its actions, rather than from being taught explicitly.
### 3. **Key Concepts**  
- **Dataset:**  A collection of data that the machine learning model learns from. It's usually divided into training and testing sets. 
- **Features:**  The input variables of the dataset. They are the characteristics based on which the model makes predictions. 
- **Labels:**  In supervised learning, these are the output variables or the predictions the model aims to make. 
- **Model:**  A mathematical representation of the real-world process. It's trained using algorithms on a dataset. 
- **Training:**  The process of teaching a machine learning model to make predictions or decisions based on data. 
- **Inference:**  Using a trained model to make predictions on new, unseen data.
### 4. **Common Algorithms**  
- **Linear Regression:**  Used for predicting a continuous value. 
- **Logistic Regression:**  Used for binary classification tasks. 
- **Decision Trees:**  Can be used for classification or regression tasks. They split the data based on certain conditions. 
- **Neural Networks:**  Complex models that can capture non-linear relationships in data. They're particularly useful for image and speech recognition.
### 5. **Evaluation Metrics** 

Different tasks use different metrics for evaluating the performance of machine learning models. Common metrics include accuracy, precision, recall, F1 score, and mean squared error.
### 6. **Overfitting and Underfitting**  
- **Overfitting:**  The model performs well on the training data but poorly on new, unseen data. It has essentially memorized the training dataset, including the noise and outliers. 
- **Underfitting:**  The model is too simple to capture the underlying structure of the data, leading to poor performance on both the training and testing sets.
### 7. **Improving a Model**  
- **Feature Engineering:**  Creating new features or modifying existing ones to improve model performance. 
- **Regularization:**  Techniques to prevent overfitting by penalizing complex models. 
- **Hyperparameter Tuning:**  Adjusting the settings of the machine learning algorithm to optimize performance.

## 19.1 Forms of Learning**  
- **Learning components in agent programs** : Machine learning can enhance any part of an agent program, influenced by which component is being improved, the agent's prior knowledge, and the available data and feedback. 
- **Agent design components** :
1. Direct mapping from state conditions to actions.
2. Inferring world properties from percept sequences.
3. Understanding how the world evolves and the outcomes of actions.
4. Utility information for determining the desirability of world states.
5. Action-value information for assessing the desirability of actions.
6. Goals that outline the most desirable states.
7. A problem generator, critic, and learning element for system improvement. 
- **Examples of learning in action** : For instance, a self-driving car learning from a human driver might learn when to brake based on observed conditions (1), recognize objects like buses from camera images (2), learn the effects of its actions by experimentation (3), and adjust its utility function based on passenger feedback (4). 
- **Machine learning in software engineering** : Machine learning technologies have become integral to software development, significantly enhancing efficiency and effectiveness in various applications, from analyzing astronomical images to optimizing data center cooling systems. 
- **Agent models and learning algorithms** : The chapter discusses learning algorithms for different agent models, including atomic, factored, and relational models, based on logic or probability. 
- **Assumptions and induction** : The chapter assumes minimal prior knowledge for the agent, focusing on learning from scratch and, briefly, on transfer learning, where knowledge from one domain is applied to a new one. It emphasizes induction, the process of deriving general rules from specific observations, which differs from deduction in its potential for incorrect conclusions. 
- **Learning problems and inputs** : It covers learning problems where inputs are factored representations or vectors of attribute values, distinguishing between classification (discrete outputs) and regression (numerical outputs) learning problems. 
- **Types of learning based on feedback** : 
- **Supervised learning** : Learning a function from input-output pairs (labels), where the environment acts as a teacher. 
- **Unsupervised learning** : Learning patterns in the input without explicit feedback, such as clustering. 
- **Reinforcement learning** : Learning from a series of rewards and punishments to modify actions towards achieving more rewards in the future.

## 19.2 Supervised Learning**  
- **Task of supervised learning** : The goal is to discover a function hhh that approximates an unknown true function fff, given a training set of example input-output pairs (x1,y1),(x2,y2),…,(xN,yN)(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)(x1​,y1​),(x2​,y2​),…,(xN​,yN​). Here, hhh is a hypothesis about the world, selected from a hypothesis space HHH of possible functions. 
- **Hypothesis space (H)** : Can vary greatly, from polynomials of a certain degree to sets of functions like Javascript functions or 3-SAT Boolean logic formulas. The choice of HHH depends on prior knowledge or exploratory data analysis of the training data. 
- **Selecting a good hypothesis** : Involves choosing a hypothesis that is consistent with the training data. For continuous outputs, this means seeking a best-fit function. The ultimate measure of a hypothesis is its ability to generalize to unseen data, evaluated using a test set. 
- **Bias and variance** : 
- **Bias**  refers to the predictive hypothesis's tendency to consistently deviate from the expected value across different training sets. High bias can lead to underfitting, where the hypothesis fails to capture the data's pattern. 
- **Variance**  refers to the change in the hypothesis with fluctuations in the training data. High variance can result in overfitting, where the hypothesis is too tailored to the training data and performs poorly on unseen data. 
- **Bias-variance tradeoff** : Navigating between complex, low-bias hypotheses that fit training data well and simpler, low-variance hypotheses that may generalize better. The goal is to find a hypothesis that matches the data adequately while maintaining simplicity, guided by principles like Ockham's razor. 
- **Defining simplicity and model fitness** : While simplicity is intuitively appealing, the complexity of models like deep neural networks, which can generalize well despite having billions of parameters, shows that simplicity alone is not always the best criterion. Appropriateness to the data and task is crucial. 
- **Choosing the best hypothesis** : Depends on the data's nature and the task. Supervised learning can select the most probable hypothesis given the data, using Bayesian principles to balance the likelihood of the data under a hypothesis with the prior probability of the hypothesis. 
- **Expressiveness vs. computational complexity** : There's a tradeoff between the hypothesis space's expressiveness and the computational effort required to find a good hypothesis. While expressive hypothesis spaces allow for fitting simple models to complex data, they can increase computational complexity and the difficulty of using the learned hypothesis h(x)h(x)h(x). 
- **Focus on simpler representations** : Historically, learning has concentrated on simpler representations due to computational efficiency and the practicality of using the learned models. However, interest has grown in more complex models like those in deep learning, where computations remain bounded in time with appropriate hardware.

### 19.2.1 Example problem: Restaurant waiting**  
- **Problem description** : This supervised learning problem involves deciding whether to wait for a table at a restaurant, based on various factors. The output (yyy) is a Boolean variable named WillWait, indicating whether the decision is to wait. 
- **Input attributes (x)** : Consists of a vector of ten discrete attributes that might influence the waiting decision: 
1. **Alternate** : Availability of a suitable alternative restaurant nearby. 
2. **Bar** : Presence of a comfortable bar area to wait in. 
3. **Fri/Sat** : Indicator for Fridays and Saturdays. 
4. **Hungry** : Immediate hunger state. 
5. **Patrons** : Restaurant's current occupancy level (None, Some, Full). 
6. **Price** : Price range of the restaurant ($, $$, $$$). 
7. **Raining** : Weather condition outside. 
8. **Reservation** : Whether a reservation has been made. 
9. **Type** : Type of restaurant (French, Italian, Thai, Burger). 
10. **WaitEstimate** : Estimated waiting time given by the host (0–10 mins, 10–30 mins, 30–60 mins, >60 mins). 
- **Data sparsity** : The challenge highlighted by this example is the sparse nature of the data. Despite having 9,216 possible combinations of input attributes, only 12 instances are provided for learning. This illustrates the problem of induction in machine learning, where the goal is to make the best guess about the output for the vast majority of possible inputs based on very limited examples.


## 19.3 Learning Decision Trees**  
- **Function representation** : A decision tree maps a vector of attribute values to a single output value, facilitating decision-making through a series of tests. It starts at the root and progresses along branches based on test outcomes, ending at a leaf that provides the decision. 
- **Structure of a decision tree** : 
- **Internal nodes** : Each corresponds to a test on one of the input attributes. 
- **Branches** : Labeled with possible attribute values, indicating different paths to follow based on the test outcome. 
- **Leaf nodes** : Specify the decision or output value to return. 
- **Discrete and continuous values** : Although decision trees can handle both discrete and continuous input and output values, the focus here is on discrete inputs and Boolean classification outputs (true or false). 
- **Boolean classification** : In this context, outputs are classified as either positive (true) or negative (false), with xjx_jxj​ representing the input vector for the jthj^{th}jth example, yjy_jyj​ the output, and xj,ix_{j,i}xj,i​ denoting the ithi^{th}ith attribute of the jthj^{th}jth example. 
- **Example application** : The decision tree for deciding whether to wait for a table at a restaurant (the example problem described earlier) illustrates how a decision is reached by evaluating the attributes. For instance, if "Patrons = Full" and "WaitEstimate = 0–10", the example would be classified as positive, indicating a decision to wait for a table.

### 19.3.1 Expressiveness of Decision Trees**  
- **Logical equivalence** : Boolean decision trees can be represented as logical statements in disjunctive normal form (DNF), where the output is equivalent to a disjunction of paths, each path being a conjunction of attribute-value tests. This allows any function expressible in propositional logic to be represented as a decision tree. 
- **Decision tree as a logical statement** : The structure of a decision tree can be seen as a series of logical decisions leading to an outcome, effectively mimicking a "How To" guide for various decisions, making them intuitively appealing and easy to understand in many cases. 
- **Limitations on expressiveness** :
- Certain functions, such as the majority function (which requires more than half of the inputs to be true for a true output) and the parity function (which requires an even number of true inputs for a true output), demand exponentially large decision trees for accurate representation. 
- For real-valued attributes, representing functions like y>A1+A2y > A1 + A2y>A1+A2, which have a diagonal decision boundary, is challenging with decision trees due to their inherent structure of dividing the space into rectangular, axis-aligned segments. Approximating a diagonal line would require an impractical number of rectangular segments. 
- **Inefficiency for some functions** : Although decision trees are efficient and effective for certain types of functions, they are not universally optimal. Their structure makes them unsuitable for functions that require complex, non-linear decision boundaries or that depend on a balance of numerous attributes. 
- **Representation limitations** : No singular representation can efficiently encapsulate all types of functions due to the vast number of potential functions, especially as the number of attributes increases. For instance, with just 20 Boolean attributes, the total number of possible Boolean functions exceeds 10 million, highlighting the impracticality of representing all possible functions within a constrained bit-length representation.