# 1.1 What is machine learning

Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed 

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

# 1.2 Why use machine learning

The traditional approach: rule-based programming. The rules can be complex and hard to maintain, inflexible

Machine learning approach: detect patterns, short and accurate, flexible, adaptable to changes

Data mining: Applying ML techniques to dig into large amounts of data can help discover patterns that were not immediately apparent

Machine learning can help humans learn

Summary machine learning is great for:
- Problems for which existing solutions require a lot of fine-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better than the traditional approach
- Complex problems for which using a traditional approach yields no good solution: the best Machine Learning techniques cam perhaps find a solution
- Fluctuating environments: a Machine Learning system can adapt to new data
- Getting insights about complex problems and large amounts of data

# 1.3 Examples of applications

Analyzing imaged of products on a production line to automatically classify them: This is image classification, typically performed using convolutional neural networks

Detecting tumors in brain scans: This is semantic segmentation, where each pixel in the image is classified (as we want to determine the exact location and shape of tumors), typically using CNNs

Automatically classifying news articles: This is natural language processing (NLP), and more specifically text classification, which can be tackled using recurrent neural networks (RNNs), CNNs or Transformers

Automatically flagging offensive comments on discussion forums: This is also text classification, using the same NLP tools

Summarizing long documents automatically: This is a branch of NLP called text summarization, again using the same tools

Creating a chatbot or a personal assistant: This involves many NLP components, including natural language understanding (NLU) and question-answering modules

Forecasting your company's revenue next year, based on many performance metrics: This is a regression task (i.e. predicting values) that may be tackled using any regression model, such as a Linear Regression or Polynomial Regression model, a regression SVM, a regression Random Forest, or an artificial neural network. If you want to take into account sequences of past performance metrics, you may want to use RNNs, CNNs or Transformers

Making you app react to voice commands: This is speech recognition, which requires processing audio samples: since they are long and complex sequences, they are typically processed using RNNs, CNNs, or Transformers

Detecting credit card fraud: This is anomaly detection

Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment: This is clustering

Representing a complex, high-dimensional dataset in a clear and insightful diagram: This is data visualization, often involving dimensionality reduction techniques

Recommending a product that a client may be interested in, based on past purchases: This is recommender system. One approach is to feed past purchases (and other information about the client) to an artificial neural network, and get it to output the most likely next purchase. This neural new would typically be trained on past sequences of purchases across all clients

Building an intelligent bot for a game: This is often tackled using Reinforcement Learning (RL), which is a branch of Machine Learning that trains agents (such as bots) to pick the actions that will maximize their rewards over time (e.g. a bot may get a reward every time the player loses some life points), within a given environment (such as the game). The famous AlphaGo program that beat the world champion at the game of Go was built using RL

# 1.4 Types of machine learning systems

**Criteria**:
- Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, and Reinforcement Learning)
- Whether or not they can learn incrementally on the fly (online versus batch learning)
- Whether they works by simply comparing new data points to known data points, or instead by detecting patterns in the training data and building a predictive model, much like scientists do (instance-based versus model-based learning)

## 1.4.1 Supervised/Unsupervised learning

**Supervised learning**

In supervised learning, the training set you feed to the algorithm includes the desired solutions, called labels

Classification and regression

Examples: k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Decision Trees and Random Forests, Neural networks

**Unsupervised learning**

In unsupervised learning, the training data is unlabeled. The system tries to learn without a teacher

Clustering: k-Means, DBSCAN, Hierarchical Cluster Analysis (HCA)

Anomaly detection and novelty detection: One-class SVM, Isolation Forest

Visualization and dimensionality reduction: Principal Component Analysis (PCA), Kernal PCA, Locally Linear Embedding (LLE), t-Distributed Stochastic Neighbor Embedding (t-SNE)

Association rule learning: Apriori, Eclat

**Semisupervised learning**

Since labeling data is usually time-consuming and costly, you will often have plenty of unlabeled instances, and few labeled instances. Some algorithms can deal with data that is partially labeled, this is called semisupervised learning

**Reinforcement learning**

The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation

## 1.4.2 Batch and online learning

**Batch learning**

The system is incapable of learning incrementally. It must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore. It just applies what it has learned. This is called offline learning

If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one.

Fortunately, the whole process of training, evaluating, and launching a Machine Learning system can be automated fairly easily, so even a batch learning system can adapt to change. Simply update the data and train a new version of the system from scratch as often as needed.

**Online learning**

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or in small groups called minibatches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives

Online learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. It is also a good option if you have limited computing resources: once an online learning system has learned about new data instances, it does not need them anymore, so you can discard them (unless you want to be able to roll back to a previous state and “replay” the data). This can save a huge amount of space.

Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory (this is called out-ofcore learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data (you don’t want a spam filter to flag only the latest kinds of spam it was shown). Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

A big challenge with online learning is that if bad data is fed to the system, the system’s performance will gradually decline. If it’s a live system, your clients will notice. For example, bad data could come from a malfunctioning sensor on a robot, or from someone spamming a search engine to try to rank high in search results. To reduce this risk, you need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance. You may also want to monitor the input data and react to abnormal data (e.g., using an anomaly detection algorithm).

## 1.4.3 Instance-based versus model-based learning

By how they generalize

**Instance-based learning**

Non-parameter

Possibly the most trivial form of learning is simply to learn by heart. If you were to create a spam filter this way, it would just flag all emails that are identical to emails that have already been flagged by users—not the worst solution, but certainly not the best.

Instead of just flagging emails that are identical to known spam emails, your spam filter could be programmed to also flag emails that are very similar to known spam emails. This requires a measure of similarity between two emails. A (very basic) similarity measure between two emails could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email.

This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples (or a subset of them)

**Model-based learning**

Parameters

Another way to generalize from a set of examples is to build a model of these examples and then use that model to make predictions. This is called model-based learning

**Note**: Model selection consists in choosing the type of model and fully specifying its architecture. Training a model means running an algorithm to find the model parameters that will make it best fit the training data (and hopefully make good predictions on new data)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
# Load the data
oecd_bli = pd.read_csv("oecd_bli_2015.csv", thousands=',')
gdp_per_capita =
pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',
encoding='latin1', na_values="n/a")
# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life
satisfaction')
plt.show()
# Select a linear model
model = sklearn.linear_model.LinearRegression()
# Train the model
model.fit(X, y)
# Make a prediction for Cyprus
X_new = [[22587]] # Cyprus's GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]              

In [None]:
import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(
n_neighbors=3)

If all went well, your model will make good predictions. If not, you may need to use more attributes (employment rate, health, air pollution, etc.), get more or better-quality training data, or perhaps select a more powerful model (e.g., a Polynomial Regression model).

**Summary**:
- You studied the data
- You selected a model
- You trained it on the training data (i.e. the learning algorithm searched for the model parameter values that minimize a cost function)
- Finally, you applied the model to make predictions on new cases (this is called inference), hoping that this model will generalize well

# 1.5 Main challenges of machine learning

## 1.5.1 Insufficient quantity of training data

The unreasonable effectiveness of data

## 1.5.2 Nonrepresentative training data

Sampling noise (sample is too small) and sampling bias

## 1.5.3 Poor-quality data

Errors, outliers, and noise

## 1.5.4 Irrelevant features

Garbage in, garbage out 

**Feature engineering**
- Feature selection, select the most useful features to train on among existing features 
- Feature extraction, combine existing features to produce a more useful one, such as dimensionality reduction
- Create new features by gathering new data

## 1.5.5 Overfitting the training data

Overgeneralizing and overfitting

The model performs well on the training data, but it does not generalize well

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data

**Solutions**:
- Simplify the model by selecting one with fewer parameters (e.g. a linear model rather than a high-degree polynomial model), by reducing the number of attributes in the training data, or by constraining the model
- Gather more training data
- Reduce the noise in the training data (e.g. fix data errors and remove outliers)

**Regularization**: Constrain a model to make it simpler and reduce the risk of overfitting. The amount of regularization to apply during learning can be controlled by a hyperparameter

## 1.5.6 Underfitting the training data

The model is too simple to learn the underlying structure of the data

**Solutions**:
- Select a more powerful model, with more parameters
- Feed better features to the learning algorithm (feature engineering)
- Reduce the constraints on the model (e.g. reduce the regularization hyperparameter)

## 1.5.7 Stepping back

**Summary**:
- Machine learning is about making machines get better at some task by learning from data, instead of having to explicitly code rules 
- There are many different types of ML systems: supervised or not, batch or online, instance-based or model-based 
- In an ML project you gather data in a training set, and you feed the training set to a learning algorithm. If the algorithm is model-based, it tunes some parameters to fit the model to the training set (i.e. to make good predictions on the training set itself), and then hopefully it will be able to make good predictions on new cases as well. If the algorithm is instance-based, it just learns the examples by heart and generalizes to new instances by using a similarity measure to compare them to the learned instance 
- The system will not perform well if your training set is too small, or if the data is not representative, is noisy, or is polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple (in which case it will underfit) nor too complex (in which case it will overfit)

# 1.6 Testing and validating

## 1.6.1 Hyperparameter tuning and model selection

Holdout validation: hold out part of the training set to evaluate several candidate models and select the best one

**Steps**:
- Train multiple models with various hyperparameters on the reduced training set (i.e. the full training set minus the validation set)
- Select the model that performs best on the validation set
- After this holdout validation process, train the best model on the full training set (including the validation set), and this gives the final model. 
- Lastly, evaluate this final model on the test set to get an estimate of the generalization error 

**Cross-validation**

Each model is evaluated once per validation set after it is trained on the rest of the data. Averaging out all the evaluations of a model

## 1.6.2 Data mismatch

The validation set and the test set must be as representative as possible of the data you expect to use in production, so they should be composed exclusively of representative

train-dev set

**No free lunch theorem:**

A model is a simplified version of the observations. The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. To decide what data to discard and what data to keep, you must make assumptions. For example, a linear model makes the assumption that the data is fundamentally linear and that the distance between the instances and the straight line is just noise.

If you make absolutely no assumption about the data, then there is no reason to prefer one model over any other, this is called the no free lunch theorem. For some datasets, the best model is a linear model, while for other datasets, it is a neural network. There is no model that is a priori guaranteed to work better. The only way to know for sure which model is best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization, and for a complex problem you may evaluate various neural networks