# The Machine Learning Landscape

### What Is Machine Learning?

* Machine Learning is the field of study that gives computers the ability to learn
without being explicitly programmed.

Example :

your spam filter is a Machine Learning program that can learn to flag
spam given examples of spam emails (e.g., flagged by users)

Note : The examples that the system uses to learn are
called the training set. Each training example is called a training instance (or sample).


### Types of Machine Learning Systems

There are so many different types of Machine Learning systems that it is useful to
classify them in broad categories based on:

1. Whether or not they are trained with human supervision (supervised, unsuper‐
vised, semisupervised, and Reinforcement Learning)

2. Whether or not they can learn incrementally on the fly (online versus batch
learning)


3. Whether they work by simply comparing new data points to known data points,
or instead detect patterns in the training data and build a predictive model, much
like scientists do (instance-based versus model-based learning)

### Supervised learning

In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels

Example :

The spam filter is a good example
of this: it is trained with many example emails along with their class (spam or ham),
and it must learn how to classify new emails.

Some of the most important supervised learning algorithms are :

* k-Nearest Neighbors
* Linear Regression
* Logistic Regression
* Support Vector Machines (SVMs)
* Decision Trees and Random Forests
* Neural networks

### Unsupervised learning

In unsupervised learning, as you might guess, the training data is unlabeled
.The system tries to learn without a teacher.

Some of the most important unsupervised learning algorithms are :

* Clustering

— K-Means

— DBSCAN

— Hierarchical Cluster Analysis (HCA)

* Anomaly detection and novelty detection

— One-class SVM

— Isolation Forest

* Visualization and dimensionality reduction

— Principal Component Analysis (PCA)

— Kernel PCA

— Locally-Linear Embedding (LLE)

— t-distributed Stochastic Neighbor
Embedding (t-SNE)

* Association rule learning

— Apriori

— Eclat


Example :

 Say you have a lot of data about your blog’s visitors. You may want to
run a clustering algorithm to try to detect groups of similar visitors
as :
notice that 40% of your visitors
are males who love comic books and generally read your blog in the evening, while
20% are young sci-fi lovers

Bonus :

 * Another important unsupervised task is anomaly detection—for example, detect‐
ing unusual credit card transactions to prevent fraud, catching manufacturing defects

* Another common unsupervised task is association rule learning, suppose you own a supermarket. Running an association rule
on your sales logs may reveal that people who purchase barbecue sauce and potato
chips also tend to buy steak

Note :

 It is often a good idea to try to reduce the dimension of your train‐
ing data using a dimensionality reduction algorithm before you
feed it to a Machine Learning algorithm

### Semisupervised learning

Some algorithms can deal with partially labeled training data, usually a lot of unla‐
beled data and a little bit of labeled data. This is called semisupervised learning

Example :

Google Photos, are good examples of this. Once
you upload all your family photos to the service, it automatically recognizes that the
same person A shows up in photos 1, 5, and 11, while another person B shows up in
photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all
the system needs is for you to tell it who these people are. Just one label per person,4
and it is able to name everyone in every photo, which is useful for searching photos.

### Reinforcement Learning


The learning system, called an agent
in this context, can observe the environment, select and perform actions, and get
rewards in return (or penalties in the form of negative rewards, . It
must then learn by itself what is the best strategy, called a policy, to get the most
reward over time. A policy defines what action the agent should choose when it is in a
given situation.


Example :

Many robots implement Reinforcement Learning algorithms to learn
how to walk

### Batch and Online Learning

#### Batch learning

* In batch learning, the system is incapable of learning incrementally, it must be trained
using all the available data.

* First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning.

Note :

If you want a batch learning system to know about new data (such as a new type of
spam), you need to train a new version of the system from scratch on the full dataset
(not just the new data, but also the old data), then stop the old system and replace it
with the new one.

#### Online learning

* In online learning, you train the system incrementally by feeding it data instances
sequentially, either individually or by small groups called mini-batches.

* Each learning
step is fast and cheap, so the system can learn about new data on the fly, as it arrives.

Example :

Online learning is great for systems that receive data as a continuous flow (e.g., stock
prices) and need to adapt to change rapidly or autonomously.

Note :

A big challenge with online learning is that if bad data is fed to the system, the sys‐
tem’s performance will gradually decline.

* To reduce this risk, you need to monitor your system closely and promptly
switch learning off (and possibly revert to a previously working state) if you detect a
drop in performance

### Instance-Based Versus Model-Based Learning

#### Instance-based learning

* Possibly the most trivial form of learning is simply to learn by heart.

Example :

Instead of just flagging emails that are identical to known spam emails, your spam
filter could be programmed to also flag emails that are very similar to known spam emails. This requires a measure of similarity between two emails. A (very basic) simi‐larity measure between two emails could be to count the number of words they have
in common.

#### Model-based learning


Another way to generalize from a set of examples is to build a model of these exam‐
ples, then use that model to make predictions. This is called model-based learning

### Comparison Table

#### supervised, unsuper‐ vised, semisupervised, and Reinforcement Learning

|Criteria|	Supervised ML|	Unsupervised ML|	Reinforcement ML|
|---------|-------------|------------------|--------------------|
Definition|	Learns by using labelled data	|Trained using unlabelled data without any guidance.|	Works on interacting with the environment
Type of data|	Labelled data|	Unlabelled data	|No – predefined data
Type of problems|	Regression and classification|	Association and Clustering|	Exploitation or Exploration
Supervision	|Extra supervision|	No supervision|	No supervision
Algorithms|	Linear Regression, Logistic Regression, SVM, KNN etc.|K – Means,C–Means, Apriori|	Q – Learning,SARSA
Aim	|Calculate outcomes|	Discover underlying patterns|	Learn a series of action
Application|	Risk Evaluation, Forecast Sales	|Recommendation System, Anomaly |Detection	Self Driving Cars, Gaming, Healthcare

#### online versus batch learning

|Criteria| Online machine learning|Batch machine learning
|--------|--------------------------|----------------------|
Complexity|	More complex because the model keeps evolving over time as more data becomes available.|	Less complex because the model is fed with more consistent data sets periodically
Computational power|	More computational power is required because of the continuous feed of data that leads to continuous refinement.	|Fewer computational power is needed because data is delivered in batches; the model isn’t continuously refining itself.
Use in production|	Harder to implement and control because the production model changes in real-time according to its data feed.	|Easier to implement because offline learning provides engineers with more time to perfect the model before deployment.
Applications|	Used in applications where new data patterns are constantly required (e.g., weather prediction tools)	|Used in applications where data patterns remain constant and don’t have sudden concept drifts (e.g., image classification)

#### instance-based versus model-based learning

|Model Based Machine Learning|	Instance Based Learning|
|---------------------------|---------------------------|
Prepare the data for model training|	Prepare the data for model training. No difference here
Train model from training data to estimate model parameters i.e. discover patterns	|Do not train model. Pattern discovery postponed until scoring query received
Store the model in suitable form	|There is no model to store
Generalize the rules in form of model, even before scoring instance is seen|	No generalization before scoring. Only generalize for each scoring instance individually as and when seen
Predict for unseen scoring instance using model|	Predict for unseen scoring instance using training data directly
Can throw away input/training data after model training	|Input/training data must be kept since each query uses part or full set of training observations
Requires a known model form	|May not have explicit model form
Storing models generally requires less storage	|Storing training data generally requires more storage
Scoring for new instance is generally fast|	Storing for new instance may be slow

### Main Challenges of Machine Learning

#### Insufficient Quantity of Training Data

* For a toddler to learn what an apple is, all it takes is for you to point to an apple and
say “apple” (possibly repeating this procedure a few times). Now the child is able to
recognize apples in all sorts of colors and shapes. Genius.

* Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐
ing algorithms to work properly



#### Nonrepresentative Training Data

Let's say we want to train a machine learning model to classify whether an email is spam or not. We gather a dataset of 1,000 emails, out of which 900 are non-spam and 100 are spam. However, the dataset is nonrepresentative because the 100 spam emails are all from a particular spam campaign promoting a certain product, while the non-spam emails cover a wide range of topics.

Note :

Sampling Bias is a great example of this.

#### Poor-Quality Data

* If some instances are missing a few features (e.g., 5% of your customers did not
specify their age), you must decide whether you want to ignore this attribute alto‐
gether, ignore these instances, fill in the missing values (e.g., with the median
age), or train one model with the feature and one model without it, and so on.


#### Irrelevant Features

* Feature selection: selecting the most useful features to train on among existing
features.
* Feature extraction: combining existing features to produce a more useful one (as
we saw earlier, dimensionality reduction algorithms can help).
* Creating new features by gathering new data.

#### Overftting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be
tempted to say that all taxi drivers in that country are thieves.

* Unfortunately machines can fall into
the same trap if we are not careful.

Note :

Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.

 Example: the linear model we defined earlier has two parameters,
θ0
 and θ1
. This gives the learning algorithm two degrees of freedom to adapt the model
to the training data: it can tweak both the height (θ0
) and the slope (θ1
) of the line. If
we forced θ1
 = 0, the algorithm would have only one degree of freedom and would
have a much harder time fitting the data properly

#### Underftting the Training Data

It occurs when your
model is too simple to learn the underlying structure of the data.

The main options to fix this problem are:

* Selecting a more powerful model, with more parameters
* Feeding better features to the learning algorithm (feature engineering)
* Reducing the constraints on the model (e.g., reducing the regularization hyper‐
parameter)

#### Stepping Back

Let's say you are working on a machine learning project to predict housing prices based on various features such as size, location, and number of bedrooms. You initially choose a complex deep learning model that requires a large amount of labeled data to train effectively. However, you soon realize that obtaining a sufficient amount of labeled housing data is difficult and time-consuming.

* n this scenario, you decide to step back and reevaluate your approach. Instead of using a complex deep learning model, you decide to explore other machine learning algorithms that can work well with smaller datasets.

#### Testing and Validating

A better option is to split your data into two sets: the training set and the test set. As
these names imply, you train your model using the training set, and you test it using
the test set.

* The error rate on new cases is called the generalization error (or out-of sample error), and by evaluating your model on the test set, you get an estimate of this
error.

#### Hyperparameter Tuning and Model Selection

Let's say you are working on a classification problem where you want to predict whether an online transaction is fraudulent or legitimate. You decide to use a random forest algorithm for this task. However, the random forest algorithm has several hyperparameters that need to be tuned for optimal performance.

* First, you split your labeled dataset into training and validation sets. You then train multiple random forest models with different combinations of hyperparameters. For example, you vary the number of trees in the forest, the maximum depth of each tree, and the minimum number of samples required to split a node.

* After training these models, you evaluate their performance using the validation set. You calculate metrics such as accuracy, precision, recall, and F1 score to assess how well each model is performing. Based on the results, you compare the performance of different models and select the one that achieves the highest accuracy or the best balance between different evaluation metrics.

#### Data Mismatch


Let's say you are working on a machine learning project to predict customer churn for a subscription-based service. You gather a dataset of customer information, including features such as age, subscription length, and activity level. However, you realize that the dataset you have collected is not representative of the current customer base.

* Upon further investigation, you find that the data was collected from a specific region during a limited time period when the service was relatively new. Since then, the service has expanded to different regions and attracted a more diverse customer base.

* In this case, data mismatch occurs because the collected dataset does not accurately reflect the characteristics and behavior of the current customer base. The customers from different regions and with varying usage patterns may exhibit different churn behaviors, making the model trained on the mismatched data less effective in predicting churn accurately.