## 3.1 Recognising Patterns

In feature analysis, we are said to recognise an object by
considering the constituent parts, or features, of the object.
We then assemble them together to determine what the
object is. For example, we know that a cat is a small fury If we know what a cat looks like,
we can recognise other cats. animal with triangular ears, long whiskers and playful
claws. When we see a cat we recognise it for what it is
because it satisfies these (admittedly simplified) rules.
The field of pattern recognition is thus interested in the
systematic detection of regularities in a dataset, based on
the use of algorithms.

## 3.2 Artificial Intelligence and Machine Learning

Machine learning can be seen as a Machine learning is a subfield of
artificial intelligence focussed on
improving the performance of an
intelligent agent.

subfield of artificial intelligence.
machine
learning is interested in studying the methods that can be
used to improve the performance of an intelligent agent
over time, based on stimuli from the environment.

## 3.4 Learning, Predicting and Classifying

The implementation of machine learning algorithms
involves the analysis of data that could be employed in the
improvement (learning) of the agent (model) and
subsequently using the results to make predictions about
quantities of interest or making decisions in the face of uncertainty.
It is important to bear in mind that machine learning is interested in
regularities and patterns in data. interested in the regularities or patterns of the data in order
to provide predictive and/or classifying power. This is not
necessarily the same as causality.
Machine learning tasks are traditionally divided into two
camps: Predictive or supervised learning and descriptive
or unsupervised learning. 
A teacher
that knows what a cat looks like will present the pupil
with several training images of cats and other animals,
and the pupil is expected to use the features or attributes of
the images presented to learn what a cat looks like. The
teacher will have provided a label to each of the images as being of cats or not. In the testing part, the teacher will
present images of various kinds of animals, and the pupil is
expected to classify which ones show a friendly feline face.
In machine learning parlance we talk about supervised
learning when we are interested in learning a mapping
from the input to output with the help of labelled sets
of input-output pairs. predictions based on the data that we see and thus apply
generalisations.
Each input has a number of features that can be represented
in terms of an N-dimensional vector that will help in the
task of learning the label of each of the training examples.
In unsupervised
learning. In this case, following our example of the
teacher-pupil situation, the teacher takes a Montessori-style
approach and lets the pupil develop, on her own, a rule
about what a cat (or any other animal of the pupil’s
preference) looks like, without providing any hints or labels
to the learner.
In this case, from a machine learning point of view, there are no input-output pairs. Instead, we only have the
unlabelled inputs and their associated N-dimensional
feature vectors, without being told the kind of pattern that
we must look for. that an unsupervised
learning task may enable us to assign labels to those inputs
and thus open the door to the use of predictive or
supervised learning.


## 3.5 Machine Learning and Data Science

Machine learning algorithms are suitable to the solution of problems
encountered in the data science and analytics workflow
where we are interested in deriving valuable insights from
data.
The
improvement in learning comes from generalising regular
patterns in the training data to be able to say something
about unobserved data points. We should therefore be
careful not to obtain a model that “memorises” the data,
also known as overfitting. We can avoid this by employing techniques such as regularisation and cross-validation.


## 3.6 Feature Selection

Unprocessed data can thus be thought of as the raw
material that can be filtered and prepared to obtain the
insights desired. We need
to be able to think through the available independent
variables or features (ingredients) that will be included in
the model (recipe).In some cases using the unprocessed, raw data may be
suitable. However, in many cases it is preferable to create
new features that synthesise important signals spread out in
the raw data. This process is known as feature selection where not only should we consider the the features readily
available, but also the creation and extraction of new
features and even the elimination of some variables too.
A common way to create new features is via
mathematical transformations that make the variables suitable for exploitation by a particular algorithm. For
instance, many algorithms rely on features having a linear
relationship, and finding a transformation that renders
nonlinear features to be represented as being linear in a
different feature space is definitely worth considering.

## 3.7 Bias, Variance and Regularisation: A Balancing Act

Machine
learning algorithms enable us to exploit the regularities in
the data. Our task is therefore to generalise those
regularities and apply them to new data points that have not been observed. This is called generalisation, and we are
interested in minimising the so-called generalisation error, i.e.
a measure of how well our model performs against unseen
data. If we were able to create an algorithm that is able to recall
the exact noise in the training data, we would be able to bring our training error down to zero. That sounds great and we would be very happy until we receive a new batch
of data to test our model. It is quite likely that the
performance of the model is not as good as a zero
generalisation error would have us believe. We have ended
up with an overfit model: We would be able to describe the
noise in our data instead of uncovering a relationship, given
the variance in our data.
The key is to maintain a balance between the propensity of
our model to learn the wrong thing, i.e the bias, and the
sensitivity to small fluctuations in the data, i.e. the variance. The key is to maintain a balance
between bias and variance. In the ideal case scenario we are interested in obtaining
a model that encapsulates patterns in the training data,
and that at the same time generalises well to data not yet
observed. As you can imagine, the tension between both
tasks means that we cannot do both equally well and a
trade-off must be found in order to represent the training
data well (high variance) without risking overfitting (high
bias).
High-bias models typically produce simpler models that do
not overfit and in those cases the danger is that of
underfitting. Models with low-bias are typically more complex and that complexity enables us to represent the
training data in a more accurate way. The danger here is
that the flexibility provided by higher complexity may end
up representing not only a relationship in the data but also
the noise. Another way of portraying the bias-variance
trade-off is in terms of complexity v simplicity.

The tension between bias and variance, simplicity and
complexity, or underfitting and overfitting is an area in the
data science and analytics process that can be closer to a
craft than a fixed rule. The main challenge is that not only is
each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model.
Instead, we are interested in building a strategy that enables
us to tell something about data from the sample used in
building the model. In order to prevent overfitting it is possible to introduce
ways to penalise our models for complexity by adding extra
constraints such as smoothness, or requiring bounds in the
norm of the vector space we are working on. This process is known as regularisation, and the effects of y. the penalty introduced can be adjusted with the use of the
so-called regularisation hyperparameter, l. Regularisation is the process
of introducing to our model a
penalty for complexity. Regularisation can then be employed to fine-tune the
complexity of the model in question. Some typical penalty methods that are introduced for
regularisation are the L1 and L2 norms that we will discuss
in the following section. In Section 3.12 we will touch upon
how the hyperparameter l can be tuned with the use of
cross-validation.


## 3.8 Some Useful Measures: Distance and Similarity

Once we have built a set of models based on the training
data we have, it is important to distinguish a good
performing model against a less good one. So, how do we ascertain that a model is good enough for our purposes?
The answer is that we need to evaluate the models with the
aid of a scoring or objective function. The
performance of a model will therefore depend on various
factors such as the distribution of classes, the cost of
misclassification, the size of the dataset, the sampling
methods used to obtain the data, or even the range of values
in the selected features. In general model evaluation can be posed as a constrained
optimisation problem given an objective function. The aim can then be presented as the problem of finding a set of
parameters that minimises that objective function. This is a
very useful way to tackle the problem as the evaluation
measure can be included as part of the objective function
itself. For example, consider the case where we are interested in finding the best line of fit given a number of
data points: A perfect fit would be found in the case where the data points align flawlessly in a straight line. We can evaluate how
well a line fits the data when we take into account the
difference between the location of a point and its
corresponding prediction as obtained from the model. If we
minimise that distance then we can evaluate and compare
various calculated predictions. This particular evaluation measure used in regression analysis is known as the sum of
squared residuals (SSR) and we will discuss it in more detail
in Chapter 4. In regression, the minimisation
of the sum of squares error is a
typical evaluation measure.

As we can see, the concept of distance arises naturally as
a way to express the evaluation problem, and indeed a
number of conventional evaluation procedures rely on
measures of distance. Consider the points A and B in a two dimensional space shown in Figure 3.1. Point A has
coordinates p(p1, p2) and point B has coordinates q(q1, q2).
We are interested in calculating the distance between these
two points. This can be achieved in different ways and we
are familiar with some of these, such as the Euclidean and
the Manhattan distances. Euclidean distance: This corresponds to the ordinary
distance calculated using the straight line that joins
points A and B; in two dimensions it corresponds to the
distance given by the Pythagorean theorem. Given the coordinates of each of the two points in question we can obtain the distance between A and B as:

Euclidean Distance

![alt text](images/euclidean_distance.png "Title")

![alt text](images/euclidean_distance_2.png "Title")

Manhattan distance: It is easy to see why this distance
measure gets this name if we think of the distance that a
yellow cab would cover while travelling along the streets in Manhattan. Apart from Broadway, the cab cannot
move diagonally in the street-avenue grid. Instead, it can
only move North-South and East-West. In the case of
points A and B in Figure 3.1, the Manhattan distance is

Manhattan Distance

![alt text](images/manhattan_distance_1.png "Title")

![alt text](images/manhattan_distance_2.png "Title")

If the distance is zero we can argue that the
two points are effectively the same one, or at the very least
similar to one another. This idea of similarity is therefore another useful tool in the development of evaluation
measures, particularly in the case where features are not
inherently amenable to being placed in a geometric space.

Cosine similarity: This similarity measure is commonly
used in text mining tasks, for example. In these cases the
words in the documents that comprise the corpora to be
mined correspond to our data features. The features can
be arranged into vectors and our task is to determine if
any two documents are similar or not. Cosine similarity
is based on the calculation of the dot product of the
feature vectors. It is effectively a measure of the angle q
between the vectors: If q = 0, then cos q is 1 and the two
vectors are said to be similar. For any other value of q the
cosine similarity will be less than 1. The cosine similarity
of vectors v1 and v2 is given by:

![alt text](images/cosine_similarity.png "Title")

Jaccard similarity: The Jaccard similarity measure provides us with a way to compare unordered collections
of objects, i.e. sets. We define the Jaccard similarity in
terms of the elements that are common to the sets in
question. Consider two sets A and B with cardinalities
|A| and |B|. The common elements of both sets are given
by the intersection A \ B. In order to give us an idea of the
relative size of the intersection compared to the sets, we divide the former by the union of the sets. This can be
expressed as follows:

![alt text](images/jaccard_similarity.png "Title")

In the case of document similarity for example, two
identical documents will have a Jaccard similarity of 1 and those completely dissimilar a value of 0.
Intermediate values correspond to various degrees of
similarity.

## 3.9 Beware the Curse of Dimensionality

We have been referring to data features as an integral
part of the ingredients we will use with our machine
learning algorithms. For a single feature we have a one-dimensional space, two
features can be represented in two dimensions. It follows that as we increase the number of features, the
number of dimensions that our model must include is increased too. Not only that, but we will also increase
the amount of information required to describe the data
instances, and therefore the model. The realisation that the number
of data points required to sample a space grows
exponentially with the dimensionality of the space is usually
called the curse of dimensionality. The curse of dimensionality becomes more apparent in
instances where we have datasets with a number of features
much larger than the number of data points. We can see
why this is the case when we consider the calculation of the distance between data points in spaces with increasing
dimensionality. Avoiding the curse of
dimensionality can be done by increasing the amount of
data, but even before going down that route it is worth
considering if the features used are indeed a suitable
collection. In that respect, apart from a careful feature selection process,
we can also reduce the dimensionality of the problem by
transforming the data from a higher-dimensional space into
a space with fewer dimensions as it is the case with Principal Component Analysis (PCA). We will discuss this
technique in Chapter 8. As for avoiding overfitting, in 
Section 3.12 we will discuss the ideas behind
cross-validation. But first we need to make a stop to talk
about Scikit-learn.

In [6]:
# Playing with scikit-learn

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

In [6]:
iris.data.shape

(150, 4)

In [7]:
iris.data[0:6,0:4]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4]])

In [8]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


## 3.11 Training and Testing

We carry out our modelling and the result can be used to
classify any new iris flower we encounter based on the
4 measurements (features) used. However, how do we
know how well (or how badly) our model performs? We would have to wait until we get new data not seen by the
model. What is more, we must remember that we build a model
because we are interested in using it effectively. This means
that we should care about its performance with new unseen
data and therefore a way to assess this is with error rates.
One way to tackle this problem is to prepare two
independent datasets from the original one: Training set: This is the data that the model will see and
it is used to determine the parameters of the model. Testing set: We can think of this as “new” data. The model has not encountered it yet and it will enable us
to measure the performance of the model built with the
training set. In some cases instead of partitioning the data into two sets,
it is divided into three. The third component is called the
validation set and it is used for tuning the model. All three parts need to be representative of the data that will be used
with the model.

In [9]:
from sklearn import model_selection
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(iris.data,iris.target,test_size=0.2,random_state=0)

## 3.12 Cross-Validation

Since we are interested in making accurate and useful
predictions, we need to ensure that any models we create
generalise well to unseen data. The parameters that we obtain with the use of a
single training dataset may end up reflecting the particular
way in which the data split was performed. The solution to this problem is straightforward: We can use
statistical sampling to get more accurate measurements. This process is usually referred to as cross-validation. Cross-
validation improves statistical efficiency by performing repeated splitting of data into training and validation sets, and re-performing model training and evaluation every time. The aim of cross-validation is to use a dataset to validate the model during the training phase.

A common cross-validation technique is the k-fold procedure: The original data is divided into k equal sets.
From the k subsets, a single partition is kept for validating
the model, and k-1 subsets are used for training. The process is then repeated k times, using one by one each
of the k subsets for validation. We will therefore have a
total of k trained models. The results of each of the folds
can be combined, for instance by averaging, to obtain a single estimation of the out-of-sample error. Cross-validation is a useful and straightforward way to get a more accurate estimate of the out-of-sample error, and
at the same time a more efficient use of data than a single training/testing split. This is because each record in the
dataset is used in both training and validating.

Cross-validation can also be useful in feature and model
selection procedures. For example, it can be used for tuning
the regularisation parameter l introduced in Section 3.7: We split out training data and train a model for a fixed value of
l. We can then test it on the remaining subsets and repeat
this procedure while varying l. Finally, we select the best l
that minimises our measure of error.

In [15]:
X_test.shape

(30, 4)