# AI, Machine Learning, and Deep Learning

Artificial Intelligence (AI), a broad technical term generally agreed to be first discussed _in concept_ by [Alan Turing's excellent "Imitation Game" paper in 1950](https://watermark.silverchair.com/lix-236-433.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAqwwggKoBgkqhkiG9w0BBwagggKZMIIClQIBADCCAo4GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMVWWgpsCtEoCPpAH6AgEQgIICXwN8ibRzPvazu4YkOlMZpU9IQu68dd6TzcVh1OKSNO9WZMqUXy6ZtavbAc4-wyUhIo6FfM4cR2RFflVL7vnXK5k12SV372kHP5QbLwVMH9LI5lheHcTPNaNzEYJ4tq5smjYYvmYOs5Cf1topdvP4rmoRymRujjy00HvO6A2WtV5iOAcOI3FISuG4Hp3Cf-_WKdor3SJgMNSd6x_ZZnqk3WiYx2uyL7D6n96nCG7BsdpGSc7F27Z-YwGben1q0NEYxa8QK5B1rwi_n4Q31usLo0V3TGTAb9gANpynV6Y0sg23QM9z4aFPnW1kPazK4lTR4KikRP112subUBix3M1dMABtLAE0ts105m5QUCv3VDXvPVwl2ePKB8PKyiAUvuoW65Yoif5QtjONthEZMcjenO6gi3YrFFDgNw1KqoXvxjU_v0Ii1ujQLAacpBUvA45eg8AD5KbE4I8SRd8zHIQud67sYahj4jwNBKPfrDOL_SwyNAAQM7Kdpaot8xJ0Km8-LPhjZ2rLXMnjEAIsruDOfRjTKzkPT3Uqz2imynq9XxgS9YLJ2y77mHRwWfsF8WH7_hCdSbXzW4Vr5_ZXLWOQcdugZCM1DHefxAhMthjBMuYd-50xgPKdDbGEhAZTSKxPjS8vbayJWg465J-VfE4LGpt3iJ978XYBy7v_KtCfy5m_FK18DLiw8AXiZkKKNpx074Kipg9pwr_hXX3Llz5CG_JyNQ0lYqyxAbAOt5-bbpgepHCYe_cQ0Uy4Z-oQTFVTKvoLKYe6Gyc8p8yJLhj9g1GA7_TP-KGL6CGaWyovEgw), but first explicitly coined in 1956, has taken on a life of its own in the public imagination. In recent years, the apocalyptic connotations bestowed on it by films like _The Terminator_ and _2001: A Space Odyssey_ has begun to be supplanted by its liberal use in [banal](https://www.forbes.com/sites/bernardmarr/2018/05/25/stitch-fix-the-amazing-use-case-of-using-artificial-intelligence-in-fashion-retail/#53bc5c3f3292) [branding](https://www.cmo.com.au/article/669073/how-thirdlove-used-ai-better-personalisation/) [exercises](https://www.forbes.com/sites/parmyolson/2019/03/04/nearly-half-of-all-ai-startups-are-cashing-in-on-hype/#d85ee88d0221) that stand in stark contrast to Elon Musk's [dire warnings about AI domination](https://www.vanityfair.com/news/2017/03/elon-musk-billion-dollar-crusade-to-stop-ai-space-x), leaving most people wondering what it means when a company says they're "using AI."

Of course, the broad use of this term stems in no small part from its (several) broad definitions that play with both the concepts of "artificial" and "intelligence". Recently, it has become common to refer more specifically to "machine learning" (ML), a term referring to a specific "flavor" of AI that manages to avoid the loaded connotations of "intelligence" and is generally more precise, as it is the most common way in which AI is practiced today. Unfortunately, it too has begun to inherit much of the vagueness and hype of AI, especially as it becomes fashionable to let people know both that you're "doing" AI, but are also not one of the hucksters that uses the term "AI".

At this point, volumes have been written about the merits of the "intelligence" displayed by modern machine learning (and specifically deep learning (DL), a sub-category of the sub-category of ML) systems like GPT3 and AlphaZero which are, intelligence aside, astonishing achievements of engineering that deserve praise and careful consideration by both industry professionals and the public. And while I find these conversations fascinating and enlightening, I'm not really interested in writing about it here, both because I consider myself far from an expert and because I feel like sharing code and graphs and other things that those discussions wouldn't necessarily entail (at least not if I want to run it on my laptop, which I do).

Instead, I think it's worth trying to clear up some of the confusion over what these terms *mean* and how these definitions emerge as practical implementations. In this way, we can exercise more skepticism when we hear these terms used popular culture and learn to separate the bland from the exciting from the horrifying.

So let's start with the biggest tent we can: artificial intelligence. What does it mean for a machine to be intelligent? Francois Chollet, creator the deep learning library Keras, recently wrote a [fascinating paper](https://arxiv.org/abs/1911.01547) on the subject, distinguishing "skilled" systems from "intelligent" ones. Skilled systems can perform a constrained and narrowly defined task in some specific domain, whereas intelligent systems (like ourselves) can perform a broad range of tasks across a variety of domains and, importantly, can quickly pick up _new_ skills, often need only a handful of attempts and _without sacrificing skill on other tasks_.

More often than not, the "AI" systems in production today fall under the first category. And while this may seem disappointing in comparison to a truly intelligent system, the reality is that many of these narrow and constrained tasks are still extremely useful! Spam email filters may not be able to beat you in chess, but they make your inbox much less cluttered and risky. Object detectors on car backup cameras may not even be useful when fed images from a different camera, but they still keep kids in your driveway safer. In this sense, we can think of successful AI systems as the combination of intelligent engineering (done by us) of skilled artificial subsystems (whatever algorithm is flagging spam or detecting objects).

With this constrained definition of AI in mind, let's implement a few differnt AI systems on such a simple task and start to get an idea of what makes ML a subset of AI, and what distinguishes the two. The task we'll pick is a common toy problem in the AI/ML literature of classifying iris species based on measurements of their sepal and petal lengths and widths. Mathematically, we want to build a function that can map from these four _continuous_ measurements to a single _categorical_ value. This function is the "skill" in question.

Before we begin, it's worth asking the question: why would I want to do this? Well, the reality is that _you_ probably don't. You probably get along just fine day-to-day without knowing what types of irises you're looking at (or even measuring them for that matter). But the assumption is that if you're reading this, you have some problem that can be solved with a skill *like* this one, and the techniques we'll leverage will be as useful in solving your task. As for why we would want a machine to solve this problem (especially one that we'll see is reasonably simple), the reality is that even slightly more complicated skills, and especially ones that involve more measurements (or **features**), grow exponentially in their difficulty for humans to solve. Moreover, perhaps your problem involves a *volume* of data that isn't reasonable to expect someone (or many thousands of someones) to sit there and practice their skill on all day. Luckily, computers can manipulate numbers *really fast*, and so are well suited to these sorts of problems.

Ok, enough primer, let's take a look at what measurements we have to go off of:

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

dataset = load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['class'] = dataset.target_names[dataset.target]
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


So how can we use this information to get our computer to learn a new skill? Well, just staring at these numbers doesn't give us a lot to go off. What if we looked at which measurement values are most common for a given class, and which aren't? Then we could figure out a rule that looks like "sepal lengths between $x$ and $y$ are most commonly _setosas_. So if you're in that range, then I'm going to say you're a _setosa_". If we can come up with a few good rules and encode them into our computer, then we should be able to feed our computer a *new* set of measurements and have it return the "right" class!

Would such a system be "intelligent", in the sense that you and I are intelligent? Of course not. But if you're a botanist collecting thousands of samples of petal and sepal measurements that you need to classify, that skill would be immensely useful. Let's take a look at what I mean:

In [2]:
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import LegendItem, Legend
import plot_utils
output_notebook()

# plot distributions for each feature
# make the last one a little wider to accommodate
# a legend that all four can use
figures = []
for n, feature_name in enumerate(dataset.feature_names):
    fig_kwargs = {
        "title": feature_name.strip(" (cm)"),
        "x_axis_label": feature_name.split(maxsplit=1)[1],
        "y_axis_label": "Probability",
        "plot_height": 200,
        "plot_width": 520 if n==3 else 400
    }
    p = plot_utils.make_distribution_plot(
        df[[feature_name, 'class']],
        n_bins=10,
        label_column='class',
        fig_kwargs=fig_kwargs
    )
    figures.append(p)

# configure a legend on just the last plot
legend_items = []
for class_name in dataset.target_names:
    r = p.select(name=class_name)
    legend_items.append(LegendItem(label=class_name, renderers=r))
legend = Legend(
    items=legend_items,
    location=(10, 60)
)
p.add_layout(legend, 'right')

grid = column(row(figures[0], figures[1]), row(figures[2], figures[3]))
show(grid)

Ok, these **distributions** are certainly more helpful than the dataframe view above. What can we learn?

Well, it appears that smaller petals tend to correspond to _setosas_. In fact, it looks like there aren't really any _versicolor_ or _viginica_ samples with petal lengths lower than ~2.2 cm, or petal widths lower than ~0.7 cm. So why don't we make that our first rule:

> If your petal length is lower than 2.2 cm, or your petal width lower than 0.7 cm, you will be classified as a _setosa_

Now all we need to do is come up with a rule to distinguish _versicolors_ and _virginicas_. Unfortunately, there seems to be much more overlap between these two. It looks like _virginica_ tends to dominate at petal lengths above 5.1 and petal widths above 1.7 (though there are obviously a finite number of samples that break this rule). Let's make our second rule:

> If your petal length is greater than 5.1 and your petal width is greater than 1.7, you will be classified as _virginica_

So now we'll encode, in code, a function which uses these rules to map from a set of measurements (a **sample**), and returns a class name.

In [3]:
def rules_based_ai_classifier(sample):
    if sample['petal length (cm)'] < 2.2 and sample['petal width (cm)'] < 0.7:
        return 'setosa'
    else:
        if sample['petal width (cm)'] > 1.7 and sample['petal length (cm)'] >= 5.1:
            return 'virginica'
        else:
            return 'versicolor'

So how well does this rule work? Let's run it on our dataset and see how many (or, more usefully, what fraction) it classifies correctly. We can even take a look at the ones we got wrong to see where and why our rule failed.

In [4]:
num_right = 0
wrong_rows, wrong_predictions = [], []
for row_id, x in df.iterrows():
    label = x.pop('class')
    prediction = rules_based_ai_classifier(x)
    if label == prediction:
        num_right += 1
    else:
        wrong_rows.append(row_id)
        wrong_predictions.append(prediction)

accuracy = num_right / len(df)
print('Accuracy: {:0.2f}%'.format(100*accuracy))

wrong_df = df.loc[wrong_rows]
wrong_df["predicted class"] = wrong_predictions
wrong_df

Accuracy: 92.00%


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class,predicted class
106,4.9,2.5,4.5,1.7,virginica,versicolor
113,5.7,2.5,5.0,2.0,virginica,versicolor
119,6.0,2.2,5.0,1.5,virginica,versicolor
121,5.6,2.8,4.9,2.0,virginica,versicolor
123,6.3,2.7,4.9,1.8,virginica,versicolor
126,6.2,2.8,4.8,1.8,virginica,versicolor
127,6.1,3.0,4.9,1.8,virginica,versicolor
129,7.2,3.0,5.8,1.6,virginica,versicolor
133,6.3,2.8,5.1,1.5,virginica,versicolor
134,6.1,2.6,5.6,1.4,virginica,versicolor


So with just these simple rules (and not even leveraging the sepal measurements at all), we manage to classify 92% of our samples correctly. Not bad!

Moreover, it looks like the only mistake we made was calling a bunch of _virginicas_ _versicolors_. This should be unsurprising, given the level of overlap between the feature distributions of these classes.

Still, an astute observer of the missed classifications may be bothered by one observation in particular. Take, for example, the error we made in row $106$.

In [5]:
df.loc[106]

sepal length (cm)          4.9
sepal width (cm)           2.5
petal length (cm)          4.5
petal width (cm)           1.7
class                virginica
Name: 106, dtype: object

Looking back at our distributions, we see that _virginica_ has a relatively low probability of having petal measurements in this region, especially in comparison to _versicolor_. The probability isn't _zero_, so maybe it's unsurprising that _one_ of these measurements would come out so low, but what are the odds that _both_ of them do?!

The resolution to this apparent paradox is that these measurements are **correlated**: the higher petal lengths are more likely to correspond to higher petal widths as well. When we looked at the distributions along each feature individually, we lost the ability to see how class probabilities are distributed across the entire **feature space** (in essence, we projected along each **axis** of the feature space). To see what I mean, let's plot these features against one another and note how often an increase in one leads to an increase in the other.

In [6]:
petal_features = [f for f in dataset.feature_names if f.startswith("petal")]
scatter_df = df[petal_features + ["class"]].copy()

fig_kwargs = {}
for feature, axis in zip(petal_features, ["x", "y"]):
    scatter_df[axis] = scatter_df[feature]
    fig_kwargs["{}_axis_label".format(axis)] = feature

p = plot_utils.make_2d_scatter_plot(scatter_df, label_column="class", fig_kwargs=fig_kwargs)
p.legend.location="top_left"
show(p)

This view should hopefully give us a _much_ clearer idea of how these classes are separated along these two feature axes. In particular, it becomes extremely obvious how different _setosa_ is from the other two classes. This might incentivize us to breaking the problem up into 2 subproblems: separating _setosa_ from _versicolor_ and _virginica_ (relatively easy), and separating _versicolor_ from _virginica_ (more difficult). However, there's another, more general observation I would like us to make.

If increases in petal length are frequently accompanied by increases in petal width, why do we need to bother tracking both of these measurements _separately_? Shouldn't we be be able to combine them into one measurement that tells us just about everything we need to know (as far as classifying is concerened) about the petal?

Philosophically, we can imagine that we've chosen to represent the petal of an iris, a physical object in the real world, by projecting it along a **basis** corresponding to two measurements _about_ the petal, which are floating point numbers in the "data world". However, it appears that this basis is not orthogonal: when I move along one axis, I'm also unavoidably moving along the other axis as well. The less orthogonal these axes are, the more of a waste (and, for reasons we'll touch on later, actually a *problem*) it is that I'm representing this object with two numbers instead of one.

So what one number could we use to represent the petal? Well, if higher lengths tend to lead to higher widths, then it sounds like what we're talking about roughly is higher _areas_ (length times width). So why don't we calculate this value for our samples, and see how that combined measurement is distributed.

In [7]:
area_df = pd.DataFrame()
area_df["area"] = df[petal_features[0]] * df[petal_features[1]]
area_df["class"] = df["class"]

fig_kwargs = {
    "title": "Petal Area",
    "x_axis_label": "Value (cm^2)",
    "y_axis_label": "Probability",
    "plot_height": 300,
    "plot_width": 800
}
p = plot_utils.make_distribution_plot(
    area_df,
    n_bins=10,
    label_column="class",
    fig_kwargs=fig_kwargs
)

# configure a legend on just the last plot
legend_items = []
for class_name in dataset.target_names:
    r = p.select(name=class_name)
    legend_items.append(LegendItem(label=class_name, renderers=r))
legend = Legend(
    items=legend_items,
    location=(10, 60)
)
p.add_layout(legend, 'right')
show(p)

So what if we tried to implement the following rules:
> If your petal area is greater than 8.4 $\text{cm}^2$, you're a _virginica_

> If your petal area is less than 2.7 $\text{cm}^2$, you're a _setosa_

> Otherwise, you're a _versicolor_

In [8]:
def area_rules_based_ai_classifier(sample):
    petal_area = sample[petal_features[0]] * sample[petal_features[1]]
    if petal_area > 8.4:
        return "virginica"
    elif petal_area < 2.7:
        return "setosa"
    else:
        return "versicolor"

Does this rule improve our accuracy?

In [9]:
num_right = 0
wrong_rows, wrong_predictions = [], []
for row_id, x in df.iterrows():
    label = x.pop('class')
    prediction = area_rules_based_ai_classifier(x)
    if label == prediction:
        num_right += 1
    else:
        wrong_rows.append(row_id)
        wrong_predictions.append(prediction)

accuracy = num_right / len(df)
print('Accuracy: {:0.2f}%'.format(100*accuracy))

wrong_df = df.loc[wrong_rows]
wrong_df["predicted class"] = wrong_predictions
wrong_df

Accuracy: 96.00%


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class,predicted class
70,5.9,3.2,4.8,1.8,versicolor,virginica
77,6.7,3.0,5.0,1.7,versicolor,virginica
106,4.9,2.5,4.5,1.7,virginica,versicolor
119,6.0,2.2,5.0,1.5,virginica,versicolor
133,6.3,2.8,5.1,1.5,virginica,versicolor
134,6.1,2.6,5.6,1.4,virginica,versicolor


Not a bad lift from what amounts to just a pretty small change in perspective! One interesting to note is that since our rule has the form (mathematically) of:
> If $\text{petal length}*\text{petal width} > a, ...$

for some value $a$, this amount so checking which side of the curve $y=\frac{a}{x}$ we are in the 2D feature space above. Let's plot these curves and see what they look like.

In [10]:
fig_kwargs = {}
for feature, axis in zip(petal_features, ["x", "y"]):
    scatter_df[axis] = scatter_df[feature]
    fig_kwargs["{}_axis_label".format(axis)] = feature

p = plot_utils.make_2d_scatter_plot(scatter_df, label_column="class", fig_kwargs=fig_kwargs)
p.legend.location="top_left"

x = np.linspace(scatter_df["x"].min(), scatter_df["x"].max(), 100)
for threshold in [2.7, 8.4]:
    y = threshold / x
    mask = y < 1.01*scatter_df["y"].max()
    p.line(x[mask], y[mask], line_color="#000000", line_alpha=0.9, line_dash="2 2")
show(p)

What I like about this view is that it shows us very clearly which samples we misclassified and *why* we misclassified them. More importantly, I think it illustrates the cross-section of what we're doing here both mathematically and philosophically.

For this particular classification rule, we decided that this 2D **slice** of the full 4D feature space (both petal and sepal measurements) can be neatly divided by two curves of the form

$$ \text{petal width}\ =\ \frac{a_i}{\text{petal length}}, i \in \{0, 1\} \text{,}$$

with categories assigned depending on which *side* of the curves a sample falls. The values $a_i$ are called **parameters** of these family of curves. This reduced the problem to finding the *best* values $a_i$ such that we classified our samples as well as possible. And how did we find these best values? We looked at the distributions for the feature $\text{petal length}*\text{petal width}$ and manually picked values that looked good.

We could, of course, have picked another form for the curves that divide up this subspace, and then tried to find the best _particular_ curves that fit that form for our data. But we picked this one. Why? Well, it seemed natural and solvable and, frankly, it worked pretty well. So why not?

Of course, we still haven't even used our sepal measurements. Surely those could add something to the picture. What if we came up with families of curves in the full 4 dimensional space and found which of their members best separated the data? But how would we do that, since we can't *view* the data in the 4D space? Well, we could try to reduce those curves to some 1D distribution like we did above and manually find the cut points. But we relied on some hand-wavy intuition to come up with that "area" feature above, and I for one certainly don't know enough about botany to come up with something that represents information from all four features at once.

Perhaps more importantly, what would this exercise do that might help us on any other problem we might want to solve down the line? Sure, right now irises and their petals and sepals are all we care about, but maybe we'll have a project down the road involving other flowers or plants, maybe ones without petals at all. Maybe we'll stop caring about botany altogether and start applying machine learning to some new field entirely. What good will these insights about the information contained in sepal and petal measurements do us then? We'll have to start completely from scratch and repeat the whole process!

What would be ideal for us would be if there was some way to automatically reduce _any_ dataset down to fewer dimensions in some way that retained information about the _structure_ of the dataset in the original space. Such a tool would not only help us visualize and build rules for *this* dataset, but would (with enormous caveats) help with making rules for any *other* dataset as well. Our job would then be reduced to finding good rules in this 2D space, rather than laboriously trying different curves and feature combinations to see if they work or not.

Luckily, there's a great off-the-shelf statistical tool for building these reduced representations called [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA). We'll **transform** our data using PCA, then visualize it and try to come up with a new and better rule for classification.

In [11]:
from scipy.linalg import eig


def principal_component_analysis(X, n_components=2):
    '''
    simple PCA implementation using eigenvectors of the 
    normalized X^T X matrix
    '''
    # X is a 150 x 4 matrix corresponding to our measurements
    # STEP 1:
    # "normalize" the data: center each feature at 0 w/ standard dev. of 1
    X = (X - X.mean(axis=0)) / X.std(axis=0)

    # STEP 2:
    # take the eigenvalues and eigenvectors of matrix product X^T X
    # note that this makes X^T X real and symmetric, so eigenvectors are real
    eigenvalues, eigenvectors = eig(X.T @ X)
    eigenvectors = eigenvectors.real

    # STEP 3:
    # make a matrix whose rows are the top n eigenvectors
    # (i.e. the ones corresponding to the n largest eignvalues)
    top_n_eigenvalue_indices = np.argsort(eigenvalues)[:n_components]
    W = eigenvectors[top_n_eigenvalue_indices]
    assert W.shape == (n_components, 4) # 2 x 4 matrix

    # STEP 4:
    # map X down to the 2D space using this matrix transformation
    X = X @ W.T # (150 x 4)(4 x 2) -> (150 x 2)

    # This is just a little convenience thing I'm doing for
    # plotting to assign these values to the x and y axes,
    # it has nothing to do with PCA
    if n_components == 2:
        X.columns = ["x", "y"]

    # return both the transformed data, as well as the matrix
    # we used to make the transformation, in case we feel like
    # using it to transform anything else (like, say, new samples
    # that we encounter when we deploy our system "in production")
    return X, W

In [12]:
X = df[dataset.feature_names]
X_2d, W = principal_component_analysis(X)
X_2d = pd.concat([X_2d, df], axis=1)

fig_kwargs = {"title": "PCA Transformed Data"}
p = plot_utils.make_2d_scatter_plot(X_2d, label_column="class", fig_kwargs=fig_kwargs)
p.legend.location = 'bottom_left'
show(p)

Pause for a moment and appreciate how impressive this visualization is: here we've used an algorithm that was formulated without knowing *anything* about irises or how large their sepals and petals are supposed to be. And yet, we've managed to use it to build a more *efficient* representation of our data that still manages to create some clear separation between these classes! And since the algorithm didn't rely on anything specific to this particular dataset of iris measurements, or irises at all, we could just as easily use it on another set of measurements or another problem entirely!

Everything we've done up to this point would probably be more commonly termed "data analytics" than "artificial intelligence" or "machine learning". Sure, we used a computer to generate the plots and even execute the requisite logic, but all the intelligence was built-in manually by us, the programmer, entirely from our insights about botany.

What's different about this scenario? If I was sufficiently motivated and equipped with a linear algebra textbook and a lot of time, I could have calculated the PCA matrix with pencil and paper and transformed the original matrix myself. That the computation of these values was executed using silicon and electricity was more a matter of *convenience* than of fundamental intellectual ability. If I had made this graph on paper and used it to build some classification rule, would that still be "machine learning"?

I would argue that it would. What separates this from the analytics we did earlier is that the *algorithm* for generating this representation (and hence any rule we might build from it) *does not rely on a priori knowledge of the system in question*. PCA is an automatic representation building *machine*. Whether those representations are always *useful* is another matter entirely, and will still require a data scientist or domain expert to determine. But in my mind what separates ML techniques from more traditional rules-based AI is the use of these domain-agnostic transformation mechanisms that can be robustly applied across a variety of ontologically disparate but statistically reconcilable problems.

We'll see soon that we can take this a step farther and use ML algorithms to not only build representations, but also learn classification rules *from* those representations. We can even imagine how this might generalize to truly intelligent systems: systems which are capable of not just adapting their representations and classification rules from domain to domain, but of efficiently and robustly adapting the mechanism *by which* they learn representations and rules to new domains. Sufficiently robust mechanisms might require that the representations learned by the algorithm reflect some real knowledge or understanding of the systems in question.

But now we're getting a few football fields out over our skis. Let's get back to the problem at hand. So we have this nice 2D representation of our data. Let's do what we did above: pick families for a couple curves that will break up our 2D space and be used to classify samples, and then figure out which curves in those families are best.

In this case, the family of curves I'm going to pick are straight lines, since those are easiest for me to visualize. Using the grid on the plot as a reference, I'll take a stab at a couple lines below, and we'll plot them with our samples and see how they look.

In [13]:
line1_points = [(-3, -1), (0, 0.5)] # (x,y) end-points on line separating blue from yellow
line2_points = [(0, -1), (2, 0.5)] # (x,y) end-points on line separating yellow from red

xs = [[point[0] for point in line] for line in [line1_points, line2_points]]
ys = [[point[1] for point in line] for line in [line1_points, line2_points]]
p.multi_line(xs, ys, line_color="#000000", line_alpha=0.9, line_dash="2 2")
show(p)

Those look about as good as I'm going to get without zooming in, so let's see how our accuracy looks now by checking which side of each line our points fall.

In [14]:
def pca_rules_based_ai_classifier(sample):
    x, y = sample["x"], sample["y"]
    distance_from_lines = []
    for line_points in [line1_points, line2_points]:
        (x1, y1), (x2, y2) = line_points
        distance_from_line = (x - x1)*(y2 - y1) - (y - y1)*(x2 - x1)
        distance_from_lines.append(distance_from_line)

    d1, d2 = distance_from_lines
    if d1 < 0:
        return 'setosa'
    elif d2 < 0:
        return 'versicolor'
    else:
        return 'virginica'

num_right = 0
wrong_rows, wrong_predictions = [], []
for row_id, x in X_2d.iterrows():
    label = x.pop('class')
    prediction = pca_rules_based_ai_classifier(x)
    if label == prediction:
        num_right += 1
    else:
        wrong_rows.append(row_id)
        wrong_predictions.append(prediction)

accuracy = num_right / len(df)
print('Accuracy: {:0.2f}%'.format(100*accuracy))

wrong_df = df.loc[wrong_rows]
wrong_df["predicted class"] = wrong_predictions
wrong_df

Accuracy: 96.67%


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class,predicted class
70,5.9,3.2,4.8,1.8,versicolor,virginica
119,6.0,2.2,5.0,1.5,virginica,versicolor
129,7.2,3.0,5.8,1.6,virginica,versicolor
133,6.3,2.8,5.1,1.5,virginica,versicolor
134,6.1,2.6,5.6,1.4,virginica,versicolor


So the good news is that we made another slight improvement! But the even better news is that, once again, we did it using a representation that took more or less *no work* and *no thought* (at least about irises) to generate. Sure, we had to code up the PCA implementation. But now that we have it, we can use it forever. In fact, we didn't even really need to put in this work at all. Most (read: all, unless you're doing some groundbreaking research) of these common ML algorithms are already available in open-sourced implementations in Python libraries like [scikit-learn](https://scikit-learn.org/stable/), [RAPIDS](https://rapids.ai/) (for GPU usage), and [TensorFlow](https://www.tensorflow.org/). But where would the fun in that have been?

## Conclusion
Hopefully, this has provided some helpful insights into what ML is trying to *do*, and what distinguishes it as a sub-category of artificial intelligence. Of course, this is just the beginning. The question of *how* ML does these things, and what we can do as data scientists to leverage it most effectively, is a rich and fascinating topic that I'm hoping to explore more in a follow-up notebook to this. In the meantime, I encourage the reader to think critically about not only the math and code that we've used here, but about the philosophical issues at play as well.

For example, we've been measuring the quality of our classification rules by evaluating their accuracy on our dataset. However, we used that same dataset to inform our understanding of the problem and build those rules in the first place, so shouldn't we *expect* that the rules perform reasonably well? We never considered building deliberately *worse* rules at any point. This seems relevant because, ultimately, we don't care about how well our rule does on *this* dataset: once I've used it to make my rules, I frankly don't care if I ever see this data again. We care about how well our rule will perform on *new* data we've never seen before! After all, that's the data that we don't already have classified, and is presumably why we undertook this whole exercise to begin with. So does our evaluation of our rules on the data used to *inform* those rules represent an accurate estimate of how well I should expect those same rules to perform "in the wild"?