# MA2634: Machine Learning for Artificial Intelligence (NCUT)

> **Adapted from MA5634: Fundamentals of Machine Learning (Simon Shaw)**  
> This MA2634 notebook updates framing, outcomes, and assessment alignment per the MA2634 module block (NCUT), while acknowledging Simon Shaw’s original MA5634 materials.

---

## Acknowledgements (MA5634 sources)

# MA5634: Fundamentals of Machine Learning

#### *variationalform* <https://variationalform.github.io/>

#### *Just Enough: progress at pace*

<https://variationalform.github.io/>

<https://github.com/variationalform>

Simon Shaw  
<https://www.brunel.ac.uk/people/simon-shaw>



<table>
<tr>
<td>
<img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:18px"/>
<img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:18px"/>
<img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1" style="height:18px"/>
</td>
<td>

<p>
This work is licensed under CC BY-SA 4.0 (Attribution-ShareAlike 4.0 International)

<p>
Visit <a href="http://creativecommons.org/licenses/by-sa/4.0/">http://creativecommons.org/licenses/by-sa/4.0/</a> to see the terms.
</td>
</tr>
</table>

<table>
<tr>
<td>This document uses python</td>
<td>
<img src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png" style="height:30px"/>
</td>
<td>and also makes use of LaTeX </td>
<td>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/LaTeX_logo.svg/320px-LaTeX_logo.svg.png" style="height:30px"/>
</td>
<td>in Markdown</td>
<td>
<img src="https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png" style="height:30px"/>
</td>
</tr>
</table>

## What this is about

You will be introduced to the **foundational principles of machine learning** with both their mathematical underpinning and their software incarnations. The course explores the role of machine learning (ML) in artificial intelligence (AI), along with the **ethical issues** surrounding the use of data and AI systems.

You will learn to distinguish between:
- **Regression and classification** problems  
- **Supervised and unsupervised** learning approaches  
- **Training, validation, and test** data sets, and their role in model evaluation  

You will also encounter essential ideas such as:
- **Cost and loss functions**, and the notion of optimisation  
- **Decision boundaries** in classification  
- **Bootstrap techniques** for assessing model variability  
- **Clustering and neighbours methods** for unsupervised learning  
- **Trees and perceptrons** as basic prediction machines  

### Mathematics: “just enough, at pace”
You are **not expected to be a mathematician**, but you will be expected to recall or learn the necessary basics in:
- Vectors and matrices  
- Random variables and simulation  
- Differential calculus (only where required for cost/loss functions)  

### Programming: “just enough, at pace”
You are **not expected to be a computer scientist**, but **Python will be introduced and used as a tool**. Only the essential Python syntax, libraries, and techniques will be covered.  
The emphasis will be on **doing and experimenting with ML models**, rather than proving theorems.

---


## Assessment

- **25% Assignment**  
  A small project requiring you to **choose, configure, and implement a machine learning classifier or clustering method**.  
  Deliverables will include a short technical report (with code appendix) and clear communication of your results. Guidance and a detailed brief will be provided in class.

- **75% Examination (2 hours)**  
  A written exam focusing on theory, core concepts (e.g. regression vs. classification, supervised vs. unsupervised learning, validation methods, bootstrap, decision boundaries), and applied interpretation of results.  
  **Revision and reflection time will be allocated** in the final weeks of the course.


## Key Concepts: Glossary of Relevant Terms

The first few of these are debatable, evolving, and subject to change and interpretation.  
It is worth searching and reading for yourself — these are fast-growing areas.

---

#### Data Science
An **interdisciplinary field** combining mathematics, computer science, and statistics with domain expertise.  
It involves extracting knowledge and insight from structured and unstructured data, often at scale.

#### Data Analytics
The **systematic computational analysis of data** to discover patterns, trends, and value.  
Often used to support decision-making and business intelligence.

#### Data Engineering
The stewardship, cleaning, storage, and preparation of data, including pipelines and warehousing, to make data reliable and usable for analysis and machine learning.

#### Machine Learning (ML)
A branch of AI focused on building algorithms that **learn patterns from data** to make predictions or decisions without being explicitly programmed.  
Includes both **supervised** (with labelled outcomes) and **unsupervised** (finding structure without labels) approaches.

#### Supervised Learning
Learning from labelled data: input–output pairs are provided, and the task is to generalise to unseen cases.  
Examples: regression (predicting continuous values), classification (predicting categories).

#### Unsupervised Learning
Learning from unlabelled data: the task is to uncover structure such as clusters, latent factors, or relationships.  
Examples: clustering, dimensionality reduction (e.g. PCA).

#### Cost and Loss Functions
Mathematical formulations that measure the “error” of a model.  
- **Loss function**: error on a single example.  
- **Cost function**: aggregated error over the dataset.  
Optimisation methods aim to minimise these.

#### Validation / Test Sets
- **Validation set**: used for model selection and tuning hyperparameters.  
- **Test set**: held back until the very end to give an unbiased estimate of model performance.

#### Bootstrap
A **resampling technique** that provides estimates of uncertainty (e.g. confidence intervals) by repeatedly sampling (with replacement) from the observed data.

#### Decision Boundary
A surface in the feature space that separates different classes assigned by a model.  
Understanding decision boundaries helps interpret classification models.

#### Perceptron
A simple linear model inspired by biological neurons, able to separate data that are linearly separable. Forms the basis for more complex neural networks.

#### Ethics in AI/ML
Critical consideration of **bias, fairness, accountability, and transparency** in data and models.  
Covers issues of data provenance, privacy, and responsible deployment of AI systems.


#### Artificial Intelligence

The development and deployment of digital systems that can effectively substitute for humans in
tasks beyond the routine application of fixed rules. When you talk to your home assistant, your
phone, or your satellite TV receiver, or your car, or your laptop, and so on, it has no idea
what you are going to say. It doesn't have a bank of pre-answered questions, but instead it
responds dynamically to what it hears. It has been trained on data, and it has learned how to
respond. Incidentally, how do you think these systems even understand what you said? As a child,
it took you months to begin to understand human speech...

#### Machine Learning

The development and deployment of algorithms that are able to learn from data without explicit instructions,
and then analyze, predict, or otherwise draw inferences, from unseen data. These algorithms would typically
be expected to add measurable value by their performance.

> *Consider for example an algorithm that predicted __tails__ for every coin flip. It's
> right half the time* - but there's no value in that.

#### Learning

Machine learning models do not have intrinsic knowledge but instead learn from data.
Typically a data set comprises a list of items each of which has one or more
*features* which correspond to a *label*. We'll see some examples of this below.

We think of the features as being inputs to the machine learning model, and the label
as being the output. Typically we want to be able to feed in new features, and have
the model predict the label.

To do this we need a **training data set** so that the model can learn how to map the
features to the label: the *input to the output*.

There are three basic learning paradigms:

- **Supervised Learning**:
Here the data is labelled. This means that for a given set of features, or inputs, we also know
their labels, or outputs. Examples of this are where...
  - We could have a list of features of insured drivers, such as age, time since they passed
  their driving test, type of car, locality, and along with those features a monetary
  value on their accident claim. The task would be to learn how much of an insurance premium
  to charge to a new customer once those features have been determined.
  - We might have a bank of images of handwritten digits, and for each image we know what
  digit is represented. The MNIST database of handwritten digits, see
  <http://yann.lecun.com/exdb/mnist/> or <https://en.wikipedia.org/wiki/MNIST_database>
  for example, is a well known example of this. The task is to learn how to predict
  what digit is captured by a new image. This could be used in ANPR systems for example,
  <https://en.wikipedia.org/wiki/Automatic_number-plate_recognition>.
  

- **Unsupervised Learning**:
This is where we only know the features and we want to cluster the data in such a way
that a set of similar features can be assiociated with some common characteristic (the label).
  - This can be used on data where the anlayst doesn't initially know what they are looking
  for. For example, a retailer might have a mass of data of customer age, locale, average spend,
  types of purchased item, time of day of purchase, day of week of purchase, time of year etc.
  What characteristics can be used to group these customers? How can advertising be targetted?
  - principal component analysis seeks to re-orient data so that its dominant statistical
  properties are revealed. We'll see this later.

- **Reinforcement Learning**:
This seeks to strike a balance between the two above. There are no labels, but instead, as time
progresses the learning algorithm has a *reward* variable which is increased when an action it
has learned has resulted in a measurable benefit. Over time the algorithm develops a policy to
inform its actions.

This last is a major topic and will not be covered in these lectures. We will see examples of
the first two.

#### Regression and Classification

Our algorithms will be developed to perform one of the following tasks:

- **Regression:** here the output, the label, can take any value in a continuous set. For example,
  the height of a tree, given local climate, soil type, genus, age since planting, could be
  considered to be any non-negative real number (although not with equal probability).

- **Classification:** in this case the label will be deemed to be one of a certain class. For
  example, in the handwritten digits example above, the output will be one of the digits
  $\{0,1,2,3,\ldots,9\}$.

Some of the algorithms we study will be able to perform both the regression and clustering tasks,
although we wont always delve deeply into both capabilities.

## Reading List

For the data science, our main sources of information are as follows:
    
- MML: Mathematics for Machine Learning, by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong.
  Cambridge University Press. <https://mml-book.github.io>.
- MLFCES: Machine Learning: A First Course for Engineers and Scientists, by Andreas Lindholm,
  Niklas Wahlström, Fredrik Lindsten, Thomas B. Schön. Cambridge University Press.
  <http://smlbook.org>.
- FCLA: A First Course in Linear Algebra, by Ken Kuttler,
  <https://math.libretexts.org/Bookshelves/Linear_Algebra/A_First_Course_in_Linear_Algebra_(Kuttler)>
- AP: Applied Probability, by Paul Pfeiffer
  <https://stats.libretexts.org/Bookshelves/Probability_Theory/Applied_Probability_(Pfeiffer)>
- IPDS: Introduction to Probability for Data Science, by Stanley H. Chan,
  <https://probability4datascience.com>
- SVMS: Support Vector Machines Succinctly, by Alexandre Kowalczyk,
  <https://www.syncfusion.com/succinctly-free-ebooks/support-vector-machines-succinctly>
- VMLS: Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares,
  by Stephen Boyd and Lieven Vandenberghe,
  <https://web.stanford.edu/~boyd/vmls/>


All of the above can be accessed legally and without cost.


There are also these useful references for coding:

- PT: `python`: <https://docs.python.org/3/tutorial>
- NP: `numpy`: <https://numpy.org/doc/stable/user/quickstart.html>
- MPL: `matplotlib`: <https://matplotlib.org>

The capitalized abbreviations will be used throughout to refer to these sources. For example, we could
say *See [MLFCES, Chap 2, Sec. 1] for more discussion of __Supervised Learning__*. This would
just be a quick way of saying

> Look in Section 1, of Chapter 2, of Machine Learning: A First Course for Engineers and
> Scientists, by Andreas Lindholm, Niklas Wahlström, Fredrik Lindsten, Thomas B. Schön,
> for more discussion of supervised learning.

There will be other sources shared as we go along. For now these will get us a long way.

## Coding: `python` and some data sets

For each of our main topics we will see some example data, discuss a means of working with it,
and then implement those means in code. We will develop enough theory so as to understand how
the codes work, but our main focus will be the intution behind the method, and the effective
problem solving using code.

We choose `python` because its use in both the commercial and academic data science
arena seems to be pre-eminent.

The data science techniques and algorithms we will study, and the supporting technology
like graphics and number crunching, are implemented in well-known and well-documented
`python` libraries. These are the main ones we will use:

- `matplotlib`: used to create visualizations, plotting 2D graphs in particular.
- `numpy`: this is *numerical python*, it is used for array processing which for us
   will usually mean the numerical calculations involving vectors and matrices.
- `scikit-learn`: a set of well documented and easy to use tools for predictive data analysis.
- `pandas`: a data analysis tool, used for the storing and manipulation of data.
- `seaborn`: a data visualization library for attractive and informative statistical graphics.

There will be others, but these are the main ones. Let's look at some examples of how to use these


## Binder, Anaconda, Jupyter - a first look at some data

Eventually we will use the anaconda distribution to access `python` and the libraries
we need. The coding itself will be carried out in a Jupyter notebook. We'll go through this
in an early lab session. We'll start though with Binder: click here:

<https://mybinder.org/v2/gh/variationalform/FML.git/HEAD>

Let's see some code and some data. In the following cell we import `seaborn` and look at
the names of the built in data sets. The `seaborn` library, <https://seaborn.pydata.org>,
is designed for data visualization. It uses `matplotlib`, <https://matplotlib.org>,
which is a graphics library for `python`.

If you want to dig deeper, you can look at
<https://blog.enterprisedna.co/how-to-load-sample-datasets-in-python/>
and <https://github.com/mwaskom/seaborn-data> for the background - but you don't need to.

In [None]:
import seaborn as sns
# we can now refer to the seaborn library functions using 'sns'
# note that you can use another character string - but 'sns' is standard.

# note that # is used to write 'comments'
# Now let's get the names of the built-in data sets.
sns.get_dataset_names()

# type SHIFT=RETURN to execute the highlighted (active) cell

### The `taxis` data set

In [None]:
# let's take a look at 'taxis'
dft = sns.load_dataset('taxis')
# this just plots the first few lines of the data
dft.head()

In [None]:
# this will plot the last few lines... There are 6433 records (Why?)
dft.tail()

What we are seeing here is a **data frame**. It is furnished by the `pandas`
library: <https://pandas.pydata.org> which is used by the `seaborn` library
to store its example data sets.

Each row of the data frame corresponds to a single **data point**, which we
could also call an `observation` or `measurement` (depending on context).

Each column (except the left-most) corresponds to a **feature** of the data
point. The first column is just an index giving the row number. Note that this
index starts at zero - so, for example, the third row will be labelled/indexed
as $2$. Be careful of this - it can be confusing.

In this, the variable dft is a pandas data frame: dft = 'data frame taxis'

In [None]:
# let's print the data frame...
print(dft)

#### Visualization

Rows and rows of numbers aren't that helpful.

seaborn makes visualization easy - here is a scatter plot of the data.

In [None]:
sns.scatterplot(data=dft, x="distance", y="fare")

> **THINK ABOUT**: it looks like fare is roughly proportional to distance.
> But what could cause the outliers?

In [None]:
# here's another example
sns.scatterplot(data=dft, x="pickup_borough", y="tip")

In [None]:
# is the tip proportional to the fare?
sns.scatterplot(data=dft, x="fare", y="tip")

In [None]:
# is the tip proportional to the distance?
sns.scatterplot(data=dft, x="distance", y="tip")

### The `tips` data set

Let's look now at the `tips` data set. Along the way we'll see a few more
ways we can use the data frame object




In [None]:
# load the data - dft: data frame tips
# note that this overwrites the previous 'value/meaning' of dft
dft = sns.load_dataset('tips')
dft.head()

An extensive list of data frame methods/functions can be found here:
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame>
Let's look at some of them

In [None]:
print(dft.info)
print('The shape of the data frame is: ', dft.shape)
print('The size of the data frame is: ', dft.size)
print('Note that 244*7 =', 244*7)

#### Visualization

Again, numbers aren't always that helpful. Plots often give us more insight.

In [None]:
dft.plot()

In [None]:
sns.scatterplot(data=dft, x="total_bill", y="tip")

#### Statistics and Probability

You're assumed to be familiar with basic terms and concepts in these areas,
but we will revise and review those that we need later.

We can get some basic stats for our data set with the `describe()` method...

In [None]:
# here are some descriptive statistics
dft.describe()

### The `anscombe` data set

This is pretty famous. There are four sets of 11 coordinate pairs.
When plotted they look completely different.
But they have the same summary statistics (at least the common ones).

See <https://en.wikipedia.org/wiki/Anscombe%27s_quartet>

<img src="https://upload.wikimedia.org/wikipedia/commons/7/7e/Julia-anscombe-plot-1.png" style="height:300px"/>

Image Credit: `https://upload.wikimedia.org/wikipedia/commons/7/7e/Julia-anscombe-plot-1.png`

Let's load the data set and take a look at it - we can look at the head and tail of the
table just as we did above.

In [None]:
dfa = sns.load_dataset('anscombe')
# look at how we get an apostrophe in the string...
print("The size of Anscombe's data set is:", dfa.shape)

In [None]:
dfa.head()

In [None]:
dfa.tail()

It looks like the four data sets are in the `dataset` column. How can we extract them as separate items?

Well, one way is to print the whole dataset and see which rows correspond to each dataset. Like this...

In [None]:
print(dfa)

From this and the `head` and `tail` output above we can infer that there are four
data sets: I, II, III and IV. They each contain $11$ pairs $(x,y)$.

- The first set occupies rows $0,1,2,\ldots,10$
- The second set occupies rows $11,12,\ldots,21$
- The third set occupies rows $22,23,\ldots,32$
- The fourth set occupies rows $33,34,\ldots,43$

However, this kind of technique is not going to be useful if we have a data set
with millions of data points (rows). We certainly wont want to print them all
like we did above.

Is there another way to determine the number of distinct feature values in a
given column of the data frame?

Fortunately, yes. We want to know how many different values the `dataset` column
has. We can do it like this.

In [None]:
dfa.dataset.unique()

We can count the number of different ones automatically too, by asking
for the `shape` of the returned value. Here we go:

In [None]:
dfa.dataset.unique().shape

This tell us that there are 4 items - as expected.
Don't worry too much about it saying `(4,)` rather that just `4`.
We'll come to that later when we discuss `numpy`
(Numerical python: <https://numpy.org>).

Now, we want to extract each of the four datasets as separate data sets so we can work
with them. We can do that by using `loc` to get the row-wise locations where each
value of the `dataset` feature is the same.

For example, using the hints here
<https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values>,
to get the data for the sub-data-set `I` we can do this:

In [None]:
dfa.loc[dfa['dataset'] == 'I']

Now we have this subset of data we can examine it - with a scatter plot
for example.

In [None]:
sns.scatterplot(data=dfa.loc[dfa['dataset'] == 'I'], x="x", y="y")

To really work properly with each subset we should extract them and give each
of them a name that is meaningful.

In [None]:
dfa1 = dfa.loc[dfa['dataset'] == 'I']
dfa2 = dfa.loc[dfa['dataset'] == 'II']
dfa3 = dfa.loc[dfa['dataset'] == 'III']
dfa4 = dfa.loc[dfa['dataset'] == 'IV']

Now let's look at each of the four data sets in a scatter plot,
and use the `describe` method to examine the summary statistics.

The outcome is quite surprising...

#### dataset 1

In [None]:
sns.scatterplot(data=dfa1, x="x", y="y")
dfa1.describe()

#### dataset 2

In [None]:
sns.scatterplot(data=dfa2, x="x", y="y")
dfa2.describe()

#### dataset 3

In [None]:
sns.scatterplot(data=dfa3, x="x", y="y")
dfa3.describe()

#### dataset 4

In [None]:
sns.scatterplot(data=dfa4, x="x", y="y")
dfa4.describe()

## Exercises

For the `taxis` data set:

1. Produce a scatterplot of "dropoff_borough" vs. "tip"
2. Plot the dependence of fare on distance.

```
1: sns.scatterplot(data=ds, x="dropoff_borough", y="tip")
2: sns.scatterplot(data=ds, x="distance", y="tip")
```

For the `tips` data set:

1. What is the standard deviation of the tips?
2. Plot the scatter of tip against the total bill
3. Plot the scatter of total bill against day
4. Plot the scatter of tip against gender

```
1: ds.describe()
2: sns.scatterplot(data=ds, x="total_bill", y="tip")
3: sns.scatterplot(data=ds, x="day", y="total_bill")
4: sns.scatterplot(data=ds, x="sex", y="tip")
```

## Technical Notes, Production and Archiving

Ignore the material below. What follows is not relevant to the material being taught.

#### Production Workflow

- Finalise the notebook material above
- Set `OUTPUTTING=1` below
- Clear and fresh run of entire notebook
- Create html slide show:
  - `jupyter nbconvert --to slides 1_intro.ipynb `
- Clear all cell output
- Set `OUTPUTTING=0` below
- Save
- git add, commit and push to FML
- copy PDF, HTML etc to web site
  - git add, commit and push
- rebuild binder

Ignore this - it is done in 2_vectors

For the Anscombe data set:

1. Which of the summary statistics for $x$ are the same or similar for each subset?
1. Which of the summary statistics for $y$ are the same or similar for each subset?


Look at the `diamonds` data set

1. How many diamonds are listed there? How many attributes does each have?
2. Scatter plot price against carat.

```
1: ds = sns.load_dataset('diamonds'); ds.shape: 53940 and 10
2: sns.scatterplot(data=ds, x="carat", y="price")
```

Some of this originated from

<https://stackoverflow.com/questions/38540326/save-html-of-a-jupyter-notebook-from-within-the-notebook>


In [1]:
!apt-get -y update
!apt-get -y install pandoc texlive-xetex texlive-latex-extra texlive-fonts-recommended fonts-noto-cjk
!pip -q install "nbconvert==7.*" pygments

!which pandoc
!pandoc --version | head -n 2


0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 https://cli.github.com/packages stable/main amd64 Packages [346 B]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [2,006 kB]
Get:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Hit:13 https://ppa.launchpadc

In [None]:
import json, pathlib, re
from google.colab import _message

nb = _message.blocking_request('get_ipynb')['ipynb']

for c in nb.get("cells", []):
    src = "".join(c.get("source", []))
    tags = c.setdefault("metadata", {}).setdefault("tags", [])
    if c.get("cell_type") == "code" and re.search(r'(%%bash|apt-get|nbconvert|PDFExporter|NBROOTNAME|OUTPUTTING=)', src):
        tags.append("remove_cell")
    if c.get("cell_type") == "markdown" and re.search(r'(Technical Notes|Production Workflow)', src, re.I):
        tags.append("remove_cell")

pathlib.Path("current_tagged.ipynb").write_text(json.dumps(nb), encoding="utf-8")
print("Wrote current_tagged.ipynb")


In [None]:
%%bash
cat > nbconvert_xelatex_hide.py <<'CFG'
c = get_config()
c.TagRemovePreprocessor.enabled = True
c.TagRemovePreprocessor.remove_cell_tags = {"remove_cell"}
c.PDFExporter.latex_command = ["xelatex","-interaction=nonstopmode","{filename}"]
CFG
jupyter nbconvert --to pdf --config nbconvert_xelatex_hide.py current_tagged.ipynb


[NbConvertApp] Converting notebook current_tagged.ipynb to pdf
[NbConvertApp] Writing 55002 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', '-interaction=nonstopmode', 'notebook.tex']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 88420 bytes to current_tagged.pdf
