# Week 1 - Common machine learning tasks

Machine learning is increasingly in the news and pervades every aspect of our lives.

Machine learning protects our inboxes from spam messages, recommends purchases on shopping sites, and deciphers our handwriting to efficiently route our mail.

What is machine learning?

> Teaching machines how to learn to carry out tasks by themselves.

Machine learning algorithms do not rely on static instructions provided by human programmers. Instead, they use examples to develop a model from which they can make predictions or decisions. 

Machine learning is now used in many fields and for many tasks:
- Detecting spam emails
- Recommending products based on past purchases
- Recognizing handwritten text
- Image segmentation and object recognition
- Self-diriving cars
- Speech recognition
- and on ...

## Computer programs and machine learning

There are similarities between writing a computer program and building a machine learning model. We have a set of inputs, a processing step, and then we get an output.

The difference is in what we need to supply. 

When writing a computer program we specify the processing that will be performed. We then hope that an appropriate output is returned when we supply input values.

![Computer program](files/code.png)

When building a machine learning model we specify the outputs we expect for a variety of inputs and the machine learning algorithm is then used to define the necessary processing.

![Machine learning](files/ml.png)

When we write a program there are a variety of different ways we can approach the task. We might look for existing solutions to similar problems. Or we might attempt to divide the task into smaller sub-tasks that are easier to tackle. Finally, as we finalize our solution our approach may be closer to trial and error as we fix the remaining bugs.

A machine learning algorithm is not so flexible. The approach taken is generally one of trial-and-error. An initial guess is gradually improved over multiple rounds in a process of **optimization**.



Machine learning algorithms can be categorized in several different ways.

We will discuss many of these today and explore the most commonly used in greater detail over the next several weeks.

- Type of task
- Batch vs online
- Supervision

## Type of task

- In __classification__ the algorithm must assign inputs to one of two (or more) classes. 
- In __regression__ the algorithm returns a value for each sets of inputs received.
- In __clustering__ the algorithm divides the inputs supplied into two or mor subgroups.
- In __density estimation__ the algorithm construsts an estimate for the population distribution based on a small subset of the whole.
- Finally, __dimensionality reduction__ maps inputs on to a lower dimensional space.

## Batch vs online

In __batch learning__ algorithms all the data is available and can be processed at one time. In __online learning__ algorithms only part of the data is available for inclusion at any one time.

## Supervision

In __supervised learning__ each of the supplied inputs is labeled with the desired output. In __unsupervised learning__ no labels are available and the algorithm must find the underlying structure in the data itself. Somewhere in the middle is __semi-supervised learning__ in which only some of the supplied inputs also have the desired output.

In __reinforcement learning__ the algorithm must interact with an environment to perform a certain goal, without explicit feedback on how well it is performing.

## Classification

Classification is concerned with assigning inputs to different classes or sub-groups.

Examples include predicting survivors/non-survivors, response to a therapy, etc

![Example data](files/classification-data.png)

![Example plot](files/classification.png)

## Regression

Regression is concerned with predicting a continuous output given an input.

Examples include predicting prices, production, demand, etc.

![Regression example](files/regression.png)

## Clustering

Clustering attempts to find structure in the data by grouping similar items together. The items within a group should be more similar to each other than to items in other groups. The approach taken to determine similarity and to assign items to a group can vary depending on the algorithm used.

Choosing the number of groups into which to divide the observations can be challenging. Some algorithms rely on the number of groups being specified, while other algorithms choose the number of groups based on different parameters.

![Otsu](clustering.png)

## Density estimation

In density estimation the goal is to determine the distribution of a population from an observed sample.

The most common example is a histogram. Although simple, histograms are susceptible to misinterpretation with different bin sizes and positions significantly changing their visualization. More robust approaches avoid these issues.

![kernal-density-estimation](files/density-estimation.png)

## Dimensionality reduction

Dimensionality reduction attempts to map a high-dimensional feature set on to a lower dimensional space.

This approach can be used for both feature selection and feature extraction.

Feature selection attempts to pick the subset of variables that best maintain the overall performance of the system.

Feature extraction transforms the data in the high dimensional space into a lower dimensional space. By maximizing the variance in the transformed variables it is hoped that the most important variance in the high dimensional space is maintained.

![Feature selection](files/feature-selection.png)

## Batch learning

In batch, or offline, learning all the data is available at the same time. This is the ideal situation.

## Online learning

The alternative to batch learning is online learning. In this situation we don't have access to all the data. There are a variety of reasons for this. The dataset may simply be too large to process at the same time. Additional data may also be generated as a function of time. Online learning is also important when the properties of the underlying system are changing.

### Statistical learning models

When the system is stable the advantage of online learning is the ability to utilize the maximum amount of data. Some of the algorithms used in batch learning can also be applied for online learning with minimal modification. Ideally, an online algorithm would need only the next training example, the current state of the function and a minimal set of additional information that is independent in size to the total number of observations previously processed. 

Algorithms such as gradient descent are well suited to online learning. Non-linear algorithms may be harder to implement.

### Adversarial models

If the underlying system being modeled is changing a different approach is needed. The system may change for a variety of reasons. For example, if we were predicting mortality in patients the performance of our model would gradually drop as our therapies improve. The system may also be actively changing in an attempt to decrease the model performance. For example, detection of spam email is a constant battle between those creating spam messages and those attempting to detect and capture spam emails.

Regardless of the reasons for the system changing the value of recent examples will be more important than examples seen in the past. 

## Supervised learning

In supervised learning a set of labeled training data is available. The task of the algorithm is to find the underlying structure in the data such that for previously unseen inputs a reasonable output can be predicted.

There are a variety of different algorithms each with their particular strengths and weaknesses. 

![Example classification data](files/classification-data.png)

## Unsupervised learning

Unsupervised learning is concerned with finding the hidden structure in unlabeled data. Without labels evaluating the performance of a potential solution can be problematic. The approach taken will depend on the algorithm being used.

## Reinforcement learning

Reinforcement learning focuses on adapting the behaviour of an agent in a particular environment such that the cumulative reward is maximized.

Inputs are not paired with correct outputs leaving the algorithm to estimate the utility of any action based on experience. There is a trade-off between taking an action that has had a positive outcome in the past and trying alternative actions that may have higher rewards.

The important point here is that taking the optimal action can have no reward, and taking an inferior action can have a positive reward. Knowing which is which only becomes clear with multiple trials.

This trade-off is perhaps most widely studied in multi-armed bandit problems.

# Machine learning beyond the algorithms

Much of our time working on machine learning algorithms is actually spent on relatively mundane tasks:

- Reading and cleaning the data
- Understanding the structure of the data and changing it to be suitable for use with our tools.
- Applying our machine learning approach of choice __(perhaps 10% of our time)__
- Measuring performance

We will be spending much of our time with tools we are already familiar with. Python itself, numpy, scipy, pandas, and matplotlib will all likely be relied upon.