# Introduction

This book aims to provide an accessible introduction into applying machine learning with Python, in particular using the scikit-learn library.
I assume that you're already somewhat familiar with Python and the libaries of the scientific Python ecosystem. If you find that you
have a hard time following along some of the details of numpy, matplotlib and pandas, I highly recommend you look at Jake VanderPlas' [Python Data Science handbook](https://jakevdp.github.io/PythonDataScienceHandbook/).

## Scope and Goals

After reading this book, you will be able to do exploratory data analysis on a dataset, evaluate potential machine learning solutions, implement, and evaluate them.
The focus of the book is on tried-and-true methodology for solving real-world machine learning problems. However, we will not go into the details of productionizing and deloying the solutions.
We will mostly focus on what's know as tabular data, i.e. data that would usually be represented as a pandas DataFrame, Excel Spreadsheet, or CSV file. While we will discuss working with text-data in Chapter, there are many more advanced techniques, for which I'll point you towards [Dive into Deep Learning](https://d2l.ai/) by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola.
We will not look at image recognition, video or speech data, or time series forecasting, though many of the core concepts described in this book will also apply there.

## What is machine learning?
Machine learning, also known as predictive modeling in statistics, is a research field and a collection of techniques to extract knowledge from data, often used to automate decision-making processes. Applications of machine learning are pervasive in technology, in particular in complex websites such as facebook, Amazon, youtube or Google. These sites use machine learning to personalize the experience, show relevant content, decide on advertisements, and much more. Without machine learning, none of these services would look anything like they do today.
Outside the web, machine learning has also become integral to commercial applications in manifacturing, logistics, material design, financial markets and many more. Finally, over the last years, machine learning has also become essential to research in practically all data-driven sciences, including physics, astronomy, biology, medicine, earth sciences and social sciences.

There are three main sub-areas of machine learning, supervised learning, unsupervised learning, and reinforcement learning, each of which applies to a somewhat different setting. We'll discuss each in turn, and give some examples of how they can be used.

### Supervised Learning
Supervised learning is by far the most commonly used in practice. In supervised learning, a model is built from a dataset of input-output pairs, where the input is known as features or independent variables, which we'll denote by $x$, and the output is known as target or label, which we'll denote by $y$. The input here is a representation of an entity of interest, say a customer to your online shop, represented by their age, location and shopping history. The output is a quantity of interest that we want our model to predict, say whether they would buy a particular product if we recommend it to them. To build a model, we need to collect many such pair, i.e. we need to build records of many customers and their decisions about whether or not they bought the product after an recommendation was shown to them. Such a set of input-output pairs for the purpose of building a supervised machine learning model is called a *training set*.

```{margin} TODO
(customers of bank easier? because discrete products?)
```

Once we collected this dataset, we can (attempt to) build a supervised machine learning model that will make a prediction for a new user that wasn't included in the training dataset. That might enable us to make better recommendations, i.e. only show recommendations to a user that's likely to buy.
```{margin} TODO
(given an example?)
```

The name supervised learning comes from the fact that during learning, the dataset contains the correct targets, which acts as a supervisor for the model training.



For both regression and classification, it’s important to keep in mind the concept of generalization. Let’s say we have a regression task. We have features, that is data vectors x_i and targets y_i drawn from a joint distribution. We now want to learn a function f, such that f(x) is approximately y, not on this training data, but on new data drawn from this distribution. This is what’s called generalization, and this is a core distinction to function approximation. In principle we don’t care about how well we do on x_i, we only care how well we do on new samples from the distribution. We’ll go into much more detail about generalization in about a week, when we dive into supervised learning.

#### Classification and Regression
```{margin}
 $^1$ There are many other kinds of supervised learning tasks such as ranking or probability estimation, however, we will focus on classification and regression, the most commonly used tasks, in this book.
```
There are two main kinds of supervised learning tasks, called classification and regression$^1$. If the target of interest $y$ that we want to predict is a quantity, the task is a regression problem. If it is discrete, i.e. one of several distinct choices, then it is a classification problem. For example, predicting the time it will take a patient to recover from an illness is a regression task, say measured in days. We might want our model to predict whether a patient will be ready to leave a hospital 3.5 days after admission or 5 or 10. This is regression becaues the time is clearly a continuous quantity, and there is a clear sense of ordering and distance between the different possible predictions. If the correct prediction is that the patient can leave after 4.5 days, but instead we predict 5, that might not be exactly correct, but it might still be a useful prediction. Even 6 might be somewhat useful, while 20 would be totally wrong.

```{margin} TODO?
$^2$ This might be more naturally formulated as a multi-label task, which is basically a series of binary classification tasks. There could be more than one medication that leads to success, so this could be phrased as a yes/no question for each candidate.
```

An example for a classification task would be which of a set of medications the patient would respond best to$^2$. Here, we have a fixed set of disjoint candidates that are known a-priori, and there is usually no order or sense of distance between the classes. If medication A is the best, then predicting any other medication is a mistake, so we need to predict the exact right outcome for the prediction to be accurate. A very common instance of classification is the special case of binary classification, where there are exactly two choices. Often this can be formulated as a "yes/no" question to which you want to predict an answer. Examples of this are "is this email spam?", "is there a pedestrian on the street", "will this customer buy this product" or "should we run an X-ray on this patient".

The distinction into classification is important, as it will change the algorithms we will use, and the way we measure success. For classification, a common metric is *accuracy*, the fraction of correctly classified examples, i.e. the fraction of times the model predictied the right class. For regression on the other hand, a common metric is mean squared error, which is the squared average distance from the prediction to the correct answer. In other words, in regression, you want the prediction to be close to the truth, while in classification you want to predict exactly the correct class. In practice, the difference is a bit more subtle, and we will discuss model evaluation in depth in chapter TODO.

Usually it's quite clear whether a task is classification or regression, but there are some cases that could be solved using either approach. A somewhat common example is ratings in the 5-star rating system that that's popular on many online platforms. Here, the possible ratings are one start, two starts, three stars, four stars and five stars. So these are discrete choices, and you could apply a classification algorithm. On the other hand, there is a clear ordering, and if the real answer is one star, predicting two stars is probably better than predicting 5 stars, which means it might be more appropriate to use regression. Here, which one is more appropriate depends on the particular algorithm you're using and how to integrate into your larger workflow.

#### Generalization
When building a model for classification or regression, keep in mind that what we're interested in is applying the model to new data for which we do not know the outcome. If we build a model for detecting spam emails, but it only works on emails in the training set, i.e. emails the model has seen during building of the model, it will be quite useless. What we want from a spam detection algorithm is to predict reasonably well whether an email is spam or not for a *new email* that was not included in the training set. The ability for a supervised model to make accurate predictions on new data is called *generalization* and is the core goal of supervised learning.
Whithout asking for generalization, an algorithm could solve the spam detection task on the training data by just storing all the data, and when presented with one of these emails, look up what the correct answer was. This approach is known as memorization, but it's impossible to apply to new data.

#### Conditions for success

For a supervised learning model to generalize well, i.e. for it to be able to learn to make accurate prediction on new data, some key assumptions must be met:

First, **the necessary information for making the correct prediction actually needs to be encoded in the training data**. For example, if I try to learn to predict a fair coin flip before the coin is tossed, iI won't be able to build a working machine learning model, no matter what I choose as the input features. The process is very (or entirely?) random, and the information to make a prediction is just not available. More technically, one might say the process has high intrinsic randomness that we can not overcome by building better models. While you're unlikely to encounter a case as extreme (and obvious) as a coin toss, many processes in the real world are quite random (such as the behavior of people) and it's impossible to make entirely accurate predictions for them.

In other cases, a prediction might be possible in principle, but we might not have provided the right information to the model. For example, it might be possible for a machine learning model to learn to diagnose pneumonia in a patient, but not if the only information about the patient that we represent to them is their shopping habbits and wardrobe. If we use a chest x-ray as a representation of the patient, together with a collection of symptoms, we will likely have better success.
Even if the information is represented in the input, learning might still fail if the model is unable to extract the information. For example, visual stimuli are very easy to interpret for humans, but in general much harder to understand for machine learning algorithms.
Consequently, it would be much harder for a machine to determine if a graffiti is offensive by presenting it with a photograph, than if the same information was represented as a text file.

Secondly, **the training dataset needs to be large and varied enough to capture the variability of the process**. In other words, the training data needs to be representative of the whole process, not only representing a small portion of it. Humans are very good at abstracting properties, and a child will be able to understand what a car is after seeing only a handfull. Machine learning algorithms on the other hand require a lot of variability to be present. For example, to learn the concept of what a car looks like, an algorithm likely needs to see pictures of vans, of trucks, of sedans, pictures from the front, the side and above, pictures parking and in traffic, pictures in rain and in sunshine, in a garage and outdoors, maybe even pictures taken by a phone camera and pictures taken by a news camera. As we said before, the whole point of supervised learning is to generalize, so we want our model to apply to new settings. However, how new a setting can be depends on the representation of the data and the algorithm in question. If the algorithm has only ever seen trucks, it might not recognize a sedan. If the algorithm has never seen a snow-covered car, it's unlikely it will recognize it.
Photos (also known as natural images in machine learning) are a very extreme example as they have a lot of variability, and so often require a lot of training data. If your data has a simple structure, or the relationship between your features and your target are simple, then only a handful of training examples might be enough.
```{margin} TODO
example of simple training task
```

Third and finally, **the data that the model is applied to needs to be generated from the same process as the data the model was trained on**. A model can only generalize to data that basically adheres to the same rules and has the same structure.

```{admonition} Mathematical Background

From a mathematical standpoint, supervised learning assumes that there is a joint distribution $p(x, y)$ and that the training dataset consists for independent, identically distributed (i.i.d.) samples from this joint distribution.
The model is then applied to new data sampled from the same distribution, but $y$ is unknown. The model is then used to estimate $p(y | x)$, or more commonly the mode of this distribution, i.e. the most likely value for $y$ to take given the $x$ we observed.
In the case of learning to predict a coin flip, you could actually learn a very accurate model of $p(y | x)$, that predicts heads and tails with equal probability. There is no way to predict the particular outcome itself, though.


The third requirement for success is easily expressed as saying that the test data is sampled i.i.d. from the same distribution $p(x, y)$ that the training data was generated from.
```

```{margin} TODO math
define genearalization here for real? Or do that in chapter 2? decompose classification error into irreducible error estimation error etc...
```

### Unsupervised Learning
In unsupervised machine learning, we are usually just given data points $x$, and the goal is to learn something about the structure of the data. This is usually a more open-ended task than what we saw in supervised learning. This kind of task is called unsupervised, because even during training, there is no "supervision" providing a correct answer. There are several sub-categories of unsupervised learning that we'll discuss in Chapter 3, in particular clustering, dimensionality reduction, and signal decomposition.
Clustering is the task of finding coherent groups within a dataset, say subgroups of customers that behave in a similar way, say "students", "new parents" and "retirees", that each have a distinct shopping pattern.
However, here, in contrast to classification, the groups are not pre-defined. We might not know what the groups are, how many groups there are, or even if there is a coherent way to define any groups.
There might also be several different ways the data could be grouped: say you're looking at portraits. One way to group them could be by whether the subject has classes or not. Another way to group them could be by the direction they are facing. Yet another might be hair color or skin color. If you tell an algorithm to cluster the data, you don't know which aspect it will pick up on, and usually manually inspecting the groups or clusters is the only way to interpret the results.

Two other, related, unsupervised learning tasks are dimensionality reduction and signal decomposition. In these, we are not looking at groups in the data, but underlying factors of variance, that are potentially more semantic than the original representation.
Going back to the example of portraits, an algorithm might find that head orientation, lighting and hair color are important aspects of the image that vary independently. In dimensionality reduction, we are usually looking for a representation that is lower-dimensional, i.e. that has less variables than the original feature space. This can be particularly useful for visualizing dataset with many features, by projecting them into a two-dimensional space that's easily plotted.
Another common application of signal decomposition is topic modeling of text data. Here, we are trying to find *topics* among a set of documents, say news articles, or court documents, or social media posts. This is related to clustering, though with the difference that each document can be assigned multiple topics, i.e. topics in the news could be politics, religion, sports and economics, and an article could be about both, politics and economics.

Both, clustering and signal decomposition, are most commonly used in exploratory analysis, where we are trying to understand the data. They are less commonly used in production systems, as they lend themselves less easily to automating a decision process.
Sometimes signal decomposition is used as a mechanism to extract more semantic features from a dataset, on top of which a supervised model is learned. This can be particularly useful if there is a large amount of data, but only a small amount of annotated data, i.e. data for which the outcome $y$ is known.


### Reinforcement Learning
The third main family of machine learning tasks is reinforcement learning, which is quite different from the other two. Both supervised and unsupervised learning basically work on a dataset that was collected and stored, from which we then build a model. Potentially, this model is then applied to new data in the future.
In reinforcement learning, on the other hand, there is no real notion of a dataset. Instead, reinforcement learning is about a program (usually known as an agent) interacting with a particular environment. Through this interaction, the agent learns to achieve a particular goal. A good example of this is a program learning to play a video game. Here, the agent would be an AI playing the game, while the environment would be the game itself, i.e. the world in which it plays out. The agent is presented with the environment, and has choices of actions (say moving forward and backward and jumping) and each of these actions will result in the environment being in a new state (i.e. with the agent placed a bit forward, or backward, or falling in a hole). Given the new environment, the agent again can choose an action and the environment will be in a new state as a consequence.
The learning in reinforcement learning happens with so-called *rewards*, which need to be specified by the data scientist building the system. The agent is trained to seek rewards (hence the name reinforcement learning), and will find series of actions that maximize its reward. In a game, a reward could be given to the environment every time they score points, or just once when the agent wins the game. In the second case, there might be a long delay between the agent taking an action, and the agent winning the game, and one of the main challenges in reinforcement learning is dealing with such settings (this is known as credit attribution problem: which of my actions should I give credit for me winning the game).

Compared with supervised learning, reinforcement learning is a much more indirect way to specify the learning problem: we don't provide the algorithm with the correct answer (i.e. the correct sequence of actions to win the game), instead we only reward the agent once they achieve a goal. Suprisingly, this can work surprisingly well in practice. This is like learning a game without someone ever telling you the rules, or what the goal of the game is, but only telling you whether you lost or won at the end. As you might expect, it might take you many many tries to figure out the game.

However, algorithms are notoriously patient, and researchers have been able to use reinforcement learning to create programs that can play a wide variety of complex games. Potentially one of the most suprising and impressive feats was learning to play the ancient chinese boardgame of Go at a superhuman level. 
```{margin} TODO citations etc, numbers of games, years...
```
When this was publicided in TODO, many researchers in the area were shocked, as the game was known to be notoriously hard, and many believed it could not be learned by any known algorithms. While the initial work used some human knowledge, later publications learned to play the game from scratch, i.e. without any rewards other than for winning the game, by the agent repeatedly playing the game against itself. The resulting programs are now playing at superhuman level, meaning that they are basically unbeatable, even by the best human players in the world. Similar efforts are now underway for other games, in particular computer games like StarCraft II and DOTA.
```{margin}
Algorithms achieved super-human performance in the game of chess long before this, in the year TODO with the famous play of Kasparov against Deep Blue.
Chess has much fewer possible moves and games are much shorter sequences of actions than in Go or StarCraft, which makes it much easier to devise algorithms to play chess.

```

Reinforcement learning also has a long history in other areas, in particular robotics, where it is used for learning and tuning behaviors, such as walking or grasping.
While many impressive achievements have been made with reinforcement learning, there are several aspects that limit it's broad usefulness.
A potential application of reinforcement learning could be self-driving cars. However, as mentioned above, reinforcement learning usually requires many attempts or iterations before it learns a specific task.
If I wanted to train a car to learn to park, it might fail thousands or hundreds of thousands of times first. Unfortunately, in the real world this is impractical: self-driving cars are very expensive, and we don't want to crash them over and over again. It might also be risky for the person conducting the experiment. With thousands of attempts, even if the car doesn't crash, the gas will run out, and the person having to reset the experiment every time will probably get very tired very quickly.
Therefore, reinforcement learning is most successful when there is a good way to simulate the environment, as is the case with games, and with some aspects of robotics.
For learning how to park, a simulation might actually work well, as the sensing of other cars and the steering of the car can be simulated well.
However, for really learning how to drive, a car would need to be able to deal with a variety of situations, such as different weather conditions, crowded streets, people running on the strees, kids chasing balls, navigating detours and many other scenarios.
Simulating these in a realistic way is very hard, and so reinforcement learning is much harder to apply in the physical world.

A setting that has attracted some attention, and might become more relevant soon, is online platforms that are not games. You could think of a social media timeline as an agent that gets rewarded for you looking at it.
Right now, this is often formulated as a supervised learning task (TODO or more acurately active learning). However, your interactions with social media are not usually indepentent events, but your behavior online
is shaped by what is being presented to you, and what was shown to you in the past might influence what is shown to you in the future.
A maybe somewhat cynical analogy would be to think of this as a timeline being an agent, playing you, winning whenever you stay glued to the screen (or click an ad or buy a product).
I'm not aware that this has been implemented anywhere, but as computational capacity increase and algorthms become more sophisticated, it is a natural direction to explore.

Reinforcement learning is a fascinating topic, but much beyond the scope of this book. For an introduction, see TODO Sutten Barto.
For an overview of modern approaches, see TODO. 

As you might have notices in the table of contents, this book mostly concerns itself with supervised and unsupervised learning, and we will not discuss reinforcement learning any further.
As a matter of fact, the book heavily emphasizes supervised learning, which has found the larges success among the three in practical applications so far.
While all three of these areas are interesting in their own right, when you see an application of machine learning, or when someone says they are using machine learning for something,
chances are they mean supervised learning, which is arguably the most well-understood, and the most easy to productionize and analyze. 

### Isn't this just statistics?
Before I’ll go into some general principles, I want to position machine learning in relation to statistics. I recently got chewed out by a colleague for doing that. My goal here is not to say one is better than the other. Actually, there’s really no clear boundary between statistics and machine learning, and anyone that tells you otherwise is lying. Two of the books I recommended for the course are actually statistics text books. But I can tell you how the tools that I’m talking about in this course will differ from what you’d learn in a typical stats course.

Statistics is usually about inference, often phrased in terms of hypothesis testing. An example might be a yes-no-question, such as “are women less likely to enroll in a Data Science Program”, and you have a sample population, for example this classroom, and you can then try to make an inference about whether this statement is true. Often this includes making assumptions on how your sample relates to the general population, say this class vs all of DSI or Columbia vs all of the US. You might also have a specific model of how the process behind your question works.

On the other hand, machine learning is about prediction and generalization. We want to learn from past data to predict outcomes on future, unseen data.

We usually want to make statements about individual data points, and we want to build a model that will work on new data that fulfills our assumptions, independent of the population we samples. Often we don’t have or need a model of the process, but we rely on the assumption that our training data is generated from the same process as any future data will be.

There are statisticians that do predictions and there are machine learning scientists that do inference, but I find this distinction helpful.

Again I’m not saying one or the other is better, I’m just saying that you should know what kind of problem you are trying to solve, and what the right tool for the problem is. And then you can call it machine learning or statistics or probabilistic inference or data science. The tools you learn in this class will usually not help you to make yes/no inferences, and they will only give you a limited insight into the data generating process.

## When to use and not to use machine learning
My first advice would be, don’t try machine learning. Machine learning systems are very complex and often fragile. Whether you’re in research or a startup, don’t immediately start with "oh we can apply deep learning to this". Often it’s good to collect data, and be able to use data to drive and evaluate decisions. But including a complex process like machine learning into whatever you’re trying to do will make it much harder to debug and much harder to understand.

## The role of data
Clearly the data you use for building and applying machine learning systems is a critical component, and we will talk a lot about handling and transforming data this semester. Clearly, if you don’t have data, you can’t use machine learning. Let’s say you have a dataset. A very important question you need to ask yourself is: should I get more data? That’s another reason why kaggle competitions are bad: usually you can’t get more data. So what do you think? Should you get more data? More data always improves the model if it’s from the right source. So it depends: What’s the marginal cost of more data, what’s the marginal benefit to the model, what’s the marginal benefit of the model to your end-goal? We will talk about how to assess the benefit to the model later in the course. But the other questions are also important.

The cost of data can be very different, and two kinds of data are particularly common: free data, and very expensive data. What kind of data do you think is free?

Free data is data that you’ll just get more of. And that happens a lot. If you are running an ad company and want to do click prediction, every day you’ll get so much new data, you’ll barely be able to use it all. The same is true for the stock-market. In general, if you want to predict the future, and the event is observable, you’ll get more data just by waiting. This can either be because the world just produces the data, or your business process produces them. You can also be smart, and ensure your business is set up in a way so that it does produce the data. Google used captchas to do OCR and to read house numbers, then they used the results for machine learning.

The other extreme of the spectrum is when you want to automate an expensive process with machine learning. This process could be an expert opinion, like a doctor’s diagnosis, or a literary analysis. It could also be an experiment, like an initial drug-trial, or measuring the efficiency of a microchip.

TODO: say why machines are good at some things and bad at others? like medical imaging?

## Metrics and evaluation
One of the most important parts of machine learning is defining the goal, and defining a way to measure that goal. In this way, Kaggle is a really bad way to prepare you for machine learning in the real world, because they already did that work for you. In the real world, people don’t tell you whether you should use unsupervised learning, supervised learning, classification or regression, and what’s the right way to cast something as a machine learning task – or whether to cast it as machine learning at all.

So think in context of your problem. What do you want to achieve? What is the easiest way to achieve this? And what will improving over this baseline buy you?

The problem of metrics is not unique to machine learning, but a problem in any data driven decision making. And often you have no choice but to use a substitute metric, either because the effect you’re interested in is too hard to measure, or because the influence is too indirect. Imagine spotify improved their artist radio to be waaay better. The metric they care about is revenue. Do you think better radio will increase revenue short-term? What would be a good substitute metric?

Let's say facebook wants to optimize their ad revenue. What should they measure? If people click on ads, that's probably good, right? But you can optimize clicks on ads by increasing accidental clicks by putting ads next to things people click. But accidental clicks will not yield conversions, and if you sell clicks to the ad buyers, and they don't result in conversions, they will go somewhere else.

## Ethics
One aspect of machine learning that only recently is getting some more attention is ethics. There was a recent article in propublica about racial bias in risk-assessment used in the criminal justice system. Spoiler alert: it’s bad. I recommend reading the article, it’s quite interesting. This is a black-box machine learning system created by some company. If they had to provide explanations, or a more transparent system, the situation would likely be better. But this is not the only place where ethics plays a role in machine learning. There will be a more focused course on ethics in the DSI next semester, and I really recommend looking into it.

Some people think that ethics is not something that the technical people should care about, but I disagree. I think if you build a machine learning system, you should know whether and how it is biased, and whether its application is ethical. Sometimes it’s hard to decide that, though. There’s an example of two high-schools, both of whom tried to predict which of their students will underperform in the coming year. There is a lot of ways this could be biased based on race, financial background and other factors. But that’s not the point. The point is that one of the schools used the predictions, and kicked out these students before the annual evaluation, so that they got a better evaluation score. The other school used the data to provide these students with targeted support and help. The algorithm could be the same, but the outcome is quite different. Ok, that’s enough about ethics, I hope you’ll keep these considerations in mind. The next thing I want to talk about is data!


## Explainable AI


## Scaling up

There’s another aspect to data collection and dataset size. More data might be more expensive to collect, but it might also be more expensive and more complicated to work with. With the available cloud services, storage might not be that much of an issue any more. But runtime is. And I’m less concerned about buying a bigger cluster, I’m more concerned about your time, the data scientists time and the machine learning analysis. There’s a reason we’ll be using Python in this course. Python is easy to learn, has lots of tools and allows very close interactions with the data. If we would try to use SPARK instead, this would be whole other story. Working with distributed systems is hard, they are not responsive and the tooling is often not as good. So what I often do, no matter how big the dataset is, is to work with a subset of the data that fits into my RAM. Then I can use python, and everything is easy. And with AWS, I can easily get 512GB of RAM, if I really need to. Arguably I'm a bit biased because I work on scikit-learn. For some applications that subsampling might not make sense, or working on very large data is critical. But I don't think that should ever be the first step.