# 1. What is Deep Learning?

In the past few years, **artificial intelligence** (AI) has been a subject of intense media hype. **Machine learning**, **deep learning**, and **AI** come up in countless articles, often outside of technology-minded publications. We’re promised a future of intelligent chatbots, self-driving cars, and virtual assistants—a future sometimes painted in a grim light and other times as utopian, where human jobs will be scarce and most economic activity will be handled by robots or AI agents.

So let’s tackle these questions: 
- What has deep learning achieved so far? 
- How significant is it? 
- Where are we headed next? 
- Should you believe the hype?

<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1yfGHg5tlv9gPij6x1nv-6_gWHLoBv3lB">


## 1.1 Artificial intelligence, machine learning, and deep learning

### 1.1.1 Artificial intelligence

A concise definition of the field would be as follows: **the effort to automate intellectual tasks normally performed by humans**. As such, AI is a general field that encompasses machine learning and deep learning, but that also includes many more approaches that don’t involve any learning. Early chess programs, for instance, only involved hardcoded rules crafted by programmers, and didn’t qualify as machine learning. For a fairly long time, many experts believed that human-level artificial intelligence could be achieved by having programmers handcraft a sufficiently large set of explicit rules for manipulating knowledge. This approach is known as **symbolic AI**, and it was the dominant paradigm in AI from the 1950s to the late 1980s. It reached its peak popularity during the **expert systems boom** of the 1980s.

Although symbolic AI proved suitable to solve well-defined, logical problems, such as playing chess, it turned out to be intractable to figure out explicit rules for solving more complex, fuzzy problems, such as image classification, speech recognition, and language translation. A new approach arose to take symbolic AI’s place: **machine learning**.

### 1.1.2 Machine Learning

Machine learning arises from this question: **could a computer go beyond** “what we know how to order it to perform” and learn on its own how to perform a specified task? **Could a computer surprise us**? Rather than programmers crafting data-processing rules by hand, **could a computer automatically learn these rules by looking at data**?

This question opens the door to a new programming paradigm. In classical programming, the paradigm of symbolic AI, humans input rules (a program) and data to be processed according to these rules, and out come answers (see figure 1.2). With machine learning, humans input data as well as the answers expected from the data, and out come the rules. These rules can then be applied to new data to produce original answers.

<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1rvilB5MT3rIFMHoD_u15sxbYorEYgZDy">

**A machine-learning system is trained rather than explicitly programmed.**

Although machine learning only started to flourish in the 1990s, it has quickly become the most popular and most successful subfield of AI, **a trend driven by the availability of faster hardware and larger datasets**. Machine learning is tightly related to mathematical statistics, but it differs from statistics in several important ways. Unlike statistics, machine learning tends to deal with large, complex datasets (such as a dataset of millions of images, each consisting of tens of thousands of pixels) for which classical statistical analysis such as Bayesian analysis would be impractical. As a result, machine learning, and especially deep learning, exhibits comparatively little mathematical theory—maybe too little—and is engineering oriented. It’s a hands-on discipline in which ideas are proven empirically more often than theoretically.

So that’s what machine learning is, technically: **searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal**. This simple idea allows for solving a remarkably broad range of intellectual tasks, from speech recognition to autonomous car driving.

### 1.1.3 The “deep” in deep learning

Deep learning is a specific subfield of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations. The deep in deep learning isn’t a reference to any kind of deeper understanding achieved by the approach; rather, it stands for this idea of **successive layers of representations**.

Meanwhile, other approaches to machine learning tend to focus on learning only one or two layers of representations of the data; hence, they’re sometimes called **shallow learning**.

**What do the representations learned by a deep-learning algorithm look like**? Let’s examine how a network several layers deep transforms an image of a digit in order to recognize what digit it is.

As you can see in figure 1.6, the network transforms the digit image into representations that are increasingly different from the original image and increasingly informative about the final result. You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified (that is, useful with regard to some task).

<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=19a5EQpt1NHw67yjXkYgS4PWrw8Y3KZIh">

So that’s what deep learning is, technically: **a multistage way to learn data representations.** 

### 1.1.4 Understanding how deep learning works


The fundamental trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example (see figure 1.9). This adjustment is the job of the optimizer, which implements what’s called the **Backpropagation algorithm**: the central algorithm in deep learning. The next chapter explains in more detail how backpropagation works.

<img width="600" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=1ArEA6BxB6hBrmUz5p0gpZIeW-FT_T7NN">

Initially, the weights of the network are assigned random values, so the network merely implements a series of random transformations. Naturally, its output is far from what it should ideally be, and the loss score is accordingly very high. But with every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases. This is the training loop, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function. **A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network.**

## 1.2. Before deep learning: a brief history of machine learning

### 1.2.1 Probabilistic modeling

Probabilistic modeling is the **application of the principles of statistics** to data analysis. It was one of the earliest forms of machine learning, and it’s still widely used to this day. One of the best-known algorithms in this category is the **Naive Bayes algorithm**.

A closely related model is the **logistic regression** (logreg for short), which is sometimes considered to be the **“hello world” of modern machine learning**. Don’t be misled by its name—logreg is a classification algorithm rather than a regression algorithm. 

Much like **Naive Bayes**, **logreg** predates computing by a long time, yet it’s still useful to this day, thanks to its simple and versatile nature. It’s often the first thing a data scientist will try on a dataset to get a feel for the classification task at hand.

### 1.2.2. Early neural networks
Although the core ideas of neural networks were investigated in toy forms as early as the **1950s**, the approach took decades to get started. For a long time, **the missing piece was an efficient way to train large neural networks**. 

This changed in the **mid-1980s**, when multiple people independently rediscovered the **Backpropagation algorithm** -- a way to train chains of parametric operations using gradient-descent optimization -- and started applying it to neural networks.

**The first successful practical application of neural nets came in 1989 from Bell Labs**, when [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) combined the earlier ideas of convolutional neural networks and backpropagation, and applied them to the problem of classifying handwritten digits. The resulting network, dubbed LeNet, was used by the United States Postal Service in the 1990s to automate the reading of ZIP codes on mail envelopes.

### 1.2.3. Kernel methods

As neural networks started to gain some respect among researchers in the 1990s, thanks to this first success, a new approach to machine learning rose to fame and quickly sent neural nets back to oblivion: **kernel methods**. Kernel methods are a **group of classification algorithms**, the best known of which is the **support vector machine (SVM)**. The modern formulation of an SVM was developed by Vladimir Vapnik and Corinna Cortes in the early **1990s** at Bell Labs and published in **1995**, although an older linear formulation was published by Vapnik and Alexey Chervonenkis as early as **1963**.

The technique of mapping data to a high-dimensional representation where a classification problem becomes simpler may look good on paper, but in practice it’s often computationally intractable.

At the time they were developed, SVMs exhibited state-of-the-art performance on simple classification problems and were one of the few machine-learning methods backed by extensive theory and amenable to serious mathematical analysis, making them well understood and easily interpretable. Because of these useful properties, **SVMs became extremely popular in the field for a long time**.

**But SVMs proved hard to scale to large datasets and didn’t provide good results for perceptual problems such as image classification**. Because an SVM is a shallow method, applying an SVM to perceptual problems requires first extracting useful representations manually (a step called **feature engineering**), which is difficult and brittle.

### 1.2.4. Decision trees, random forests, and gradient boosting machines

**Decision trees** are flowchart-like structures that let you classify input data points or predict output values given inputs. They’re **easy to visualize and interpret**. Decisions trees learned from data began to receive significant research interest in the 2000s, and by 2010 they were often preferred to kernel methods.


In particular, the **Random Forest** algorithm introduced a robust, practical take on decision-tree learning that involves building a large number of specialized decision trees and then **ensembling** their outputs. Random forests are applicable to a wide range of problems—you could say that **they’re almost always the second-best algorithm for any shallow machine-learning task**. 

When the popular machine-learning competition website [Kaggle](http://kaggle.com) got started in **2010**, **random forests quickly became a favorite on the platform—until 2014**, when gradient boosting machines took over. 

**A gradient boosting machine**, much like a random forest, is a machine-learning technique based on **ensembling weak prediction models**, generally decision trees. It uses gradient boosting, a way to improve any machine-learning model by iteratively training new models that specialize in addressing the weak points of the previous models. Applied to decision trees, the use of the gradient boosting technique results in models that **strictly outperform random forests most of the time, while having similar properties**. **It may be one of the best, if not the best, algorithm for dealing with nonperceptual data today**. Alongside deep learning, it’s one of the most commonly used techniques in Kaggle competitions.

### 1.2.5. Back to neural networks

**Around 2010**, although neural networks were almost completely shunned by the scientific community at large, a number of people still working on neural networks started to make important breakthroughs: the groups of [Geoffrey Hinton](https://en.wikipedia.org/wiki/Geoffrey_Hinton) at the University of Toronto, [Yoshua Bengio](https://en.wikipedia.org/wiki/Yoshua_Bengio) at the University of Montreal, [Yann LeCun]([Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) at New York University, and IDSIA in Switzerland.


- **In 2011**, Dan Ciresan from IDSIA began to win academic image-classification competitions with **GPU-trained** deep neural networks.
- **In 2012**, a team led by Alex Krizhevsky and advised by Geoffrey Hinton was able to achieve a top-five accuracy of **83.6%**—a significant breakthrough.
- **By 2015**, the winner reached an accuracy of **96.4%**, and the classification task on ImageNet was considered to be a completely solved problem.
- Since 2012, **deep convolutional neural networks (convnets)** have become the go-to algorithm for all computer vision tasks; more generally, they work on all perceptual tasks.
- **At major computer vision conferences in 2015 and 2016**, it was nearly impossible to find presentations that didn’t involve convnets in some form.


### 1.2.6. The modern machine-learning landscape


A great way to get a sense of the current landscape of machine-learning algorithms and tools is to look at machine-learning competitions on [Kaggle](http://kaggle.com). 

**In 2016 and 2017**, Kaggle was dominated by two approaches: **gradient boosting machines** and **deep learning**. Specifically, **gradient boosting is used for problems where structured data is available**, whereas **deep learning is used for perceptual problems such as image classification**. 

These are the two techniques you should be the most familiar with in order to be successful in applied machine learning today: gradient boosting machines, for shallow-learning problems; and deep learning, for perceptual problems. In technical terms, this means you’ll need to be familiar with **XGBoost** and **Keras** —the two libraries that currently dominate Kaggle competitions.

## 1.3. Why deep learning? Why now?

The two key ideas of **deep learning** for computer vision — **convolutional neural networks** and **backpropagation** — were already well understood in **1989**. The **Long Short-Term Memory (LSTM)** algorithm, which is fundamental to deep learning for **timeseries**, was developed in **1997** and has barely changed since. So why did deep learning only take off after 2012? What changed in these two decades?

In general, three technical forces are driving advances in machine learning:

- Hardware
- Datasets and benchmarks
- Algorithmic advances

Because the field is guided by experimental findings rather than by theory, algorithmic advances only become possible when appropriate data and hardware are available to try new ideas (or scale up old ideas, as is often the case). **Machine learning** isn’t mathematics or physics, where major advances can be done with a pen and a piece of paper. **It’s an engineering science.**

The real **bottlenecks throughout the 1990s and 2000s** were **data** and **hardware**. But here’s what happened during that time: the internet took off, and **high-performance graphics chips were developed** for the needs of the gaming market.


### 1.3.1. A new wave of investment

**In 2011**, right before deep learning took the spotlight, the total venture capital investment in AI was around **19 million**, which went almost entirely to practical applications of shallow machine-learning approaches. **By 2014**, it had risen to a staggering **394 million**. Dozens of startups launched in these three years, trying to capitalize on the deep-learning hype. 

Meanwhile, large tech companies such as **Google**, **Facebook**, **Baidu**, and **Microsoft** have invested in internal research departments in amounts that would most likely dwarf the flow of venture-capital money. Only a few numbers have surfaced: **In 2013**, **Google acquired** the deep-learning startup **DeepMind** for a reported **500 million**—the largest acquisition of an AI company in history. **In 2014**, **Baidu** started a deep-learning research center in Silicon Valley, **investing 300 million** in the project. The deep-learning hardware **startup Nervana Systems** was acquired by **Intel** in **2016** for over **400 million.**

Machine learning—in particular, deep learning—has become central to the product strategy of these tech giants. **In late 2015**, **Google CEO Sundar Pichai stated**, “Machine learning is a core, transformative way by which we’re rethinking how we’re doing everything. **We’re thoughtfully applying it across all our products**, be it search, ads, YouTube, or Play. And we’re in early days, but you’ll see us—in a systematic way—apply machine learning in all these areas.”