## Introduction ##
* This chapter introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. It will be a high-level overview (the only chapter without much code), all rather simple, but you should make sure everything is crystal-clear to you before continuing to the rest of the book. So grab a coffee and let’s get started!

## What is Machine Learning ##

* Machine Learning is the science (and art) of programming computers so they can learn from data.
* Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed. (Arthur Samuel, 1959)
* A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. (Tom Mitchell, 1997)

> **Example:** Spam Filter.
>
> **Task:** Correctly identify spam.
>
> **Experience:** Training data (collection of emails) + algorithms (machine learning) = model
>
> **Performance:** Ratio of correctly classified emails as spam.

## Why Use Machine Learning? ##

Machine Learning is great for:

* Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
    > * For example: Spam Filter.
    > * Traditional approach: Write a rule based system ('4U', 'cre%it car%'). But as spammers become smarter (using 'ForU' instead of '4U'), the rules list will keep growing and quickly get unweildy.
    > * Machine learning approach will keep learning as new training data is fed to the algorithm.
* Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
    > * For example: Speech recognition for words 'one' & 'two'.
    > * Traditional rule based approach will be very difficult to even define a good solution.
    > * Machine learning approach only needs a good training set of recordings of the words.
* Fluctuating environments: a Machine Learning system can adapt to new data.
* Getting insights about complex problems and large amounts of data.
    > * For example: Data mining.
    > * Inspect the ML based solution.
    > * Understand the problem better.
    > * Iterate if needed.

## Types of Machine Learning Systems ##

1. Trained with/without human supervision:
    > * Supervised
    > * Unsupervised
    > * Semisupervised
    > * Reinforcement learning
2. Can/cannot learn incrementally on the fly:
    > * Online learning
    > * Batch learning
3. Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do:                   
    > * Instance-based learning
    > * Model-based learning
4. Example:
    > * A state-of-the-art spam filter may learn on the fly using a deep neural network model trained using examples of spam and ham; this makes it an online, model-based, supervised learning system.
    
### Supervised Learning ###

* The training data fed to the algorithm includes the desired solutions, called *labels*.
* The training data features/attribute are called *predictors*.
* Two types:
    1. Classification
        > * Training data labels are non-numeric.
        > * Example: Spam filter (spam or ham)
    2. Regression
        > * Training data labels are numeric.
        > * Example: Car prices
        > * Note that some regression algorithms can be used for classification as well, and vice versa. For example, Logistic Regression is commonly used for classification, as it can output a value that corresponds to the probability of belonging to a given class (e.g., 20% chance of being spam).
* Important supervised learning algorithms:
    > * k-Nearest Neighbors
    > * Linear Regression
    > * Logistic Regression
    > * Support Vector Machines (SVMs)
    > * Decision Trees and Random Forests
    > * Neural networks (Some neural network architectures can be unsupervised, such as autoencoders and restricted Boltzmann machines. They can also be semisupervised, such as in deep belief networks and unsupervised pretraining.)
    
 
> **_NOTE_**
> In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Many people use the words attribute and feature interchangeably, though.

### Unsupervised Learning ###

* The training data does not contain the desired solutions and hence is *unlabeled*.
* Unsupervised learning tasks:
    > * **Clustering** - detect groups/sub-groups (HCA) within data.
    > * **Visualization** - visualize (2D/3D) clusters in the input space (t-SNE)
    > * **Dimensionality Reduction** - simplify the data without losing too much information. One way to do this is to merge several correlated features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called **_feature extraction_**. 
    > * **TIP:** _It is often a good idea to try to reduce the dimension of your training data using a dimensionality reduction algorithm before you feed it to another Machine Learning algorithm (such as a supervised learning algorithm). It will run much faster, the data will take up less disk and memory space, and in some cases it may also perform better._
    > * **Anomaly Detection** - detecting unusual credit card transactions to prevent fraud, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm. The system is trained with normal instances, and when it sees a new instance it can tell whether it looks like a normal one or whether it is likely an anomaly.
    > * **Association Rule Learning** - dig into large amounts of data and discover interesting relations between attributes. For example, suppose you own a supermarket. Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other.
* Important unsupervised learning algorithms:
    > * Clustering
    > * k-Means
    > * Hierarchical Cluster Analysis (HCA)
    > * Expectation Maximization
    > * Visualization and dimensionality reduction
    > * Principal Component Analysis (PCA)
    > * Kernel PCA
    > * Locally-Linear Embedding (LLE)
    > * t-distributed Stochastic Neighbor Embedding (t-SNE)
    > * Association rule learning
    > * Apriori
    > * Eclat

### Semisupervised Learning ###

* The training data has little bit of labeled data and a lot of unlabeled data.
* Example:
    > Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised part of the algorithm (clustering). Now all the system needs is for you to tell it who these people are. Just one label per person and it is able to name everyone in every photo, which is useful for searching photos.
* Algorithms:
    > Most semisupervised learning algorithms are combinations of unsupervised and supervised algorithms. For example, deep belief networks (DBNs) are based on unsupervised components called restricted Boltzmann machines (RBMs) stacked on top of one another. RBMs are trained sequentially in an unsupervised manner, and then the whole system is fine-tuned using supervised learning techniques.
    
### Reinforcement Learning ###

* The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards. It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
* Examples:
    > * Robots learning how to walk.
    > * DeepMind’s AlphaGo learned its winning policy by analyzing millions of games, and then playing many games against itself.
    
### Batch Learning (aka Offline Learning) ###

* System needs to be trained on all the available data.
* It can consume a lot of time and computing resources based on the amount of data.
* Hence, the learning needs to be done offline usually.
* A schedule (daily or weekly) can be setup for re-learning using new plus existing data.

### Online Learning ###

* System can be trained incrementally with individual instances or mini-batches of instances.
* Good when:
    > * Data is received in a continous flow (example: stock prices).
    > * Computing resources are limited - learned instances can be discarded.
    > * Huge datasets that cannot fit in one machine's memory.  The algorithm loads part of the data, trains on it and repeats the process until all data is run. This is called **_out-of-core learning_**.
* **WARNING:** _This whole process is usually done offline (i.e., not on the live system), so **online learning** can be a confusing name. Think of it as **incremental learning** _.
* **Learning Rate** - measure of how fast the system adapts to changing data. Too high and the system only remembers the latest training. Too low and system learns from new data slowly. Medium might be the way to go.
* System needs to be monitored very closely to react accordingly in time to abnormal or bad data.

### Instance-based Learning ###

* System learns the examples by heart, then generalizes to new cases using a similarity measure. 
* Example: 
    > * Learn by heart a group of spam emails, then use them to establish similarity measure for new emails by counting the number of words they have in common.
    
### Model-based Learning ###

* System learns by making a model out of training instances, then uses that model to make predictions on new instances.
* Use data from [OECD - Better Life Index](http://stats.oecd.org/index.aspx?DataSetCode=BLI) & [IMF - GDP per capita](http://www.imf.org/external/pubs/ft/weo/2016/01/weodata/weorept.aspx?pr.x=32&pr.y=8&sy=2015&ey=2015&scsm=1&ssd=1&sort=country&ds=.&br=1&c=512%2C668%2C914%2C672%2C612%2C946%2C614%2C137%2C311%2C962%2C213%2C674%2C911%2C676%2C193%2C548%2C122%2C556%2C912%2C678%2C313%2C181%2C419%2C867%2C513%2C682%2C316%2C684%2C913%2C273%2C124%2C868%2C339%2C921%2C638%2C948%2C514%2C943%2C218%2C686%2C963%2C688%2C616%2C518%2C223%2C728%2C516%2C558%2C918%2C138%2C748%2C196%2C618%2C278%2C624%2C692%2C522%2C694%2C622%2C142%2C156%2C449%2C626%2C564%2C628%2C565%2C228%2C283%2C924%2C853%2C233%2C288%2C632%2C293%2C636%2C566%2C634%2C964%2C238%2C182%2C662%2C453%2C960%2C968%2C423%2C922%2C935%2C714%2C128%2C862%2C611%2C135%2C321%2C716%2C243%2C456%2C248%2C722%2C469%2C942%2C253%2C718%2C642%2C724%2C643%2C576%2C939%2C936%2C644%2C961%2C819%2C813%2C172%2C199%2C132%2C733%2C646%2C184%2C648%2C524%2C915%2C361%2C134%2C362%2C652%2C364%2C174%2C732%2C328%2C366%2C258%2C734%2C656%2C144%2C654%2C146%2C336%2C463%2C263%2C528%2C268%2C923%2C532%2C738%2C944%2C578%2C176%2C537%2C534%2C742%2C536%2C866%2C429%2C369%2C433%2C744%2C178%2C186%2C436%2C925%2C136%2C869%2C343%2C746%2C158%2C926%2C439%2C466%2C916%2C112%2C664%2C111%2C826%2C298%2C542%2C927%2C967%2C846%2C443%2C299%2C917%2C582%2C544%2C474%2C941%2C754%2C446%2C698%2C666&s=NGDPDPC&grp=0&a=) against linear regression algorithm to create a model to predict life satisfaction for countries not included in training data given their GDP per capita.
    > * life_satisfaction = x0 + x1 * gdp_per_capita
    
### Main Challenges of Machine Learning ###

1. Insufficient Quantity of Training Data
    > * Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recognition you may need millions of examples.
    > * **THE UNREASONABLE EFFECTIVENESS OF DATA** In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different Machine Learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natural language disambiguation8 once they were given enough data. The idea that data matters more than algorithms for complex problems was further popularized by Peter Norvig et al. in a paper titled “The Unreasonable Effectiveness of Data” published in 2009.
2. Nonrepresentative Training Data
    > * In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is true whether you use instance-based learning or model-based learning.
    > * If the sample is too small, you will have **_sampling noise_** (i.e., nonrepresentative data as a result of chance)
    > * Even very large samples can be nonrepresentative if the sampling method is flawed. This is called **_sampling bias_**.
3. Poor-Quality Data
    > *  If your training data is full of errors, outliers, and noise (e.g., due to poor-quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well. Poor quality data needs to be cleaned by deciding whether to ignore the outliers and missing instances or fill in the missing values or train one model with the feature and one model without it, and so on.
4. Irrelevant Features
    > * 