# CH.1 The Machine Learning Landscape

## 1.1 What is Machine Learning?

**Machine Learning**: Science and Art of programming computers so that they can _learn from data_.

The goal is to have the computer learn a task without being explicitly programmed.

_Training Set_: Examples a computer uses as a foundation. Each particular example is called a _training instance_.

We often need a metric to measure the particular performance of the program to determine its _accuracy_.

## 1.2 Why use Machine Learning?

Traditional programming workflow looks like this:

![traditional_programming_flow.PNG](attachment:traditional_programming_flow.PNG)

But think about programming of image recognition. There are millions of pixels in an image, each with a whole range of values. That's too many _if-then_ rules to write. 

Machine Learning techniques automatically extract predictors that are used in a solution, but more importantly, particular solutions get better and better with more updates to the available data. This makes Machine Learning programs far more adaptable than their traditional counterparts:

![machine_learning_programming_flow.PNG](attachment:machine_learning_programming_flow.PNG)

![machine_learning_programming_flow_2.PNG](attachment:machine_learning_programming_flow_2.PNG)

Machine Learning is great for:
- Problems for which current solutions require a terribly long list of rules.
- Fluctuating environments by adapting with new data.
- Finding insights to complex problems with data that's too large to parse through traditionally.

## 1.3 Types of Machine Learning systems'

How do these systems differ?

- Extent of human intervension: Supervised v. Unsupervised v. Semi-supervised v. Reinforcement Learning
- Extent of incremental learning: Online v. Batch Learning
- Theoretical or empirical based: Instance comparisson v. Model construction

### 1.3.1 Supervised v. Unsupervised Learning

#### 1.3.1.1 Supervised Learning

**Supervised Learning**: Training data fed to the algorithm includes the answers/solutions (known as _labels_ or _class_). This is useful for _classification_ or to predict a target numeric value (ex. the price of a car) given a set of _features_ or _predictors_. 

To train this system, the data would require both the _predictors_ and their associated _labels_.

![regression_example.PNG](attachment:regression_example.PNG)
    
Important examples of supervised learning:

- k-Nearest Neighbors
- Linear Regression
- Logistic Regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural Networks

#### 1.3.1.2 Unsupervised Learning

**Unsupervised Learning**: Training data fed to the algorithm does not include labels or solutions. 

Important examples of unsupervised learning:

- Clustering:
 - k-Means
 - Hierarchical Cluster Analysis (HCA)
 - Expectation Maximization
- Visualization and dimensionality reduction:
 - Principal Component Analysis (PCA)
 - Kernal PCA
 - Locally-Linear Embedding (LLE)
 - t-distributed Stochastic Neighbor Embedding (t-SNE)
- Association rule learning:
 - Apriori
 - Eclat

_Clustering_ involves detecting similar groups among features:

![clustering_example.PNG](attachment:clustering_example.PNG)

_Visualization_ involves 2D or 3D representations of data that preserve structure such that unsuspecting patterns can be resovled. This goes along with _dimensionality reduction_ because certain features might have strong correlations, such as the age of a car and its mileage. Combining those two features into one is a form of _feature extraction_ and it is often a good idea to reduce the dimensions of the training data before feeding it to other Machine Learning algorithms. Below is a good example of visualization:

![t-sne_example.PNG](attachment:t-sne_example.PNG)


#### 1.3.1.3 Semi-Supervised Learning

This is a combination of the two above.

#### 1.3.1.4 Reinforcement Learning

**Reinforcement Learning**: The system is known as an _agent_ and will get either _rewards_ or _penalties_ depending on the _policy_ or action it takes. Over time, the policy defines the actions further and further in different situations. 


![reinforcement_learning_example.PNG](attachment:reinforcement_learning_example.PNG)

This was how AlphaGo beat the world champion in May 2017, by analyzing millions of games and then using that built policy against the champion.


### 1.3.2 Batch v. Online Learning

Can the ML system learn incrementally from streams of incoming data?

#### 1.3.2.1 Batch Learning

**Batch Learning**: Incapable of learning incrementally (trained using all available data). Think about it as the system learns all in one _batch_. 

This takes a lot of computational resources and time, so it is often trained offline and launched into production with no further learning (updates can be made by retraining the system with updated data).

#### 1.3.2.2 Online Learning

**Online/Incremental Learning**: Feed data instances incrementally and sequentially to the ML system to train it.

![online_learning_example.PNG](attachment:online_learning_example.PNG)

Great option for:
- continuous flow of data (e.g., the stock market)
- limited resources (e.g., memory storage on a Mars rover)

The reason this is good for systems with limited resources is that once the system is trained with new data, it can be discarded to save storage costs.

This method must be tuned to how much is weighs old v. new data (rigid orderly old v. flexible chaotic new).

### 1.3.3 Instance v. Model Based Learning

How do ML Algorithms generalize from data to making predictions on data it has never seen before?

#### 1.3.3.1 Instance-Based Learning

Detecting similarity between two instances is the jist of this technique.

![instance-based-learning-example.PNG](attachment:instance-based-learning-example.PNG)

#### 1.3.3.2 Model-Based Learning

**Model-Based Learning**: Build a model from examples to make predictions for new instances.

![model-based-learning-example.PNG](attachment:model-based-learning-example.PNG)

Given a plot of data, trends can be fit with a variety of techniques. _Model selection_ is imperative to accuracy. A good, simple example of a model is the _linear model_, which detects a dependent variable for a given feature. Each model has _parameters_, which are ways to tune the model. The $\theta$ values represent these parameters within the model below:

$$ y = \theta_{0} + \theta_{1}x $$

x: Independent variable, or input data (generally known data)
y: Dependent variable, or predicted value (generally unknown)

I won't go more into linear models because they should be known.

However, determining how _accurate_ a linear model is important. It can be done in one of two ways:

**Utility function**: Defining this measures how good a model is.
**Cost function**: Defining this measures how bad a model is.

For linear models, a good cost function would be how far away training examples are from the model's prediction. _Training_ the model would be the application of a Linear Regression algorithm to minimize the cost function.

Once the model is optimized, it can be exported as a simple input-output system block for which new data can be fed into it and spit out some different outputs.

There are many more complicated models out there that can be dependent on more features.

## 1.4 Main Challenges of Machine Learning

- Bad algorithm
- Bad data

### 1.4.1 Bad Data

#### 1.4.1.1 Insufficient Quantity

Simple problems - thousands of instances
Complex problems - millions, if not billions of instances

A famous paper by Peter Norvig called "The Unreasonable Effectiveness of Data" showed that the more data is available, the more different models converge to similar accuracies.

![effectiveness_of_data.PNG](attachment:effectiveness_of_data.PNG)

That said, they'll converge at different rates, so algorithm selection is still important.

#### 1.4.1.2 Nonrepresentative Training Data

Old data has to be indicative of data to come. Some things that can go wrong here:
- Sampling noise
 - Sheer randomness of the universe might just work against you when you go to sample your data
- Sampling bias
 - An example of this would be if political polls used emails from university directories (biased towards liberals) or used the numbers of retirement homes (biased towards conservatives). 
 
#### 1.4.1.3 Poor-quality Data

Errors, outliers, falsified data (malicious or not), and noise can throw off a model. Data must, thus, be cleaned of these errors as best as possible. Examples of this:
- Omitting outliers
- Clearing missing data:
 - Ignore the attribute altogether
 - Fill it with a median/mean value
 - Train two models, one with the sporadic feature and one without
 
#### 1.4.1.4 Irrelevant Features

Determining which features are most likely to be causally related to the value you are trying to extract is known as _feature engineering_, and most features are most likely going to be irrelevant. This process involves:

- _Feature selection_: finding most useful features of a set
- _Feature extraction_: combining existing features (e.g., dimensionality reduction)
- Gathering new data with new features

### 1.4.2 Bad algorithms

#### 1.4.2.1 Overfitting Training Data

**Overfitting**: Building a model that performs too well on the training data and will not generalize.

An example below is a _polynomial model_ with a high degree value:

![polynomial_overfitting.PNG](attachment:polynomial_overfitting.PNG)

Deep neural networks might detect patterns in the noise itself. Polynomial patterns might lose the forrest for a few trees. Improbable causal features might be used and patterns detected from this (number of certain vowels in the Japanese name for a country correlating with its happiness would be an example). 

Some possible solutions to overfitting are:
- _Regularization_: simplify the model to use fewer parameters (for the linear model, setting $\theta_{0} = 0$ would reduce a _degree of freedom_. Another solution would be to only allow limited values for model parameters.
- Gather more training data
- Reduce/filter noise in training data

An example of _regularization_ can be seen below:

![regularization_example.PNG](attachment:regularization_example.PNG)

_Regularization_ can be controlled with _hyperparameters_, or parameters of a learning algorithm applied prior to training.

#### 1.4.2.2 Underfitting the Training Data

Opposite probelm of the above. Main fixes involve:

- Selecting a more complex or powerful model with more parameters.
- Give the model better or more features
- Reduce constraints on the model


## 1.5 Testing and Validating

**Training Set**: Data used to parameterize the model. Typically 80% of the total data.

**Test Set**: Used to discover the generalization error of the model.

_Generalization error_, or _out-of-sample error_, is an evaluation of the accuracy of the model on instances it has never seen before. 

If TRAINING ERROR is low, but the GENERALIZATION ERROR is high, you've OVERFIT the training data.

However, fitting a model to one set of data can cause issues as well. _Cross validation_ is a method of splitting the traning set into various similar subsets against which the model can be validated. 

No model is _a priori_ guaranteed to work over any other, so the best path forward is often just to evaluate based off resonable assumptions based on your data.