# 10 Modeling and Machine Learning
__Math 3080: Fundamentals of Data Science__

Reading:
* Grus, Chapter 11 Machine Learning

Outline:
* What is a Model?
* What is Machine Learning?
  * Supervised/Unsupervised Learning
  * Batch/Online Learning
  * Instance-/Model-based Learning
* Overfitting and Underfitting
  * Bias-Variance Tradeoff
  * Correctness
* Feature Extraction / Dimensionality Reduction

-----
# What is a Model?
* "A mathematical (or probabilistic) relationship that exists between variables"
* A mathematical formula that describes patterns in data
* Used to predict the outcome of specific inputs

# What is Machine Learning?
* "Creating and using models that are *learned from data*"
* "Use existing data to develop models that we can use to *predict* various outcomes"

Machine Learning needs to be:
* useful for future
* accurate
* fast
* __generalizable__
* __interpretable__
  * People want to know how you got your results, but they don't want details
* __certifiable/guaranteed__

Three different categories of ML Algorithms
1. Supervision (does the data have labels or not?)
2. Online vs. Batch Learning (can it learn on the fly?)
3. Instance-based vs. Model-based Learning (how does it use new data?)

The criteria are not exclusive - all models can be described by each of these categories.

## Supervised/Unsupervised Learning
* Supervised learning (Labelled Data)
  * Classification (Discrete Data)
  * Regression (Continuous Data)
* Unsupervised learning (Unlabelled Data)
  * Clustering (Discrete Data)
  * Visualization and Dimensionality Reduction (Continuous Data)
  * Anomaly Detection and Novelty Detection
  * Association Rule Learning
* Semisupervised learning (Partly automated, with some labels inserted at some point)

Flow Chart of [Types of Machine Learning 2 by Steve Brunton](https://youtu.be/0_lKUPYEYyY?t=209)
* System (Training Data)
  * Labeled?
    * Yes: Supervised
      * Discrete or Continuous?
        * Discrete: Classification
        * Continuous: Regression
    * No: Unsupervised
      * Often is what is referred to as "data mining"
      * Discrete or Continuous?
        * Discrete: Clustering
        * Continuous: Embedding (a.k.a. Feature Extraction, Dimensionality Reduction, Dimension Extraction, Pattern Extraction)
    * Partial: Semi-supervised
      * Model or Modify?
        * Model: Generative Models
        * Modify
          * --> Loops back to System
          


## Batch/Online Learning
__Batch Learning__ or __Offline Learning__

All data is provided which you use to train a model. The model is the rolled out and runs without further learning. Also known as *offline learning*. If you get more data, then you re-train your model (with old data + new data), and roll out the upgraded model.

* Advantages
  * Automated - just run the program with each new dataset
* Disadvantages
  * Time-intensive - Each training often takes many hours
  * Resource-intensive - Often requires a lot of CPU, memore, disk space, etc.
  * If automated, it needs to carry the data with it (think of a rover on Mars or on the Moon, or of your smartphone)
  
__Incremental Learning__ or __Online Learning__

Train the system incrementally by feeding it data instances
* Individual data instances
* Mini-batches (small groups of data instances)
Run the data with a few small groups, then it can continue to process new data as a new mini-batch.
* Great for systems that receive data as a continuous flow (stock, weather, etc.) that need to adapt rapidly and autonomously
* Great if you are limited on resources
  * Even if you are offline, you can use online learning algorithms on huge datasets
  
The speed at which the system adapts: __Learning Rate__
* High rates adapt quickly, but tend to also forget old data quickly
* Slow rates will cause the model to learn more slowly, but will have less noise

Disadvantage to Online Learning:
* Any bad data will quickly cause model's performance to decline
  * Watch data to switch learning off whenever bad data starts coming in

## Instance-based vs Model-based Learning
Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.

### Instance-based Learning
The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples.
* Results of a model are based on the similarity to other data
* Most of the pictures are cats, so this must be a cat, too.

### Model-based Learning
Build a model from a set of examples, then use that model to make *predictions*.

Steps for model-based learning:
1. Study the data
2. Model Selection - What type of model do you want? Linear Regression? 
   1. Define Parameters
      * By convention, $\theta$ is used to represent model parameters
      * Which variables serve as inputs and which will be the output?
   2. Specify a performance measure
      * A *utility function* (or *fitness function*) measures how good the model is
      * A *cost function* measures how bad the model is
3. Train your model
   * Train the model based on some data
   * Test the model on some unused data
4. Run the model to make predictions

# Overfitting and Underfitting
![Figure 11-1 from Grus' Data Science from Scratch textbook](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781492041122/files/assets/dsf2_1101.png)

## Bias-Variance Tradeoff
* High bias and low variance correspond to underfitting
* Low bias but very high variance corresponds to overfitting

## Correctness
* Confusion Matrix
* Accuracy
* Precision
* Recall
* Harmonic Mean

# Feature Extraction

# Machine Learning Models we will look at
* Linear Regression
* Logistic Regression
* Decision Trees
* If we have time:
  * k-Nearest Neighbors
  * Naive-Bayes 