# 10 Modeling and Machine Learning
__Math 3080: Fundamentals of Data Science__

Reading:
* Grus, Chapter 11 Machine Learning

Outline:
* What is a Model?
* What is Machine Learning?
  * Supervised/Unsupervised Learning
  * Batch/Online Learning
  * Instance-/Model-based Learning
* Overfitting and Underfitting
  * Bias-Variance Tradeoff
  * Correctness
* Feature Extraction / Dimensionality Reduction

-----
# What is a Model?
* "A mathematical (or probabilistic) relationship that exists between variables"
* A mathematical formula that describes patterns in data
* Used to predict the outcome of specific inputs

# What is Machine Learning?
* "Creating and using models that are *learned from data*"
* "Use existing data to develop models that we can use to *predict* various outcomes"

Machine Learning needs to be:
* useful for future
* accurate
* fast
* __generalizable__
* __interpretable__
  * People want to know how you got your results, but they don't want details
* __certifiable/guaranteed__

Three different categories of ML Algorithms
1. Supervision (does the data have labels or not?)
2. Online vs. Batch Learning (can it learn on the fly?)
3. Instance-based vs. Model-based Learning (how does it use new data?)

The criteria are not exclusive - all models can be described by each of these categories.

## Supervised/Unsupervised Learning
* Supervised learning (Labelled Data)
  * Classification (Discrete Data)
  * Regression (Continuous Data)
* Unsupervised learning (Unlabelled Data)
  * Clustering (Discrete Data)
  * Visualization and Dimensionality Reduction (Continuous Data)
  * Anomaly Detection and Novelty Detection
  * Association Rule Learning
* Semisupervised learning (Partly automated, with some labels inserted at some point)

Flow Chart of [Types of Machine Learning 2 by Steve Brunton](https://youtu.be/0_lKUPYEYyY?t=209)
* System (Training Data)
  * Labeled?
    * Yes: Supervised
      * Discrete or Continuous?
        * Discrete: Classification
        * Continuous: Regression
    * No: Unsupervised
      * Often is what is referred to as "data mining"
      * Discrete or Continuous?
        * Discrete: Clustering
        * Continuous: Embedding (a.k.a. Feature Extraction, Dimensionality Reduction, Dimension Extraction, Pattern Extraction)
    * Partial: Semi-supervised
      * Model or Modify?
        * Model: Generative Models
        * Modify
          * --> Loops back to System
          


## Batch/Online Learning
__Batch Learning__ or __Offline Learning__

All data is provided which you use to train a model. The model is the rolled out and runs without further learning. Also known as *offline learning*. If you get more data, then you re-train your model (with old data + new data), and roll out the upgraded model.

* Advantages
  * Automated - just run the program with each new dataset
* Disadvantages
  * Time-intensive - Each training often takes many hours
  * Resource-intensive - Often requires a lot of CPU, memore, disk space, etc.
  * If automated, it needs to carry the data with it (think of a rover on Mars or on the Moon, or of your smartphone)
  
__Incremental Learning__ or __Online Learning__

Train the system incrementally by feeding it data instances
* Individual data instances
* Mini-batches (small groups of data instances)
Run the data with a few small groups, then it can continue to process new data as a new mini-batch.
* Great for systems that receive data as a continuous flow (stock, weather, etc.) that need to adapt rapidly and autonomously
* Great if you are limited on resources
  * Even if you are offline, you can use online learning algorithms on huge datasets
  
The speed at which the system adapts: __Learning Rate__
* High rates adapt quickly, but tend to also forget old data quickly
* Slow rates will cause the model to learn more slowly, but will have less noise

Disadvantage to Online Learning:
* Any bad data will quickly cause model's performance to decline
  * Watch data to switch learning off whenever bad data starts coming in

## Instance-based vs Model-based Learning
Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.

### Instance-based Learning
The system learns the examples by heart, then generalizes to new cases by comparing them to the learned examples.
* Results of a model are based on the similarity to other data
* Most of the pictures are cats, so this must be a cat, too.

### Model-based Learning
Build a model from a set of examples, then use that model to make *predictions*.

Steps for model-based learning:
1. Study the data
2. Model Selection - What type of model do you want? Linear Regression? 
   1. Define Parameters
      * By convention, $\theta$ is used to represent model parameters
      * Which variables serve as inputs and which will be the output?
   2. Specify a performance measure
      * A *utility function* (or *fitness function*) measures how good the model is
      * A *cost function* measures how bad the model is
3. Train your model
   * Train the model based on some data
   * Test the model on some unused data
4. Run the model to make predictions

# Overfitting and Underfitting
![Figure 11-1 from Grus' Data Science from Scratch textbook](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781492041122/files/assets/dsf2_1101.png)

How do we know whether our model is overfitting, underfitting, or is a good balance? We will look at this in two ways:
* Bias-Variance Tradeoff
* Correctness

## Bias-Variance Tradeoff
* High bias and low variance correspond to underfitting
* Low bias but very high variance corresponds to overfitting

## Correctness
* Confusion Matrix

|                     | Positive Value | Negative Value |
| ------------------- | -------------- | -------------- |
| Positive Prediction | True positive  | False positive<br>Type I Error |
| Negative Prediction | False negative<br>Type II Error | True negative  |

A new test to see if someone has leukemia: their name is "Luke" since it sounds similar. Here's some data:

|          | Has leukemia | No leukemia | Total           |
| -------- | :----------: | :---------: | :-------------: |
| Luke     | 70           | 4,930       | __5,000__       |
| Not Luke | 13,930       | 981,070     | __995,000__     |
| *Total*  | *14,000*     | *986,000*   | __*1,000,000*__ |

* Accuracy

|                     | Positive Value           | Negative Value               |
| ------------------- | ------------------------ | ---------------------------- |
| Positive Prediction | __*True positive*__ (70) | *False positive* (4930)      |
| Negative Prediction | *False negative* (13930) | __*True negative*__ (981070) |

$$Accuracy = \frac{Correct~Predictions}{All~Values} = \frac{tp+tn}{tp+fp+fn+tn}$$


In [None]:
def accuracy(tp,fp,fn,tn):
    return (tp+tn)/(tp+fp+fn+tn)

accuracy(70,4930,13930,981070)

0.98114

There are better measures of how good the results are:
* Precision

|                     | Positive Value           | Negative Value          |
| ------------------- | ------------------------ | ----------------------- |
| Positive Prediction | __*True positive*__ (70) | *False positive* (4930) |
| Negative Prediction | False negative           | True negative           |

$$Precision = \frac{True~Positives}{Positive~Predictions} = \frac{tp}{tp+fp}$$

In [None]:
def precision(tp,fp):
    return tp/(tp+fp)

precision(70,4930)

0.014


* Recall

|                     | Positive Value           | Negative Value |
| ------------------- | ------------------------ | -------------- |
| Positive Prediction | __*True positive*__ (70) | False positive |
| Negative Prediction | *False negative* (13930) | True negative  |

$$Recall = \frac{True~Positives}{Positive~Values}$$


In [None]:
def recall(tp,fn):
    return tp/(tp+fn)

recall(70,13930)

0.005


* Harmonic Mean (a.k.a. *F1* score)
$$F1 = \frac{2*precision*recall}{precision+recall}$$

In [None]:
def scores(tp,fp,fn,tn):
    pre = precision(tp,fp)
    rec = recall(tp,fn)
    print("Accuracy:  {0}".format(accuracy(tp,fp,fn,tn)))
    print("Precision: {0}".format(pre))
    print("Recall:    {0}".format(rec))
    print("F1 score:  {0}".format(2*pre*rec/(pre+rec)))

scores(70,4930,13930,981070)

Accuracy:  0.98114
Precision: 0.014
Recall:    0.005
F1 score:  0.00736842105263158


The choice of your model is usually based on the tradeoff of these measures. Some models will tend to say "yes" more, so has a tendency to give false positives. Other models may say "no" more, so has a tendency to give false negatives. The choice of model will involve a balance of precision and recall. In other words, we choose the model that will maximize the *F1* score.

#### Example of calculating Correctness
You want to see if doughnut shops increase sales during finals week. You create a model to predict whether a particular store will see an increase in sales, then afterwards obtain data from them. Here are the results:

|                       | Sales increased | Sales didn't increase |
| --------------------- | --------------- | --------------------- |
| Increase Predicted    | 42              | 21                    |
| No increase Predicted | 7               | 30                    |

Calculate the accuracy, precision, recall, and F1 score of your model predictions.

# Feature Extraction
Sometimes, the data isn't very clear, and it is hard to tell what influences the data. This is where we start looking for features. A __feature__ is an input we provide to our model. Some features/inputs will have a large effect on the model, others will not. For example, 
* the number of hours worked is a large feature in an employee's salary
* the exact task the employee is doing may not be as large a feature in the salary

Here is another example. We can detect whether an email is Spam or not.
* Does the email contain the word *prince*?
* Does the email contain the word *Mongolia*?
* Does the email contain the phrase *Prince of Mongolia*?
* How many times does the letter *d* appear?
* What was the domain of the sender?

The answers to these questions could be Yes or No (1 or 0), a number, or a choice from a set of options. This is typical of feature extraction.


# Machine Learning Models we will look at
* Linear Regression (Grus, Chapters 14-15)
* Logistic Regression (Grus, Chapter 16)
* Decision Trees (Grus, Chapter 17)
* If we have time:
  * k-Nearest Neighbors (Grus, Chapter 12)
  * Naive-Bayes (Grus, Chapter 13)