# Chapter 11. Machine learning

## 11.1 Modeling

A model is simply a specification of a mathematical or probabilistic relationship that exists between different variables

For instance, if you’re trying to raise money for your social networking site, you might build a business model (likely in a spreadsheet) that takes inputs like “number of users,” “ad revenue per user,” and “number of employees” and outputs your annual profit for the next several years. A cookbook recipe entails a model that relates inputs like “number of eaters” and “hungriness” to quantities of ingredients needed. And if you’ve ever watched poker on television, you know that each player’s “win probability” is estimated in real time based on a model that takes into account the cards that have been revealed so far and the distribution of cards in the deck.

The business model is probably based on simple mathematical relationships: profit is revenue minus expenses, revenue is units sold times average price, and so on. The recipe model is probably based on trial and error—someone went in a kitchen and tried different combinations of ingredients until they found one they liked. And the poker model is based on probability theory, the rules of poker, and some reasonably innocuous assumptions about the random process by which cards are dealt.

## 11.2 What is machine learning

Machine learning to refer to creating and using models that are learned from data, also called predictive modeling or data mining

## 11.3 Overfitting and underfitting

In [4]:
import random 
from typing import TypeVar, List, Tuple

X = TypeVar('X') # generic type to represent a data point 

def split_data(data: List[X], prob: float) -> Tuple[List[X], List[X]]:
    '''Split data into fractions [prob, 1 - prob]'''
    data = data[:] # Make a shallow copy 
    random.shuffle(data) # Because shuffle modifies the list 
    cut = int(len(data) * prob) # Use prob to find a cutoff
    return data[:cut], data[cut:] # Split the shuffled list 

data = [n for n in range(1000)]
train, test = split_data(data, 0.75)

assert len(train) == 750
assert len(test) == 250
assert sorted(train + test) == data

In [6]:
Y = TypeVar('Y') # generic type to represent output variables 

def train_test_split(xs: List[X],
                     ys: List[Y],
                     test_pct: float) -> Tuple[List[X], List[X], List[Y], List[Y]]:
    # Generate the indices and split them 
    idxs = [i for i in range(len(xs))]
    train_idxs, test_idxs = split_data(idxs, 1 - test_pct)
    
    return ([xs[i] for i in train_idxs], # x_train
            [xs[i] for i in test_idxs], # x_test
            [ys[i] for i in train_idxs], # y_train
            [ys[i] for i in test_idxs] # y_test
           )

xs = [x for x in range(1000)]
ys = [2 * x for x in xs]

x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.25)
            
assert len(x_train) == len(y_train) == 750
assert len(x_test) == len(y_test) == 250

assert all(y == 2 * x for x, y in zip(x_train, y_train))
assert all(y == 2 * x for x, y in zip(x_test, y_test))

In [None]:
model = SomeKindOfModel()
x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.33)
model.train(x_train, y_train)
performance = model.test(x_test, y_test)

1. The common patterns in the test and training data that wouldn't generalize to a larger dataset: For example, imagine that your dataset consists of user activity, with one row per user per week. In such a case, most users will appear in both the training data and the test data, and certain models might learn to identify users rather than discover relationships involving attributes
2. A bigger problem is if you use the test/train split not just to judge a model but also to choose from among many models. In that case, although each individual model may not be overfit, “choosing a model that performs best on the test set” is a meta-training that makes the test set function as a second training set. (Of course the model that performed best on the test set is going to perform well on the test set.) In such a situation, you should split the data into three parts: a training set for building models, a validation set for choosing among trained models, and a test set for judging the final model.

## 11.4 Correctness

Confusion matrix for binary label

In [7]:
def accuracy(tp: int, fp: int, fn: int, tn: int) -> float:
    correct = tp + tn
    total = tp + fp + fn + tn
    return correct / total

assert accuracy(70, 4930, 13930, 981070) == 0.98114

In [8]:
def precision(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fp)

assert precision(70, 4930, 13930, 981070) == 0.014

In [9]:
def recall(tp: int, fp: int, fn: int, tn: int) -> float:
    return tp / (tp + fn)

assert recall(70, 4930, 13930, 981070) == 0.005

In [10]:
def f1_score(tp: int, fp: int, fn: int, tn: int) -> float:
    p = precision(tp, fp, fn, tn)
    r = recall(tp, fp, fn, tn)
    return 2 * p * r / (p + r)

## 11.5 The bias-variance tradeoff

High bias and low variance typically correspond to underfitting

Low bias and high variance typically correspond overfitting

Holding model complexity constant, the more data you have, the harder it is to overfit

On the other hand, more data won't help with bias, if your model doesn't use enough features to capture regularities in the data, throwing more data at it won't help

## 11.6 Feature extraction and selection

Features are whatever inputs we provide to our model

1. The naive bayes classifier is suited to yes-or-no features 
2. Regression models require numeric features which could include dummy variables that are 0s and 1s
3. Decision trees can deal with numeric or categorical data

Dimensionality reduction and regularization

Combine experience and domain expertise to choose features

## 11.7 For further exploration