# week 4: learning

## supervised learning

- given a data set of input-output pairs, learn a function to map inputs to outputs

**classification**: supervised learning task of learning a function mapping an input point to a discrete category

- given a table of data with variables `humidity` and `pressure`
- it is supervised learning because a human has labeled each entry with `rain` or `no rain`
- we come up with an estimated hypothesis function that maps the given data
- we can plot these on a graph, which can be any dimension as computers are fine thinking above the third dimension

**nearest-neighbor classification**: algorithm that, given input data, chooses the class of the nearest data point to that input

**k-nearest-neighbor**: chooses the most common class out of `k` nearest data points

**linear regression**: finding a decision boundary to classify points

- given $x_1$ = Humidity, $x_2$ = Pressure

$$$
h(x_1, x_2)=\text{Rain if } w_0+w_1x_1+w_2x_2\ge0, \text{No Rain otherwise}
$$$

**Weight Vector:** $\vec w=(w_0, w_1, w_2)$

**Input Vector:** $\vec x=(1, x_1, x_2)$

$h_w(x) = 1 \text{ if } \vec w \cdot\vec x \ge0, 0\text{ otherwise}$


## perceptron learning rule

- given data point $(\vec x, y)$, update each weight according to:
$$$
w_i=w_i+\alpha(y-h_w(\vec x))\times x_i
$$$
where $\alpha$ is the learning rate

- this creates a hard-square threshold function
- maybe you want more than 0 or 1, and care about likelyhood or confidence

**soft threshold**: using a logistic function to output a *likelyhood*, not just a hard 0 or 1

## support vector machines

choosing a good boundary from a range of "valid" ones

better boundaries are as far apart as possible from the two classification areas

**maximum margin separator**: boundary that maximises the distance between any of the data points

## regression

supervised learning task of learning a function mapping an input point to a continuous value



## evaluating hypotheses

**loss function**: function that expresses how poorly our hypothesis performs, a loss of utility
$$$
L(\text{actual, predicted})=1\text{ if (actual = predicted)}, 1\text { otherwise}
$$$

- we can also take into account how far away it was, using L1
$$$
L_1(a, p) = |a - p|
$$$
where `a` = actual, `p`= predicted
- or we can use L2
$$$
L_2(a,p)=(a-p)^2
$$$

## overfitting

- a model that fits too closely to. particular data set and may fail to generalise for future data
- this happens if you only care about minimising loss
- we can build a better cost functinon with Occam's razor
- we can plenalise complexity with $\lambda$
$$$
cost(h)=loss(h)+\lambda complexity(h)
$$$
**regularisation**: penalising a hypothesis that is more complex in favour of simpler, more general hypotheses

**holdout cross-validation**: splitting the data into a training set and a testing set, such that learning happens on the training set and is evalusated on the testing set

**k-fold cross-validation**: splitting data into `k` sets, and experimenting `k` times, using each set as a test once, and using the remaining data as a training set

In [8]:
import csv
import random

from sklearn import svm
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=1)
# model = svm.SVC()
# model = Perceptron()

In [9]:
# Read data in from file
with open("banknotes.csv") as f:
    reader = csv.reader(f)
    next(reader)

    data = []
    for row in reader:
        data.append({
            "evidence": [float(cell) for cell in row[:4]],
            "label": "Authentic" if row[4] == "0" else "Counterfeit"
        })

# Separate data into training and testing groups
holdout = int(0.40 * len(data))
random.shuffle(data)
testing = data[:holdout]
training = data[holdout:]

# Train model on training set
X_training = [row["evidence"] for row in training]
y_training = [row["label"] for row in training]
model.fit(X_training, y_training)

# Make predictions on the testing set
X_testing = [row["evidence"] for row in testing]
y_testing = [row["label"] for row in testing]
predictions = model.predict(X_testing)

# Compute how well we performed
correct = 0
incorrect = 0
total = 0
for actual, predicted in zip(y_testing, predictions):
    total += 1
    if actual == predicted:
        correct += 1
    else:
        incorrect += 1

# Print results
print(f"Results for model {type(model).__name__}")
print(f"Correct: {correct}")
print(f"Incorrect: {incorrect}")
print(f"Accuracy: {100 * correct / total:.2f}%")

Results for model KNeighborsClassifier
Correct: 548
Incorrect: 0
Accuracy: 100.00%


## reinforcement learning

given a set of rewards or punishments, learn what actions to take in the future

- an agent is situated in the environment
- the environment puts the agent in some state
- the agent makes an action on the environment
- the agent gets a new state, and a reward/punishment

**markov decision process**: model for decision-making, representing states, actions, and their rewards

- set of states `S`
- set of actions `Actions(s)`
- transition model `P(s'|s,a)`
- reward function `R(s, a, s')`

**Q-learning**: method for learning a function $Q(s, a)$, estimate of the value of performing action $a$ in the state $s$

- start with $Q(s,a)=0$ for all $s,a$
- when we have taken an action and recieved a reward:
    - estimate the value of $Q(s,a)$ based on current reward and expected future rewards
    - update $Q(s,a)$ to take into account old estimate as well as our new estimate

however this Q-learning has some issues
- explore vs exploit
    - exploit: using the knowledge it already has
    - explore: exploring new actions
    - by only using the knowledge, it might stick to a sub-optimal path to the goal

this can be solved with **$\varepsilon$-greedy**

- set $\varepsilon$ equal to how often we want to move randomly
- with probability $1-\varepsilon$, choose the estimated best move
- with probability $\varepsilon$, choose a random move

**function approximation**: approximating $Q(s,a)$, often by a function combining various features, rather than storing one value for every state-action pair


## unsupervised learning

given input data without any additional feedback, learn patterns

**clustering**: organising a set of objects into groups in such a way that similar objects tend to be in the same group

**k-means clustering**: algorithm for clustering data based on repeatedly assigning points to clusters and updating those cluster's centers