# Ensemble Learning | Lecture 1

* so far - learning methods that learn a single hypothesis form a hypothesis space that is used to make predictions

* ensemble learning -> select a collection (ensemble) of hypotheses and combine their predictions

* example - generate 100 different decision trees from the same or different training set and have them vote on the best classification for a new example.

* key motivation: reduce the error rate. hoe is that it will become much more unlikely that the ensemble will misclassify an example

## Learning Ensembles

![image.png](attachment:image.png)

## Value of Ensembles

* no free lunch theorem
    * no single algorithm wins all the time

* combine multiple independent and diverse decisions
    * each at least more accurate than random guessing
    * random errors cancel each other out
    * correct decisions are reinforced

* each algorithm makes assumptions which might be or not be valid

## Value of Ensembles

* different learnings use different
    * algorithms
    * hyperparameters
    * representations (modalities)
    * training set
    * subproblems

* examples: human ensembles are demonstrably better
    * how many jelly beans in the jar?: individual estimates vs. group average
    * who wants to be a millionarie: audience vote

## What is the Main Challenge for Developing Ensemble Models?

* Goal is not necessarily to obtain highly accurate base models, 
but rather to obtain base models which make different kinds of errors

* measured by degree of overlap in misclassifications

* more overlap means less independence between two models

$$ (|A \cap B| / |A \cup B|)$$

## Intuitions

* majority vote

* suppose we have 5 completely independent classifiers..
    * if accuracy is 70\% for each
        - $(0.7^5) + 5(0.7^4)(0.3)+ 10(0.7^3)(0.3^2)$
        - 83.7\% majority vote accuracy
    * 101 such classifiers
        - 99.9\% majority vote accuracy

## Increase power of Ensemble Learning

* enlarges teh hypothesis space

* the ensemble itself is a hypothesis and the new hypothesis space is the set of all possible ensembles constructible for hypothesis of the original space

![image.png](attachment:image.png)

## Why does it work?

* Suppose there are 25 base classifiers
    * each classifier has error rate, $\epsilon = 0.35$
    * assume classifiers are independent
    * probability that the ensemble classifier makes a wrong prediction:

    $$\sum^{25}_{i=13} \left(\begin{array}{c}25 \\ i \end{array}\right) \epsilon^i (1-\epsilon)^{25-i} = 0.06$$

## Bagging

taking training set and getting random sampling of the data, with replacement. So, some of the set of samples will be present in other samples

![image.png](attachment:image.png)

* use bootstrapping to generate L training sets and train one base-learning with each

* use voting (average or median with regression)

* unstable algorithms profit from bagging

* sampling with replacement

![image.png](attachment:image.png)

* build classifier on each bootstrap sample

* each sample has probability (1-1/n)^n of being selected

## Boosting

![image.png](attachment:image.png)

take data and feed to predictor and result is given to the next prediction - boost data of specific data

samples that are harder to classifier will receive a high probability of being selected in the next round

iterate procedure where we start with everything having an equal weight and increase weights on things that are missclassified

* an iterative procedure to adaptively change distribution of training data by focusing more on previously msiclassified records

* initially, all n records are assigned equal weights

* unlike bagging, weights may change at the end of boosting round

* records that are wrongly classified will have their weights increased

* records that are classified correctly will have their weights decreased

![image.png](attachment:image.png)

* example 4 is harder to classify

* its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

## Stacking

![image.png](attachment:image.png)

multiple predictors, output of those predictors, and blending of the predictions

## Mixture of Experts

* voting where weights are input-dependent (gating)

* experts or gating can be nonlinear

![image.png](attachment:image.png)

## Cascading

* use d_j only if preceding ones are not confident

* cascade learners in order of complexity

![image.png](attachment:image.png)

in cascading you would order your predictors in the order of complexity.

only use a predictor if there was not a high enough confidence in the previous predictor

## Summary Ensemble Learning

* use multiple models for decision making

* frequently accomplish high accuracy

* less likely to over-fit

* exhibit a low variance

* design key is diversity and not necessarily high accuracy of the base classifiers

* base classifiers of the ensemble should vary in the examples they misclassify

* represents a single hypothesis
    * not necessarily contained within the hypothesis space of the models from which it is built

* can be sensitive to noise

* might not be parallizable

# Reinforcement Learning | Lecture 2

when you are interacting with the environment

![image.png](attachment:image.png)

about making a policy to decide what to do next

![image.png](attachment:image.png)

## What is Reinforcement Learning?

* learning from interaction

* goal-oriented learning

* learning about, from, and while interacting with an external environment

* learning what to do - how to map situations to actions - so as to maximize a numerical reward signal

## Supervised Learning

![image.png](attachment:image.png)

information cam in and our error was out target output minus our ouput

## Reinforcement Learning

![image.png](attachment:image.png)

## Key Features of RL

* learner is not told which actions to take

* trial-and-error search

* possibility of delayed reward (sacrafice short-term gains for greater long-term gains)

* need to explore and exploit

* considers the whole problem of a goal-directed agent interacting with an uncertain environment

## Complete Agent

* temporally situated

* continual learning and planning

* object is to affect the environment

* environment is stochastic and uncertain

![image.png](attachment:image.png)

our agent has to make decisions each situation

## Elements of RL

* policy: what to do

* reward: what is good

* value: what is good because it predicts reward

* model: what follows what

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Assume an imperfect opponent: he/she sometimes makes mistakes - turns this into a more stochastic decision process

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

## The Markov Property

* by "the state" at step t, we mean wahtever infomration is available to the agent at step t about its environment

* the state can include immediate "sensations," highly processed sensations , and structures built up over time from sequences of sensations

* ideally, a state should summarize past sensations so as to retain all "essential" information, i.e., it should have the Markov Property:

![image.png](attachment:image.png)



## Markov Decision Processes

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Policy is based on characteristics on that particular problem

everything can change as we're progressing through the problem

![image.png](attachment:image.png)

^^ more steps = more value

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)