# Cross Entropy Method
Some strengths of this method are:
* **Simplicity**: This is a very simple method, which makes it an intuitive method to follow. 
* **Good convergence**: In simple environments that don't require complex, multistep policies to be learned and discovered and have short episodes with frequent rewards, cross-entropy usually works well. 

## RL Methods:
* Model-free or Model-based
* Value-based or Policy-based
* On-policy or Off-policy

**Cross Entropy Method** falls into *Model-free** and *Policy-based*. 

**Model-Free** means that the method doesn't build a model of the environment or reward; it just directly connects observations to actions (or values that are related to actions). Therefor the agent takes current observations and does some computations on them, and the result is the action that it should take. 

*Model-Free* are usually esier to train as its hard to build good models of complex enviroment with rich observations. 

*This is more how humans operate, we don't everything about the world of environment and base our judgement/decisions on our observation + (stored knowledge of the world/enviroment)* 

**Model-Based** methods try to predict what the next observation and/or reward will be. Based on this prediction, the agent is trying to choose the best possible action to take, very often making such precictions multiple times to look more nad more steps into the future (Monte-Carlo-Tree-Search). 

*Model-Based* are used in deterministic environments, such as: board games with strict rules. 

**Policy-Based** methods are directly approximating the policy of the agent: What actions the agent should carry out at every step. 

*Policy* is usually represented by probability distribution over the available actions

**Value-Based**: Instead of probability of actions, the agent calculates the value of every possible action and chooses the action with the best value. 

**Off-Policy**: The ability of the method to learn on old historical data (obtained by a previous version of the agent or recorded by human demonstration)

# Cross Entropy Method
This method is: **Model-Free, Policy-Based, On-Policy**
* It doesn't build any model of the environment.
* It approximates the policy of the agent
* It requires fresh data obtained from the environment

## Practical Cross-Entropy
We follow a common ML approach, replacing all of the complications of the agent with some kind of nonlinear trainable function, which maps the agent's input (Observation from the environment) to some output. 

For our cross-entropy method: our nonlinear function is a (Deep Neural Network) producing our *policy*, which basically says for every observation which action the agent should take. 

Observation --> NN --> Policy

In practice, policy is usually represented as probability distribution over actions, which makes it very similar to a classification problem, with the amount of classes being equal to amount of actions we can carry out. 

We need to pass an observation from the environment to the neural network, get probability distribution over actions, and perform random sampling using probability distribution to get an action to carry out. 

Loop: 
At the beginning of the training when our weights (NN) are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. 

During the agent's lifetime, its experience is presented as episodes. Every episode is a sequence of observations that the agent has got from the environment, actions it has issues, and rewards for these actiosn. 

*After our agent has played several such episodes. For every episode, we can calculate the total reward that the agent has claimed. For simplicity this will just be a sum of all local rewards for every episode. This total reward will show how good this episode was for the agent.* 

Example: 
```episode_i``` = $R = r_j + r_{j+1} + ... + r_n$
* ```i``` = episode number
* $R$ = Total reward for that episode
* $r_j$ = reward for observation at timestep j
* $r_n$ = the last reward for the episode

Each episode is composed of cells (timesteps) with: ```(observation, action, reward)```

Every cell represents the agents step in the episode. 

The **core** of cross-entropy method is to throw away bad episodes and train on better ones. 

1. Play $N$ number of episodes using our current model and environment
2. Calculate the total reward for every episode and decide on a reward boundary. *Usually, we use some percentile of all rewards: 50th or 70th
3. Throw away all episodes with a reward below that boundary
4. Train on something remaining **elite** episodes using observations as the input and issued actions as the desired output
5. Repeat from step 1 until we become satisfied with the result 

## Limitations of Cross-Entropy Method
* For training, our episodes have to be finite and, preferably, short
* The total reward for the episodes should have enough variability to seperate good episode from bad ones. 
* There is no intermediate indication about whether the agent has succeeded or failed