# **State–action–reward–state–action (SARSA)**
The __SARSA__ *(State-Action-Reward-State-Action)* is a value-based reinforcement learning algorithm used to solve Markov Decision Processes (MDPs) in the context of sequential decision-making. SARSA is a member of the family of temporal difference (TD) learning methods and is particularly useful for training agents in environments with discrete states and actions.

Value-based algorithms evaluate state-action pairs $(s, a)$ by learning one of the value functions — $V^\pi (s)$ or $Q^\pi (s,a)$ — and use these evaluations to select actions.

### **Components of SARSA**
- **State (S):** In a reinforcement learning problem, the environment is divided into states (or observations), which represent the situations or configurations that the agent can encounter.<br><br>
- **Action (A):** At each state, the agent can take one of several possible actions. Actions represent the choices made by the agent to interact with the environment.<br><br>
- **Reward (R):** After taking an action in a specific state, the agent receives a reward from the environment. Rewards provide feedback to the agent about the desirability of its actions.<br><br>
- **Policy (π):** The policy defines the agent's strategy or behavior, specifying which action to take in each state. SARSA is an on-policy algorithm, meaning it learns and updates the policy it uses to interact with the environment.<br><br>

### **Key Concepts of SARSA**
- **Q-Values (Action-Value Function):** SARSA aims to estimate the Q-values, denoted as Q(s, a), which represent the expected cumulative reward that can be obtained by starting in state $s$, taking action $a$, and following a specific policy $\pi$ thereafter. The goal is to find the optimal Q-values that maximize the expected return.<br><br>
- **Q-Table:** SARSA maintains a Q-table (or Q-function), which is a data structure that stores Q-values for each state-action pair. Initially, these values are typically initialized randomly.<br><br>
- **Exploration vs. Exploitation:** SARSA balances exploration (trying different actions to discover the environment) and exploitation (choosing actions that are believed to be the best based on current Q-value estimates). This balance is controlled by an exploration strategy, often based on $\varepsilon$-greedy exploration, where with probability $\varepsilon$, the agent chooses a random action, and with probability $1-\varepsilon$, it chooses the action with the highest estimated Q-value.<br><br>

### **SARSA Algorithm**
The SARSA algorithm updates Q-values iteratively through experience gained while interacting with the environment. The update rule for SARSA is as follows:
$$
    Q(s,a) \rightarrow Q(s,a) + \alpha [R+\gamma Q(s',a') - Q(s,a)]
$$
- **Q(s,a)** is the current $Q$-value estimate for state $s$ and action $a$.<br><br>
- **$\alpha$** is the learning rate, which controls the size of the $Q$-value updates.<br><br>
- **R** is the immediate reward received after taking action $a$ in state $s$.<br><br>
- **$\gamma$** is the discount factor, representing the importance of future rewards.<br><br>
- **$s'$** is the next state reached after taking action $a$ in state $s$.<br><br>
- **$a'$** is the action selected in state $s'$ according to the current policy.<br><br>

A SARSA agent interacts with the environment and updates the policy based on actions taken.<br>
The Q value for a state-action is updated by an error, adjusted by the learning rate alpha.<br>
Q values represent the possible reward received in the next step action $a$ in the state $s$.<br>

### **SARSA Hyperparameters**
- **Learning rate ($\alpha$)** determines to what extent new information replaces old one. A factorof a value 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.<br><br>
- **Discount factor ($\gamma$)** determines the importance of future rewards. The discount factor of 0 makes the agent "lacking foresight" or "opportunistic", by only considering current rewards, while a factor approaching to 1 will make it strive for a long-term high reward.<br><br>
