$\textbf{Reset application}$

The reset application consists of a Markov chain with $|\mathcal{S}|$ states and two actions ("Do nothing" and "Reset to state 0"). Formally, we consider a discrete time Markov chain (DTMC) with state space $\mathcal{S}=\{0,1,...,N\}$ for some finite $N\in\mathbb{N}^+$ and transition matrix $\mathcal{P}$. Let action $0$ correspond to the action "Do nothing" and action 1 to the action "Reset to state $0$". The action space of the corresponding MDP is then given by 
\begin{equation*}\mathcal{A}(s)=\begin{cases}
\{0\},&\text{ for }s=0,\\
\{0,1\},&\text{ for }s\in\{1,...,N-1\},\\
\{1\},&\text{ for }s=N.
\end{cases}
\end{equation*}
Furthermore, the transition probabilities $\mathcal{P}^{(a)}_{ss'}$ corresponding the state-action pair $(s,a)$ are given by
\begin{align*}
    \mathcal{P}_{ss'}^{(0)}=\mathcal{P}_{ss'}\;\text{ and }\;\mathcal{P}_{ss'}^{(1)}=\begin{cases}
    1, &\text{ if }s'=0,\\
    0, &\text{ if }s'\neq 0,
    \end{cases} \qquad\text{for all} \;(s,s')\in \mathcal{S}\times \mathcal{S}.
\end{align*}
We assume that the costs $c(s,a)$ in state-action pair $(s,a)$ are bounded for all attainable state-action pairs  $s\in\mathcal{S}$ with $a\in\mathcal{A}(s)$. 

$\textbf{Model settings for the numerical analysis}$

In the numerical analysis of the reinforcement learning algorithms tailored for the reset application, three different cost realizations of the MDP will be considered. For all three realizations we will investigate various transition matrices and algorithm parameters. Let us first introduce the three cost realizations.


1. The cost function for the first realization is defined by
    \begin{equation}
        c(s,0)=0\quad\forall\quad s\in\mathcal{S} \quad\text{ and }\quad c(s,1)=\begin{cases}
        s,&\text{ for }s=0,...,N-1,\\
        2N,&\text{ for }s=N.
        \end{cases}
    \end{equation}
    In this realization, no cost will be incurred when we choose to "Do nothing" and a cost that is linearly proportional to the state will be incurred when we choose to "Reset to state 0". 

2. The cost function for the second realization is given by
    \begin{equation}
        c(s,0)=s+1\quad\forall\quad s\in\mathcal{S}\quad\text{ and }\quad c(s,1)=K\cdot c(s,0)\quad\forall\quad s\in\mathcal{S}\text{ for some }K\in \mathbb{N}^+.
    \end{equation}
    Here, the cost for the action "Do nothing" is linearly proportion to the state and the cost for action "Reset to state 0" is proportional to the cost for the action "Do nothing". 

3. Finally, the cost function for the third realization is defined by
    \begin{equation}
        c(s,0)=\frac{1}{N+1-s}\quad\forall\quad s\in\mathcal{S}\quad\text{ and }\quad c(s,1)=10^s\quad\forall\quad s\in\mathcal{S}.
    \end{equation}

For each of these realizations we will investigate three different transition matrices of the following form,
\begin{align*}
    \mathcal{P}_A=\begin{pmatrix}
    0&1&0&0\\
    0&0&1&0\\
    0&0&0&1\\
    1&0&0&0
    \end{pmatrix},\qquad 
    \mathcal{P}_{B}=\begin{pmatrix}
    0&1&0&0\\
    \frac{1}{2}&0&\frac{1}{2}&0\\
    0&\frac{1}{2}&0&\frac{1}{2}\\
    0&0&1&0
    \end{pmatrix},\;\;\text{and}\;\; 
    \mathcal{P}_{C}=\begin{pmatrix}
    \frac{1}{4}&\frac{1}{4}&\frac{1}{4}&\frac{1}{4}\\
    \frac{1}{4}&\frac{1}{4}&\frac{1}{4}&\frac{1}{4}\\
    \frac{1}{4}&\frac{1}{4}&\frac{1}{4}&\frac{1}{4}\\
    \frac{1}{4}&\frac{1}{4}&\frac{1}{4}&\frac{1}{4}
    \end{pmatrix}.
\end{align*}
Here we took $N=3$, but these matrices can trivially be extended for different $N\in\mathbb{N}^+$. The transition matrix $\mathcal{P}_A$ corresponds to "deterministic" transitions of the DTMC to the next state. Transition matrix $\mathcal{P}_B$ corresponds to transitions to the next or previous state with equal probability for states $s\in \mathcal{S}\setminus\{0,N\}$. Finally, transition matrix $\mathcal{P}_C$ corresponds to uniform transitions, i.e. equal transition probabilities to any state $s\in\mathcal{S}$. Figure 1 illustrates the DTMC for $N=3$ for each of these transition matrices.

![image.png](attachment:image.png)

$\textit{Figure 1: The Markov chain and its transition probabilities for transition matrices $\mathcal{P}_A,\mathcal{P}_B$ and $\mathcal{P}_C$ for $N = 3$.}$