# Explanation

The broad semantic knowledge and reasoning ability provided by SOTA LLMs has the potential to completely solve the task planning problem by translating a high level goal into subtasks the robot needs to perform to accomplish these goals.

However, LLMs also don't have a grounding in what's physically possible in reality so can often suggest tasks that are infeasible. SayCan addresses this issue by training an RL policy to understand what actions are possible given image data from the environment, which can then be used to ground the tasks suggested by an LLM.

# Notes

> A significant weakness of language models is that they lack real-world
> experience, which makes it difficult to leverage them for decision making within a given embodiment.

LLMs can provide high level semantic knowledge for complex task planning, but they can come up with narratives that aren’t grounded in the reality of the embodiment or the environment of a robot.

> In this way, the LLM describes the probability that each skill contributes to completing the instruction, and the affordance function describes the probability that each skill will succeed – combining the two provides the probability that each skill will perform the instruction successfully.

The LLM part (Say) provides knowledge on the best skills to perform the task, and the affordance function (Can) grounds it in what skills can actually work in the current environment.

> Grounding the LLM in the real-world via affordances nearly doubles the performance over the non-grounded baselines.

> Additionally, by evaluating the performance of the system with different LLMs, we show that a robot’s performance can be improved simply by enhancing the underlying language model.

The intelligence of the robot increases by just switching out the LLM, as it’s probably providing higher skill likelihoods.

### Preliminaries

**1. Large Language Models**

> In this work, we utilize the vast semantic knowledge contained in LLMs to determine useful tasks for solving high-level instructions.

**2. Value Functions and RL**

> Our goal is to be able to accurately predict whether a skill (given by a
> language command) is feasible at a current state. We use temporal-difference-based (TD) reinforcement learning to accomplish this goal.

They use Q-learning in the environment to learn whether an action is possible or not.

The robot receives a reward of 1 or 0 at the end of the episode if it succeeded or failed, so the RL function is trained to be an affordance function that specifies whether a skill is possible or not.

### SayCan

![Screenshot 2024-11-01 at 2.08.45 PM.png](../../images/notes/Screenshot_2024-11-01_at_2.08.45_PM.png)

We are given an instruction $i$ and a set of valid skills where each skill $\pi \in \Pi$ has a short language description $\ell_\pi$.

To execute the instruction, we need to learn the function that models the probability that a given skill $\pi$ will make progress toward the completion of the instruction $i$ in the current state $s$: $p(c_i | i, s, \ell_\pi)$ where $c_i$ is the probability of completion.

We can factorize this as $p(c_i | i, s, \ell_\pi) \propto p(c_\pi|s, \ell_\pi)p(\ell_\pi|i)$.

Then the LLM gives us $p(\ell_\pi|i)$, the probability that a skill $\pi$ corresponding with language description $\ell_\pi$ will make progress toward $i$ (where we sample from the LLM scoring function).

$p(c_\pi|s, \ell_\pi)$ corresponds to the affordance function of whether the skill is possible given the current state, called the world-grounding.

> The optimal skill according to the language model is computed via $\ell_\pi = \mathrm{arg max}_{\ell_\pi \in \ell_\Pi} p(\ell_\pi|i)$. Once selected, the process proceeds by iteratively selecting a skill and appending it to the instruction.

The prompt is structured as a series of human robot conversations to given an idea of the task structure.

> The key idea of SayCan is to ground large language models through value functions – affordance functions that capture the log likelihood that a particular skill will be able to succeed in the current state.

The affordance function models the likelihood that a skill can be executed successfully given the current state (with an image and other sensors).

The combination of the LLM and the value function are used to select the final skill that gets used, and then the language instruction $\ell_\pi$ gets added to the prompt and the process continues until it arrives at a termination skill.

$$
\pi = \textrm{argmax}_{\pi \in \Pi} \: p(c_\pi|s, \ell_\pi)p(\ell_\pi|i)
$$

### Implementing SayCan

> To instantiate SayCan, we must provide it with a set of skills, each of which has a policy, a value function, and a short language description.

> In our implementation, we train the individual skills either with image-based behavioral cloning, following the BC-Z method, or reinforcement learning, following MT-Opt.

SayCan can use both imitation learning using behavior cloning (from tele-op examples) or reinforcement learning from simulation.

> Regardless of how the skill’s policy is obtained, we utilize value functions trained via TD backups as the affordance model for that skill.

The value functions (affordance functions) use Q-learning in either case.

> While we find that the BC policies achieve higher success rates at the current stage of our data collection process, the value functions provided by the RL policies are crucial as an abstraction to translate control capabilities to a semantic understanding of the scene.

SayCan also conditions the BC/RL models on language using large sentence encoders to make the policies conditioned on language.

> We utilize both BC and RL policy training procedures to obtain the
> language-conditioned policies and value functions, respectively.

### Experimental Evaluation

> We test across 101 instructions from 7 instruction families.

![Screenshot 2024-11-01 at 2.23.18 PM.png](../../images/notes/Screenshot_2024-11-01_at_2.23.18_PM.png)

They measure **plan success rate**, based on if the skills selected by the model are correct for a given instruction, and **execution success rate**, based if the robot actually successfully executed the task.

> In the mock kitchen, PaLM-SayCan achieved a planning success rate of 84% and an execution rate of 74%.

![Screenshot 2024-11-01 at 2.26.16 PM.png](../../images/notes/Screenshot_2024-11-01_at_2.26.16_PM.png)

> We also find that PaLM-SayCan struggles with negation (e.g., “bring me a snack that isn’t an apple”) and ambiguous references (e.g. asking for drinks with caffeine), which is a known issue inherited from underlying language models.

These robotics models that use LLMs inherit issues from the LLMs.

**New Capabilities**

> SayCan is capable of integrating new skills by simply adding the new skills as options for the LLM and providing accompanying value functions and add an example in the prompt with that skill.

> SayCan can be integrated with recent work improving LLM reasoning, such as Chain of Thought.

> While not explicitly designed to work with multilingual queries, PaLM-SayCan is able to handle them.

### Conclusion

> We presented SayCan, a method that enables leveraging and grounding the rich knowledge in large language models to complete embodied tasks.

SayCan grounds an LLM in real-world skills by pre-training skills in a robot and then using these skills to ground the LLM.

> More specifically, we use reinforcement learning as a way to learn value functions for the individual skills that provide affordances of what is possible in the world, and then use textual labels for these skills as potential responses that are scored by a language model.

The primary limitation on SayCan is in the range of skills that it has access to.

Natural language provides semantic understanding but may not be the most descriptive medium for certain tasks.
