# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [1]:
!pip install requests bs4 google-generativeai pypdf

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting pypdf
  Downloading pypdf-6.5.0-py3-none-any.whl.metadata (7.1 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Downloading pypdf-6.5.0-py3-none-any.whl (329 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.6/329.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf, bs4
Successfully installed bs4-0.0.2 pypdf-6.5.0


In [2]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm
# from google.colab import userdata

In [3]:
API_KEY = "#"
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [4]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

Code to extract text from PDFs.

In [5]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [6]:
LLM = "gemini-2.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [7]:
for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content("Summarize this research article into one paragraph without formatting highlighting its strengths and weaknesses. " + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

100%|██████████| 7/7 [01:07<00:00,  9.65s/it]


We print the results to a html file.

In [8]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> <p>{paper["summary"]}</p>'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [9]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[Latent Implicit Visual Reasoning](https://arxiv.org/pdf/2512.21218)**<br>This research introduces Latent Implicit Visual Reasoning (LIVR), a novel task-agnostic method designed to enhance Large Multimodal Models (LMMs) in predominantly visual reasoning tasks, addressing their inherent text-centric bias and the limitations of explicit visual supervision. LIVR achieves this by training LMMs to autonomously discover and utilize visual reasoning tokens through a two-stage visual bottlenecking approach, eliminating the need for costly annotations or hand-crafted intermediate visual steps. A key strength of LIVR is its consistent outperformance of direct supervised fine-tuning and achievement of state-of-the-art results across a diverse range of perception-heavy tasks, including those where defining intermediate visual abstractions is particularly challenging (like Art Style or Visual Similarity), while also demonstrating strong generalization capabilities in multi-task instruction tuning across multiple LMM backbones. The method's ability to implicitly learn useful visual structures without additional data or human biases is a significant advantage. However, a notable weakness of LIVR is that these latent tokens are inherently less interpretable compared to explicit textual explanations, making it more challenging to trace the model's internal reasoning. Furthermore, while effective, the method's performance can be sensitive to the number of latent tokens used, and the authors note future work on scaling to even larger models and datasets.<br><br>

**[Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning](https://arxiv.org/pdf/2512.20605)**<br>This research introduces "internal RL," a novel hierarchical reinforcement learning method that overcomes the inefficiency of token-by-token exploration in autoregressive models by operating within their internal representations. A key strength is the demonstration that pretrained autoregressive models inherently learn behaviorally meaningful temporal abstractions in their internal activations, which an unsupervised metacontroller then learns to discover, sequence, and terminate without explicit labels. This "internal RL" approach significantly outperforms standard RL finetuning and existing hierarchical RL methods on sparse-reward grid world and MuJoCo tasks, enabling efficient exploration and compositional generalization through a learned, quasi-binary switching mechanism that dynamically compresses time. However, the study acknowledges the controlled nature of its experimental setup, noting that scalability to larger models and real-world tasks remains a future challenge. Additionally, optimal metacontroller performance relies on high-quality expert demonstrations for its self-supervised training, and the base model must be kept frozen during this phase, a constraint shown to be critical for the approach's success and the discovery of meaningful abstractions.<br><br>

**[Spatia: Video Generation with Updatable Spatial Memory](https://arxiv.org/pdf/2512.15716)**<br>Spatia introduces a novel spatial memory-aware framework for video generation, addressing the pervasive challenge of maintaining long-term spatial and temporal consistency in generated videos. It achieves this by explicitly preserving and iteratively updating a 3D scene point cloud as persistent spatial memory, leveraging visual SLAM to incorporate new content. This design enables several key strengths: robust dynamic-static disentanglement, allowing generation of dynamic entities within a coherent static scene; enhanced spatial consistency across multiple viewpoints; precise 3D-aware camera control; and intuitive interactive editing of scene elements. The model demonstrates superior visual quality and long-horizon consistency across various benchmarks, particularly excelling in closed-loop generation by effectively "remembering" revisited locations. However, the approach has notable weaknesses, including its heavy reliance on external 3D reconstruction (MapAnything) for initial scene estimation and memory updates, which introduces a dependency and potential points of failure or inaccuracy. Furthermore, it requires a separate segmentation step to remove dynamic entities before point cloud estimation, adding preprocessing complexity. While enhancing consistency, the iterative updating of the scene point cloud for "all previously generated frames" could pose scalability challenges for extremely long sequences, and maintaining high point cloud density for optimal visual quality consumes significant computational resources.<br><br>

**[Schoenfeld's Anatomy of Mathematical Reasoning by Language Models](https://arxiv.org/pdf/2512.19995)**<br>This research introduces ThinkARM, a novel framework that applies Schoenfeld's Episode Theory to systematically analyze the internal reasoning processes of large language models during mathematical problem-solving. The study's key strength lies in its ability to abstract complex LLM reasoning traces into explicit, functional steps like Analysis, Explore, Implement, and Verify, revealing a reproducible "cognitive heartbeat" in reasoning models that progresses from abstract conceptualization to concrete execution and finally to evaluative control. Through large-scale annotation using GPT-5, ThinkARM effectively distinguishes the structured, iterative reasoning of advanced models—characterized by balanced effort distribution and frequent Explore-Monitor/Verify loops—from the more feed-forward, execution-heavy patterns of non-reasoning models. Furthermore, it diagnostically links specific episode transition patterns to solution correctness, showing that successful solutions route exploratory uncertainty into monitoring or re-analysis, and highlights how efficiency-oriented methods selectively prune evaluative feedback steps rather than merely shortening responses. However, a notable weakness is the framework's reliance on an automatic LLM annotator, which, despite strong human agreement, could introduce labeling noise. Additionally, its primary focus on mathematical problem-solving limits the immediate generalizability of these findings to other diverse reasoning domains.<br><br>

**[How Much 3D Do Video Foundation Models Encode?](https://arxiv.org/pdf/2512.19949)**<br>This research quantifies the inherent 3D understanding within Video Foundation Models (VidFMs) by developing a model-agnostic framework that probes frozen VidFM features to estimate 3D points, depth maps, and camera poses. Its primary strength lies in demonstrating that state-of-the-art video generative models, such as WAN and Open-Sora2.0, develop strong, generalizable 3D awareness—often surpassing specialized 3D models on out-of-domain data and significantly outperforming image-only models, underscoring the critical role of temporal reasoning. The study also offers valuable insights into optimal feature extraction layers and timesteps within diffusion models, and highlights the practical benefit of VidFM features for feedforward 3D reconstruction in limited-data regimes. However, a notable weakness is the reliance on publicly available checkpoints, which precluded controlled experimentation on factors like data scale or training strategy, thus limiting definitive attribution of observed differences, such as the mixed impact of model scaling and the potential for 3D fine-tuning to degrade generalization. Additionally, resource constraints prevented the training of large-scale 3D reconstruction models from scratch using VidFM features on massive datasets.<br><br>

**[VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation](https://arxiv.org/pdf/2512.19680)**<br>VA-π introduces a novel, lightweight post-training framework designed to address the critical misalignment in autoregressive (AR) visual generation where models optimized for token likelihood often produce low-quality images when decoded from tokenizers trained solely for pixel reconstruction. Its primary strength lies in formulating this problem as a variational optimization, deriving an Evidence Lower Bound (ELBO) that unifies pixel reconstruction with AR modeling by treating the generator as an RL policy and leveraging pixel-space reconstruction quality (derived from teacher forcing) as an intrinsic reward, alongside a regularization term for token distribution consistency. This approach is highly efficient and adaptable, achieving substantial improvements in image fidelity (FID, IS) and compositional understanding (GenEval) on models like LlamaGen and Janus-Pro with minimal data and compute (e.g., 1% ImageNet-1K, 25 minutes), without requiring tokenizer retraining or external reward models. However, while VA-π effectively bridges the token-pixel gap through intrinsic pixel-level guidance, its performance is inherently bounded by the capabilities of the fixed, pre-trained tokenizer, and although the regularization term aims to maintain distributional consistency, the reliance on teacher forcing for reward calculation does not fundamentally resolve the exposure bias problem that arises during free-running AR generation.<br><br>

**[GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training](https://arxiv.org/pdf/2512.13043)**<br>GTR-Turbo introduces an efficient approach to training multi-turn vision-language model agents by leveraging a "free" teacher derived from merging historical checkpoints generated during the ongoing reinforcement learning process, eliminating the need for costly and often inaccessible external models like GPT or Gemini. This innovative framework guides the agent's reasoning through either supervised fine-tuning or soft logit distillation, leading to significant strengths such as a 10-30% accuracy improvement over baselines, a 50% reduction in wall-clock training time, and a 60% decrease in compute cost, all while stabilizing training and mitigating "thought collapse." However, the system's reliance on self-evolution means a foundational level of capability is required from the base model to prevent issues like passive exploration, potentially necessitating external knowledge for models with very low initial success rates. Additionally, the reported experimental results primarily focus on 7B models, leaving the framework's performance characteristics across larger model scales for future investigation.<br><br>

# New Version

In [10]:
prompt = (
    "Summarize this research article. The output must be a Markdown table "
    "with exactly two columns: 'Strengths' and 'Weaknesses'. "
    "Provide a concise summary of the key points in each column. "
    "Do not include any other text or formatting outside the table. "
)

for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content(prompt + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

 71%|███████▏  | 5/7 [00:52<00:15,  7.83s/it]

Generation failed


 86%|████████▌ | 6/7 [00:53<00:05,  5.70s/it]

Generation failed


100%|██████████| 7/7 [00:55<00:00,  7.96s/it]

Generation failed





In [11]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> <p>{paper["summary"]}</p>'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

In [12]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[Latent Implicit Visual Reasoning](https://arxiv.org/pdf/2512.21218)**<br>| Strengths                                                                                                                                                                                                                          | Weaknesses                                                                                                                                                                                                                                                                                              |
| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| - **Novelty & Efficiency:** Introduces Latent Implicit Visual Reasoning (LIVR) that implicitly learns visual representations via latent tokens and a visual bottleneck, requiring no explicit supervision, extra data, or annotation costs. | - **Interpretability:** Latent tokens are inherently less interpretable than explicit textual explanations or human-designed intermediate visual steps.                                                                                                                                                     |
| - **Superior Performance:** Consistently outperforms direct supervised fine-tuning (SFT) and explicitly supervised latent reasoning methods (e.g., Mirage) across three diverse LMM backbones and nine perception-heavy tasks.            | - **Scalability (Current Scope):** While effective, the paper suggests future work on scaling to larger models and datasets, implying current evaluations are not at the absolute largest scales.                                                                                                    |
| - **Strong Generalization:** Demonstrates robust performance and generalizes well in both single-task and multi-task instruction tuning settings.                                                                                          | - **Latent Capacity Optimization:** Performance can decrease with an excessive number of latent tokens (e.g., K=32), indicating a need for careful balancing or potentially task-specific tuning for optimal latent capacity.                                                                        |
| - **Enhanced Visual Reasoning:** Particularly effective on complex visual abstraction tasks where human-defined intermediate steps are difficult, indicating deeper learned visual understanding.                                       | - **Hyperparameter Sensitivity:** The method's effectiveness is sensitive to architectural and training choices like latent token placement, specific masking strategy, and the Stage 1/Stage 2 training schedule, requiring careful configuration for peak performance. |
| - **Validated Mechanism:** Ablation studies confirm that both dedicated latent tokens and the visual bottleneck are essential for LIVR's gains and that latents actively encode task-relevant visual information.                        |                                                                                                                                                                                                                                                                                                         |<br><br>

**[Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning](https://arxiv.org/pdf/2512.20605)**<br>| Strengths                                                                                                                                                                                                                                                                                                                                                                                                                    | Weaknesses                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Efficient Hierarchical Reinforcement Learning (HRL):** Overcomes the inefficiency of token-by-token exploration in autoregressive models by operating on temporally-abstract actions, enabling learning from sparse rewards where standard RL fails.                                                                                                                                                                                            | **Limited Experimental Scope:** The demonstrated "overwhelming success" is acknowledged to be within a "controlled nature of our experimental setup" (grid world and MuJoCo tasks), and its applicability to larger-scale or real-world environments, including LLMs, is explicitly stated as future work.                                                                                                                                      |
| **Unsupervised Temporal Abstraction Discovery:** Introduces a novel metacontroller architecture that learns to discover and sequence meaningful temporally-abstract actions, along with their termination conditions, from unlabeled behavioral data without requiring explicit supervision.                                                                                                                                                          | **Reliance on Frozen Pretrained Model:** The method critically depends on pretraining a base autoregressive model and then freezing its parameters. Co-training the metacontroller with the base model, or training from scratch, leads to degenerate abstract action representations and failure in subsequent RL.                                                                                                                                 |
| **Leverages Internal Representations:** Demonstrates that pretrained autoregressive models inherently form linearly controllable, temporally-abstract action representations in their internal residual stream activations, which can be effectively steered by external controllers.                                                                                                                                                             | **Sensitivity to Demonstration Quality:** Empirical evidence suggests that the metacontroller performs better when trained on "cleaner" (more optimal) expert demonstrations, implying potential sensitivity to the quality and optimality of the unlabeled behavioral dataset used for its self-supervised learning phase.                                                                                                                                   |
| **Superior Performance on Challenging Tasks:** Achieves high success rates and efficient credit assignment on hierarchically-structured, sparse-reward tasks where standard RL finetuning (e.g., GRPO) and existing HRL methods (e.g., CompILE) fail to learn effectively.                                                                                                                                                                    | **Comparison Baselines:** While impressive, the performance gap is shown on tasks designed to be extremely difficult for token-level RL (e.g., "minuscule chance" of success by random sampling), which might inflate the relative advantage compared to more competitive scenarios or different task complexities.                                                                                                                                |
| **Compositional Generalization:** The learned internal controllers can be sequenced to solve complex tasks requiring combinations of subgoals that were not explicitly seen during the metacontroller's training.                                                                                                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| **Theoretically Grounded Architecture:** Provides a concrete neural architecture that aligns with Schmidhuber's theoretical framework of alternating self-supervised learning for history compression and reinforcement learning for control, demonstrating empirical backing for these ideas. |                                                                                                                                                                                                                                                                                                                                                                                                                                            |<br><br>

**[Spatia: Video Generation with Updatable Spatial Memory](https://arxiv.org/pdf/2512.15716)**<br>| Strengths                                                                                             | Weaknesses                                                                                               |
| :--------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------- |
| **Persistent 3D Spatial Memory:** Novel use of an updatable 3D point cloud ensures long-term spatial and temporal consistency in generated videos. | **Reliance on External Models:** Integrates several off-the-shelf 3D reconstruction/SLAM and video diffusion models, making it dependent on their performance and limitations. |
| **Dynamic-Static Disentanglement:** Enables generating dynamic elements while maintaining a consistent static scene, surpassing static-only generation limits. | **Resource Intensiveness:** Training and maintaining dense 3D spatial memory requires substantial computational resources (e.g., 64 GPUs for training), and point cloud density affects quality. |
| **Explicit 3D Control:** Offers precise camera control via 3D trajectories and supports interactive, 3D-aware editing of scene elements. | **Static-Only Memory:** The explicit spatial memory mechanism models only the static scene, potentially limiting complex dynamic interactions that fundamentally alter the scene geometry within the memory itself. |
| **High Performance & Long-Horizon Coherence:** Achieves superior visual quality and robust spatial coherence over extended video durations, validated across benchmarks. | **Sensitivity to Point Cloud Density:** Achieving fine-grained spatial guidance requires high point cloud density, which directly impacts memory storage and potentially increases processing time, as shown in ablation studies. |<br><br>

**[Schoenfeld's Anatomy of Mathematical Reasoning by Language Models](https://arxiv.org/pdf/2512.19995)**<br>| Strengths                                                                                                                                                                                                                                                         | Weaknesses                                                                                                                                                                                             |
| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Introduces ThinkARM, a scalable framework for abstracting LLM reasoning into explicit functional steps (e.g., Analysis, Explore, Implement, Verify) using Schoenfeld’s Episode Theory.                  | Relies on an automatic annotator (GPT-5), which, despite high agreement with human annotations, may introduce labeling noise into the large-scale dataset.                                            |
| Reveals reproducible thinking dynamics, distinct lexical patterns for each reasoning step, and a three-phase "cognitive heartbeat" (Initialization, Execution, Convergence) across LLMs.               | The primary focus of the experiments is on mathematical problem-solving, limiting the immediate generalizability of findings to other domains and complex reasoning tasks.                         |
| Differentiates reasoning from non-reasoning models by structural effort distribution (e.g., balanced vs. Implement-heavy) and the presence of iterative feedback loops (e.g., Explore-Monitor/Verify). |                                                                                                                                                                                                        |
| Provides diagnostic insights by linking episode-level patterns to solution correctness (e.g., routing exploration to monitoring/analysis for correct solutions) and analyzing efficiency methods.     |                                                                                                                                                                                                        |<br><br>

**[How Much 3D Do Video Foundation Models Encode?](https://arxiv.org/pdf/2512.19949)**<br>Paper not available<br><br>

**[VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation](https://arxiv.org/pdf/2512.19680)**<br>Paper not available<br><br>

**[GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training](https://arxiv.org/pdf/2512.13043)**<br>Paper not available<br><br>