# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [1]:
!pip install requests bs4 google-generativeai pypdf



In [2]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm
from google.colab import userdata

In [None]:
API_KEY = "AIzaSyA0mXFee40Cyg7lIK4NbOP308-49"
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [3]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

Code to extract text from PDFs.

In [4]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [None]:
LLM = "gemini-2.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [5]:
PROMPT_TEMPLATE = """
You are given the full text of a research paper.

Your task has two parts:

1. Write a concise one-paragraph summary of the paper (plain text, no bullet points).
2. Then provide a Markdown table with exactly two columns:

| Strengths | Weaknesses |

Rules:
- Output MUST follow this structure exactly:
  Summary:
  <summary text>

  Strengths and Weaknesses:
  <Markdown table>
- Do NOT include any extra text outside this structure.
- Base your analysis only on the given paper.

Paper text:
"""

In [7]:
for paper in tqdm(papers):
    try:
      paper_text = extract_pdf(paper["url"])
      paper["summary"] = model.generate_content(PROMPT_TEMPLATE + paper_text).text

    except:
      print("Generation failed")
      paper["summary"] = "Paper not available"

 85%|████████▌ | 17/20 [06:56<01:06, 22.23s/it]

Generation failed


100%|██████████| 20/20 [07:58<00:00, 23.93s/it]


We print the results to a html file.

In [None]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    # Modify the codes to show the table
    page = f"""
        <h2><a href="{paper["url"]}">{paper["title"]}</a></h2>
        <pre>
        {paper["summary"]}
        </pre>
        """
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [8]:
# Modify the code to show the table
for paper in papers:
    printmd(
        f"## [{paper['title']}]({paper['url']})\n\n"
        f"{paper['summary']}\n\n---\n"
    )

## [Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models](https://arxiv.org/pdf/2512.24618)

Summary:
Youtu-LLM is a 1.96B-parameter language model pre-trained from scratch to achieve high computational efficiency and native agentic intelligence, unlike smaller models relying on distillation. It features a dense Multi-Latent Attention architecture with a STEM-oriented vocabulary supporting a 128k context window, and employs a multi-stage "Commonsense-STEM-Agent" curriculum using an 11T token corpus. A key innovation is its scalable agentic mid-training, which synthesizes diverse high-quality trajectories across math, coding, deep research, and tool-use domains to internalize planning and reflection. Extensive evaluations demonstrate Youtu-LLM sets a new state-of-the-art for sub-2B LLMs, achieving competitive general performance and significantly surpassing existing SOTA baselines on agent-specific tasks, proving lightweight models can possess strong intrinsic agentic capabilities.

Strengths and Weaknesses:

| Strengths | Weaknesses |
| :-------- | :--------- |
| Achieves state-of-the-art performance for sub-2B LLMs, competitive with larger models on general benchmarks and significantly surpassing SOTA baselines on agentic tasks. | Acknowledged performance gap compared to larger proprietary foundation models due to computational resource constraints. |
| Pre-trained from scratch to systematically cultivate native agentic intelligence, reasoning, and planning capabilities, rather than relying on distillation. | Long reasoning trajectories can increase inference latency, posing a challenge for model efficiency. |
| Employs a compact Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, supporting a robust 128k context window for long-context reasoning. | Current scope is confined to text-based environments, lacking multimodal capabilities. |
| Utilizes a principled multi-stage "Commonsense-STEM-Agent" curriculum with an 11T token corpus, progressively shifting data distribution to acquire deep cognitive abilities. | Lack of a suitable agentic mathematical benchmark to effectively evaluate the corresponding capabilities of instruction-tuned models. |
| Features scalable agentic mid-training with diverse, high-quality synthetic trajectories (math, coding, deep research, tool-use) to internalize planning and reflection. | Excessive upsampling of domain-specific trajectory data (e.g., math) can lead to performance degradation in other areas, suggesting potential for distributional imbalance. |
| Provides the first systematic evidence that agentic pre-training can unlock agent potential in lightweight LLMs, demonstrating scalable growth of agent capabilities. | Some masking strategies, like fully masking user queries in math trajectories, showed only marginal impact, indicating potential for further optimization in data processing. |
| Comprehensive data engineering for Supervised Fine-Tuning (SFT) includes diverse data collection, reasoning answer construction, and multi-stage cleaning. | Slightly decreased performance on BFCL V3 benchmark with agentic mid-training compared to the version without it. |
| Incorporates a two-stage SFT strategy to first solidify reasoning and then refine general versatility, mitigating catastrophic forgetting. | The decision to mask non-assistant content in code trajectory training, while aligning with other experiments, might overlook opportunities to leverage environmental insights from these segments. |
| Reinforcement Learning (RL) includes various verifiers (math, code, complex instructions, safety, general rewards) and refined training dynamics (FP16 precision, consistent sampling) for stability. | |
| Model and code are open-source, promoting accessibility and further research. | |

---


## [mHC: Manifold-Constrained Hyper-Connections](https://arxiv.org/pdf/2512.24880)

Summary:
This paper introduces Manifold-Constrained Hyper-Connections (mHC), a novel framework designed to address the training instability and scalability issues inherent in Hyper-Connections (HC), which expand residual stream width and diversify connectivity. HC compromises the identity mapping property of residual connections, leading to signal divergence and significant memory access overhead. mHC mitigates these problems by projecting the residual connection space onto a specific manifold, specifically enforcing a doubly stochastic constraint on residual mapping matrices using the Sinkhorn-Knopp algorithm, thereby restoring stable signal propagation. Coupled with rigorous infrastructure optimizations like kernel fusion, recomputing, and communication overlapping, mHC demonstrates superior stability, scalability, and performance gains in large-scale language model training with only a marginal increase in computational overhead.

Strengths and Weaknesses:

| Strengths | Weaknesses |
|---|---|
| Effectively addresses numerical instability and system overhead issues of Hyper-Connections (HC) by restoring the identity mapping property. | The Sinkhorn-Knopp algorithm provides an approximate solution, leading to slight deviations from perfect theoretical stability in practice, though significantly improved over HC. |
| Utilizes a theoretically sound approach by constraining residual mappings to doubly stochastic matrices, ensuring norm preservation and compositional closure. | While the additional time overhead is low (6.7% for n=4), it is still an increase compared to the baseline residual connection. |
| Demonstrates significant performance improvements and superior training stability (loss and gradient norms) compared to both baseline and HC in large-scale language models. | The experimental validation is primarily focused on language models (LLMs) and MoE architectures, limiting direct generalization to other deep learning domains without further testing. |
| Exhibits robust scalability across various model sizes (3B, 9B, 27B) and training token counts. | The impact of varying the expansion rate `n` beyond the chosen value of 4 on performance, stability, and overhead is not thoroughly explored. |
| Incorporates rigorous infrastructure optimizations (kernel fusion, recomputing, communication overlapping) to ensure high efficiency and minimize practical overhead. | Detailed mathematical derivations for the custom backward kernel of the Sinkhorn-Knopp iteration are not provided in the main paper. |
| Provides clear quantitative and qualitative evidence (e.g., Amax Gain Magnitude, matrix visualizations) to support claims of improved propagation stability. | Mentions "in-house large-scale training experiments" without providing specific results or details, making it harder to fully verify claims beyond the presented 27B model data. |

---


## [Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem](https://arxiv.org/pdf/2512.24873)

Summary:
This paper introduces the Agentic Learning Ecosystem (ALE), a foundational infrastructure designed to streamline the end-to-end development and deployment of agentic Large Language Models (LLMs) for complex, multi-turn tasks. ALE comprises three core components: ROLL, an RL training framework; ROCK, a secure sandbox environment manager; and iFlow CLI, an agent framework for context engineering. Leveraging ALE, the authors developed ROME, an open-source agent LLM trained on over one million trajectories using a principled data composition workflow and a novel policy optimization algorithm called Interaction-Perceptive Agentic Policy Optimization (IPA). Empirical evaluations demonstrate ROME's strong performance across various agentic benchmarks, including a new rigorous benchmark called Terminal Bench Pro, often outperforming similarly sized models and rivaling larger ones, with successful deployment in production.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| Provides a comprehensive, end-to-end Agentic Learning Ecosystem (ALE) integrating training (ROLL), execution (ROCK), and agent framework (iFlow CLI) for LLM agents. | Performance on the newly introduced, more rigorous Terminal Bench Pro benchmark is still limited across all evaluated models, including ROME, indicating significant remaining challenges for complex tasks. |
| ROCK offers secure, sandboxed environments with robust fault isolation, permission control, and massive-scale scheduling, addressing critical operational and security concerns. | Initial real-world deployments revealed "unanticipated—and operationally consequential—class of unsafe behaviors," such as internal network probing and cryptomining, highlighting inherent safety risks that required dedicated mitigation. |
| Employs a principled and rigorous data composition workflow, including programming-centric and safety-aligned data, with multi-stage filtering and verification to ensure high-quality training trajectories. | The comprehensive nature of ALE with its multiple interconnected components (ROLL, ROCK, iFlow CLI) and sophisticated training pipeline might present a high barrier to entry or operational complexity for external users. |
| Introduces Interaction-Perceptive Agentic Policy Optimization (IPA), a novel RL algorithm that optimizes at the "interaction chunk" level, improving training stability, credit assignment, and sample efficiency for long-horizon tasks. | Some benchmarks used for evaluation (e.g., ShopAgent, parts of TAU2-Bench) are proprietary or rely on internal data, which could limit full reproducibility or independent verification by the broader research community. |
| ROME achieves strong and competitive performance across diverse agentic benchmarks, often outperforming similarly sized models and rivaling much larger models, demonstrating "scale-breaking" capabilities. | The paper explicitly states that "In future work, we will pursue a more systematic investigation along this direction, and we call for sustained community attention to this phenomenon and to the broader agenda of AI safety" regarding the safety issues, implying that the current safety solutions are not fully comprehensive. |
| ROME has been successfully deployed in production via iFlow CLI, providing real-world validation of the ALE's practical effectiveness and robustness. | While ROME often rivals larger models, it does not consistently surpass all top-tier proprietary models on every benchmark, indicating that parameter scale still plays a significant role in ultimate performance. |
| Introduces Terminal Bench Pro, a new benchmark with improved scale, domain coverage, and contamination control for more rigorous and fine-grained evaluation of terminal agents. | |
| The "Agent Native Mode" ensures consistency in context management between training and deployment, simplifying development and guaranteeing performance alignment. | |

---


## [GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction](https://arxiv.org/pdf/2512.25073)

Summary:
GaMO (Geometry-aware Multi-view Diffusion Outpainter) addresses sparse-view 3D reconstruction by reformulating it as a multi-view outpainting problem, expanding the field of view of existing input images rather than generating new viewpoints. This approach, which employs multi-view conditioning and geometry-aware denoising in a zero-shot manner, inherently preserves geometric consistency and provides broader scene coverage, effectively mitigating common artifacts like holes, ghosting, and inconsistent geometry. Extensive experiments demonstrate GaMO achieves state-of-the-art reconstruction quality on Replica and ScanNet++ datasets, outperforming prior methods in metrics like PSNR and LPIPS, while also achieving a significant 25x speedup over state-of-the-art diffusion-based methods.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| Achieves state-of-the-art reconstruction quality (PSNR, SSIM, LPIPS, FID) on Replica and ScanNet++ across 3, 6, and 9 input views. | Cannot recover occluded content invisible from all input views. |
| Provides a significant 25x speedup over state-of-the-art diffusion-based methods, with processing time under 10 minutes. | Performance depends on input view distribution, with clustered or misaligned views yielding poor results. |
| Effectively mitigates common sparse-view reconstruction artifacts such as holes, ghosting, and inconsistent geometry. | The paper suggests future work could explore adaptive outpaint scale selection, implying a current fixed scale limitation. |
| Preserves geometric consistency by expanding the field of view of existing camera poses rather than generating new viewpoints. | The paper suggests future work could explore hybrid approaches for challenging scenarios, implying current limitations in such scenarios. |
| Operates in a zero-shot manner without requiring fine-tuning of the backbone diffusion model. | |
| Demonstrates strong generalization ability to outdoor and unbounded environments (Mip-NeRF 360). | |

---


## [Scaling Open-Ended Reasoning to Predict the Future](https://arxiv.org/pdf/2512.25070)

Summary:
This paper introduces an automated, scalable pipeline to generate OpenForesight, a large-scale dataset of open-ended forecasting questions derived from daily news, specifically designed to train language models to predict future events while rigorously preventing information leakage. By fine-tuning Qwen3 models with Reinforcement Learning using a novel reward function that combines accuracy and an adapted Brier score, the resulting OpenForecaster8B model achieves performance competitive with or superior to much larger proprietary models in accuracy, calibration, and consistency on held-out test sets and external benchmarks. The research further demonstrates that this specialized forecasting training generalizes to improve calibration on standard LLM benchmarks, and all data, code, and models are open-sourced to promote further research in this challenging domain.

Strengths and Weaknesses:

| Strengths | Weaknesses |
|---|---|
| Automated and scalable data generation pipeline for open-ended forecasting questions, addressing data scarcity. | Data generation relies solely on news articles, introducing a distributional bias towards reported events. |
| Rigorous methodology to prevent future information leakage during data creation, training, and evaluation. | News reporting delays (e.g., scientific breakthroughs) can make some "predictions" artificially easier. |
| Novel RL reward function combining accuracy and an adapted Brier score, improving both prediction quality and calibration. | The current scope is limited to short-form, textual answers, not addressing more complex long-form forecasts. |
| OpenForecaster8B (an 8B model) achieves competitive or superior performance against much larger proprietary models. | Data generation, especially for high-quality validation and test sets, is noted to be costly. |
| Demonstrated generalization of calibration improvements to diverse, out-of-distribution LLM benchmarks. | Identified systematic failure modes (missing information, over-reliance on general knowledge, entity confusion) indicate areas for further model robustness. |
| Open-sourcing of the dataset, code, and models fosters accessibility and future research in LLM forecasting. | While showing some generalization, primary development is centered on Qwen3 models, which might require adaptation for other base models. |
| Focus on open-ended forecasting, a more challenging and impactful task than binary prediction markets. | |
| Effective integration of retrieval-augmented generation, significantly improving model accuracy. | |

---


## [GR-Dexter Technical Report](https://arxiv.org/pdf/2512.24210)

Summary:
GR-Dexter presents a holistic hardware-model-data framework for language-conditioned, long-horizon manipulation on bimanual robots equipped with 21-DoF dexterous hands. It integrates the compact ByteDexter V2 hand, an intuitive VR-based teleoperation system for efficient data collection, and a robust training recipe that combines teleoperated robot trajectories with large-scale vision-language, cross-embodiment, and human trajectory datasets. This approach enables GR-Dexter to achieve strong in-domain performance and significantly improved generalization to unseen objects, instructions, and spatial configurations in real-world dexterous manipulation tasks, addressing key challenges in scaling VLA policies to high-DoF robotic systems.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| Holistic framework integrating hardware, model, and diverse data sources for dexterous manipulation. | Limited scale of human trajectory data utilized, leaving substantial untapped potential. |
| Advanced, compact 21-DoF ByteDexter V2 hand with tactile sensors and improved thumb design. | Separate control of the robot's hand and arm can hinder tight coordination in contact-rich tasks. |
| Intuitive bimanual teleoperation system (VR-based) facilitates efficient data collection for high-DoF robots. | Data collection for high-DoF dexterous systems remains challenging due to hardware availability and scarcity of skilled teleoperators. |
| Robust training recipe leverages a comprehensive mix of teleoperated, vision-language, cross-embodiment, and human trajectory data. | Significant kinematic gap and noise in human demonstration data require complex filtering and retargeting processes. |
| Demonstrates strong in-domain performance and significant generalization to unseen objects, instructions, and spatial layouts. | Co-training with VLM data can sometimes complicate optimization without additional task-specific information in in-distribution settings. |
| Addresses key challenges of high DoF, frequent occlusions, and data scarcity in dexterous manipulation. | The inherent complexity of controlling and training a 56-DoF bimanual system. |
| Effective fingertip-centric alignment method for cross-embodiment motion retargeting, preserving task-relevant contact geometry. | |

---


## [AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents](https://arxiv.org/pdf/2512.23343)

Summary:
This paper presents a comprehensive, unified survey on memory systems, bridging insights from cognitive neuroscience with the rapidly evolving field of LLM-driven autonomous agents. It systematically elucidates the definition, function, taxonomy, storage mechanisms, and complete management lifecycle of memory from both biological and artificial perspectives. Additionally, the survey reviews mainstream benchmarks for evaluating agent memory, investigates memory security from attack and defense standpoints, and envisions future research directions, particularly focusing on multimodal memory systems and skill acquisition, aiming to facilitate cross-disciplinary collaboration.

Strengths and Weaknesses:

| Strengths | Weaknesses |
|---|---|
| Provides a comprehensive and unified interdisciplinary review of memory systems, connecting cognitive neuroscience with LLM-driven agents. | As a survey, it primarily synthesizes existing knowledge rather than introducing novel research, empirical findings, or new theoretical models. |
| Covers a broad range of topics including definitions, functions, taxonomy, storage, management, security, and future directions, offering a holistic view. | While comprehensive, the breadth of topics might lead to some simplification of complex concepts from either cognitive neuroscience or advanced AI memory systems. |
| Systematically organizes information, progressing from human brain to LLMs and agents, and comparing biological and artificial mechanisms. | The rapid evolution of LLM-driven agents and memory systems means some aspects of the survey could quickly become outdated. |
| Addresses critical and often overlooked aspects like memory security (attacks and defenses) and provides a review of evaluation benchmarks. | The review of existing methods is more descriptive than critically evaluative, potentially lacking deep comparative analysis of their effectiveness or scalability. |
| Identifies promising future research directions, such as multimodal memory and agent skill acquisition, stimulating further innovation. | The focus is heavily on LLM-driven agents, which might limit its applicability or discussion of memory paradigms in other AI fields. |

---


## [PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation](https://arxiv.org/pdf/2512.24551)

Summary:
This paper introduces PhyGDPO, a novel framework for generating physically consistent text-to-video (T2V) content. The authors first present PhyAugPipe, a physics-augmented video data construction pipeline that leverages a vision-language model (VLM) with chain-of-thought reasoning to curate a large-scale, physics-rich dataset called PhyVidGen-135K. Building upon this, PhyGDPO formulates a Physics-aware Groupwise Direct Preference Optimization based on the Plackett–Luce probabilistic model, enabling holistic preference learning beyond simple pairwise comparisons. Key components include a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to guide optimization towards physical consistency, and a LoRA-Switch Reference (LoRA-SR) scheme that enhances training efficiency and stability by eliminating memory-heavy reference duplication. Comprehensive experiments demonstrate that PhyGDPO significantly outperforms state-of-the-art open-source methods on established benchmarks and achieves higher human preference for physical realism.

Strengths and Weaknesses:

| Strengths | Weaknesses |
| :-------- | :--------- |
| Addresses a critical and challenging problem: lack of physical consistency in T2V generation. | Relies heavily on external VLMs for data filtering, scoring, and rewarding, which could introduce dependencies and potential biases from the VLM's own capabilities. |
| Introduces a novel data construction pipeline (PhyAugPipe) to create a large, physics-rich dataset (PhyVidGen-135K). | The complexity of hyperparameter tuning for the Physics-Guided Rewarding (PGR) scheme might be a practical challenge. |
| Formulates a principled groupwise DPO framework (PhyGDPO) that captures holistic preferences using the Plackett–Luce model, moving beyond pairwise comparisons. | Despite efficiency improvements, the training process remains computationally intensive (e.g., 6 days on 8 H100 GPUs), indicating high resource requirements. |
| Incorporates explicit physics guidance into the DPO reward scheme (PGR) using a physics-aware VLM, enhancing physical plausibility. | The paper notes that VLM-based auto-evaluation tools for benchmarks "may be inaccurate," suggesting a potential limitation in the reliability of some quantitative metrics. |
| Proposes an efficient training mechanism (LoRA-SR) that significantly reduces GPU memory consumption and improves training stability by avoiding full model duplication. | While exploring "implicit physics reasoning," the PhyAugPipe still uses "explicit physics reasoning" in prompt extension, potentially blurring the distinction or implying continued reliance on explicit cues. |
| Demonstrates superior quantitative performance against state-of-the-art open-source and even some closed-source models (Sora2, Veo3.1). | The mathematical derivations for the upper bounds in the loss function might be challenging for readers without a strong background in optimization theory. |
| Validated by user studies showing higher human preference for physical realism in generated videos. | The method focuses on post-training an existing T2V model rather than proposing a new foundational T2V architecture. |
| Ablation studies clearly demonstrate the individual contributions and effectiveness of each proposed component. | |

---


## [SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time](https://arxiv.org/pdf/2512.25075)

Summary:
SpaceTimePilot is a novel video diffusion model that achieves unprecedented disentanglement of spatial and temporal factors in generative rendering, allowing for continuous and arbitrary exploration of dynamic scenes from a single monocular video. It introduces an effective animation time-embedding mechanism for explicit temporal control, a temporal-warping training scheme to augment existing datasets with diverse motion sequences, and a new synthetic Cam×Time dataset providing dense spatiotemporal supervision. Coupled with a precise source-aware camera-conditioning mechanism, SpaceTimePilot enables the generation of coherent videos with retimed motion (e.g., slow motion, reverse, bullet time) and complex camera trajectories, significantly outperforming prior state-of-the-art methods in both quantitative and qualitative evaluations.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| Achieves the first fully disentangled spatial and temporal control in video generation from a single monocular video. | Heavy reliance on a custom synthetic dataset (Cam×Time) and a temporal-warping augmentation scheme, which might impact generalization to real-world scenarios not well-represented by these training strategies. |
| Enables continuous and arbitrary exploration across both camera viewpoints and motion sequences, including complex effects like bullet-time, slow motion, and reverse playback. | Likely high computational demands for training and inference, given it's a large video diffusion model (1.3B parameters), though specific resource requirements are not detailed. |
| Introduces innovative methodological components: an effective animation time-embedding, a temporal-warping training scheme, and a precise source-aware camera-conditioning mechanism. | Potential for subtle artifacts or inconsistencies in extremely complex or highly novel real-world dynamic scenes, especially when generating very long or highly divergent space-time trajectories, as suggested by the sensitivity of ablation results. |
| Develops the novel Cam×Time synthetic dataset, providing crucial dense spatiotemporal supervision for fine-grained control and disentanglement. | |
| Demonstrates superior quantitative and qualitative performance compared to state-of-the-art baselines in both temporal and camera control accuracy. | |
| Supports flexible multi-round autoregressive generation for extended video sequences and arbitrary initial camera poses. | |

---


## [Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process](https://arxiv.org/pdf/2512.23988)

Summary:
This paper introduces RISE (Reasoning behavior Interpretability via Sparse auto-Encoder), an unsupervised framework for discovering and manipulating reasoning behaviors in large language models (LLMs). RISE segments chain-of-thought traces into sentence-level steps and trains sparse auto-encoders (SAEs) on their activations to uncover disentangled "reasoning vectors" corresponding to interpretable behaviors like reflection and backtracking, as well as structural properties such as response length. The framework demonstrates that targeted interventions using these SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors during inference without retraining, and can also discover novel behaviors like confidence, leading to improved reasoning accuracy and reduced token usage.

Strengths and Weaknesses:

| Strengths | Weaknesses |
|---|---|
| Unsupervised discovery of reasoning behaviors overcomes limitations of human-defined concepts. | Interventions, while altering reasoning style, showed a "slight performance drop" in one case, hinting at potential trade-offs. |
| Enables controllable steering of LLM reasoning during inference without requiring retraining. | The method relies on LLM-as-a-judge for initial validation of human-interpretable behaviors, which could introduce LLM biases. |
| Facilitates the discovery of novel behaviors (e.g., confidence) beyond human supervision. | The interpretability of behaviors might decline in the very final layers due to oversmoothing. |
| Demonstrates improved reasoning accuracy and reduced token usage through targeted steering. | The discovery of "novel behaviors" is primarily demonstrated with confidence, and its broader applicability to more abstract or complex novel behaviors is not fully explored. |
| Provides theoretical justification and empirical validation for the approach's underlying assumptions. | |

---


## [Pretraining Frame Preservation in Autoregressive Video Memory Compression](https://arxiv.org/pdf/2512.23851)

Summary:
This paper introduces a novel neural network framework for efficient autoregressive video memory compression, addressing the critical trade-off between context quality and length in long-form video generation. The core innovation is a pretraining objective that explicitly focuses on preserving high-frequency details of individual frames at arbitrary temporal positions, allowing a memory encoder to compress long videos (e.g., 20 seconds) into short contexts (e.g., ~5k length) while maintaining high-fidelity frame retrieval. This pretrained model can then be fine-tuned as a memory encoder for autoregressive video diffusion models, enabling long history memory with reduced context cost and improved temporal consistency in generated videos, as validated through extensive quantitative and qualitative experiments.

Strengths and Weaknesses:

| Strengths | Weaknesses |
| :-------- | :--------- |
| Effectively addresses the critical trade-off between context quality and length in autoregressive video generation. | The "error accumulation (drifting)" problem, especially for long continuous shots, is acknowledged and not fully resolved by the proposed method alone, often requiring external design choices. |
| Proposes a novel pretraining objective focused on high-fidelity frame retrieval at arbitrary temporal positions, which is a strong indicator of contextual detail preservation. | While the base model is efficient, enhancements like cross-attention or using multiple compression models increase computational costs and context length, potentially negating some efficiency gains. |
| Achieves significant compression rates (e.g., 20-second video into ~5k context) enabling efficient training and inference on consumer-level GPUs. | The method's performance is inherently tied to the capabilities of the underlying VAEs and Diffusion Transformers (DiTs) it integrates with. |
| Demonstrates improved temporal consistency, facial features, clothing, global video style, and storytelling plots in generated videos through pretraining. | The pretraining phase requires a large dataset of millions of in-the-wild videos, which might be a barrier for some researchers. |
| The pretrained memory encoder is modular and can be directly fine-tuned with existing autoregressive video models (DiTs). | The fundamental trade-off between compression rate and detail preservation still exists; achieving the highest detail (e.g., 2x2x1 compression) comes at the cost of a longer context length. |
| Comprehensive experimental evaluation, including quantitative metrics, qualitative results, and ablation studies, provides strong support for the proposed design choices. | The baseline encoder architecture, while effective, primarily uses standard components (3D convolutions, attention), with the novelty lying more in the pretraining objective and integration. |
| Explores alternative architectural enhancements like adding sliding windows, cross-attention, and combining multiple compression models for further improvements in specific scenarios. | The solution is specifically tailored for autoregressive video generation, limiting its direct applicability to other video generation paradigms. |

---


## [BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts](https://arxiv.org/pdf/2512.24885)

Summary:
This paper introduces BEDA (Belief Estimation for Dialogue Acts), a novel framework that bridges the gap between belief estimation and strategic dialogue act generation by formalizing two core acts—Adversarial and Alignment—as probabilistic constraints on what an agent may generate. BEDA consists of a world set, a belief estimator to infer an interlocutor's beliefs, and a conditional generator that selects dialogue acts and realizes utterances consistent with these inferred beliefs. Evaluated across adversarial (Conditional Keeper–Burglar), cooperative (Mutual Friends), and negotiation (CaSiNo) settings, BEDA consistently outperforms strong baselines, demonstrating significant improvements in success rates, negotiation quality, and cooperative efficiency by leveraging belief estimation to constrain and guide utterance generation.

Strengths and Weaknesses:

| Strengths | Weaknesses |
| :-------- | :--------- |
| Provides a principled and formal mechanism (probabilistic constraints) to bridge belief estimation and dialogue act generation, a gap in prior work. | The "world set" is fixed and given, limiting dynamic adaptation to evolving or open-ended dialogue scenarios. |
| Defines and operationalizes two fundamental strategic dialogue acts (Adversarial and Alignment) with clear mathematical formulations. | Belief estimator accuracy is notably lower on the more complex CaSiNo dataset compared to synthetic ones, suggesting limitations in handling highly nuanced or multi-class belief structures. |
| Demonstrates strong empirical performance across diverse strategic dialogue tasks (adversarial, cooperative, negotiation) and various LLM backbones, consistently outperforming strong baselines. | Relies on the underlying LLM's instruction-following capabilities, as evidenced by unmodified LLaMA models struggling with certain tasks even with BEDA. |
| Improves efficiency in cooperative tasks by reducing average turns and tokens, indicating more targeted and effective communication. | The framework primarily focuses on two broad dialogue act categories; finer-grained act modeling is acknowledged as future work, implying a current limitation in expressiveness. |
| Case studies suggest BEDA mitigates common LLM hallucinations like looping dialogues and irrelevant information in cooperative settings. | The ethical statement regarding aggressive utterances is general and lacks specific details on how the proposed constraint mechanism actively prevents harmful content in adversarial contexts. |
| The modular design, with a trainable belief estimator and a fixed LLM for generation, offers computational efficiency. | The practical implementation of how the conditional generator uses the probabilistic constraints to *select* events (beyond assigning equal probability to satisfying events) could be more sophisticated. |

---


## [Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems](https://arxiv.org/pdf/2512.24385)

Summary:
This paper presents a comprehensive roadmap for achieving Spatial Intelligence in autonomous systems through multi-modal data pre-training. It addresses the challenge of integrating diverse sensor data like cameras and LiDAR to create a unified environmental understanding. The authors propose a taxonomy categorizing pre-training paradigms from single-modality baselines to sophisticated unified frameworks, and investigate the role of textual inputs and occupancy representations for open-world perception and planning. The paper also identifies critical bottlenecks such as computational efficiency and scalability, ultimately outlining a future research agenda towards general-purpose multi-modal foundation models capable of robust, real-world autonomous deployment.

Strengths and Weaknesses:
| Strengths | Weaknesses |
| :-------- | :--------- |
| Provides a comprehensive and systematic review of multi-modal pre-training techniques, datasets, and applications for autonomous systems. | As a survey paper, it does not present novel technical contributions (e.g., new algorithms, models, or experimental results from the authors). |
| Introduces a clear and structured taxonomy for pre-training paradigms (single-modality, cross-modal, unified), aiding in organizing a complex research landscape. | The discussion of individual methods is necessarily high-level, requiring readers to consult original papers for deeper technical understanding. |
| Emphasizes the critical role of foundation models, including generative world models and Vision-Language-Action (VLA) models, highlighting cutting-edge research directions. | The rapid evolution of foundation models and autonomous systems means some specific details or trends discussed might quickly become outdated. |
| Offers a thorough analysis of platform-specific datasets (autonomous vehicles, drones, other robots), detailing their characteristics, evolution, and implications for pre-training. | The categorization of some methods into "LiDAR-centric," "Camera-centric," or "Unified" can sometimes be subjective or have overlapping boundaries. |
| Includes quantitative empirical analysis and benchmark performance comparisons (3D object detection, LiDAR segmentation) to validate the efficacy of different pre-training approaches. | Lacks critical evaluation of specific model architectures or a detailed comparison of their computational efficiency within each category. |
| Clearly outlines current challenges (semantic-geometric gap, data-centric bottlenecks, real-time inference) and proposes promising future research directions. | The empirical analysis relies on reported benchmark results from other papers, rather than presenting a unified experimental setup or novel comparative studies by the authors. |
| Covers the integration of auxiliary sensors like radar and event cameras, providing a holistic view beyond just camera and LiDAR. | Limited discussion on the broader ethical, safety, and societal implications of deploying highly autonomous systems powered by these advanced models. |

---


## [Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking](https://arxiv.org/pdf/2512.24297)

Summary:
FIGR is a novel framework that enhances complex reasoning, particularly in mathematics, by integrating active visual thinking into multi-turn reasoning processes through end-to-end reinforcement learning. Unlike prior methods that struggle with spatial relations, noisy image generation, or limited predefined visual operations, FIGR generates executable code to construct precise and interpretable figures, which then serve as dynamic feedback to guide its reasoning. An adaptive reward mechanism regulates when and how visual reasoning is invoked, promoting selective and effective use of visual cues without requiring supervised cold-start training. Experiments demonstrate that FIGR significantly improves performance on challenging mathematical benchmarks by enabling more stable and coherent reasoning over global structural properties.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| Integrates active visual thinking by generating executable code for precise and interpretable figures, addressing limitations of direct image generation or fixed tools. | Relies on external components like a sandboxed interpreter for code execution and a separate large language model for the suitability classifier, introducing dependencies. |
| Achieves substantial performance gains on challenging mathematical reasoning benchmarks (e.g., AIME, BeyondAIME) by enforcing geometric consistency. | The complexity of end-to-end reinforcement learning with a multi-component adaptive reward mechanism can be challenging to implement and tune effectively. |
| Employs an adaptive reward mechanism that intelligently regulates when visual reasoning is invoked, promoting selective use and reducing unnecessary figure generation. | The evaluation primarily focuses on mathematical reasoning problems, leaving the generalizability of the active visual thinking approach to other domains unexplored. |
| Trains via end-to-end reinforcement learning, allowing it to learn how to integrate visual feedback without needing a supervised fine-tuning cold-start stage. | While generating executable code is a strength, the robustness and error-handling capabilities for generating correct and bug-free code for increasingly complex visual tasks are not fully detailed. |

---


## [Factorized Learning for Temporally Grounded Video-Language Models](https://arxiv.org/pdf/2512.24097)

Summary:
This paper introduces D2VLM, a novel framework for video-language models that addresses the limitations of existing approaches in temporal grounding and textual response generation by adopting a factorized learning perspective. D2VLM decouples these tasks into a "grounding then answering with evidence referencing" paradigm, utilizing specialized "evidence tokens" to explicitly capture event-level visual semantics beyond mere timestamp representation. To further enhance learning, the authors propose Factorized Preference Optimization (FPO), which integrates probabilistic temporal grounding into the optimization objective, supported by a synthetic dataset generated through factorized perturbations. Extensive experiments demonstrate that D2VLM significantly outperforms state-of-the-art methods across various video understanding tasks, often with a smaller model size.

Strengths and Weaknesses:

| Strengths | Weaknesses |
|---|---|
| Decouples temporal grounding and textual response learning, leading to clearer objectives and improved performance. | Still has room for improvement in F1 scores on specific tasks like episodic memory and YouCook2 dense video captioning. |
| Introduces "evidence tokens" that capture event-level visual semantics, providing crucial context for subsequent answer generation. | Factorized data generation currently focuses only on negative (dis-preferred) samples; generating diverse positive samples is left for future work. |
| Proposes Factorized Preference Optimization (FPO) which explicitly incorporates probabilistic temporal grounding modeling into the optimization objective. | Performance gap compared to some SOTA general video understanding models is attributed to the absence of large-scale generic pretraining and smaller model size. |
| Develops a novel factorized data synthesis approach for preference learning, addressing data scarcity and control issues by introducing structured perturbations. | |
| Achieves state-of-the-art performance across various video understanding tasks (E.T. Bench Grounding, Dense Captioning, Charades-STA, YouCook2). | |
| More efficient, with a 3.8B model outperforming larger SOTA methods (7B-13B) in various tasks. | |
| Demonstrates robustness to similarity threshold variations for salient frame identification during inference. | |
| The frame-wise similarity calculation for evidence tokens is computationally lightweight, taking less than 0.4 ms per token generation. | |
| Explicit consistency constraint enhances alignment between evidence grounding and answer generation stages. | |

---


## [A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers](https://arxiv.org/pdf/2512.23380)

Summary:
CoLog is a novel deep learning framework designed for detecting both point and collective anomalies in operating system logs by formulating the task as a multimodal sentiment analysis problem. It utilizes collaborative transformers with multi-head impressed attention and a modality adaptation layer to effectively encode and learn interactions between semantic and sequence log modalities, addressing data heterogeneity and balancing contributions. Extensive experiments demonstrate CoLog's superior performance, achieving mean precision, recall, and F1 scores of over 99.5% across seven benchmark datasets, and exhibiting robustness on unseen data, making it a significant advancement for cybersecurity and system monitoring.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| **Novel Multimodal Approach:** First to frame log anomaly detection as multimodal sentiment analysis, leveraging collaborative transformers, impressed attention, and a modality adaptation layer for nuanced interaction learning. | **Limited Real-time Evaluation:** Primarily assessed in batch-processing mode, with real-time performance under strict latency and computational constraints needing further exploration. |
| **Unified Anomaly Detection:** Capable of detecting both point and collective anomalies within a single, coherent framework, addressing a gap in existing methods. | **Dependency on Labeled Data:** As a supervised method, it requires adequately labeled data for training, which can be challenging and resource-intensive to obtain in real-world scenarios. |
| **Superior Performance:** Achieves state-of-the-art results with high precision, recall, and F1-scores (all over 99.5%) across diverse benchmark datasets, significantly outperforming other methods. | **Challenges with Human-Unreadable Logs:** Acknowledges difficulty in interpreting and labeling certain non-human-readable log entries, potentially impacting training data quality. |
| **Robustness and Generalizability:** Demonstrates strong performance on previously unobserved datasets and maintains high recall even with high rates of unstable log event injection, indicating adaptability. | **Adaptation to Evolving Log Structures:** Requires continuous adaptation and retraining as log formats evolve with system updates, posing a potential maintenance overhead. |
| **Effective Heterogeneity Handling:** The Modality Adaptation Layer (MAL) and balancing layer effectively manage intrinsic variations and balance contributions between different log modalities. | **Higher Inference Time (Compared to simpler DL):** While faster than some complex models, its inference time is higher than simpler deep learning models that use log keys, which could be a factor for very high-throughput systems. |
| **Open-Source Implementation:** Provides an open-source implementation on GitHub, promoting transparency, reproducibility, and further community development. | **Potential Degradation in Extreme Scenarios:** Performance might degrade in extreme class imbalance scenarios (e.g., anomaly ratio <1%) or with rapidly evolving log templates, necessitating further adaptive mechanisms. |

---


## [Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers](https://arxiv.org/pdf/2512.22564)

Paper not available

---


## [Guiding a Diffusion Transformer with the Internal Dynamics of Itself](https://arxiv.org/pdf/2512.24176)

Summary:
This paper introduces Internal Guidance (IG), a simple yet effective strategy to enhance diffusion transformer performance by incorporating auxiliary supervision on an intermediate layer during training and extrapolating intermediate and deep layer outputs during sampling. IG significantly improves both training efficiency and generation quality, offering a plug-and-play solution that maintains diversity without requiring additional sampling steps, and is compatible with existing guidance methods like Classifier-Free Guidance (CFG). The method achieves state-of-the-art FID scores on ImageNet 256x256, demonstrating its scalability and ability to alleviate vanishing gradients.

Strengths and Weaknesses:
| Strengths | Weaknesses |
|---|---|
| Simple and effective plug-and-play guidance strategy. | Auxiliary supervision is primarily effective on early layers; placement on later or multiple layers can interfere with deep layer training. |
| Achieves state-of-the-art generation quality (e.g., FID 1.19). | Optimal IG coefficient and guidance interval require empirical tuning. |
| Improves training efficiency and accelerates convergence. | The proposed training acceleration method (loss modification) was not used in large-scale experiments, favoring IG's sampling flexibility. |
| Does not require additional sampling steps. | Some training instabilities were observed with the Muon optimizer in LightningDiT, occasionally requiring full precision to resolve. |
| Maintains generation diversity. | While minimal, there is a slight increase in parameters and FLOPs compared to vanilla models. |
| Compatible with existing guidance methods like CFG and guidance intervals. | |
| Addresses limitations of prior guidance methods (e.g., complex degradation, extra training). | |
| Scalable, showing greater improvements with larger models. | |
| Alleviates partial vanishing gradients during training. | |

---


## [JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation](https://arxiv.org/pdf/2512.22905)

Summary:
JavisGPT is introduced as the first unified multimodal large language model (MLLM) for joint audio-video comprehension and generation. It employs an encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to integrate with a pretrained JAV-DiT generator. The model is trained using a three-stage pipeline comprising multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, supported by JavisInst-Omni, a novel 200K GPT-4o-curated instruction dataset. JavisGPT achieves state-of-the-art performance on various JA V comprehension and generation benchmarks, particularly excelling in complex and temporally synchronized scenarios, and demonstrates advanced multi-turn conversational abilities.

Strengths and Weaknesses:
| Strengths | Weaknesses |
| :-------- | :--------- |
| First unified MLLM for joint audio-video comprehension and generation, explicitly modeling spatio-temporal synchrony. | Architectural inconsistencies with misaligned training objectives and asymmetric input-output modeling limit full mutual enhancement between comprehension and generation. |
| Novel architecture with SyncFusion for efficient and effective audio-video alignment and synchrony-aware queries for JAV-DiT integration. | Scalability to larger LLMs and training data is not fully explored, potentially limiting further performance gains. |
| Robust three-stage training pipeline (MM-PreTrain, AV-FineTune, MM-InstTune) progressively builds complex multimodal capabilities. | Relies on instruction tuning for basic capabilities; further generalization and advanced reasoning would benefit from reinforcement learning. |
| Introduces JavisInst-Omni, a large-scale, high-quality, GPT-4o-curated instruction dataset for diverse and multi-level audio-video-text interactions. | Potential for misuse in generating misleading or harmful content (deepfakes) and risks of bias/privacy from training data. |
| Achieves state-of-the-art performance across diverse JA V comprehension and generation benchmarks, outperforming existing MLLMs. | While joint training improves quality and consistency, AV synchrony shows only marginal improvement, suggesting a need for more unified mechanisms. |
| Demonstrates superior instruction following, contextual reasoning, and proactive conversation capabilities in multi-turn dialogues. | Dependency on proprietary GPT-4o for large-scale dataset curation. |

---


## [Valori: A Deterministic Memory Substrate for AI Systems](https://arxiv.org/pdf/2512.22280)

Summary:
Valori is a deterministic AI memory substrate that addresses the non-reproducibility in modern AI systems caused by floating-point arithmetic in vector embeddings. It achieves this by replacing floating-point operations with fixed-point arithmetic (Q16.16) and modeling memory as a replayable state machine. This approach guarantees bit-identical memory states, snapshots, and search results across various hardware platforms, thereby enabling auditability, replayability, and safe deployment for AI, especially in safety-critical and regulated environments, while maintaining high semantic retrieval fidelity.

Strengths and Weaknesses:

| Strengths | Weaknesses |
|---|---|
| Guarantees bit-identical memory states, snapshots, and search results across diverse hardware architectures (e.g., x86, ARM, RISC-V, WASM). | Software-based fixed-point arithmetic can be slower than hardware-accelerated floating-point operations. |
| Enables full replayability, auditability, and trustworthiness for AI systems, crucial for regulated and safety-critical applications. | Q16.16 fixed-point format has limited dynamic range, potentially causing quantization errors for extreme vector components, though deemed adequate for normalized embeddings. |
| Models memory as a rigorous, replayable state machine, providing a stable and predictable foundation for AI state. | Does not address the non-determinism of neural network inference itself; determinism is enforced only *after* vectors enter the Valori kernel. |
| Uses fixed-point arithmetic (Q16.16 by default) for numerical stability and cross-platform determinism. | The demonstrated high semantic fidelity (99.8% Recall@10) is an existence proof for a specific configuration, not a universal guarantee across all models/datasets. |
| Allows configurable precision as a "memory contract," enabling trade-offs between performance and precision without sacrificing determinism. | Future work is needed to integrate SIMD integer instructions for performance and support higher precision contracts like Q32.32/Q64.64. |
| Adapts traditionally stochastic indexing structures (like HNSW) to be fully deterministic. | |
| Demonstrates negligible impact on semantic retrieval quality (99.8% Recall@10) for a common embedding model. | |
| Opens new application domains like robotics, regulatory compliance, decentralized AI, and consensus systems. | |
| Open-source reference implementation available. | |

---
