# STORM
### Synthesis of Topic Outlines through Retrieval and Multi-perspective question asking.

Research assistant that extends the idea of 'outline-driven RAG' for richer article generation.

It applies two main insights to produce more organized and comprehensive articles:
1. Creating an outline (planning) by querying similar topics helps improve coverage.
2. Multi-perspective, grounded (in search) conversation simulation helps increase the reference count and information density.

#### Overview
1. Generate initial outline + survey related subjects
2. Identify distiinct perspectives
3. "Interview subject matter experts" (role-playing LLMs)
4. Refine outline (using references)
5. Write sections, then write article

The expert interviews stage occurs between the role-playing article writer and a research expert. The "expert" is able to query external knowledge and respond to pointed questions, saving cited sources to a vectorstore so that the later refinement stages can synthesize the full article.

Hyperparameters to restrict the potentially infinite research breadth:
- N: number of perspectives to survey
- M: Max number of conversation turns

In [4]:
from dotenv import load_dotenv
load_dotenv()

True

### Select LLMs
We will have a faster LLM to do most of the work, but a slower, long-context model to distill the conversations and write the final report.

In [5]:
from langchain_openai import ChatOpenAI

FAST_LLM = ChatOpenAI(model='gpt-3.5-turbo')
GOOD_LLM = ChatOpenAI(model='gpt-4o')

### Generate Initial Outline
For many topics, your LLM may have an initial idea of the important and related topics. We can generate an initial outline to be referred after our research. Below, we will use our 'fast' llm to generate the outline.

In [15]:
from typing import List, Optional
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_community.retrievers import WikipediaRetriever
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables import chain as as_runnable

In [7]:
direct_gen_outline_prompt = ChatPromptTemplate.from_messages(
    [
        (
            'system',
            'You are a an experienced equity research analyst. Write an outline for an analysis report about a user-provided topic. Be comprehensive and specific.'
        ),
        ('user', '{topic}')
    ]
)


class Subsection(BaseModel):
    subsection_title: str = Field(..., title='Title of the subsection')
    description: str = Field(..., title='Content of the subsection')

    @property
    def as_str(self) -> str:
        return f'### {self.subsection_title}\n\n{self.description}'.strip()
    

class Section(BaseModel):
    section_title: str = Field(..., title='Title of the section')
    description: str = Field(..., title='Content of the section')
    subsections: Optional[List[Subsection]] = Field(default=None, title='Titles and descriptions for each subsection of the analysis report.')

    @property
    def as_str(self) -> str:
        subsections = '\n\n'.join(f'### {subsection.subsection_title}\n\n{subsection.description}' for subsection in self.subsections or [])
        return f'## {self.section_title}\n\n{self.description}\n\n{subsections}'.strip()
    

class Outline(BaseModel):
    page_title: str = Field(..., title='Title of the Report')
    sections: List[Section] = Field(
        default_factory=list,
        title='Titles and descriptions for each section of the Report page.'
    )

    @property
    def as_str(self) -> str:
        sections = '\n\n'.join(section.as_str for section in self.sections)
        return f'# {self.page_title}\n\n{sections}'.strip()
    
generate_outline_direct = direct_gen_outline_prompt | FAST_LLM.with_structured_output(Outline)

In [8]:
example_topic = 'Impact on silicon industry due to the China Taiwan war'

In [9]:
initial_outline = generate_outline_direct.invoke({'topic': example_topic})

In [11]:
print(initial_outline.as_str)

# Impact on Silicon Industry Due to the China-Taiwan War

## Introduction

Brief overview of the China-Taiwan conflict and its potential implications on the silicon industry.

## Current State of the Silicon Industry

Overview of the global silicon industry, key players, market size, and growth trends.

## Key Suppliers in Taiwan

Analysis of Taiwan's role in the silicon industry, including key suppliers and their market share.

## Impact of Conflict on Supply Chain

Discussion on how the China-Taiwan conflict could disrupt the silicon supply chain and production.

## Geopolitical Factors

Analysis of geopolitical factors influencing the silicon industry amidst the China-Taiwan conflict.

## Market Reactions

Evaluation of the market reactions to the conflict, including stock price movements and investor sentiment.

## Potential Scenarios and Risks

Assessment of potential scenarios and risks for the silicon industry in the event of escalating tensions.


### Expand topics
While language models do store some Wikipedia-like knowledge in their parameters, you will get better results by incorporating relevant and recent information using a search engine.

We will start our search by generating a list of related topics, sourced from Wikipedia.

In [12]:
gen_related_topics_prompt = ChatPromptTemplate.from_template(
    '''I'm writing a Wikipedia page for a topic mentioned below. Please identify and recommend some Wikipedia pages on closely related subjects.
    I'm looking for examples that provide insights into interesting aspects commonly associated with this topic, or examples that help me understand the typical content and structure included in Wikipedia pages for similar topics.
    Please list as many subjects and urls as you can.

    Topic of interest: {topic}''')

class RelatedSubjects(BaseModel):
    topics: List[str] = Field(description='Comprehensive list of related subjects as background research')

expand_chain = gen_related_topics_prompt | FAST_LLM.with_structured_output(RelatedSubjects)

In [13]:
related_subjects = await expand_chain.ainvoke({'topic': example_topic})
related_subjects

RelatedSubjects(topics=['Silicon Industry', 'China Taiwan Conflict', 'Economic Impact', 'Global Supply Chain', 'Technology Industry', 'Trade Relations', 'Geopolitics'])

### Generate Perspectives

From these related subjects, we can select representative Wikipedia editors as 'subject matter experts' with distinct backgrounds and affiliations. These will help distribute the research process to encourage a more well-rounded final report.

In [14]:
class Editor(BaseModel):
    affiliation: str = Field(description='Primary affiliation of the editor.')
    name: str = Field(description='Name of the editor', pattern=r'^[a-zA-Z0-9_-]{1,64}$')
    role: str = Field(description='Role of the editor in the context of the topic.')
    description: str = Field(description='Description of the editor\'s focus, concerns and motives')

    @property
    def persona(self) -> str:
        return f'Name: {self.name}\nRole: {self.role}\nAffiliation: {self.affiliation}\nDescription: {self.description}\n'
    

class Perspectives(BaseModel):
    editors: List[Editor] = Field(description='Comprehensive list of editors with their roles and affiliations.')


gen_perspectives_prompt = ChatPromptTemplate.from_messages(
    [
        (
            'system',
            '''You need to select a diverse (and distinct) group of Wikipedia editors who will work together to create a comprehensive article on the topic. Each of them represents a different perspective, role, or affiliation related to this topic. 
            You can use other Wikipedia pages of related topics for inspiration. For each editor, add a description of what they will focus on. 
            
            Wiki pages outlines of related topics for inspiration: 
            {examples}''',
        ),
        (
            'user'
            'Topic of interest: {topic}'
        )
    ]
)

gen_perspectives_chain = gen_perspectives_prompt | FAST_LLM.with_structured_output(Perspectives)

In [18]:
wikipedia_retriever = WikipediaRetriever(load_all_available_meta=True, top_k_results=1)

def format_doc(doc, max_length=1000):
    related = '- '.join(doc.metadata['categories'])
    return f"### {doc.metadata['title']}\n\nSummary: {doc.page_content}\n\nRelated\n{related}"[:max_length]

def format_docs(docs):
    return '\n\n'.join(format_doc(doc) for doc in docs)

@as_runnable
async def survey_subjects(topic: str):
    related_subjects = await expand_chain.ainvoke({'topic': topic})
    retrieved_docs = await wikipedia_retriever.abatch(related_subjects.topics, return_exceptions=True)
    all_docs = []
    for docs in retrieved_docs:
        if isinstance(docs, BaseException):
            continue
        all_docs.extend(docs)
    formatted = format_docs(all_docs)
    return await gen_perspectives_chain.ainvoke({'examples': formatted, 'topic': topic})

In [19]:
perspectives = await survey_subjects.ainvoke(example_topic)

In [20]:
perspectives.dict()

{'editors': [{'affiliation': 'Silicon Valley Association',
   'name': 'TechGuru123',
   'role': 'Tech Expert',
   'description': 'TechGuru123 will focus on how the China-Taiwan conflict impacts the supply chain, manufacturing, and technological advancements in the Silicon Valley region.'},
  {'affiliation': 'Global Economics Institute',
   'name': 'EconExpert456',
   'role': 'Economic Analyst',
   'description': "EconExpert456 will analyze the financial implications of the China-Taiwan conflict on the Silicon Valley's economy, investments, and market trends."},
  {'affiliation': 'Environmental Research Foundation',
   'name': 'EnviroWatcher789',
   'role': 'Environmental Analyst',
   'description': 'EnviroWatcher789 will investigate the environmental consequences of the China-Taiwan conflict on the semiconductor industry in Silicon Valley.'}]}