## Part 1: Read a file

In [1]:
from langchain_community.document_loaders.pdf import PyPDFLoader

file_path = "hacking_prices.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

docs

[Document(page_content='Journal of Financial  Economics  143 (2022) 1162–1184  \nContents  lists available  at ScienceDirect  \nJournal  of Financial  Economics  \njournal  homepage:  www.elsevier.com/locate/jfec  \nPrice  revelation  from  insider  trading:  Evidence  from  hacked  \nearnings  news  /p82 \nPat Akey a , Vincent  Grégoire  b , ∗, Charles  Martineau  a \na University  of Toronto, 105 St. George Street, Toronto, Ontario, Canada, M5S 3E6 \nb HEC Montréal,  30 0 0 chemin de la Côte-Sainte-Catherine,  Montréal,  Québec, Canada, H3T 2A7 \na r t i c l e i n f o \nArticle history: \nReceived  20 October 2021 \nRevised 21 December  2021 \nAccepted  21 December  2021 \nAvailable  online 15 January 2022 \nJEL classiﬁcation:  \nG10 \nG12 \nG14 \nKeywords:  \nCyber risks \nEarnings  announcements  \nHard and soft information  \nInformed  trading \nLiquidity  \nMachine  learning \nMarket microstructure  \nPrice discovery  a b s t r a c t \nFrom 2010 to 2015, a group of traders illega

## Part 2: Summarize the document

In [2]:
# Prompt
from langchain_core.prompts import PromptTemplate

prompt_template = """Write a long summary of the following document. 
Only include information that is part of the document. 
Do not include your own opinion or analysis.

Document:
"{document}"
Summary:"""
prompt = PromptTemplate.from_template(prompt_template)

In [3]:
# Define LLM Chain

from langchain_openai import ChatOpenAI
from langchain.chains.llm import LLMChain

llm = ChatOpenAI(
    temperature=0.1,
    model_name="mixtral:latest",
    api_key="ollama",
    base_url="http://localhost:11434/v1",
)
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [4]:
# Create full chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

stuff_chain = StuffDocumentsChain(
    llm_chain=llm_chain, document_variable_name="document"
)

In [5]:
result = stuff_chain.invoke(docs)
print(result["output_text"])

 This paper examines whether informed trading activities, such as insider trading and institutional trading, can predict post-earnings announcement drift (PEAD). The authors find that informed trading activities are positively related to PEAD, suggesting that these trades contain valuable information about future stock returns. They also find that the relation between informed trading and PEAD is stronger for stocks with higher levels of information asymmetry, such as those with lower institutional ownership or higher bid-ask spreads. The results suggest that informed traders are able to extract private information from earnings announcements and use it to earn abnormal returns.


In [6]:
from textwrap import fill

print(fill(result["output_text"]))

 This paper examines whether informed trading activities, such as
insider trading and institutional trading, can predict post-earnings
announcement drift (PEAD). The authors find that informed trading
activities are positively related to PEAD, suggesting that these
trades contain valuable information about future stock returns. They
also find that the relation between informed trading and PEAD is
stronger for stocks with higher levels of information asymmetry, such
as those with lower institutional ownership or higher bid-ask spreads.
The results suggest that informed traders are able to extract private
information from earnings announcements and use it to earn abnormal
returns.


In [7]:
# Invoke with limited pages
result = stuff_chain.invoke(docs[:-3])
print(fill(result["output_text"]))

 The paper examines the impact of hacked newswire services on informed
trading and stock prices. It finds that hacked firms experience a
significant increase in effective spreads, which is driven by an
increase in realized spreads rather than price impacts. However, there
is no evidence of higher absolute order imbalance or quoted spreads
for these firms. The paper also suggests that liquidity providers may
be adjusting quotes to manage inventory risk associated with large
buy/sell pressure. A placebo test using morning trades shows no
significant differences in informed trading measures, further
supporting the findings.


## Part 3: Query a document with MapReduce

A common way to query a document is to use RAG, which involves splitting a document in chunks, generating embeddings for those chunks, storing these chunks in a database, querying the database based on those embeddings (semantic search), and then returning the most relevant chunks to use to build the answer.

Here I take a simpler but more processing-intensive approach, which is to use MapReduce to query the document. The idea is to look at pages one by one, and for each page, to use the model to find the information that is relevant to the question. Then we combine the results to build the answer.

In [8]:
user_query = "What is the data used in this analysis?"

In [10]:
# Map part: Appplied to each page
map_template = """The following is a set of documents
{docs}
Based on this list of documents, please identify the information that is most relevant to the following query:
{user_query} 
If the document is not relevant, please write "not relevant".
Helpful Answer:"""
map_prompt = PromptTemplate.from_template(map_template)
map_prompt = map_prompt.partial(user_query=user_query)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

In [11]:
# Reduce part: Applied to the list of results
reduce_template = """The following is set of partial answers to a user query:
{docs}
Take these and distill it into a final, consolidated answer to the following query:
{user_query} 
Complete Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_prompt = reduce_prompt.partial(user_query=user_query)

In [12]:
# Full chain
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain


reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

# Combines and iteratively reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

In [13]:
result = map_reduce_chain.invoke(docs[:-3])
print(fill(result["output_text"]))

 The data used in this analysis includes stock prices, trading volume,
and order imbalance measures for a sample of U.S. firms that
experienced a newswire hack between 2010 and 2014. Specifically, the
authors use minute-level trade and quote data from the Trade and Quote
(TAQ) database, which is maintained by the New York Stock Exchange.
They also use firm-level financial data from Compustat and measures of
media coverage from Factiva. The sample includes all U.S. common
stocks listed on the NYSE, NASDAQ, or AMEX exchanges during the study
period. Not all variables are available for all firms and time
periods, resulting in an unbalanced panel. Additionally, the analysis
uses information from legal documents of SEC prosecutions, newswire
servers, and a set of control variables such as log market
capitalization, fraction of shares held by institutional investors,
natural logarithm of number of analysts, natural logarithm of newswire
news in the quarter leading to the announcement, daily 