In [None]:
#!pip install transformers evaluate portalocker chromadb sentence-transformers rouge-score sec-edgar-downloader langchain xformers

## 14.4 Lab 7 / Case 7: Document Q&A

In this lab, we'll put together several tools we already used to extract information from a set of documents, also called "Document Q&A". We'll retrieve the latest 10-K forms filed by S&P500 top companies and search for information about their reported risks using natural language.

Here are the tickers for the top 25 companies, as of June 2023:

In [None]:
tickers = ['AAPL', 'MSFT', 'AMZN', 'NVDA', 'GOOGL', 'GOOG', 'META', 'BRK.B', 'TSLA', 'UNH', 'XOM', 'JPM',
           'JNJ', 'V', 'LLY', 'PG', 'AVGO', 'MA', 'HD', 'MRK', 'CVX', 'PEP', 'ABBV', 'KO', 'COST']

### 14.4.1 EDGAR

EDGAR is the Securities and Exchange Commission's (SEC) Eletronic Data Gathering, Analysis, and Retrieval (EDGAR) system.

"_[it] performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the U.S. Securities and Exchange Commission (SEC). Its primary purpose is to increase the efficiency and fairness of the securities market for the benefit of investors, corporations, and the economy by accelerating the receipt, acceptance, dissemination, and analysis of time-sensitive corporate information filed with the agency._"

Source: [Important Information About EDGAR](https://www.sec.gov/edgar/searchedgar/aboutedgar.htm)

### 14.4.2 Form 10-K

In this lab, we'll be retrieving the latest 10-K form filed by the companies previously listed.

"_A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance. Although similarly named, the annual report on Form 10-K is distinct from the often glossy "annual report to shareholders," which a company must send to its shareholders when it holds an annual meeting to elect directors (though some companies combine the annual report and the 10-K into one document). The 10-K includes information such as company history, organizational structure, executive compensation, equity, subsidiaries, and audited financial statements, among other information._"

Source: [Wikipedia](https://en.wikipedia.org/wiki/Form_10-K)

We'll be paying special attention to the section ["Item 1A - Risk Factors"](https://en.wikipedia.org/wiki/Form_10-K#Item_1A_%E2%80%93_Risk_Factors), where "_...the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors._"

### 14.4.3 Downloader

While it's possible to retrieve public information directly from EDGAR, you'd have to find the proper identification numbers of companies and filings to download the reports. It is more conveniente to use a Python package that handles the nitty-gritty details for us and retrieves as many reports as we want by simply specifying the company's ticker (e.g. MSFT, GOOGL), and the type of report (e.g. 10-K). The package [`sec-edgar-downloader`](https://github.com/jadchaar/sec-edgar-downloader) does exactly that.

We can easily download the forms by creating an instance of a `Downloader` that points to the destination folder where files will be stored, and calling its `get()` method repeatedly, once for each ticker:

In [None]:
from sec_edgar_downloader import Downloader

dest_folder = "./edgar10k_sp500_top25"
dl = Downloader(dest_folder)

form = '10-K'
for ticker in tickers:
    dl.get("10-K", ticker, amount=1, download_details=True)

Alternatively, you can download the compressed folder containing all forms (as of June 2023):

In [None]:
#!wget https://github.com/dvgodoy/assets/releases/download/dataset/edgar10k_sp500_top25.tar.gz
#!tar -xvzf edgar10k_sp500_top25.tar.gz
#!mv filings edgar10k_sp500_top25

It will create a subfolder for each ticker, each containing a folder corresponding to the downloaded form (10-K), and yet another folder named after the form's corresponding ID number. That last folder has two files: `filing-details.html` and `full-submission.txt`. We'll be reading the details file.

In [None]:
!ls -l edgar10k_sp500_top25/sec-edgar-filings/MSFT/10-K/0001564590-22-026876

### 14.4.4 Parser

The details file is a mix of HTML and XML tags, and it would be very cumbersome to parse them ourselves. Fortunately, we can easily adapt a parser function, [`parse_10k_filing()`](https://github.com/rsljr/edgarParser/blob/master/parse_10K.py) from the [edgarParser](https://github.com/rsljr/edgarParser) repository, to parse our downloaded files.

Its original docstring states:

"_The function *parse_10k_filing()* parses 10-K forms to extract the following sections: business description, business risk, and management discussioin and analysis. The function takes two arguments, a link and a number indicating the section, and returns a list with the requested sections. Current options are **0(All), 1(Business), 2(Risk), 4(MDA).**_"

We'll be using option number two to retrieve text related to section "Item 1A - Risk Factors" only.

In [None]:
# Adapted from https://github.com/rsljr/edgarParser/blob/master/parse_10K.py
import re
import unicodedata
from bs4 import BeautifulSoup as bs
import requests

def parse_10k_filing(content, section):

    if section not in [0, 1, 2, 3]:
        print("Not a valid section")
        sys.exit()

    def get_text(content):
        html = bs(content, "html.parser")
        text = html.get_text()
        text = unicodedata.normalize("NFKD", text).encode('ascii', 'ignore').decode('utf8')
        text = text.split("\n")
        text = " ".join(text)
        return(text)

    def extract_text(text, item_start, item_end):
        item_start = item_start
        item_end = item_end
        starts = [i.start() for i in item_start.finditer(text)]
        ends = [i.start() for i in item_end.finditer(text)]
        positions = list()
        for s in starts:
            control = 0
            for e in ends:
                if control == 0:
                    if s < e:
                        control = 1
                        positions.append([s,e])
        item_length = 0
        item_position = list()
        for p in positions:
            if (p[1]-p[0]) > item_length:
                item_length = p[1]-p[0]
                item_position = p

        item_text = text[item_position[0]:item_position[1]]

        return(item_text)

    text = get_text(content)

    if section == 1 or section == 0:
        try:
            item1_start = re.compile("item\s*[1][\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
            item1_end = re.compile("item\s*1a[\.\;\:\-\_]\s*Risk|item\s*2[\.\,\;\:\-\_]\s*Prop", re.IGNORECASE)
            businessText = extract_text(text, item1_start, item1_end)
        except:
            businessText = "Something went wrong!"

    if section == 2 or section == 0:
        try:
            item1a_start = re.compile("(?<!,\s)item\s*1a[\.\;\:\-\_]\s*Risk", re.IGNORECASE)
            item1a_end = re.compile("item\s*2[\.\;\:\-\_]\s*Prop|item\s*[1][\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
            riskText = extract_text(text, item1a_start, item1a_end)
        except:
            riskText = "Something went wrong!"

    if section == 3 or section == 0:
        try:
            item7_start = re.compile("item\s*[7][\.\;\:\-\_]*\s*\\bM", re.IGNORECASE)
            item7_end = re.compile("item\s*7a[\.\;\:\-\_]\sQuanti|item\s*8[\.\,\;\:\-\_]\s*", re.IGNORECASE)
            mdaText = extract_text(text, item7_start, item7_end)
        except:
            mdaText = "Something went wrong!"

    if section == 0:
        data = [businessText, riskText, mdaText]
    elif section == 1:
        data = [businessText]
    elif section == 2:
        data = [riskText]
    elif section == 3:
        data = [mdaText]
    return(data)

Let's parse the latest 10-K form filed by Microsoft (as of June 2023):

In [None]:
with open('./edgar10k_sp500_top25/sec-edgar-filings/MSFT/10-K/0001564590-22-026876/filing-details.html', 'r',
          encoding='utf-8') as f:
    html = f.read()

res = parse_10k_filing(html, 2)[0]
len(res)

That's about 70,000 characters. We need to split it into more manageable chunks.

### 14.4.5 Chunking Strategies

There is no right or wrong answer to how you should split the text into chunks. It depends on a series of factors such as the type of text you're dealing with (long reports or short tweets, for example), the model you're using to embed the text and the nature of your queries (more on that later), and limitations in size (models typically have a maximum input length as we've already seen).

For more details, check the ["Chunking Strategies for LLM Applications"](https://www.pinecone.io/learn/chunking-strategies/) blog post.

Having said that, it's possible to chunk your text using a fixed-length or a content-aware approach. Let's take a quick look at some of them.

#### 14.4.5.1 Fixed-Length

Fixed-length approaches split the text into equal-length chunks (e.g. 300 characters/words/tokens) with or without some overlap between them. [Langchain](https://github.com/hwchase17/langchain), a Python package that has grown a lot in popularity in the last few months, and that allows you to integrate different tools (e.g. embedding models, vector databases) into a workflow, also offers a convenient way of [splitting text into chunks](https://python.langchain.com/docs/modules/data_connection/document_transformers/#get-started-with-text-splitters) of fixed-length:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=20)
docs = text_splitter.create_documents([res])
docs[:3]

#### 14.4.5.2 Content-Aware

The problem with fixed-length is that the chunks will most certainly end mid-sentence. The overlap may help mitigate this issue, but it won't work for long sentences. That's when the content-aware approach comes in. We can split it by sentences or paragraphs using indications in the text's structure. We could, for example, naively split the text into sentences using the period (.) as an indication of the end of a sentence. What about exclamation and question marks?

Fortunately, sentence tokenizing is a very well-known problem, and the traditional [Natural Language Toolkit (NLTK)](https://www.nltk.org/) package has a sentence tokenizer available. We only need to download the `punkt` tokenizer package and it will be ready to be used:

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import sent_tokenize

docs = sent_tokenize(res)
docs[:3]

As you can see, each element in the list of documents is a single sentence.

Sometimes, as in the case of our 10-K form filed by Microsoft, there's some other indication to the text's structure: it looks like paragraphs are separated by a sequence of two or more spaces.

Let's try it out:

In [None]:
docs = res.split('  ')
docs[:3]

Looks good, these are definitely paragraphs. Unfortunately, this may not be the case for every document: in some 10-K forms, there's no clear indication of a paragraph, and you'll need to rely on a different chunking strategy to move forward. For now, we're sticking with this particular 10-K form, and we'll proceed using paragraphs as chunks.

If we look at the full list, though, we'll see that there are many empty lines as well as really short ones that are likely section headers. We can discard these chunks that are too short (say, less than 10 characters long).

In [None]:
paragraphs = list(map(lambda s: s.strip(), filter(lambda s: len(s) > 10, res.split('  '))))
len(paragraphs)

We got 88 paragraphs. Let's take a look at one of those paragraphs:

In [None]:
text = paragraphs[1]
text

How many words is that?

In [None]:
len(text.split())

Perhaps we can make it shorter?

### 14.4.6 Summarization

Load a pretrained summarization pipeline from HuggingFace and use it to summarize the text above. Try different minimum and maximum lengths and observe the resulting summaries. How they compare to the original text?

In [None]:
import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

summarizer = ...

In [None]:
summarizer(text, max_length=50, min_length=20)

Summarizing text is great, but we may be doing it prematurely at this point. Instead of summarizing individual paragraphs (or other chunks of text), it may be more interesting to find (full) paragraphs of interest first, and only then summarize them as a whole.

If we're doing document Q&A, we need to query our documents (paragraphs) and find those that are more likely to contain the answer, that is, those more closely related to the topic of our query.

How can we search for similar documents?

### 14.4.7 Embeddings

You already know how to search for similar documents. You need to embed them first!

Use the `sentence transformers` package to load a pretrained model for sentence embeddings (e.g. `all-MiniLM-L12-v2`) and embed every paragraph of text from Microsoft's 10-K form.

In [None]:
from sentence_transformers import SentenceTransformer

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = ...

In [None]:
embeddings = ...
embeddings.shape

### 14.4.8 Searching

There are two alternatives to search for similar embeddings, and we have tried them both already: PyTorch's own cosine similarity, and vector databases such as FAISS or ChromaDB.

At this point, let's keep it as simple as it can be, and stick with cosine similarity. Create an instance of the cosine similarity layer and use it to find five paragraphs that are most similar to the query below (don't forget to embed the query as well):

In [None]:
query = "what are the sources of uncertainties?"

In [None]:
import torch.nn as nn

similarity = ...

In [None]:
# Embed the query and make it a tensor
q = ...
content = torch.as_tensor(embeddings)

In [None]:
# Compute the cosine similarity between query and content
# and get the top 5 results
similarities = ...
most = ...

You should get a list of five indices corresponding to the paragraphs that are most relevant to our query.

### 14.4.9 Context

Now, join all the paragraphs together as a single piece of text. This is going to be what is referred to as the "context". Notice that the indices may be ordered according to their similarity to the query. However, it's probably a good idea to order them as they appear on the text instead.

In [None]:
context = ...
print(context)

The context should contain the relevant information to answer our query, and it is one of the arguments you need to pass to a question answering pipeline.

### 14.4.10 Question Answering

We have a question, "_what are the sources of uncertainties?_" and we have a context, five paragraphs from our text that are the most similar to the question. That's everything you need to try a question answering pipeline!

Create an instance of Q&A pipeline, and call it using its `question` and `context` arguments:

In [None]:
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

qa_model = ...

In [None]:
query = "what are the sources of uncertainties?"

qa_model(question=query, context=context)

The Q&A model is good at answering questions that are extractive in nature and can be easily pinpointed in the text. It gives you back the start and end positions in the text that contain the answer to your question.

It may technically correct, but perhaps it's a bit too short, right?

In theory, the context should contain the relevant information to our query. But, it is too verbose and it doesn't read well, after all, it is just a sequence of paragraphs patched together. One way of trying to make it look more like an answer is to summarize it.

Use the summarization pipeline you already created to summarize the context above. Make sure the minimum and maximum length are appropriate given the original length of the context.

In [None]:
summary = ...
summary

How do you like the summary? Does it look like an answer to our question? Could it have been better? Pause and ponder for a while, what could you do to get a better answer or, better yet, to get a better context back?

### 14.4.11 Hallucinations

Hallucinations have a bad reputation. The term refers to generated text that's completely fictional while being portrayed as factual in an answer given by a generative model such as ChatGPT.

But hallucinations may be useful if handled with care. You see, one of the issues with the Q&A is that we're searching for paragraphs that are most similar to our question. Wouldn't make more sense to search for paragraphs that are most similar to an answer instead? Well, yes, but if you knew the answer already, you wouldn't need to ask the question, right?

But no one said it had to be a real or factual answer! We may very well use a fake, made-up, completely hallucinated answer instead. If it looks like an answer, it sounds like an answer, and it has the structure of an answer, it IS an answer as far as semantic search is concerned.

Our original question was "_what are the sources of uncertainties?_", and we're looking for an answer in 10-K filings of S&P 500 companies, Microsoft in this case. So, what if we rephrase our question as "_what uncertainties are likely faced by an S&P500 company?_" and ask a Large Language Model to come up with an answer?

We'll be doing that later in the course but, for now, assume that the answer generated by the LLM is the following:

In [None]:
hallucinated_answer = """
    As an S&P 500 company, uncertainies such as market fluctuations, changes in customer demand,
    and new competitors entering the market can be challenging to handle. Additionally, companies
    must also be prepared to adapt to changing trends and technological innovations, while ensuring
    that their products and services remain competitive and high-quality. Furthermore, companies
    must carefully evaluate their target market to remain relevant and generate revenue.
"""

It surely does look like an answer, right? Now, make this made-up answer your "question" and:
- get its embeddings
- use the embeddings to search for similar paragraphs in the text
- use the search results to build a context
- summarize the context

In [None]:
# Embed the hallucinated answer and make it a tensor
q = ...

# Compute the cosine similarity between query and content
# and get the top 5 results
similarities = ...
most = ...

# Concatenate the corresponding paragraphs together
hallucination_context = ...
hallucination_context

In [None]:
hallucination_summary = summarizer(hallucination_context, min_length=100, max_length=250)
hallucination_summary

How do you like the new summary? Is it better than the previous one?

### 14.4.12 Asymmetric Semantic Search

If you think that resorting to "hallucinated" answers may be impractical, there's an alternative: asymmetric semantic search. As it turns out, the model we've been using to embed our sentences and paragraphs is better suited for assessing similarities of sentences and paragraphs of similar length. It worked well to look for nuclear plant-related news in the AG News Dataset, and it is a good fit for clustering pieces of text together for example.\

However, as we already pointed out, our question is much shorter than any of the paragraphs, a typical case of asymmetric semantic search that asks for a different model to produce embeddings. For more details about recommended models for both [symmetric](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) and [asymmetric](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) search, check the [documentation] of sentence transformers.

In the `sentence_transformers` package, pretrained models for asymmetric search are based on [MSMARCO](https://microsoft.github.io/msmarco/), "_a large scale information retrieval corpus that was created based on real user search queries using Bing search engine._" Let's create an instance of one of these models.

Choose a model from the list of [asymmetric](https://www.sbert.net/docs/pretrained-models/msmarco-v3.html) models, and make sure you pick one that's fine-tuned for cosine similarity, since it is the similarity metric we're using:

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

asymmetric_model = ...

Remember that you cannot mix and match embeddings, so you need to embed the paragraphs once again using the new model first:

In [None]:
asymmetric_embeddings = ...

So far, you've been using PyTorch's cosine similarity layer and its `topk()` function to retrieve the most similar documents. Now, you will use Sentence Transformers' own [`semantic_search()`](https://www.sbert.net/examples/applications/semantic-search/README.html#util-semantic-search) helper function. It takes the following arguments:
- `query_embeddings`: your embedded question
- `corpus embeddings`: your embedded context
- `top_k`: how many results it should return

In [None]:
from sentence_transformers.util import semantic_search

query = "what are the sources of uncertainties?"
# Get the embeddings for the query above using the asymmetric model
asymmetric_query = ...

# Use the semantic_search helper function to get a list of the top 5 results
similarities = ...
similarities

Notice that each result is a dictionary, so you have to unpack the paragraph indices (`corpus_id`):

In [None]:
most = [s['corpus_id'] for s in similarities]
most

Now, it is business as usual: assemble the context by joining the matched paragraphs and submit it to the summarization pipeline.

In [None]:
asymmetric_context = ...
asymmetric_context

In [None]:
# We're moving the summarize to the CPU for this, please don't remove this code
device = summarizer.device
summarizer.model.to('cpu')
summarizer.device = torch.device('cpu')

asymmetric_summary = summarizer(asymmetric_context, min_length=100, max_length=250)
asymmetric_summary

Did you get an error by any chance? Good! It means that your context is too long for the summarization pipeline you're using. This information is available at the configuration of the model that powers the pipeline, and you can retrieve it like this:

In [None]:
max_len = summarizer.model.config.max_position_embeddings
max_len

Now, you have to double-check how long (in tokens) your context is. Remember that there's usually more tokens than words in a document. The summaritzation pipeline also contains the tokenizer it uses to preprocess its inputs. Do you remember which method from the tokenizer can be used to encode a sentence into the corresponding list of token indices? Use this method get a list of token indices for your context, and see how long it is:

In [None]:
# Tip: set the argument `add_special_tokens=False` to count the real number of tokens
token_list = ...
len(token_list)

For curiosity's sake, let's compare the number of words with the number of tokens:

In [None]:
len(asymmetric_context.split())/len(token_list)

The ratio should be around 0.8, meaning that 1,000 tokens roughly correspond to 800 words. That's the rule of thumb you can use if you're using a paid API since they charge by number of tokens, not words.

#### 14.4.12.1 Trimmed Context

Back to our issue, if the number of tokens in your context is higher than what the model can take (1,024 in our case), you need to make the context shorter. One alternative is to simply trim it at exactly 1,022 tokens, that's two less than the maximum length to account for the special tokens at both start and end of the input. You can trim the token list itself and use the tokenizer's `decode()` method to turn it back into text:

In [None]:
trimmed_context = summarizer.tokenizer.decode(token_list[:(max_len-2)])
trimmed_context

In [None]:
# We can bring the summarizer back to the GPU now
summarizer.model.to('cuda')
summarizer.device = torch.device('cuda:0')

trimmed_summary = summarizer(trimmed_context, min_length=100, max_length=250)
trimmed_summary

#### 14.4.12.2 Unsorted Context

Until now, we've been sorting the matched documents by their indices to preserve the original flow of the text before summarizing. Trimming the context, however, may lead to the most relevant parts being removed from it, so it may be worth trying to keep the documents sorted by their decreasing matching score instead:

In [None]:
unsorted_context = '\n'.join([paragraphs[i] for i in most])
unsorted_context

In [None]:
# Get the token list of the unsorted context above
token_list = ...
# Trim the token list and use the `decode()` method to get text back
trimmed_unsorted_context = ...
trimmed_unsorted_summary = summarizer(trimmed_unsorted_context, min_length=100, max_length=250)
trimmed_unsorted_summary

### 14.4.13 ROUGE Score

How can you tell if the summary is good or not? ROUGE, which stands for Recall-Oriented Understudy for Gist Evaluation, is a metric used to compare an automatically produced summary against a reference (a human-produced high quality summary). We're not going into its details here, but the general idea is to compute precision and recall for matching N-grams. An 1-gram is just a word, a 2-gram is a pair of consecutive words, and so on, and so forth, as illustrated in the figure below:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch14/ngrams.png)

For example, the ROUGE-1 score between "_nice to meet you here_" and "_nice to meet you now_" is 0.8 precision and 0.8 recall because it got 4 out of 5 words (1-grams) in common. Let's use a Python package, [ROUGE Scourer](https://github.com/google-research/google-research/tree/master/rouge), to try it out:

In [None]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [None]:
scorer.score('nice to meet you here', 'nice to meet you now')

For 2-grams, it is 0.75 because it matches 3 out of 4 pairs of words (2-grams): "_nice to_", "_to meet_", and "_meet you_" are matches, but "_you here_" and "_you now_" aren't. Precision and recall are computed using the number of N-grams in the prediction and the reference/target, respectively, in the denominator.

There's also the ROUGE-L metric, which stands for the longest common subsequence (LCS) (not necessarily consecutive, but in the same order) in both texts. ROUGE-L precision is the length of the LCS over the number of unigrams in the prediction, and ROUGE-L recall is the length the LCS over the number of unigrams in the reference/target. In our example, both values are 0.8 because the LCS is four words long ("_nice to meet you_") and both sentences are five words long.

We don't have a reference or target here because there's no high-quality human-produced summary to compare to. However, we can use the full context as target, and look at the precision metric to have a rough idea of the commonalities between the generated summary and the original text. You'll notice that recall values are quite low, but that's due to the fact that the original text is much longer, and its length is used in the denominator for computing recall metrics in ROUGE.

Here are the scores for every pair of context and summary you generated in this lab:

In [None]:
scores = scorer.score(context,
                      summary[0]['summary_text'].strip())
scores

In [None]:
scores = scorer.score(hallucination_context,
                      hallucination_summary[0]['summary_text'].strip())
scores

In [None]:
scores = scorer.score(trimmed_context,
                      trimmed_summary[0]['summary_text'].strip())
scores

In [None]:
scores = scorer.score(trimmed_unsorted_context,
                      trimmed_unsorted_summary[0]['summary_text'].strip())
scores