<a href="https://colab.research.google.com/github/docfhsp/fhsp-memorial/blob/main/10k%20analysis%20using%20all%20there%20is%20with%20just%20ancient%20model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install relevant libraries

In [None]:
!pip install google-cloud-aiplatform==1.38.0



**Initialize Vertex AI**
---

In [None]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./nlmatics-1f123f7481dd.json"

In [None]:
import vertexai
vertexai.init(project="nlmatics", location="us-central1")

In [None]:
from vertexai.preview.generative_models import GenerativeModel

def run_prompt(prompt_text, context_text):
  model = GenerativeModel("gemini-pro")
  responses = model.generate_content(f"{prompt_text}\n{context_text}", stream=False)
  return responses.candidates[0].content.parts[0].text

def summarize_table(table_text):
  return run_prompt("You are a financial analyst and you are required to summarize the key insights of given numerical tables.", table_text)

def summarize_article(article_text):
  return run_prompt("Provide a brief summary for the following article:", article_text)


In [None]:
!pip install llmsherpa

Collecting llmsherpa
  Downloading llmsherpa-0.1.3-py3-none-any.whl (12 kB)
Installing collected packages: llmsherpa
Successfully installed llmsherpa-0.1.3


Reading Google's 10-K Document using LLM Sherpa LayoutPDFReader

In [None]:
# pdf_url = "https://d18rn0p25nwr6d.cloudfront.net/CIK-0000320193/faab4555-c69b-438a-aaf7-e09305f87ca3.pdf"
from llmsherpa.readers import LayoutPDFReader
pdf_url = "https://www.abc.xyz/assets/9a/bd/838c917c4b4ab21f94e84c3c2c65/goog-10-k-q4-2022.pdf"
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
# pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)


**Find and Summarize a table**
---

In [None]:
from IPython.core.display import display, HTML
# HTML(doc.tables()[5].to_html())
my_table = None
for table in doc.tables():
 if "CONSOLIDATED STATEMENTS OF CASH" in table.parent_text():
    my_table = table
    break
# if my_table:
# HTML(my_table.to_html())
print(summarize_table(my_table.to_text()))

- The company's net income saw a significant increase in 2021, rising by 89% from $40,269 million to $76,033 million. However, it declined by 21% to $59,972 million in 2022.

- Net cash provided by operating activities improved from $65,124 million in 2020 to $91,652 million in 2021, but slightly decreased to $91,495 million in 2022.

- Investing activities continued to consume cash, with net cash used increasing from $32,773 million in 2020 to $35,523 million in 2021, and further to $20,298 million in 2022. This change was primarily driven by a decrease in purchases of marketable securities.

- Net cash used in financing activities increased significantly from $24,408 million in 2020 to $61,362 million in 2021, mainly due to a surge in stock repurchases and a decrease in proceeds from debt issuance. It further increased to $69,757 million in 2022.

- The company's cash and cash equivalents experienced fluctuations, decreasing from $26,465 million at the end of 2020 to $20,945 million 

**Find and Summarize Section**
---

In [None]:
selected_section = ""
for section in doc.sections():
  if section.title == "Risks Specific to our Company":
    selected_section = section
    break
HTML(selected_section.to_html(include_children=True, recurse=True))
print(run_prompt("Read this and summarize key risks:", selected_section.to_text(include_children=True, recurse=True)))

- Reduced spending by advertisers, loss of partners, or new technologies that block ads online could harm revenue.


- Termination of contracts by advertisers, publishers, content providers at any time could harm revenue.


- Changes to advertising policies and data privacy practices could impact advertising availability.


- Economic conditions could affect advertising spending, harming revenue.


- Intense competition requires constant innovation to remain competitive, which may not always be successful.


- New technologies and competitors may offer better products/services, reducing usage of Google's products.


- Ongoing investment in new businesses, products, services, and technologies may divert management attention and harm financial performance.


- Revenue growth rate could decline over time due to various factors, impacting profitability.


- Inability to protect intellectual property rights could reduce the value of products/services and affect competitiveness.


- Failure 

**RAG and Semantic Search**
---



In [None]:
from vertexai.language_models import TextEmbeddingModel
pdf2_url = "https://static.googleusercontent.com/media/about.google/en//belonging/diversity-annual-report/2023/static/pdfs/google_2023_diversity_annual_report.pdf?cachebust=2943cac"
doc2 = pdf_reader.read_pdf(pdf2_url)

def text_embedding() -> list:
    """Text embedding with a Large Language Model."""
    model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
    embeddings = model.get_embeddings(["What is life?"])
    for embedding in embeddings:
        vector = embedding.values
        print(f"Length of Embedding Vector: {len(vector)}")
    return vector

contexts = []
for chunk in doc2.chunks():
  contexts.append(chunk.to_context_text())
#Encode your documents with input type 'search_document'
embeddings_model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
embeddings = embeddings_model.get_embeddings(contexts[0:250])

# text_embedding()

In [None]:
import numpy as np
emb_values = []
for embedding in embeddings:
  emb_values.append(embedding.values)
# emb_values[0]
doc_emb = np.asarray(emb_values)

In [None]:
import numpy as np
def ask(query):
  #Encode your query with input type 'search_query'
  query_emb = embeddings_model.get_embeddings([query])[0].values
  query_emb = np.asarray(query_emb)
  query_emb.shape
  #Compute the dot product between query embedding and document embedding
  scores = np.dot(query_emb, doc_emb.T)
  # print(scores)
  #Find the highest scores
  max_idx = np.argsort(-scores)
  most_relevant_contexts = []
  top_k = 20

  #Get only the top contexts to keep the context for openai small
  for idx in max_idx[0:top_k]:
    most_relevant_contexts.append(contexts[idx])

  #Call OpenAI to synthesize answers
  passages = "\n".join(most_relevant_contexts)
  synthesized_answer = run_prompt(f"Read this and answer question: {query}", passages)

  print(f"Query: {query}")
  print(f"Answer: {synthesized_answer}")
  # print("\nRelevant contexts: \n")
  # for ctx in most_relevant_contexts:
  #     print(ctx)
  #     print("--------")

ask("what did they do for diversity")

Query: what did they do for diversity
Answer: 1. In Our Products:
   - They built health equity into Search and YouTube to ensure that health information is accessible and accurate for everyone.
   - They worked on Project Relate and Tackling Health Equity through Information Quality (THE-IQ) to provide seed funding and expertise to organizations working to improve health outcomes for marginalized communities.
   - They continued to take steps to curb harassment on YouTube and made the platform safer for underrepresented communities.


2. In Our Workplace:
   - They deepened their anti-racism education efforts globally by launching programs to educate employees on racial equity, especially in the context of their regions.
   - They expanded two of their newer Employee Resource Groups (ERGs) globally - the Parents and Caregivers ERG and the Mixed@Google ERG - to support employees in parenting and multi-racial/multi-ethnic roles.


3. In Society:
   - They platformed opportunities for Bl