# This is to demonstrate the core logic for the project

This is to demonstrate the core logic for the project

In [2]:
import time
import os
from dotenv import load_dotenv
import math

from langchain.prompts import PromptTemplate
from langchain_community.document_loaders import PyPDFLoader

# from langchain_community.llms import HuggingFaceHub
from langchain_community.llms import HuggingFaceEndpoint

# CHANGE the path otherwise script will fail
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

True

## 1. Select the model

You may use any model of your choice including commercial ones. 

NOTE:

HuggingFace has limit on number of calls you can make per hour. With large PDF documents there is a potential to run out of quota. For testing and experimentation use small PDF documents.

In [3]:
# https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
# Context window = 32K
model_id = 'mistralai/Mistral-7B-Instruct-v0.2'
CONTEXT_WINDOW_SIZE=32000
MAX_TOKENS=2000

# Create the client to LLM
llm = HuggingFaceEndpoint(
        repo_id=model_id, 
        max_new_tokens=MAX_TOKENS
)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to C:\Users\raj\.cache\huggingface\token
Login successful


## 2. Load the PDF & print statistcis

In [10]:
# Sample pdf
# Roughly 788K
pdf_link = "https://sgp.fas.org/crs/misc/R47644.pdf"

loader = PyPDFLoader(pdf_link)
pages = loader.load()
page_count = len(pages)
print("Number of pages: ", page_count)

# Total size of all pages
size = 0 
for page in pages:
    size = size + len(page.page_content)
print("Total content size = ", size)
print("Number of chunks = ", math.ceil(size/CONTEXT_WINDOW_SIZE), " with context size = ", CONTEXT_WINDOW_SIZE)

Number of pages:  15
Total content size =  55594
Number of chunks =  2  with context size =  32000


## 3. Define a template

ADJUST the prompt as each model behaves differently.


In [5]:
template = """
    extend the abstractive summary below with the new content. Keep total size of the extended summary around 3000 words.

    summary: 
    {summary}

    new content:
    {content}

    extended summary:
    
"""

prompt_template = PromptTemplate(
    input_variables = ['summary', 'content'],
    template = template
)

test_template = prompt_template.format(summary='partial summary from the previous pages', content='new content from a set of pages')
print(test_template)


    extend the abstractive summary below with the new content. Keep total size of the extended summary around 3000 words.

    summary: 
    partial summary from the previous pages

    new content:
    new content from a set of pages

    extended summary:
    



## 4. Summarization logic

Create the summary incrementally as all of PDF content may not fit in the context window of the model. 

* Create a chunk by concatenating a set of pages such that *len(partial_summary) + len(new_content) + MAX_TOKENS < CONTEXT_WINDOW_SIZE*
* Using LLM extend the summary with the new chunk
* Continue the process till all pages are included in the summary

In [6]:
# Each chunk should be such that: 
# len(partial_summary) + len(new_content) + MAX_TOKENS < CONTEXT_WINDOW_SIZE

# Holds partial summary
partial_summary = ''



# Index of the first page in the chunk
next_page_index = 0

print("Total pages to process : ", page_count)
# Create the chunk, extend the summary with the chunk
while next_page_index < len(pages):
    print('Processing chunk, starting with page index : ',next_page_index)

    # Holds the chunk = a set of contenated pages
    new_content = ''
    
    # Loop to create chunk 
    for i, doc in enumerate(pages[next_page_index : ]):
        last_i = i
        if len(partial_summary) + len(new_content) + len(doc.page_content) + MAX_TOKENS < CONTEXT_WINDOW_SIZE :
            new_content = new_content + doc.page_content
        else:
            break
            
    # Initialize the new content and next page index
    next_page_index = next_page_index + last_i + 1
        
    # Pass the current summary and new content to LLM for summarization
    query = prompt_template.format(summary=partial_summary, content=new_content)
    partial_summary = llm.invoke(query)


Total pages to process :  15
Processing chunk, starting with page index :  0
Processing chunk, starting with page index :  9


In [7]:
print(partial_summary)

Artificial Intelligence: Overview, Recent Advances, and Considerations for the 118th Congress

Congressional Research Service
August 4, 2023
R47644

Summary:
Artificial intelligence (AI) is a term used to describe computerized systems that work and react in ways commonly thought to require intelligence, such as learning, problem-solving, and decision-making under uncertain and varying conditions. AI technologies have been in development since the 1950s, with recent advances driven by the availability of large datasets, improved machine learning algorithms, and more powerful computers. The widespread availability of AI tools, such as generative AI models like ChatGPT, has renewed debate about appropriate uses and guardrails for AI, particularly in the areas of health care, education, and national security. This report provides an overview of AI, recent advances, benefits and potential risks, and current federal laws addressing AI.

Background and History:
The concept of AI was first int