<a href="https://colab.research.google.com/github/athapa785/LLM_4_Biz_Stanford/blob/main/aditya_thapa_llm4biz_homework_1_compare_texts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Aditya Thapa](https://www.linkedin.com/in/aditya-thapa-56442b11b/)
SLAC National Accelerator Laboratory | Stanford University

# Homework 1
## LLM for Biz with Python
Stanford Continuing Education
TECH 16


The project is set up to perform text comparison and summarization using OpenAI's GPT-3.5 model. It can:

> **Extract text from various sources:** It can extract text from web pages (using *extract_webpage_text*) and PDF files from a path or URL (using *extract_pdf_text*). It also accepts direct text input.
>
>**Summarize text:** It uses the summarize function to generate concise summaries of given texts, leveraging the GPT-3.5 model.
>
>**Compare texts:** The core functionality lies in the compare_texts function. It takes multiple text inputs, summarizes them individually, and then utilizes GPT-3.5 to identify and highlight significant differences between the texts.
>
>**Token Management:** It incorporates token counting (*num_tokens_from_string*) for efficient handling of input text within the OpenAI API's token limits.

Essentially, the project provides a way to analyze and compare information from different sources, whether it's web pages, PDFs, or user-provided text, using the power of large language models for insightful comparisons.

In [97]:
# Initialize key and client

from openai import OpenAI
from google.colab import userdata

open_ai_key = userdata.get('open_ai_key')

client = OpenAI(api_key=open_ai_key)

In [100]:
!pip install PyPDF2
!pip install tiktoken
!pip install validators



In [101]:
import PyPDF2 as PyPDF
from  bs4 import BeautifulSoup
import requests
import tiktoken
import validators
import textwrap

In [102]:
def chat(sys_message, user_message):
    """
    Send a message to the OpenAI GPT-3.5 model and return its response.

    Parameters:
    message (str): A string containing the user's message to the AI.

    Returns:
    str: The text response generated by the GPT-3.5 model.
    """

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": f"{sys_message}"},
            {"role": "user", "content": f"{user_message}"}
        ]
    )

    text_only = response.choices[0].message.content
    return text_only

In [103]:
def extract_webpage_text(url):
  """
  Extracts text from a webpage.

  Parameters:
  url (str): The URL of the webpage.

  Returns:
  str: The extracted text from the webpage.
  """
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  text = soup.get_text()
  return text

In [104]:
# helper function to check if arg is a url or not

def is_url(arg):
  """
  Checks if a given argument is a URL.

  Parameters:
  arg (str): The argument to be checked.

  Returns:
  bool: True if the argument is a URL, False otherwise.

  """

  result = validators.url(arg)

  if result:
    return True
  else:
    return False


In [105]:
def extract_pdf_text(pdf_path):
  """
  Extracts text from a PDF file.

  Parameters:
  pdf_path (str): The path to the PDF file.

  Returns:
  str: The extracted text from the PDF.
  """

  #Checking if URL
  if is_url(pdf_path):
    response = requests.get(pdf_path)
    with open("temp.pdf", "wb") as f:
      f.write(response.content)
      pdf_path = "temp.pdf"
  else:
    pdf_path = pdf_path

  # Open file and extract text
  with open(pdf_path, 'rb') as pdf_file:
    pdf_reader = PyPDF.PdfReader(pdf_file)
    text = ""
    for page_num in range(len(pdf_reader.pages)):
      page = pdf_reader.pages[page_num]
      text += page.extract_text()
    return text

In [106]:
def num_tokens_from_string(string, encoding_name="cl100k_base"):
  """Returns the number of tokens in a text string."""
  encoding = tiktoken.get_encoding(encoding_name)
  num_tokens = len(encoding.encode(string))
  return num_tokens

In [107]:
def summarize(text, max_tokens=2048):
  """
  Summarizes a given text using OpenAI GPT-3.5, limiting the input text size.

  Parameters:
  text (str): The text to be summarized.
  max_tokens (int): The maximum number of tokens allowed for the input text.

  Returns:
  str: The summarized text.
  """
  # Limit the input text size to prevent exceeding token limits
  text_tokens = num_tokens_from_string(text)
  if text_tokens > max_tokens:
    text = text[:max_tokens * 4] # Assume roughly 4 chars per token

  sys_message = f"You are an AI that takes instructions from a human and produces an answer. Be concise in your output."
  user_message = f"Summarize this text: {text}"
  return chat(sys_message, user_message)

In [140]:
def compare_texts(*args):
  """
  Compares two pieces of texts using OpenAI GPT-3.5.

  Parameters: Texts to be compared

  """
  summary = []
  i = 1

  for arg in args:
    summary.append(summarize(arg))

  texts = ""
  for text in summary:
    texts += f"({i})" + text + ";"


  sys_message = """
                  You will be provided numbered bodies of texts separated by a semicolon.
                  You will compare the bodies of text. Be analytical with your response.
                  Make it readable with bullet points with headings.
                """
  user_message = f"{texts}"

  return chat(sys_message, user_message)

# Let's try out an example.

In [114]:
text1 = extract_webpage_text("https://aws.amazon.com/what-is/quantum-computing/#:~:text=Quantum%20computing%20is%20a%20multidisciplinary,hardware%20research%20and%20application%20development.")
text2 = extract_pdf_text("https://cs269q.stanford.edu/lectures/lecture1.pdf")
text3 = "Quantum computing is going to be the next big thing after LLMs."
text4 = extract_webpage_text("https://en.wikipedia.org/wiki/Quantum_computing")

Let's first try comparing the first two.

In [141]:
print(compare_texts(text1, text2))

### Comparison of Texts on Quantum Computing:

#### Topics Covered:
- Text 1 focuses on providing detailed explanations of quantum computing, including its advantages, principles, components, and types of technology.
- Text 2 describes an online course, CS269 Quantum Computer Programming, which takes a programming-first approach to quantum computing. It covers quantum mechanics, operations, hybrid algorithms, and error correction.

#### Focus:
- Text 1 emphasizes how quantum computing utilizes principles from quantum mechanics like qubits, superposition, entanglement, and decoherence to solve problems faster than classical computers.
- Text 2 highlights the new power, opportunities, and fundamental curiosity associated with quantum computing, as well as its potential to solve complex problems with exponential scalability.

#### Approach:
- Text 1 provides a theoretical overview of quantum computing principles and technologies.
- Text 2 focuses on a practical approach to quantum computi

Let's see how it does while comparing all four

In [142]:
print(compare_texts(text1, text2, text3, text4))

### Comparison of Text Bodies:

#### Overview:
- Text (1) provides a general overview of quantum computing, its components, principles, and applications.
- Text (2) focuses on a specific course, CS269, centered around Quantum Computer Programming with a programming-first approach.
- Text (3) briefly mentions quantum computing as a significant development post-LLMs.
- Text (4) delves deeper into quantum computing, explaining its benefits, components, challenges, and historical context.

#### Topics Discussed:
- Text (1) covers quantum mechanics, qubits, superposition, entanglement, decoherence, quantum algorithms, quantum hardware and software.
- Text (2) mentions problem sets, programming projects, hybrid quantum/classical computing, practical applications, and the exponential scaling power of quantum computing.
- Text (3) connects quantum computing with being the next significant development after LLMs.
- Text (4) discusses qubits, quantum algorithms, challenges in engineering quantum

---

Feel free to change the existing URLs or add more bodies of texts.