<a href="https://colab.research.google.com/github/athapa785/LLM_4_Biz_Stanford/blob/main/aditya_thapa_llm4biz_homework_1_compare_texts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Aditya Thapa](https://www.linkedin.com/in/aditya-thapa-56442b11b/)
SLAC National Accelerator Laboratory | Stanford University

# Homework 1
## LLM for Biz with Python
Stanford Continuing Education
TECH 16


The project is set up to perform text comparison and summarization using OpenAI's GPT-3.5 model. It can:

> **Extract text from various sources:** It can extract text from web pages (using *extract_webpage_text*) and PDF files from a path or URL (using *extract_pdf_text*). It also accepts direct text input.
>
>**Summarize text:** It uses the summarize function to generate concise summaries of given texts, leveraging the GPT-3.5 model.
>
>**Compare texts:** The core functionality lies in the compare_texts function. It takes multiple text inputs, summarizes them individually, and then utilizes GPT-3.5 to identify and highlight significant differences between the texts.
>
>**Token Management:** It incorporates token counting (*num_tokens_from_string*) for efficient handling of input text within the OpenAI API's token limits.

Essentially, the project provides a way to analyze and compare information from different sources, whether it's web pages, PDFs, or user-provided text, using the power of large language models for insightful comparisons.

In [144]:
# Initialize key and client

from openai import OpenAI
from google.colab import userdata

open_ai_key = userdata.get('open_ai_key')

client = OpenAI(api_key=open_ai_key)

In [145]:
!pip install PyPDF2
!pip install tiktoken
!pip install validators



In [154]:
import PyPDF2 as PyPDF
from  bs4 import BeautifulSoup
import requests
import tiktoken
import validators
from IPython.display import Markdown

In [147]:
def chat(sys_message, user_message):
    """
    Send a message to the OpenAI GPT-3.5 model and return its response.

    Parameters:
    message (str): A string containing the user's message to the AI.

    Returns:
    str: The text response generated by the GPT-3.5 model.
    """

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": f"{sys_message}"},
            {"role": "user", "content": f"{user_message}"}
        ]
    )

    text_only = response.choices[0].message.content
    return text_only

In [148]:
def extract_webpage_text(url):
  """
  Extracts text from a webpage.

  Parameters:
  url (str): The URL of the webpage.

  Returns:
  str: The extracted text from the webpage.
  """
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')
  text = soup.get_text()
  return text

In [149]:
# helper function to check if arg is a url or not

def is_url(arg):
  """
  Checks if a given argument is a URL.

  Parameters:
  arg (str): The argument to be checked.

  Returns:
  bool: True if the argument is a URL, False otherwise.

  """

  result = validators.url(arg)

  if result:
    return True
  else:
    return False


In [150]:
def extract_pdf_text(pdf_path):
  """
  Extracts text from a PDF file.

  Parameters:
  pdf_path (str): The path to the PDF file.

  Returns:
  str: The extracted text from the PDF.
  """

  #Checking if URL
  if is_url(pdf_path):
    response = requests.get(pdf_path)
    with open("temp.pdf", "wb") as f:
      f.write(response.content)
      pdf_path = "temp.pdf"
  else:
    pdf_path = pdf_path

  # Open file and extract text
  with open(pdf_path, 'rb') as pdf_file:
    pdf_reader = PyPDF.PdfReader(pdf_file)
    text = ""
    for page_num in range(len(pdf_reader.pages)):
      page = pdf_reader.pages[page_num]
      text += page.extract_text()
    return text

In [151]:
def num_tokens_from_string(string, encoding_name="cl100k_base"):
  """Returns the number of tokens in a text string."""
  encoding = tiktoken.get_encoding(encoding_name)
  num_tokens = len(encoding.encode(string))
  return num_tokens

In [165]:
def summarize(text, max_tokens=2048):
  """
  Summarizes a given text using OpenAI GPT-3.5, limiting the input text size.

  Parameters:
  text (str): The text to be summarized.
  max_tokens (int): The maximum number of tokens allowed for the input text.

  Returns:
  str: The summarized text.
  """
  # Limit the input text size to prevent exceeding token limits
  text_tokens = num_tokens_from_string(text)
  if text_tokens > max_tokens:
    text = text[:max_tokens * 4] # Assume roughly 4 chars per token

  sys_message = f"You are an AI that takes instructions from a human and produces an answer. Try to capture as many details as you can in a short paragraph."
  user_message = f"Summarize this text: {text}"
  return chat(sys_message, user_message)

In [166]:
def compare_texts(*args):
  """
  Compares two pieces of texts using OpenAI GPT-3.5.

  Parameters: Texts to be compared

  """
  summary = []
  i = 1

  for arg in args:
    summary.append(summarize(arg))

  texts = ""
  for text in summary:
    texts += f"({i})" + text + ";"


  sys_message = """
                  You will be provided numbered bodies of texts separated by a semicolon.
                  You will compare the bodies of text for important differences. Be analytical with your response.
                  Make it readable with bullet points with headings.
                """
  user_message = f"{texts}"

  return chat(sys_message, user_message)

# Let's try out an example.

In [167]:
text1 = extract_webpage_text("https://aws.amazon.com/what-is/quantum-computing/#:~:text=Quantum%20computing%20is%20a%20multidisciplinary,hardware%20research%20and%20application%20development.")
text2 = extract_pdf_text("https://cs269q.stanford.edu/lectures/lecture1.pdf")
text3 = "Quantum computing is going to be the next big thing after LLMs."
text4 = extract_webpage_text("https://en.wikipedia.org/wiki/Quantum_computing")

### Summary of each:

In [168]:
Markdown(summarize(text1))

Quantum computing involves utilizing principles of quantum mechanics to solve complex problems at a faster pace than traditional computers. This multidisciplinary field combines computer science, physics, and mathematics to leverage quantum effects like superposition and quantum interference for enhanced computational power. Quantum bits, or qubits, are fundamental units in quantum computing, unlike classical bits, and operate based on quantum principles like superposition, entanglement, and decoherence. Quantum computers consist of hardware components such as qubits and control mechanisms, as well as quantum software that implements specialized algorithms. Despite the current limitations, quantum computing shows promise in transforming industries like machine learning, optimization, and simulations that are impractical for classical computers.

In [169]:
Markdown(summarize(text2))

The text outlines a course on quantum computer programming by Dan Boneh and Will Zeng, focusing on challenging traditional computation methods and quantum physics to prepare students for a new discipline and industry. The course details include online lectures, problem sets, programming projects, and readings, with topics covering quantum mechanics, programming, algorithms, and hardware. It emphasizes the exponential power growth of quantum computing compared to classical systems and explores the relationship between physics and computer science, as well as the hybrid nature of quantum programming. The content also delves into the potentials for quantum computing in various fields like artificial intelligence, energy, and materials science, highlighting the accelerating development of quantum hardware globally.

In [170]:
Markdown(summarize(text3))

The text predicts that quantum computing will be the next significant development after Large Language Models (LLMs) in technology.

In [171]:
Markdown(summarize(text4))

The text discusses quantum computing, a field that leverages quantum mechanical phenomena to perform computations exponentially faster than classical computers. It explains that quantum computers use qubits (quantum bits) that can exist in a superposition of states, allowing for faster calculations through wave interference effects. However, the current state of quantum computing is largely experimental due to challenges in engineering high-quality qubits that are isolated from their environment to prevent quantum decoherence. The text also explores the history of quantum computing, from the convergence of quantum mechanics and computer science to the development of quantum algorithms that demonstrate the potential applications of quantum computers, such as breaking encryption schemes like RSA.

### Let's first try comparing the first two.

In [174]:
Markdown(compare_texts(text1, text2))

### Differences Between the Bodies of Texts:

#### Content Focus:
- **First Body of Text:**
  - Focuses on the broad field of quantum computing as a whole.
  - Discusses the potential applications and challenges of quantum computing.
  - Mentions key principles of quantum mechanics and the components of a quantum computer.
  
- **Second Body of Text:**
  - Specifically discusses a course at Stanford University on Quantum Computer Programming.
  - Emphasizes the teaching of quantum mechanics and programming projects related to quantum computing.
  - Highlights the exponential growth and hybrid nature of quantum-classical computing.

#### Approach:
- **First Body of Text:**
  - Provides an overview of quantum computing technologies and their potential.
  - Mentions AWS's support for advancing quantum computing technologies.
  
- **Second Body of Text:**
  - Focuses on the educational aspect of quantum computing through a course.
  - Emphasizes understanding quantum computers in various practical applications and industries.
  
#### Emphasis:
- **First Body of Text:**
  - Emphasizes the advanced applications and benefits of quantum computing, looking at the field from a technological standpoint.
  
- **Second Body of Text:**
  - Emphasizes the educational aspect of quantum computing and its importance in various fields, focusing more on the academic and practical implications of quantum technology.
  
#### Key Takeaways:
- **First Body of Text:**
  - Highlights the significance of quantum mechanics, qubits, and quantum effects in computing.
  - Points out the potential for solving complex problems currently beyond classical computers' capabilities.
  
- **Second Body of Text:**
  - Stresses the importance of understanding quantum computing through practical programming projects.
  - Focuses on the increasing interest in quantum technology in academia and industry.

### Let's see how it does while comparing all four

In [173]:
Markdown(compare_texts(text1, text2, text3, text4))

### Differences in Texts Discussing Quantum Computing:

#### **Focus of Discussion:**
- The first text talks about the multidisciplinary nature of quantum computing, key components, current performance comparison with classical computers, and various applications.
- The second text discusses a specific course on Quantum Computer Programming, covering topics like quantum mechanics, low-level programming, and hybrid algorithms.
- The third text suggests the emergence of quantum computing as a significant technological advancement after LLMs.
- The fourth text covers the basics of quantum computing, challenges in building practical quantum computers, the history of quantum computing, and potential exponential speed advantages of quantum computers over classical ones.

#### **Content Covered:**
- The first text covers quantum effects like superposition and entanglement, qubits, quantum principles, applications, and "quantum advantage."
- The second text focuses on programming quantum computers, hybrid quantum/classical computing approaches, and learning "just enough" quantum physics.
- The third text hints at the growth of quantum computing post LLMs (Large Language Models) without specifying how.
- The fourth text delves into the basics, history, challenges, and speed advantages of quantum computing compared to classical computers.

#### **Applications and Importance:**
- The first text mentions applications like machine learning, optimization, and simulating physical systems to solve currently impossible problems.
- The second text underscores the importance of quantum computing in providing new opportunities, fundamental curiosity, and how to program quantum computers.
- The third text does not specify applications but suggests significant advancement after LLMs.
- The fourth text mentions the potential exponential speed advantage of quantum computers over classical ones in solving certain computational problems.

#### **Level of Detail:**
- The first text provides a broad overview of quantum computing with key components and applications.
- The second text delves into programming quantum computers and specific topics covered in a course.
- The third text is more abstract, hinting at future advancements without detailed specifics.
- The fourth text offers a detailed explanation of the basics, history, challenges, and speed advantages of quantum computing.

#### **Perspectives on Quantum Computing:**
- The first three texts focus more on the practical aspects of quantum computing like programming, potential advancements, and key components.
- The fourth text takes a more historical and technical approach, discussing the development of quantum algorithms and speed advantages over classical computers.

---

Feel free to change the existing URLs or add more bodies of texts.