generativeqa notebook

amaiya · May 10, 2023 · f840efb · f840efb
1 parent 61bd40a
commit f840efb
Show file tree

Hide file tree

Showing 4 changed files with 302 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -12,6 +12,17 @@
 
 
 ### News and Announcements
+- **2023-05-10**
+  - **ktrain 0.37.x** is released and supports a generative question-answering using OpenAI models like GPT-3.5-turbo. Ask a large text corpus questions and receive answers with citations into your corpus.See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/question_answering_with_openai.ipynb) for more information.
+```python
+import os
+os.environ['OPENAI_API_KEY'] = 'ENTER YOUR OPENAI API KEY HERE'
+from ktrain .text.qa import GenerativeQA
+genqa = GenerativeQA()
+genqa.add_doc(text=your_text_document)
+print(genqa.query('What is ktrain?'))
+```
+
 - **2023-04-21**
   - **ktrain 0.36.x** is released and supports a simple wrapper to **Sentiment Analysis**. See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/sentiment_analysis_example.ipynb) for more information. 
 ```python
@@ -62,7 +73,8 @@ print(model.execute(prompt))
      - **Document Similarity with One-Class Learning**:  given some documents of interest, find and score new documents that are thematically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
      - **Document Recommendation Engines and Semantic Searches**:  given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus  <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
      - **Text Summarization**:  summarize long documents - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization.ipynb)]</sup></sub>
-     - **End-to-End Question-Answering**:  ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
+     - **Extractive Question-Answering**:  ask a large text corpus questions and receive exact answers using BERT <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
+     - **Generative Question-Answering**:  ask a large text corpus questions and receive answers with citations using OpenAI models <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_openai.ipynb)]</sup></sub>
      - **Easy-to-Use Built-In Search Engine**:  perform keyword searches on large collections of documents <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
      - **Zero-Shot Learning**:  classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
      - **Language Translation**:  translate text from one language to another <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/language_translation_example.ipynb)]</sup></sub>

diff --git a/examples/README.md b/examples/README.md
@@ -148,7 +148,8 @@ The objective of the CoNLL2003 task is to classify sequences of words as belongi
 - [20newsgroups-recommendation_engine.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text):  given text from a sample document, recommend documents that are semantically similar to it from a larger corpus
 
 ### <a name="bart"></a>Text Summarization with pretrained BART: [text_summarization_with_bart.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
-### <a name="textqa"></a>Open-Domain Question-Answering: [question_answering_with_bert.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
+### <a name="textqa"></a>Question-Answering with BERT: [question_answering_with_bert.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
+### <a name="genqa"></a>Question-Answering with OpenAI: [question_answering_with_openai.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="zsl"></a>Zero-Shot Learning: [zero_shot_learning_with_nli.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="translation"></a>Language Translation: [language_translation_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
 ### <a name="textextraction"></a>Text Extraction: [text_extraction_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)

diff --git a/examples/text/question_answering_with_openai.ipynb b/examples/text/question_answering_with_openai.ipynb
@@ -0,0 +1,262 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%reload_ext autoreload\n",
+    "%autoreload 2\n",
+    "%matplotlib inline\n",
+    "import os\n",
+    "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
+    "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generative Question-Answering in `ktrain` Using OpenAI Models\n",
+    "\n",
+    "As of v0.37.x of **ktrain** supports **Generative Question-Answering** using OpenAI's models like GPT-3.5-turbo. You can get an API key at the [OpenAI website](https://platform.openai.com/account/api-keys) and set it in the cell below.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "os.environ['OPENAI_API_KEY'] = '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'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This notebook won't incur very many charges, but you\n",
+    "# can go to openai.com to view incurred charges from API calls\n",
+    "os.environ['OPENAI_API_KEY'] = 'ENTER YOUR OPENAI API KEY HERE'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ktrain .text.qa import GenerativeQA\n",
+    "genqa = GenerativeQA()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Let's download the ktrain paper from ArXiv and extract text from it using the `TextExtractor`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n",
+    "from ktrain.text.textextractor import TextExtractor\n",
+    "text = TextExtractor().extract('/tmp/downloaded_paper.pdf')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Adding documents to the index\n",
+    "Although we could add the document suppyling the path to the downloaded PDF paper directly, we will instead just use the extracted text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genqa.add_doc(text=text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Let's submit a query\n",
+    "\n",
+    "The `GenerativeQA` module will return an answer with citations to documents in your index (in our case, there is only one). The `GenerativeQA` model is a simple wrapper to the `paper-qa` package.  By default, citations are in the form of MD5 hashes of the text supplied as input.  You can supply custom citations and citation keys by supplying the `citation` and `key` parameters to `add_doc`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: What is ktrain?\n",
+      "\n",
+      "Ktrain is a low-code Python library designed to make machine learning more accessible and easier to apply for both beginners and experienced practitioners. It provides a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four \"commands\" or lines of code. Ktrain can be used with any machine learning model implemented in TensorFlow Keras (tf.keras) and includes out-of-the-box support for text data (e.g., text classification, sequence tagging, open-domain question-answering), vision data (e.g., image classification), graph data (e.g., node classification, link prediction), and tabular data. (md5:1daab15d256e4843ffe094079711bf9c)\n",
+      "\n",
+      "However, it should be noted that the provided context only provides a brief overview of ktrain and its capabilities. For more detailed information, it is recommended to refer to the official documentation and resources.\n",
+      "\n",
+      "References\n",
+      "\n",
+      "1. (md5:1daab15d256e4843ffe094079711bf9c): Document md5:1daab15d256e4843ffe094079711bf9c\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(genqa.query('What is ktrain?'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Save the current state of the document index and other data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genqa.save('/tmp/my_generative_qa')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Re-load the document index"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genqa = GenerativeQA()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genqa.load('/tmp/my_generative_qa')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: What is ktrain?\n",
+      "\n",
+      "Ktrain is a low-code Python library that simplifies the process of building, training, inspecting, and applying machine learning models. It provides a unified interface that enables both beginners and experienced practitioners to quickly solve a wide range of tasks with just a few lines of code. Ktrain can be used with any machine learning model implemented in TensorFlow Keras (tf.keras) and includes out-of-the-box support for text data, vision data, graph data, and tabular data. The text also mentions that ktrain provides examples of both supervised and non-supervised machine learning tasks, including named entity recognition, node classification with graph neural networks, theme discovery, and zero-shot topic classification. (md5:1daab15d256e4843ffe094079711bf9c) However, it is not clear from the context what specific machine learning models are supported by ktrain.\n",
+      "\n",
+      "References\n",
+      "\n",
+      "1. (md5:1daab15d256e4843ffe094079711bf9c): Document md5:1daab15d256e4843ffe094079711bf9c\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(genqa.query('What is ktrain?'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Delete the document index to start over"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "are you sure you want to delete the vector index? (y/n)y\n"
+     ]
+    }
+   ],
+   "source": [
+    "genqa.clear_index()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Since the documents were deleted, there is no longer data to answer the question"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Question: What is ktrain?\n",
+      "\n",
+      "I cannot answer this question due to insufficient information.\n",
+      "\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(genqa.query('What is ktrain?'))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python3",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/ktrain/text/qa/generative_qa.py b/ktrain/text/qa/generative_qa.py
@@ -1,3 +1,4 @@
+import sys
 import os
 import pickle
 from typing import Optional
@@ -13,22 +14,44 @@
 DOCS = "docs_obj.pkl"
 
 
+def is_notebook() -> bool:
+    try:
+        shell = get_ipython().__class__.__name__
+        if "google.colab" in sys.modules:
+            return True
+        elif shell == "ZMQInteractiveShell":
+            return True  # Jupyter notebook or qtconsole
+        elif shell == "TerminalInteractiveShell":
+            return False  # Terminal running IPython
+        else:
+            return False  # Other type (?)
+    except NameError:
+        return False  # Probably standard Python interpreter
+
+
 class GenerativeQA:
     """
     Question-answering using OpenAI or open-source GPT or GPT-like generative LLM models
     """
 
-    def __init__(self):
+    def __init__(self, llm=None):
         """
         ```
         GenerativeQA constructor
+
+        Args:
+          llm(str):  The LLM to use.  If None, gpt-3.5-turbo is used.
         ```
         """
         if not PAPERQA_INSTALLED:
             raise Exception(
                 "GenerativeQA in ktrain requires the paper-qa package by Andrew White: pip install paper-qa"
             )
-        self.docs = Docs()
+        self.docs = Docs(llm)
+        if is_notebook():
+            import nest_asyncio
+
+            nest_asyncio.apply()
 
     def load(self, path: str):
         """