Skip to content

Commit

Permalink
generativeqa notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
amaiya committed May 10, 2023
1 parent 61bd40a commit f840efb
Show file tree
Hide file tree
Showing 4 changed files with 302 additions and 4 deletions.
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,17 @@


### News and Announcements
- **2023-05-10**
- **ktrain 0.37.x** is released and supports a generative question-answering using OpenAI models like GPT-3.5-turbo. Ask a large text corpus questions and receive answers with citations into your corpus.See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/question_answering_with_openai.ipynb) for more information.
```python
import os
os.environ['OPENAI_API_KEY'] = 'ENTER YOUR OPENAI API KEY HERE'
from ktrain .text.qa import GenerativeQA
genqa = GenerativeQA()
genqa.add_doc(text=your_text_document)
print(genqa.query('What is ktrain?'))
```

- **2023-04-21**
- **ktrain 0.36.x** is released and supports a simple wrapper to **Sentiment Analysis**. See the [example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/develop/examples/text/sentiment_analysis_example.ipynb) for more information.
```python
Expand Down Expand Up @@ -62,7 +73,8 @@ print(model.execute(prompt))
- **Document Similarity with One-Class Learning**: given some documents of interest, find and score new documents that are thematically similar to them using [One-Class Text Classification](https://en.wikipedia.org/wiki/One-class_classification) <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-document_similarity_scorer.ipynb)]</sup></sub>
- **Document Recommendation Engines and Semantic Searches**: given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/20newsgroups-recommendation_engine.ipynb)]</sup></sub>
- **Text Summarization**: summarize long documents - no training required <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/text_summarization.ipynb)]</sup></sub>
- **End-to-End Question-Answering**: ask a large text corpus questions and receive exact answers <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
- **Extractive Question-Answering**: ask a large text corpus questions and receive exact answers using BERT <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
- **Generative Question-Answering**: ask a large text corpus questions and receive answers with citations using OpenAI models <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_openai.ipynb)]</sup></sub>
- **Easy-to-Use Built-In Search Engine**: perform keyword searches on large collections of documents <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/question_answering_with_bert.ipynb)]</sup></sub>
- **Zero-Shot Learning**: classify documents into user-provided topics **without** training examples <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/zero_shot_learning_with_nli.ipynb)]</sup></sub>
- **Language Translation**: translate text from one language to another <sub><sup>[[example notebook](https://nbviewer.jupyter.org/github/amaiya/ktrain/blob/master/examples/text/language_translation_example.ipynb)]</sup></sub>
Expand Down
3 changes: 2 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,8 @@ The objective of the CoNLL2003 task is to classify sequences of words as belongi
- [20newsgroups-recommendation_engine.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text): given text from a sample document, recommend documents that are semantically similar to it from a larger corpus

### <a name="bart"></a>Text Summarization with pretrained BART: [text_summarization_with_bart.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
### <a name="textqa"></a>Open-Domain Question-Answering: [question_answering_with_bert.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
### <a name="textqa"></a>Question-Answering with BERT: [question_answering_with_bert.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
### <a name="genqa"></a>Question-Answering with OpenAI: [question_answering_with_openai.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
### <a name="zsl"></a>Zero-Shot Learning: [zero_shot_learning_with_nli.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
### <a name="translation"></a>Language Translation: [language_translation_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
### <a name="textextraction"></a>Text Extraction: [text_extraction_example.ipynb](https://github.com/amaiya/ktrain/tree/master/examples/text)
Expand Down
262 changes: 262 additions & 0 deletions examples/text/question_answering_with_openai.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%reload_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline\n",
"import os\n",
"os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"; "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generative Question-Answering in `ktrain` Using OpenAI Models\n",
"\n",
"As of v0.37.x of **ktrain** supports **Generative Question-Answering** using OpenAI's models like GPT-3.5-turbo. You can get an API key at the [OpenAI website](https://platform.openai.com/account/api-keys) and set it in the cell below.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"os.environ['OPENAI_API_KEY'] = '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'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This notebook won't incur very many charges, but you\n",
"# can go to openai.com to view incurred charges from API calls\n",
"os.environ['OPENAI_API_KEY'] = 'ENTER YOUR OPENAI API KEY HERE'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ktrain .text.qa import GenerativeQA\n",
"genqa = GenerativeQA()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's download the ktrain paper from ArXiv and extract text from it using the `TextExtractor`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget --user-agent=\"Mozilla\" https://arxiv.org/pdf/2004.10703.pdf -O /tmp/downloaded_paper.pdf -q\n",
"from ktrain.text.textextractor import TextExtractor\n",
"text = TextExtractor().extract('/tmp/downloaded_paper.pdf')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Adding documents to the index\n",
"Although we could add the document suppyling the path to the downloaded PDF paper directly, we will instead just use the extracted text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"genqa.add_doc(text=text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's submit a query\n",
"\n",
"The `GenerativeQA` module will return an answer with citations to documents in your index (in our case, there is only one). The `GenerativeQA` model is a simple wrapper to the `paper-qa` package. By default, citations are in the form of MD5 hashes of the text supplied as input. You can supply custom citations and citation keys by supplying the `citation` and `key` parameters to `add_doc`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: What is ktrain?\n",
"\n",
"Ktrain is a low-code Python library designed to make machine learning more accessible and easier to apply for both beginners and experienced practitioners. It provides a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four \"commands\" or lines of code. Ktrain can be used with any machine learning model implemented in TensorFlow Keras (tf.keras) and includes out-of-the-box support for text data (e.g., text classification, sequence tagging, open-domain question-answering), vision data (e.g., image classification), graph data (e.g., node classification, link prediction), and tabular data. (md5:1daab15d256e4843ffe094079711bf9c)\n",
"\n",
"However, it should be noted that the provided context only provides a brief overview of ktrain and its capabilities. For more detailed information, it is recommended to refer to the official documentation and resources.\n",
"\n",
"References\n",
"\n",
"1. (md5:1daab15d256e4843ffe094079711bf9c): Document md5:1daab15d256e4843ffe094079711bf9c\n",
"\n",
"\n"
]
}
],
"source": [
"print(genqa.query('What is ktrain?'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Save the current state of the document index and other data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"genqa.save('/tmp/my_generative_qa')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Re-load the document index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"genqa = GenerativeQA()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"genqa.load('/tmp/my_generative_qa')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: What is ktrain?\n",
"\n",
"Ktrain is a low-code Python library that simplifies the process of building, training, inspecting, and applying machine learning models. It provides a unified interface that enables both beginners and experienced practitioners to quickly solve a wide range of tasks with just a few lines of code. Ktrain can be used with any machine learning model implemented in TensorFlow Keras (tf.keras) and includes out-of-the-box support for text data, vision data, graph data, and tabular data. The text also mentions that ktrain provides examples of both supervised and non-supervised machine learning tasks, including named entity recognition, node classification with graph neural networks, theme discovery, and zero-shot topic classification. (md5:1daab15d256e4843ffe094079711bf9c) However, it is not clear from the context what specific machine learning models are supported by ktrain.\n",
"\n",
"References\n",
"\n",
"1. (md5:1daab15d256e4843ffe094079711bf9c): Document md5:1daab15d256e4843ffe094079711bf9c\n",
"\n",
"\n"
]
}
],
"source": [
"print(genqa.query('What is ktrain?'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Delete the document index to start over"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"are you sure you want to delete the vector index? (y/n)y\n"
]
}
],
"source": [
"genqa.clear_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Since the documents were deleted, there is no longer data to answer the question"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Question: What is ktrain?\n",
"\n",
"I cannot answer this question due to insufficient information.\n",
"\n",
"\n"
]
}
],
"source": [
"print(genqa.query('What is ktrain?'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "python3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
27 changes: 25 additions & 2 deletions ktrain/text/qa/generative_qa.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import sys
import os
import pickle
from typing import Optional
Expand All @@ -13,22 +14,44 @@
DOCS = "docs_obj.pkl"


def is_notebook() -> bool:
try:
shell = get_ipython().__class__.__name__
if "google.colab" in sys.modules:
return True
elif shell == "ZMQInteractiveShell":
return True # Jupyter notebook or qtconsole
elif shell == "TerminalInteractiveShell":
return False # Terminal running IPython
else:
return False # Other type (?)
except NameError:
return False # Probably standard Python interpreter


class GenerativeQA:
"""
Question-answering using OpenAI or open-source GPT or GPT-like generative LLM models
"""

def __init__(self):
def __init__(self, llm=None):
"""
```
GenerativeQA constructor
Args:
llm(str): The LLM to use. If None, gpt-3.5-turbo is used.
```
"""
if not PAPERQA_INSTALLED:
raise Exception(
"GenerativeQA in ktrain requires the paper-qa package by Andrew White: pip install paper-qa"
)
self.docs = Docs()
self.docs = Docs(llm)
if is_notebook():
import nest_asyncio

nest_asyncio.apply()

def load(self, path: str):
"""
Expand Down

0 comments on commit f840efb

Please sign in to comment.