In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import HuggingFaceHub
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os 
from dotenv import load_dotenv
from time import time
import warnings
warnings.filterwarnings('ignore')

In [2]:
#loader = DirectoryLoader('PDF_Testing', glob="./*.pdf", loader_cls=PyPDFLoader)
loader = PyPDFLoader('HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf')
documents = loader.load()

In [3]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [4]:
len(texts)

192

In [5]:
texts[5]

Document(page_content='paper, he turns to ChatGPT, the \nchatbot that produces fluent \nresponses to almost any query', metadata={'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf', 'page': 1})

In [6]:
_ = load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

In [7]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [8]:
%%time
persist_directory = 'db_HuggingFace'

embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: total: 1.58 s
Wall time: 3.28 s


In [9]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriever.get_relevant_documents("HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?")

In [10]:
len(docs)

2

In [11]:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [12]:
def process_llm_response(qa_chain, query):
    print(f"Query: {query}\n")
    time_1 = time()
    llm_response = qa_chain(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print(f"\nResult:", llm_response['result'])
    print(f"\nmetadata:", llm_response['source_documents'][0].metadata)

In [13]:
query = "HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?"
process_llm_response(qa_chain, query)

Query: HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?

Inference time: 18.746 sec.

Result:  Generative AI has the potential to significantly disrupt scientific publishing by revolutionizing the way scientific research is communicated and disseminated. Some ways in which AI could transform scientific communication and publishing include:

1. Automated literature reviews: AI algorithms can quickly and accurately scan through vast amounts of scientific literature, identify key findings, and generate summaries and insights that would take human researchers hours or even days to produce.

2. Improved scientific writing: AI-powered writing tools can help researchers write more clearly, concisely, and accurately by suggesting alternative word choices, identifying grammatical errors, and even generating entire paragraphs or sections of text.

3. Enhanced scientific editing: AI algorithms can help editors and reviewers identify plagiarism, inconsistencies, and other errors in scientifi

In [14]:
query = "Give me 3 examples of using Generative AI in scientific publish?"
process_llm_response(qa_chain, query)

Query: Give me 3 examples of using Generative AI in scientific publish?

Inference time: 10.183 sec.

Result:  1. Automated Abstract Generation: Generative AI can analyze the full text of a scientific paper and generate a concise and accurate abstract that accurately summarizes the key findings and methods used. This can save researchers time and effort in writing abstracts, as well as improve the accuracy and consistency of abstracts across different papers. 2. Automated Figure Generation: Generative AI can analyze the data presented in a scientific paper and generate figures that accurately represent the data, as well as suggest alternative visualizations that may be more effective at communicating the results. This can help researchers communicate their findings more clearly and effectively, as well as save time and effort in creating figures. 3. Automated Reviewer Assignment: Generative AI can analyze the content of a scientific paper and suggest reviewers who are experts in the re

In [15]:
query = "What is the potential impact of using Generative AI in publishing scientific research?"
process_llm_response(qa_chain, query)

Query: What is the potential impact of using Generative AI in publishing scientific research?

Inference time: 8.223 sec.

Result:  The potential impacts of using Generative AI in publishing scientific research, as identified by science publishers and others, include concerns about the flood of fakes, the potential for AI to generate false or misleading results, the possibility of plagiarism or copyright infringement, and the need for transparency and accountability in the use of AI in scientific research. Additionally, there are questions about the reliability and accuracy of AI-generated research, as well as concerns about the potential for AI to exacerbate existing inequalities in scientific publishing, such as the overrepresentation of certain types of research or the underrepresentation of certain researchers. Overall, while there is excitement about the potential benefits of using AI in scientific publishing, such as increased efficiency and accuracy, there are also significant c

In [16]:
query = "Any downside of using Generative AI to publish scientific research?"
process_llm_response(qa_chain, query)

Query: Any downside of using Generative AI to publish scientific research?

Inference time: 4.67 sec.

Result:  Yes, there are concerns about the reliability and accuracy of the results generated by AI. While AI can quickly analyze large amounts of data, it may not be able to fully understand the context and nuances of scientific research. Additionally, there is a risk of AI perpetuating existing biases and errors in scientific literature. As a result, it is important for scientists to carefully review and validate the results generated by AI before publishing them in scientific journals.

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [17]:
query = "Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?"
process_llm_response(qa_chain, query)

Query: Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?

Inference time: 5.359 sec.

Result:  One key concern of researchers relying on ChatGPT to publish their scientific research is that they could potentially rely on the AI to write their reviews with little thought, which could lead to inaccurate or unreliable findings. This raises questions about the integrity and validity of the research being published. Additionally, some researchers have already admitted to using ChatGPT to help write papers without disclosing this fact, which raises concerns about academic integrity and transparency.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [18]:
query = "Why generative AI is “automated plagiarism by design”?"
process_llm_response(qa_chain, query)

Query: Why generative AI is “automated plagiarism by design”?

Inference time: 3.843 sec.

Result:  While some see generative AI as a tool for scientists to rethink their research methods, others view it as a way to more easily produce fake content. This is because generative AI is essentially automated plagiarism by design, as it generates new content by building on existing material. This raises concerns about the potential for misuse and the need for clear guidelines and safeguards to prevent the spread of false information.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [19]:
query = "What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?"
process_llm_response(qa_chain, query)

Query: What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?

Inference time: 3.444 sec.

Result:  According to a Nature survey, the most popular answer that researchers thought the biggest benefits of generative AI might be for science was for suggesting clearer ways to convey ideas. However, it should be noted that while many expect that generative AI will have significant benefits for science, the majority of scientists who use LLMs regularly are still in the minority.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [20]:
query = "How generative AI tools could change the ways of researchers conduct meta-analyses and review?"
process_llm_response(qa_chain, query)

Query: How generative AI tools could change the ways of researchers conduct meta-analyses and review?

Inference time: 6.594 sec.

Result:  Generative AI tools have the potential to significantly change the way researchers conduct meta-analyses and reviews. Some researchers see these tools as a way to rethink how they interpret and synthesize data. However, there are concerns about the use of these tools, as some worry that uploading manuscripts and sections of text to generative AI platforms could result in the work being fed back into the training data of language models (LLMs), potentially leading to plagiarism or other forms of academic misconduct. As the use of generative AI in scientific research continues to grow, it will be important for researchers to carefully consider the potential benefits and drawbacks of these tools and to develop guidelines and best practices for their use.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [21]:
query = "Who is Iris van Rooij? "
process_llm_response(qa_chain, query)
#Correct metadata

Query: Who is Iris van Rooij? 

Inference time: 2.138 sec.

Result:  Iris van Rooij is a cognitive scholar at the University of Oxford in the UK who studies the ethical and legal implications of content creation, particularly in regards to issues of bias, consent, and copyright.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [22]:
query = "Give me 2 comments from the scientists or researchers of using Generative AI in their work?"
process_llm_response(qa_chain, query)

Query: Give me 2 comments from the scientists or researchers of using Generative AI in their work?

Inference time: 4.378 sec.

Result:  According to some scientists, Generative AI can greatly assist in their research by providing new insights and perspectives through the use of generative AI tools to interrogate experiments, data, and models. This can lead to a rethinking of how they approach their work. Additionally, users can pose queries to these tools to gain further insights and understandings. However, the full potential of Generative AI in scientific research is still being explored and understood.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [23]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity', <langchain.vectorstores.chroma.Chroma at 0x1ce13961490>)

In [24]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:
