In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import HuggingFaceHub
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os 
from dotenv import load_dotenv
from time import time
import warnings
warnings.filterwarnings('ignore')

In [2]:
#loader = DirectoryLoader('PDF_Testing', glob="./*.pdf", loader_cls=PyPDFLoader)
loader = PyPDFLoader('HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf')
documents = loader.load()

In [3]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [4]:
len(texts)

22

In [5]:
texts[5]

Document(page_content='of its use, such as fake references or the soft -\nware’s preprogrammed response that it is an \nAI language model.\nIdeally, publishers would be able to detect \nLLM-generated text. In practice, AI-detection \ntools have so far proved unable to pick out \nsuch text reliably while avoiding flagging \nhuman-written prose as the product of an AI.  \nAlthough developers of commercial LLMs \nare working on watermarking LLM-generated \noutput to make it identifiable, no firm has yet \nrolled this out for text. Any watermarks could \nalso be removed, says Sandra Wachter, a legal \nscholar at the University of Oxford, UK, who \nfocuses on the ethical and legal implications \nof emerging technologies. She hopes that law -\nmakers worldwide will insist on disclosure or \nwatermarks for LLMs, and will make it illegal \nto remove watermarking.  \nPublishers are approaching the issue either \nby banning the use of LLMs altogether (as Sci -\nence’s publisher, the American Ass

In [6]:
_ = load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

In [7]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [8]:
%%time
persist_directory = 'db_HuggingFace'

embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: total: 1.38 s
Wall time: 2.64 s


In [9]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriever.get_relevant_documents("HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?")

In [10]:
len(docs)

2

In [11]:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [12]:
def process_llm_response(qa_chain, query):
    print(f"Query: {query}\n")
    time_1 = time()
    llm_response = qa_chain(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print(f"\nResult:", llm_response['result'])
    print(f"\nmetadata:", llm_response['source_documents'][0].metadata)

In [13]:
query = "HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?"
process_llm_response(qa_chain, query)

Query: HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?

Inference time: 0.917 sec.

Result:  Generative AI has the potential to significantly disrupt scientific publishing by transforming scientific communication in various ways. While science publishers are already experimenting with AI in scientific publishing, concerns have been raised about the potential impacts of generative AI, including the possibility of a flood of fakes. The accessibility of AI-generated content raises questions about the authenticity and reliability of scientific research, as well as the potential for plagiarism and intellectual property infringement. As AI becomes more sophisticated, it may also lead to a shift in the role of human editors and peer reviewers, as AI-generated content may require less human intervention in the publishing process. Overall, the potential disruption of scientific publishing by generative AI highlights the need for careful consideration and regulation to ensure the integrit

In [14]:
query = "Give me 3 examples of using Generative AI in scientific publish?"
process_llm_response(qa_chain, query)

Query: Give me 3 examples of using Generative AI in scientific publish?

Inference time: 0.311 sec.

Result:  1. Automated abstract generation: Generative AI can analyze the content of a scientific paper and generate an abstract that accurately summarizes the key findings and methods used. This can save researchers time and ensure that their abstract accurately reflects the content of their paper. 2. Automated figure generation: Generative AI can analyze the data presented in a scientific paper and generate figures that accurately represent the results. This can save researchers time and ensure that their figures are clear and accurate. 3. Automated peer review: Generative AI can analyze the content of a scientific paper and provide a preliminary assessment of its scientific merit. This can help editors and reviewers prioritize papers for review and ensure that only high-quality papers are published.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHIN

In [15]:
query = "What is the potential impact of using Generative AI in publishing scientific research?"
process_llm_response(qa_chain, query)

Query: What is the potential impact of using Generative AI in publishing scientific research?

Inference time: 0.31 sec.

Result:  The potential impacts of using Generative AI in publishing scientific research are both positive and concerning. On the positive side, AI can help with tasks such as summarizing research papers, generating abstracts, and identifying key concepts, which can save time and resources for researchers and publishers. Additionally, AI can help with the peer review process by identifying potential plagiarism or errors in manuscripts. However, there are also concerns about the potential for AI to generate fake research or to perpetuate existing biases in scientific literature. Some experts have also raised concerns about the potential for AI to replace human editors and reviewers, which could lead to a loss of expertise and nuance in the publishing process. Overall, the impact of AI on scientific publishing is still uncertain, and more research is needed to fully un

In [16]:
query = "Any downside of using Generative AI to publish scientific research?"
process_llm_response(qa_chain, query)

Query: Any downside of using Generative AI to publish scientific research?

Inference time: 0.305 sec.

Result:  Yes, science publishers and others have identified a range of concerns about the potential impacts of generative AI, including the risk of a flood of fakes and the potential for plagiarism and misinformation. While some see generative AI as a tool to assist with writing manuscripts, peer-review reports, and grant applications, others worry about the potential for AI to generate false or misleading results, as well as the potential for AI to perpetuate existing biases and inequalities in scientific research. As with any new technology, it's important to carefully consider the potential benefits and drawbacks of using generative AI in scientific publishing, and to ensure that any use of AI is transparent, responsible, and subject to rigorous peer review.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [17]:
query = "Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?"
process_llm_response(qa_chain, query)

Query: Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?

Inference time: 0.331 sec.

Result:  One key concern of researchers relying on ChatGPT to publish their scientific research is that they could rely on it to whip up reviews with little thought, potentially leading to inaccurate or incomplete reviews.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [18]:
query = "Why generative AI is “automated plagiarism by design”?"
process_llm_response(qa_chain, query)
#Metadata corrects 

Query: Why generative AI is “automated plagiarism by design”?

Inference time: 31.721 sec.

Result:  The author, a researcher from the Netherlands, explains that generative AI is "automated plagiarism by design" because users have no idea where such tools source their information from. This means that the AI may be using information that has been plagiarized or copied from other sources without proper attribution, making it a form of automated plagiarism.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [19]:
query = "What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?"
process_llm_response(qa_chain, query)

Query: What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?

Inference time: 0.329 sec.

Result:  According to a Nature survey, the most popular answer that researchers believed the biggest benefits of generative AI might be for science was that it would help researchers who do not have English as their first language.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [20]:
query = "How generative AI tools could change the ways of researchers conduct meta-analyses and review?"
process_llm_response(qa_chain, query)

Query: How generative AI tools could change the ways of researchers conduct meta-analyses and review?

Inference time: 0.323 sec.

Result:  As generative AI tools become more advanced, they could potentially assist researchers in conducting meta-analyses and literature reviews by quickly summarizing large amounts of text and identifying key themes and findings. This could save researchers significant amounts of time and effort, allowing them to focus on more complex analysis and interpretation. However, there are also concerns about the accuracy and reliability of these tools, particularly in terms of their ability to accurately identify and interpret the nuances of scientific literature. As a result, it is likely that researchers will continue to rely on traditional methods of meta-analysis and literature review, at least for the foreseeable future.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [21]:
query = "Who is Iris van Rooij? "
process_llm_response(qa_chain, query)
#Correct metadata

Query: Who is Iris van Rooij? 

Inference time: 0.365 sec.

Result:  Iris van Rooij is a cognitive scientist at Radboud University in Nijmegen, the Netherlands, who believes that generative AI should be used without concern for issues such as bias, consent, or copyright. She also hopes that lawmakers worldwide will require disclosure regarding these issues.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [22]:
query = "Give me 2 comments from the scientists or researchers of using Generative AI in their work?"
process_llm_response(qa_chain, query)

Query: Give me 2 comments from the scientists or researchers of using Generative AI in their work?

Inference time: 0.335 sec.

Result:  One scientist mentioned in the context that non-native English speakers could benefit the most from these tools as they could help them suggest clearer ways to convey their ideas. Another researcher saw generative AI as a way for scientists to rethink how they interpret and summarize experimental results. However, a Nature survey revealed that while some scientists use LLMs regularly, the majority still falls into the minority category. Many expect that the use of generative AI in scientific work will increase in the future.

metadata: {'page': 1, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [23]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity', <langchain.vectorstores.chroma.Chroma at 0x1ba2ce2ac50>)

In [24]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:
