In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import HuggingFaceHub
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os 
from dotenv import load_dotenv
from time import time
import warnings
warnings.filterwarnings('ignore')

In [2]:
#loader = DirectoryLoader('PDF_Testing', glob="./*.pdf", loader_cls=PyPDFLoader)
loader = PyPDFLoader('HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf')
documents = loader.load()

In [3]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
texts = text_splitter.split_documents(documents)

In [4]:
len(texts)

19

In [5]:
texts[5]

Document(page_content='focuses on the ethical and legal implications \nof emerging technologies. She hopes that law -\nmakers worldwide will insist on disclosure or \nwatermarks for LLMs, and will make it illegal \nto remove watermarking.  \nPublishers are approaching the issue either \nby banning the use of LLMs altogether (as Sci -\nence’s publisher, the American Association for \nthe Advancement of Science, has done), or, \nin most cases, insisting on transparency (the \npolicy at Nature and many other journals). A \nstudy examining 100 publishers and journals \nfound that, as of May, 17% of publishers and \n70% of journals had released guidelines on how \ngenerative AI could be used, although they \nvaried on how the tools could be applied, says \nGiovanni Cacciamani, a urologist at the Uni -\nversity of Southern California in Los Angeles, \nwho co-authored the work, which has not yet \nbeen peer reviewed1. He and his colleagues \nare working with scientists and journal edi -\ntors

In [6]:
_ = load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

In [7]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [8]:
%%time
persist_directory = 'db_HuggingFace'

embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: total: 1.7 s
Wall time: 2.62 s


In [9]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriever.get_relevant_documents("HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?")

In [10]:
len(docs)

2

In [11]:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [12]:
def process_llm_response(qa_chain, query):
    print(f"Query: {query}\n")
    time_1 = time()
    llm_response = qa_chain(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print(f"\nResult:", llm_response['result'])
    print(f"\nmetadata:", llm_response['source_documents'][0].metadata)

In [13]:
query = "HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?"
process_llm_response(qa_chain, query)

Query: HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?

Inference time: 8.719 sec.

Result:  The article "HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING" by Gemma Conroy, published in Nature's Vol 622, Issue 7915, on October 12, 2023, explores the potential impact of generative AI on scientific publishing. The article suggests that AI-assisted writing and reviewing could transform the nature of scientific papers by improving the efficiency and accuracy of the writing process, as well as by enabling more rigorous and objective peer review. However, the article also raises concerns about the potential for AI to replace human authors and reviewers, as well as about the need for transparency and accountability in the use of AI in scientific publishing. Overall, the article highlights the need for careful consideration and regulation of the use of AI in scientific publishing to ensure that it enhances, rather than undermines, the integrity and rigor of the scientific process.


In [14]:
query = "Give me 3 examples of using Generative AI in scientific publish?"
process_llm_response(qa_chain, query)

Query: Give me 3 examples of using Generative AI in scientific publish?

Inference time: 11.981 sec.

Result:  While the use of generative AI in scientific publishing is still in its early stages, there are already some examples of how it could be used. Here are three:

1. Writing scientific papers: Generative AI could be used to help researchers write their papers more efficiently. By analyzing existing scientific literature, AI algorithms could suggest potential ideas, hypotheses, and experiments for a researcher to consider. They could also help generate the first draft of a paper, which the researcher could then edit and refine.

2. Reviewing scientific papers: Generative AI could also be used to help reviewers assess the quality and validity of scientific papers. By analyzing the data and methods used in a paper, AI algorithms could identify potential issues or inconsistencies, and suggest ways to address them. They could also help reviewers quickly and accurately assess the novel

In [15]:
query = "What is the potential impact of using Generative AI in publishing scientific research?"
process_llm_response(qa_chain, query)

Query: What is the potential impact of using Generative AI in publishing scientific research?

Inference time: 7.08 sec.

Result:  The use of Generative AI in publishing scientific research has the potential to transform the nature of the scientific paper. This could include AI-assisted writing, which could help researchers to generate more accurate and concise scientific reports, and AI-assisted reviewing, which could help editors and peer reviewers to more efficiently and accurately assess the quality and validity of scientific manuscripts. However, there are also concerns about the potential for AI to perpetuate existing biases in scientific research and publishing, as well as the potential for AI to replace human reviewers and editors altogether. As with any emerging technology, the impact of Generative AI on scientific publishing will depend on how it is developed and used, and it will be important to ensure that it is used in a responsible and transparent manner.

metadata: {'pag

In [16]:
query = "Any downside of using Generative AI to publish scientific research?"
process_llm_response(qa_chain, query)

Query: Any downside of using Generative AI to publish scientific research?

Inference time: 7.048 sec.

Result:  Yes, there are potential downsides to using Generative AI to publish scientific research. While AI can assist in the writing and reviewing process, it cannot replace the critical thinking and expertise of human researchers. There is also a risk of AI generating incorrect or misleading results, as it may not be able to fully understand the nuances and complexities of scientific research. Additionally, there are concerns about the potential for AI to perpetuate existing biases and inequalities in scientific publishing, as it may learn from and reinforce existing patterns of publication and citation. As with any new technology, it is important to carefully consider the potential benefits and drawbacks of using Generative AI in scientific publishing, and to ensure that it is used in a responsible and ethical manner.

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRU

In [17]:
query = "Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?"
process_llm_response(qa_chain, query)

Query: Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?

Inference time: 3.962 sec.

Result:  One key concern is that researchers could rely on ChatGPT to whip up reviews with little thought, although the naive act of asking an LLM directly to review a manuscript is likely to produce little of value beyond summaries and copy-editing suggestions, says Mohamad Hosseini, who studies research ethics and integrity at Northwestern University’s Galter Health Sciences Library and Learning Center in Chicago, Illinois.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [18]:
query = "Why generative AI is “automated plagiarism by design”?"
process_llm_response(qa_chain, query)
#Metadata is wrong, should be quote in How generative AI could disrupt scientific publishing

Query: Why generative AI is “automated plagiarism by design”?

Inference time: 0.319 sec.

Result:  According to Iris van Rooij, a cognitive scientist at Radboud University in Nijmegen, the Netherlands, generative AI is “automated plagiarism by design” because users have no idea where such tools source their information from when they trawl the internet without concern for bias, consent, or copyright. This means that researchers may unknowingly use information from unreliable sources, which could lead to plagiarism or the spread of false information. Van Rooij argues that if researchers were more aware of this problem, they wouldn't want to use generative AI tools.

metadata: {'page': 40, 'source': 'NYSE_AXP_2021.pdf'}


In [19]:
query = "What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?"
process_llm_response(qa_chain, query)

Query: What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?

Inference time: 0.331 sec.

Result:  According to the article, the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science was "accelerating the pace of discovery" with 43% of respondents selecting this option. Other benefits that received significant support included "improving the quality of research" (32%) and "enhancing the accessibility of scientific knowledge" (25%). The survey received 600 responses from researchers in various fields of science.

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [20]:
query = "How generative AI tools could change the ways of researchers conduct meta-analyses and review?"
process_llm_response(qa_chain, query)

Query: How generative AI tools could change the ways of researchers conduct meta-analyses and review?

Inference time: 11.405 sec.

Result:  According to Mineault, generative AI tools could significantly change the way researchers conduct meta-analyses and reviews, but only if the tools' tendency to make up information and references can be addressed adequately. Mineault, a computer scientist at the University of British Columbia in Vancouver, Canada, suggests that these tools could allow researchers to drill into the aspects of a study that are most relevant to them and access a description of the results tailored to their needs. The largest human-generated review Mineault has seen included around 1,600 papers, but working with generative AI could take it much further. Mineault's comments come as companies such as scite and Elicit have already launched search tools that use LLMs to provide researchers with natural-language answers to queries, and Elsevier has launched a pilot version 

In [21]:
query = "Who is Iris van Rooij? "
process_llm_response(qa_chain, query)
#Again wrong metadata, maybe becoz there r 2 ppl have same name, 1 works in AE, another is a cognitive scientist

Query: Who is Iris van Rooij? 

Inference time: 0.31 sec.

Result:  Iris van Rooij is a Senior Consultant in the Privacy, Data Protection, Data Governance, and Information and Cyber Security practice at Capgemini Invent. She has over 20 years of experience in the field of data protection and privacy, and has worked with a variety of clients in different industries, including financial services, healthcare, and retail. Her expertise includes data governance, data retention, and data breach response. She is also a certified Information Privacy Professional (CIPP/E) and a member of the International Association of Privacy Professionals (IAPP).

metadata: {'page': 15, 'source': 'NYSE_AXP_2021.pdf'}


In [None]:
query = "Give me 2 comments from the scientists or researchers of using Generative AI in their work?"
process_llm_response(qa_chain, query)

Query: Give me 2 comments from the scientists or researchers of using Generative AI in their work?



In [None]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)