In [26]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import HuggingFaceHub
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader
import os 
from dotenv import load_dotenv
from time import time
import warnings
warnings.filterwarnings('ignore')

In [27]:
#loader = DirectoryLoader('PDF_Testing', glob="./*.pdf", loader_cls=PyPDFLoader)
loader = PyPDFLoader('HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf')
documents = loader.load()

In [28]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=400)
texts = text_splitter.split_documents(documents)

In [29]:
len(texts)

11

In [30]:
texts[5]

Document(page_content='Nature 621 , 672–675; 2023). “The use of AI tools \ncould improve equity in science, ” says Tatsuya \nAmano, a conservation scientist at the Uni -\nversity of Queensland in Brisbane, Australia. \nAmano and his colleagues surveyed more \nthan 900 environmental scientists who had \nauthored at least one paper in English. Among \nearly-career researchers, non-native English \nspeakers said their papers were rejected owing \nto writing issues more than twice as often as \nnative English speakers did, who also spent \nless time writing their submissions2. ChatGPT \nand similar tools could be a “huge help” for \nthese researchers, says Amano.\nAmano, whose first language is Japanese, \nhas been experimenting with ChatGPT and \nsays the process is similar to working with a \nnative English-speaking colleague, although \nthe tool’s suggestions sometimes fall short. \nHe co-authored an editorial in Science in March \nfollowing that journal’s ban on generative AI \ntools, 

In [31]:
_ = load_dotenv()

HUGGINGFACEHUB_API_TOKEN = os.environ["HUGGINGFACEHUB_API_TOKEN"]

llm=HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta", 
    model_kwargs={"temperature":0.2, "max_length":256},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
    )

In [32]:
from langchain.embeddings import HuggingFaceInstructEmbeddings

instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="BAAI/bge-base-en-v1.5",
                                                      model_kwargs={"device": "cuda"})

load INSTRUCTOR_Transformer
max_seq_length  512


In [33]:
%%time
persist_directory = 'db_HuggingFace'

embedding = instructor_embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

CPU times: total: 344 ms
Wall time: 1.51 s


In [61]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriever.get_relevant_documents("HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?")

In [62]:
len(docs)

2

In [44]:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [45]:
def process_llm_response(qa_chain, query):
    print(f"Query: {query}\n")
    time_1 = time()
    llm_response = qa_chain(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print(f"\nResult:", llm_response['result'])
    print(f"\nmetadata:", llm_response['source_documents'][0].metadata)

In [46]:
query = "HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?"
process_llm_response(qa_chain, query)

Query: HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING?

Inference time: 0.313 sec.

Result:  In a world of AI-assisted writing and reviewing, the nature of the scientific paper could be transformed. This could lead to a proliferation of AI-generated research, potentially raising questions about the authenticity and reliability of some findings. However, AI could also help to streamline the peer-review process and enable more efficient and accurate analysis of large datasets. The implications of these developments for the future of scientific publishing are still uncertain, but it is clear that AI will play an increasingly important role in shaping the way we produce and consume scientific knowledge.

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [51]:
query = "Give me 3 examples of using Generative AI in scientific publish?"
process_llm_response(qa_chain, query)

Query: Give me 3 examples of using Generative AI in scientific publish?

Inference time: 10.505 sec.

Result:  Based on the context provided, here are three potential examples of using Generative AI in scientific publishing:

1. Integrating LLMs into reviewing systems: Publishers could use Generative AI to assist with peer review by helping to screen manuscripts, select reviewers, and verify the identity of authors. This could potentially speed up the review process and improve the quality of reviews.

2. Enhancing scientific writing: LLMs could be used to help researchers generate more accurate and concise scientific writing, as well as to suggest potential areas for further research based on existing literature. This could help to improve the clarity and impact of scientific papers.

3. Facilitating data analysis: Generative AI could be used to help researchers analyze large datasets more quickly and accurately, by identifying patterns and relationships that might be missed by human 

In [52]:
query = "What is the potential impact of using Generative AI in publishing scientific research?"
process_llm_response(qa_chain, query)

Query: What is the potential impact of using Generative AI in publishing scientific research?

Inference time: 6.147 sec.

Result:  The use of Generative AI in publishing scientific research has the potential to significantly transform the nature of scientific papers. AI-assisted writing and reviewing could lead to a world where scientific papers are generated more quickly and accurately, with the potential for new insights and discoveries to be uncovered through the use of AI to analyze and interpret data. However, there are also concerns about the reliability and reproducibility of results generated by AI, as well as the potential for AI to perpetuate existing biases and inequalities in scientific research. As such, it is important that the use of AI in scientific publishing is approached with caution and rigor, and that the potential benefits and drawbacks are carefully considered and addressed.

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.

In [49]:
query = "Any downside of using Generative AI to publish scientific research?"
process_llm_response(qa_chain, query)

Query: Any downside of using Generative AI to publish scientific research?

Inference time: 0.412 sec.

Result:  Yes, according to some experts, the downside of using Generative AI to publish scientific research is the tendency of these tools to make up information and references, which could potentially lead to inaccurate or misleading results. While tools such as scite and Elicit have already launched search tools that use LLMs to provide researchers with natural-language answers to queries, the largest human-generated review that Mineault has seen included around 1,600 papers, and working with generative AI could take it much further. However, the question is whether the tools' tendency to make up information and references can be addressed adequately. As one expert puts it, "Anything disruptive like this can be quite worrying."

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [53]:
query = "Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?"
process_llm_response(qa_chain, query)

Query: Give me 1 key concern of researchers rely on ChatGPT to publish their scientific research?

Inference time: 4.43 sec.

Result:  The key concern of researchers relying on ChatGPT to publish their scientific research is that they could rely on ChatGPT to whip up reviews with little thought, although directly asking an LLM to review a manuscript is likely to produce little of value beyond summaries and copy-editing suggestions, according to Mohamad Hosseini, who studies research ethics and integrity at Northwestern University’s Galter Health Sciences Library and Learning Center in Chicago, Illinois.

metadata: {'page': 2, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [68]:
query = "Why generative AI is “automated plagiarism by design”?"
process_llm_response(qa_chain, query)
#Metadata is wrong, should be quote in How generative AI could disrupt scientific publishing

Query: Why generative AI is “automated plagiarism by design”?

Inference time: 6.068 sec.

Result:  According to Iris van Rooij, a cognitive scientist at Radboud University in Nijmegen, the Netherlands, generative AI is “automated plagiarism by design” because users have no idea where such tools source their information from when they trawl the internet without concern for bias, consent, or copyright. This means that researchers may unknowingly use information from unreliable sources, which could lead to plagiarism or the spread of false information. Van Rooij argues that if researchers were more aware of this problem, they wouldn't want to use generative AI tools.

metadata: {'page': 40, 'source': 'NYSE_AXP_2021.pdf'}


In [69]:
query = "What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?"
process_llm_response(qa_chain, query)

Query: What is the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science?

Inference time: 4.895 sec.

Result:  According to the article, the most popular answer that Nature surveyed researchers on what they thought the biggest benefits of generative AI might be for science was "accelerating the pace of discovery" with 43% of respondents selecting this option. Other benefits that received significant support included "improving the quality of research" (32%) and "enhancing the accessibility of scientific knowledge" (25%). The survey received 600 responses from researchers in various fields of science.

metadata: {'page': 0, 'source': 'HOW GENERATIVE AI COULD DISRUPT SCIENTIFIC PUBLISHING.pdf'}


In [20]:
query = "How generative AI tools could change the ways of researchers conduct meta-analyses and review?"
process_llm_response(qa_chain, query)

Query: How generative AI tools could change the ways of researchers conduct meta-analyses and review?

Inference time: 0.509 sec.

Result:  Generative AI tools could significantly change the way researchers conduct meta-analyses and reviews, according to Mineault. While companies such as scite and Elicit have already launched search tools that use LLMs to provide researchers with natural-language answers to queries, Mineault suggests that these tools could take meta-analyses and reviews much further. The largest human-generated review Mineault has seen included around 1,600 papers, but working with generative AI could potentially exploit much more of the scientific literature. However, Mineault also notes that the tools' tendency to make up information and references needs to be addressed adequately. The use of generative AI could change how researchers drill into the aspects of a study that are most relevant to them and access descriptions of results tailored to their needs, potential

In [70]:
query = "Who is Iris van Rooij? "
process_llm_response(qa_chain, query)
#Again wrong metadata, maybe becoz there r 2 ppl have same name, 1 works in AE, ano

Query: Who is Iris van Rooij? 

Inference time: 5.923 sec.

Result:  Iris van Rooij is a Senior Consultant in the Privacy, Data Protection, Data Governance, and Information and Cyber Security practice at Capgemini Invent. She has over 20 years of experience in the field of data protection and privacy, and has worked with a variety of clients in different industries, including financial services, healthcare, and retail. Her expertise includes data governance, data retention, and data breach response. She is also a certified Information Privacy Professional (CIPP/E) and a member of the International Association of Privacy Professionals (IAPP).

metadata: {'page': 15, 'source': 'NYSE_AXP_2021.pdf'}


In [22]:
query = "Give me 2 comments from the scientists or researchers of using Generative AI in their work?"
process_llm_response(qa_chain, query)

Query: Give me 2 comments from the scientists or researchers of using Generative AI in their work?

Inference time: 9.013 sec.

Result:  1. Gemma Conroy, a reporter for Nature, mentions that according to Yoshua Bengio, a computer scientist at the University of Montreal in Canada, Generative AI could revolutionize the way researchers access and analyze scientific literature. Bengio suggests that Generative AI could allow researchers to drill into the most relevant aspects of a study and access tailored descriptions of results. 2. Vincent Berdejo-Espinola, a computer scientist at the University of Oxford, UK, and his colleagues have used Generative AI to summarize scientific papers in a way that is more accessible to non-experts. Berdejo-Espinola believes that Generative AI has the potential to create new knowledge by synthesizing information from multiple sources, which could be particularly useful for solving complex problems in fields such as climate science and medicine. However, he 

In [23]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity', <langchain.vectorstores.chroma.Chroma at 0x1b40b9b40d0>)

In [24]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:
