## News Query Bot

A News Research Tool leveraging Google's PaLM Large Language Model, Facebook AI Similarity Search (FAISS) vector 
database, and Langchain framework. 

Implemented text embedding, integrated FAISS for efficient similarity search, and fine-tuned PaLM on ingested news article data. 

Enabled querying the trained model with natural language questions related to ingested news content, facilitating efficient research and information retrieval.

___

In [1]:
# Importing required Libraries

from langchain.llms import GooglePalm
from langchain_community.llms import ollama

from langchain.document_loaders import UnstructuredURLLoader
from langchain_community.document_loaders import TextLoader

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.embeddings import GooglePalmEmbeddings

import warnings
warnings.filterwarnings('ignore')

In [2]:
#Defining GooglePalm API Key

api_key = "AIzaSyDgHBN9If1kqzro5BAdA5nxk9r0rhyCNIM"

llm = GooglePalm(google_api_key=api_key, temperature=0.7)

In [3]:
#Testing the GooglePalm LLM by asking random query.

output = llm("write an email to invite for birthday")
print(output)

Dear [friend's name],

I hope this email finds you well.

I'm writing to invite you to my birthday party! I'm turning [age] on [date], and I'm having a party at [location] from [time] to [time].

I'm really excited to celebrate with all of my friends and family. There will be food, drinks, music, and dancing. I hope you can make it!

Please RSVP to [email address] by [date].

I can't wait to see you there!

Sincerely,
[Your name]


In [4]:
""" UnstructuredURLLoader is a class within the LangChain library specifically designed for loading web content from URLs 
and processing it as unstructured data. """

urls_list = ["https://www.moneycontrol.com/", "https://economictimes.indiatimes.com/news/international/world-news/maldives-parliamentary-elections-pro-china-partys-win-raises-concerns-for-india/articleshow/109493583.cms"]

loader = UnstructuredURLLoader(urls=urls_list)

In [5]:
""" The load() function is a method of the UnstructuredURLLoader class. 
It's responsible for the actual process of fetching and processing the data from the URLs provided. """

data = loader.load()
data

[Document(page_content="Close Ad\n\nEnglish\n\nHindi\n\nGujarati\n\nSpecials\n\nTrending Stocks\n\nVodafone Idea\xa0INE669E01016, IDEA, 532822\n\nCDSL\xa0INE736A01011, CDSL, 0\n\nGTL Infra\xa0INE221H01019, GTLINFRA, 532775\n\nReliance\xa0INE002A01018, RELIANCE, 500325\n\nSuzlon Energy\xa0INE040H01021, SUZLON, 532667\n\n\n\nQuotes\n\nMutual Funds\n\nCommodities\n\nFutures & Options\n\nCurrency\n\nNews\n\nCryptocurrency\n\nForum\n\nNotices\n\nVideos\n\nGlossary\n\nAll\n\nHello, Login\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t Hello, Login\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\tLog-in\r\n\t\t\t\t\t\t\t\t\t\t\tor \r\n\t\t\t\t\t\t\t\t\t\t\tSign-Up\r\n\t\

In [6]:
"""RecursiveCharacterTextSplitter Functionality: This class is a text splitter specifically designed for handling 
large chunks of text, like the content retrieved from a news article URL. 
It offers several functionalities: Recursive Splitting, Customizable Seperators (Optional), chunk_size control.
"""

from langchain.text_splitter import RecursiveCharacterTextSplitter

In [7]:
text_splitter = RecursiveCharacterTextSplitter(separators=['\n\n', '\n', '.', ','], chunk_size=1000)

"""The "split_documents" function is designed to segment a provided text corpus (potentially large chunks of text) into smaller, 
more manageable pieces."""

docs = text_splitter.split_documents(data)
docs

[Document(page_content='Close Ad\n\nEnglish\n\nHindi\n\nGujarati\n\nSpecials\n\nTrending Stocks\n\nVodafone Idea\xa0INE669E01016, IDEA, 532822\n\nCDSL\xa0INE736A01011, CDSL, 0\n\nGTL Infra\xa0INE221H01019, GTLINFRA, 532775\n\nReliance\xa0INE002A01018, RELIANCE, 500325\n\nSuzlon Energy\xa0INE040H01021, SUZLON, 532667\n\n\n\nQuotes\n\nMutual Funds\n\nCommodities\n\nFutures & Options\n\nCurrency\n\nNews\n\nCryptocurrency\n\nForum\n\nNotices\n\nVideos\n\nGlossary\n\nAll', metadata={'source': 'https://www.moneycontrol.com/'}),
 Document(page_content='Hello, Login\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t Hello, Login\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\

In [8]:
"""
* Integration with Hugging Face:
This line indicates you're leveraging the Hugging Face ecosystem within your LangChain project. 
Hugging Face is a popular platform for natural language processing (NLP) resources, including pre-trained models and tools.

* Hugging Face Instruct Embeddings: 
The specific class being imported, HuggingFaceInstructEmbeddings, suggests you'll be using a 
type of embedding model from Hugging Face. Embeddings are a way to represent text data as numerical vectors,
allowing for more efficient processing by NLP algorithms.

* Focus on Instructions:
The term "Instruct" in the class name implies this embedding model is specifically designed to handle instructions or prompts.
This aligns well with your project's goal of using PaLM 2 for tasks like summarization or question answering 
based on user input about a news article."""

from langchain.embeddings import HuggingFaceInstructEmbeddings

In [9]:
"""hkunlp/instructor-large: This refers to a pre-trained model likely trained on a dataset of instructions or prompts. 
The "large" size suggests it's a more complex model with potentially better performance compared to smaller versions."""

embedding = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large")

load INSTRUCTOR_Transformer
max_seq_length  512


In [10]:
#embedding_2 = GooglePalmEmbeddings(google_api_key=api_key)

In [11]:
#embedding = OllamaEmbeddings()

In [12]:
"""The FAISS class pertains to the Facebook AI Similarity Search (FAISS) library, a powerful tool for efficient retrieval of 
similar items based on their vector representations.
LangChain integrates FAISS as a vectorstore, essentially a data structure designed to store and manage vector embeddings."""

from langchain.vectorstores import FAISS

In [13]:
vectordb = FAISS.from_documents(documents=docs[:30], embedding=embedding)

In [14]:
vectordb

<langchain.vectorstores.faiss.FAISS at 0x195f120b5b0>

In [15]:
# Create a retriever for querying the vector database

retriever = vectordb.as_retriever(score_threshold=0.7)
retriever

VectorStoreRetriever(tags=['FAISS'], metadata=None, vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x00000195F120B5B0>, search_type='similarity', search_kwargs={})

In [16]:
from langchain.prompts import PromptTemplate

In [17]:
prompt_template = """Given the following context and a question, generate an answer based on this context only.
    In the answer try to provide as much text as possible from "response" section in the source document context without making much changes.
    If the answer is not found in the context, kindly state "I don't know." Don't try to make up an answer.

    CONTEXT: {context}

    QUESTION: {question}"""

    
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

In [18]:
"""The RetrievalQA class specifically defines a chain for implementing a Retrieval-based Question Answering (QA) system.
This type of system retrieves relevant information from a source (often text documents) and then uses an LLM like 
PaLM 2 to answer questions based on the retrieved information."""

from langchain.chains import RetrievalQA

In [19]:
"""chain = RetrievalQA.from_chain_type(...) - This line creates a new instance of the RetrievalQA chain using the
from_chain_type function. This function is a convenient way to define a retrieval-based question answering system with 
some pre-configured settings."""

chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=retriever,
                                    input_key="query",
                                    return_source_documents=True,
                                    chain_type_kwargs={"prompt": PROMPT}
                                   )

"""Overall Functionality:

This code defines a RetrievalQA chain that retrieves relevant news articles based on the user's question (using the retriever) 
and then leverages the provided LLM to answer the question using the retrieved information and the specified
prompt. If return_source_documents is True, the chain will also return the original news articles alongside the answer.
"""

chain

RetrievalQA(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, combine_documents_chain=StuffDocumentsChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, input_key='input_documents', output_key='output_text', llm_chain=LLMChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, metadata=None, prompt=PromptTemplate(input_variables=['context', 'question'], output_parser=None, partial_variables={}, template='Given the following context and a question, generate an answer based on this context only.\n    In the answer try to provide as much text as possible from "response" section in the source document context without making much changes.\n    If the answer is not found in the context, kindly state "I don\'t know." Don\'t try to make up an answer.\n\n    CONTEXT: {context}\n\n    QUESTION: {question}', template_format='f-string', validate_template=True), llm=GooglePalm(cache=N

In [20]:
#  Attempts to extract the answer generated by the AI system.

print(chain("How generative AI deals stack up for Accenture, TCS, Infosys, Wipro")['result'])

Retrying langchain.llms.google_palm.generate_with_retry.<locals>._generate_with_retry in 2.0 seconds as it raised InternalServerError: 500 An internal error has occurred. Please retry or report in https://developers.generativeai.google/guide/troubleshooting.


Response: 

IT majors Accenture, TCS, Infosys and Wipro have been making strategic bets on artificial intelligence (AI) and other emerging technologies to drive growth. In the past year, these companies have announced several large generative AI deals, which are expected to help them accelerate innovation and improve their competitive position.

Accenture has been a leader in the AI space for several years, and it has been making significant investments in generative AI. In 2021, the company announced a $1 billion investment in generative AI and related technologies. This investment is expected to help Accenture develop new AI-powered solutions for its clients across industries.

TCS is another major player in the AI space, and it has also been making significant investments in generative AI. In 2021, the company announced a partnership with Google Cloud to develop new generative AI solutions. This partnership is expected to help TCS accelerate its AI efforts and deliver new value to i