In [6]:
!pip install -U langchain-community -q

In [7]:
import warnings
warnings.filterwarnings('ignore')

!pip install groq -q

In [8]:
!pip install langchain faiss-cpu transformers pypdf -q

In [9]:
!pip install python-dotenv -q

import os
from dotenv import load_dotenv

# ;oad environment variables
load_dotenv()

# set environment variables
os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")
os.environ["HF_API_KEY"] = os.getenv("HF_API_KEY")


In [10]:
#load the pdf file
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("revised.pdf")
documents = loader.load()

In [11]:
print(documents[7].page_content)

Students at Cambridge University, who come from less afuent backgrounds, are being offered up to
1,000 a year under a bursary scheme.
This sentence contains a non-restrictive relative clause: who come from less afuent backgrounds. This is a form
of parenthetical comment. The sentence implies that most/all students at Cambridge come from less afuent back-
grounds. What the reporter probably meant was a restrictive relative, which should not have commas round it:
Students at Cambridge University who come from less afuent backgrounds are being offered up to 1,000
a year under a bursary scheme.
A restrictive relative is a type of modier: that is, it further species the entities under discussion. Thus it refers to a
subset of the students.
1.6 Information retrieval, information extraction and question answering
Information retrieval involves returning a set of documents in response to a user query: Internet search engines are a
form of IR. However, one change from classical IR is that

In [12]:
# split the document into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

In [13]:
docs

[Document(metadata={'source': 'revised.pdf', 'page': 0}, page_content='Natural Language Processing\n2004, 8 Lectures\nAnn Copestake (aac@cl.cam.ac.uk)\nhttp://www.cl.cam.ac.uk/users/aac/\nCopyright c\rAnn Copestake, 2003\x962004\nLecture Synopsis\nAims\nThiscourse aims to introduce the fundamental techniques of natural language processing and to develop an under-\nstanding of the limits of those techniques. It aims to introduce some current research issues, and to evaluate some\ncurrent and potential applications.\n\x0f Introduction. Brief history of NLP research, current applications, generic NLP system architecture, knowledge-\nbased versus probabilistic approaches.\n\x0f Finite-state techniques. In\x03ectional and derivational morphology, \x02nite-state automata in NLP, \x02nite-state\ntransducers.\n\x0f Prediction and part-of-speech tagging. Corpora, simple N-grams, word prediction, stochastic tagging, evalu-\nating system performance.\n\x0f Parsing and generation. Generative gramm

In [14]:
# generate embeddings
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

In [15]:
embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [16]:
#create a FAISS vector database
from langchain.vectorstores import FAISS

vector_db = FAISS.from_documents(docs, embeddings)

In [17]:
vector_db

<langchain_community.vectorstores.faiss.FAISS at 0x19777b980e0>

In [18]:
!pip install -qU langchain-groq

In [19]:
#retriever
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_groq import ChatGroq

retriever = vector_db.as_retriever(search_type="similarity", search_kwargs={"k": 3})
prompt = PromptTemplate(template="Answer the question based on the following context: {context}\n\nQuestion: {question}\nAnswer:")

#model
model = ChatGroq(model="llama-3.3-70b-versatile")

In [20]:
qa_chain = RetrievalQA.from_chain_type(
    llm=model,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt})

In [26]:
query = "summarize this document"
response = qa_chain(query)

In [27]:
print(response['result'])

The document discusses the relationship between discourse structure and summarization in Natural Language Processing (NLP). It explains that discourse relations can be represented as a binary branching tree structure, where some relationships, such as Explanation, have a main phrase (nucleus) and a subsidiary phrase (satellite) that can be removed without losing coherence. This can be exploited in summarization. However, other relationships, like Narration, give equal weight to both elements, making summarization more challenging. The document also provides an overview of the NLP field, its subareas, and methodologies, and outlines the structure of a course that covers these topics in more detail. 

Key points:

* Discourse structure can be represented as a binary branching tree
* Some relationships (e.g. Explanation) have a main and subsidiary phrase that can be used for summarization
* Other relationships (e.g. Narration) give equal weight to both elements, making summarization harde