# Text Summarization

### Imports

In [162]:
from dotenv import find_dotenv, load_dotenv
import os
import requests
import wikipedia

## Solution 1 : Using BART (Bidirectional and Auto-Regressive Transformers)

**BART (Bidirectional and Auto-Regressive Transformers) is an advanced transformer-based model developed by Facebook AI, renowned for its proficiency in abstractive text summarization. Leveraging bidirectional understanding, it comprehensively interprets and summarizes text by considering context from both ends, resulting in coherent and contextually accurate summaries. Pretrained BART models offer adaptability through fine-tuning for specific summarization tasks, making it a versatile choice not only for summarization.BART model pre-trained on English language, and fine-tuned on CNN Daily Mail.**

In [163]:
# Fetching the API key

_=load_dotenv(find_dotenv())
HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY")

Inferencing the BART model API from Hugging Face

In [164]:
API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_KEY}"}

In [165]:
def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

In [166]:
# Set the language for Wikipedia
wikipedia.set_lang("en")  # Change to your desired language

# Define the Wikipedia page title you want to summarize
page_title = "Salman Khan"

# Fetch the content of the Wikipedia page
page_content = wikipedia.page(page_title).content


In [167]:
# Wikipedia Page Content

display(page_content)

'Abdul Rashid Salim Salman Khan (pronounced [səlˈmɑːn xɑːn]; born 27 December 1965) is an Indian actor, film producer, writer and television personality who works predominantly in Hindi films. In a film career spanning over thirty five years, Khan has received numerous awards, including two National Film Awards as a film producer, and two Filmfare Awards as an actor. He is cited in the media as one of the most commercially successful actors of Indian cinema. Forbes has included Khan in listings of the highest-paid celebrities in the world, in 2015 and 2018, with him being the highest-ranked Indian in the latter year.The eldest son of screenwriter Salim Khan, Khan began his acting career with a supporting role in Biwi Ho To Aisi (1988), followed by his breakthrough with a leading role in Sooraj Barjatya\'s romance Maine Pyar Kiya (1989). He established himself in the 1990s, with several commercially successful films, including Barjatya\'s family dramas Hum Aapke Hain Koun..! (1994) and 

In [168]:
output = query({
	"inputs": page_content,
    "parameters": {"min_length": 100, "max_length":500}
})

### Final Output - BART

In [169]:
summary = output[0]['summary_text']
print(summary)

Abdul Rashid Salim Salman Khan is an Indian actor, film producer, writer and television personality who works predominantly in Hindi films. He is cited in the media as one of the most commercially successful actors of Indian cinema. In a film career spanning over thirty five years, Khan has received numerous awards, including two National Film Awards as a film producer and two Filmfare Awards as an actor. He has starred in the highest-grossing Hindi films of 10 years, the highest for any actor. In 2015, he was convicted of culpable homicide for a negligent driving case in which he ran over five people with his car, killing one. On 5 April 2018, Khan was convicted in a blackbuck poaching case and sentenced to five years imprisonment.


## Solution 2 : Using Prompt Engineering/LLMs 

**LangChain is an open-source framework designed to simplify the creation of applications using large language models (LLMs).
LangChain follows a general pipeline where a user asks a question to the language model where the vector representation of the question is used to do a similarity search in the vector database and the relevant information is fetched from the vector database and the response is later fed to the language model. further, the language model generates an answer or takes an action.**

In [170]:
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
import textwrap
import wikipedia

In [171]:
# Set the language for Wikipedia
wikipedia.set_lang("en")  # Change to your desired language

# Define the Wikipedia page title you want to summarize
page_title = "Salman Khan"

# Fetch the content of the Wikipedia page
page_content = wikipedia.page(page_title).content


In [172]:
# Wikipedia Page Content

display(page_content)

'Abdul Rashid Salim Salman Khan (pronounced [səlˈmɑːn xɑːn]; born 27 December 1965) is an Indian actor, film producer, writer and television personality who works predominantly in Hindi films. In a film career spanning over thirty five years, Khan has received numerous awards, including two National Film Awards as a film producer, and two Filmfare Awards as an actor. He is cited in the media as one of the most commercially successful actors of Indian cinema. Forbes has included Khan in listings of the highest-paid celebrities in the world, in 2015 and 2018, with him being the highest-ranked Indian in the latter year.The eldest son of screenwriter Salim Khan, Khan began his acting career with a supporting role in Biwi Ho To Aisi (1988), followed by his breakthrough with a leading role in Sooraj Barjatya\'s romance Maine Pyar Kiya (1989). He established himself in the 1990s, with several commercially successful films, including Barjatya\'s family dramas Hum Aapke Hain Koun..! (1994) and 

### Stuff
The stuff documents chain ("stuff" as in "to stuff" or "to fill") is the most straightforward of the document chains. It takes a list of documents, inserts them all into a prompt and passes that prompt to an LLM.

In [174]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")
chain = load_summarize_chain(llm, chain_type="stuff")

text_splitter = CharacterTextSplitter()
docs = text_splitter.create_documents([page_content])

output_summary = chain.run(docs)
wrapped_text = textwrap.fill(output_summary,
                             width=100,
                             break_long_words=False,
                             replace_whitespace=False)
print(wrapped_text)

Created a chunk of size 4089, which is longer than the specified 4000
Created a chunk of size 6339, which is longer than the specified 4000


Abdul Rashid Salim Salman Khan, known as Salman Khan, is an Indian actor, film producer, writer, and
television personality. He has received numerous awards and is considered one of the most
commercially successful actors in Indian cinema. Khan has starred in several high-grossing films and
is known for his roles in action and drama movies. He is also involved in philanthropy through his
charity, Being Human Foundation. Khan has been involved in various controversies and legal troubles,
including a hit-and-run case and poaching cases. Despite these controversies, he remains a popular
and influential figure in the Indian entertainment industry.


### Map-Reduce
If you have multiple pages you'd like to summarize, you'll likely run into a token limit. Token limits won't always be a problem, but it is good to know how to handle them if you run into the issue.

The chain type "Map Reduce" is a method that helps with this. You first generate a summary of smaller chunks (that fit within the token limit) and then you get a summary of the sumries.es

In [175]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [176]:
llm.get_num_tokens(page_content)

7582

In [177]:
# splitting the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=17000, chunk_overlap=1000)

# creating a document to pass it to llm
docs = text_splitter.create_documents([page_content])

num_docs = len(docs)

num_tokens_first_doc = llm.get_num_tokens(docs[0].page_content)

print (f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens")

Now we have 2 documents and the first one has 3988 tokens


In [178]:
summary_chain = load_summarize_chain(llm=llm, chain_type='map_reduce',
#                                      verbose=True # Set verbose=True if you want to see the prompts being used
                                    )

In [179]:
output = summary_chain.run(docs)

### Summary- GPT

In [180]:
print(output)

Abdul Rashid Salim Salman Khan is a highly successful Indian actor, film producer, writer, and television personality. He is known for his work in Hindi films and has received numerous awards. Khan is considered one of the most commercially successful actors in Indian cinema and has been listed as one of the highest-paid celebrities in the world. He gained fame in the 1990s and achieved even greater stardom in the 2010s. Khan is also a television presenter and promotes humanitarian causes through his charity, Being Human Foundation. However, he has faced controversy and legal troubles throughout his career.


### To get the summary in bullet points

In [181]:
map_prompt = """
Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [182]:
combine_prompt = """
Write a concise summary of the following text delimited by triple backquotes.
Return your response in bullet points which covers the key points of the text.
```{text}```
BULLET POINT SUMMARY:
"""
combine_prompt_template = PromptTemplate(template=combine_prompt, input_variables=["text"])

In [183]:
summary_chain = load_summarize_chain(llm=llm,
                                     chain_type='map_reduce',
                                     map_prompt=map_prompt_template,
                                     combine_prompt=combine_prompt_template,
#                                      verbose=True
                                    )

In [186]:
output = summary_chain.run(docs)

### Summary in bullet points.

In [187]:
print(output)

- Salman Khan is a popular Indian actor, film producer, writer, and television personality
- He is known for his work in Hindi films and has received numerous awards
- Khan has starred in several top-grossing films, including Wanted, Dabangg, Bajrangi Bhaijaan, and Sultan
- He is also a television presenter and has hosted successful shows like "10 Ka Dum" and "Bigg Boss"
- Khan promotes humanitarian causes through his charity, Being Human Foundation
- He has been involved in various brand endorsements, promoting products such as Thums Up, Pepsi, and Suzuki motorcycles
- Khan's personal life has been marred by controversy and legal troubles, including convictions for a negligent driving case and a blackbuck poaching case
- He has won numerous awards for his acting and has a successful career in Bollywood.


# Summary

### There are several methods for text summarization like abstractive, extractive approaches. The choice of a text summarization technique depends on the specific requirements of task.

**BART is a good choice if you need both extractive and abstractive summarization capabilities, and if context and coherence are critical. It's suitable for tasks where maintaining context and readability is essential.**

**GPT is a versatile choice if you have access to a large pretrained model like GPT-3 or GPT-4. It can generate creative and abstractive summaries but may require careful fine-tuning for your specific task.**

# Thank you!