# Notebook Info
This code aims to **summarize a given text** using three different **pre-trained language models** (BART, T5, PEGASUS). The code follows these steps for each model:

1. **Loading the Model**: First, the BART, T5, and PEGASUS models, along with their respective tokenizers, are loaded.
2. **Tokenizing the Text**: The given text is tokenized into a format that each model can understand.
3. **Summarization**: Each model generates a summary of the provided text. Strategies like **beam search** are used to improve the model's performance.
4. **Decoding Results**: The generated summary from each model is decoded back into text and printed on the screen.

**Outcome**: The entire code allows you to summarize the same text using three different models and compare the summaries produced by each model. This helps users understand which model provides the most suitable results.

In [1]:
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
!pip install transformers



# Install Transormers
```python
!pip install transformers
```
- This installs the `transformers` library, which is developed by Hugging Face and provides pre-trained models like BART, T5, and PEGASUS for various NLP tasks, including summarization.

# BART Model

In [3]:
from transformers import BartForConditionalGeneration, BartTokenizer
# BART modelini ve tokenizer'ı yükle
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Özetlemek istediğiniz metin
text = """
LangChain is an open-source framework that helps developers build applications powered by large language models. 
With LangChain, you can easily create and manage chains of different language model components that help
automate complex tasks such as question answering, summarization, and text generation. The framework integrates
with various external tools, allowing it to expand the capabilities of AI models for specific use cases.
"""
# Metni tokenize etme
inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)

# Modeli çalıştırarak özetleme yapma
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)

# Özetlenen metni çözümleme
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("BART Özetlemesi: ", summary)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

BART Özetlemesi:  LangChain is an open-source framework that helps developers build applications powered by large language models. The framework integrateswith various external tools, allowing it to expand the capabilities of AI models for specific use cases. With LangChain, you can easily create and manage chains of different language model components.


# Step 1: **BART Model**

```python
from transformers import BartForConditionalGeneration, BartTokenizer
```
- Imports the `BartForConditionalGeneration` and `BartTokenizer` classes from the `transformers` library.
- `BartForConditionalGeneration`: This is the BART model for conditional generation tasks, such as summarization.
- `BartTokenizer`: Tokenizer used to convert input text into tokens that the BART model can understand.

```python
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
```
- Loads the pre-trained BART model (`facebook/bart-large-cnn`) and its tokenizer.
- `model` stores the BART model, which will be used for summarization.
- `tokenizer` is used to convert the input text into tokens that the model can process.

```python
text = """
LangChain is an open-source framework that helps developers build applications powered by large language models. 
With LangChain, you can easily create and manage chains of different language model components that help
automate complex tasks such as question answering, summarization, and text generation. The framework integrates
with various external tools, allowing it to expand the capabilities of AI models for specific use cases.
"""
```
- This is the input text that you want to summarize. In this case, it's about the LangChain framework.

```python
inputs = tokenizer([text], max_length=1024, return_tensors="pt", truncation=True)
```
- Tokenizes the input text using the tokenizer.
- `max_length=1024`: Specifies the maximum number of tokens the model can handle.
- `return_tensors="pt"`: Returns the tokenized text as PyTorch tensors.
- `truncation=True`: If the text is too long, it truncates it to fit the model's input size.

```python
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, length_penalty=2.0, num_beams=4, early_stopping=True)
```
- Runs the model to generate a summary.
- `inputs["input_ids"]`: The tokenized input text provided to the model.
- `max_length=150`: Maximum length of the summary.
- `min_length=50`: Minimum length of the summary.
- `length_penalty=2.0`: Penalizes longer summaries.
- `num_beams=4`: Uses beam search with 4 beams to find the best summary.
- `early_stopping=True`: Stops once the model has generated a sufficient summary.

```python
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
```
- Decodes the summary back into text.
- `skip_special_tokens=True`: Removes special tokens (e.g., padding) from the summary.

```python
print("BART Summary: ", summary)
```
- Prints the summary generated by the BART model.

# T5 Model

In [4]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# T5 modelini ve tokenizer'ı yükle
model_name = "t5-small"  # Alternatif: t5-base, t5-large
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
# Özetlemek istediğiniz metin
text = """
LangChain is an open-source framework that helps developers build applications powered by large language models.
With LangChain, you can easily create and manage chains of different language model components that help
automate complex tasks such as question answering, summarization, and text generation. The framework integrates
with various external tools, allowing it to expand the capabilities of AI models for specific use cases.
"""
# Özetleme için T5 modeline uygun formatta giriş oluşturma
input_text = "summarize: " + text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
# Modeli çalıştırarak özetleme yapma
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, num_beams=4, early_stopping=True)
# Özetlenen metni çözümleme
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("T5 Özetlemesi: ", summary)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


T5 Özetlemesi:  LangChain is an open-source framework that helps developers build applications powered by large language models. the framework integrates with various external tools, allowing it to expand the capabilities of AI models for specific use cases. the framework is designed to help developers build applications powered by large language models.


# Step 2: **T5 Model**
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
```
- Imports the `T5ForConditionalGeneration` and `T5Tokenizer` classes from the `transformers` library.
- `T5ForConditionalGeneration`: T5 model for text generation tasks like summarization.
- `T5Tokenizer`: Tokenizer for the T5 model.

```python
model_name = "t5-small"  # Alternative: t5-base, t5-large
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
```
- Loads a pre-trained T5 model (`t5-small`) and its tokenizer.
- `t5-small` is a smaller version of the T5 model. You can use `t5-base` or `t5-large` for more powerful models.

```python
input_text = "summarize: " + text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
```
- Prepares the input text for T5 by prefixing it with `summarize:`, which instructs the model to perform summarization.
- `max_length=512`: Specifies the maximum length of the tokenized text.
- `truncation=True`: Ensures the text is truncated if it exceeds the maximum length.

```python
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, num_beams=4, early_stopping=True)
```
- Generates the summary using the T5 model, with the same parameters as before to control the length of the summary and the beam search process.

```python
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
```
- Decodes the summary back into text.

```python
print("T5 Summary: ", summary)
```
- Prints the summary generated by the T5 model.

# PEGASUS Model

In [5]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
# PEGASUS modelini ve tokenizer'ı yükle
model_name = "google/pegasus-large"  # Alternatif: google/pegasus-xsum
model = PegasusForConditionalGeneration.from_pretrained(model_name)
tokenizer = PegasusTokenizer.from_pretrained(model_name)
# Özetlemek istediğiniz metin
text = """
LangChain is an open-source framework that helps developers build applications powered by large language models.
With LangChain, you can easily create and manage chains of different language model components that help
automate complex tasks such as question answering, summarization, and text generation. The framework integrates
with various external tools, allowing it to expand the capabilities of AI models for specific use cases.
"""
# Metni tokenize etme
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
# Modeli çalıştırarak özetleme yapma
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, num_beams=4, early_stopping=True)
# Özetlenen metni çözümleme
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("PEGASUS Özetlemesi: ", summary)

config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

PEGASUS Özetlemesi:  LangChain is an open-source framework that helps developers build applications powered by large language models. The framework integrates with various external tools, allowing it to expand the capabilities of AI models for specific use cases. The framework integrates with various external tools, allowing it to expand the capabilities of AI models for specific use cases.


# Step 3: **PEGASUS Model**
```python
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
```
- Imports the `PegasusForConditionalGeneration` and `PegasusTokenizer` classes from the `transformers` library.
- `PegasusForConditionalGeneration`: PEGASUS model for summarization.
- `PegasusTokenizer`: Tokenizer for the PEGASUS model.

```python
model_name = "google/pegasus-large"  # Alternative: google/pegasus-xsum
model = PegasusForConditionalGeneration.from_pretrained(model_name)
tokenizer = PegasusTokenizer.from_pretrained(model_name)
```
- Loads the pre-trained PEGASUS model (`google/pegasus-large`) and its tokenizer.
- `google/pegasus-xsum` is an alternative PEGASUS model that focuses on extreme summarization.

```python
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
```
- Tokenizes the input text for PEGASUS in the same way as the previous models.

```python
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, num_beams=4, early_stopping=True)
```
- Generates the summary using the PEGASUS model, with the same parameters for controlling summary length and beam search.

```python
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
```
- Decodes the summary back into text.

```python
print("PEGASUS Summary: ", summary)
```
- Prints the summary generated by the PEGASUS model.

# **Similar & Different Parts:**

## **Similar Parts:**

1. **Loading the Model and Tokenizer**:
   - For each model (BART, T5, PEGASUS), the model and tokenizer are loaded in a similar manner:
   ```python
   model = ModelClass.from_pretrained(model_name)
   tokenizer = TokenizerClass.from_pretrained(model_name)
   ```
   - For example:
     - For BART: `BartForConditionalGeneration.from_pretrained(model_name)`
     - For T5: `T5ForConditionalGeneration.from_pretrained(model_name)`
     - For PEGASUS: `PegasusForConditionalGeneration.from_pretrained(model_name)`
   
2. **Tokenizing the Text**:
   - The input text is tokenized for each model and converted into the format that the model can understand.
   ```python
   inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
   ```
   - In this step, the text is tokenized, but the `max_length` value might differ for some models (e.g., T5 uses `max_length=512`).

3. **Summarization Process**:
   - The same `model.generate()` function is used in each model to generate the summary.
   ```python
   summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=50, num_beams=4, early_stopping=True)
   ```
   - The same parameters (maximum length, minimum length, beam search, early stopping, etc.) are used to perform the summarization.

4. **Decoding and Printing the Results**:
   - The output from the model is decoded and printed as human-readable text.
   ```python
   summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
   print("Model Summary: ", summary)
   ```

## **Different Parts:**

1. **Different Models and Tokenizers**:
   - The key difference is that different models and tokenizers are used:
     - BART: `BartForConditionalGeneration`, `BartTokenizer`
     - T5: `T5ForConditionalGeneration`, `T5Tokenizer`
     - PEGASUS: `PegasusForConditionalGeneration`, `PegasusTokenizer`

2. **Preparing the Input Text for the Model**:
   - **T5** model requires the `"summarize: "` prefix to indicate that the model should perform summarization:
     ```python
     input_text = "summarize: " + text
     ```
   - This step is not needed for the other models (BART and PEGASUS), which do not require this specific prefix.

3. **Naming the Model**:
   - Each model uses a different name for the pre-trained model. For example:
     - For BART: `"facebook/bart-large-cnn"`
     - For T5: `"t5-small"`
     - For PEGASUS: `"google/pegasus-large"`

4. **Different Usage of Parameters**:
   - In T5, the `max_length` parameter is set to 512, while for the other models, it can be set to 1024.
     - For example, in T5:
     ```python
     inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
     ```
     - In BART and PEGASUS:
     ```python
     inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
     ```

## **In Summary**:
- **Similar Parts**: The steps of loading the model, tokenizing the text, performing the summarization, and decoding/printing the results are similar for all three models.
- **Different Parts**: Each model has its own tokenizer and model name. Additionally, T5 requires the `"summarize: "` prefix, while this is not necessary for BART and PEGASUS. There are also slight differences in parameters like `max_length` for each model.