# Text Summarization

---
## Content
1) **Project Overview**
2) **Import Dataset**
- In `.parquet`format
3) **Text Preprocessing**
- Tokenize the text data and convert it to sequences
- Pad or truncate sequences to ensure uniform input length.
- Handle special tokens (e.g., start-of-sequence `<sos>` and end-of-sequence `<eos>`).
4) **Model Development**
- Build a Seq2Seq model in TensorFlow with the following components:
    - Encoder (RNN/LSTM/GRU).
    - Decoder with attention mechanism.
    - Attention layer to enhance summary quality.
5) **Training**
- Split the dataset
    - %80 train data
    - %10 validation data
    - %10 test data
- Train the model on the training set
- Monitor performance using the validation set
- Adjust hyperparameters as necessary.
6) **Evaluation**
- Use the test set to generate summaries.
- Evaluate the generated summaries using the ROUGE metric.
7) **Analysis**
- Compare generated summaries to reference summaries and discuss performance.
- Suggest potential improvements or extensions for better results.

---
## Project Overview
This project focuses on developing a text summarization system for news articles using Sequence-to-Sequence (Seq2Seq) models enhanced with attention mechanisms. By utilizing a custom dataset of news content, the model is trained to generate concise, coherent, and informative summaries that capture the key points of each article.

The Seq2Seq architecture, paired with attention, allows the model to dynamically focus on relevant parts of the input text during the decoding process, improving the quality and accuracy of the summaries. This approach addresses the challenge of long and complex news articles by effectively reducing redundancy and preserving critical information.

The project includes dataset preprocessing, model training, and evaluation using metrics such as ROUGE, with the goal of producing high-quality, human-like summaries. This work aims to contribute to automated news aggregation, efficient information retrieval, and content generation.

---
## Import Dataset

In [1]:
import pandas as pd
import os

In [2]:
# Current directory
print(os.getcwd())

/Users/fako/Desktop/Neural/NLP/Text Summarization/src


In [3]:
# Read the Parquet file
df = pd.read_parquet('../data/ds1.parquet')

df.head(3)

Unnamed: 0,text,prediction,prediction_agent,annotation,annotation_agent,id,metadata,status,event_timestamp,metrics
0,WASHINGTON (Reuters) - President Donald Trump ...,"[{'score': 1.0, 'text': 'Trump ends 'Dreamer' ...",Argilla,,,04de325a-1fbf-41a9-977b-ec7892ef86f0,,Default,2017-09-05,{'text_length': 6904}
1,MOSCOW (Reuters) - Russian property developer ...,"[{'score': 1.0, 'text': 'Russian tycoon, fresh...",Argilla,,,97c7f5e7-ae32-44af-ad0c-e6b17ce31e54,,Default,2017-11-08,{'text_length': 1527}
2,WASHINGTON (Reuters) - The U.S. intelligence c...,"[{'score': 1.0, 'text': 'U.S. not started asse...",Argilla,,,90894659-b843-4817-9df8-bb34d6219cdf,,Default,2017-05-23,{'text_length': 677}


---
## Text Preprocessing

In [4]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np

Extract the 'text' from the 'prediction' column. It has also different properties, we don't need them. Also, handle special tokens in target text

In [5]:
df['text_prediction'] = df['prediction'].apply(
    lambda x: x[0]['text'] if isinstance(x, np.ndarray) and len(x) > 0 else ''
)

df['text_prediction'] = df['text_prediction'].apply(lambda x: '<sos> ' + x + ' <eos>')

Tokenize the text data and convert it to sequences using TensorFlow/Keras Tokenizer.

In [6]:
# Initialize Tokenizer for input texts
input_tokenizer = Tokenizer()
input_tokenizer.fit_on_texts(df['text'])
input_sequences = input_tokenizer.texts_to_sequences(df['text'])

# Initialize Tokenizer for target summaries
target_tokenizer = Tokenizer()
target_tokenizer.fit_on_texts(df['text_prediction'])
target_sequences = target_tokenizer.texts_to_sequences(df['text_prediction'])

Find the number of 'text' and 'prediction' words among all rows for padding

In [7]:
# Find the maximum number of words in 'text' column
df['text_length'] = df['metrics'].apply(lambda x: x['text_length'])

# Find the maximum number of words in 'prediction' column
df['prediction_length'] = df['text_prediction'].apply(lambda x: len(str(x).split()))

Pad sequences to ensure uniform input length.

In [8]:
# Find the maximum word length in 'text' column
max_input_length = df['text_length'].max()

# Find the maximum word length in 'prediction' column
max_target_length = df['prediction_length'].max()

# Pad input and target sequences
input_padded = pad_sequences(input_sequences, maxlen=max_input_length, padding='post', truncating='post')
target_padded = pad_sequences(target_sequences, maxlen=max_target_length, padding='post', truncating='post')

---
## Model Development

In [9]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Concatenate, RepeatVector
from tensorflow.keras.models import Model

Build a Seq2Seq model in TensorFlow with the following components:
- Encoder (RNN/LSTM/GRU)
- Decoder with attention mechanism.
- Attention layer to enhance summary quality.

In [10]:
# Define Bahdanau Attention as a custom layer
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)
    
    def call(self, query, values):
        # query shape: (batch_size, hidden size)
        # values shape: (batch_size, max_input_length, hidden size)
        
        # Expand query dimensions to match values
        query_with_time_axis = tf.expand_dims(query, 1)  # (batch_size, 1, hidden size)
        
        # Calculate score
        score = self.V(tf.nn.tanh(
            self.W1(values) + self.W2(query_with_time_axis)
        ))  # (batch_size, max_input_length, 1)
        
        # Calculate attention weights
        attention_weights = tf.nn.softmax(score, axis=1)  # (batch_size, max_input_length, 1)
        
        # Context vector
        context_vector = attention_weights * values  # (batch_size, max_input_length, hidden size)
        context_vector = tf.reduce_sum(context_vector, axis=1)  # (batch_size, hidden size)
        
        return context_vector, attention_weights

# Parameters
embedding_dim = 256
units = 512
input_vocab_size = len(input_tokenizer.word_index) + 1
target_vocab_size = len(target_tokenizer.word_index) + 1
decoder_input_length = max_target_length - 1

# Encoder
encoder_inputs = Input(shape=(max_input_length,), name='encoder_inputs')
encoder_embedding = Embedding(input_vocab_size, embedding_dim, mask_zero=True, name='encoder_embedding')(encoder_inputs)
encoder_lstm = LSTM(units, return_sequences=True, return_state=True, name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(decoder_input_length,), name='decoder_inputs')
decoder_embedding = Embedding(target_vocab_size, embedding_dim, mask_zero=True, name='decoder_embedding')(decoder_inputs)
decoder_lstm = LSTM(units, return_sequences=True, return_state=True, name='decoder_lstm')
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)

# Attention
attention = BahdanauAttention(units)
context_vector, attention_weights = attention(state_h, encoder_outputs)

# Repeat the context vector across all time steps using RepeatVector
context_vector_repeated = RepeatVector(decoder_input_length)(context_vector)  # (batch_size, max_target_length, hidden size)

# Concatenate context vector with decoder outputs using Keras Concatenate layer
decoder_concat = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, context_vector_repeated])

# Dense layer
decoder_dense = Dense(target_vocab_size, activation='softmax', name='output_layer')
decoder_outputs = decoder_dense(decoder_concat)

# Seq2Seq Model with Bahdanau Attention
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

---
## Training
We'll perform the following actions:
- Split the Dataset into Training, Validation, and Testing Sets (80%/10%/10%).
- Prepare Decoder Input and Output Sequences.
- Compile the Model.
- Train the Model While Monitoring Validation Performance.

In [11]:
from sklearn.model_selection import train_test_split

**Split the Dataset**

In [12]:
# First split: 80% training and 20% temp (validation + testing)
input_train, input_temp, target_train, target_temp = train_test_split(
    input_padded, target_padded, test_size=0.2, random_state=42
)

# Second split: 10% validation and 10% testing from temp
input_val, input_test, target_val, target_test = train_test_split(
    input_temp, target_temp, test_size=0.5, random_state=42
)

print(f"Training set size: {input_train.shape[0]}")
print(f"Validation set size: {input_val.shape[0]}")
print(f"Testing set size: {input_test.shape[0]}")

Training set size: 16333
Validation set size: 2042
Testing set size: 2042


**Prepare Decoder Input and Output Sequences**
- Decoder Input: All tokens except the last one.
- Decoder Output: All tokens except the first one.

In [13]:
# Decoder input data: remove the last token
decoder_input_train = target_train[:, :-1]
decoder_input_val = target_val[:, :-1]
decoder_input_test = target_test[:, :-1]

# Decoder output data: remove the first token
decoder_output_train = target_train[:, 1:]
decoder_output_val = target_val[:, 1:]
decoder_output_test = target_test[:, 1:]

# Expand dimensions for sparse categorical crossentropy
decoder_output_train = np.expand_dims(decoder_output_train, -1)
decoder_output_val = np.expand_dims(decoder_output_val, -1)
decoder_output_test = np.expand_dims(decoder_output_test, -1)

**Compile the Model**

We'll compile the model with the Adam optimizer and use sparse_categorical_crossentropy as the loss function since the targets are integer-encoded.

In [14]:
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

**Train the Model**

We'll train the model using the training set and monitor its performance on the validation set.

In [15]:
# Define training parameters
batch_size = 16
epochs = 20

# Train the model
history = model.fit(
    [input_train, decoder_input_train], 
    decoder_output_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=([input_val, decoder_input_val], decoder_output_val)
)

Epoch 1/20
[1m   9/1021[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m39:45:52[0m 141s/step - accuracy: 0.3297 - loss: 9.0323

KeyboardInterrupt: 