In [None]:
!pip install chardet

# Sequence-to-Sequence (Seq2Seq) model with LSTM units

Here's a breakdown of the components and the architecture:

1. **Encoder-Decoder Architecture**: This model consists of two primary components: an encoder and a decoder.
   - **Encoder**: Takes the input sequence and processes it into a fixed-sized vector (or state), capturing the essence of the input data.
   - **Decoder**: Takes the output from the encoder and generates the target sequence. The initial state of the decoder is set to the final state of the encoder, allowing the decoder to use the learned context.

2. **LSTM Layers**: Both the encoder and decoder use Long Short-Term Memory (LSTM) layers, which are a type of recurrent neural network (RNN) suitable for sequence prediction problems. LSTM helps the model to retain long-term dependencies and handle vanishing gradient problems that can occur with standard RNNs.

3. **Embedding Layer**: Both the encoder and decoder are equipped with an embedding layer that transforms the integer-encoded vocabulary into dense vectors of a fixed size. This provides a more expressive representation of words and reduces the dimensionality compared to one-hot encoding.

4. **Dense Layer**: The output of the decoder's LSTM is passed through a dense (fully connected) layer with a softmax activation function to predict the probability distribution over the vocabulary for each time step in the output sequence.

# Dataset Used for Training
We are using the BBC News Summary data to train our model: https://www.kaggle.com/datasets/pariza/bbc-news-summary?select=BBC+News+Summary

### Data Preparation

In [3]:
import os
import chardet

def read_files(directory):
    files_content = []
    for filename in sorted(os.listdir(directory)):
        filepath = os.path.join(directory, filename)
        if os.path.isfile(filepath):
            # Detect encoding
            with open(filepath, 'rb') as file:  # Open file in binary mode
                raw_data = file.read()
                encoding = chardet.detect(raw_data)['encoding']
            
            # Read file with detected encoding
            with open(filepath, 'r', encoding=encoding) as file:
                files_content.append(file.read().strip())
    return files_content

def load_data(main_directory):
    """
    Function to load news articles and their summaries from given directory structure.
    """
    categories = ['business', 'entertainment', 'politics', 'sport', 'tech']  # List of categories
    texts = []
    summaries = []

    # Paths for articles and summaries
    articles_path = os.path.join(main_directory, 'News Articles')
    summaries_path = os.path.join(main_directory, 'Summaries')

    for category in categories:
        # Full path to category folder for articles and summaries
        category_articles_path = os.path.join(articles_path, category)
        category_summaries_path = os.path.join(summaries_path, category)

        # Read all articles and summaries from category folder
        category_articles = read_files(category_articles_path)
        category_summaries = read_files(category_summaries_path)

        # Extend the main lists
        texts.extend(category_articles)
        summaries.extend(category_summaries)

    return texts, summaries

main_directory = '/kaggle/input/bbc-news-summary/BBC News Summary'

texts, summaries = load_data(main_directory)
print("Number of texts: ", len(texts))
print("Number of summaries: ", len(summaries))
print("Example text: ", texts[0])
print("Example summary: ", summaries[0])

Number of texts:  2225
Number of summaries:  2225
Example text:  Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service

In [4]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts + summaries)

# Convert texts to sequences
text_seq = tokenizer.texts_to_sequences(texts)
summary_seq = tokenizer.texts_to_sequences(summaries)

# Pad sequences
text_seq = pad_sequences(text_seq, maxlen=50)
summary_seq = pad_sequences(summary_seq, maxlen=20)

2024-05-03 18:36:30.111591: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-03 18:36:30.111715: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-03 18:36:30.247300: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Model Architecture

In [5]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding

# Parameters
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 300
lstm_units = 256

# Encoder
encoder_inputs = Input(shape=(None,))
encoder_embedding = Embedding(vocab_size, embedding_dim)(encoder_inputs)
encoder_outputs, state_h, state_c = LSTM(lstm_units, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]

# Decoder
decoder_inputs = Input(shape=(None,))
decoder_embedding = Embedding(vocab_size, embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

### Training

In [6]:
# Prepare decoder input data that just contains the start token
decoder_input_data = np.zeros_like(summary_seq)
decoder_input_data[:, 1:] = summary_seq[:, :-1]
decoder_input_data[:, 0] = 1  # Assuming 1 is the start token

# Prepare decoder target data
decoder_target_data = np.expand_dims(summary_seq, -1)

# Training the model
model.fit([text_seq, decoder_input_data], decoder_target_data, batch_size=16, epochs=150)

Epoch 1/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 28ms/step - loss: 8.7686
Epoch 2/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 27ms/step - loss: 7.1593
Epoch 3/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 28ms/step - loss: 6.9356
Epoch 4/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 28ms/step - loss: 6.7560
Epoch 5/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 28ms/step - loss: 6.5690
Epoch 6/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 28ms/step - loss: 6.4182
Epoch 7/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 27ms/step - loss: 6.2475
Epoch 8/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 27ms/step - loss: 6.0532
Epoch 9/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 28ms/step - loss: 5.8447
Epoch 10/150
[1m140/140[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4

<keras.src.callbacks.history.History at 0x7c522cd9ef20>

### Inference

In [11]:
# Inference setup: Define encoder model
encoder_model = Model(encoder_inputs, encoder_states)

# Inference setup: Define decoder model
decoder_state_input_h = Input(shape=(lstm_units,))
decoder_state_input_c = Input(shape=(lstm_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

# Function to decode sequence
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1 with only the start character.
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = 1  # Start token

    # Sampling loop for a batch of sequences
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = tokenizer.index_word.get(sampled_token_index, 'unknown')  # Handle unknown tokens

        # Append sampled word (or 'unknown') to the decoded sentence
        decoded_sentence += ' ' + sampled_char

        # Exit condition: either hit max length or find stop character.
        if (sampled_char == 'end' or len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence


print(decode_sequence(text_seq[0:1]))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 144ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
 profits were lower than in the preceding three quarters


# Inference code: User Input

In [13]:
input_text = input("Enter a text to summarize: ")
input_seq = tokenizer.texts_to_sequences([input_text])
input_seq = pad_sequences(input_seq, maxlen=50)
summary = decode_sequence(input_seq)
print("Summary:", summary)

Enter a text to summarize:  Bloom is to be formally presented with the Hans Christian Andersen Award this spring in Anderson's hometown of Odense.Later at a gala dinner, Danish supermodel Helena Christensen was named a Hans Christian Andersen ambassador.French musician Jean-Michel Jarre is to perform at a concert in Copenhagen to mark the bicentennial of the birth of writer Hans Christian Andersen."Christian Andersen's fairy tales are timeless and universal," said Jarre.The royal couple also visited the Hans Christian Anderson School complex, where Queen Mary read The Ugly Duckling to the young audience."Bloom recognizes the darker aspects of Andersen's authorship," Prince Frederik said.


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
Summary:  chance to utilise some of a one day strike which could
