In [1]:
data = """You are a simple yet powerful LLM AI language model with natural language processing capabilities designed to respond with concise and informative, summarization of all the context that was provided below from a given text. Your task is to analyze correctly and create a memorable guiding summary.

You will use the conversation history provided after these instructions to answer questions or execute tasks to do your best to serve the instruction you receive at the end.

You will never make up anything that was not present in or not related to the provided context. 

Your objective is to summarize the context below and present it in an understandable and compact fashion.

You should be aware that this conversation is between other individuals and you are only observing without any contribution.

Don't include any comments or speech in your response, output pure summarization and nothing else.

Good luck and have fun!"""

In [2]:
# 1. **Data Preparation**
# Firstly, you need to prepare your dataset. You can use any text corpus for this purpose. For simplicity, let's assume we have a file named `data.txt` containing our training data.

import tensorflow as tf
import numpy

# Split each line into a list
lines = data.split("\n\n")

# Split each line into words
words = " ".join(lines).split()

2025-02-15 22:58:22.377145: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-15 22:58:22.380407: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-02-15 22:58:22.389973: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739649502.406526   18056 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739649502.411495   18056 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-15 22:58:22.427996: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

In [3]:
# 2. **Tokenization**
# Next, we need to convert our text data into numerical tensors that can be fed into a neural network.

# Create a tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)

# Fit the tokenizer on the words
tokenizer.fit_on_texts(words)

# Convert the words back into sequences of integers
sequences = tokenizer.texts_to_sequences(words)

In [4]:
# 3. **Building the Model**
# Now, we can build our language model using a recurrent neural network (RNN) like LSTM or GRU.

vocab_size = len(tokenizer.word_index) + 1

# Define the model architecture
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(vocab_size)
])

# Compile the model
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer='adam')

W0000 00:00:1739649504.015847   18056 gpu_device.cc:2344] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [5]:
# 4. **Training the Model**
# We can now train our language model using the prepared data.

# Convert sequences into numpy arrays
sequences = tf.constant(numpy.expand_dims(sequences, -1))

# Pad the sequences to make them of equal length
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, padding='post')

# Filter out empty sequences
padded_sequences = [seq for seq in padded_sequences if len(seq) > 0]

# Train the model
model.fit(padded_sequences, epochs=10)


ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (150,) + inhomogeneous part.

In [None]:
# 5. **Evaluating the Model**
# Finally, we evaluate the performance of our trained model.
# Generate some text from the model
generated_text = ''

for i in range(20):
    prediction = model.predict(tf.expand_dims([tokenizer.texts_to_sequences([generated_text])[0]], 0))
    predicted_id = tf.argmax(prediction[0]).numpy()
    generated_text += tokenizer.index_word[predicted_id]

print(generated_text)