Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r-and-d--try-gpt-bit-pair-100k-encoding #147

Open
2 tasks
david-thrower opened this issue Feb 19, 2024 · 1 comment
Open
2 tasks

r-and-d--try-gpt-bit-pair-100k-encoding #147

david-thrower opened this issue Feb 19, 2024 · 1 comment
Labels
audience/technical Issue primarily for technical review and service. kind/enhancement New feature or request triage/high-priority

Comments

@david-thrower
Copy link
Owner

Kind of issue: Feature / enhancement; Natural Language Processing.

TLDR:

@david-thrower david-thrower added kind/enhancement New feature or request triage/high-priority audience/technical Issue primarily for technical review and service. labels Feb 19, 2024
@david-thrower
Copy link
Owner Author

This may make a suitable base model, but may need further preprocessing:

import tensorflow as tf
from minbpe import GPT4Tokenizer 

from keras_nlp.models import GPT2Preprocessor


class TextEncoderLayer(tf.keras.layers.Layer):
    def __init__(self,
                 # tokenizer,
                 sequence_length = 100):
        super(TextEncoderLayer, self).__init__()
        tokenizer = GPT4Tokenizer()
        self.tokenizer = tokenizer
        self.sequence_length = sequence_length

    def call(self, text):
        _tokens = []
        for text_0 in text:
            tokens = self.tokenizer.encode(str(text_0), allowed_special="all")

            _tokens.append(tokens)
        # ragged_tokens = tf.ragged.constant(padded_tokens)
        # token_tensor = tf.constant(_tokens)
        padded_tokens =\
                tf.keras.preprocessing.sequence.pad_sequences(
                 _tokens, maxlen=self.sequence_length, padding='post')
        
        return tf.constant(padded_tokens) # ragged_tokens

# Usage example
text_1 = tf.constant(["<|endoftext|>hello world"], dtype=tf.string)
text = tf.constant(["<|endoftext|>hello world", "test 9"], dtype=tf.string)
	    # tf.constant("<|endoftext|>hello world", dtype=tf.string),
            # tf.constant("test 9", dtype=tf.string)])
# tokenizer = GPT4Tokenizer()
text_encoder_layer = TextEncoderLayer() # tokenizer)

print("2 tensor: as layer:")
print(text_encoder_layer(text))

print("One tensor: as layer:")
print(text_encoder_layer(text_1))

# Check if compatible with preprocessor:


inp = tf.keras.layers.Input(shape=(), dtype=tf.string)
tokens_1 = TextEncoderLayer()(inp)
vocab_size = 100276
embedded = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=18,
    input_length=100)(tokens_1)
flat = tf.keras.layers.Flatten()(embedded)


m1 = tf.keras.Model(inputs=inp, outputs=flat)

result_1 = m1(text_1)
print("1 Tensor:")
print(result_1)

result = m1(text)

print("2 tensor:")
print(result)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audience/technical Issue primarily for technical review and service. kind/enhancement New feature or request triage/high-priority
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant