Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on artificial language data (server logs, medical records, etc.) #24

Closed
klimentij opened this issue Dec 30, 2019 · 1 comment
Closed

Comments

@klimentij
Copy link

Hi and thank you for your amazing work! I would like to train GPT-2 in Colab TPU on non-natural language sequential categorical data like server logs, medical records or weather events. What do I have to change in your code to prepare a dataset with word-level encoding (instead of BPE) and successfully run training?

P.S. I think I would be very useful for the community if we have a quick tutorial section on this in Readme.

Thank you!

@ConnorJL
Copy link
Owner

ConnorJL commented Jan 7, 2020

Hello there!

There already is a small tutorial in the README under the heading "Using Your Own Data". Unfortunately, it's not very beginner friendly, due to how shoddy my code is overall. I haven't tested this code in colab, and word-level encoding is not implemented. This project is somewhat in "archive" mode, as I currently have no time or intention to improve on it. I would recommend looking into other more mature LM implementations, such as Hugging Face's Transformers library. Hope that helps!

@ConnorJL ConnorJL closed this as completed Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants