In this notebook we will play a little around with the pretrained gpt2 made available by huggingface. I recommend enabling the gpu for this exercise.

First we have to download the transformers library:

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.7 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 51.2 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 34.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.2 transformers-4.24.0


In the next step, we import the very powerful 'pipeline' wrapper, which saves us from learning a lot of the syntax for the transformers library:

In [None]:
from transformers import set_seed, pipeline

We set up a text generator using gpt2-large, and ask for it to be put on the cpu (if we had a gpu, we would set device=0):

In [None]:
generator = pipeline(task='text-generation', model='gpt2-large',device=-1)

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Now we are ready to generate text. We do this by giving the generator a prompt, and some (potentially very many) options. In the example below, I limited the output to 100 words, asked for two texts, and set the 'temperature' of the random word generator.

In [None]:
set_seed(42)

generator("Deep learning is an interesting topic, because ", max_length=100, num_return_sequences=2,temperature=0.1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Deep learning is an interesting topic, because \xa0it is a very new field, and it is still very much in its infancy. \xa0It is a very powerful tool, and it is very easy to get wrong. \xa0I have seen many people get very excited about the potential of deep learning, and then get very wrong. \xa0I have seen many people get very excited about the potential of deep learning, and then get very wrong. \xa0I have seen many people get very'},
 {'generated_text': "Deep learning is an interesting topic, because \xa0it's a very new field, and it's not clear how much of the current research is actually applicable to real-world problems. \xa0I'm not sure how much of the current research is applicable to real-world problems, but I'm sure there are some things that can be learned from it. \xa0I'm not sure how much of the current research is applicable to real-world problems, but I'm sure there are some things"}]

While the output is clearly english, it seems very repetitive, and not very impressive.

Therefore, your first task is:

Make the above generator output more sensible continuations of the prompt than this. Achieve this by changing the options given to the generator.
 To achieve this task, I recommend looking at both the documentation for the generation function: https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation as well as this excellent tutorial: https://huggingface.co/blog/how-to-generate

As motivation, I can tell you that with the same seed and prompt, I have managed to get the following output (which seems a bit more interesting than the above):



> [{'generated_text': "Deep learning is an interesting topic, because \xa0it's a very new field, and it's still in its infancy. \xa0There's no one way to do this, and there's no one way that's going to work for every situation. \xa0I'm going to give you a few tips that I've found helpful in the past. \xa0I hope they'll help you in your own research, and help you in the future as well. \xa0I'll start with a"},

>  {'generated_text': "Deep learning is an interesting topic, because \xa0it is not just about learning a new language. It's about learning a new way to think about the world. It's a new way of thinking about how to make sense of the world, and how to make sense out of data. It's a way of understanding how to make sense and understand the world. It is a new way of looking at how the world is structured, how the world works, and how we can make sense out of"}]



# Optional exercise: finetuning

The next, perhaps more interesting task, is to finetune gpt-2 for a specific problem. This is a somewhat more involved task, which is why it's just optional. If you have the time, or just want to get more into NLP, I highly recommend trying it out.

I suggest the following:

Carry out this tutorial:

https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272

When that works for you, change the data set to this one instead:
https://www.kaggle.com/datasets/paultimothymooney/recipenlg

This should set you up decently for future, specialized NLP projects that you wish to carry out.
