# Getting Started with Huggingface Transformers
---
### Table of Contents
- Description
- Requirements
- Setup
- Usage

**Description**
This Jupyter Notebook demonstrates how to use GPT2 and DistilBERT on Huggingface Transformers to perform two tasks: (1) generate text, and (2) classify text.

This notebook was generated by [Galactica](https://galactica.org) and cleaned up to be usable by [crumb](https://twitter.com/aicrumb).

The original generated text is located at [this link](https://galactica.org/?max_new_tokens=1000&prompt=Jupyter+Notebook+for+Getting+Started+with+GPT2+on+Huggingface+Transformers) (0-shot generation, first try)

**Requirements**
To run this notebook, you will need Python 3.6+, Jupyter, PyTorch, Sklearn, Pandas, and Transformers

### Setup
---
Install the library with `pip` (the rest of the requirements are pre-installed on Google Colab):

In [1]:
%pip install transformers -q

[K     |████████████████████████████████| 5.5 MB 12.5 MB/s 
[K     |████████████████████████████████| 163 kB 49.6 MB/s 
[K     |████████████████████████████████| 7.6 MB 43.7 MB/s 
[?25h

### Usage
---
**Generate text**

You can use GPT2 to generate text by passing the model a context (e.g., a prompt) and using generate() to get a list of possible continuations:

In [2]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the tokenizer and the model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2").cuda()

# Generate text
context = "My name is"
text = model.generate(
    input_ids=tokenizer.encode(context, return_tensors="pt").cuda(),
    max_length=50,
)

# Decode text
generated_text = tokenizer.decode(text[0], skip_special_tokens=True)
print(generated_text)
# Should output something like "My name is Joey"

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


My name is John. I'm a man of God. I'm a man of God. I'm a man of God. I'm a man of God. I'm a man of God. I'm a man of God. I'm a


**Classify text**

You can use BERT to classify text by using the model's last_hidden_state to get the embeddings of a text and passing the embeddings to a logistic regression classifier.

In [3]:
from sklearn.linear_model import LogisticRegression
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import torch

# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Load the dataset
train_dataset = pd.DataFrame([
    {"text": "This movie is terrible", "label": 0},
    {"text": "This movie is great", "label": 1},
])

# Encode the dataset
train_inputs = tokenizer(list(train_dataset['text']), padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    train_inputs = model(train_inputs['input_ids']).last_hidden_state[:,0,:]

train_labels = torch.tensor(train_dataset['label'])

# Train the classifier
classifier = LogisticRegression()
classifier.fit(train_inputs, train_labels)

# Classify a text
test_dataset = pd.DataFrame([{"text": "This movie is good"}])
test_inputs = tokenizer(
    list(test_dataset["text"]), padding=True, truncation=True, return_tensors="pt"
)
with torch.no_grad():
    test_inputs = model(test_inputs['input_ids']).last_hidden_state[:,0,:]
test_labels = classifier.predict(test_inputs)

print("Test Labels")
print(test_labels)
# Should output [1]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Test Labels
[1]
