In [1]:
import pandas as pd
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
from sentence_transformers import SentenceTransformer, util

In [2]:
sentence = 'Hello, I\'m a language model'

In [3]:
p = pipeline(task='text-generation', model='gpt2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


In [7]:
result1 = p(sentence)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [8]:
result1[0]['generated_text']

'Hello, I\'m a language modeler, a programmer, and I am fascinated with the many different ways in which it can be used to solve real-world problems. It is a fascinating and fascinating topic.\n\nIn my time, I\'ve never been a programmer, so I\'m not sure if I would be interested in learning more about programming. I\'m still working on the programming language, and I\'m interested in getting some fun and interesting projects written and tested.\n\nI believe that the answer to all of the problems that I\'m trying to solve is to write a very simple program. I believe that the most important thing is to write code that works on the real world, but do it in a way that is efficient and efficient.\n\n"You have to be extremely clever and hardworking. You have to have a lot of experience, a lot of experience, and you have to be a very good programmer, and you have to be very careful about what you write. If you use the wrong word, you can get really confused. If you use the wrong word, you ca

In [9]:
result2 = p(sentence, return_full_text=False)
result2[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'. I\'m trying to get more expressive.\n\nYou know I\'m a big fan of the F# toolchain, though its not what I\'m passionate about. It\'s just that there are a lot of tools out there that I wouldn\'t call "official" F#, and they\'re mostly things I haven\'t used much. I\'d say it\'s been a good year for F#.\n\nIt\'s funny. You know, in fact, that I\'m a huge F# fan. It\'s very cool, very fun to use, to work with. It\'s really cool that there\'s a lot of tools out there that I wouldn\'t call "official" F#, and they\'re mostly things I wouldn\'t call "official" F#. But I\'m sure some people would say, "Oh yes, I know. The one that\'s official is here. The one that I\'m just not going to use." But that\'s my experience.\n\nAnd so you\'re saying, "Well, we\'re going to use F# in the future."\n\nDo you have any projects out there that you\'d like to start using in the future?\n\nI\'m going to start doing a lot of other things, but that stuff is'

In [12]:
result3 = p(sentence, return_full_text=False, max_new_tokens=10, truncation=True)
result3[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'er. I write complex and complex languages. I'

In [13]:
model_name = 't5-base' # t5-small, t5-large, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large

In [14]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [16]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [17]:
data = pd.read_csv('train_en.txt', sep='\t')
sentences = data['Sentence'].values.tolist()
labels = data['Style'].values.tolist()

In [18]:
data

Unnamed: 0,Sentence,Style
0,he had steel balls too !,toxic
1,"dude should have been taken to api , he would ...",toxic
2,"im not gonna sell the fucking picture , i just...",toxic
3,the garbage that is being created by cnn and o...,toxic
4,the reason they dont exist is because neither ...,toxic
...,...,...
25035,both sides need to calm down or we are heading...,neutral
25036,i 'm sitting here in my calm german city conte...,neutral
25037,"dude , get a clue .",neutral
25038,"I was so high, it was amazing.",neutral


In [19]:
sample_sentence = sentences[4]
sample_sentence

'the reason they dont exist is because neither is a pathological liar like trump .'

In [30]:
prompt = f'Classify the following text into either \'TOXIC\' or \'NEUTRAL\': {sample_sentence}'
prompt

"Classify the following text into either 'TOXIC' or 'NEUTRAL': the reason they dont exist is because neither is a pathological liar like trump ."

In [31]:
tokens = tokenizer(prompt, return_tensors='pt')
tokens

{'input_ids': tensor([[ 4501,  4921,     8,   826,  1499,   139,   893,     3,    31,  5647,
             4,  4666,    31,    42,     3,    31,  4171,  6675, 21415,    31,
            10,     8,  1053,    79,  2483,  3223,    19,   250,  7598,    19,
             3,     9,  2071,  4478,     3,    40,    23,   291,   114,     3,
          2666,  1167,     3,     5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [32]:
output_ids = model.generate(tokens.input_ids)
output_ids

tensor([[    0, 32099,     3,    31,  5647,     4,  4666,    31,    42,     3,
            31,  4171,  6675, 21415,    31,    10,     8,  1053,    79,  2483,
          3223]])

In [33]:
tokenizer.decode(output_ids[0], skip_special_tokens=True)

"'TOXIC' or 'NEUTRAL': the reason they dont exist"

In [36]:
examples = 'Text: delete the page and shut up\nClass: TOXIC\nText: I heard it was on the news.\nClass: NEUTRAL\n'
examples

'Text: delete the page and shut up\nClass: TOXIC\nText: I heard it was on the news.\nClass: NEUTRAL\n'

In [41]:
prompt = f'{examples}\nClassify the following text into either \'TOXIC\' or \'NEUTRAL\': {sample_sentence}\nClass:'
prompt

"Text: delete the page and shut up\nClass: TOXIC\nText: I heard it was on the news.\nClass: NEUTRAL\n\nClassify the following text into either 'TOXIC' or 'NEUTRAL': the reason they dont exist is because neither is a pathological liar like trump .\nClass:"

In [42]:
tokens = tokenizer(prompt, return_tensors='pt')
tokens

{'input_ids': tensor([[ 5027,    10,  9268,     8,   543,    11,  6979,    95,  4501,    10,
          3001,     4,  4666,  5027,    10,    27,  1943,    34,    47,    30,
             8,  1506,     5,  4501,    10,     3,  4171,  6675, 21415,  4501,
          4921,     8,   826,  1499,   139,   893,     3,    31,  5647,     4,
          4666,    31,    42,     3,    31,  4171,  6675, 21415,    31,    10,
             8,  1053,    79,  2483,  3223,    19,   250,  7598,    19,     3,
             9,  2071,  4478,     3,    40,    23,   291,   114,     3,  2666,
          1167,     3,     5,  4501,    10,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]])}

In [43]:
output_ids = model.generate(tokens.input_ids)
output_ids

tensor([[    0,  4501,    10,     3,  4171,  6675, 21415,  4501,  4921,     8,
           826,  1499,   139,   893,     3,    31,  5647,     4,  4666,    31,
            42]])

In [44]:
tokenizer.decode(output_ids[0], skip_special_tokens=True)

"Class: NEUTRAL Classify the following text into either 'TOXIC' or"

In [45]:
sample_sentence

'the reason they dont exist is because neither is a pathological liar like trump .'

In [46]:
embedding_model = SentenceTransformer('all-distilroberta-v1')
embeddings = embedding_model.encode(sentences[:20], batch_size=64, show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [48]:
embeddings.shape

(20, 768)

In [49]:
embeddings

array([[ 0.05055266, -0.06140865, -0.01228614, ..., -0.03881477,
        -0.06116227,  0.0181213 ],
       [-0.00400064, -0.00940307, -0.00860855, ..., -0.06624117,
        -0.0465671 , -0.02007012],
       [ 0.02266759, -0.02643988,  0.00947504, ..., -0.04479922,
        -0.05901453,  0.01749855],
       ...,
       [-0.0221903 , -0.04576224,  0.02153144, ...,  0.04361866,
         0.01497844,  0.04335182],
       [ 0.06613797, -0.06282376, -0.03906498, ..., -0.0683265 ,
        -0.07153116,  0.07797875],
       [ 0.04563302, -0.01121642, -0.01660856, ..., -0.04931672,
        -0.05503166,  0.07557259]], dtype=float32)

In [50]:
query_emb = embedding_model.encode([sample_sentence], batch_size=64, show_progress_bar=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [51]:
query_emb

array([[-3.89828682e-02, -3.87882888e-02,  1.50467232e-02,
        -8.45358446e-02, -1.11309430e-02, -4.69603091e-02,
        -5.80696994e-03,  5.47401458e-02,  5.31626940e-02,
        -2.61463057e-02,  6.25139661e-03, -4.00162709e-04,
        -1.96443740e-02,  2.91153900e-02, -3.76911387e-02,
         2.70483624e-02, -4.54815999e-02,  3.19135264e-02,
        -1.50244581e-02,  3.06181852e-02, -5.93253523e-02,
         2.57888306e-02, -5.10807112e-02,  5.85217923e-02,
        -7.75822066e-03,  2.03761943e-02,  3.13512534e-02,
         6.37368765e-03, -6.70893714e-02,  6.55516284e-03,
        -9.64984018e-03, -1.67814791e-02,  4.00716551e-02,
         3.08850650e-02,  3.76150869e-02,  1.82156339e-02,
        -6.94985166e-02,  2.29601357e-02,  4.17866744e-02,
         2.11373698e-02,  5.40892184e-02,  5.29876277e-02,
         1.55742280e-02,  6.41989782e-02,  3.59797403e-02,
         2.18239576e-02, -9.52616036e-02,  5.04507800e-04,
         4.00748756e-03, -3.48801166e-03, -3.68346833e-0

In [52]:
query_emb.shape

(1, 768)

In [53]:
util.semantic_search(query_emb, embeddings, top_k=10)

[[{'corpus_id': 4, 'score': 1.0000001192092896},
  {'corpus_id': 17, 'score': 0.4185289740562439},
  {'corpus_id': 11, 'score': 0.3201321065425873},
  {'corpus_id': 19, 'score': 0.2643239200115204},
  {'corpus_id': 6, 'score': 0.25656458735466003},
  {'corpus_id': 15, 'score': 0.23332402110099792},
  {'corpus_id': 3, 'score': 0.19737792015075684},
  {'corpus_id': 9, 'score': 0.18650716543197632},
  {'corpus_id': 10, 'score': 0.17953705787658691},
  {'corpus_id': 16, 'score': 0.16733801364898682}]]

In [54]:
sentences[17]

'disgusting , immoral people , rightwing extremists , who should be driven out of office asap .'

In [55]:
labels[17]

'toxic'