<a href="https://colab.research.google.com/github/ZeroAlster/LLM-HugF-Course/blob/main/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers, what can they do?

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [58]:
import json

path = '/content/drive/MyDrive/Colab Notebooks'  # change to your actual notebook name

with open(path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Remove widgets metadata if present
if 'widgets' in data.get('metadata', {}):
    del data['metadata']['widgets']
    print("Removed 'widgets' metadata.")

# Save cleaned notebook
with open(path, 'w', encoding='utf-8') as f:
    json.dump(data, f, indent=1)

print(f"Cleaned and saved notebook: {path}")

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Colab Notebooks'

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]}

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

[{'generated_text': 'In this course, we will teach you how to understand and use '
                    'data flow and data interchange when handling user data. We '
                    'will be working with one or more of the most commonly used '
                    'data flows — data flows of various types, as seen by the '
                    'HTTP'}]

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

[{'generated_text': 'In this course, we will teach you how to manipulate the world and '
                    'move your mental and physical capabilities to your advantage.'},
 {'generated_text': 'In this course, we will teach you how to become an expert and '
                    'practice realtime, and with a hands on experience on both real '
                    'time and real'}]

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'sequence': 'This course will teach you all about mathematical models.',
  'score': 0.19619831442832947,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about computational models.',
  'score': 0.04052725434303284,
  'token': 38163,
  'token_str': ' computational'}]

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18}, 
 {'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45}, 
 {'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57}
]

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The '
                  'number of engineering graduates in the U.S. has declined in '
                  'traditional engineering disciplines such as mechanical, civil '
                  ', electrical, chemical, and aeronautical engineering . Rapidly '
                  'developing economies such as China and India, as well as other '
                  'industrial countries in Europe and Asia, continue to encourage '
                  'and advance engineering .'}]

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

##Causal language modeling (Finetune DistilGPT2 on the r/askscience subset of the ELI5 dataset. and use your finetuned model for inference.)


In [None]:
pip install transformers evaluate




In [None]:
pip install -U datasets



In [None]:
from datasets import load_dataset


In [None]:
eli5 = load_dataset("eli5_category", split="train[:5000]")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

eli5_category.py:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

In [None]:
eli5 = eli5.train_test_split(test_size=0.2)

In [None]:
eli5.shape

{'train': (4000, 8), 'test': (1000, 8)}

In [None]:
eli5["train"][0]

{'q_id': '7bj0wz',
 'title': "Why can't we start a farm in America where we breed a bunch of Rhinos or other endangered animals.",
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dpidu20'],
  'text': ["I'm going to point out the Giant Panda as an example. In zoos, they have a perfectly safe environment with some roam to roam, plenty of food and a few choices of potential mates. And they don't want to breed. Getting animals to breed in captivity is a hard task."],
  'score': [10],
  'text_urls': [[]]},
 'title_urls': ['url'],
 'selftext_urls': ['url']}

###Preprocessing Phase

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
eli5 = eli5.flatten()

In [None]:
eli5["train"][0]

{'q_id': '7bj0wz',
 'title': "Why can't we start a farm in America where we breed a bunch of Rhinos or other endangered animals.",
 'selftext': '',
 'category': 'Biology',
 'subreddit': 'explainlikeimfive',
 'answers.a_id': ['dpidu20'],
 'answers.text': ["I'm going to point out the Giant Panda as an example. In zoos, they have a perfectly safe environment with some roam to roam, plenty of food and a few choices of potential mates. And they don't want to breed. Getting animals to breed in captivity is a hard task."],
 'answers.score': [10],
 'answers.text_urls': [[]],
 'title_urls': ['url'],
 'selftext_urls': ['url']}

In [None]:
examples = eli5["train"][:3]
examples

{'q_id': ['7bj0wz', '7d64tb', '7fgkbr'],
 'title': ["Why can't we start a farm in America where we breed a bunch of Rhinos or other endangered animals.",
  'How does this picture disappear?',
  'Without animal testing, how do cosmetics and skin care companies determine that their products are safe?'],
 'selftext': ['',
  '[Here’s the post with the picture]( URL_0 ) Stare at a specific point for a couple seconds and you will see it disappears. Why?',
  "I'm a biologist by training and work on mouse models for drug testing. With medicine, every single approved drug was first extensively tested in animal models. My question is, without animal testing, how do cosmetic and skin care companies know that their products won't have adverse effects on humans? Do they test solely on humans? That sounds very dangerous..."],
 'category': ['Biology', 'Biology', 'Chemistry'],
 'subreddit': ['explainlikeimfive', 'explainlikeimfive', 'explainlikeimfive'],
 'answers.a_id': [['dpidu20'],
  ['dpvav38'],
 

In [None]:
### Concatinate all the inputs some how for all the examples ?

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

## let's do some test on the input format of the tokenizer
a= [" ".join(x) for x in examples["answers.text"]]
a  ### why we add this space in between each character of the words ?

["I'm going to point out the Giant Panda as an example. In zoos, they have a perfectly safe environment with some roam to roam, plenty of food and a few choices of potential mates. And they don't want to breed. Getting animals to breed in captivity is a hard task.",
 'Your brain stops telling you that it is there. Basically, your brain is concerned about *changes* to your environment. New or changing stimuli. Things that never change are - simply put - unimportant. It\'s the reason you never "see" your nose or "feel" your tongue in your mouth (unless you consciously concentrate on it). It\'s unimportant stimuli that doesn\'t ever change. So, when your brain is receiving a constant signal from some constant source, it eventually tunes it out. It\'s why you eventually stop smelling a bad smell (olfactory fatigue) and aren\'t constantly feeling the clothes touching your body (again, unless you focus on it). But, this is actually bad for sight. Vision is pretty important and purely motion-

In [None]:
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1282 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2642 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2879 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1051 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1985 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1896 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1197 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4108 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
tokenized_eli5.shape

{'train': (4000, 2), 'test': (1000, 2)}

In [None]:
tokenized_eli5["train"][10]     ##### why after tokenization we also have attension_mask list ?

{'input_ids': [29,
  14927,
  286,
  25304,
  326,
  314,
  1101,
  4737,
  329,
  281,
  7468,
  11,
  475,
  3588,
  470,
  18681,
  703,
  345,
  1833,
  1243,
  30,
  1318,
  389,
  257,
  3155,
  286,
  3689,
  13,
  679,
  1244,
  655,
  307,
  599,
  13660,
  18149,
  1976,
  268,
  7510,
  290,
  655,
  307,
  2642,
  13,
  1320,
  318,
  281,
  3038,
  13,
  1471,
  3737,
  339,
  318,
  2111,
  329,
  257,
  517,
  11800,
  966,
  13,
  50125,
  602,
  2148,
  1321,
  11,
  981,
  262,
  4547,
  318,
  262,
  5798,
  286,
  262,
  36974,
  1428,
  13,
  1002,
  345,
  2993,
  262,
  1321,
  1541,
  345,
  815,
  307,
  1498,
  284,
  1833,
  262,
  3721,
  1541,
  26,
  611,
  345,
  460,
  892,
  1576,
  284,
  24772,
  618,
  340,
  318,
  4893,
  788,
  345,
  714,
  1541,
  1833,
  262,
  3721,
  13,
  1471,
  3863,
  339,
  318,
  1642,
  262,
  966,
  326,
  611,
  345,
  3521,
  470,
  4174,
  3511,
  290,
  3785,
  340,
  503,
  1231,
  1037,
  788,
  345,
  561,
  30

In [None]:
len(tokenized_eli5["train"][10]['attention_mask'])

340

In [None]:
len(tokenized_eli5["train"][10]['input_ids'])

340

In [None]:
## we need to tackle the problem that the tokenized seqs length maybe more than the length of the model input (or i think what they refer to as a context)
## so we concatinate them first and then we do the parsing on the token seqs ....

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
#### some  simple testings of the above pre-processing functions
examples=tokenized_eli5["train"][:2]
examples

{'input_ids': [[40,
   1101,
   1016,
   284,
   966,
   503,
   262,
   17384,
   41112,
   355,
   281,
   1672,
   13,
   554,
   1976,
   16426,
   11,
   484,
   423,
   257,
   7138,
   3338,
   2858,
   351,
   617,
   35563,
   284,
   35563,
   11,
   6088,
   286,
   2057,
   290,
   257,
   1178,
   7747,
   286,
   2785,
   27973,
   13,
   843,
   484,
   836,
   470,
   765,
   284,
   15939,
   13,
   18067,
   4695,
   284,
   15939,
   287,
   33763,
   318,
   257,
   1327,
   4876,
   13],
  [7120,
   3632,
   9911,
   5149,
   345,
   326,
   340,
   318,
   612,
   13,
   20759,
   11,
   534,
   3632,
   318,
   5213,
   546,
   1635,
   36653,
   9,
   284,
   534,
   2858,
   13,
   968,
   393,
   5609,
   25973,
   13,
   11597,
   326,
   1239,
   1487,
   389,
   532,
   2391,
   1234,
   532,
   555,
   18049,
   13,
   632,
   338,
   262,
   1738,
   345,
   1239,
   366,
   3826,
   1,
   534,
   9686,
   393,
   366,
   36410,
   1,
   534,
   11880,
  