- **Author ⚛** **Kandimalla Hemanth**
- **Date of lastly modified ⚛** **09-01-2024**
- **E-mail⚛** **speechhemanth2@gmail.com**
- **Google colab ⚛** **explaoring the Transformer Module**



In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h

# ` exploring The  Transformers Modules!`

In [None]:
#@title
from IPython.display import HTML

HTML(' Kandimalla Hemanth exploring The  Transformers Modules ')

| Task                         | Description                                                    | Modality          | Pipeline Identifier                      |
|------------------------------|----------------------------------------------------------------|-------------------|-------------------------------------------|
| Text classification          | Assign a label to a given sequence of text                     | NLP               | `pipeline(task="sentiment-analysis")`      |
| Text generation              | Generate text given a prompt                                   | NLP               | `pipeline(task="text-generation")`          |
| Summarization                | Generate a summary of a sequence of text or document           | NLP               | `pipeline(task="summarization")`           |
| Image classification         | Assign a label to an image                                      | Computer Vision   | `pipeline(task="image-classification")`    |
| Image segmentation           | Assign a label to each individual pixel of an image            | Computer Vision   | `pipeline(task="image-segmentation")`      |
| Object detection             | Predict the bounding boxes and classes of objects in an image  | Computer Vision   | `pipeline(task="object-detection")`        |
| Audio classification         | Assign a label to some audio data                               | Audio             | `pipeline(task="audio-classification")`    |
| Automatic speech recognition  | Transcribe speech into text                                    | Audio             | `pipeline(task="automatic-speech-recognition")` |
| Visual question answering     | Answer a question about the image, given an image and a question| Multimodal        | `pipeline(task="vqa")`                     |
| Document question answering   | Answer a question about a document, given an image and a question| Multimodal        | `pipeline(task="document-question-answering")` |
| Image captioning             | Generate a caption for a given image                            | Multimodal        | `pipeline(task="image-to-text")`           |


| Argument             | Description                                                                                                     | Type                                              | Default Value                    |
|----------------------|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------|----------------------------------|
| model                | Pre-trained model to be used for the pipeline.                                                                  | Union["PreTrainedModel", "TFPreTrainedModel"]     | None (required)                  |
| tokenizer            | Tokenizer associated with the pre-trained model. If not provided, it will be instantiated based on the model. | Optional[PreTrainedTokenizer]                     | None                             |
| feature_extractor    | Feature extractor associated with the pre-trained model.                                                        | Optional[PreTrainedFeatureExtractor]              | None                             |
| image_processor      | Image processor for handling image inputs (if applicable).                                                      | Optional[BaseImageProcessor]                      | None                             |
| modelcard            | Model card providing additional information about the model.                                                     | Optional[ModelCard]                              | None                             |
| framework            | Framework used for the model (e.g., "tf" for TensorFlow, "pt" for PyTorch). If not provided, it is inferred.     | Optional[str]                                    | None                             |
| task                 | Specifies the type of task the pipeline is designed for.                                                          | str                                               | "" (empty string)                |
| args_parser          | Custom argument handler for parsing additional keyword arguments.                                                | ArgumentHandler                                  | None                             |
| device               | Device on which the pipeline runs (CPU or GPU). If None, it is inferred based on available hardware.           | Union[int, "torch.device"]                        | None (inferred)                  |
| torch_dtype          | Data type used by PyTorch (e.g., "float32", "float64").                                                          | Optional[Union[str, "torch.dtype"]]               | None                             |
| binary_output        | If True, outputs are stored in pickle format for large tensor objects.                                           | bool                                              | False                            |
| kwargs               | Additional keyword arguments that can be passed to the underlying model.                                          |                                                   |                                  |


In [28]:
from transformers import pipeline, DistilBertTokenizer, DistilBertForSequenceClassification

# Load a pre-trained model for sentiment analysis
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name)

# Instantiate a tokenizer associated with the pre-trained model
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

# Create a sentiment analysis pipeline with specified arguments
classifier = pipeline(
    task="sentiment-analysis",          # Specifies the type of task for the pipeline
    model=model,                        # Pre-trained model to be used
    tokenizer=tokenizer,                # Tokenizer associated with the model
    framework="pt",                     # Framework used for the model (PyTorch in this case)
    device='cpu',                       # Device on which the pipeline runs (GPU: 0)
    binary_output=True,                 # Store outputs in pickle format for large tensor objects
    feature_extractor=None,             # No feature extractor used in this case
    image_processor=None,               # No image processor used in this case
    modelcard=None,                     # No model card provided in this case
    torch_dtype="float32"               # Data type used by PyTorch
)

# Test the pipeline by analyzing sentiment
result = classifier("I'm really excited about this!")
print(result)
result=result[0]  # Analyze sentiment for the given text
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'label': 'LABEL_1', 'score': 0.5345535278320312}]
label: LABEL_1, with score: 0.5346


In [None]:
import torch
from transformers import pipeline, Wav2Vec2ForCTC, Wav2Vec2Tokenizer

# Load a pre-trained model for automatic speech recognition
model_name = "facebook/wav2vec2-base-960h"
model = Wav2Vec2ForCTC.from_pretrained(model_name)

# Instantiate a tokenizer associated with the pre-trained model
tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_name)

# Create an automatic speech recognition pipeline with specified arguments
speech_recognizer = pipeline(
    task="automatic-speech-recognition",  # Specifies the type of task for the pipeline
    model=model,                          # Pre-trained model to be used
    tokenizer=tokenizer,                  # Tokenizer associated with the model
    framework="pt",                       # Framework used for the model (PyTorch in this case)
    device=torch.device("cpu"),           # Device on which the pipeline runs (CUDA)
    binary_output=True,                   # Store outputs in pickle format for large tensor objects
    feature_extractor=None,               # No feature extractor used in this case
    image_processor=None,                 # No image processor used in this case
    modelcard=None,                       # No model card provided in this case
    torch_dtype="float32"                 # Data type used by PyTorch
)

# Test the pipeline by transcribing speech to text
audio_file = "https://www2.cs.uic.edu/~i101/SoundFiles/BabyElephantWalk60.wav"  # Replace with the path to your audio file
result = speech_recognizer(audio_file)      # Transcribe speech from the given audio file
print(result)


### `Dateset-->load_dataset`

| Argument             | Description                                                                                                                                   | Type                                                                   | Default Value                    |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------|
| path                 | Path to the dataset or directory where the dataset is located.                                                                                | str                                                                    | Required                         |
| name                 | Name of the dataset.                                                                                                                          | str \| None                                                            | None                             |
| data_dir             | Directory where dataset files are stored.                                                                                                     | str \| None                                                            | None                             |
| data_files           | Paths to specific data files or collections of data files comprising the dataset.                                                              | str \| Sequence[str] \| Mapping[str, str \| Sequence[str]] \| None       | None                             |
| split                | Specify the split of the dataset to use (e.g., train, test, validation).                                                                      | str \| Split \| None                                                   | None                             |
| cache_dir            | Directory to cache downloaded files.                                                                                                           | str \| None                                                            | None                             |
| features             | Dataset features (e.g., columns, labels) if applicable.                                                                                        | Features \| None                                                       | None                             |
| download_config      | Configuration for downloading the dataset (e.g., timeouts, proxies).                                                                           | DownloadConfig \| None                                                 | None                             |
| download_mode        | Mode for downloading the dataset (e.g., 'force_reuse', 'reuse', 'force_download', etc.).                                                      | DownloadMode \| str \| None                                             | None                             |
| verification_mode    | Mode for dataset verification (e.g., 'checksum', 'signature', 'all', etc.).                                                                    | VerificationMode \| str \| None                                         | None                             |
| ignore_verifications | Flag to ignore verifications (deprecated).                                                                                                     | str                                                                    | "deprecated"                     |
| keep_in_memory       | Flag to keep the dataset in memory after loading.                                                                                              | bool \| None                                                           | None                             |
| save_infos           | Flag to save dataset information (e.g., metadata, version, etc.).                                                                              | bool                                                                   | False                            |
| revision             | Dataset revision or version to use.                                                                                                            | str \| Version \| None                                                 | None                             |
| token                | Authentication token for accessing the dataset.                                                                                                | bool \| str \| None                                                    | None                             |
| use_auth_token       | Flag to use authentication token (deprecated).                                                                                                 | str                                                                    | "deprecated"                     |
| task                 | Specify the task associated with the dataset (deprecated).                                                                                      | str                                                                    | "deprecated"                     |
| streaming            | Flag to indicate whether the dataset is a streaming dataset.                                                                                    | bool                                                                   | False                            |
| num_proc             | Number of processes to use when preparing the dataset.                                                                                          | int \| None                                                            | None                             |
| storage_options      | Additional options for storage (e.g., AWS S3 credentials).                                                                                     | Dict \| None                                                           | None                             |
| trust_remote_code    | Flag to indicate trust in remote code (e.g., for downloading scripts).                                                                         | bool \| None                                                           | None                             |

This table offers a comprehensive explanation of each argument, its purpose, accepted types, and default values to aid in understanding and utilizing these arguments effectively while working with datasets. Adjust these arguments based on your specific dataset requirements and usage scenarios.

In [None]:
from datasets import load_dataset, Audio, Features, Split, DownloadConfig

# Define the arguments for loading the dataset
path = "PolyAI/minds14"  # Path to the dataset
name = "en-US"           # Name of the dataset
split = "train"           # Split of the dataset (train)

# Define additional arguments
data_dir = "path_to_data_dir"  # Directory where dataset files are stored
cache_dir = "path_to_cache_dir" # Directory to cache downloaded files

# Create Features object if applicable (columns, labels, etc.)
features = Features({
    "audio": Audio()  # Example feature - Audio data type
})

# Define download configuration with appropriate parameters
download_config = DownloadConfig()  # Creating an empty DownloadConfig object

# Load the dataset using the defined arguments without specifying verification mode
dataset = load_dataset(
    path=path,
    name=name,
    split=split,
    data_dir=data_dir,
    cache_dir=cache_dir,
    features=features,
    download_config=download_config
)

# Print dataset info to verify loading
print(dataset)


In [None]:
# simple snippet from hugging face
import torch
from transformers import pipeline
from datasets import load_dataset, Audio
speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate))
result = speech_recognizer(dataset[:10]["audio"])
for text in result:
  print(text["text"])


# `from_pertrained from transformers module`

| Method/Attribute | Description | Parameters | Example |
|---|---|---|---|
| **`__init__()`** | Raises an `EnvironmentError`. | N/A | N/A |
| **`from_pretrained`** | Instantiates tokenizer based on pretrained model vocabulary. | - `pretrained_model_name_or_path`: Identifies the pretrained model. <br>- `inputs`: Additional positional arguments. <br>- `config`: Configuration object. <br>- `cache_dir`: Caching directory path. <br>- `force_download`, `resume_download`: Flags for downloads. <br>- `proxies`: Proxy server configurations. <br>- `revision`: Specific model version. <br>- `subfolder`: Subfolder within the model repository. <br>- `use_fast`: Flag for fast Rust-based tokenizer. <br>- `tokenizer_type`: Type of tokenizer. <br>- `trust_remote_code`: Caution flag for custom models. <br>- `kwargs`: Additional keyword arguments. | - `AutoTokenizer.from_pretrained("bert-base-uncased")` <br>- `AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased")` <br>- `AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)` |

- the `__init__()` method
- the `from_pretrained` class method of the `AutoTokenizer` class, outlining their descriptions, parameters, and examples for better understanding.

| Class Name                               | Purpose                                            | Model Mapping Variable                       |
|------------------------------------------|----------------------------------------------------|---------------------------------------------|
| AutoModelForMaskGeneration               | For models that generate masks for data             | MODEL_FOR_MASK_GENERATION_MAPPING           |
| AutoModelForTextEncoding                 | For models specialized in encoding text             | MODEL_FOR_TEXT_ENCODING_MAPPING             |
| AutoModelForImageToImage                 | For models that map images to images                | MODEL_FOR_IMAGE_TO_IMAGE_MAPPING            |
| AutoModel                                 | General class for models without a specific head    | MODEL_MAPPING                               |
| AutoModelForPreTraining                  | For models that are pre-trained on a task           | MODEL_FOR_PRETRAINING_MAPPING               |
| _AutoModelWithLMHead (Deprecated)        | For language models with a language modeling head   | MODEL_WITH_LM_HEAD_MAPPING                 |
| AutoModelForCausalLM                     | For causal language models (unidirectional)         | MODEL_FOR_CAUSAL_LM_MAPPING                 |
| AutoModelForMaskedLM                     | For masked language models (BERT-like)              | MODEL_FOR_MASKED_LM_MAPPING                 |
| AutoModelForSeq2SeqLM                    | For sequence-to-sequence language models            | MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING     |
| AutoModelForSequenceClassification       | For sentence or sequence classification            | MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING   |
| AutoModelForQuestionAnswering             | For question answering tasks                        | MODEL_FOR_QUESTION_ANSWERING_MAPPING        |
| AutoModelForTableQuestionAnswering        | For question answering on tabular data              | MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING  |
| AutoModelForVisualQuestionAnswering      | For visual question answering tasks                 | MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING |
| AutoModelForDocumentQuestionAnswering    | For question answering on documents                 | MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING |
| AutoModelForTokenClassification           | For token-level classification (e.g., NER)         | MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING      |
| AutoModelForMultipleChoice               | For multiple-choice tasks                           | MODEL_FOR_MULTIPLE_CHOICE_MAPPING           |
| AutoModelForNextSentencePrediction       | For next sentence prediction tasks                  | MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING |
| AutoModelForImageClassification          | For image classification tasks                      | MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING     |
| AutoModelForZeroShotImageClassification  | For zero-shot image classification                 | MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING |
| AutoModelForImageSegmentation            | For image segmentation tasks                        | MODEL_FOR_IMAGE_SEGMENTATION_MAPPING       |
| AutoModelForSemanticSegmentation         | For semantic segmentation tasks                     | MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING    |
| AutoModelForUniversalSegmentation        | For universal image segmentation                    | MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING   |
| AutoModelForInstanceSegmentation         | For instance segmentation tasks                     | MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING   |
| AutoModelForObjectDetection              | For object detection tasks                          | MODEL_FOR_OBJECT_DETECTION_MAPPING         |
| AutoModelForZeroShotObjectDetection      | For zero-shot object detection                     | MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING |
| AutoModelForDepthEstimation              | For depth estimation tasks                          | MODEL_FOR_DEPTH_ESTIMATION_MAPPING         |
| AutoModelForVideoClassification          | For video classification tasks                      | MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING    |
| AutoModelForVision2Seq                   | For vision-to-text modeling                         | MODEL_FOR_VISION_2_SEQ_MAPPING             |
| AutoModelForAudioClassification          | For audio classification tasks                      | MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING    |
| AutoModelForCTC                          | For models using Connectionist Temporal Classification | MODEL_FOR_CTC_MAPPING                  |
| AutoModelForSpeechSeq2Seq                | For sequence-to-sequence speech-to-text modeling   | MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING        |
| AutoModelForAudioFrameClassification     | For audio frame (token) classification             | MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING |
| AutoModelForAudioXVector                 | For audio retrieval via x-vector                    | MODEL_FOR_AUDIO_XVECTOR_MAPPING           |
| AutoModelForTextToSpectrogram            | For converting text to spectrogram representations | MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING     |
| AutoModelForTextToWaveform               | For converting text to waveform (audio)            | MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING       |
| AutoBackbone                             | Generic class for model backbones                   | MODEL_FOR_BACKBONE_MAPPING                |
| AutoModelForMaskedImageModeling          | For masked image modeling tasks                     | MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING  |
| AutoModelWithLMHead (Deprecated)         | Deprecated class for language modeling              | -- (Use specific LM classes instead)      |



In [30]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline,AutoModelForSequenceClassification

# Define the GPT-2 model and tokenizer
gpt_model_name = "gpt2"
gpt_model = AutoModelForCausalLM.from_pretrained(gpt_model_name)
gpt_tokenizer = AutoTokenizer.from_pretrained(gpt_model_name)
gpt_tokenizer.pad_token = tokenizer.eos_token
if gpt_tokenizer.pad_token is None:
            gpt_tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
gpt_tokenizer.padding_side = "right"  # Fix weird overflow issue with fp16 training

# Define the sentiment analysis model and tokenizer
sentiment_model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
sentiment_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_name)
sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_model_name)

# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis", model=sentiment_model, tokenizer=sentiment_tokenizer)

# Example sentence in French
sentence = "GPT-2 a été entraîné sur une vaste quantité de données textuelles et est réputé pour sa capacité à produire du texte fluide et à capturer la structure et le sens du langage."

# Generate 10 sentences using the GPT model
generated_sentences = []

for _ in range(10):
    inputs = gpt_tokenizer.encode(sentence, return_tensors="pt", max_length=512, truncation=True)
    outputs = gpt_model.generate(inputs, max_length=100, num_return_sequences=1, temperature=0.7, do_sample=True)
    generated_sentence = gpt_tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_sentences.append(generated_sentence)

# Perform sentiment analysis on the generated sentences
for generated_sentence in generated_sentences:
    results = classifier(generated_sentence)
    for result in results:
        print(f"Sentence: {generated_sentence}")
        print(f"Label: {result['label']}, Score: {result['score']:.4f}")


config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

Sentence: GPT-2 a été entraîné sur une vaste quantité de données textuelles et est réputé pour sa capacité à produire du texte fluide et à capturer la structure et le sens du langage.

The same was true for the other articles.

But in the last few days, a great deal of content from the first few days of the campaign has been published in the French.


Label: 4 stars, Score: 0.4141
Sentence: GPT-2 a été entraîné sur une vaste quantité de données textuelles et est réputé pour sa capacité à produire du texte fluide et à capturer la structure et le sens du langage.

Développement de la vie du langage :

"Les réponditions de la vie du langage" « l'équipe dit. »

Label: 5 stars, Score: 0.4334
Sentence: GPT-2 a été entraîné sur une vaste quantité de données textuelles et est réputé pour sa capacité à produire du texte fluide et à capturer la structure et le sens du langage.

Nous répeits et lorsse aux sommes de la vaste quantité au même à ses ses découps.

"Ce-p
Label: 5 stars, Score: 0.5817


In [None]:

from IPython.display import HTML

HTML('Tokenizer functionity from transformers Modules')


In [32]:

from transformers import AutoTokenizer
gpt_model_name = "gpt2"
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
gpt_tokenizer = AutoTokenizer.from_pretrained(gpt_model_name)

gpt_tokenizer.pad_token = tokenizer.eos_token
if gpt_tokenizer.pad_token is None:
            gpt_tokenizer.add_special_tokens({'pad_token': gpt_tokenizer.eos_token})
gpt_tokenizer.padding_side = "right"  # Fix weird overflow issue with fp16 training



## we have to show how tokenizations impact on the two models
pt_batch = tokenizer(
    ["what is tokenization", " why it is different for different model","why tokenizations very imp to pre-prcocess in the era of LLMs "],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

In [33]:
pt_batch

{'input_ids': tensor([[  101, 11523, 10127, 16925, 13649, 26364,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101, 18469, 10197, 10127, 12850, 10139, 12850, 10713,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101, 18469, 16925, 13649, 26364, 10107, 12495, 73296, 10114, 12021,
           118, 14853, 10805, 15101, 10107, 10104, 10103, 10420, 10108, 17361,
         12932,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1,

- os_token in the context of Natural Language Processing (NLP) refers to the end of sentencing token. This token is specific to sequence to sequence models and it is used to indicate the end of a sentence.

- In Transformer-based language models, the eos token is typically used as the final value in a sequence, and the model is trained to predict the eos token at the end of each input sequence in order to produce a coherent output sequence.

- In the given code, it sets the pad_token in the tokenizer to be the same as the eos_token. This allows for consistent padding of the input sequences when adding a eos token at the end of each sequence.

- pad_token is a special token used to represent padding space in a sequence. In context of text classification, this token is used when the sequence is not of equal length when concatenating all sequences. The pad_token is used to pad the shorter sequences to the length of the longest sequence.

- In the given code, if the pad token is not set in the tokenizer, it is set to be the same as the eos token (end of sentence token) if it is also not set. This is done to ensure consistency in the padding process, particularly when using a sequence-to-sequence model like GPT-

In [None]:
from transformers import AutoTokenizer

# Define the model names
gpt_model_name = "gpt2"
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

# Initialize tokenizers for both models
tokenizer = AutoTokenizer.from_pretrained(model_name)
gpt_tokenizer = AutoTokenizer.from_pretrained(gpt_model_name)

# For GPT-2, we need to set the pad token to the same as the eos token if it's not already set
if gpt_tokenizer.pad_token is None:
    gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

# Input text samples
texts = [
    "what is tokenization",
    "why it is different for different model",
    "why tokenizations very imp to preprocess in the era of LLMs",
]

# Tokenize text for the sentiment analysis model
pt_batch = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

# Tokenize text for GPT-2 model
gpt_batch = gpt_tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",# tf tensorflow pt for pytorch
)

# Display the tokenized outputs
print("Tokenization for Sentiment Analysis Model:")
for key, value in pt_batch.items():
    print(f"{key}: {value.tolist()}")

print("\nTokenization for GPT-2 Model:")
for key, value in gpt_batch.items():
    print(f"{key}: {value.tolist()}")

### Tokenization for Sentiment Analysis Model:
- **input_ids:**
    - List of token IDs representing the input text for sentiment analysis model.
    - Example:
        - `[101, 11523, 10127, 16925, 13649, 26364, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`
- **token_type_ids:**
    - Segments or parts of the input identified by different token types (e.g., sentences).
    - Example:
        - `[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`
- **attention_mask:**
    - Indicates which tokens should be attended to or ignored by the model.
    - Example:
        - `[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`

### Tokenization for GPT-2 Model:
- **input_ids:**
    - List of token IDs representing the input text for GPT-2 model.
    - Example:
        - `[10919, 318, 11241, 1634, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]`
- **attention_mask:**
    - Indicates which tokens should be attended to or ignored by the model.
    - Example:
        - `[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]`

### Actual Text:
Input text samples:
- "what is tokenization"
- "why it is different for different model"
- "why tokenizations very imp to preprocess in the era of LLMs"

The provided code snippet demonstrates tokenization outputs for two different models: a sentiment analysis model and GPT-2. The tokenization outputs include `input_ids`, `token_type_ids` (only available for some models), and `attention_mask`, which are essential components used for model input in natural language processing tasks.


In [None]:
#

# End-to-end case


In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
from torch.nn.functional import softmax

# Define the sentiment analysis models and tokenizers
models = {
    "bert": "nlptown/bert-base-multilingual-uncased-sentiment",
    "distilbert": "distilbert-base-uncased-finetuned-sst-2-english",
    "gpt2":"gpt2"
}

# Load the models and create pipelines for sentiment analysis
pipelines = {}
for model_name, model_path in models.items():
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    pipelines[model_name] = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Example sentences
sentences = [
    "I love this product!",
    "This is the worst thing I've ever bought."
]

# Iterate over the models and perform sentiment analysis
for model_name, sentiment_pipeline in pipelines.items():
    print(f"Model: {model_name}")
    for sentence in sentences:
        # Perform sentiment analysis on the sentence
        result = sentiment_pipeline(sentence)
        print(f"Sentence: {sentence}")
        print(f"Sentiment: {result[0]['label']}, Score: {result[0]['score']:.8f}")
    print("")

# Note: GPT-2 is  included only to test here because it is not a sentiment analysis model, but  it is Generate Model(🧠🧠🧠)

# Last stage output

In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
from torch import nn

# Define the sentiment analysis model and tokenizer
sentiment_model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
sentiment_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_name)
sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_model_name)

# Create a pipeline for sentiment analysis (this is not necessary for converting logits to probabilities but included for context)
classifier = pipeline("sentiment-analysis", model=sentiment_model, tokenizer=sentiment_tokenizer)

# Example sentences
sentences = [
    "I love this product!",
    "This is the worst thing I've ever bought."
]

# Tokenize the text input (this is your `pt_batch`)
pt_batch = sentiment_tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Pass the tokenized batch through the sentiment model to get the raw logits
pt_outputs = sentiment_model(**pt_batch)

# Convert the logits to probabilities using softmax
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

# Print out the probabilities
print(f" Print out the probabilities:{pt_predictions}")
print(f"length of output_vector :{len(pt_predictions)}")

In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
from torch import nn

# Define the sentiment analysis model and tokenizer
sentiment_model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_name)
sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_model_name)

# Create a pipeline for sentiment analysis (this is not necessary for converting logits to probabilities but included for context)
classifier = pipeline("sentiment-analysis", model=sentiment_model, tokenizer=sentiment_tokenizer)

# Example sentences
sentences = [
    "I love this product!",
    "This is the worst thing I've ever bought."
]

# Tokenize the text input (this is your `pt_batch`)
pt_batch = sentiment_tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Pass the tokenized batch through the sentiment model to get the raw logits
pt_outputs = sentiment_model(**pt_batch)

# Convert the logits to probabilities using softmax
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

# Print out the probabilities
print(f" Print out the probabilities:{pt_predictions}")
print(f"length of output_vectors :{len(pt_predictions)}")


In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch
from torch import nn

# Define the sentiment analysis model and tokenizer
sentiment_model_name = "gpt2"
sentiment_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_name)
sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_model_name)
if sentiment_tokenizer.pad_token is None:
    sentiment_tokenizer.pad_token = sentiment_tokenizer.eos_token

# Create a pipeline for sentiment analysis (this is not necessary for converting logits to probabilities but included for context)
classifier = pipeline("sentiment-analysis", model=sentiment_model, tokenizer=sentiment_tokenizer)

# Example sentences
sentences = [
    "I love this product!",
    "This is the worst thing I've ever bought."
]

# Tokenize the text input (this is your `pt_batch`)
pt_batch = sentiment_tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Pass the tokenized batch through the sentiment model to get the raw logits
pt_outputs = sentiment_model(**pt_batch)

# Convert the logits to probabilities using softmax
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

# Print out the probabilities
print(f" Print out the probabilities:{pt_predictions}")
print(f"length of output_vector :{len(pt_predictions)}")

### Tokenization of the Dataset
- Tokenization of the dataset using the provided tokenizer.

### Loading Pre-trained Model
- Loading a pre-trained model for sequence classification.

### Training Arguments
- Setting up training arguments relevant for training the model.

### Data Collator Creation
- Creating a data collator for dynamic padding of batches.

### Trainer Initialization
- Initializing the trainer to handle the training loop.

### Training Execution
- Running the trainer to commence training.

### Saving Trained Model and Tokenizer
- Saving the trained model and tokenizer for future use.


In [None]:
!pip install -q -U datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
from typing import Dict, Any

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from datasets import load_dataset, DatasetDict

# Define a function to tokenize the dataset
def tokenize_dataset(examples: Dict[str, Any], tokenizer: AutoTokenizer) -> Dict[str, Any]:
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Load a dataset
dataset: DatasetDict = load_dataset("rotten_tomatoes")

# Load a tokenizer
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenize the dataset
tokenized_dataset: DatasetDict = dataset.map(lambda x: tokenize_dataset(x, tokenizer), batched=True)

# Load the model
model: AutoModelForSequenceClassification = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Set training arguments
training_args: TrainingArguments = TrainingArguments(
    output_dir="/content/sample_data/Hemanth_module",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    save_strategy="epoch",  # Save the model at the end of each epoch
)

# Initialize a DataCollator
data_collator: DataCollatorWithPadding = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize the Trainer
trainer: Trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Train the model
trainer.train()

# Save the model and tokenizer
trainer.save_model()
tokenizer.save_pretrained(training_args.output_dir)

# Make sure the output directory exists
os.makedirs(training_args.output_dir, exist_ok=True)

# Save the trained model
model_save_path: str = os.path.join(training_args.output_dir, "model.pt")
trainer.model.save_pretrained(model_save_path)



```python
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    StoppingCriteriaList,
    MaxLengthCriteria,
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
# set pad_token_id to eos_token_id because OPT does not have a PAD token
model.config.pad_token_id = model.config.eos_token_id
input_prompt = "DeepMind Company is"
input_ids = tokenizer(input_prompt, return_tensors="pt")
stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=64)])
outputs = model.contrastive_search(
    **input_ids, penalty_alpha=0.6, top_k=4, stopping_criteria=stopping_criteria
)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
['DeepMind Company is a company that focuses on the development and commercialization of artificial intelligence (AI). DeepMind’s mission is to help people understand and solve problems that are difficult to solve in the world today.\n\nIn this post, we talk about the benefits of deep learning in business and how it']
```


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer


model_directory = 'mistralai/Mixtral-8x7B-v0.1'

model = AutoModelForCausalLM.from_pretrained(model_directory)
tokenizer = AutoTokenizer.from_pretrained(model_directory)

# Set pad token to eos token if it's not already set
if tokenizer.pad_token is None:
    # Add special tokens (eos_token as pad_token)
    tokenizer.add_special_tokens({'pad_token':tokenizer.eos_token})

# Input prompt for text generation
prompt = "How susceptible are AI models to data poisoning attack"

# Generate text based on the input prompt
generated = model.generate(tokenizer.encode(prompt, return_tensors="pt"), max_length=1024)

# Decode the generated output and print it
decoded_output = tokenizer.decode(generated[0], skip_special_tokens=True)

print("Generated model",generated)
print(decoded_output)


In [None]:
!pip install -q -U transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from transformers import pipeline

# Load the pre-trained sentiment analysis model
classifier = pipeline("sentiment-analysis")

# Define a function to perform sentiment analysis on a piece of text
def analyze_sentiment(text):
    # Use the sentiment analysis model to get the sentiment label for the text
    result = classifier(text)[0]

    # Determine the sentiment of the text based on the label
    if result["label"] == "POSITIVE":
        return "positive"
    elif result["label"] == "NEGATIVE":
        return "negative"
    else:
        return "neutral"

# Use the analyze_sentiment function to analyze a list of customer reviews
reviews = [
    "I love this business!",
    "I hate this business!",
    "This business is okay, I guess."
]

sentiments = [analyze_sentiment(review) for review in reviews]

print(sentiments)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


['positive', 'negative', 'positive']


In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf

# Define the user inputs for the classification task
user_input_0 = "What is the best way to lose weight?"
user_input_1 = "How much weight can I expect to lose per week?"

# initialize the tokenizer for the bert-base-cased-finetuned-mrpc model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")

# initialize the model for the sequence classification task
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")

# Define the classes for the classification task
classes = ["repeated", "new"]

# Tokenize the user inputs and store them as tensors
paraphrase = tokenizer(user_input_0, user_input_1, return_tensors="tf")
not_paraphrase = tokenizer(user_input_0, user_input_1, return_tensors="tf")

# Use the tokenized inputs to make predictions using the model
paraphrase_classification_logits = model(paraphrase).logits
not_paraphrase_classification_logits = model(not_paraphrase).logits

# Apply the softmax activation function to the logits to get the probabilities for each class
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]

# Print the classification results
print(f"Classification for {user_input_0} and {user_input_1}:")
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


Classification for What is the best way to lose weight? and How much weight can I expect to lose per week?:
repeated: 12%
new: 88%


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# initialize tokenizer and model with pretrained weights
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# define the text to use for question answering
text = r"""
 All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Classification for What is the best way to lose weight? and How much weight can I expect to lose per week?:
repeated: 12%
new: 88%



"""

# define a list of questions to ask of the model
questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]

# loop over each question and extract the answer
for question in questions:
    # tokenize the question using the tokenizer
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")

    # extract the input IDs from the tokenized question
    input_ids = inputs["input_ids"].tolist()[0]

    # pass the tokenized question through the model to get the answer
    outputs = model(**inputs)

    # extract the start and end scores for the answer
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # use the argmax function to get the most likely start and end positions of the answer
    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    # use the tokenizer to convert the input IDs to a string and get the answer
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    # print the question and answer
    print(f"Question: {question}")
    print(f"Answer: {answer}")

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Question: How many pretrained models are available in 🤗 Transformers?
Answer: 12 %
Question: What does 🤗 Transformers provide?
Answer: predictions
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: predictions


In [None]:
# Import the required libraries
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

# Initialize the AutoTokenizer with 'distilbert-base-cased' pre-trained model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

# Initialize the AutoModelForMaskedLM with 'distilbert-base-cased' pre-trained model
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")

# Define the input sentence
sequence = (
    "Distilled models are smaller than the models they mimic. Using them instead of the large "
    f"versions would help {tokenizer.mask_token} our carbon footprint."
)

# Tokenize the input sequence using the tokenizer
inputs = tokenizer(sequence, return_tensors="pt")

# Extract the mask token index from the input sequence
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Get token logits for the masked position using the model
token_logits = model(**inputs).logits[0, mask_token_index, :]

# Get the top 5 most likely tokens for the masked position
top_5_tokens = torch.topk(token_logits, 5, dim=1).indices[0].tolist()

# Print the top 5 likely replacements for the masked token
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.


In [None]:

from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]

# get logits of last hidden state
next_token_logits = model(**inputs).logits[:, -1, :]

# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

# sample
probs = nn.functional.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

generated = torch.cat([input_ids, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])
print(resulting_string)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Hugging Face is based in DUMBO, New York City, and premiered


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("xlnet-base-cased")
tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing.   """

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length + 1 :]

print(generated)

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (-1). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


Today the weather is really nice and I am planning on visiting up close - this trip is long gone. One day I will return back home and dine with my family, with many other people.<eop><eod> This is one of my favorite articles on a blog. You may see that they link from the Internet to a link you already used. There is a link at a link in your directory that you could have searched around for and not found.


In [None]:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = (
    "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, "
    "therefore very close to the Manhattan Bridge."
)

inputs = tokenizer(sequence, return_tensors="pt")
tokens = inputs.tokens()

outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)


In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

In [None]:


from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer("summarize: " + ARTICLE, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
    inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

In [None]:

from transformers import pipeline
from datasets import load_dataset
import torch

torch.manual_seed(42)
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
audio_file = dataset[0]["audio"]["path"]

audio_classifier = pipeline(
    task="audio-classification", model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)
predictions = audio_classifier(audio_file)
predictions = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in predictions]
predictions

In [None]:
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
from datasets import load_dataset
import torch

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

feature_extractor = AutoFeatureExtractor.from_pretrained("superb/wav2vec2-base-superb-ks")
model = AutoModelForAudioClassification.from_pretrained("superb/wav2vec2-base-superb-ks")

inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=-1).item()
predicted_label = model.config.id2label[predicted_class_ids]
predicted_label

In [None]:
from transformers import pipeline
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
audio_file = dataset[0]["audio"]["path"]

speech_recognizer = pipeline(task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
speech_recognizer(audio_file)


In [None]:
from transformers import AutoProcessor, AutoModelForCTC
from datasets import load_dataset
import torch

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")

inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(predicted_ids)
transcription[0]

In [None]:
from transformers import pipeline

vision_classifier = pipeline(task="image-classification")
result = vision_classifier(
    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
)
print("\n".join([f"Class {d['label']} with score {round(d['score'], 4)}" for d in result]))

In [None]:
from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from datasets import load_dataset

dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]

feature_extractor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

inputs = feature_extractor(image, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

 Markdown is a lightweight markup language used for formatting text on the web. Here's an example demonstrating various formatting options in Markdown:

```markdown
# Markdown Formatting Examples

## Text Styling
**Bold Text** or __Bold Text__
*Italic Text* or _Italic Text_
~~Strikethrough~~

## Headings
# Heading 1
## Heading 2
### Heading 3

## Lists
### Unordered List
- Item 1
- Item 2
  - Subitem 2.1
  - Subitem 2.2

### Ordered List
1. First item
2. Second item
   1. Subitem
   2. Another subitem

## Links and Images
[Link to Google](https://www.google.com)

![Image](https://via.placeholder.com/150)

## Code
Inline `code` or code blocks:
     ```python
     def greet():
         print("Hello, Markdown!")
     ```

## Blockquotes
> This is a blockquote.
> - Anonymous

## Horizontal Line
---

## Tables
| Column 1 | Column 2 |
| -------- | -------- |
| Cell 1   | Cell 2   |
| Cell 3   | Cell 4   |

## Task Lists
- [x] Task 1
- [x] Task 2
- [ ] Task 3

The Pythagorean theorem is $a^2 + b^2 = c^2$.
$ E = mc^2 $

$$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

$$\sum_{i=1}^{n} i = \frac{n(n+1)}{2} $$
$$\int_{a}^{b} x^2 \, dx = \frac{b^3 - a^3}{3} $$

$$ A = \begin{bmatrix}
    1 & 2 \\
    3 & 4
\end{bmatrix} $$

$$ \mathbf{V} = \begin{bmatrix}
    v_1 \\
    v_2 \\
    v_3
\end{bmatrix} $$

$$ \alpha, \beta, \gamma, \delta, \pi, \Sigma, \lambda, \Phi, \Omega $$







```
This Markdown example showcases various formatting options available, including headings, text styling, lists, links, images, code blocks, blockquotes, horizontal lines, tables, and task lists. Feel free to use these formatting options in your Markdown documents to structure and stylize your content.
# Markdown Formatting Examples

## Text Styling
**Bold Text** or __Bold Text__
*Italic Text* or _Italic Text_
~~Strikethrough~~

## Headings
# Heading 1
## Heading 2
### Heading 3

## Lists
### Unordered List
- Item 1
- Item 2
  - Subitem 2.1
  - Subitem 2.2

### Ordered List
1. First item
2. Second item
   1. Subitem
   2. Another subitem

## Links and Images
[Link to Google](https://www.google.com)

![Image](https://via.placeholder.com/150)

## Code
Inline `code` or code blocks:
```python
def greet():
    print("Hello, Markdown!")
```

## Blockquotes
> This is a blockquote.
> - Anonymous

## Horizontal Line
---

## Tables
| Column 1 | Column 2 |
| -------- | -------- |
| Cell 1   | Cell 2   |
| Cell 3   | Cell 4   |

## Task Lists
- [x] Task 1
- [x] Task 2
- [ ] Task 3

The Pythagorean theorem is $a^2 + b^2 = c^2$.
- $ E = mc^2 $

$$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

$$\sum_{i=1}^{n} i = \frac{n(n+1)}{2} $$
$$\int_{a}^{b} x^2 \, dx = \frac{b^3 - a^3}{3} $$

$$ A = \begin{bmatrix}
    1 & 2 \\
    3 & 4
\end{bmatrix} $$

$$ \mathbf{V} = \begin{bmatrix}
    v_1 \\
    v_2 \\
    v_3
\end{bmatrix} $$

$$ \alpha, \beta, \gamma, \delta, \pi, \Sigma, \lambda, \Phi, \Omega $$

