### Installing Libraries

In [None]:
pip install transformers



In [None]:
pip install tf-keras



## 1. Text Generation

In [None]:
from transformers import pipeline

- **truncation=True**: This tells the text generation model to truncate the input text if it exceeds the maximum length the model can handle. This helps prevent errors and ensures efficient processing.

- **num_return_sequences=2**: This instructs the model to generate two different possible continuations for the given input text, providing a variety of options for you to choose from.

**Example 1**

In [None]:
generator=pipeline('text-generation',model='openai-community/gpt2')

generator("Today is Monday",
          truncation=True,
          num_return_sequences=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Today is Monday, September 4th 2012\n\nA great morning to me and to my family, thanks everyone! Thanks for the wonderful gifts. The very first, the most elegant, was from an ew, one of those gifts that I love'},
 {'generated_text': 'Today is Monday, July 7rd. And with it the return of the "Rise of the Super-Bowl".\n\nAnd with that we are going to have something special to show everyone.\n\nThe Rise of the Star Ruler\n'}]

**Example 2**

In [None]:
generator("Python is a great programming language",
          truncation=True,
          num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Python is a great programming language for creating real-time application-specific tools.\n\nHowever, I won\'t call it "real programming". Rather, the core of this language is "pure". The syntax is as follows:\n\nval foo'},
 {'generated_text': "Python is a great programming language for testing and debugging. There are often many interesting ways out of your favorite programming language, but we've covered some very good ones in this article. Here are some of the top programming languages for debugging with Python - Python"},
 {'generated_text': 'Python is a great programming language but for some I find it hard to fully understand what happens in the program. This article attempts to explain it. I would use Haskell. The Haskell project at Hoc, a Haskell team, uses Haskell every day.'},
 {'generated_text': 'Python is a great programming language and has been gaining a lot of hype in the last decade despite being a little less powerful like C. It has made great prog

## 2. Sentiment Analysis

**Example 1**

In [None]:
from transformers import pipeline

classifier=pipeline('sentiment-analysis')

classifier("I Loved Star Wars so much!")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POSITIVE', 'score': 0.999840259552002}]

**Example 2**

In [None]:
pipeline('sentiment-analysis')("I Hated star wars so much")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'NEGATIVE', 'score': 0.995543897151947}]

1. **As you can see this is saying that we are not using any models here.**
2. **So we will adding a model and then we will see the results again.**

### Now we are using sentiment analysis with a Model

In [None]:
pipeline('sentiment-analysis', model='facebook/bart-large-mnli')('I Loved Star Wars so much!')

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'neutral', 'score': 0.9278974533081055}]

### The previous model shows the neutral result which is obiously not true, so we will try again with a new model here.

In [None]:
pipeline('sentiment-analysis', model='SamLowe/roberta-base-go_emotions')('I Loved Star Wars so much!')

config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/380 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'love', 'score': 0.9498008489608765}]

## 3. Question Answering

**Example 1**

In [None]:
from transformers import pipeline

qa_model=pipeline("question-answering")

qa_model(question="What day is today", context="Today is Friday")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.9879868626594543, 'start': 9, 'end': 15, 'answer': 'Friday'}

**Example 2**

In [None]:
from transformers import pipeline

qa_model=pipeline("question-answering")

qa_model(question="What Language am I Learning?",
         context="Python is a great programming language and the codebase is beautiful.\n\nOur goals with Python is to provide an easy, extensible interface to other languages and applications by providing a single code base that we can use and integrate with any other language or")


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'score': 0.5345434546470642, 'start': 0, 'end': 6, 'answer': 'Python'}

## 4. Text Summarization

In [6]:
pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[

**Download and Loading the dataset**

In [None]:
from datasets import load_dataset

dataset=load_dataset('CShorten/ML-ArXiv-Papers')

Downloading readme:   0%|          | 0.00/986 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/147M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/117592 [00:00<?, ? examples/s]

**Showing the dataset**

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
        num_rows: 117592
    })
})

**Showing a Sample text**

In [None]:
dataset['train'][0]

{'Unnamed: 0.1': 0,
 'Unnamed: 0': 0.0,
 'title': 'Learning from compressed observations',
 'abstract': '  The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rat

**Using a pre-trained model to perform text summarization**

In [None]:
from transformers import pipeline

classifier=pipeline("summarization",model='facebook/bart-large-cnn')

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
classifier(dataset['train'][0]['abstract'])

[{'summary_text': 'Predictors are drawn from some specified class, and the goal is to approach the performance (expected loss) of the best predictor in the class. The ideas areillustrated on the example of nonparametric regression in Gaussian noise. Under suitable regularity conditions on the admissible predictors, we give an information-theoretic characterization of achievable predictorperformance in terms of conditional distortion-rate functions.'}]

## 5. Tokenisation

**It divide the sentences in such tokens for data remove stopwords and other text related works.**

In [1]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained('google-bert/bert-base-cased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [3]:
sentence="I loved Star Wars so much!"

tokens=tokenizer.tokenize(sentence)

tokens

['I', 'loved', 'Star', 'Wars', 'so', 'much', '!']

## 6. Fine Tune Model

### Summary -->

1. Load and Prepare dataset
2. Preprocess Data
3. Set Training Arguments
4. Initialise Model
5. Train Model
6. Evaluate Model
7. Save Results

#### a) Load and Prepare dataset

In [7]:
from datasets import load_dataset

dataset=load_dataset('imdb')

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

#### b) Next we will Tokenize our words using pretrained model

In [None]:
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained('google-bert/bert-base-cased')

- **As you can see our dataset is splitted into three parts
Train, Test and Unsupervised.**

In [8]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [9]:
dataset['test']['text'][0]

'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as they have

#### c) Create a custom function for tokenization and apply this to original dataset

- **As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths.**

In [10]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets=dataset.map(tokenize_function, batched=True)

tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

#### d) We are only using sample of 1000 data for faster training.

- **otherwise it will take much time.**

In [11]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

### Training

- **Here we will be adjusting that where to save, what is the learning rate, what is evaluation strategy etc.**

#### e) Setting Up Training Arguments

In [12]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=".results",
                                  evaluation_strategy="epoch",
                                  learning_rate=2e-5,
                                  num_train_epochs=3,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  weight_decay=0.01)



#### f) Training our model (We are using a pre-trained model here)

In [13]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=2)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### For train dataset we use **small_train_dataset** which is consist of 1000 sentences and for evaluation process we use **small_eval_dataset** which is also consist of 1000 sentences.

#### g) Then initializing the model with a trainer

In [15]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset
)

#### h) Training started

In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.342235
2,No log,0.34017
3,No log,0.390564


TrainOutput(global_step=189, training_loss=0.3046949477422805, metrics={'train_runtime': 381.7777, 'train_samples_per_second': 7.858, 'train_steps_per_second': 0.495, 'total_flos': 789333166080000.0, 'train_loss': 0.3046949477422805, 'epoch': 3.0})

#### i) Evaluating the results

In [17]:
results=trainer.evaluate()

results

{'eval_loss': 0.3905642032623291,
 'eval_runtime': 30.5672,
 'eval_samples_per_second': 32.715,
 'eval_steps_per_second': 2.061,
 'epoch': 3.0}

#### j) Save the Model and tokenizer

In [18]:
model.save_pretrained("./results")
tokenizer.save_pretrained("./results")

('./results/tokenizer_config.json',
 './results/special_tokens_map.json',
 './results/vocab.txt',
 './results/added_tokens.json',
 './results/tokenizer.json')