<a href="https://colab.research.google.com/github/artanebibi/datascience/blob/main/Transformers_and_transfer_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets

In [None]:
!pip install evaluate

In [52]:
from transformers import pipeline
from datasets import load_dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoTokenizer

# Sentiment analysis

In [2]:
classifier = pipeline("sentiment-analysis")
classifier("Summer is the best season.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998335838317871}]

# Zero shot classification

In [3]:
classifier = pipeline("zero-shot-classification")
classifier(
    "Data Science needs to be studied with professionals",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'sequence': 'Data Science needs to be studied with professionals',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.5705001354217529, 0.35369235277175903, 0.07580748200416565]}

# Text generation


In [7]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "The earth is",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The earth is cold.\n\nIt would look even worse if we had the same meteorological data in our hands...\nThe first time there seems'},
 {'generated_text': 'The earth is moving and it is shifting towards the dark of the night. Its axis is spinning downward. No gravity is required to move it forward.'}]

# Mask Filling

In [11]:
unmasker = pipeline("fill-mask")
unmasker("The <mask> probability is higher than choosing the other side.", top_k=2)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.06063278019428253,
  'token': 1298,
  'token_str': ' winning',
  'sequence': 'The winning probability is higher than choosing the other side.'},
 {'score': 0.056530196219682693,
  'token': 6814,
  'token_str': ' default',
  'sequence': 'The default probability is higher than choosing the other side.'}]

# Named Entity Recognition

In [14]:
ner = pipeline("ner", grouped_entities=True)
ner("Artan Ebibi studies at a good university named Faculty Of Computer Engineering.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'entity_group': 'PER',
  'score': 0.98805726,
  'word': 'Artan Ebibi',
  'start': 0,
  'end': 11},
 {'entity_group': 'ORG',
  'score': 0.9929935,
  'word': 'Faculty Of Computer Engineering',
  'start': 47,
  'end': 78}]

# Question Answering

In [18]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="What is 2+2?",
    context="2+2=4"
)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


{'score': 0.9614691138267517, 'start': 4, 'end': 5, 'answer': '4'}

# Text Summarization

In [29]:
summarizer = pipeline("summarization")
summarizer(
    """
    Situations That Increase Price Sensitivity - Availability of Substitute Products, Higher Total Expenditure, Noticeable Price Differences & Easy Price Comparisons.
    Situations That Decrease Price Sensitivity - Lack of Substitutes, Real or Perceived Necessities, Complementary Products (If the price of one product falls, customers will be less sensitive to the price of complementary products), Perceived Product Benefits (Worth it Products), Situational Influences & Product Differentiation.
    Pricing Service Products - If the service provider sets prices too low, customers will have inaccurate perceptions and expectations about quality. If prices are too high, customers may not give the firm a chance.
    Due to the limited capacity associated with most services, service pricing is also a key issue with respect to balancing supply and demand during peak and off-peak demand times. In these situations, many service firms use yield management systems to balance pricing and revenue considerations with their need to fill unfilled capacity. Yield management allows the service firm to simultaneously control capacity and demand to maximize revenue and capacity utilization
    **The initial price is critical, not only for initial success, but also for maintaining the potential for profit over the long term. There are several different approaches to base pricing, like:
    Price Skimming - This strategy intentionally sets a high price relative to the competition, thereby “skimming” off the profits early after the product’s launch. Price skimming is designed to recover the high R&D and marketing expenses associated with developing a new product.
    Price Penetration - This strategy is designed to maximize sales, gain widespread market acceptance, and capture a large market share quickly by setting a relatively low initial price.
    Prestige Pricing - This strategy sets prices at the top end of all competing products in a category. This is done to promote an image of exclusivity and superior quality. Prestige pricing is a viable approach in situations where it is hard to objectively judge the true value of a product.
    Value-Based Pricing (EDLP) - Firms that use a value-based pricing approach set reasonably low prices but still offer high quality products and adequate customer services. It sets prices so they are consistent with the benefits and costs associated with acquiring the product, example Ikea.
    Competitive Matching - these firms set prices at what most consider to be the “going rate” for the industry, example Oil, Steel, Gold etc.
    Non-Price Strategies - By downplaying price in the marketing program, the firm must be able to emphasize the product’s quality, benefits, and unique features; as well as customer service, promotion, or packaging to make the product stand out against competitors, many of whom will offer similar products at lower prices. For example, theme parks like Disney World, Sea World, and Universal Studios generally compete on excellent service, unique benefits, and one-of-a-kind experiences rather than price. Customers willingly pay for these experiences because they cannot be found in any other setting.
    Techniques of Adjusting the Base Price – Discounting , Reference Pricing (“Originally $99, Now $49”), Price Bundling (Sometimes called solution-based pricing or all-inclusive pricing, price bundling brings together two or more complementary products for a single price), Odd Pricing (Everyone knows that prices are rarely set at whole, round numbers. To say you will cut my grass for $47 sounds like you put a lot more thought into it than if you just said, “I will do it for $40,” even though the first figure is $7 higher), & Price Lining (the price of a competing product is the reference price, takes advantage of the simple truth that some customers will always choose the lowest-priced or highest-priced product.)
    Adjusting Prices in Business Markets - Trade discounts, Discounts and allowances, Geographic pricing, Transfer pricing, Barter and countertrade (exchange products), & Price discrimination.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'summary_text': ' The initial price is critical, not only for initial success, but also for maintaining potential for profit over the long term . Price Skimming is designed to recover the high R&D and marketing expenses associated with developing a new product . Value-based pricing is a viable approach in situations where it is hard to objectively judge the true value of a product .'}]

# Translation

In [34]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-mk")
translator("How are you doing today, you look sad?")

Device set to use cpu


[{'translation_text': 'Како си денес, изгледаш тажно?'}]

# Fine-tuning a pre-trained model

In [35]:
dataset = load_dataset("csv", data_files = "https://raw.githubusercontent.com/artanebibi/Datasets/refs/heads/main/emotions-dataset.csv")
dataset

DatasetDict({
    train: Dataset({
        features: ['message', 'emotion'],
        num_rows: 12000
    })
})

In [36]:
df = dataset['train'].to_pandas()

In [37]:
# encoding the features is needed
encoder = LabelEncoder()
labels = encoder.fit_transform(dataset['train']['emotion'])
dataset['train'] = dataset['train'].add_column("label", labels)
dataset['train'] = dataset['train'].remove_columns('emotion')
dataset['train'] = dataset['train'].rename_column('message', 'text')


In [38]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 12000
    })
})

In [41]:
dataset = dataset['train'].train_test_split(test_size=0.2)

In [42]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 7680
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1920
    })
})

In [45]:
# tokenizing / embbeding the words before feeding them to the model

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_texts = tokenizer(dataset["train"]["text"])

In [54]:
def tokenize(sample):
  return tokenizer(sample['text'], truncation = True)

In [55]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/7680 [00:00<?, ? examples/s]

Map:   0%|          | 0/1920 [00:00<?, ? examples/s]

In [57]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [58]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7680
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1920
    })
})

In [68]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="trainer",
    eval_strategy="epoch",
    per_device_train_batch_size=8,  # batch size for training
    per_device_eval_batch_size=8,  # batch size for evaluation
    metric_for_best_model="f1",
)

In [69]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [70]:
import evaluate
import numpy as np

metric = evaluate.load("f1")

In [71]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels, average="weighted")

In [72]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
predictions = trainer.predict(tokenized_dataset["test"])

In [None]:
predictions