✅ 1. Neural Machine Translation (NMT) using Hugging Face

We'll translate English to German using the pre-trained model Helsinki-NLP/opus-mt-en-de.

In [None]:
!pip install transformers sentencepiece

In [3]:
from transformers import MarianMTModel, MarianTokenizer

# Load pre-trained model and tokenizer
model_name = 'Helsinki-NLP/opus-mt-en-de'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# English sentences to translate
src_texts = ["How are you?", "The weather is nice today.", "I love learning AI!"]

# Tokenize input and translate
translated = model.generate(**tokenizer(src_texts, return_tensors="pt", padding=True))
output = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

# Show output
for i, translation in enumerate(output):
    print(f"EN: {src_texts[i]}\nDE: {translation}\n")


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


EN: How are you?
DE: Wie geht es dir?

EN: The weather is nice today.
DE: Das Wetter ist heute schön.

EN: I love learning AI!
DE: Ich liebe es, KI zu lernen!



2. Text Classification using DistilBERT

Using a small dataset for binary classification (Positive/Negative sentiment).

In [6]:
!pip install transformers datasets torch scikit-learn nltk




In [16]:
!pip install 'transformers[torch]'

ERROR: Invalid requirement: "'transformers[torch]'": Expected package name at the start of dependency specifier
    'transformers[torch]'
    ^


In [15]:
import nltk
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from transformers import TrainingArguments, IntervalStrategy

# Download tokenizer tools
nltk.download('punkt')

# Sample data
texts = [
    "I love this movie!", "It was terrible and boring.",
    "Great acting and storyline.", "Worst film I’ve ever seen."
]
labels = [1, 0, 1, 0]

# Convert to dataset
dataset = Dataset.from_dict({"text": texts, "label": labels})
train_test = dataset.train_test_split(test_size=0.5)
train_dataset = train_test['train']
test_dataset = train_test['test']

# Load tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

# Tokenize
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# Training setup
training_args = TrainingArguments(
    output_dir="./output",
    evaluation_strategy=IntervalStrategy.NO,
    num_train_epochs=2,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    logging_dir="./logs"
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()
trainer.evaluate()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Gauri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 2/2 [00:00<00:00, 666.19 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 666.66 examples/s]


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'