# Pretraining a Language Model

Pretraining a language model is a critical step in the development of sophisticated natural language processing (NLP) systems. In the context of modern machine learning, pretraining refers to training a model on a large corpus of text before fine-tuning it for specific downstream tasks. The main goal of pretraining is to enable the model to learn general language patterns, such as grammar, word relationships, context, and meaning. By training on vast amounts of text data, a language model is able to build a comprehensive understanding of language that can be transferred to various specialized tasks with minimal additional data and computational resources.

### Benefits of Pretraining

Pretraining provides several key benefits. First, it significantly reduces the need for task-specific labeled data. By leveraging a large, diverse dataset during pretraining, a model can learn general language representations that are useful across a wide range of NLP tasks. This is particularly beneficial for tasks where labeled data is scarce or difficult to obtain. For example, a model pretrained on millions of documents can perform well in tasks such as sentiment analysis, named entity recognition, or text summarization, even with limited task-specific training data.

### BERT and GPT: Architectures and Training Approaches

Two of the most well-known language models used for pretraining are BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer). Both models use the Transformer architecture, which has become the foundation for many state-of-the-art NLP systems due to its ability to efficiently capture long-range dependencies in text.

BERT is designed as a **masked language model (MLM)**. During pretraining, BERT randomly masks some of the words in a sentence and learns to predict these missing words based on the context of the surrounding words. This bidirectional training approach enables BERT to capture the full context of a word, both to the left and right, resulting in richer word representations. For example, in the sentence "The cat sat on the ___," BERT would predict the word "mat" based on the context provided by both the words before and after the blank.

GPT, on the other hand, is based on a **causal language model (CLM)**. Unlike BERT, GPT is trained to predict the next word in a sequence, conditioned on all previous words. This unidirectional approach helps GPT generate coherent and contextually appropriate text. For example, in the phrase "The cat sat on the," GPT would predict the next word to be "mat," relying only on the words that come before it. GPT's causal nature makes it well-suited for text generation tasks, such as writing articles, answering open-ended questions, or creatingdustries from healthcare to entertainment.

### Applications of Pretrained and Fine-Tuned Language Models

The versatility of pretrained language models like BERT and GPT makes them applicable to a wide range of downstream tasks. Fine-tuned models have been used in areas such as:

- **Text classification**: Identifying categories or labels for a given piece of text, such as spam detection or sentiment analysis.
- **Named entity recognition (NER)**: Identifying and classifying entities (people, organizations, locations, etc.) in a text.
- **Question answering**: Finding and providing answers to questions based on a given context or document.
- **Text generation**: Automatically generating coherent and contextually appropriate text, useful in content creation, chatbots, and summarization.
- **Machine translation**: Translating text from one language to another, which has seen significant improvements with models like GPT.

By leveraging the power of pretraining and fine-tuning, language models can be adapted to solve a wide variety of tasks efficiently and accurately, often achieving state-of-the-art results even with relatively small amounts of task-specific data.

In [None]:
# !pip install tokenizers -q
# !pip install datasets -q

In [None]:
import requests
import zipfile
import os
import shutil

# Define the URL of the ZIP file
zip_url = "https://github.com/AsoSoft/AsoSoft-Text-Corpus/raw/master/AsoSoft%20Text%20Corpus%20Small%20Version/AsoSoft Text Corpus- Small Version 1.0 (2018-12-10).zip"

# Define the local filename
local_zip_file = "AsoSoft Text Corpus- Small Version 1.0 (2018-12-10).zip"

# Download the ZIP file
print("Downloading ZIP file...")
response = requests.get(zip_url)
if response.status_code == 200:
    with open(local_zip_file, "wb") as file:
        file.write(response.content)
    print("Download complete.")
else:
    print(f"Failed to download file: {response.status_code}")

# Unzip the file and rename the desired file
print("Extracting contents and renaming...")
with zipfile.ZipFile(local_zip_file, "r") as zip_ref:
    extract_dir = "data"  # Directory to extract contents
    os.makedirs(extract_dir, exist_ok=True)
    zip_ref.extractall(extract_dir)

    # Find and rename the text file
    for file_name in os.listdir(extract_dir):
        if file_name.endswith(".txt"):  # Adjust if the file extension is different
            old_file_path = os.path.join(extract_dir, file_name)
            new_file_path = os.path.join(extract_dir, "text_data.txt")
            os.rename(old_file_path, new_file_path)
            print(f"Renamed '{file_name}' to 'text_data.txt'.")

# Optional: Clean up (delete the ZIP file)
os.remove(local_zip_file)
print("Temporary ZIP file deleted.")

In [None]:
from tokenizers import BertWordPieceTokenizer

# Define paths
data_file = "data/text_data.txt"
tokenizer_save_path = "./bert_tokenizer/"

os.makedirs(tokenizer_save_path, exist_ok=True)

# Initialize and train the tokenizer
tokenizer = BertWordPieceTokenizer(lowercase=True, strip_accents=True)
tokenizer.train(
    files=[data_file],
    vocab_size=20_000,
    min_frequency=2,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
)

# Save the tokenizer
tokenizer.save_model(tokenizer_save_path)
print(f"Tokenizer saved at {tokenizer_save_path}")

In [None]:
from datasets import load_dataset
from transformers import BertTokenizerFast

# Load your text file as a dataset
dataset = load_dataset("text", data_files={"train": data_file})

# Load the trained tokenizer
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_save_path)


# Tokenization and dataset preprocessing
def preprocess_function(examples):
    # Tokenize input text
    encoding = tokenizer(
        examples["text"],
        truncation=True,
        max_length=128,
        padding="max_length",
        return_special_tokens_mask=True,
    )
    return encoding


# Apply preprocessing to the dataset
tokenized_dataset = dataset["train"].map(
    preprocess_function, batched=True, remove_columns=["text"]
)
tokenized_dataset.set_format(
    type="torch", columns=["input_ids", "attention_mask", "token_type_ids"]
)

In [None]:
from transformers import DataCollatorForLanguageModeling

# Initialize the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
from transformers import BertConfig, BertForMaskedLM

# Define the BERT configuration
config = BertConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    num_hidden_layers=2,
    num_attention_heads=2,
    intermediate_size=256,
    max_position_embeddings=128,
    type_vocab_size=2,
)

# Initialize the BERT model
model = BertForMaskedLM(config)

print(f"{model.num_parameters():,}")

In [None]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./bert-mlm-model",
    overwrite_output_dir=True,
    num_train_epochs=20,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=1,
    logging_steps=1000,
    logging_dir="./logs",
    report_to="none",
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

trainer.train()

In [None]:
# Save the model
model.save_pretrained("./bert-mlm-model")
tokenizer.save_pretrained("./bert-mlm-model")

In [None]:
from transformers import BertForMaskedLM, BertTokenizerFast

# Load the model and tokenizer from the saved path
model_path = "./bert-mlm-model"
model = BertForMaskedLM.from_pretrained(model_path)
tokenizer = BertTokenizerFast.from_pretrained(model_path)

In [None]:
# Input sentence with a masked token
sentence = "سڵاو [MASK] تۆ."

# Tokenize the input
inputs = tokenizer(sentence, return_tensors="pt")
inputs

In [None]:
import torch

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits


predictions

In [None]:
# Get the index of the [MASK] token
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Get the logits for the masked token(s)
mask_token_logits = predictions[0, mask_token_index, :]

# Get the top predicted tokens (e.g., top 5)
top_k = 5
top_tokens = torch.topk(mask_token_logits, top_k, dim=1).indices[0].tolist()

# Convert token IDs to words
predicted_words = [tokenizer.decode([token_id]).strip() for token_id in top_tokens]

print("Top predictions for the masked token:")
print(predicted_words)

In [None]:
# Or you can use pipeline
from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model_path, tokenizer=model_path)
fill_mask(sentence)

# Fine-tune your pretrained BERT

### Fine-Tuning for Downstream Tasks

After pretraining, a language model like BERT or GPT can be fine-tuned for specific downstream tasks. Fine-tuning involves training the model further on task-specific labeled data, adjusting its parameters to optimize performance for that particular task. This allows the model to specialize in areas such as sentiment analysis, named entity recognition, machine translation, or summarization, among others.

For example, fine-tuning BERT for a language detection task involves providing the model with a labeled dataset containing sentences and their corresponding sentiment language (English, Arabic, or Kurdish). The model will adjust its parameters to focus on the specific patterns that indicate sentiment.

In contrast, fine-tuning GPT for a task like text generation would involve training it on specific text types, such as legal documents or medical reports, so that the model can generate relevant, context-specific content in those domains.

In [None]:
from datasets import load_dataset

# Load the dataset from a CSV file
dataset = load_dataset(
    "csv", data_files="/kaggle/input/language-detection/language_detection.csv"
)

# Split the dataset into training and testing sets (80% train, 20% test)
split_dataset = dataset["train"].train_test_split(test_size=0.2, shuffle=True)

# Access train and test sets
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

In [None]:
# Define label mapping
label_mapping = {"English": 0, "Arabic": 1, "Kurdish": 2}

# Apply the label mapping to both train and test datasets
train_dataset = train_dataset.map(lambda x: {"label": label_mapping[x["language"]]})
test_dataset = test_dataset.map(lambda x: {"label": label_mapping[x["language"]]})

In [None]:
from transformers import BertTokenizerFast

# Load the tokenizer
tokenizer = BertTokenizerFast.from_pretrained("./bert-mlm-model")


# Define a tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["sentence"], truncation=True, padding="max_length", max_length=128
    )


# Tokenize the train and test datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove unused columns
train_dataset = train_dataset.remove_columns(["sentence", "language"]).with_format(
    "torch"
)
test_dataset = test_dataset.remove_columns(["sentence", "language"]).with_format(
    "torch"
)

In [None]:
from torch.utils.data import DataLoader

# Create DataLoader objects
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)

In [None]:
from transformers import BertForSequenceClassification

# Load the BERT model for sequence classification
num_labels = len(label_mapping)
model = BertForSequenceClassification.from_pretrained(
    "./bert-mlm-model", num_labels=num_labels
)

In [None]:
from transformers import TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./bert-classification-model",  # Directory to save model
    eval_strategy="epoch",  # Evaluate every epoch
    save_strategy="epoch",  # Save model every epoch
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",  # Directory for logs
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    save_total_limit=2,  # Save only the last two checkpoints
    report_to="none",
)

In [None]:
from sklearn.metrics import accuracy_score, f1_score



# Define compute_metrics for evaluation

def compute_metrics(eval_pred):

    logits, labels = eval_pred

    predictions = logits.argmax(axis=-1)

    accuracy = accuracy_score(labels, predictions)

    f1 = f1_score(labels, predictions, average="weighted")

    return {"accuracy": accuracy, "f1": f1}



# Initialize Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,

    eval_dataset=test_dataset,

    processing_class=tokenizer,
    compute_metrics=compute_metrics,

)


# Train and log metrics

trainer.train()

In [None]:
# Evaluate on the test set
results = trainer.evaluate()
print("Test Results:", results)

In [None]:
def predict(sentence):
    # Tokenize the input
    inputs = tokenizer(
        sentence,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=128,
    )
    inputs = {key: val.to(trainer.model.device) for key, val in inputs.items()}

    # Get predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predicted_label = torch.argmax(outputs.logits, dim=-1).item()

    # Map label ID back to language
    label = [lang for lang, idx in label_mapping.items() if idx == predicted_label][0]
    return label


# Example inference
print(predict("This is an English sentence."))
print(predict("هذه جملة باللغة العربية."))
print(predict("ئەمڕۆ سێ شەمە سەرۆک وەزیرانی فەڕەنسا ڕایگەیاند:"))

# Language Detection with Random Forest

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
df = pd.read_csv("/kaggle/input/language-detection/language_detection.csv")

# Features and labels
X = df["sentence"]
y = df["language"]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# One-hot encode target labels
encoder = OneHotEncoder(sparse=False)
y_train_encoded = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_encoded = encoder.transform(y_test.values.reshape(-1, 1))

# Convert text to numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train the RandomForest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_vectorized, y_train)

# Make predictions
y_pred = clf.predict(X_test_vectorized)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Perform cross-validation
cv_scores = cross_val_score(
    clf, X_train_vectorized, y_train_encoded, cv=5, scoring="accuracy"
)
# Print results
print("Cross-validation scores:", cv_scores)
print("Mean accuracy:", np.mean(cv_scores))

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))