In [None]:
%pip install transformers datasets torch


### Importing the Required libraries

Hugginface key is required 

[Huggingface](https://huggingface.co/)

use the link to signup and create a api key free of cost


In [None]:
from huggingface_hub import login
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

huggingfacekey = "hugging-face api key"
login(token=huggingfacekey)

**Dataset Loading:** The script loads the IMDB movie reviews dataset using the load_dataset("imdb") function from the Hugging Face datasets library.

**Dataset Structure:** After loading, the structure and content of the dataset are printed, allowing you to understand its format and how it's organized.

**Random Sampling:** A set of 10 random indices is generated from the range 0 to 10,000 using NumPy’s randint() function. These indices will be used for sampling from the dataset.

In [None]:
# Load the IMDB dataset
dataset = load_dataset("imdb")

# Print the dataset structure
print(dataset)
rand = np.random.randint(10000,size=10)
print(rand)


1. Sampling from the Training Dataset:

    The script uses Python's random.sample() function to select 200 random samples from the training set (dataset['train']).
    These 200 examples are stored in the train_dataset variable, which can be used for training models.

2. Sampling from the Test Dataset:

    Similarly, 20 random samples are selected from the test set (dataset['test']) and stored in the test_dataset variable for evaluation or testing purposes.

3. Random Sampling:

    random.sample() ensures that the indices selected for training and testing are random, providing a way to test model generalization.

4. Use Case:

    This process helps with quickly creating a smaller, representative subset of the original dataset for faster model prototyping, testing, or experimentatio

In [None]:
# Sample 200 random training examples
train_samples = random.sample(range(len(dataset['train'])), 200)
train_dataset = dataset['train'].select(train_samples)

# Sample 20 random test examples
test_samples = random.sample(range(len(dataset['test'])), 20)
test_dataset = dataset['test'].select(test_samples)



**Loading the Tokenizer:**

    - The script loads the DistilBERT tokenizer (distilbert-base-uncased) using the AutoTokenizer class from Hugging Face's Transformers library.
    - This tokenizer converts raw text into token IDs that can be fed into the DistilBERT model.

**Tokenization Function:**

    - The tokenize_function() takes a batch of text samples and tokenizes them using the loaded tokenizer.
    - It applies padding to ensure all sequences have the same length and truncates those that exceed the maximum allowed length.

**Tokenizing Train and Test Datasets:**

    - The train_dataset and test_dataset are tokenized using the map() method, which applies the tokenization function to each batch.
    - The batched=True parameter ensures that the dataset is processed in batches for efficiency.

**Result:**

    - After tokenization, tokenized_train_dataset and tokenized_test_dataset contain the input IDs and attention masks necessary for training or evaluating a model using DistilBERT.

In [None]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

# Tokenize the sampled datasets
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)



**Remove Unnecessary Columns:**

The text column is removed from both tokenized training and test datasets since it's not required for model training after tokenization.

**Rename 'label' to 'labels':**

The label column is renamed to labels to match the input format expected by Hugging Face's Trainer class.

**Set PyTorch Format:**

The datasets are converted to PyTorch tensors, specifying the columns input_ids, attention_mask, and labels as the necessary inputs for the model.

In [None]:
# Remove unnecessary columns from the tokenized datasets
tokenized_train_dataset = tokenized_train_dataset.remove_columns(["text"])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["text"])

# Rename the 'label' column to 'labels' to match the Trainer's expectation
# [0,1]
tokenized_train_dataset = tokenized_train_dataset.rename_column("label", "labels")
tokenized_test_dataset = tokenized_test_dataset.rename_column("label", "labels")

# Set the format for PyTorch
tokenized_train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


In [None]:
# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# BART for summarization
# BERT for classification

# Define training arguments
# 200 review out of 25K
training_args = TrainingArguments(
    output_dir="./results",          # output directory
    evaluation_strategy="epoch",     # evaluation strategy to adopt during training
    learning_rate=2e-5,              # learning rate
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    num_train_epochs=3,              # total number of training epochs
    weight_decay=0.01,               # strength of weight decay
)




**Create the Trainer:**

The Trainer class from Hugging Face is initialized with:

**model:** The model used for training (such as a transformer model).

**training_args:** Arguments for training like learning rate, epochs, batch size, etc.

**train_dataset:** The tokenized training dataset.

**eval_dataset:** The tokenized test dataset for evaluation.

**Train-Test Split:**

The dataset is split into training and test sets using an 80-20 ratio (test_size=0.2), meaning 80% is for training and 20% for testing.

**Model Training:**

The trainer.train() method is called to start the training process, which will optimize the model based on the provided training dataset and evaluate performance using metrics like precision, accuracy, F1 score, and recall.

In [None]:
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
)


# metrices are like precision,accuracy, f1 score, recal
# 0-1 0.1 0.2
data = dataset.train_test_spilt(test_size =0.2)
data['train']
data['test']

# Train the model
trainer.train()

**Evaluate the Model:**

After training, the trainer.evaluate() method is called to evaluate the performance of the model using the test dataset.
This method calculates key metrics like precision, accuracy, F1 score, and recall based on the test set.

**Print Evaluation Results:**

The evaluation results, which include these metrics, are printed to provide insight into the model's performance.

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)


{'eval_loss': 0.7238416075706482, 'eval_runtime': 20.7465, 'eval_samples_per_second': 0.964, 'eval_steps_per_second': 0.145, 'epoch': 3.0}


**New Reviews:**

A list of sample reviews (new_reviews) is provided for which the model will predict sentiments.

**Tokenization:**

The reviews are tokenized using the tokenizer, ensuring that the input format matches what the model expects. Padding and truncation are applied to ensure consistent input lengths.

**Predictions:**

The tokenized inputs are passed through the model. torch.no_grad() ensures no gradients are calculated since we're only making predictions, not training.
The model's logits (output) are used to predict the sentiment by selecting the index of the maximum value using torch.argmax.

**Sentiment Prediction:**

For each review, the predicted label (either 0 or 1) is mapped to either "Positive" or "Negative" sentiment, and the result is printed.

In [None]:
# Example reviews
new_reviews = [
    "the movie was not good enough to watch",
    "i liked the movie and it was awesome",
]

# Tokenize the new reviews
inputs = tokenizer(new_reviews, padding=True, truncation=True, return_tensors="pt")

# Make predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Print predicted labels
for review, predicted in zip(new_reviews, predictions):
    sentiment = "Positive" if predicted.item() == 1 else "Negative"
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")


Review: the move was not good enough to watch
Predicted Sentiment: Negative

Review: i liked the movie and it was awesome
Predicted Sentiment: Positive

