<a href="https://colab.research.google.com/github/b21renu/ESSENCE/blob/main/Renu_Essence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TASK - 1

```
Get the list of the most popular language models, think about pros-cons of different models if there is research on it.
Big challenge: how computationally-intensive are those models, can you run them on Google Colab? Maybe try and code them.
```

In [None]:
# Install the transformers library
!pip install transformers

In [None]:
# Install SentencePiece
!pip install sentencepiece

## T5 (Google)

In [None]:
# Import necessary libraries
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Input text
input_text = "Translate this English text to French: Hello, how are you?"

# Tokenize and encode the input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Generate output
output = model.generate(input_ids)

# Decode and print the output
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Bonjour, comment êtes-vous?


## GPT-2 (OpenAI)

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate text based on a prompt
prompt = "Happy"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=120, temperature=0.8, num_beams=5, no_repeat_ngram_size=2, top_k=50, top_p=0.95)

decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Happy.

I'm not going to lie, I'm really excited about this project. I've been looking forward to it for a long time, and I can't wait to share it with you all! I hope you enjoy it as much as I do!


## BERT (Google)

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax

def predict_sentiment(sentence, model, tokenizer):
    # Tokenize input sentence
    tokens = tokenizer(sentence, return_tensors='pt')

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**tokens)

    # Apply softmax to get probabilities
    probs = softmax(outputs.logits, dim=1)

    # Predict the class with the highest probability
    prediction = torch.argmax(probs).item()

    # Return the sentiment label and probabilities
    return prediction, probs

# Load pre-trained BERT model and tokenizer fine-tuned on SST-2 for sentiment analysis
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Example sentence for sentiment analysis
example_sentence = "I wanna die!"

# Perform sentiment analysis
prediction, probabilities = predict_sentiment(example_sentence, model, tokenizer)

# Define sentiment labels
sentiment_labels = ['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive']

# Print results
print(f"Sentence: {example_sentence}")
print(f"Predicted Sentiment: {sentiment_labels[prediction]}")
print(f"Probabilities: {probabilities}")


Sentence: I wanna die!
Predicted Sentiment: Very Negative
Probabilities: tensor([[0.5333, 0.1215, 0.0657, 0.0622, 0.2174]])


## XLNet (Google/CMU)

In [None]:
from transformers import GPT2Tokenizer, XLNetLMHeadModel

# Load pre-trained XLNet model and use GPT2Tokenizer
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Encode input text and generate output
input_text = "good morning"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, top_k=50, top_p=0.95)

# Decode and print the generated output
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (-1). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


good morningeeYe normal normal88**2+2,,2.2-++,-)(����

enenlyly tolyive to toiveiveip toipive2ive


## COMPUTATION INTENSITY

In [None]:
import time
from transformers import GPT2LMHeadModel, GPT2Tokenizer, BertModel, BertTokenizer, XLNetLMHeadModel, XLNetTokenizer

def measure_tokenization_time(tokenizer, input_text):
    start_time = time.time()
    tokens = tokenizer(input_text, return_tensors='pt')
    tokenization_time = time.time() - start_time
    return tokenization_time, tokens

def measure_inference_time(model, tokens):
    start_time = time.time()
    with torch.no_grad():
        output = model(**tokens)
    inference_time = time.time() - start_time
    return inference_time

# Sample input text
input_text = "good morning"

# GPT-2 model
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_tokenization_time, gpt2_tokens = measure_tokenization_time(gpt2_tokenizer, input_text)
gpt2_inference_time = measure_inference_time(gpt2_model, gpt2_tokens)

# BERT model
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_tokenization_time, bert_tokens = measure_tokenization_time(bert_tokenizer, input_text)
bert_inference_time = measure_inference_time(bert_model, bert_tokens)

# XLNet model
xlnet_model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
xlnet_tokenization_time, xlnet_tokens = measure_tokenization_time(xlnet_tokenizer, input_text)
xlnet_inference_time = measure_inference_time(xlnet_model, xlnet_tokens)

# Print the results
print(f"GPT-2 Tokenization Time: {gpt2_tokenization_time:.4f} seconds")
print(f"GPT-2 Inference Time: {gpt2_inference_time:.4f} seconds")

print(f"BERT Tokenization Time: {bert_tokenization_time:.4f} seconds")
print(f"BERT Inference Time: {bert_inference_time:.4f} seconds")

print(f"XLNet Tokenization Time: {xlnet_tokenization_time:.4f} seconds")
print(f"XLNet Inference Time: {xlnet_inference_time:.4f} seconds")


spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

GPT-2 Tokenization Time: 0.0004 seconds
GPT-2 Inference Time: 0.0737 seconds
BERT Tokenization Time: 0.0005 seconds
BERT Inference Time: 0.0723 seconds
XLNet Tokenization Time: 0.0004 seconds
XLNet Inference Time: 0.1654 seconds


```
SPEED: BERT / GPT-2  
DIVERSITY AND CREATIVE GENERATION: GPT-2 / XLNet
 - T5 is a larger model with a different architecture. Could be a good choice for text summarization or translation.T5's inference time can
    vary depending on the size of the model and the complexity of the task.
 - BERT is excellent for tasks requiring deep understanding of context and relationships in text.
```

In [None]:
import time
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, BertModel, BertTokenizer, XLNetLMHeadModel, XLNetTokenizer, T5ForConditionalGeneration, T5Tokenizer

def measure_tokenization_time(tokenizer, input_text):
    start_time = time.time()
    tokens = tokenizer(input_text, return_tensors='pt')
    tokenization_time = time.time() - start_time
    return tokenization_time, tokens

def measure_inference_time(model, tokens):
    start_time = time.time()
    with torch.no_grad():
        output = model(**tokens)
    inference_time = time.time() - start_time
    return inference_time

# Sample input text
input_text = "Translate this English text to French: Hello, how are you?"

# GPT-2 model
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_tokenization_time, gpt2_tokens = measure_tokenization_time(gpt2_tokenizer, input_text)
gpt2_inference_time = measure_inference_time(gpt2_model, gpt2_tokens)

# BERT model
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_tokenization_time, bert_tokens = measure_tokenization_time(bert_tokenizer, input_text)
bert_inference_time = measure_inference_time(bert_model, bert_tokens)

# XLNet model
xlnet_model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
xlnet_tokenization_time, xlnet_tokens = measure_tokenization_time(xlnet_tokenizer, input_text)
xlnet_inference_time = measure_inference_time(xlnet_model, xlnet_tokens)

# T5 model
t5_model = T5ForConditionalGeneration.from_pretrained("t5-small")
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")
t5_tokenization_time, t5_tokens = measure_tokenization_time(t5_tokenizer, input_text)

# Modify tokens to include 'decoder_input_ids' for T5
t5_tokens['decoder_input_ids'] = t5_tokens['input_ids']

t5_inference_time = measure_inference_time(t5_model, t5_tokens)

# Print the results
print(f"GPT-2 Tokenization Time: {gpt2_tokenization_time:.4f} seconds")
print(f"GPT-2 Inference Time: {gpt2_inference_time:.4f} seconds")
print()
print(f"BERT Tokenization Time: {bert_tokenization_time:.4f} seconds")
print(f"BERT Inference Time: {bert_inference_time:.4f} seconds")
print()
print(f"XLNet Tokenization Time: {xlnet_tokenization_time:.4f} seconds")
print(f"XLNet Inference Time: {xlnet_inference_time:.4f} seconds")
print()
print(f"T5 Tokenization Time: {t5_tokenization_time:.4f} seconds")
print(f"T5 Inference Time: {t5_inference_time:.4f} seconds")


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


GPT-2 Tokenization Time: 0.0006 seconds
GPT-2 Inference Time: 0.1675 seconds

BERT Tokenization Time: 0.0006 seconds
BERT Inference Time: 0.1387 seconds

XLNet Tokenization Time: 0.0004 seconds
XLNet Inference Time: 0.1571 seconds

T5 Tokenization Time: 0.0013 seconds
T5 Inference Time: 0.1100 seconds


# TASK - 2

```
Use and fine-tune different language and large language models (BERT, Huggingface transformers, etc.), compare them with
word embeddings models, and select the model with the best performance. Again, build a reproducible pipeline.
```

LANGUAGE MODELS:
```
It is an artificial intelligence system that understands and generates human-like language patterns.
1. N-gram
2. Word2Vec
3. BERT
4. T5
```

LARGE LANGUAGE MODELS:
```
It is a variant with an extensive number of parameters, enabling it to capture intricate language nuances and excel at a
wide range of natural language processing tasks.
1. GPT-2
2. XLNet
```

# BERT MODEL (Bidirectional Encoder Representations from Transformers):

```
Definition:
BERT is a specific transformer-based model developed by Google that has significantly advanced the field of NLP. It is known for its
bidirectional training approach, capturing contextual information from both left and right contexts in a sentence.

Usage:
BERT is a pre-trained language model that can be fine-tuned for specific NLP tasks, such as sentiment analysis,
named entity recognition, and question answering.

Strengths of BERT:
BERT excels in natural language understanding by capturing bidirectional context. Its transformer architecture facilitates
efficient training on large datasets, showcasing state-of-the-art performance across diverse NLP tasks. BERT's adaptability
and versatility make it a go-to choice for various applications.

Considerations and Challenges of BERT:
Despite its effectiveness, BERT demands substantial computational resources during training and deployment, and its large model size poses
challenges for deployment in resource-constrained environments. Interpreting BERT's decision-making process can be complex due to its
attention mechanisms and parameter volume. While pre-trained on extensive data, BERT may require domain-specific fine-tuning for
optimal performance in certain specialized tasks.
```

In [None]:
# This library provides pre-trained models for natural language processing tasks.
from transformers import BertTokenizer, BertForSequenceClassification
import torch #PyTorch is used for deep learning operations.

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Example text for sentiment analysis
text = "I really enjoyed watching this movie. The acting was fantastic!"

# Tokenize the input text
tokens = tokenizer(text, return_tensors='pt')

# Forward pass through the model
with torch.no_grad():
    outputs = model(**tokens)

# Get the predicted probabilities for each class
probs = torch.nn.functional.softmax(outputs.logits, dim=1)

# Get the predicted class (0 or 1 in binary classification)
predicted_class = torch.argmax(probs).item()

# Print results
print("Text:", text)
print("Predicted Class:", predicted_class)
print("Class Probabilities:", probs.tolist())


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Text: I really enjoyed watching this movie. The acting was fantastic!
Predicted Class: 0
Class Probabilities: [[0.5139232873916626, 0.48607680201530457]]


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Example sentence for classification
text = "This is a sample sentence for classification."

# Tokenize input text
inputs = tokenizer(text, return_tensors="pt")

# Make prediction using the BERT model
outputs = model(**inputs)
logits = outputs.logits

# Convert logits to probabilities using softmax
probabilities = torch.nn.functional.softmax(logits, dim=1)

print("Predicted probabilities:", probabilities)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicted probabilities: tensor([[0.3476, 0.6524]], grad_fn=<SoftmaxBackward0>)


# Hugging Face Transformer:

```
Definition:
Hugging Face Transformers is an open-source library developed by Hugging Face, a company that specializes in
natural language processing (NLP) and machine learning.

Usage:
Developers and researchers use the Hugging Face platform to access a variety of pre-trained models, including BERT, and to leverage tools and
resources for NLP tasks. The library simplifies the process of working with and deploying state-of-the-art language models.
```

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from torch.nn.functional import softmax
import torch

def predict_sentiment(sentence, model, tokenizer):
    # Tokenize input sentence
    tokens = tokenizer(sentence, return_tensors='pt')

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**tokens)

    # Apply softmax to get probabilities
    probs = softmax(outputs.logits, dim=1)

    # Predict the class with the highest probability
    prediction = torch.argmax(probs).item()

    # Return the sentiment label and probabilities
    return prediction, probs

# Load pre-trained RoBERTa model and tokenizer for sentiment analysis
model_name = "roberta-base"
model = RobertaForSequenceClassification.from_pretrained(model_name)
tokenizer = RobertaTokenizer.from_pretrained(model_name)

# Example sentence for sentiment analysis
example_sentence = "Hugging Face Transformers is amazing!"

# Perform sentiment analysis
prediction, probabilities = predict_sentiment(example_sentence, model, tokenizer)

# Define sentiment labels
sentiment_labels = ['Negative', 'Neutral', 'Positive']

# Print results
print(f"Sentence: {example_sentence}")
print(f"Predicted Sentiment: {sentiment_labels[prediction]}")
print(f"Probabilities: {probabilities}")


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Sentence: Hugging Face Transformers is amazing!
Predicted Sentiment: Negative
Probabilities: tensor([[0.5238, 0.4762]])


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

def predict_sentiment(sentence, model, tokenizer):
    # Tokenize input sentence
    tokens = tokenizer(sentence, return_tensors='pt')

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**tokens)

    # Apply softmax to get probabilities
    probs = softmax(outputs.logits, dim=1)

    # Predict the class with the highest probability
    prediction = torch.argmax(probs).item()

    # Return the sentiment label and probabilities
    return prediction, probs

# Load pre-trained BERT model and tokenizer for sentiment analysis
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Example sentence for sentiment analysis
example_sentence = "I love using Hugging Face Transformers for NLP tasks!"

# Perform sentiment analysis
prediction, probabilities = predict_sentiment(example_sentence, model, tokenizer)

# Define sentiment labels
sentiment_labels = ['Very Negative', 'Negative', 'Neutral', 'Positive', 'Very Positive']

# Print results
print(f"Sentence: {example_sentence}")
print(f"Predicted Sentiment: {sentiment_labels[prediction]}")
print(f"Probabilities: {probabilities}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Text: I really enjoyed watching this movie. The acting was fantastic!
Sentiment Label: POSITIVE
Sentiment Score: 0.9998703002929688


# What is the difference between Bert and word embeddings?  
```
Word2Vec embeddings do not take into account the word position. BERT model explicitly takes as input the position (index) of each
word in the sentence before calculating its embedding.
```

# Word2Vec Vs Word Embeddings
```
Word2vec is a two-layer neural network used to generate distributed representations of words called word embeddings.  
1. Improved performance compared to traditional bag-of-words features .  
2. Ability to learn from unlabeled data and reduce the dimension of the feature space .  
EXAMPLE: KING - MAN + WOMEN = QUEEN

Word embeddings are a collection of numerical vectors (embeddings) that represent words
```


# Word2Vec:
```
It is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information abt
the meaning of the word and their usage in context. The word2vec algorithm estimates these representations by modeling text in a large corpus.

Strengths:
Word2Vec is computationally less intensive compared to transformer models like BERT. It is effective for capturing semantic relationships btw
words and can be useful in tasks that involve understanding word similarity and analogy.

Considerations:
Word2Vec is context-agnostic and may not capture complex contextual information as effectively as BERT. It may not perform as well on
tasks that require a deep understanding of context and dependencies.
```


In [None]:
import nltk
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Sample corpus
corpus = "Word embeddings are awesome. They capture semantic relationships."

# Tokenize the corpus
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus.split(".")]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)

# Access word embeddings
word_embedding = model.wv['word']
print(word_embedding)

[ 8.1681199e-03 -4.4430327e-03  8.9854337e-03  8.2536647e-03
 -4.4352221e-03  3.0310510e-04  4.2744912e-03 -3.9263200e-03
 -5.5599655e-03 -6.5123225e-03 -6.7073823e-04 -2.9592158e-04
  4.4630850e-03 -2.4740540e-03 -1.7260908e-04  2.4618758e-03
  4.8675989e-03 -3.0808449e-05 -6.3394094e-03 -9.2608072e-03
  2.6657581e-05  6.6618943e-03  1.4660227e-03 -8.9665223e-03
 -7.9386048e-03  6.5519023e-03 -3.7856805e-03  6.2549924e-03
 -6.6810320e-03  8.4796622e-03 -6.5163244e-03  3.2880199e-03
 -1.0569858e-03 -6.7875278e-03 -3.2875966e-03 -1.1614120e-03
 -5.4709399e-03 -1.2113475e-03 -7.5633135e-03  2.6466595e-03
  9.0701487e-03 -2.3772502e-03 -9.7651005e-04  3.5135616e-03
  8.6650876e-03 -5.9218528e-03 -6.8875779e-03 -2.9329848e-03
  9.1476962e-03  8.6626766e-04 -8.6784009e-03 -1.4469790e-03
  9.4794659e-03 -7.5494875e-03 -5.3580985e-03  9.3165627e-03
 -8.9737261e-03  3.8259076e-03  6.6544057e-04  6.6607012e-03
  8.3127534e-03 -2.8507852e-03 -3.9923131e-03  8.8979173e-03
  2.0896459e-03  6.24894

----
----
# WORK IN PROGRESS -->>

# Build a reproducible pipeline

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset, random_split
from torch.nn import functional as F
from torch.utils.tensorboard import SummaryWriter
import torch
import os

# Load and preprocess data
def load_and_preprocess_data(data_path):
    df = pd.read_csv(data_path)
    # Assume the dataset has 'text' and 'label' columns
    return df

# Tokenize and prepare data for training
def tokenize_and_prepare_data(df, max_length=128, batch_size=32):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    encoded_data = tokenizer(df['text'].tolist(), truncation=True, padding='max_length', max_length=max_length, return_tensors='pt', return_attention_mask=True)

    labels = torch.tensor(df['label'].tolist())

    dataset = TensorDataset(encoded_data['input_ids'], encoded_data['attention_mask'], labels)

    # Split dataset into training and validation sets
    train_size = int(0.8 * len(dataset))
    val_size = len(dataset) - train_size
    train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

    # Create DataLoader
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    return train_dataloader, val_dataloader

# Model training
def train_model(model, train_dataloader, val_dataloader, epochs=3, lr=2e-5):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=lr)
    criterion = torch.nn.CrossEntropyLoss()

    for epoch in range(epochs):
        model.train()
        for batch in train_dataloader:
            input_ids, attention_mask, labels = batch
            input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.logits, labels)
            loss.backward()
            optimizer.step()

        # Validation
        model.eval()
        val_loss = 0.0
        correct_predictions = 0
        total_samples = 0

        with torch.no_grad():
            for batch in val_dataloader:
                input_ids, attention_mask, labels = batch
                input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

                outputs = model(input_ids, attention_mask=attention_mask)
                loss = criterion(outputs.logits, labels)
                val_loss += loss.item()

                _, predicted = torch.max(outputs.logits, 1)
                correct_predictions += (predicted == labels).sum().item()
                total_samples += labels.size(0)

        accuracy = correct_predictions / total_samples
        avg_val_loss = val_loss / len(val_dataloader)

        print(f'Epoch {epoch + 1}/{epochs} - Loss: {avg_val_loss:.4f} - Accuracy: {accuracy:.4f}')

# Example usage
if __name__ == "__main__":
    # Set random seed for reproducibility
    seed = 42
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ['PYTHONHASHSEED'] = str(seed)

    # Load and preprocess data
    data_path = "path/to/your/dataset.csv"
    df = load_and_preprocess_data(data_path)

    # Tokenize and prepare data
    train_dataloader, val_dataloader = tokenize_and_prepare_data(df)

    # Initialize BERT model for sequence classification
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

    # Train the model
    train_model(model, train_dataloader, val_dataloader)


----
----
# BERT MODEL Vs TinyBERT
### PRE TRAINED BERT MODEL

In [None]:
!pip install transformers

In [None]:
from transformers import BertTokenizer, BertModel
import torch
import time

model_name = "bert-base-uncased"

# Load pre-trained BERT model and tokenizer
start_time_loading = time.time()
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
end_time_loading = time.time()

print(f"Time taken for loading: {end_time_loading - start_time_loading} seconds")

# Example sentences
sentences = ["This is an example sentence.", "Each sentence is converted."]

# Tokenize input sentences
start_time_processing = time.time()
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Forward pass through the model
with torch.no_grad():
    outputs = model(**inputs)

# Extract sentence embeddings from the output
sentence_embeddings = outputs.last_hidden_state.mean(dim=1)
end_time_processing = time.time()

print("BERT Sentence Embeddings:")
print(sentence_embeddings)
print(f"Time taken for processing: {end_time_processing - start_time_processing} seconds")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Time taken for loading: 2.442431926727295 seconds
BERT Sentence Embeddings:
tensor([[-0.1624, -0.4157, -0.2049,  ..., -0.2625,  0.1507,  0.3525],
        [ 0.0124,  0.0350,  0.1593,  ..., -0.0574, -0.0018,  0.2290]])
Time taken for processing: 0.3050954341888428 seconds


## FINE TUNING BERT MODEL

In [None]:
!pip uninstall transformers accelerate

In [None]:
!pip install transformers

In [None]:
!pip install accelerate

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import torch

In [None]:
# Example labeled dataset (replace this with your own)
labels = [1, 0]  # Binary labels
texts = ["This is a positive example.", "This is a negative example."]

# Split the dataset into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [None]:
# Load pre-trained BERT model and tokenizer for sequence classification
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Adjust num_labels for your task

# Tokenize and format the dataset
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Create PyTorch datasets
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = MyDataset(train_encodings, train_labels)
val_dataset = MyDataset(val_encodings, val_labels)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert_fine_tuned_model",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=500,
    save_total_limit=2,
    learning_rate=2e-5,
)


In [None]:
import time
from transformers import Trainer, TrainingArguments

# Assume `model`, `training_args`, `train_dataset`, and `val_dataset` are defined

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Record start time
start_time = time.time()

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

# Record end time
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

# Print the elapsed time
print(f"Fine-tuning took {elapsed_time:.2f} seconds")


Epoch,Training Loss,Validation Loss
1,No log,0.513373
2,No log,0.643305
3,No log,0.711918


Fine-tuning took 15.85 seconds


In [None]:
# Get the number of training steps or epochs
num_steps_or_epochs = trainer.state.global_step

# Calculate efficiency metrics
time_per_step = elapsed_time / num_steps_or_epochs
time_per_epoch = elapsed_time / training_args.num_train_epochs

# Print efficiency metrics
print(f"Time per training step: {time_per_step:.2f} seconds")
print(f"Time per epoch: {time_per_epoch:.2f} seconds")

Time per training step: 5.28 seconds
Time per epoch: 5.28 seconds


# Tiny BERT MODEL
### PRE TRAINED Tiny BERT MODEL

In [None]:
!pip install -U sentence-transformers

In [None]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
import torch
import time

In [None]:
sentences = ["This is an example sentence", "Each sentence is converted"]

start_time = time.time()

model = SentenceTransformer('sentence-transformers/paraphrase-TinyBERT-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

[[-0.09970594  0.27598232 -0.09934346 ...  0.27487376 -0.59598863
   0.16181661]
 [-0.2668063  -0.32326776 -0.28500086 ... -0.03816248 -0.3401403
   0.29823408]]


In [None]:
#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

end_time = time.time()
time_tinybert = end_time - start_time

print("Sentence embeddings:")
print(sentence_embeddings)
print(f"Computation Time (TinyBERT): {time_tinybert} seconds\n")

Sentence embeddings:
tensor([[-0.0997,  0.2760, -0.0993,  ...,  0.2749, -0.5960,  0.1618],
        [-0.2668, -0.3233, -0.2850,  ..., -0.0382, -0.3401,  0.2982]])
Computation Time (TinyBERT): 4.1196558475494385 seconds



## FINE TUNING Tiny BERT MODEL

In [None]:
# token = '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' -> write

In [None]:
!pip install torch

In [None]:
!pip install transformers

In [None]:
!pip install sentence-transformers

In [None]:
from transformers import AutoTokenizer, AutoModel, AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
from transformers import BertTokenizerFast
import torch
import time

In [None]:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')

In [None]:
model = SentenceTransformer('sentence-transformers/paraphrase-TinyBERT-L6-v2')

In [None]:
# Sentences we want to encode
train_texts = ["sentence 1", "sentence 2", ...]
embeddings = model.encode(train_texts)
print(embeddings)
print(type(train_texts))

[[-0.48658934 -0.12511948  0.11874049 ... -0.00476595 -0.3784158
  -0.02270076]
 [-0.35696122 -0.155866   -0.02816081 ... -0.05049503 -0.319653
  -0.02979056]
 [-0.06691432  0.2567988  -0.27058375 ... -0.38106966  0.30398515
   0.09594714]]
<class 'list'>


In [None]:
# Encode sentences using batch_encode_plus
encoded_train = tokenizer.batch_encode_plus(
    sentences,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

# Print the encoded result
print(encoded_train)

{'input_ids': tensor([[ 101, 2023, 2003, 2019, 2742, 6251,  102],
        [ 101, 2169, 6251, 2003, 4991,  102,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}


In [None]:
train_labels = [label for label in train_labels if label is not Ellipsis]

In [None]:
# If train_labels is a list or tuple, convert it to a tensor
if isinstance(train_labels, (list, tuple)):
    train_labels = torch.tensor(train_labels, dtype=torch.long)

# If train_labels is already a tensor, check its dtype
elif isinstance(train_labels, torch.Tensor):
    if train_labels.dtype != torch.long:
        train_labels = train_labels.to(dtype=torch.long)

# If train_labels is not a list, tuple, or tensor, raise an error
else:
    raise TypeError("train_labels must be a list, tuple, or torch.Tensor")

In [None]:
from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(
    encoded_train['input_ids'],
    encoded_train['attention_mask'],
    torch.tensor(train_labels, dtype=torch.long)
)

batch_size = 2
train_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


  torch.tensor(train_labels, dtype=torch.long)


In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)

epochs = 3
start_time = time.time()

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_dataloader:
        input_ids, attention_mask, batch_labels = [t.to(device) for t in batch]

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=batch_labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(train_dataloader)
    print(f'Epoch {epoch + 1}/{epochs}, Average Loss: {avg_loss}')


Epoch 1/3, Average Loss: 0.37068894505500793
Epoch 2/3, Average Loss: 0.22787418961524963
Epoch 3/3, Average Loss: 0.21021877229213715


In [None]:
end_time = time.time()
computation_time = end_time - start_time
print(f"Total Computation Time: {computation_time} seconds")

Total Computation Time: 10.950398206710815 seconds


In [None]:
model.save_pretrained('fine_tuned_tinybert')
tokenizer.save_pretrained('fine_tuned_tinybert')

('fine_tuned_tinybert/tokenizer_config.json',
 'fine_tuned_tinybert/special_tokens_map.json',
 'fine_tuned_tinybert/vocab.txt',
 'fine_tuned_tinybert/added_tokens.json',
 'fine_tuned_tinybert/tokenizer.json')

# OBSERVATION
```
COMPUTATION TIME:TinyBERT
ACCURACY: BERT
COMPUTATION POWER:TinyBERT
```

----
----
# REAL DATA
```
Now, let’s try the models you worked on our real data. Could you please use the models you worked with - T5, GPT, Bert, XLNet, TinyBert - and
1) Compute the average tokenization time of a task title across the dataset. I.e., an average speed of sentence tokenization;
2) Fine-tune the models on the dataset and see what is the results? Let’s take the "task type" classes. Mind that your train-val split
should include all classes, so something like StratifiedSplit from sklearn would help.

In our dataset (https://drive.google.com/file/d/1cwhVDiBWDu5xUwYDLfVTClpqzETcJaEF/view?usp=sharing) there are also ChatGPT-generated tasks.
I think you could drop them from the dataset for the start.
```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
path = '/content/drive/MyDrive/SEM-4/ESSENCE/dataset-20240111.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Task name,Type,Intensity,Generated
0,spa,Rest & self-care,Low / Medium,0.0
1,doctor,Rest & self-care,Low / Medium,0.0
2,read,Rest & self-care,Low / Medium,0.0
3,therapist,Rest & self-care,High,0.0
4,meditate,Rest & self-care,High,0.0


In [None]:
df.isnull().any()

Task name    False
Type         False
Intensity    False
Generated    False
dtype: bool

In [None]:
result = df[df['Generated'] == 1]
result.count()
# print(result)

Task name    398
Type         398
Intensity    398
Generated    398
dtype: int64

In [None]:
df.drop(df[df['Generated'] == 1].index, inplace=True)
print(df)

                         Task name              Type     Intensity  Generated
0                              spa  Rest & self-care  Low / Medium        0.0
1                           doctor  Rest & self-care  Low / Medium        0.0
2                             read  Rest & self-care  Low / Medium        0.0
3                        therapist  Rest & self-care          High        0.0
4                         meditate  Rest & self-care          High        0.0
..                             ...               ...           ...        ...
313                Pack a suitcase    Other Personal  Low / Medium        0.0
314  Register kids for soccer camp    Other Personal          High        0.0
315                     Gardening     Other Personal          High        0.0
316                    buy a phone    Other Personal          High        0.0
317                fix the scooter    Other Personal          High        0.0

[318 rows x 4 columns]


```
Use the models - T5, GPT, Bert, XLNet, TinyBert - and
1) Compute the average tokenization time of a task title across the dataset i.e. an average speed of sentence tokenization

Tokenization time refers to the time it takes to break down a given piece of text into individual units, often referred to as tokens.
Tokens can be words, subwords, or characters, depending on the level of granularity chosen for the tokenization process.
The duration for tokenization depends on various factors such as the size of the text, the complexity of the
tokenization process, and the efficiency of the tokenization algorithm
```

# T5

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import time
import random

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Input text
random_index = random.randint(0, len(df) - 1)
random_sentence = df.loc[random_index, 'Task name']

print(f"Randomly selected sentence: {random_index} - {random_sentence}")

# Measure tokenization time
start_time = time.time()

# Tokenize and encode the input text
input_ids = tokenizer.encode(random_sentence, return_tensors="pt")

end_time = time.time()

# Calculate tokenization time
tokenization_time = end_time - start_time
print(f"Tokenization time: {tokenization_time} seconds")


Randomly selected sentence: 71 - buy aroma candles
Tokenization time: 0.0009737014770507812 seconds


In [None]:
# Measure tokenization time for each sentence
tokenization_times = []

for sentence in df['Task name']:
    start_time = time.time()

    # Tokenize and encode the input text
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    end_time = time.time()

    # Calculate tokenization time and append to the list
    tokenization_time = end_time - start_time
    tokenization_times.append(tokenization_time)

# Calculate average tokenization time
average_tokenization_time = sum(tokenization_times) / len(tokenization_times)

print(f"Average Tokenization Time: {average_tokenization_time} seconds")

Average Tokenization Time: 0.00011271800634995946 seconds


# GPT

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Generate text based on a random sentence from the DataFrame
random_index = random.randint(0, len(df) - 1)
random_sentence = df.loc[random_index, 'Task name']
print(f"Randomly selected sentence: {random_index} - {random_sentence}")

# Measure tokenization time
start_time = time.time()

# Tokenize and encode the input text
input_ids = tokenizer.encode(random_sentence, return_tensors="pt")

end_time = time.time()

# Generate text based on the tokenized input
output = model.generate(input_ids, max_length=120, temperature=0.8, num_beams=5, no_repeat_ngram_size=2, top_k=50, top_p=0.95)

decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)

# Calculate tokenization time
tokenization_time = end_time - start_time
print(f"Tokenization Time: {tokenization_time} seconds")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Randomly selected sentence: 58 - morning routine
morning routine.

"I'm not going to lie to you," he said. "I don't think I've ever done anything like this before."
Tokenization Time: 0.0025641918182373047 seconds


In [None]:
# Initialize an empty list to store tokenization times
tokenization_times = []

# Tokenize each sentence in the DataFrame
for sentence in df['Task name']:
    # Measure tokenization time
    start_time = time.time()

    # Tokenize and encode the input text
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    end_time = time.time()

    # Calculate tokenization time and append to the list
    tokenization_time = end_time - start_time
    tokenization_times.append(tokenization_time)

# Calculate average tokenization time
average_tokenization_time = sum(tokenization_times) / len(tokenization_times)
print(f"Average Tokenization Time: {average_tokenization_time} seconds")


Average Tokenization Time: 0.00017567250713612298 seconds


# BERT

In [None]:
from transformers import BertTokenizer, BertModel

model_name = "bert-base-uncased"

# Load pre-trained BERT model and tokenizer
start_time_loading = time.time()
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
end_time_loading = time.time()
print(f"Time taken for loading: {end_time_loading - start_time_loading} seconds")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Time taken for loading: 5.770026922225952 seconds


In [None]:
# Example sentences
random_index = random.randint(0, len(df) - 1)
random_sentence = df.loc[random_index, 'Task name']
print(f"Randomly selected sentence {random_index}: {random_sentence} ")

# Tokenize input sentences
start_time = time.time()
inputs = tokenizer(random_sentence, padding=True, truncation=True, return_tensors="pt")

# Forward pass through the model
with torch.no_grad():
    outputs = model(**inputs)

# Extract sentence embeddings from the output
sentence_embeddings = outputs.last_hidden_state.mean(dim=1)
end_time = time.time()

# print("BERT Sentence Embeddings:")
# print(sentence_embeddings)

# Calculate tokenization time
tokenization_time = end_time - start_time
print(f"Tokenization Time: {tokenization_time} seconds")

Randomly selected sentence 175: fix the sink 
Tokenization Time: 0.09710907936096191 seconds


In [None]:
# Initialize an empty list to store tokenization times
tokenization_times = []

# Tokenize each sentence in the DataFrame
for sentence in df['Task name']:
    # Measure tokenization time
    start_time = time.time()
    inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract sentence embeddings from the output
    sentence_embeddings = outputs.last_hidden_state.mean(dim=1)
    end_time = time.time()

    # Calculate tokenization time and append to the list
    tokenization_time = end_time - start_time
    tokenization_times.append(tokenization_time)

# Calculate average tokenization time
average_tokenization_time = sum(tokenization_times) / len(tokenization_times)
print(f"Average Tokenization Time: {average_tokenization_time} seconds")

Average Tokenization Time: 0.09176518707155432 seconds


# XLNet

In [None]:
from transformers import GPT2Tokenizer, XLNetLMHeadModel

# Load pre-trained XLNet model and use GPT2Tokenizer
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
# Encode input text and generate output
random_index = random.randint(0, len(df) - 1)
random_sentence = df.loc[random_index, 'Task name']
print(f"Randomly selected sentence {random_index}: {random_sentence} ")

# Measure tokenization time
start_time = time.time()

# Tokenize and encode the input text
input_ids = tokenizer.encode(random_sentence, return_tensors="pt")

end_time = time.time()

output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, top_k=50, top_p=0.95)

# Decode and print the generated output
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)

# Calculate tokenization time
tokenization_time = end_time - start_time

print(f"Tokenization Time: {tokenization_time} seconds")

Randomly selected sentence 218: fill out the conference form QA sis 




fill out the conference form QA sis withoutQ2+2242,2*2�2.2]2-282�2s2�2;2t2 expected2early2abouts2 Today British2
Tokenization Time: 0.0021386146545410156 seconds


In [None]:
# Initialize an empty list to store tokenization times
tokenization_times = []

# Tokenize each sentence in the DataFrame
for sentence in df['Task name']:
    # Measure tokenization time
    start_time = time.time()

    # Tokenize and encode the input text
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    end_time = time.time()

    # Calculate tokenization time
    tokenization_time = end_time - start_time

    # Append tokenization time to the list
    tokenization_times.append(tokenization_time)

# Calculate average tokenization time
average_tokenization_time = sum(tokenization_times) / len(tokenization_times)
print(f"Average Tokenization Time: {average_tokenization_time} seconds")

Average Tokenization Time: 0.00022623898848047797 seconds


# TinyBERT

In [None]:
!pip install -U sentence-transformers

In [None]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

In [None]:
# Randomly select a sentence
random_index = random.randint(0, len(df) - 1)
random_sentence = df.loc[random_index, 'Task name']
print(f"Randomly selected sentence {random_index}: {random_sentence} ")

# Measure tokenization time
start_tokenization_time = time.time()

# Encode sentence using SentenceTransformer
model = SentenceTransformer('sentence-transformers/paraphrase-TinyBERT-L6-v2')
embeddings = model.encode(random_sentence)

# Tokenize sentences using Hugging Face tokenizer
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')
encoded_input = tokenizer(random_sentence, padding=True, truncation=True, return_tensors='pt')

end_tokenization_time = time.time()
tokenization_time = end_tokenization_time - start_tokenization_time

print(f"Tokenization Time: {tokenization_time} seconds")

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
# print("Sentence embeddings:")
# print(sentence_embeddings)

end_time = time.time()
time_tinybert = end_time - start_tokenization_time  # Using the end of tokenization as the starting point

print(f"Computation Time (TinyBERT): {time_tinybert} seconds\n")


Randomly selected sentence 259: English 
Tokenization Time: 0.8371336460113525 seconds
Computation Time (TinyBERT): 1.1602320671081543 seconds



In [None]:
# Function for mean pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Encode sentence using SentenceTransformer
model = SentenceTransformer('sentence-transformers/paraphrase-TinyBERT-L6-v2')

# Tokenize sentences using Hugging Face tokenizer
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-TinyBERT-L6-v2')

# Measure average tokenization time
total_tokenization_time = 0
num_sentences = len(df)

for sentence in df['Task name']:
    start_tokenization_time = time.time()

    # Encode sentence using SentenceTransformer
    embeddings = model.encode(sentence)

    # Tokenize sentences using Hugging Face tokenizer
    encoded_input = tokenizer(sentence, padding=True, truncation=True, return_tensors='pt')

    end_tokenization_time = time.time()
    tokenization_time = end_tokenization_time - start_tokenization_time
    total_tokenization_time += tokenization_time

# Calculate average tokenization time
average_tokenization_time = total_tokenization_time / num_sentences
print(f"Average Tokenization Time: {average_tokenization_time} seconds")


Average Tokenization Time: 0.05756535844982795 seconds


```
2) Fine-tune the models on the dataset and see what is the results? Let’s take the "task type" classes. Mind that your train-val split
should include all classes, so something like StratifiedSplit from sklearn would help.
```
```
StratifiedShuffleSplit from scikit-learn: This ensures that the distribution of classes in both the training and validation sets
remains similar to that of the original dataset.
```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel
import torch
import time
import random

In [None]:
path = '/content/drive/MyDrive/RVU S4/INTERNSHIP/dataset-20240111.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Task name,Type,Intensity,Generated
0,spa,Rest & self-care,Low / Medium,0.0
1,doctor,Rest & self-care,Low / Medium,0.0
2,read,Rest & self-care,Low / Medium,0.0
3,therapist,Rest & self-care,High,0.0
4,meditate,Rest & self-care,High,0.0


In [None]:
del df['Generated']

In [None]:
df.isnull().any()

Task name    False
Type         False
Intensity    False
dtype: bool

#T5

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import Dataset, DataLoader
from transformers import T5ForConditionalGeneration, T5Tokenizer, AdamW
import torch
import numpy as np
import pandas as pd
import time
import random

In [None]:
# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Prepare data
X = df['Task name'].values
y = df['Type'].values

In [None]:
# Split data into train and validation sets
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_index, val_index = next(sss.split(X, y))

X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]

In [None]:
# Define a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )
        input_ids = inputs["input_ids"].squeeze()
        attention_mask = inputs["attention_mask"].squeeze()
        return input_ids, attention_mask

In [None]:
# Create instances of custom dataset class for training and validation sets
train_dataset = CustomDataset(X_train, tokenizer)
val_dataset = CustomDataset(X_val, tokenizer)

In [None]:
# Define hyperparameters
learning_rate = 1e-4
batch_size = 8
epochs = 3

In [None]:
# Define optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Define data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)



In [None]:
# Define the device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Fine-tune the model
for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for input_ids, attention_mask in train_loader:
        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_loss /= len(train_loader)

In [None]:
    # Validate the model
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for input_ids, attention_mask in val_loader:
            input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
            loss = outputs.loss
            val_loss += loss.item()
    val_loss /= len(val_loader)

    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss}, Val Loss: {val_loss}")


In [None]:
# True values
true_values = df['Intensity']

# Predicted values
# predicted_values = model.predict(df[['Task name', 'Type']])  # Assuming 'model' has a 'predict' method

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Assuming 'df' is your DataFrame containing the dataset
input_sentences = df['Task name'] + " " + df['Intensity'] + " " + df['Type']  # Combine Task name, Intensity, and Type

# Generate predictions using the T5 model
predicted_values = []
for input_sentence in input_sentences:
    input_ids = tokenizer.encode(input_sentence, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = model.generate(input_ids)
    predicted_value = tokenizer.decode(outputs[0], skip_special_tokens=True)
    predicted_values.append(predicted_value)

# Print the predicted values
print(predicted_values)

In [None]:
# Assuming 'true_labels' are the ground truth labels and 'predicted_labels' are the model's predicted labels
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy}")

# GPT

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import DataLoader, Dataset
import torch
import time
import random
from sklearn.preprocessing import LabelEncoder

# BERT

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import DataLoader, Dataset
import torch
import time
import random
from sklearn.preprocessing import LabelEncoder

In [None]:
# Prepare the Data
X = df[['Task name', 'Intensity']].apply(lambda x: ' '.join(x), axis=1).tolist()
y = df['Type'].tolist()

In [None]:
# Convert string labels to numeric values
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
num_labels = len(label_encoder.classes_)

In [None]:
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

train_idx, val_idx = next(sss.split(X, y_encoded))

X_train, X_val = [X[i] for i in train_idx], [X[i] for i in val_idx]
y_train, y_val = [y_encoded[i] for i in train_idx], [y_encoded[i] for i in val_idx]

In [None]:
# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

X_train_encodings = tokenizer(X_train, truncation=True, padding=True)
X_val_encodings = tokenizer(X_val, truncation=True, padding=True)

In [None]:
class CustomDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CustomDataset(X_train_encodings, y_train)
val_dataset = CustomDataset(X_val_encodings, y_val)

In [None]:
# Modify the Model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Specify Hyperparameters
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
batch_size = 16
num_epochs = 3

In [None]:
# Fine-tune the Model
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

In [None]:
# Evaluate Performance
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        predicted = torch.argmax(outputs.logits, dim=1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total
print(f"Validation Accuracy: {accuracy}")

Validation Accuracy: 0.7222222222222222


# XLNet

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import Dataset, DataLoader
from transformers import XLNetTokenizer, XLNetLMHeadModel, AdamW
from tqdm import tqdm

In [None]:
# Define the device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Define the tokenizer
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

In [None]:
# Define the XLNet model
model = XLNetLMHeadModel.from_pretrained("xlnet-base-cased").to(device)

In [None]:
# X = df[['Task name', 'Intensity']].apply(lambda x: ' '.join(x), axis=1).tolist()
# y = df['Type'].tolist()

In [None]:
# Prepare data
X = df['Task name'].values
y = df['Type'].values

In [None]:
# Split data into train and validation sets
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_index, val_index = next(sss.split(X, y))

X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]

In [None]:
# Define a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=128):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )
        input_ids = inputs["input_ids"].squeeze()
        attention_mask = inputs["attention_mask"].squeeze()
        return input_ids, attention_mask

In [None]:
# Create instances of custom dataset class for training and validation sets
train_dataset = CustomDataset(X_train, tokenizer)
val_dataset = CustomDataset(X_val, tokenizer)

In [None]:
# Define hyperparameters
learning_rate = 2e-5
batch_size = 8
epochs = 3

In [None]:
# Define optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)



In [None]:
# Define data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [None]:
# Fine-tune the model
for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for inputs, attention_mask in tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs}", leave=False):
        inputs, attention_mask = inputs.to(device), attention_mask.to(device)
        optimizer.zero_grad()
        outputs = model(inputs, attention_mask=attention_mask, labels=inputs)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_loss /= len(train_loader)



In [None]:
    # Validate the model
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for inputs, attention_mask in val_loader:
            inputs, attention_mask = inputs.to(device), attention_mask.to(device)
            outputs = model(inputs, attention_mask=attention_mask, labels=inputs)
            loss = outputs.loss
            val_loss += loss.item()
    val_loss /= len(val_loader)

    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {train_loss}, Val Loss: {val_loss}")

# Evaluation on test set
# Define your test set and perform evaluation similarly as done for validation set

Epoch 3/3, Train Loss: 0.1261334433220327, Val Loss: 0.011765315507849058


In [None]:
# Assuming the 'Task name' column contains the test data
X_test = df['Task name'].values

In [None]:
# Create instances of custom dataset class for test set
test_dataset = CustomDataset(X_test, tokenizer)

In [None]:
# Define data loader for test set
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [None]:
# Evaluate the model on the test set
model.eval()
num_correct = 0
total_samples = 0
with torch.no_grad():
    for inputs, attention_mask in test_loader:
        inputs, attention_mask = inputs.to(device), attention_mask.to(device)
        outputs = model(inputs, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        num_correct += (predictions == inputs).sum().item()
        total_samples += inputs.size(0)

In [None]:
test_accuracy = num_correct / total_samples
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 127.68156424581005


# TinyBERT

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
from torch.utils.data import DataLoader, Dataset
import torch
import time
import random
from sklearn.preprocessing import LabelEncoder