**1. Created own dataset for text classification. It should contain at least 2000 words in total and at least three categories with at least 100 examples per category (an example can be a poem or a paragraph from a book).**

In [1]:
import pandas as pd
import numpy as np

file_path = '/content/Labeled_Unique_Text_Classification_Dataset.txt'
data = pd.read_csv(file_path, delimiter=':', header=None, names=['Category', 'Text'])
data.head()

Unnamed: 0,Category,Text
0,literature,Waves crashed against the shore in roaring ap...
1,tech,A startup introduced a revolutionary battery ...
2,quote,"Aspire to inspire before we expire, make a ma..."
3,quote,"Die with memories, not dreams. Live life to i..."
4,literature,"What we think, we become. Our thoughts shape ..."


**2. Split the dataset into training (at least 240examples) and test (at least 60 examples) sets.**

In [2]:
data['Category'] = data['Category'].str.strip()
data['Text'] = data['Text'].str.strip()

data = data.sample(frac=1, random_state=42).reset_index(drop=True)

split_index = int(0.8 * len(data))
train_data = data[:split_index]
test_data = data[split_index:]

(train_data.shape, test_data.shape)

((240, 2), (60, 2))

**3. Fine-tune a pre-trained language model capable of generating text (e.g., GPT) that you can take, e.g., from the Hugging Face Transformers library.**

In [None]:
!pip install datasets
import tensorflow as tf
from datasets import Dataset
from transformers import AutoTokenizer, TFAutoModelForCausalLM
import pandas as pd
train_dataset = Dataset.from_pandas(train_data)
test_dataset = Dataset.from_pandas(test_data)

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(examples['Text'], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

def map_model_input(input_ids, attention_mask, labels):
    return {"input_ids": input_ids, "attention_mask": attention_mask}, labels

train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_dataset['input_ids'], train_dataset['attention_mask'], train_dataset['input_ids']))
train_tf_dataset = train_tf_dataset.map(map_model_input).shuffle(1000).batch(8)
test_tf_dataset = tf.data.Dataset.from_tensor_slices((test_dataset['input_ids'], test_dataset['attention_mask'], test_dataset['input_ids']))
test_tf_dataset = test_tf_dataset.map(map_model_input).batch(16)

model = TFAutoModelForCausalLM.from_pretrained("gpt2")

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss,metrics=['accuracy'])

model.fit(train_tf_dataset, epochs=3, validation_data=test_tf_dataset)

eval_loss, eval_acc = model.evaluate(test_tf_dataset)
print(f"Eval Loss: {eval_loss}, Eval Accuracy: {eval_acc}")

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-a

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3
Eval Loss: 0.03460855782032013, Eval Accuracy: 0.994921863079071



1. **Model Architecture and Training Setup**:
   - The model used is TFAutoModelForCausalLM, which is a TensorFlow implementation of the GPT-2 model for language modeling tasks.
   - The optimizer used is Adam with a learning rate of 5e-5.
   - The loss function used is SparseCategoricalCrossentropy, and the model is compiled with the 'accuracy' metric.

2. **Training**:
   - The model is trained for 3 epochs on the training dataset (`train_tf_dataset`).
   - During training, both loss and accuracy metrics are logged for both the training and validation datasets (`test_tf_dataset`).
   - As the epochs progress, both training and validation losses decrease, and accuracy increases, indicating that the model is learning and improving its performance.

3. **Evaluation**:
   - After training, the model is evaluated on the test dataset (`test_tf_dataset`).
   - The evaluation metrics show that the model achieves a low evaluation loss of 0.0346 and a high evaluation accuracy of 99.49%.

4. **Analysis**:
   - The model's high evaluation accuracy suggests that it generalizes well to unseen data.
   - The training and validation loss/accuracy curves indicate that the model is not overfitting, as there is no significant gap between the training and validation metrics. Both training and validation loss decrease consistently across epochs, and the validation accuracy reaches a high level comparable to the training accuracy.

5. **Conclusion**:
   - Based on the provided content, there is no evidence of overfitting. The model demonstrates strong performance on both the training and test datasets, with high accuracy and low loss values.
   - The provided code effectively trains and evaluates a GPT-2 model for a language modeling task, showcasing its ability to generate text sequences with high accuracy.

To improve accuracy, consider the following steps:

Fine-Tuning: Fine-tune the GPT-2 model on a downstream task related to your specific domain. This involves training the model on a task-specific dataset to adapt it to your use case.

Hyperparameter Tuning: Experiment with different hyperparameters, such as learning rate, batch size, and model architecture, to find the combination that works best for your data.

Data Augmentation: Increase the diversity of your training data through data augmentation techniques, such as adding noise, paraphrasing, or using different sentence structures.

More Training Data: If possible, obtain more labeled data for training. A larger and more diverse dataset can often lead to better model performance.

Model Architecture: Experiment with different model architectures or try more advanced models that might be better suited for your task.

Regularization: Apply regularization techniques, such as dropout, to prevent overfitting on the training data.

Error Analysis: Analyze the mistakes made by the model on the test set. Identify patterns in misclassifications and consider incorporating this knowledge into the training process.

Good dataset : if possible proper collection of images in data without any distortion,blur, etc.