# Architecture 
It is a language model that utilizes a transformer based architecture and comprises of several key components like Input Embeddings, Encoder layers, Decoder layers and Output Layers.
1. Input Embedding : In this the input text is converted to numerical representations that can be understood by the model. The embedding layer is being deployed for this task which maps each word or token in the input seq to a high dim vector.
2. Encoder layer - GPT2 consists of multiple identical encoder layers stacked over each other. Each encoder layer has two sub layers which are a self attention mechanism and feed forwd network. The self attention mechanism allows the model to weigh the importance of diff words or tokens with inp. seq thereby capturing the dependencies and relationships betw. them. The feed forward network processes the self attn outputs to gen more complex representations.
3. Decoder layer - It follows the encoder layers and has a similar structure as it also consists of self attention and feed forward layers. Just that in this the decoder layer is conditioned on the context from the prev. tokens enabling autoregressive generation. This means the model predicts the next word in the seq based on the context it has learned so far.
4. Output layer - The final layer of GPT2 is a linear transformation followed by a softmax activation function. This layer produces the prob. distribution over the vocab for the next word in the sequence. It alows the model to generate text by sampling from the distribution or choosing the word with the highest probability.

In [2]:
#This snippet imports the necessary libraries and modules for the code. We import torch for PyTorch functionality, DataLoader for creating data loaders, GPT2LMHeadModel and GPT2Tokenizer from transformers for the
#GPT-2 model and tokenizer, and AdamW for the optimizer. 

In [4]:
import torch
import datasets
from torch.utils.data import DataLoader # from pytorch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW # from hugging face
from datasets import load_dataset # from hugging face

In [16]:
!pip install plotly

Collecting plotly
  Downloading plotly-5.22.0-py3-none-any.whl.metadata (7.1 kB)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-8.3.0-py3-none-any.whl.metadata (1.2 kB)
Downloading plotly-5.22.0-py3-none-any.whl (16.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.4/16.4 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading tenacity-8.3.0-py3-none-any.whl (25 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.22.0 tenacity-8.3.0
[0m

In [3]:
import numpy as np
import pandas as pd
import plotly.express as px
from collections import Counter




In [97]:
dataframe = pd.read_csv('/tf/tensorflow-tutorials/medium_articles.csv')

In [100]:
test_data = dataframe['text'][0:1000]

In [105]:
test_data.to_csv('test.csv', index=False)

In [47]:
dataframe.head(5)

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


In [40]:
dataframe.columns

Index(['title', 'text', 'url', 'authors', 'timestamp', 'tags'], dtype='object')

In [25]:
assert len(dataframe) == len(dataframe["url"].unique())

In [29]:
print(f"There are {len(dataframe['url'].unique())} articles")

There are 192368 articles


In [None]:
all_tags = [tag for tags_list in dataframe["tags"] for tag in eval(tags_list)]
d_tags_counter = Counter(all_tags)
tags, frequencies = list(zip(*d_tags_counter.most_common(n=50)))

fig = px.bar(x=tags, y=frequencies)
fig.update_xaxes(title="tags")
fig.update_yaxes(title="frequencies")
fig.show()

In [48]:
dataframe

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."
...,...,...,...,...,...,...
192363,Why do you need a cleaning service?,What could be more important than having a tid...,https://medium.com/@ozneedcleaningau/why-do-yo...,[],2021-11-16 08:17:08.950000+00:00,"['Cleaning', 'Cleaning Services', 'Cleaning Co..."
192364,Daily cleaning and maintenance of bedding,Daily cleaning and maintenance of bedding\n\nW...,https://medium.com/@a198blwt/daily-cleaning-an...,[],2021-11-16 05:27:05.359000+00:00,"['Bedding', 'Cleaning', 'Maintain']"
192365,Beneficial Advice on Bond Cleaning!,The most important chore at the end is bond cl...,https://medium.com/@princegohil/beneficial-adv...,['Prince Shrawan'],2021-11-26 08:20:27.660000+00:00,"['Cleaning', 'End Of Lease Cleaning', 'Cleaners']"
192366,How I Learned Romanian in 37 Easy Steps,How I Learned Romanian in 37 Easy Steps\n\nHey...,https://medium.com/@lifeinromania/how-i-learne...,['Sam Ursu'],2017-11-27 08:09:19.025000+00:00,"['Romania', 'Language Learning', 'Storyofmylife']"


In [42]:
pip install -U scikit-learn

[0mNote: you may need to restart the kernel to use updated packages.


In [49]:
from sklearn.model_selection import train_test_split

In [50]:
train_data, test_data = train_test_split(dataframe, test_size=0.2, random_state=42)

In [51]:
len(train_data),len(test_data)

(153894, 38474)

# GPT2 Model and Tokenizer

In [111]:
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
tokenizer.pad_token = tokenizer.eos_token


model.safetensors:  12%|#1        | 178M/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# OR

# ANY of these models can be used for Effective text generation, for gpu memory limitation, i used distil gpt2 version

# DISTIL GPT2

In [None]:
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
tokenizer.pad_token = tokenizer.eos_token


In [54]:
def tokenizer_dataset(text):
    return tokenizer(text, truncation=True, max_length=512, padding="max_length",return_tensors="pt")



In [67]:
train_data['tokenized_text']= train_data['text'].apply(tokenizer_dataset)

In [57]:
train_data.to_csv('Final_Dataset_Short.csv',index=False)

In [49]:
# here we are loading saved csv!

In [58]:
train_data = pd.read_csv('/tf/Final_Dataset_Short.csv')

  train_data = pd.read_csv('/tf/Final_Dataset_Short.csv')


In [68]:
train_data["tokenized_text"]

0         [input_ids, attention_mask]
1         [input_ids, attention_mask]
2         [input_ids, attention_mask]
3         [input_ids, attention_mask]
4         [input_ids, attention_mask]
                     ...             
153889    [input_ids, attention_mask]
153890    [input_ids, attention_mask]
153891    [input_ids, attention_mask]
153892    [input_ids, attention_mask]
153893    [input_ids, attention_mask]
Name: tokenized_text, Length: 153894, dtype: object

In [60]:
result = train_data.dtypes
result

title             object
text              object
url               object
authors           object
timestamp         object
tags              object
tokenized_text    object
dtype: object

In [69]:
train_data.head(5)

Unnamed: 0,title,text,url,authors,timestamp,tags,tokenized_text
0,How To Make A Man Obsessed With You | What Men...,How To Make A Man Obsessed With You | What Men...,https://medium.com/@moikabu43/how-to-make-a-ma...,[],2021-09-15 18:14:06.621000+00:00,"['Relationships Love Dating', 'Relationship Ad...","[input_ids, attention_mask]"
1,Stop using Pandas and start using Spark with S...,Data transformations\n\nMost (if not all) of t...,https://towardsdatascience.com/stop-using-pand...,['Chloe Connor'],2020-06-07 18:56:07.675000+00:00,"['Scala', 'Spark', 'Towards Data Science', 'Pa...","[input_ids, attention_mask]"
2,Entanglements,I tell stories entangled in the ridges of my m...,https://medium.com/scribe/entanglements-108bad...,['Uchechi Obasi'],2020-08-06 12:25:00.903000+00:00,"['Poetry', 'Haiku', 'Heartbreak', 'Dating', 'L...","[input_ids, attention_mask]"
3,SpanFact: Fix your factually incorrect summaries,SpanFact: Fix your factually incorrect summari...,https://towardsdatascience.com/spanfact-fix-yo...,['Rohit Pillai'],2020-12-17 05:06:47.959000+00:00,"['Naturallanguageprocessing', 'Summarization',...","[input_ids, attention_mask]"
4,AVM Android Integration is live for Aion Network,The Pocket team loves celebrating with its pee...,https://medium.com/pocket-network/avm-android-...,['Pocket Network'],2019-08-06 16:12:04.050000+00:00,"['Blockchain', 'Avm', 'Smart Contracts', 'Aion...","[input_ids, attention_mask]"


In [53]:
result = data

# extract inputs ids, attention mask

In [112]:
import torch
from torch.utils.data import Dataset, DataLoader
import ast

class TextDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe

    def __len__(self):
     
        return len(self.dataframe)

    def __getitem__(self, idx):

        row = self.dataframe.iloc[idx]
        tokenized_text = row['tokenized_text']
        
        # Check and print the type to understand what we are dealing with
        #print("ROW : ", row)
        #print("Type of tokenized_text:", type(tokenized_text))
        #print("Content of tokenized_text:", tokenized_text)
        
        # Assuming 'tokenized_text' is a dictionary containing tensors for 'input_ids' and 'attention_mask'
        
        input_ids = tokenized_text['input_ids'].squeeze(0)  # Remove batch dimension if present
        attention_mask = tokenized_text['attention_mask'].squeeze(0)  # Remove batch dimension if present
        labels = input_ids.clone()  # Typically labels for a language model are the input_ids themselves
        
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels
        }

# call dataset

In [113]:
# Initialize Dataset
dataset = TextDataset(train_data)

In [118]:

# Create DataLoader
train_dataloader = DataLoader(dataset, batch_size=2, shuffle=True)


In [119]:
train_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7fe778c0b110>

# check dataloader

In [115]:
# Test fetching a batch
try:
    batch = next(iter(train_dataloader))
    print("Batch fetched successfully!")
    print("Input IDs:", batch['input_ids'])
    print("Attention Mask:", batch['attention_mask'])
    print("Labels:", batch['labels'])
except Exception as e:
    print("Failed to fetch a batch:", e)

Batch fetched successfully!
Input IDs: tensor([[ 1890,  1374,  5882,  ..., 50256, 50256, 50256],
        [25681,    12,   265,  ...,  1738,   373,   780],
        [ 1212,   318,   262,  ..., 23292,   306,   503],
        ...,
        [ 3844, 20544,   612,  ...,  3038,   318,  2045],
        [   40,   892,   314,  ...,  7062,   284, 12383],
        [   32,  1285,   550,  ...,  8033,   287,   290]])
Attention Mask: tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])
Labels: tensor([[ 1890,  1374,  5882,  ..., 50256, 50256, 50256],
        [25681,    12,   265,  ...,  1738,   373,   780],
        [ 1212,   318,   262,  ..., 23292,   306,   503],
        ...,
        [ 3844, 20544,   612,  ...,  3038,   318,  2045],
        [   40,   892,   314,  ...,  7062,   284, 12383],
        [   32,  1285,   550,  ...,  8033,   287,   29

In [1]:
# clean cuda

In [226]:
torch.cuda.empty_cache() 

# device configuration

In [116]:
# Set up the training parameters
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

# “automatic mixed precision training”

In [120]:
import os
from torch.cuda.amp import GradScaler, autocast
# Training loop
scaler = GradScaler()
model.train()
num_epochs=1
output_path = 'GPT2-model-medium'

for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
       
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()

        # autocast
        with autocast():
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

        # scale loss with stability of mixed precision
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
       
        print("Step-{},Loss-{}".format(step, loss.item()))
        if step == 400:

            if not os.path.exists(output_path):
                os.makedirs(output_path)


            model.save_pretrained(output_path)
            tokenizer.save_pretrained(output_path)

            print(f"Model and Tokneizer are saved in {output_path}")
            break
            

        #loss.backward()
        #optimizer.step()

Step-0,Loss-5.136580467224121
Step-1,Loss-3.4252097606658936
Step-2,Loss-3.0653090476989746
Step-3,Loss-3.4548635482788086
Step-4,Loss-2.964229106903076
Step-5,Loss-3.7356607913970947
Step-6,Loss-4.038450241088867
Step-7,Loss-3.439716100692749
Step-8,Loss-2.9599533081054688
Step-9,Loss-3.1705498695373535
Step-10,Loss-3.3257651329040527
Step-11,Loss-2.8328394889831543
Step-12,Loss-5.30260705947876
Step-13,Loss-3.334078550338745
Step-14,Loss-2.7600533962249756
Step-15,Loss-5.441909313201904
Step-16,Loss-4.976123809814453
Step-17,Loss-3.904688596725464
Step-18,Loss-6.645779609680176
Step-19,Loss-4.560384750366211
Step-20,Loss-4.828579425811768
Step-21,Loss-2.9035515785217285
Step-22,Loss-3.716073989868164
Step-23,Loss-2.8754332065582275
Step-24,Loss-3.516566753387451
Step-25,Loss-3.2154414653778076
Step-26,Loss-3.1334550380706787
Step-27,Loss-2.9585983753204346
Step-28,Loss-2.839010715484619
Step-29,Loss-3.506218910217285
Step-30,Loss-3.3266377449035645
Step-31,Loss-2.619534492492676
Step

In [None]:
# load

In [121]:
model_saved = GPT2LMHeadModel.from_pretrained('GPT2-model-medium')
tokenizer = GPT2Tokenizer.from_pretrained('GPT2-model-medium')

tokenizer.pad_token = tokenizer.eos_token

In [122]:
model_saved.to(device)
model.eval()



GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(1024, 1024)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-23): 24 x GPT2Block(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1024, out_features=50257, bias=False)
)

In [125]:
prompt = "he next six weeks will be hard as cases continue to explode and government leadership remains nonexistent. I can’t control any of this,"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
output = model_saved.generate(input_ids, max_length=100, num_return_sequences=1)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [126]:
for i, generated in enumerate(output):
    text = tokenizer.decode(generated, skip_special_token=True)
    print(f'Generated Text {i} : {text}')

Generated Text 0 : he next six weeks will be hard as cases continue to explode and government leadership remains nonexistent. I can’t control any of this, but I can’t help but feel that the world’s attention is on the United States.

The United States is the only country in the world that has a president who has not been elected by the people. The United States is the only country in the world that has a president who has not been elected by the people.


