Sample notebook documenting the training process

In [1]:
import pandas as pd
import numpy as np

# connect to drive
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [2]:
data = pd.read_csv('/content/drive/MyDrive/ts_generator/taylor_stitch_info.csv')
data.head()
print(data.shape)

(1186, 3)


In [3]:
# separate the model name and colors
# as added info the optionally use later
# and for eda
def add_name_info(row):
  name = row.Name
  split_text = name.split(' in ')
  model_name = split_text[0]
  row['ModelName']= model_name
  if len(split_text)>1:
    row['Color'] = split_text[1]
  else:
    row['Color'] = 'None'
  return row

In [4]:
data = data.apply(add_name_info,axis=1)

In [5]:
import textwrap
wrapper = textwrap.TextWrapper(width=50)
for text in data['Description']:
  word_list = wrapper.wrap(text=text)
  for element in word_list:
    print(element)

  print('\n')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
To the seafaring among us, the term leeward will
suggest protection from trade winds, but even
those with a healthy fear of the open ocean can
take shelter in our brand new Leeward shirt. True
to its name, this sharp, versatile shirt is built
to keep you warm and dry no matter the size of the
storm.


To rove is to wander. Fittingly, we designed and
constructed the Rover Jacket with exploration and
discovery in mind. It’s built to be an everyday
garment that accompanies you on everyday
adventures, big and small. After a brief absence,
this Taylor Stitch mainstay is back in a
remarkably elegant birdseye wool. As has become
Rover tradition, it’s been waxed for weather
resistance, this time in collaboration with
industry leaders Halley Stevensons in Scotland.
One for the rover in all of us.


The versatile California cut, newly offered in a
classic navy/ash windowpane plaid is back, sitting
cozily between dressy and casual. 

In [9]:
data['Description']

0       Built by our buddies at VALLON, The Waylons is...
1       Sure, you could hand off your well-worn denim ...
2       The Camp Candle was poured by hand in small ba...
3       This exclusive edition of The Cotton Hemp Tee ...
4       Regenerative agriculture is all about reciproc...
                              ...                        
1181    A solid selection of tees is a must have for a...
1182    A solid selection of tees is a must have for a...
1183    We designed The Pathfinder to be that trusty j...
1184    These boots are the perfect blend of style, fu...
1185    “Utility Shirt” isn’t just some catchy nomencl...
Name: Description, Length: 1186, dtype: object

In [7]:
! pip install transformers -q
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv

In [8]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')

In [10]:
data['Name'][0]

'The Waylons in Chestnut'

In [11]:
torch.tensor((tokenizer.encode(data['Description'][100]))).shape

torch.Size([92])

In [12]:
# this could just as easily be a simple function...
class InputData(Dataset):  
    def __init__(self,data, gpt2_type="gpt2",max_length=1024):

        self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type)
        self.input = []
        for index, row in data.iterrows():
          entry = '<|startoftext|>' + row['Name'] +'\n####\n' + \
                    row['Description'] + '<|endoftext|>'
          
          self.input.append(torch.tensor(
                self.tokenizer.encode(entry)))
                    
        self.input_length = len(self.input)
        
    def __len__(self):
        return self.input_length

    def __getitem__(self, item):
        return self.input[item]
    
dataset = InputData(data, gpt2_type="gpt2",max_length=1024)  

In [13]:
train = data.sample(1175,random_state=0)
test = data.loc[~data.index.isin(train.index)]

train_dataset = InputData(train)
test_dataset = InputData(test)

In [14]:
np.unique(train.index.isin(test.index))

array([False])

In [15]:
#Get the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')

#Accumulated batch size (since GPT2 is so big)
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

In [16]:
def train(
    dataset, model, tokenizer,
    batch_size=16, epochs=5, lr=2e-5,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):
    acc_steps = 100
    device=torch.device("cuda")
    model = model.cuda()
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            input_tensor = input_tensor.to(device)
            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [None]:
model = train(train_dataset, model, tokenizer)



Training epoch 0
0


1175it [01:33, 12.53it/s]


Training epoch 1
tensor(4.1222, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [01:34, 12.38it/s]


Training epoch 2
tensor(4.0628, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [01:34, 12.40it/s]


Training epoch 3
tensor(3.7085, device='cuda:0', grad_fn=<NllLossBackward0>)


745it [00:59, 12.46it/s]

In [None]:
#torch.save(model, '/content/drive/MyDrive/ts_generator/model.pt')

In [None]:
#Load the model to use it
#model = torch.load('/content/drive/MyDrive/ts_generator/model.pt')

In [None]:
model = train(train_dataset, model, tokenizer)



Training epoch 0
0


1175it [00:34, 33.86it/s]


Training epoch 1
tensor(3.0518, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 31.91it/s]


Training epoch 2
tensor(2.6444, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 31.76it/s]


Training epoch 3
tensor(2.7694, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.19it/s]


Training epoch 4
tensor(2.5987, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:37, 31.73it/s]


Training epoch 5
tensor(2.8575, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.07it/s]


Training epoch 6
tensor(2.8317, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.16it/s]


Training epoch 7
tensor(2.9973, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 31.88it/s]


Training epoch 8
tensor(2.8150, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 31.94it/s]


Training epoch 9
tensor(2.7084, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.03it/s]


Training epoch 10
tensor(2.7942, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 31.84it/s]


Training epoch 11
tensor(2.7488, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.02it/s]


Training epoch 12
tensor(3.3915, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.02it/s]


Training epoch 13
tensor(2.7551, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 32.07it/s]


Training epoch 14
tensor(2.6146, device='cuda:0', grad_fn=<NllLossBackward0>)


1175it [00:36, 31.89it/s]


In [None]:
def generate(
    model,
    tokenizer,
    prompt,
    entry_count=10,
    entry_length=150, #maximum number of words
    top_p=0.8,
    temperature=1.,
):
    model.eval()
    generated_num = 0
    generated_list = []

    filter_value = -float("Inf")

    with torch.no_grad():

        for entry_idx in trange(entry_count):

            entry_finished = False
            generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)

            for i in range(entry_length):
                outputs = model(generated, labels=generated)
                loss, logits = outputs[:2]
                logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

                sorted_indices_to_remove = cumulative_probs > top_p
                sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[
                    ..., :-1
                ].clone()
                sorted_indices_to_remove[..., 0] = 0

                indices_to_remove = sorted_indices[sorted_indices_to_remove]
                logits[:, indices_to_remove] = filter_value

                next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1)
                generated = torch.cat((generated, next_token), dim=1)

                if next_token in tokenizer.encode("<|endoftext|>"):
                    entry_finished = True

                if entry_finished:

                    generated_num = generated_num + 1

                    output_list = list(generated.squeeze().numpy())
                    output_text = tokenizer.decode(output_list)
                    generated_list.append(output_text)
                    break
            
            if not entry_finished:
              output_list = list(generated.squeeze().numpy())
              output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" 
              generated_list.append(output_text)
                
    return generated_list

#Function to generate multiple sentences. Test data should be a dataframe
def text_generation(test_data):
  generated_lyrics = []
  for i in range(len(test_data)):
    prompt = '<|startoftext|>' + test['Name'].values[i] +'\n####\n'
    #print(prompt)
    x = generate(model.cpu(), tokenizer, prompt, entry_count=1)
    generated_lyrics.append(x)
  return generated_lyrics

#Run the functions to generate the lyrics

In [None]:
def text_gen_from_custom_input(product_name_list):
  generated_text = []
  for entry in product_name_list:
    generated_lyrics = []
    prompt = '<|startoftext|>' + entry +'\n####\n'
    
    x = generate(model.cpu(), tokenizer, prompt, entry_count=2)
    generated_text.append(x)
  return generated_text

In [None]:
product_list = ['The Jack in Blue Plaid',
                'The Ledge in Purple Stripe', 
                'The Frodo Cloak in Evergreen',
                'The Foundation Pant in Natural Linen',
                'The Short Sleeve California in Wine']

generated_text = text_gen_from_custom_input(product_list)

100%|██████████| 2/2 [01:33<00:00, 46.98s/it]
100%|██████████| 2/2 [00:51<00:00, 25.77s/it]
100%|██████████| 2/2 [00:59<00:00, 29.80s/it]
100%|██████████| 2/2 [01:18<00:00, 39.23s/it]
100%|██████████| 2/2 [01:10<00:00, 35.48s/it]


In [None]:
generated_text[1]

['<|startoftext|>The Ledge in Purple Stripe\n####\nWe’ve put a lot of thought into designing our own striped shirt, and thanks to some meticulous construction, it’s impossible not to notice some subtle texture. Who wouldn’t like a good blend of soft, rugged material? We’ve teamed up with a handy workshop to bring you the perfect selvage shirt for all occasions.<|endoftext|>',
 "<|startoftext|>The Ledge in Purple Stripe\n####\nWe developed the world's first Ledge in spring, and when the weather changed, we found a better fit for a shorter, shorter life. All the work, all the innovation—all the love, and all the scars—was laid bare on this timeless piece that’s held the sharpest edge in our brand. Now, with fresh, distinctive colorways, this soft, organic blend has a clean, cozy feel to it. <|endoftext|>"]

In [None]:
generated_text = text_generation(test.sample(2,random_state=5))

100%|██████████| 1/1 [01:18<00:00, 78.62s/it]
100%|██████████| 1/1 [00:24<00:00, 24.87s/it]


In [None]:
for name in generated_text:
  split = name[0].split('>')
  product_name = split[1].split("\n####\n")[0]
  description = split[1].split("\n####\n")[1].replace('<|endoftext|','')
  #print(type(description))
  print(product_name + ':\n')
  #print(description)
  word_list = wrapper.wrap(text=description)
  for element in word_list:
    pass
    print(element)

  print('\n')

The Jack in Blue Plaid:

We’ve had some incredible things happen to us over
the years, but no one really needs to know about
them to appreciate The Jack. Here we’re offering
up our very first ever collaborative set of denim
jeans. Our collection spans from denim boots to
Golden Bear Sunroofs, and we’ve designed our fit
and construction to keep you comfortable when the
winds of change take their toll.


The Ledge in Purple Stripe:

We’ve put a lot of thought into designing our own
striped shirt, and thanks to some meticulous
construction, it’s impossible not to notice some
subtle texture. Who wouldn’t like a good blend of
soft, rugged material? We’ve teamed up with a
handy workshop to bring you the perfect selvage
shirt for all occasions.


The Frodo Cloak in Evergreen:

Our beloved trouser pieces have come a long way in
the last decade, and thanks to their all-natural
materials, comfort, and sturdy construction,
they're no longer limited only to everyday wear.
The latest iteration of o

In [None]:
#Loop to keep only generated text and add it as a new column in the dataframe
my_generations=[]

for i in range(len(generated_text)):
  a = test_set['Name'].values[i].split()[-30:] #Get the matching string we want (30 words)
  b = ' '.join(a)
  c = ' '.join(generated_text[i]) #Get all that comes after the matching string
  my_generations.append(c.split(b)[-1])

test_set['Generated_Text'] = my_generations


#Finish the sentences when there is a point, remove after that
final=[]

for i in range(len(test_set)):
  to_remove = test_set['Generated_Text'][i].split('.')[-1]
  final.append(test_set['Generated_Text'][i].replace(to_remove,''))

test_set['Generated_Text'] = final

NameError: ignored