<a href="https://colab.research.google.com/github/davidgibsonp/Rspotify/blob/master/lm_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning Language Models

## Setup

In [None]:
from google.colab import drive
import numpy as np
import collections
import random
import torch
import os

In [None]:
drive.mount('/content/drive')

In [None]:
# Clone tranformers and build from source
!git clone https://github.com/huggingface/transformers
!cd content/transformers
!pip install .
!pip install -r ./examples/requirements.txt
!cd content/transformers/examples

In [4]:
from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelWithLMHead

## Fine-Tune Model

In [8]:
!python language-modeling/run_language_modeling.py \
    --output_dir='/content/drive/My Drive/NLP/distilgpt2' \
    --cache_dir='/content/drive/My Drive/NLP/distilgpt2/cache' \
    --model_type=distilgpt2 \
    --model_name_or_path=distilgpt2 \
    --do_train \
    --train_data_file='/content/drive/My Drive/NLP/data/all_descriptions.txt' \
    --save_total_limit=5 \
    --num_train_epochs=1.0 \
    --do_train \
    --logging_steps=500 \
    --save_steps=500 \
    --per_gpu_train_batch_size=2 \
    --per_gpu_eval_batch_size=2 \
    --block_size=128 \
    --gradient_accumulation_steps=5

2020-06-19 19:03:44.749405: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
06/19/2020 19:03:46 - INFO - transformers.training_args -   PyTorch: setting up devices
06/19/2020 19:03:46 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/content/drive/My Drive/NLP/distilgpt2', overwrite_output_dir=False, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=2, per_gpu_eval_batch_size=2, gradient_accumulation_steps=5, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jun19_19-03-46_a58928ade4bc', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=5, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debu

# Model Testing

## Test Tuned Model

In [121]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import random

tokenizer = AutoTokenizer.from_pretrained("/content/drive/My Drive/NLP/distilgpt2/")
model = AutoModelWithLMHead.from_pretrained("/content/drive/My Drive/NLP/distilgpt2/")



In [110]:
# encode context the generation is conditioned on
prompt = 'This bra '
input_ids = tokenizer.encode(
    prompt, 
    add_special_tokens=False, 
    return_tensors="pt", 
    add_space_before_punct_symbol=True
    )

print(prompt)
print(input_ids[0])

This bra 
tensor([1212, 8290])


In [122]:
# set top_k = 50 and set top_p = 0.95 and  num_return_sequences = 3 and no_repeat_ngram_size = 2
sample_outputs = model.generate(
    # config params
    input_ids=input_ids,
    max_length=30,
    min_length=10,
    num_return_sequences=5,
    
    # randomely pick next word
    do_sample=True, 
    
    # limiting next words
    top_k=10, # possible orders
    top_p=0.9 # probability of words
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("\n{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------

0: This bra from the Calvin Klein Collection. style: 1792-5447 | color: 678 Soft stretch lace waistband Full coverage Real Soft stretch

1: This bra that we call the best-selling in the whole thing!
The perfect fit to your bra.
This bra is made with the right

2: This bra is designed with a soft, light, and flattering silhouette that adds support to your bra.   The color of this bra features a light,

3: This bra has a pretty lace lining, and a low rise under the waistband. This bra is made for comfort and a lot of love.


4: This bra with a low rise waistband and a low rise waistband. This style is perfect for the occasion you're looking for. The lace cups


## Test Non-Tuned Model

In [118]:
default_tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
default_model = AutoModelWithLMHead.from_pretrained("distilgpt2")



In [119]:
# encode context the generation is conditioned on
prompt = 'This bra '
input_ids = default_tokenizer.encode(
    prompt, 
    add_special_tokens=False, 
    return_tensors="pt", 
    add_space_before_punct_symbol=True
    )

print(prompt)
print(input_ids[0])

This bra 
tensor([1212, 8290])


In [120]:
# set top_k = 50 and set top_p = 0.95 and  num_return_sequences = 3 and no_repeat_ngram_size = 2
sample_outputs = default_model.generate(
    # config params
    input_ids=input_ids,
    max_length=30,
    min_length=10,
    num_return_sequences=5,
    
    # randomely pick next word
    do_sample=True, 
    
    # limiting next words
    top_k=10, # possible orders
    top_p=0.9 # probability of words
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("\n{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Output:
----------------------------------------------------------------------------------------------------

0: This bra, and you‪re ready to go!!”

“It‪s so cute!”

1: This bra that I have always loved. The bra that I have always loved. The bra that I have always loved. The bra that I have always

2: This bra is a great choice of size and I am going to be making a new bra in the next few weeks. I have never tried a bra

3: This bra in this one!

This bra is a little too light, but it's a nice touch and I can easily make this bra even

4: This bra and it has a lot of fun.



I've been wanting this bra for a couple months now, and the only thing
