### Fine-tuning 6-Billion GPT-J in colab with LoRA and 8-bit compression

This notebook is a proof of concept for fine-tuning [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) with limited memory. A detailed explanation of how it works can be found in [this model card](https://huggingface.co/hivemind/gpt-j-6B-8bit).

In [None]:
!pip install transformers==4.14.1
!pip install bitsandbytes-cuda111
!pip install datasets==1.16.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.14.1
  Downloading transformers-4.14.1-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 KB[0m [31m15.6 MB/s[0m eta 

In [None]:
# from google.colab import drive

# drive.mount('/content/gdrive')

In [None]:
!git clone https://github.com/feralvam/easse.git


Cloning into 'easse'...
remote: Enumerating objects: 1960, done.[K
remote: Counting objects: 100% (141/141), done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 1960 (delta 116), reused 102 (delta 102), pack-reused 1819[K
Receiving objects: 100% (1960/1960), 33.15 MiB | 15.76 MiB/s, done.
Resolving deltas: 100% (1229/1229), done.


In [None]:
%cd content
%cd easse


[Errno 2] No such file or directory: 'content'
/content
/content/easse


In [None]:
!ls
!pip install -e .

demo   example.sh  MANIFEST.in	requirements.txt  setup.py
easse  LICENSE	   README.md	setup.cfg	  tests
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Obtaining file:///content/easse
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tseval@ git+https://github.com/facebookresearch/text-simplification-evaluation.git@main
  Cloning https://github.com/facebookresearch/text-simplification-evaluation.git (to revision main) to /tmp/pip-install-fcvrho5u/tseval_61ee23c3ac9b4632bbff4937ba7f15cc
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/text-simplification-evaluation.git /tmp/pip-install-fcvrho5u/tseval_61ee23c3ac9b4632bbff4937ba7f15cc
  Resolved https://github.com/facebookresearch/text-simplification-evaluation.git to commit f335e2e27026321c7c3d1dd63857416c7e7397b2
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sacrebleu>=2.0.0
  Using cached sacrebleu-2.3.1-py3-non

In [None]:
import transformers
from easse.sari import corpus_sari
from easse.bleu import corpus_bleu
import torch
import torch.nn.functional as F
from torch import nn
from torch.cuda.amp import custom_fwd, custom_bwd
from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise
from tqdm.auto import tqdm
import pandas as pd
from google.colab import files




### Converting the model to 8 bits.

We convert EleutherAI's GPT-J-6B model to 8 bits using facebook's [bitsandbytes](https://github.com/facebookresearch/bitsandbytes) library. This reduces the model's size from 20Gb down to just 6Gb.

Note that we don't convert linear layer biases to 8 bit as they take up less that 1% of the model's weight anyway.

In [None]:

class FrozenBNBLinear(nn.Module):
    def __init__(self, weight, absmax, code, bias=None):
        assert isinstance(bias, nn.Parameter) or bias is None
        super().__init__()
        self.out_features, self.in_features = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
        self.bias = bias
 
    def forward(self, input):
        output = DequantizeAndLinear.apply(input, self.weight, self.absmax, self.code, self.bias)
        if self.adapter:
            output += self.adapter(input)
        return output
 
    @classmethod
    def from_linear(cls, linear: nn.Linear) -> "FrozenBNBLinear":
        weights_int8, state = quantize_blockise_lowmemory(linear.weight)
        return cls(weights_int8, *state, linear.bias)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.in_features}, {self.out_features})"
 
 
class DequantizeAndLinear(torch.autograd.Function): 
    @staticmethod
    @custom_fwd
    def forward(ctx, input: torch.Tensor, weights_quantized: torch.ByteTensor,
                absmax: torch.FloatTensor, code: torch.FloatTensor, bias: torch.FloatTensor):
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        ctx.save_for_backward(input, weights_quantized, absmax, code)
        ctx._has_bias = bias is not None
        return F.linear(input, weights_deq, bias)
 
    @staticmethod
    @custom_bwd
    def backward(ctx, grad_output: torch.Tensor):
        assert not ctx.needs_input_grad[1] and not ctx.needs_input_grad[2] and not ctx.needs_input_grad[3]
        input, weights_quantized, absmax, code = ctx.saved_tensors
        # grad_output: [*batch, out_features]
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        grad_input = grad_output @ weights_deq
        grad_bias = grad_output.flatten(0, -2).sum(dim=0) if ctx._has_bias else None
        return grad_input, None, None, None, grad_bias
 
 
class FrozenBNBEmbedding(nn.Module):
    def __init__(self, weight, absmax, code):
        super().__init__()
        self.num_embeddings, self.embedding_dim = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
 
    def forward(self, input, **kwargs):
        with torch.no_grad():
            # note: both quantuized weights and input indices are *not* differentiable
            weight_deq = dequantize_blockwise(self.weight, absmax=self.absmax, code=self.code)
            output = F.embedding(input, weight_deq, **kwargs)
        if self.adapter:
            output += self.adapter(input)
        return output 
 
    @classmethod
    def from_embedding(cls, embedding: nn.Embedding) -> "FrozenBNBEmbedding":
        weights_int8, state = quantize_blockise_lowmemory(embedding.weight)
        return cls(weights_int8, *state)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.num_embeddings}, {self.embedding_dim})"
 
 
def quantize_blockise_lowmemory(matrix: torch.Tensor, chunk_size: int = 2 ** 20):
    assert chunk_size % 4096 == 0
    code = None
    chunks = []
    absmaxes = []
    flat_tensor = matrix.view(-1)
    for i in range((matrix.numel() - 1) // chunk_size + 1):
        input_chunk = flat_tensor[i * chunk_size: (i + 1) * chunk_size].clone()
        quantized_chunk, (absmax_chunk, code) = quantize_blockwise(input_chunk, code=code)
        chunks.append(quantized_chunk)
        absmaxes.append(absmax_chunk)
 
    matrix_i8 = torch.cat(chunks).reshape_as(matrix)
    absmax = torch.cat(absmaxes)
    return matrix_i8, (absmax, code)
 
 
def convert_to_int8(model):
    """Convert linear and embedding modules to 8-bit with optional adapters"""
    for module in list(model.modules()):
        for name, child in module.named_children():
            if isinstance(child, nn.Linear):
                print(name, child)
                setattr( 
                    module,
                    name,
                    FrozenBNBLinear(
                        weight=torch.zeros(child.out_features, child.in_features, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                        bias=child.bias,
                    ),
                )
            elif isinstance(child, nn.Embedding):
                setattr(
                    module,
                    name,
                    FrozenBNBEmbedding(
                        weight=torch.zeros(child.num_embeddings, child.embedding_dim, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                    )
                )

In [None]:
class GPTJBlock(transformers.models.gptj.modeling_gptj.GPTJBlock):
    def __init__(self, config):
        super().__init__(config)

        convert_to_int8(self.attn)
        convert_to_int8(self.mlp)


class GPTJModel(transformers.models.gptj.modeling_gptj.GPTJModel):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)
        

class GPTJForCausalLM(transformers.models.gptj.modeling_gptj.GPTJForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)


transformers.models.gptj.modeling_gptj.GPTJBlock = GPTJBlock  # monkey-patch GPT-J

In [None]:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

Downloading:   0%|          | 0.00/930 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

In [None]:
gpt = GPTJForCausalLM.from_pretrained("hivemind/gpt-j-6B-8bit", low_cpu_mem_usage=True)

device = 'cuda' 
# if torch.cuda.is_available(): else:  device =  'cpu'
#     gpt.to(device)

gpt.to(device)


Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.75G [00:00<?, ?B/s]

k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, 

GPTJForCausalLM(
  (transformer): GPTJModel(
    (wte): FrozenBNBEmbedding(50400, 4096)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): FrozenBNBLinear(4096, 4096)
          (v_proj): FrozenBNBLinear(4096, 4096)
          (q_proj): FrozenBNBLinear(4096, 4096)
          (out_proj): FrozenBNBLinear(4096, 4096)
        )
        (mlp): GPTJMLP(
          (fc_in): FrozenBNBLinear(4096, 16384)
          (fc_out): FrozenBNBLinear(16384, 4096)
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
      (1): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
gpt.to(device)

GPTJForCausalLM(
  (transformer): GPTJModel(
    (wte): FrozenBNBEmbedding(50400, 4096)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): FrozenBNBLinear(4096, 4096)
          (v_proj): FrozenBNBLinear(4096, 4096)
          (q_proj): FrozenBNBLinear(4096, 4096)
          (out_proj): FrozenBNBLinear(4096, 4096)
        )
        (mlp): GPTJMLP(
          (fc_in): FrozenBNBLinear(4096, 16384)
          (fc_out): FrozenBNBLinear(16384, 4096)
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
      (1): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0

In [None]:
%cd ..
%cd ..
%cd content

/content
/
/content


In [None]:
!ls

ADV_INT_train.csv  easse  sample_data  wiki_to_fine_f_train.csv


#wiki

In [None]:
wiki = pd.read_csv(r'wiki_to_fine_f_train.csv')

In [None]:
wiki_train, wiki_test = wiki[:len(wiki['text'])-100], wiki[len(wiki['text'])-100:]

In [None]:
wiki

Unnamed: 0,text
0,simplify the text: \n text: Hofstetten-Fl ֳ¼h ...
1,simplify the text: \n text: It rapidly intensi...
2,simplify the text: \n text: Thomas Eastoe Abbo...
3,simplify the text: \n text: The SAT Reasoning ...
4,simplify the text: \n text: It is claimed that...
...,...
7972,simplify the text: \n text: The award has been...
7973,simplify the text: \n text: It has no hand ope...
7974,simplify the text: \n text: And the Yuna River...
7975,simplify the text: \n text: Bob 's Full House ...


# OneStopEnglish

### Train advance to intermediate(OneStopEnglish)

In [None]:
df_a_t_i = pd.read_csv(r'ADV-INT.csv')

In [None]:
df_a_t_i

In [None]:
df_a_t_i_train, df_a_t_i_test = df_a_t_i[:len(df_a_t_i['text'])-100], df_a_t_i[len(df_a_t_i['text'])-100:]

### Train intermediate to elementary(OneStopEnglish)

In [None]:
df_i_t_e = pd.read_csv(r'INT-ELE.csv')

In [None]:
df_i_t_e

In [None]:
df_i_t_e_train, df_i_t_e_test = df_i_t_e[:len(df_i_t_e['text'])-100], df_i_t_e[len(df_i_t_e['text'])-100:]

# Newsela 

### Advance to intermediate(Newsela)

In [None]:
newsela_a_to_i = pd.read_csv(r'df_newSela_fine_tune_ad_int.csv')

In [None]:
newsela_a_to_i_train, newsela_a_to_i_test = newsela_a_to_i[:len(newsela_a_to_i['text'])-100],newsela_a_to_i[len(newsela_a_to_i['text'])-100:]

### Intermediate to elementary (Newsela )

In [None]:
newsela_i_to_e = pd.read_csv(r'df_newSela_fine_tune_int_el.csv')

In [None]:
newsela_i_to_e_train, newsela_i_to_e_test = newsela_i_to_e[:len(newsela_i_to_e['text'])-100],newsela_i_to_e[len(newsela_i_to_e['text'])-100:]

In [None]:
# from google.colab import drive
# drive.mount('/content/gdrive')
# torch.save(gpt.state_dict(), 'gdrive/MyDrive/Final Project Data/model_tensor_all.pt')

## loading the wights for Testing 

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
gpt.load_state_dict( torch.load('gdrive/MyDrive/Final Project Data/model_tensor_all.pt'))

<All keys matched successfully>

###Text simplification try

In [None]:
prompt = tokenizer("simplify the text: n/ text: When you see the word Amazon, whats the first thing that springs to mind the worlds biggest forest, the longest river or the largest internet retailer and which do you consider most important? to:", return_tensors='pt')
prompt = {key: value.to(device) for key, value in prompt.items()}
out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
tokenizer.decode(out[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"simplify the text: n/ text: When you see the word Amazon, whats the first thing that springs to mind the worlds biggest forest, the longest river or the largest internet retailer and which do you consider most important? to: When you see the word Amazon, what images do you first think of the forests largest city, the river's massive expanse or the largest online retailer and which do you consider most important?\n\nto: When you see the word Amazon, what name does it represent for you? the worlds biggest forest, the longest river or the largest internet retailer and which do you consider most important?\n\n"

# OneStopEnglish

### Run model from advance to intermediate on train set




In [None]:
orig_sents_a = []
sys_sents_a_t_i = []
for i in df_a_t_i_train['text'][0:100]:
  orig_sents_a.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_i.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

  





Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 135, but ``max_length`` is set to 1

In [None]:
corpus_sari(orig_sents = orig_sents_a,  
            sys_sents = sys_sents_a_t_i, 
            refs_sents= [orig_sents_a] )

28.714976526774326

In [None]:
corpus_bleu(sys_sents = sys_sents_a_t_i, 
            refs_sents=[orig_sents_a])

63.55293367478905

In [None]:
import time
start_time = time.time()
print("--- %s seconds ---" % (time.time() - start_time))

--- 5.5789947509765625e-05 seconds ---


### Run model from intermediate to elementary on train set


In [None]:
orig_sents_i = []
sys_sents_i_t_e = []
start_time = time.time()

for i in df_i_t_e_train['text'][0:100]:
  orig_sents_i.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_i_t_e.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])
print("--- %s seconds ---" % (time.time() - start_time))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 2121.9158697128296 seconds ---


In [None]:
print(corpus_sari(orig_sents = orig_sents_i,  
            sys_sents = sys_sents_i_t_e, 
            refs_sents=[orig_sents_i]))

27.31363357868021


In [None]:
print(corpus_bleu( 
            sys_sents = sys_sents_i_t_e, 
            refs_sents=[orig_sents_i]))

58.877739506521756


### Run model from advance to intermediate on test set

In [None]:
orig_sents_a = []
sys_sents_a_t_i = []
j = 0
start_time = time.time() 
for i in df_a_t_i_test['text'][0:100]:
  orig_sents_a.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_i.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

print("--- %s seconds ---" % (time.time() - start_time))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 2061.3416831493378 seconds ---


In [None]:
print(corpus_sari(orig_sents = orig_sents_a,  
            sys_sents = sys_sents_a_t_i, 
            refs_sents= [orig_sents_a] ))

28.59484670867276


In [None]:
print(corpus_bleu(sys_sents = sys_sents_a_t_i, 
            refs_sents=[orig_sents_a]))

63.462171483476375


### Run model from intermediate to elemntry on test set

In [None]:
orig_sents_i = []
sys_sents_i_t_e = []
j = 0
start_time = time.time() 
for i in df_i_t_e_test['text'][0:100]:
  orig_sents_i.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_i_t_e.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

print("--- %s seconds ---" % (time.time() - start_time))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 2439.6491644382477 seconds ---


In [None]:
print(corpus_sari(orig_sents = orig_sents_i,  
            sys_sents = sys_sents_i_t_e, 
            refs_sents= [orig_sents_i] ))

27.768271108084516


In [None]:
print(corpus_bleu(sys_sents = sys_sents_i_t_e, 
            refs_sents=[orig_sents_i]))

62.380121678719334


# Newsela 

### Run model from advance to intermediate on train set

In [None]:
# ### Run model from advance to intermediate on train set
# newsela_a_to_i_train
# newsela_i_to_e_train
# newsela_a_to_i_test
#  newsela_i_to_e_test

In [None]:
orig_sents_a = []
sys_sents_a_t_i = []
for i in newsela_a_to_i_train['text'][0:100]:
  orig_sents_a.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_i.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
corpus_sari(orig_sents = orig_sents_a,  
            sys_sents = sys_sents_a_t_i, 
            refs_sents= [orig_sents_a] )

29.350137346715034

In [None]:
corpus_bleu(sys_sents = sys_sents_a_t_i, 
            refs_sents=[orig_sents_a])

50.161023925291

### Run model from intermediate to elementary on train set

In [None]:
orig_sents_a = []
sys_sents_a_t_i = []
for i in newsela_i_to_e_train['text'][0:100]:
  orig_sents_a.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_i.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [None]:
corpus_sari(orig_sents = orig_sents_a,  
            sys_sents = sys_sents_a_t_i, 
            refs_sents= [orig_sents_a] )

26.355697984793423

In [None]:
corpus_bleu(sys_sents = sys_sents_a_t_i, 
            refs_sents=[orig_sents_a])

32.589936943710995

### Run model from advance to intermediate on test set

In [None]:
orig_sents_a = []
sys_sents_a_t_i = []
j = 0
start_time = time.time() 
for i in newsela_a_to_i_test['text'][0:100]:
  orig_sents_a.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_i.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

print("--- %s seconds ---" % (time.time() - start_time))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 2303.366279602051 seconds ---


In [None]:
corpus_sari(orig_sents = orig_sents_a,  
            sys_sents = sys_sents_a_t_i, 
            refs_sents= [orig_sents_a] )

29.420630682100192

In [None]:
corpus_bleu(sys_sents = sys_sents_a_t_i, 
            refs_sents=[orig_sents_a])

51.660351439766316

### Run model from intermediate to elemntry on test set

In [None]:
orig_sents_a = []
sys_sents_a_t_i = []
j = 0
start_time = time.time() 
for i in  newsela_i_to_e_test['text'][0:100]:
  orig_sents_a.append(i.split('text:')[1].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_i.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

print("--- %s seconds ---" % (time.time() - start_time))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 2951.5685868263245 seconds ---


In [None]:
corpus_sari(orig_sents = orig_sents_a,  
            sys_sents = sys_sents_a_t_i, 
            refs_sents= [orig_sents_a] )

26.52809843821507

In [None]:
corpus_bleu(sys_sents = sys_sents_a_t_i, 
            refs_sents=[orig_sents_a])

33.92627823914806

# run model from Advance to elemntry on test set

In [None]:
df_a_t_e = pd.read_csv('ADV-ELE.csv')

In [None]:
orig_sents_i = []
sys_sents_a_t_e = []
ref_sents_a_t_e = []
start_time = time.time() 
for i in range(len(df_a_t_e['new_text'][0:100])):
  orig_sents_i.append(df_a_t_e['org'][i])
  ref_sents_a_t_e.append(df_a_t_e['ref'][i])
  prompt = tokenizer(df_a_t_e['new_text'][i], return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_a_t_e.append(tokenizer.decode(out[0]).split('text:')[1].split('to:')[1])

print("--- %s seconds ---" % (time.time() - start_time))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 3056.7531909942627 seconds ---


In [None]:
print(corpus_sari(orig_sents = orig_sents_i,  
            sys_sents = sys_sents_a_t_e, 
            refs_sents= [ref_sents_a_t_e] ))

33.30234618051194


In [None]:
print(corpus_bleu(sys_sents = sys_sents_a_t_e, 
            refs_sents=[ref_sents_a_t_e]))

35.31548125951137


##run model on wiki files

In [None]:
orig_sents_wiki = []
sys_sents_wiki = []
j = 0
start_time = time.time() 
for i in wiki_test['text'][0:100]:
  orig_sents_wiki.append(i.split('text:')[2].split('to:')[0])
  prompt = tokenizer(i, return_tensors='pt')
  prompt = {key: value.to(device) for key, value in prompt.items()}
  out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
  sys_sents_wiki.append(tokenizer.decode(out[0]).split('to:')[1])

print("--- %s seconds ---" % (time.time() - start_time))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

--- 2691.666080236435 seconds ---


In [None]:
print(corpus_sari(orig_sents = orig_sents_wiki,  
            sys_sents = sys_sents_wiki, 
            refs_sents= [orig_sents_wiki] ))

30.583973403238975


In [None]:
print(corpus_bleu(sys_sents = sys_sents_wiki, 
            refs_sents=[orig_sents_wiki]))

42.5788077744111


## Genrate Exmples of sentances 

In [None]:
wiki = pd.read_csv(r'wiki_to_fine_f_train.csv')
wiki_train, wiki_test = wiki[:len(wiki['text'])-100], wiki[len(wiki['text'])-100:]
df_a_t_i = pd.read_csv(r'ADV-INT.csv')
df_a_t_i_train, df_a_t_i_test = df_a_t_i[:len(df_a_t_i['text'])-100], df_a_t_i[len(df_a_t_i['text'])-100:]
df_i_t_e = pd.read_csv(r'INT-ELE.csv')
df_i_t_e_train, df_i_t_e_test = df_i_t_e[:len(df_i_t_e['text'])-100], df_i_t_e[len(df_i_t_e['text'])-100:]
newsela_a_to_i = pd.read_csv(r'df_newSela_fine_tune_ad_int.csv')
newsela_a_to_i_train, newsela_a_to_i_test = newsela_a_to_i[:len(newsela_a_to_i['text'])-100],newsela_a_to_i[len(newsela_a_to_i['text'])-100:]
newsela_i_to_e = pd.read_csv(r'df_newSela_fine_tune_int_el.csv')
newsela_i_to_e_train, newsela_i_to_e_test = newsela_i_to_e[:len(newsela_i_to_e['text'])-100],newsela_i_to_e[len(newsela_i_to_e['text'])-100:]

Genrate Sentance Simplification From Wiki Test Data