<a href="https://colab.research.google.com/github/dlmacedo/deep-learning-class/blob/master/C%C3%B3pia_de_smipletransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to simpleTransformers

[wandb live dashboard---> ](https://app.wandb.ai/cayush/simpletransformers/reports/simpleTransformer--Vmlldzo5MTgzNw/edit)

In [None]:
!pip install wandb


**MultiLabel Classification**

Please follow the steps below to download and use kaggle data within Google Colab:

1. Go to your account, Scroll to API section and Click Expire API Token to remove previous tokens

2. Click on Create New API Token - It will download kaggle.json file on your machine.

3. upload that file on the colab runtime and execute the cell.




In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

In [None]:

!pip install kaggle==1.5.6


In [None]:
!kaggle config set -n path -v /content


In [None]:
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge 

In [None]:
!unzip train.csv.zip

In [None]:
!pip install transformers
!pip install seqeval
!pip install tensorboardx
!pip install simpletransformers


In [None]:
import wandb
sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'eval_loss',
      'goal': 'minimize'   
    },
    'parameters': {
   
        'learning_rate': {
            'values': [ 5e-5, 3e-5, 2e-5]
        },

    }
}
sweep_defaults = {
            'learning_rate': 5e-5,
       
        'epochs':2
}

sweep_id = wandb.sweep(sweep_config,project="uncategorized")


Create sweep with ID: n2mybcri
Sweep URL: https://app.wandb.ai/cayush/uncategorized/sweeps/n2mybcri


In [None]:
import pandas as pd
df = pd.read_csv('train.csv')
df = df[:75000]

In [None]:
df['labels'] = list(zip(df.toxic.tolist(), df.severe_toxic.tolist(), df.obscene.tolist(), df.threat.tolist(),  df.insult.tolist(), df.identity_hate.tolist()))
df['text'] = df['comment_text'].apply(lambda x: x.replace('\n', ' '))

df.head()

In [None]:
from sklearn.model_selection import train_test_split


train_df, eval_df = train_test_split(df, test_size=0.2)

In [None]:
from simpletransformers.classification import MultiLabelClassificationModel
import pandas as pd

def trainMultiLabel():
  wandb.init()
  print("HyperParams=>>", wandb.config.epochs)
  # Create a MultiLabelClassificationModel
  model = MultiLabelClassificationModel(
      "distilbert",
      "distilbert-base-uncased",
      num_labels=6,
      args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": wandb.config.epochs,
            'learning_rate': wandb.config.learning_rate,
                'wandb_project': "simpletransformers",    "fp16": False,
            "max_seq_length": 64,

            },
  )
  # You can set class weights by using the optional weight argument
  print(train_df.head())

  # Train the model
  model.train_model(train_df)

  # Evaluate the model
  result, model_outputs, wrong_predictions = model.eval_model(eval_df)
  print(result)
  print(model_outputs)


In [None]:
wandb.agent(sweep_id,function=trainMultiLabel)


# Named Entity Recognition


In [None]:
!kaggle datasets download -d alaakhaled/conll003-englishversion

In [None]:
!unzip conll003-englishversion.zip

In [None]:


def trainNER():
  from simpletransformers.ner import NERModel

  wandb.init()
  print("HyperParam=>>" , wandb.config.epochs, wandb.config.learning_rate)
  # Create a NERModel
  model = NERModel('distilbert', 'distilbert-base-cased', 
                    args={"reprocess_input_data": True, "overwrite_output_dir": True, "num_train_epochs": wandb.config.epochs,
              'learning_rate': wandb.config.learning_rate,
                  'wandb_project': "simpletransformers",    "fp16": False,
              "max_seq_length": 64,

              })
  model.train_model('train.txt')

  # Evaluate the model
  result, model_outputs, predictions = model.eval_model('test.txt')

  # Check predictions
  print(predictions[:5])


In [None]:
wandb.agent(sweep_id,function=trainNER)


# Question Answering

In [None]:
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

If using JSON files, the files should contain a single list of dictionaries. A dictionary represents a single context and its associated questions.
Each such dictionary contains two attributes, the "context" and "qas".

* context: The paragraph or text from which the question is asked.
* qas: A list of questions and answers.

Questions and answers are represented as dictionaries. Each dictionary in qas has the following format.

* id: (string) A unique ID for the question. Should be unique across the entire dataset.
* question: (string) A question.
* is_impossible: (bool) Indicates whether the question can be * answered correctly from the context.
* answers: (list) The list of correct answers to the question.

A single answer is represented by a dictionary with the following attributes.
*answer: (string) The answer to the question. Must be a substring of the context.
*answer_start: (int) Starting index of the answer in the context.

We can convert the SQuAD data into this format quite easily.

In [None]:
import json


with open('train-v2.0.json', 'r') as f:
    train_data = json.load(f)

train_data = [item for topic in train_data['data'] for item in topic['paragraphs'] ]


In [None]:
train_data = train_data[:5000]

In [None]:
len(train_data)

5000

In [None]:
def trainQA():
  from simpletransformers.question_answering import QuestionAnsweringModel
  wandb.init()
  print("HyperParam=>>" , wandb.config.learning_rate)

  train_args = {
      'learning_rate': wandb.config.learning_rate,
      'num_train_epochs': 2,
      'max_seq_length': 128,
      'doc_stride': 64,
      'overwrite_output_dir': True,
      'reprocess_input_data': False,
      'train_batch_size': 2,
      'fp16': False,
      'wandb_project': "simpletransformers"
  }

  model = QuestionAnsweringModel('distilbert', 'distilbert-base-cased', args=train_args)
  model.train_model(train_data)


In [None]:
wandb.agent(sweep_id,function=trainQA)


# Language Modelling

In [None]:
!wget http://pcai056.informatik.uni-leipzig.de/downloads/corpora/epo_literature_2011_300K.tar.gz
!wget http://pcai056.informatik.uni-leipzig.de/downloads/corpora/epo_mixed_2012_300K.tar.gz
!wget http://pcai056.informatik.uni-leipzig.de/downloads/corpora/epo_wikipedia_2016_300K.tar.gz

In [None]:
!tar -xvzf epo_literature_2011_300K.tar.gz


In [None]:
!tar -xvzf epo_mixed_2012_300K.tar.gz


In [None]:
!tar -xvzf epo_wikipedia_2016_300K.tar.gz

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os


all_text = []
for fl in os.scandir("."):
    if fl.name.endswith(".txt") and "sentences" in fl.name:

        df = pd.read_csv(fl, delimiter="\t",engine='python' ,header=None, error_bad_lines=False )
        df.columns = ["id", "text"]
        texts = df.text.tolist()
        texts = [t for t in texts if isinstance(t, str)]
        all_text.extend(texts)

train, test = train_test_split(all_text, test_size=0.1)

with open("train.txt", "w") as f:
    for line in train:
        f.write(line + "\n")

with open("test.txt", "w") as f:
    for line in test:
        f.write(line + "\n")

In [None]:
def trainLM():
  import wandb
  wandb.init()
  from simpletransformers.language_modeling import LanguageModelingModel
  import logging




  train_args = {
      "reprocess_input_data": False,
      "overwrite_output_dir": True,
      "num_train_epochs": 2,
      "save_eval_checkpoints": True,
      "save_model_every_epoch": False,
      "learning_rate": 3e-2,
      "warmup_steps": 10000,
      "train_batch_size": 64,
      "eval_batch_size": 128,
      "fp16": False,
      "gradient_accumulation_steps": 1,
      "block_size": 128,
      "max_seq_length": 128,
      "dataset_type": "simple",
      'wandb_project': "simpletransformers",
      "wandb_kwargs": {"name": "LM3e-2"},
      "logging_steps": 100,
      "evaluate_during_training": True,
      "evaluate_during_training_steps": 50000,
      "evaluate_during_training_verbose": True,
      "use_cached_eval_features": True,
      "sliding_window": True,
      "vocab_size": 20000,
      "generator_config": {
          "embedding_size": 128,
          "hidden_size": 256,
          "num_hidden_layers": 3,
      },
      "discriminator_config": {
          "embedding_size": 128,
          "hidden_size": 256,
      },
  }

  train_file = "train.txt"
  test_file = "test.txt"

  model = LanguageModelingModel(
      "electra",
      None,
      args=train_args,
      train_files=train_file,
  )

  model.train_model(
      train_file, eval_file=test_file,
  )



In [None]:
wandb.agent(sweep_id,function=trainLM)
