# Oil Markets Binary Classification

This Notebook will show an example using the Transformers Library on top of the Torch framework

## Data Manipulation

Here I will show how pandas can help manipulate a small CSV

In [8]:
# Libraries required for dataset and Notebook visualization
import pandas as pd # To load the CSV
from datasets import Dataset # To transform into a Transformer Dataset
import tqdm as notebook_tqdm # For Jupyter display  
import numpy as np # Numpy for zero matrix creation

Here we load the csv file with seperator | into a pandas DataFrame

In [9]:
df = pd.read_csv("./data/500_manually_labeled.csv",sep="|")
df.head()

Unnamed: 0,data,labels
0,"The consortium, led by U.S. firm EIG Global En...",Not Relevant
1,The unplanned shutdown two weeks ago of Syncru...,Supply Negative#Uncertain Label
2,Pemex lost $7.7 billion during the first half ...,Not Relevant
3,A spokesman for New York State Comptroller Tho...,Not Relevant
4,(Reuters) — Exxon Mobil Corporation on Tuesday...,Supply Negative


In this next cell we turn the Labels in to column headers and have their value be binary

In [10]:
#Full data transformation
df.dropna(inplace=True)
df['labels'] = df['labels'].str.split("#", expand=False)
labels = ['Prices Positive','Prices Negative','Supply Positive','Supply Negative','Demand Positive','Demand Negative','Future','Current','Intermediate','Not Relevant']
for i in range(0,len(labels)):
    df[labels[i]] = np.zeros((len(df),1)).astype(int)
# Kill label and turn it into the 0 and 1 columns
for category in labels:
    df[category] = df['labels'].apply(lambda cat: 1 if category in cat else 0)
df.drop(['labels'],axis=1,inplace=True)
df.head()

Unnamed: 0,data,Prices Positive,Prices Negative,Supply Positive,Supply Negative,Demand Positive,Demand Negative,Future,Current,Intermediate,Not Relevant
0,"The consortium, led by U.S. firm EIG Global En...",0,0,0,0,0,0,0,0,0,1
1,The unplanned shutdown two weeks ago of Syncru...,0,0,0,1,0,0,0,0,0,0
2,Pemex lost $7.7 billion during the first half ...,0,0,0,0,0,0,0,0,0,1
3,A spokesman for New York State Comptroller Tho...,0,0,0,0,0,0,0,0,0,1
4,(Reuters) — Exxon Mobil Corporation on Tuesday...,0,0,0,1,0,0,0,0,0,0


No we will drop all columns and keep relevant and none relevant. This is because with only 500 lines of examples, doing this level of multi label mutli class will be to much for any one neural network. We will only pass the relevant columns to the dataset

In [11]:
dataset = Dataset.from_pandas(df[['data','Not Relevant']])
dataset = dataset.remove_columns('__index_level_0__') # Pandas clean up
dataset = dataset.rename_column('Not Relevant','label')
dataset = dataset.rename_column('data','text')
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 496
})

Create a Train and Test dataset 

In [12]:
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 446
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50
    })
})

## Tokenization

Here this is where you turn your data into tokens. The reason for this is because we need to turn the text into something the Neural Network can understand. 

In [13]:
from transformers import AutoTokenizer

hugging_face_model_name = "models/econbert_2/model" #Best production: "roberta-base" "bert-base-uncased" "distilbert-base-uncased" "microsoft/deberta-v3-base"

tokenizer = AutoTokenizer.from_pretrained(hugging_face_model_name)

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [14]:
def preprocess_function(examples):
    return tokenizer(examples['text'],truncation=True)

Here we will use the builtin parallelism of the Dataset Library to tokenize each sentences. set the batched=True parameter for faster parrallel processing. 

In [15]:
tokenized_data = dataset.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 1/1 [00:00<00:00, 20.42ba/s]
100%|██████████| 1/1 [00:00<00:00, 134.66ba/s]


This will be used later in the model training to assure proper padding for the sentences so they are all the same size when being read by the model

In [16]:
from transformers import DataCollatorWithPadding

In [17]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Create an Evaluate function

In [18]:
import evaluate

accuracy = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Create and train the Model

In [20]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

id2label = {0: "RELEVANT", 1: "NOT RELEVANT"}
label2id = {"RELEVANT": 0, "NOT RELEVANT": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    hugging_face_model_name, 
    num_labels=2, 
    id2label=id2label, 
    label2id=label2id
)

Some weights of the model checkpoint at models/econbert_2/model were not used when initializing DebertaV2ForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at m

Here we create the arguments, which are considered finetuning argument for the NLP engine. In most cases the defaults are really good

In [21]:
training_args = TrainingArguments(
    output_dir=f"outputs/{hugging_face_model_name}",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
)

In [22]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [23]:
best_model = trainer.train()

The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 446
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 140
  Number of trainable parameters = 141896450
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.612237,0.66
2,No log,0.490545,0.74
3,No log,0.54466,0.74
4,No log,0.523005,0.78
5,No log,0.593316,0.78


The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 50
  Batch size = 16
Saving model checkpoint to outputs/models/econbert_2/model/checkpoint-28
Configuration saved in outputs/models/econbert_2/model/checkpoint-28/config.json
Model weights saved in outputs/models/econbert_2/model/checkpoint-28/pytorch_model.bin
tokenizer config file saved in outputs/models/econbert_2/model/checkpoint-28/tokenizer_config.json
Special tokens file saved in outputs/models/econbert_2/model/checkpoint-28/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DebertaV2ForSequenceClassification.for

## Using the model

Here we load the model using the Pipeline method from Hugging Face. This makes it easy to run predictions on new sentences

In [24]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis",model=f"outputs/{hugging_face_model_name}/checkpoint-56")


loading configuration file outputs/models/econbert_2/model/checkpoint-56/config.json
Model config DebertaV2Config {
  "_name_or_path": "outputs/models/econbert_2/model/checkpoint-56",
  "architectures": [
    "DebertaV2ForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "RELEVANT",
    "1": "NOT RELEVANT"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "NOT RELEVANT": 1,
    "RELEVANT": 0
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": tru

Here we will run 4 tests only the first and last should be relevant

In [25]:
classifier("The oil supply is starting to degrade!")#Relevant

[{'label': 'RELEVANT', 'score': 0.9376997947692871}]

In [26]:
classifier("Today Lisa Laflamme was let go by Bell.")#Not Relevant

[{'label': 'NOT RELEVANT', 'score': 0.6796793341636658}]

In [27]:
classifier("What do you mean we want some natural gas?")#Not Relevant

[{'label': 'NOT RELEVANT', 'score': 0.6778207421302795}]

In [28]:
classifier("The supply is lacking in Germany, because of the tensions with Russia")#Relevant

[{'label': 'NOT RELEVANT', 'score': 0.5414666533470154}]

In [29]:
classifier("The need for Natural Gas has increased in Toronto.")#Not Relevant

[{'label': 'NOT RELEVANT', 'score': 0.654867947101593}]

In [30]:
classifier("The lack of demand in China has reduced the price of the barrel")#Relevant

[{'label': 'RELEVANT', 'score': 0.9461528062820435}]