# Oil Markets Binary Classification

This Notebook will show an example using the Transformers Library on top of the Torch framework

## Data Manipulation

Here I will show how pandas can help manipulate a small CSV

In [None]:
# Libraries required for dataset and Notebook visualization
import pandas as pd # To load the CSV
from datasets import Dataset # To transform into a Transformer Dataset
import tqdm as notebook_tqdm # For Jupyter display  
import numpy as np # Numpy for zero matrix creation
from sklearn.model_selection import train_test_split # To split the dataset

Here we load the csv file with seperator | into a pandas DataFrame

In [None]:
df = pd.read_csv("./data/500_manually_labeled.csv",sep="|")
df.head()

In this next cell we turn the Labels in to column headers and have their value be binary

In [None]:
#Full data transformation
df.dropna(inplace=True)
df['labels'] = df['labels'].str.split("#", expand=False)
labels = ['Prices Positive','Prices Negative','Supply Positive','Supply Negative','Demand Positive','Demand Negative','Future','Current','Intermediate','Not Relevant']
for i in range(0,len(labels)):
    df[labels[i]] = np.zeros((len(df),1)).astype(int)
# Kill label and turn it into the 0 and 1 columns
for category in labels:
    df[category] = df['labels'].apply(lambda cat: 1 if category in cat else 0)
df.drop(['labels'],axis=1,inplace=True)
df.head()

No we will drop all columns and keep relevant and none relevant. This is because with only 500 lines of examples, doing this level of multi label mutli class will be to much for any one neural network. We will only pass the relevant columns to the dataset

In [None]:
dataset = Dataset.from_pandas(df[['data','Not Relevant']])
dataset = dataset.remove_columns('__index_level_0__') # Pandas clean up
dataset = dataset.rename_column('Not Relevant','label')
dataset = dataset.rename_column('data','text')
dataset

Create a Train and Test dataset 

In [None]:
dataset = dataset.train_test_split(test_size=0.1)
dataset

## Tokenization

Here this is where you turn your data into tokens. The reason for this is because we need to turn the text into something the Neural Network can understand. 

In [None]:
from transformers import AutoTokenizer

hugging_face_model_name = "microsoft/DeBERTA_v3" #Best production: "roberta-base" 

tokenizer = AutoTokenizer.from_pretrained(hugging_face_model_name)

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['text'],truncation=True)

Here we will use the builtin parallelism of the Dataset Library to tokenize each sentences. set the batched=True parameter for faster parrallel processing. 

In [None]:
tokenized_data = dataset.map(preprocess_function, batched=True)

This will be used later in the model training to assure proper padding for the sentences so they are all the same size when being read by the model

In [None]:
from transformers import DataCollatorWithPadding

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Create an Evaluate function

In [None]:
import evaluate

accuracy = evaluate.load('accuracy')

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Create and train the Model

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

id2label = {0: "RELEVANT", 1: "NOT RELEVANT"}
label2id = {"RELEVANT": 0, "NOT RELEVANT": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    hugging_face_model_name, 
    num_labels=2, 
    id2label=id2label, 
    label2id=label2id
)

Here we create the arguments, which are considered finetuning argument for the NLP engine. In most cases the defaults are really good

In [None]:
training_args = TrainingArguments(
    output_dir="outputs/oilmarkets",
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.001,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
best_model = trainer.train()

## Using the model

Here we load the model using the Pipeline method from Hugging Face. This makes it easy to run predictions on new sentences

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis",model="outputs/oilmarkets/checkpoint-140")


Here we will run 4 tests only the first and last should be relevant

In [None]:
classifier("The oil supply is starting to degrade!")

In [None]:
classifier("Today Lisa Laflamme was let go by Bell.")

In [None]:
classifier("What do you mean we want some natural gas?")

In [None]:
classifier("The supply is lacking in Germany, because of the tensions with Russia")

In [None]:
# I know this one will be a mistake
classifier("The need for Natural Gas has increased in Toronto.")