## Creating the dataset
We need to prepare a dataset with two categories: *offer letter* and *not offer letter*.

In [3]:
# install dependencies
%pip install pandas scikit-learn --quiet

Note: you may need to restart the kernel to use updated packages.


In [4]:
import os
from pathlib import Path
import pandas as pd

# Define the paths to your directories
offer_letters_dir = Path('./datasets')/ 'offer_letter_detection' / 'offer_letters'
not_offer_letters_dir = Path('./datasets')/ 'offer_letter_detection' / 'not_offer_letters'

# Initialize lists to store file names and labels
data = []

# Process offer letters
for filename in os.listdir(offer_letters_dir):
    if filename.endswith('.txt'):
        with open(os.path.join(offer_letters_dir, filename), 'r') as file:
            text = file.read()
            data.append({'text': text, 'label': 1})

# Process not offer letters
for filename in os.listdir(not_offer_letters_dir):
    if filename.endswith('.txt'):
        with open(os.path.join(not_offer_letters_dir, filename), 'r') as file:
            text = file.read()
            data.append({'text': text, 'label': 0})

# Create a DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/offer_letter_detection/offer_letters'

## Splitting the dataset for training and evaluation
Let's use the `train_test_split` function from sklearn to split the dataset.

In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and validation sets
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)

# Display the first few rows
train_data.head()


Unnamed: 0,text,label
8,ABC Property Management\n456 Park Avenue\nCity...,0
5,[Your Company Logo]\nCulinary Delights Inc.\n2...,1
2,Future Tech Solutions\n101 Innovation Way\nSea...,1
1,"August 7, 2024\n\nJane Doe\n123 Maple Street\n...",1
11,LEASE RENEWAL AGREEMENT\n\nThis Lease Renewal ...,0


## Fine tuning the model

In [None]:
# install dependencies
%pip install transformers datasets torch torchvision torchaudio --quiet

Note: you may need to restart the kernel to use updated packages.


In [None]:
import torch
torch.cuda.is_available()

True

## Tokenize the datasets
let's tokenize the text in our datasets so that we can feed it to our model.

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "False"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset

# Load the tokenizer and model
model_name = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the datasets
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

train_dataset = Dataset.from_pandas(train_data).map(tokenize_function, batched=True)
val_dataset = Dataset.from_pandas(val_data).map(tokenize_function, batched=True)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of Phi3ForSequenceClassification were not initialized from the model checkpoint at microsoft/Phi-3-mini-4k-instruct and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/9 [00:00<?, ? examples/s]

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

## Fine tune the model

In [None]:
# install dependencies
%pip install accelerate --quiet

Note: you may need to restart the kernel to use updated packages.


In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "True"

: 

In [None]:
from transformers import Trainer, TrainingArguments

# Prepare training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    use_cpu=True,
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
You are not running the flash-attention implementation, expect numerical differences.


: 

: 