# Skill Prediction Model - Fine-Tuning with BERT

In [2]:
import pandas as pd

# Load the custom CSV dataset
df = pd.read_csv('/kaggle/input/reskill-dataset-v1/reskill_dataset_v1.csv')
df.head()

Unnamed: 0,resume_text,skills
0,david garcia lopez daviddgl@gmail.com https //...,"ionic, flutter, aws, bdd, ionic, flutter, node..."
1,manjunath email manjunathjava261@gmail.com mob...,"java, j2ee, spring, hibernate, jdbc, servlets,..."
2,avinash kumar itpl main road 6th cross kundalh...,"html, html5, css, javascript, jquery, bootstra..."
3,contact ibropamela@gmail.com +44 7383 151 935 ...,"microsoft office, hris, adp, opentable, seven ..."
4,vandana . salesforce consultant profile deadli...,"salesforce, lightning, apex, javascript, visua..."


#### Format Your Data: Since you want to extract skills from the resume text, you need to format your dataset appropriately. You can create a dataset where each resume text is paired with a list of its corresponding skills.

In [4]:
# # Assuming 'resume_text' and 'skills' are your column names
# df['skills'] = df['skills'].apply(lambda x: x.split(','))  # Convert skills to a list

#### Multi-Label Encoding: Since each resume can have multiple skills, you need to convert the skills into a binary format where each skill corresponds to a column. If a skill is present in a resume, its column will have a value of 1; otherwise, it will be 0.

In [3]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['skills'].apply(lambda x: x.split(',')))  # Split skills by comma

#### Tokenize the Resume Text: Use the BERT tokenizer to convert the resume text into input IDs and attention masks.

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

def tokenize_function(resume_texts):
    return tokenizer(resume_texts, padding="max_length", truncation=True, return_tensors="pt")

tokenized_data = tokenize_function(df['resume_text'].tolist())

In [13]:
tokenized_data[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

#### Create a Custom Dataset: You may want to create a custom dataset class to handle the multi-label outputs. This class should return the input IDs, attention masks, and the corresponding labels for each training sample.

In [22]:
import torch
from torch.utils.data import Dataset

class ResumeDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.float)  # Ensure labels are float for BCEWithLogitsLoss
        return item

    def __len__(self):
        return len(self.labels)

In [23]:
dataset = ResumeDataset(tokenized_data, y)

In [24]:
dataset

<__main__.ResumeDataset at 0x7e09ed873c10>

#### Load the BERT Model: Use the AutoModelForSequenceClassification class to load a pre-trained BERT model with the appropriate number of labels.

In [17]:
from transformers import AutoModelForSequenceClassification

num_labels = len(mlb.classes_)
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=num_labels)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Use Appropriate Loss Function: Since this is a multi-label classification problem, use a loss function suitable for multi-label tasks, such as BCEWithLogitsLoss. This will allow the model to predict probabilities for each skill independently.

In [18]:
from torch.nn import BCEWithLogitsLoss

model.loss_fct = BCEWithLogitsLoss()

#### Training Loop: Set up your training loop using the Trainer class from the Hugging Face library, ensuring that you pass the dataset and specify evaluation strategies.

In [21]:
print(y.shape)  # Should be (num_samples, num_skills)
print(len(df))  # Should be the same as y.shape[0]

(10325, 43455)
10325


In [26]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="output_dir",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=347640,  # Match this with your dataset size
    per_device_eval_batch_size=347640,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,  # Ideally, this should be a separate validation dataset
)

trainer.train()

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


OutOfMemoryError: CUDA out of memory. Tried to allocate 15.12 GiB. GPU 0 has a total capacity of 15.89 GiB of which 10.81 GiB is free. Process 7005 has 5.07 GiB memory in use. Of the allocated memory 4.69 GiB is allocated by PyTorch, and 97.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)