# Resume Generator AI
## Description of project
This model was written to create CV for job offers, it's based on gpt2 which is fine-tuned with resume data and also abstracts for creating context
## State of project
Project is suspended for now, until I get better hardware or find optimization method, training this model on my GPU lasts several hours and even with current training AI hallucination occurs frequently
## Improvement concepts
* Better data preparation
* Training optimization
* Context optimization
* AI hallucination prevention
* More data
### Code

#### Imports

In [1]:
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW, get_scheduler, GPT2LMHeadModel, GPT2Tokenizer
from IPython.display import clear_output
import torch

  from .autonotebook import tqdm as notebook_tqdm


#### Model config

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
#model.resize_token_embeddings(130000)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

#### Data import & stats

In [4]:
resume_csv_files = ['Resume.csv']
abstract_csv_files = ['Abstracts.csv']
resumes = pd.concat([pd.read_csv(file) for file in resume_csv_files], ignore_index=True)
resumes['input_text'] = resumes.apply(lambda x: f"Category: {x['Category']}\nBody CV: {x['Resume_str']}", axis=1)

abstracts = pd.concat([pd.read_csv(file) for file in abstract_csv_files], ignore_index=True)
abstracts = abstracts[:5000]
abstracts['input_text'] = abstracts.apply(lambda x: f"Subject: {x['subject']}\nTitle CV: {x['title']}\nAbstract: {x['abstract']}", axis=1)
train_texts = resumes['input_text'].to_list() + abstracts['input_text'].to_list()
print(train_texts[-20:])

['Subject: other condensed matter\nTitle CV: Electrically injected cavity polaritons\nAbstract:   We have realised a semiconductor quantum structure that produces electroluminescence while operating in the light-matter strong coupling regime. The mid-infrared light emitting device is composed of a quantum cascade structure embedded in a planar microcavity, based on the GaAs/AlGaAs material system. At zero bias, the structure is characterised using reflectivity measurements which show, up to room temperature, a wide polariton anticrossing between an intersubband transition and the resonant cavity photon mode. Under electrical injection the spectral features of the emitted light change drastically, as electrons are resonantly injected in a reduced part of the polariton branches. Our experiment demonstrates that electrons can be selectively injected into polariton states up to room temperature. ', 'Subject: other condensed matter\nTitle CV: Nonlinear tunneling in two-dimensional lattices\

#### Data prep

In [5]:
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encodings = self.tokenizer(self.texts[idx], truncation=True, max_length=self.max_length, return_tensors="pt")
        input_ids = encodings['input_ids'].squeeze()
        attention_mask = encodings['attention_mask'].squeeze()
        return input_ids, attention_mask

#### Padding

In [6]:
def collate_fn(batch):
    input_ids = [item[0] for item in batch]
    attention_masks = [item[1] for item in batch]

    # Znajdujemy najdłuższą sekwencję w batchu
    max_len = max(len(ids) for ids in input_ids)

    # Padding sekwencji do tej samej długości
    padded_input_ids = [torch.cat([ids, torch.zeros(max_len - len(ids), dtype=torch.long)]) for ids in input_ids]
    padded_attention_masks = [torch.cat([mask, torch.zeros(max_len - len(mask), dtype=torch.long)]) for mask in attention_masks]

    # Zwracamy spakowane tensory
    return torch.stack(padded_input_ids), torch.stack(padded_attention_masks)

#### Training Config

In [7]:
train_dataset = TextDataset(train_texts, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)

optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 3
num_training_steps = num_epochs * len(train_loader)

scheduler = get_scheduler("linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)



#### Training

In [8]:
model.train()
print("Starting Training")
for epoch in range(num_epochs):
# Podczas treningu, dane (input_ids, attention_mask) również muszą być przeniesione na GPU
    for i, batch in enumerate(train_loader):
        print(f"executing batch {i}/{len(train_loader)}")
        clear_output(True)
        input_ids, attention_masks = batch
        input_ids = input_ids.to(device)
        attention_masks = attention_masks.to(device)
    
        # Forward pass
        outputs = model(input_ids, attention_mask=attention_masks, labels=input_ids)
        loss = outputs.loss
        loss.backward()
    
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    print(f"Epoch {epoch + 1} finished with loss {loss.item()}")



Epoch 3 finished with loss 1.440152645111084


#### Model Saving

In [9]:
model.save_pretrained('./fine_tuned_gpt2')
tokenizer.save_pretrained('./fine_tuned_gpt2')

('./fine_tuned_gpt2\\tokenizer_config.json',
 './fine_tuned_gpt2\\special_tokens_map.json',
 './fine_tuned_gpt2\\vocab.json',
 './fine_tuned_gpt2\\merges.txt',
 './fine_tuned_gpt2\\added_tokens.json')

#### Test prompt properties & model loading

In [32]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained('./fine_tuned_gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('./fine_tuned_gpt2')

prompt = "generate CV for Samantha, the IT Recruiter, with this sections: skills, previous experience, education, hobbies, contact data"
total_new_tokens = 6000 

#### Test generate & context methods

In [33]:
def generate(prompt, model,max_new_tokens_per_generation):
    if not prompt.strip():
        return ""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(
        input_ids,
        attention_mask=torch.ones_like(input_ids),
        max_new_tokens=max_new_tokens_per_generation,
        num_return_sequences=1,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)
    
def contextualize(prompt, generated, model):
    if not prompt.strip():
        return ""
    abstract_prompt = f"Generate one sentence abstract of what you already generated for this prompt: {prompt} Generated: {generated}"
    generated_abstract = generate(abstract_prompt, model, 50)
    generated_abstract = generated_abstract[len(abstract_prompt):] if generated_abstract.startswith(abstract_prompt) else generated_abstract
    print(f"abstract:\n{generated_abstract}")
    return generated_abstract


#### Test main loop

In [34]:


model.eval()
generated_text = ""
input_ids = tokenizer.encode(prompt, return_tensors="pt")
max_new_tokens_per_generation = 600 
current_length = 0


#Debug Data
vocab_size = tokenizer.vocab_size
print(input_ids)
print(f"Vocab size: {vocab_size}")
print(f"Max index in input_ids: {input_ids.max()}")


while current_length < total_new_tokens:
    generated_part = generate(prompt, model, max_new_tokens_per_generation)
    generated_part = generated_part[len(prompt):] if generated_part.startswith(prompt) else generated_part
    generated_text += generated_part
    print(f"generated part:\n{generated_part}")
    prompt = contextualize(prompt,generated_part,model)
    current_length += max_new_tokens_per_generation

print(f"answear:\n{generated_text}")

tensor([[ 8612,   378, 26196,   329, 34778,    11,   262,  7283,  3311,   622,
          2676,    11,   351,   428,  9004,    25,  4678,    11,  2180,  1998,
            11,  3707,    11, 45578,    11,  2800,  1366]])
Vocab size: 50257
Max index in input_ids: 45578
generated part:
 and other relevant information.
 !!!!!!!!!!!!!!!!!

I am a very hard worker who works at my own pace in order to obtain excellent results within an organization that is highly competitive as well as welcoming when working under pressure or on assignment without any supervision from others (including supervisors). I have been able since 1999 to complete several projects simultaneously while still maintaining high standards of detail/task satisfaction throughout all phases of my career development through various roles including Software Development Manager & Project Management Supervisor; Programmer / Architector; Technical Support Analyst ; Lead Developer . My goal has always been to achieve great work ethic