### Install Dependencies

First, we'll set up our environment by installing the necessary Python libraries.

* **`unsloth`**: We install the latest version of Unsloth directly from its GitHub repository. Unsloth provides massive speedups and memory reductions for fine-tuning LLMs, enabling us to train models up to 2x faster and use 60% less memory. The `[colab-new]` option ensures compatibility with the latest Google Colab environments.
* **Hugging Face Ecosystem**: We install key libraries for training and optimization:
    * `peft`: Parameter-Efficient Fine-Tuning, for using techniques like LoRA.
    * `trl`: Transformer Reinforcement Learning, for its easy-to-use `SFTTrainer`.
    * `accelerate`: To easily run our training script on any hardware.
    * `bitsandbytes`: For 4-bit quantization (QLoRA), which drastically reduces model size.
* **`xformers`**: Provides memory-efficient attention mechanisms for another performance boost.
* **`wandb`**: Weights & Biases, for logging our experiments and tracking metrics like training loss.

In [2]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -q
!pip install --no-deps xformers trl peft accelerate bitsandbytes -q
!pip install wandb -q

### Log In to Services

To download our model and log our training progress, we need to authenticate with Hugging Face and Weights & Biases (W&B). We use Colab's `userdata` to securely access our API keys without hardcoding them in the notebook.

* **Hugging Face Hub**: We need to log in to download the Phi-3 model, which requires accepting user conditions.
* **Weights & Biases**: We log in to `wandb` to enable experiment tracking. This will allow us to monitor metrics like training loss in real-time.

> **Action Required:** Before running this cell, you must store your API keys as secrets in Google Colab.
> 1.  Click the **🔑 (Secrets)** icon on the left sidebar.
> 2.  Create a new secret named `hugging` and paste your Hugging Face access token (with `write` permissions) as the value.
> 3.  Create another secret named `WANDB_API_KEY` and paste your W&B API key as the value.

In [None]:
# Log in to huggingface
from google.colab import userdata
hf_token = userdata.get('hugging')

# Log in to wandb
import wandb
wandb_api_key = userdata.get('WANDB_API_KEY')
wandb.login(key=wandb_api_key)

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /teamspace/studios/this_studio/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mjavaemailacount[0m ([33mjavaemailacount-none[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### Import Core Libraries

With our environment set up and authenticated, we can now import the core components from the libraries we installed. Each of these plays a critical role in the fine-tuning pipeline.

* **`FastLanguageModel`**: The star of the show from Unsloth. This class will load our base model and automatically apply all the necessary optimizations for fast, memory-efficient training.
* **`torch`**: The fundamental deep learning framework.
* **`load_dataset`**: A function from the Hugging Face `datasets` library to easily pull our training data from the Hub.
* **`SFTTrainer`**: A specialized trainer from the `trl` library designed specifically for Supervised Fine-Tuning.
* **`TrainingArguments`**: A configuration class from `transformers` where we will define all the hyperparameters for our training job.
* **`is_bfloat16_supported`**: A utility from Unsloth to check if our hardware supports `bfloat16` precision, which is ideal for training modern transformers.

In [4]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.7.1+cu128)
    Python  3.9.23 (you have 3.10.10)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


### Load Model and Tokenizer

Now we use Unsloth's `FastLanguageModel` to load our pre-trained model. This single, powerful command handles several critical steps for us:

1.  **Downloads the model** from the Hugging Face Hub.
2.  **Applies 4-bit quantization** to drastically reduce memory usage.
3.  **Patches the model** with performance optimizations for faster training.
4.  **Prepares the tokenizer** for use in training.

Let's look at the key parameters:
* `model_name`: We are loading `"microsoft/Phi-3-mini-4k-instruct"`, a highly capable small language model that is perfect for fine-tuning on consumer hardware.
* `load_in_4bit = True`: This is the core of our memory-saving strategy. It enables 4-bit quantization (QLoRA), reducing the VRAM footprint significantly.
* `max_seq_length = 2048`: We set the maximum context window for our training examples. This offers a good balance between capturing long-range dependencies and managing memory.
* `dtype = None`: This allows Unsloth to automatically detect and use the optimal data type for our GPU (like `bfloat16`), ensuring the best possible training performance.

In [5]:
# Load model (same as before)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "microsoft/Phi-3-mini-4k-instruct",
    max_seq_length = 2048,
    load_in_4bit=True,
    dtype=None,
    token = hf_token,
)

==((====))==  Unsloth 2025.8.9: Fast Mistral patching. Transformers: 4.55.4.
   \\   /|    NVIDIA L40S. Num GPUs = 1. Max memory: 44.527 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


### Test the Base Model (Before Fine-Tuning)

Before we fine-tune the model, it's crucial to establish a baseline. We need to see how the pre-trained model performs on a task similar to our goal. This helps us understand its out-of-the-box capabilities and gives us a "before" snapshot to compare against our "after" fine-tuned version.

Our test process involves a few key steps:
1.  **Craft a Prompt**: We create a sample conversation using the standard `system` and `user` roles. The system prompt sets the model's persona (an expert financial analyst), while the user prompt provides a specific context and a question.
2.  **Apply Chat Template**: We use `tokenizer.apply_chat_template`. This is a critical function that formats our structured conversation into the exact string format that `Phi-3-instruct` expects, including special tokens.
3.  **Generate Response**: We run a standard inference using `model.generate()` to get the model's answer based on our prompt.
4.  **Evaluate Output**: We'll examine the response to see if the model correctly follows instructions and extracts the required information from the context.

In [6]:
# Test base model first to ensure it works
def test_base_model():
    messages = [
        {
            "role": "system", 
            "content": "You are an expert financial analyst. Answer the user's question based only on the provided context."
        },
        {
            "role": "user", 
            "content": """Context: The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt.

Question: What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?"""
        }
    ]
    
    # Use the model's built-in chat template
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    print("Generated prompt:")
    print(prompt)
    print("=" * 50)
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode only new tokens
    response_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(response_tokens, skip_special_tokens=True)
    print("Base model response:", response)
    return response

In [7]:
# Test base model first
print("Testing base model...")
base_response = test_base_model()

Testing base model...
Generated prompt:
<|system|>
You are an expert financial analyst. Answer the user's question based only on the provided context.<|end|>
<|user|>
Context: The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt.

Question: What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?<|end|>
<|assistant|>



Base model response: The two key factors that contributed to the increase in the company's gross margin in fiscal year 2023 were a favorable product mix, which included higher sales of premium software subscriptions, and manufacturing efficiencies gained from the introduction of a new automated production line in Alexandria, Egypt.


### Configure LoRA for Efficient Fine-Tuning

Now we get to the core of Parameter-Efficient Fine-Tuning (PEFT). Instead of training the entire model, we'll use **Low-Rank Adaptation (LoRA)** to inject small, trainable "adapter" matrices into the model's architecture. This means we only need to train a tiny fraction of the total parameters (typically <1%), which is what makes fine-tuning feasible on a single GPU.

Unsloth's `get_peft_model` function seamlessly applies this configuration to our 4-bit model. Let's look at the key hyperparameters:

* `r = 16`: The rank or dimension of the LoRA adapter matrices. A higher rank means more trainable parameters and greater expressive power, but also more memory. `16` is a solid and popular choice.
* `lora_alpha = 16`: The scaling factor for the LoRA weights. A common convention is to set this equal to `r`.
* `target_modules`: This is a critical setting. We specify the names of the layers (in this case, the attention and feed-forward layers) where the LoRA adapters will be injected. Unsloth provides a utility to find all potential layers, and we're targeting the most impactful ones here.
* `use_gradient_checkpointing = "unsloth"`: A crucial memory-saving technique that trades a bit of computation time to drastically reduce VRAM usage, allowing us to use larger batch sizes or longer sequences. The `"unsloth"` option enables a custom, faster implementation.
* `random_state = 2002`: We set a seed for reproducibility. Fun fact: 2002 was the year the modern Bibliotheca Alexandrina was inaugurated, not too far from our model's fictional manufacturing plant in Alexandria.

After this cell, our model is fully prepared for training. The original weights are frozen, and

In [8]:
# Configure LoRA (same as before)
model = FastLanguageModel.get_peft_model(
    model = model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 2002,
)

Unsloth 2025.8.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Load and Prepare the Dataset

The quality and format of your training data are paramount for a successful fine-tune. Instruction-tuned models like Phi-3 are highly sensitive to the prompt format they were trained on. In this step, we will load our financial Q&A dataset and transform each entry to perfectly match Phi-3's specific chat template.

Our workflow is as follows:
1.  **Load Dataset**: We start by loading the `virattt/llama-3-8b-financialQA` dataset from the Hugging Face Hub. This dataset contains pairs of financial contexts, questions, and expert answers.
2.  **Define a Formatting Function**: We create a function, `formatting_prompts_func`, that takes a batch of examples and restructures them. For each row, it builds a conversation with three parts:
    * A `system` message to consistently set the model's persona.
    * A `user` message combining the `context` and `question`.
    * An `assistant` message containing the ground-truth `answer` that we want the model to learn.
3.  **Apply the Chat Template**: Inside the function, we use the crucial `tokenizer.apply_chat_template` method. This converts the structured conversation into

In [9]:
# FIXED: Use Phi-3's actual chat template for training
def formatting_prompts_func(examples):
    questions = examples["question"]
    contexts = examples["context"]  
    responses = examples["answer"]
    texts = []
    
    for question, context, response in zip(questions, contexts, responses):
        # Create proper conversation format
        messages = [
            {
                "role": "system",
                "content": "You are an expert financial analyst. Answer the user's question based only on the provided context."
            },
            {
                "role": "user", 
                "content": f"Context: {context}\n\nQuestion: {question}"
            },
            {
                "role": "assistant",
                "content": response
            }
        ]
        
        # Use the model's chat template
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        texts.append(text)
    
    return {"text": texts}

In [10]:
# Load and format dataset
dataset = load_dataset("virattt/llama-3-8b-financialQA", split="train")
print("Sample before formatting:", dataset[0])

Sample before formatting: {'question': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?', 'answer': 'NVIDIA initially focused on PC graphics.', 'context': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.', 'ticker': 'NVDA', 'filing': '2023_10K'}


In [11]:
dataset = dataset.map(formatting_prompts_func, batched=True)
print("Sample after formatting:", dataset[0]["text"][:500] + "...")

Sample after formatting: <|system|>
You are an expert financial analyst. Answer the user's question based only on the provided context.<|end|>
<|user|>
Context: Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.

Question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?<|end|>
<|assistant|>
NVIDIA initially focused on PC graphics.<|end|>
<|endoftext|>...


### Configure and Launch the Fine-Tuning Job

We have arrived at the final step. With our model loaded, LoRA configured, and the dataset perfectly formatted, we can now set up the trainer and launch the fine-tuning process.

We will use the `SFTTrainer` from the TRL library, which handles the complexities of the training loop for us. The behavior of the trainer is controlled by a comprehensive set of `TrainingArguments`.

#### Key Hyperparameters:
* **Batching**: We use a `per_device_train_batch_size` of 2 and `gradient_accumulation_steps` of 4. This gives us an effective batch size of `2 * 4 = 8`, which helps stabilize training while keeping memory usage low.
* **Training Steps**: We set `max_steps = 60` for a short, demonstrative training run. In a real-world scenario, you would train for more steps or for a certain number of epochs.
* **Learning Rate**: A `learning_rate` of `2e-4` with a linear scheduler and a few `warmup_steps` is a standard and effective setup for LoRA.
* **Optimizations**: We use the `adamw_8bit` optimizer and enable `bf16` (bfloat16 mixed-precision) if our GPU supports it. These are powerful techniques that accelerate training and reduce memory consumption.
* **Logging and Saving**: We `logging_steps = 1` to see the loss at every step and will save a model checkpoint to the `outputs` directory halfway through training (`save_steps = 30`).

#### Launching the Training
With all the components in place, a single call to `trainer.train()` starts the fine-tuning process. As the training commences, keep an eye on the logged training loss—it should steadily decrease, indicating that the model is learning from our financial Q&A data. Let's kick it off and let the GPU work its magic through the early hours of this Saturday morning.

In [12]:
# Training configuration
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 2002,
        output_dir = "outputs",
        save_strategy = "steps",
        save_steps = 30,
    ),
)

In [12]:
# Train the model
print("Starting training...")
trainer_stats = trainer.train()

Starting training...


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,884,416 of 3,850,963,968 (0.78% trained)


Step,Training Loss
1,3.9063
2,3.5754
3,5.2545
4,4.4705
5,5.033
6,3.4329
7,3.2652
8,2.7386
9,2.6936
10,2.4471


Unsloth: Will smartly offload gradients to save VRAM!


### Test the Fine-Tuned Model

The training is complete! Now for the moment of truth: did our fine-tuning work? We will now test our specialized model and compare its performance directly against the baseline we established in Step 5.

To ensure a fair evaluation, our process is simple but critical:
1.  **Use a Consistent Prompt Format**: Our new `inference` function formats the prompt using the **exact same** system message and chat template that the model was trained on. This consistency is crucial for unlocking the model's new capabilities.
2.  **Rerun the Original Test Case**: We will ask the **exact same question** using the same context from our baseline test.

This provides our "after" snapshot. Compare this response to the one from the base model. Look for improvements in accuracy, conciseness, formatting (e.g., using a proper list), and overall adherence to the system prompt's instructions.

After a short but intense training session in the quiet of the Giza night, let's see how our newly specialized financial analyst performs.

In [13]:
# FIXED: Proper inference function
def inference(question, context):
    messages = [
        {
            "role": "system",
            "content": "You are an expert financial analyst. Answer the user's question based only on the provided context."
        },
        {
            "role": "user",
            "content": f"Context: {context}\n\nQuestion: {question}"
        }
    ]
    
    # Use the same chat template as training
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.1
        )
    
    # Extract only the generated response
    response_tokens = outputs[0][inputs["input_ids"].shape[-1]:]
    response = tokenizer.decode(response_tokens, skip_special_tokens=True)
    
    return response.strip()

In [14]:
# Test the fine-tuned model
print("\n" + "="*50)
print("Testing fine-tuned model...")

context = "The company's gross margin improved to 45.8% in fiscal year 2023, up from 42.1% in the prior year. The margin expansion was mainly attributable to a favorable product mix with higher sales of our premium software subscriptions, and manufacturing efficiencies gained from our new automated production line in Alexandria, Egypt."
question = "What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?"

response = inference(question, context)
print(f"Question: {question}")
print(f"Response: {response}")


Testing fine-tuned model...
Question: What were the two key factors that contributed to the increase in the company's gross margin in fiscal year 2023?
Response: The two key factors that contributed to the increase in the company's gross margin in fiscal year 2023 were (a) a more favorable product mix with higher sales of their premium software subscriptions, which indicates better pricing power or market preference for these high-margin products; and (b) manufacturing efficiencies gained from implementing a new automated production line in Alexandria, Egypt, suggesting cost savings through increased operational efficiency.


## 🔬 <span style="color: #007bff;">Analysis</span>: The <span style="color: #28a745;">Impact</span> of Fine-Tuning

The results above clearly demonstrate the value of Supervised Fine-Tuning (SFT). While the base `Phi-3` model was able to extract the correct facts, our fine-tuned model learned to adopt the **persona of an expert financial analyst**.

Let's compare the outputs.

---

### 🧠 Base Model (Before Fine-Tuning)
<div style="background-color: #262730; color: #EAEAEA; border: 1px solid #444; border-radius: 8px; padding: 15px; margin: 10px 0;">
The two key factors that contributed to the increase in the company's gross margin in fiscal year 2023 were a favorable product mix, which included higher sales of premium software subscriptions, and manufacturing efficiencies gained from the introduction of a new automated production line in Alexandria, <span style="color: #ffc107;">Egypt</span>.
</div>

### 🏆 Fine-Tuned Model (After Fine-Tuning)
<div style="background-color: #262730; color: #EAEAEA; border: 1px solid #444; border-radius: 8px; padding: 15px; margin: 10px 0;">
The two key factors that contributed to the increase in the company's gross margin in fiscal year 2023 were (a) a more favorable product mix with higher sales of their premium software subscriptions, <span style="background-color: #1a472a; color: #98FB98; padding: 2px 4px; border-radius: 4px;">**which indicates better pricing power or market preference for these high-margin products;**</span> and (b) manufacturing efficiencies gained from implementing a new automated production line in Alexandria, <span style="color: #ffc107;">Egypt</span>, <span style="background-color: #1a472a; color: #98FB98; padding: 2px 4px; border-radius: 4px;">**suggesting cost savings through increased operational efficiency.**</span>
</div>

---

### ✨ <span style="color: #007bff;">Key Improvements</span> from Fine-Tuning:

<div style="border-left: 3px solid #007bff; padding-left: 15px; margin-top: 10px;">

* **<span style="color: #28a745;">Persona Adoption & Analytical Tone</span>**: The base model simply stated the facts. The fine-tuned model speaks like an analyst, using phrases like <code style="background-color: #282c34; color: #e5c07b; padding: 2px 5px; border-radius: 4px;">indicates better pricing power</code> and <code style="background-color: #282c34; color: #e5c07b; padding: 2px 5px; border-radius: 4px;">suggesting cost savings</code>.

* **<span style="color: #dc3545;">Interpretive Depth</span>**: Instead of just extracting information, the fine-tuned model now **interprets** it. It explains the *implications* of the facts (e.g., higher sales of premium products lead to higher margins), which is a much higher-level skill.

* **<span style="color: #ffc107;">Improved Structure</span>**: The fine-tuned model learned to structure its answer more clearly using `(a)` and `(b)`, making the two key points distinct and easier to read for a professional audience.

</div>

By showing the model hundreds of examples from our financial Q&A dataset, we didn't just teach it new facts; we taught it **how to think, speak, and structure information** like an expert financial analyst. This is the true power of fine-tuning.

### Merge, Save, and Package the Final Model

Our work is not complete until the model is saved and ready for deployment. The fine-tuning process created lightweight LoRA "adapter" weights, which are separate from the original base model. For easy, portable inference, we need to merge these adapters back into the base model to create a single, unified set of weights.

This final step effectively "bakes in" our specialized financial knowledge.

1.  **Merge and Unload**: We call `model.merge_and_unload()`. This powerful Unsloth function performs two actions:
    * **Merges** the trained LoRA weights directly into the base model's attention and MLP layers.
    * **Unloads** the PEFT wrapper, returning a standard Hugging Face `PreTrainedModel` object. This new object is a complete, standalone model that doesn't require the `peft` library for inference.
2.  **Save the Model**: We use the standard `save_pretrained` method to save our new, merged model. We specify `safe_serialization=True` to use the modern and secure `safetensors` format.
3.  **Save the Tokenizer**: Crucially, we also save the tokenizer in the same directory. The model weights and the tokenizer are a pair; you need both to correctly run inference.

With the first light of dawn approaching over the Giza plateau, our final, expert financial analyst model is now serialized to disk, ready to be uploaded, shared, and deployed.

In [15]:
import os

# Create directory for the complete fine-tuned model
save_directory = "Data/complete_finetuned_model"
os.makedirs(save_directory, exist_ok=True)

In [16]:
print("Merging LoRA adapter with base model...")
# Merge the LoRA adapter with the base model
# This combines the original weights with the learned LoRA weights
merged_model = model.merge_and_unload()

Merging LoRA adapter with base model...




In [17]:
print("Saving merged model and tokenizer...")

# Save the complete merged model
merged_model.save_pretrained(
    save_directory,
    safe_serialization=True,  # Use safetensors format (recommended)
    max_shard_size="2GB"      # Split large models into 2GB chunks
)

# Save the tokenizer
tokenizer.save_pretrained(save_directory)

Saving merged model and tokenizer...


('Data/complete_finetuned_model/tokenizer_config.json',
 'Data/complete_finetuned_model/special_tokens_map.json',
 'Data/complete_finetuned_model/chat_template.jinja',
 'Data/complete_finetuned_model/tokenizer.model',
 'Data/complete_finetuned_model/added_tokens.json',
 'Data/complete_finetuned_model/tokenizer.json')