#Fine-Tuning DeepSeek Model own Custom Data with Unsloth

## Install Some Libraries

In [1]:
!pip install datasets
!pip install unsloth

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

## Import Huuging Face and Wandb Keys

In [3]:
from huggingface_hub import login
from google.colab import userdata
hf_token = userdata.get('Hugging_Face')

import wandb

wb_token = userdata.get("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Deepseek Fine tune',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33maniqramzan5758[0m ([33maniqramzan5758-atomcamp[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Load the Dataset

### Cazton_dataset

A dataset for finetuning chat llm for cazton

### About the Dataset
This is the complete dataset that has been scrpped form cazton.com. The purpose of this dataset is to finetune a chat llm in this case phi-1.5B for personalised communication. This was made using Beautiful soup , selenium and web loaders fro langchain.



In [4]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/complete_data.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,questions,answers
0,0,What big data services does Cazton offer?,Cazton has been a pioneer in Big Data. Our tea...
1,1,What strategies can companies use to better u...,"\nWith every passing second, the amount of dat..."
2,2,What features does Microsoft Fabric offer as ...,\nMicrosoft Fabric offers a comprehensive and ...
3,3,\nWhat advantages does Spark offer over Hadoop...,"\nSpark is an open-source, lightning fast, clu..."
4,4,What was Databricks' annualized revenue two y...,"\nOn Aug 5, 2022, the CEO of Databricks announ..."


In [2]:
# Remove the unecessary columns
df.drop(columns=['Unnamed: 0'], inplace=True)

In [6]:
from datasets import Dataset

dataset = Dataset.from_pandas(df)

In [7]:
dataset

Dataset({
    features: ['questions', 'answers'],
    num_rows: 1289
})

In [8]:
dataset[0]

{'questions': ' What big data services does Cazton offer?',
 'answers': 'Cazton has been a pioneer in Big Data. Our team includes but not limited to Big Data Engineers, Distributed Systems Engineer, Data Scientists, Hadoop Experts, Spark Experts, Spark.NET Experts, Kafka Experts have years of experience and strong analytical and problem-solving skills. Our experts have hands-on experience with Big Data technologies that includes Hadoop, Spark, HIVE, HBase, Kafka, Impala, PIG, Zookeeper, Cassandra. NoSQL databases like Couchbase, MongoDB and have proven record building solid production level software on various Big data technologies. Contact us now to learn more about our big data services.'}

## Formmating the Dataset

In [9]:
def formatting_data(example):
    return {
        "text": f"User: {example['questions'].strip()}\n\nAssistant: {example['answers'].strip()}<|endofsentence|>"
    }

dataset = dataset.map(formatting_data)

Map:   0%|          | 0/1289 [00:00<?, ? examples/s]

In [10]:
dataset['text'][0]

'User: What big data services does Cazton offer?\n\nAssistant: Cazton has been a pioneer in Big Data. Our team includes but not limited to Big Data Engineers, Distributed Systems Engineer, Data Scientists, Hadoop Experts, Spark Experts, Spark.NET Experts, Kafka Experts have years of experience and strong analytical and problem-solving skills. Our experts have hands-on experience with Big Data technologies that includes Hadoop, Spark, HIVE, HBase, Kafka, Impala, PIG, Zookeeper, Cassandra. NoSQL databases like Couchbase, MongoDB and have proven record building solid production level software on various Big data technologies. Contact us now to learn more about our big data services.<|endofsentence|>'

In [11]:
# Remove irrelevant columns
dataset = dataset.remove_columns(['questions', 'answers'])

## Load the Deepseek intruct Model and tokenizer

In [13]:
# load the model and the tokenizer
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "deepseek-ai/deepseek-llm-7b-chat",
    max_seq_length = 4096,
    dtype = torch.float16,
    load_in_4bit = True,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


pytorch_model.bin.index.json:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.6k [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

deepseek-ai/deepseek-llm-7b-chat does not have a padding token! Will use pad_token = <|PAD_TOKEN|>.


In [14]:
# Apply LORA technique
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = 16,
    lora_dropout = 0.02,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.02.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.3.19 patched 30 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [23]:
# Initialize Training Arguments
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

TrainingArgs = TrainingArguments(
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    warmup_steps = 5,
    max_steps = 100,
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),
    optim = "adamw_8bit",
    logging_steps = 10,
    weight_decay = 0.01,
    seed = 3407,
    lr_scheduler_type = "linear",
    save_steps = 25,
    output_dir = "output",
    save_total_limit = 2,
    logging_dir = "logs",
)

In [24]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    args = TrainingArgs,
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/1289 [00:00<?, ? examples/s]

In [25]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,289 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,739,200/7,000,000,000 (0.27% trained)


Step,Training Loss
10,1.4854
20,1.5387
30,1.6276
40,1.6724
50,1.4983
60,1.5394
70,1.5827
80,1.5051
90,1.4963
100,1.6804


TrainOutput(global_step=100, training_loss=1.5626351737976074, metrics={'train_runtime': 337.4724, 'train_samples_per_second': 2.371, 'train_steps_per_second': 0.296, 'total_flos': 2299659689852928.0, 'train_loss': 1.5626351737976074})

## Save the Model

In [26]:
model.save_pretrained("deepseek-cazton-finetuned")
tokenizer.save_pretrained("deepseek-cazton-finetuned")

('deepseek-cazton-finetuned/tokenizer_config.json',
 'deepseek-cazton-finetuned/special_tokens_map.json',
 'deepseek-cazton-finetuned/tokenizer.json')

## Inference

In [5]:
from transformers import pipeline
# load the model and the tokenizer
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/content/deepseek-cazton-finetuned",
    max_seq_length = 4096,
    dtype = torch.float16,
    load_in_4bit = True,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
from transformers import TextStreamer

# Prepare the prompt
prompt = "User: What big data services does Cazton offer?\n\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    repetition_penalty=1.2,
    eos_token_id=tokenizer.eos_token_id,
    streamer=streamer
)

# Decode output
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Clean repetitions & <|endofsentence|>
sentences = decoded.split("<|endofsentence|>")
unique_sentences = []
for s in sentences:
    s = s.strip()
    if s and s not in unique_sentences:
        unique_sentences.append(s)

clean_response = " ".join(unique_sentences)
print("\n=== Final Output ===\n")
print(clean_response)

 At Cazton, we specialize in delivering solutions and consulting for Big Data projects. We have worked on Hadoop, Spark, Hbase, Hive, Cassandra, NoSQL databases like MongoDB etc.<|endofsentence|> Our team of experts has vast experience working with various technologies to help clients achieve their business goals.</p> <h3 id="big-data">Big Data Services</h3></li><ul class=“list” style='textalign:<br/>center'><div dir = ‘ltr’>,<span lang ="en"><b>"We provide end-to-end solution development & implementation along with training support." </ b></ span>.< / li><ol type=<i>.1" start="</ p >

=== Final Output ===

User: What big data services does Cazton offer?

Assistant: At Cazton, we specialize in delivering solutions and consulting for Big Data projects. We have worked on Hadoop, Spark, Hbase, Hive, Cassandra, NoSQL databases like MongoDB etc. Our team of experts has vast experience working with various technologies to help clients achieve their business goals.</p> <h3 id="big-data">Big D

In [1]:
### Aniq Ramazn
### Gmail : aniqramzan5758@gmail.com