<a href="https://colab.research.google.com/github/chandanamn45/Dream-Analyzer1/blob/main/dream_analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.12.9: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,  # or False, depending on your needs
    random_state=3407,                # Verify if this parameter is valid
    use_rslora=False,                 # Confirm its necessity
    loftq_config=None                 # Ensure this is optional or provide a value
)


Unsloth 2024.12.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [5]:
model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128255)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

In [6]:
chat_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [7]:

from datasets import load_dataset
from transformers import AutoTokenizer

EOS_TOKEN = tokenizer.eos_token  # Ensure the tokenizer has an EOS token

# Define the formatting function
def formatting_prompts_func(examples):
    instruction = examples["instruction"][0]  # Assuming the instruction is the same for all entries
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []

    for input_text, output_text in zip(inputs, outputs):
        # Format the text with EOS_TOKEN to avoid infinite generation
        text = f"{instruction}\nInput: {input_text}\nOutput: {output_text}{EOS_TOKEN}"
        texts.append(text)

    return {"text": texts}

# Load the JSON dataset
dataset = load_dataset("json", data_files="/content/dream_data.json", split="train")

# Map the formatting function over the dataset
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

# Tokenize the formatted dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)

# Check the result
print(tokenized_dataset[0])


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/10241 [00:00<?, ? examples/s]

Map:   0%|          | 0/10241 [00:00<?, ? examples/s]

{'instruction': 'Analyze the dream and provide therapeutic suggestions.', 'input': 'Dream Content: I kill my kids (shooting with a gun or drowning); then kill myself.\nProfession: Recreation manager\nDream Pattern Type: Recurring nightmare\nTrigger Events: Witnessed a woman?s suicide at work, past trauma of mother?s death.', 'output': 'Emotion Tag: Guilt, fear\nNegative Thoughts: Helplessness, guilt\nRevised Dream Content: Using lifeguard assistance to save children; nightmares eventually stopped.\nTherapeutic Outcome: Partially resolved, further revisions needed.', 'text': 'Analyze the dream and provide therapeutic suggestions.\nInput: Dream Content: I kill my kids (shooting with a gun or drowning); then kill myself.\nProfession: Recreation manager\nDream Pattern Type: Recurring nightmare\nTrigger Events: Witnessed a woman?s suicide at work, past trauma of mother?s death.\nOutput: Emotion Tag: Guilt, fear\nNegative Thoughts: Helplessness, guilt\nRevised Dream Content: Using lifeguard 

In [8]:
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 10241
})

In [9]:
import pprint
#Here are a few examples of what the data looks like
pprint.pprint(dataset[250])
pprint.pprint(dataset[260])
pprint.pprint(dataset[270])

{'input': 'Dream Content: A scientist dreamed of their lab equipment '
          'malfunctioning during a crucial experiment.\n'
          'Profession: Scientist\n'
          'Dream Pattern Type: Recurring professional nightmare\n'
          'Trigger Events: Pressure to succeed with ongoing research projects.',
 'instruction': 'Analyze the dream and provide therapeutic suggestions.',
 'output': 'Emotion Tag: Frustration, panic\n'
           'Negative Thoughts: Fear of wasting valuable research time.\n'
           'Revised Dream Content: Dream revised to depict successful and '
           'accurate scientific findings.\n'
           'Therapeutic Outcome: Resolved by visualizing successful experiment '
           'results despite the equipment issues.'}
{'input': 'Dream Content: A teacher dreamed of students refusing to '
          'participate in a class activity, leaving them frustrated.\n'
          'Profession: Teacher\n'
          'Dream Pattern Type: Symbolic recurring nightmare\n'

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        per_device_train_batch_size=10,
        gradient_accumulation_steps=6,
        warmup_steps=100,
        num_train_epochs=1,
        max_steps=0,
        learning_rate=2e-4,
        fp16=True,
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=50,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        save_steps=500,
    ),
)

In [14]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 5,130 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 10 | Gradient Accumulation steps = 6
\        /    Total batch size = 60 | Total steps = 85
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
50,13.0755


In [15]:
FastLanguageModel.for_inference(model) # For faster Inference

inputs = tokenizer(
[
    chat_prompt.format(
        "Analyze the dream",
        "I was falling from a high raised building", # input
        "", # output - leave this blank!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>\n### Instruction:\nAnalyze the dream\n\n### Input:\nI was falling from a high raised building\n\n### Response:\nDreams are a way to express your emotions and thoughts. If you are falling from a high raised building, it means that you are feeling insecure about something. Maybe you are feeling that you are not good enough or that you are not capable of doing something. It is important to remember that dreams are just a way to']

In [16]:
model.save_pretrained("dream_model")
tokenizer.save_pretrained("dream_model")
# model.push_to_hub("your_name/lora_model", token = "...") # If you want to save online
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # If you want to save online

('dream_model/tokenizer_config.json',
 'dream_model/special_tokens_map.json',
 'dream_model/tokenizer.json')

In [17]:
#Let's zip our model folder
import shutil
import os
folder_path = "/content/dream_model"
zip_file_path = "/content/dream_model.zip"

shutil.make_archive(zip_file_path.replace(".zip", ""), 'zip', folder_path)

'/content/dream_model.zip'

In [18]:
#Now if we want to use our model again we can just load it:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "dream_model", #model folder
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

In [19]:
chat_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [20]:
inputs = tokenizer(
    [
        chat_prompt.format(
            "Analyze the dream",  # instruction
            "Dream : I was falling from a high raised building",  # input
            "",  # output
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)[0]

response = decoded_output.split("### Response:")[-1].strip()
response = response.split("<|end_of_text|>")[0].strip()

print(response)

1. The dream is about the fear of falling
2. The dream is about the fear of heights
3. The dream is about the fear of death
4. The dream is about the fear of failure
5. The dream is about the fear of success
6. The dream is about the fear of


In [21]:
!apt-get install git


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.11).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [22]:
!git config --global user.name "chandanamn45"
!git config --global user.email "chandanamn.cs23@rvce.edu.in"


In [23]:
!git clone https://github.com/chandanamn45/Dream-analyzer


Cloning into 'Dream-analyzer'...
fatal: could not read Username for 'https://github.com': No such device or address
