<a href="https://colab.research.google.com/github/fatemafaria142/Advancing-Agronomic-Question-and-Answering-through-Various-Language-Models/blob/main/Agronomic_Question_and_Answering_using_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Install Required Packages**

In [1]:
!pip install accelerate peft bitsandbytes transformers trl datasets torch

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/270.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m168.3/168.3 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting trl
  Downloading trl-0.7.9-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/

In [2]:
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


#### **Dataset Link:** https://github.com/JonaOmara/AgroQA-Dataset/tree/main

In [3]:
import pandas as pd
dataset = pd.read_csv("/content/drive/MyDrive/AgroQA Dataset.csv")
dataset.head()

Unnamed: 0,Crop,Question,Answer
0,maize,"Apart from hand weeding, what other method use...",Machinery weeders are available
1,beans,"Apart from insecticide, what other method used...",Use resistant verities and increase on water a...
2,maize,Apart from sun drying which other method used ...,Use tarpaulins or cemented floor free from dust
3,cassava,"Apart from sun drying, what other method can I...",Solar driers
4,beans,As a farmer when should I harvest beans.,When the beans pods are yellowish green or dry...


In [4]:
length_of_data = len(dataset)
# Print the length
print(length_of_data)

3044


In [5]:
from datasets import Dataset, DatasetDict

# Create a DatasetDict for instruct_tune_dataset
instruct_tune_dataset = DatasetDict()

# Select 1500 rows for the 'train' split and only keep 'Question' and 'Answer'
instruct_tune_dataset["train"] = Dataset.from_pandas(dataset[["Question", "Answer"]].head(2700))

# Select 200 rows for the 'test' split and only keep 'Question' and 'Answer'
instruct_tune_dataset["test"] = Dataset.from_pandas(dataset[["Question", "Answer"]].tail(200))

# Print the updated instruct_tune_dataset
print(instruct_tune_dataset)

DatasetDict({
    train: Dataset({
        features: ['Question', 'Answer'],
        num_rows: 2700
    })
    test: Dataset({
        features: ['Question', 'Answer'],
        num_rows: 200
    })
})


### **We will use just a small subset of the data for this training example**

In [8]:
def create_prompt(sample):
    """
    Update the prompt template:
    Combine both the prompt and input into a single column.
    """
    bos_token = ""
    eos_token = ""

    # Use a predefined template for instructions
    instructions_template = " Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest."



    full_prompt = ""
    full_prompt += bos_token
    full_prompt += "### Instructions:"
    full_prompt += "\n" + instructions_template
    full_prompt += "\n\n### User's Specific Query:"
    full_prompt += "\n" + sample['Question']
    full_prompt += "\n\n### Agronomist's Expertise:"
    full_prompt += "\n" + sample['Answer']
    full_prompt += eos_token

    return full_prompt

In [9]:
create_prompt(instruct_tune_dataset["train"][0])

"### Instructions:\n Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.\n\n### User's Specific Query:\nApart from hand weeding, what other method used to weed maize\n\n### Agronomist's Expertise:\nMachinery weeders are available"

In [10]:
create_prompt(instruct_tune_dataset["train"][1])

"### Instructions:\n Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.\n\n### User's Specific Query:\nApart from insecticide, what other method used to control bean weevils?\n\n### Agronomist's Expertise:\nUse resistant verities and increase on water availability for crop vigor"

In [11]:
create_prompt(instruct_tune_dataset["train"][2])

"### Instructions:\n Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.\n\n### User's Specific Query:\nApart from sun drying which other method used for drying maize\n\n### Agronomist's Expertise:\nUse tarpaulins or cemented floor free from dust"

In [12]:
create_prompt(instruct_tune_dataset["train"][3])

"### Instructions:\n Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.\n\n### User's Specific Query:\nApart from sun drying, what other method can I use to dry cassava\n\n### Agronomist's Expertise:\nSolar driers"

In [13]:
create_prompt(instruct_tune_dataset["train"][4])

"### Instructions:\n Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.\n\n### User's Specific Query:\nAs a farmer when should I harvest beans.\n\n### Agronomist's Expertise:\nWhen the beans pods are yellowish green or dry brown"

### **Initializing the Model**
* Load the model using a 4-bit configuration, employing double quantization, and set bfloat16 as the compute data type.

* Notably, we opt for the instruct-tuned model in this instance rather than the base model. It's worth mentioning that fine-tuning a base model necessitates a more substantial amount of data!

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
    )


* https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

In [15]:
mode_id = "mistralai/Mistral-7B-Instruct-v0.1"

In [16]:
model = AutoModelForCausalLM.from_pretrained(
        mode_id, quantization_config=bnb_config, device_map="auto", use_cache=False
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [17]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### **Let's example how well the model does at this task currently:**
* `temperature=0.5` sets a moderate level of randomness. You can experiment with different values for temperature to achieve the desired balance between creativity and determinism in your generated text. Adjust the value based on your specific use case and preferences.

In [18]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1024,temperature=0.5, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [19]:
prompt='''### Instruction:\nImagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.\n\n
###User's Specific Query: Apart from sun drying which other method used for drying maize\n\n
### Agronomist's Expertise:'''

In [20]:
generate_response(prompt, model)



"<s>  \n\nApart from sun drying, there are several methods used for drying maize. Some of these include:\n\n1. Artificial drying: This involves using machines such as driers or dryers to dry the maize. Driers work by using hot air to dry the maize, while dryers use microwaves to dry it.\n\n2. Kiln drying: This involves drying the maize in a kiln, which is a building with a high temperature. The heat from the kiln dries the maize, making it more durable and less prone to moisture loss.\n\n3. Mechanical drying: This involves using a machine to dry the maize. Mechanical dryers work by blowing hot air over the maize, which causes the moisture to evaporate.\n\n4. Natural drying: This involves leaving the maize to dry naturally, usually in a well-ventilated area. Natural drying is a low-cost method, but it can be slow and may not produce the same quality of maize as other methods.\n\nIt's important to note that the choice of drying method will depend on factors such as the type of maize bein

### **Setting up the Training**
we will be using the `huggingface` and the `peft` library!

In [21]:
from peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")


* we need to prepare the model to be trained in 4bit so we will use the  **`prepare_model_for_kbit_training`** function from peft




In [22]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

# **Training Hyperparameters**
The choice of hyperparameters is contingent upon the desired training duration. Pay special attention to the following key factors:

* `num_train_epochs/max_steps:` Dictates the number of iterations over the data. Exercise caution, as an excessive number may lead to overfitting!

* `learning_rate:` Governs the convergence speed of the model. Adjust this parameter judiciously for optimal results.

In [23]:
from transformers import TrainingArguments
output_model= "mistral_Agronomic_generation"
training_arguments = TrainingArguments(
        output_dir=output_model,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=3,
        max_steps=100,
        fp16=True,
)


### **Setting up the trainer**

`max_seq_length`: Context window size


In [24]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  formatting_func=create_prompt, # this will aplly the create_prompt mapping to all training and test dataset
  args=training_arguments,
  train_dataset=instruct_tune_dataset["train"],
  eval_dataset=instruct_tune_dataset["test"]
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



### **Training starts here**

In [25]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,0.8153
20,0.5455
30,0.4406
40,0.416
50,0.4096
60,0.4096
70,0.4033
80,0.3837
90,0.3827
100,0.3844




TrainOutput(global_step=100, training_loss=0.459065637588501, metrics={'train_runtime': 1591.5403, 'train_samples_per_second': 0.251, 'train_steps_per_second': 0.063, 'total_flos': 1.74835334381568e+16, 'train_loss': 0.459065637588501, 'epoch': 1.41})

### **Save the model**

In [26]:
trainer.save_model("mistral_Agronomic_generation")

In [27]:
merged_model = model.merge_and_unload()



In [28]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs, max_new_tokens=1024,temperature=0.5, do_sample=True, pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

### **Example: 1**

In [30]:
# Example usage
prompt = "### Instructions:\nImagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest."
prompt += "\n### User's Specific Query:\nAs a farmer, How can I control pest"
prompt += "\n\n### Agronomist's Expertise:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instructions:
Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.
### User's Specific Query:
As a farmer, How can I control pest

### Agronomist's Expertise:
You can use pesticides, biological control, crop rotation, and cultural practices to control pests. It's important to identify the specific pest and understand its life cycle to effectively manage it.</s>


### **Example: 2**

In [31]:
# Example usage
prompt = "### Instructions:\nImagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest."
prompt += "\n### User's Specific Query:\nAs a farmer which method is good to prevent soil erosion"
prompt += "\n\n### Agronomist's Expertise:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instructions:
Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.
### User's Specific Query:
As a farmer which method is good to prevent soil erosion

### Agronomist's Expertise:
You can use crop rotation, conservation tillage, contour ploughing, use of cover crops and mulching to prevent soil erosion.</s>


### **Example: 3**

In [32]:
# Example usage
prompt = "### Instructions:\nImagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest."
prompt += "\n### User's Specific Query:\nAs a farmer when should I harvest beans."
prompt += "\n\n### Agronomist's Expertise:"
response = generate_response(prompt, model)

# Print the response with formatted output
print(response)

<s> ### Instructions:
Imagine you're an experienced agronomist, entrusted with optimizing crop yields. Your expertise is crucial in guiding farmers towards success. Picture a field filled with potential, and now it's your task to provide the necessary insights for cultivating a thriving harvest.
### User's Specific Query:
As a farmer when should I harvest beans.

### Agronomist's Expertise:
Beans should be harvested when the pods are fully developed and the seeds inside are mature. This usually happens around 90-120 days after planting, depending on the bean variety and growing conditions. It's important to time your harvest right because if you wait too long, the beans may lose their quality and become less productive.</s>
