<a href="https://colab.research.google.com/github/azzindani/03_LLM_Data_Preprocess/blob/main/00_LLM_Dataset_Preprocess_v6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Modules

In [1]:
!pip install -q datasets

In [2]:
import torch

from datasets import load_dataset
from datasets import Dataset

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  GenerationConfig
)

if torch.cuda.is_available():
  print('GPU is available!')
else:
  print('GPU is not available.')

GPU is not available.


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

## Import Model Tokenizer

In [4]:
#url = 'https://huggingface.co/Qwen/Qwen2.5-0.5B'
#model_name = url.split('.co/')[-1]

model_name = 'unsloth/Llama-3.2-1B-Instruct'

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Import Dataset

In [6]:
#url = 'https://huggingface.co/datasets/KingNish/reasoning-base-20k'
#dataset_name = url.split('datasets/')[-1]

dataset_name = 'mb7419/legal-advice-reddit'

In [7]:
max_length = 1024

In [8]:
dataset = load_dataset(dataset_name, split = 'train')
dataset

Dataset({
    features: ['ques_title', 'ques_text', 'ques_created', 'ques_score', 'ans_text', 'ans_created', 'ans_score', 'dominant_topic_name', '__index_level_0__'],
    num_rows: 115359
})

In [9]:
dataset.select(range(5)).to_pandas().head()

Unnamed: 0,ques_title,ques_text,ques_created,ques_score,ans_text,ans_created,ans_score,dominant_topic_name,__index_level_0__
0,Does comcast have the right to place wires on ...,Comcast has wires on my tree and I asked them ...,2014-06-14 05:57:33,6,You would need to check the deed to your prope...,2014-06-14 06:39:35,9.0,Landlord refusal to provide internet,8001
1,Out of state plaintiff suing me in question?,I am getting sued by an individual from out of...,2016-05-30 16:24:14,5,Jurisdiction for a civil suit depends on the f...,2016-05-30 16:32:07,4.0,Miscellaneous Legal Query,40138
2,Hiring expensive Lawyers in a will contest,Hi I m in Singapore and under UK laws without...,2020-03-05 07:11:47,6,No. All that means is that the contestants ar...,2020-03-14 22:48:11,2.0,Family Estate Disputes,128612
3,Homeless people living on property,My grandparents have a veterinary hospital in ...,2019-03-14 03:22:37,7,First call the police and make a trespassing ...,2019-03-14 03:26:14,19.0,Miscellaneous Legal Query,107094
4,(Maryland DC suburb) Rental company thinks my ...,So I rented this place in March of 2017. Litt...,2018-03-18 19:24:56,78,It s a rare day to come out with more than wha...,2018-03-18 19:55:51,77.0,Landlord Lease Deposit Disputes,79301


In [10]:
dataset[0]

{'ques_title': 'Does comcast have the right to place wires on my tree?',
 'ques_text': 'Comcast has wires on my tree and I asked them before to remove them but they never do. They keep saying we ll send someone out. I live in Boston Massachusetts.',
 'ques_created': '2014-06-14 05:57:33',
 'ques_score': 6,
 'ans_text': 'You would need to check the deed to your property to see if they have an easement granting them a right to place wires on your property. You would need to check to see if the tree is in the public right of way. If the tree or wires are in the right of way  they make have right to hang those wires there. You should probably either contact a lawyer or your local public works department.',
 'ans_created': '2014-06-14 06:39:35',
 'ans_score': 9.0,
 'dominant_topic_name': 'Landlord refusal to provide internet',
 '__index_level_0__': 8001}

In [11]:
features = list(dataset.features.keys())
print(features)

['ques_title', 'ques_text', 'ques_created', 'ques_score', 'ans_text', 'ans_created', 'ans_score', 'dominant_topic_name', '__index_level_0__']


## Dataset Preprocess

In [13]:
prompt_format = """
Below is an legal questions that explain personal condition of the questioner. Write a answer that appropriately giving legal advise.

### Question:
{}

### Answer:
{}"""

In [14]:
EOS_TOKEN = tokenizer.eos_token

def preprocess(examples):
  inputs = examples['ques_text']
  outputs = examples['ans_text']

  texts = prompt_format.format(inputs, outputs) + EOS_TOKEN
  return {'prompt' : texts}

In [15]:
formatted_dataset = dataset.map(
  preprocess,
  remove_columns = list(dataset.features.keys())
)
formatted_dataset

Map:   0%|          | 0/115359 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt'],
    num_rows: 115359
})

In [16]:
print(formatted_dataset[0]['prompt'])


Below is an legal questions that explain personal condition of the questioner. Write a answer that appropriately giving legal advise.

### Question:
Comcast has wires on my tree and I asked them before to remove them but they never do. They keep saying we ll send someone out. I live in Boston Massachusetts.

### Answer:
You would need to check the deed to your property to see if they have an easement granting them a right to place wires on your property. You would need to check to see if the tree is in the public right of way. If the tree or wires are in the right of way  they make have right to hang those wires there. You should probably either contact a lawyer or your local public works department.<|eot_id|>


## Tokenization

In [17]:
def tokenize_data(example, max_length = max_length):
  return tokenizer(example['prompt'], truncation = True, padding = 'max_length', max_length = max_length)

In [18]:
tokenized_dataset = formatted_dataset.map(tokenize_data, batched = True)#, remove_columns = 'text')
tokenized_dataset

Map:   0%|          | 0/115359 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt', 'input_ids', 'attention_mask'],
    num_rows: 115359
})

In [19]:
print(tokenized_dataset[0]['prompt'])


Below is an legal questions that explain personal condition of the questioner. Write a answer that appropriately giving legal advise.

### Question:
Comcast has wires on my tree and I asked them before to remove them but they never do. They keep saying we ll send someone out. I live in Boston Massachusetts.

### Answer:
You would need to check the deed to your property to see if they have an easement granting them a right to place wires on your property. You would need to check to see if the tree is in the public right of way. If the tree or wires are in the right of way  they make have right to hang those wires there. You should probably either contact a lawyer or your local public works department.<|eot_id|>


In [20]:
print(tokenized_dataset[0]['input_ids'])#[-10:]

[128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004, 128004,

In [21]:
print(tokenized_dataset[0]['attention_mask'])#[-10:]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Prompt Test

In [22]:
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype = torch.float16,
  trust_remote_code = True
).to(device) #'''

config.json:   0%|          | 0.00/927 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

In [25]:
def assistant(prompt):
  inputs = tokenizer(
  [
    prompt_format.format(
      prompt,
      "",
    )
  ], return_tensors = 'pt').to(device)
  generation_config = GenerationConfig(
    do_sample = True,
    top_k = 1,
    temperature = 0.1,
    max_new_tokens = 1024,
    pad_token_id = tokenizer.eos_token_id
  )
  outputs = model.generate(
    **inputs,
    generation_config = generation_config
  )
  return print(tokenizer.decode(outputs[0], skip_special_tokens = True))

In [26]:
prompt = dataset[0]['ques_text']
assistant(prompt)


Below is an legal questions that explain personal condition of the questioner. Write a answer that appropriately giving legal advise.

### Question:
Comcast has wires on my tree and I asked them before to remove them but they never do. They keep saying we ll send someone out. I live in Boston Massachusetts.

### Answer:
I am unable to provide legal advice, but I can offer some general information about the situation you may face. If you are concerned about the wires on your tree, you may want to consider the following steps:

1.  **Contact Comcast**: Reach out to Comcast's customer service department and explain your situation. They may be able to provide a more specific timeline for when they will send someone to remove the wires.
2.  **Check Local Laws**: In Massachusetts, there are laws that regulate utility companies and their interactions with property owners. You may want to research local laws that may apply to your situation.
3.  **Consider a Mediation**: If you and Comcast ar

In [24]:
print(dataset[0]['ans_text'])

You would need to check the deed to your property to see if they have an easement granting them a right to place wires on your property. You would need to check to see if the tree is in the public right of way. If the tree or wires are in the right of way  they make have right to hang those wires there. You should probably either contact a lawyer or your local public works department.
