# Goal: applying the Valhalla/distilbart-mnli-12-3 model
# Introduction:  
### Valhalla is a distilled version of facebook/bart-barge-mnli
It uses 12 encoder layers and 3 decoder layers, offering a good balance between performance and efficiency
#### A. Model Name: valhalla/distilbart-mnli-12-3
#### B. Author: Valhalla (a well-known NLP researcher on Hugging Face)
#### C. Base architecture: BART — a sequence-to-sequence transformer model developed by Facebook AI
#### D. Type: Distilled BART → smaller and faster version of BART
#### E. Trained on: MNLI dataset (Multi-Genre Natural Language Inference)
#### It classifies whether the hypothesis is:
 entailment (text supports it)
 neutral
 contradiction
#### The zero-shot-classification : this model has been pretrained on large amount of general data. We don#t need to collect and label for new task.
#### Instead of retraining on your labels, it tests whether your input “entails” each label phrase.
Example:{
 'sequence': 'I think this movie is too long and a bit boring.',
 'labels': ['negative', 'neutral', 'positive'],
 'scores': [0.87, 0.10, 0.03]
}
# 1. Zero shot classifiction combined with regex:
##### It extracts location, size, price, and amenities, then saves everything in a clean JSON format
# 2. No Fine Tuning:
##### Fine-tuning presents several challenges in this case. The original model is configured to output three classes, which means we would need to reinitialize the base model to accommodate our specific task. Additionally, training with mixed precision (fp16) requires a CUDA-enabled GPU, which may not be available in all environments. For multi-label classification, the model architecture must be modified to use BCEWithLogitsLoss instead of the default loss function. Personally, I believe this model is not well-suited for our application. Therefore, we will proceed with a zero-shot approach instead.


In [3]:
!pip install transformers sentencepiece torch



In [4]:
from transformers import pipeline
import re
import json
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Introduce the model and build the pipeline

classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3")


Device set to use mps:0


In [6]:
# Load Housing requests
df=pd.read_csv("requests.csv")
df.head()
df["request_lower"]=df["request"].str.lower()


In [None]:
paired_requests=[] 
# Etract the input text 
for i, text_lower in enumerate(df["request_lower"], start=1): print(f"\nPreprocessing Request{i}: {text_lower}") 
# distrct if the district is mentioned in the text, Mitte, Friedrichshain-Kreuzberg, Neukölln, Charlottenburg-Wilmersdorf, Tempelhof-Schöneberg, Treptow-Köpenick, Lichtenberg, Marzahn-Hellersdorf, Spandau, Steglitze-Zehlendorf, Reinickendorf, Pankow, show thg district, if none, show any districts=["Mitte", "Friedrichshain-Kreuzberg", "Neukölln", "Charlottenburg-Wilmersdorf", "Tempelhof-Schöneberg", "Treptow-Köpenick", "Lichtenberg", "Marzahn-Hellersdorf", "Spandau", "Steglitze-Zehlendorf", "Reinickendorf", "Pankow"] if any(district.lower() in text_lower for district in districts): 
district=[district for district in districts if district.lower() in text_lower][0] else: district="any" print("District:", district) 
# size of the apartment, if mentioned, otherwise any 
bedroom_pattern = r'(\d+)\s*(?:bedroom|bedrooms|rooms|room)' 
bedroom_match = re.search(bedroom_pattern, text_lower) 
if bedroom_match: bedroom = bedroom_match.group(1) else: bedroom = "any" 
print("Bedroom:", bedroom) 
# size, default size if not mentioned 
size_pattern = r'(\d{2,4})\s*(?:m2|sqm|square meters|sq meters|sq\.?m|square metre|size)' 
size_match = re.search(size_pattern, text_lower) if size_match: size = int(size_match.group(1)) else: size ="any" 
## price, if mentioned, otherwise any
price_pattern = r'(\d{3,5})\s*(?:eur|euro|€)' price_match = re.search(price_pattern, text_lower) 
if price_match: price = price_match.group(1) else: price = "any" print("Price:", price) 
# if mentioned about kindergarten, parks, public transport, shopping mall, gym, hospital, school, banks, pharmacy, university, restaurant, cafe, bar,crime rate) 
labels = ["kindergarten", "parks", "public transport", "hospitals", "schools", "banks", "universities", "restaurants", "cafes", "bars"] 
results_zs=classifier(text_lower, labels) 
# score >0.9 identify as exist 
def parse_score(label, results, threshold=0.3): 
    idx= results["labels"].index(label) 
    score=results["scores"][idx] 
    return f">{threshold}" if score > threshold else "any" 

kindergarten= parse_score("kindergarten", results_zs) 
parks= parse_score("parks", results_zs) 
public_transport=parse_score("public transport",results_zs) 
hospitals=parse_score("hospitals", results_zs) 
banks=parse_score("banks",results_zs) 
universities=parse_score("universities", results_zs) 
restaurants=parse_score("restaurants", results_zs) 
cafes=parse_score("cafes", results_zs) 
bars=parse_score("bars", results_zs) 
# Output 
structured_request={ "district": district, "bedroom": bedroom, "price":price, "size":size, "parks":parks, "kindergarten":kindergarten, "public_transport":public_transport, "hospitals":hospitals, "banks":banks, "restaurants":restaurants, "cafes":cafes, "bars": bars } 
# Pair original with structured 
paired_requests.append({ "original_request": text_lower, "structured_request": structured_request })
 # Show both 
 print(json.dumps(paired_requests[-1], indent=2, ensure_ascii=False)) 
# ✅ Save all structured requests to a JSON file 
with open("paired_requests.json", "w", encoding="utf-8") as f: json.dump(paired_requests, f, indent=2, ensure_ascii=False)



Preprocessing Request1: i would like a apartment nearby the mitte of berlin, rent price less than 2000

Preprocessing Request2: i would like a small room but can have a pet , price is less than $200

Preprocessing Request3: i like a 2 rooms and neaby mrt, bus stations. the price is less than $2000

Preprocessing Request4: i would like to 3 bedrooms, nearby the mitte district, and rent price less than 2000

Preprocessing Request5: i would like to have 100m2 and price less than 1000 euro

Preprocessing Request6: i would like an apartment 100m2 , price less than 1000 euro, nearby kindergarten and transport stops
District: any
Bedroom: any
Price: 1000
{
  "original_request": "i would like an apartment 100m2 , price less than 1000 euro, nearby kindergarten and transport stops",
  "structured_request": {
    "district": "any",
    "bedroom": "any",
    "size": 100,
    "price": "1000",
    "parks": "any",
    "kindergarten": ">0.3",
    "public_transport": ">0.3",
    "hospitals": "any",
  

# chage to group_labels

In [15]:


# Initialize zero-shot classifier
classifier = pipeline("zero-shot-classification", model="valhalla/distilbart-mnli-12-3")

paired_requests = []

# --- Define grouped semantic labels ---
group_labels = {
    "family": ["near kindergarten", "good for families", "children-friendly neighborhood","nearby kindergarten"],
    "safety": ["safe neighborhood", "low crime rate", "quiet area"],
    "transport": ["near public transport", "nearby subway or bus stop", "convenient location","Ubahn","bus station"],
    "nature": ["near parks or pools", "green surroundings", "recreational areas nearby"],
    "convenience": ["near shops or restaurants or bars", "cafes nearby", "shopping area nearby"],
    "healthcare": ["near hospital or clinic", "good medical facilities nearby"],
    "education": ["near school", "near university or college"],
   
}

# --- Loop through user requests ---
for i, text_lower in enumerate(df["request_lower"], start=1):
    print(f"\nPreprocessing Request {i}: {text_lower}")

    # --- Extract district ---
    districts = [
        "Mitte", "Friedrichshain-Kreuzberg", "Neukölln", "Charlottenburg-Wilmersdorf",
        "Tempelhof-Schöneberg", "Treptow-Köpenick", "Lichtenberg", "Marzahn-Hellersdorf",
        "Spandau", "Steglitze-Zehlendorf", "Reinickendorf", "Pankow"
    ]
    if any(district.lower() in text_lower for district in districts):
        district = [district for district in districts if district.lower() in text_lower][0]
    else:
        district = "any"
    print("District:", district)

    # --- Extract number of bedrooms ---
    bedroom_pattern = r'(\d+)\s*(?:bedroom|bedrooms|rooms|room)'
    bedroom_match = re.search(bedroom_pattern, text_lower)
    bedroom = bedroom_match.group(1) if bedroom_match else "any"
    print("Bedroom:", bedroom)

    # --- Extract size (m²) ---
    size_pattern = r'(\d{2,4})\s*(?:m2|sqm|square meters|sq meters|sq\.?m|square metre|size)'
    size_match = re.search(size_pattern, text_lower)
    size = int(size_match.group(1)) if size_match else "any"

    # --- Extract price (€) ---
    price_pattern = r'(?:less than|under|below|<|up to |maximum|max)?\s*([$€]?\s*\d{2,5})\s*(?:eur|euro|€|usd|dollars)?'
    price_match = re.search(price_pattern, text_lower)
    price = price_match.group(1) if price_match else "any"
    print("Price:", price)

    # --- Run zero-shot classification across all group labels ---
    all_labels = [label for group in group_labels.values() for label in group]
    results_zs = classifier(text_lower, all_labels)

    # --- Function to parse scores for each group ---
    def parse_group_score(results, group_label_list, threshold=0.7):
        scores = []
        for label in group_label_list:
            if label in results["labels"]:
                idx = results["labels"].index(label)
                scores.append(results["scores"][idx])
        return ">0.8" if scores and max(scores) > threshold else "any"

    # --- Evaluate each group ---
    parsed_groups = {
        group_name: parse_group_score(results_zs, group_labels[group_name], threshold=0.8)
        for group_name in group_labels
    }

    # --- Structured output ---
    structured_request = {
        "district": district,
        "bedroom": bedroom,
        "size": size,
        "price": price,
    }
    structured_request.update(parsed_groups)

    # --- Pair original with structured ---
    paired_requests.append({
        "original_request": text_lower,
        "structured_request": structured_request
    })

    # --- Print result for current request ---
    print(json.dumps(paired_requests[-1], indent=2, ensure_ascii=False))

# --- Save all structured requests to JSON file ---
with open("paired_requests.json", "w", encoding="utf-8") as f:
    json.dump(paired_requests, f, indent=2, ensure_ascii=False)

print("\n✅ All structured requests saved to paired_requests.json")


Device set to use mps:0



Preprocessing Request 1: i would like a apartment nearby the mitte of berlin, rent price less than 2000
District: Mitte
Bedroom: any
Price: 2000
{
  "original_request": "i would like a apartment nearby the mitte of berlin, rent price less than 2000",
  "structured_request": {
    "district": "Mitte",
    "bedroom": "any",
    "size": "any",
    "price": "2000",
    "family": "any",
    "safety": "any",
    "transport": "any",
    "nature": "any",
    "convenience": "any",
    "healthcare": "any",
    "education": "any"
  }
}

Preprocessing Request 2: i would like a small room but can have a pet , price is less than $200
District: any
Bedroom: any
Price: $200
{
  "original_request": "i would like a small room but can have a pet , price is less than $200",
  "structured_request": {
    "district": "any",
    "bedroom": "any",
    "size": "any",
    "price": "$200",
    "family": "any",
    "safety": "any",
    "transport": "any",
    "nature": "any",
    "convenience": "any",
    "healt

### The Result can check the paired_request.json

# Uning Fine tuning

In [43]:
#Using fine tuning , install packages
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoConfig, AutoModelForAudioClassification
from peft import LoraConfig, get_peft_model, PeftType



In [46]:
from transformers import BartForSequenceClassification

class BartForSeqClsWithExtraKwargs(BartForSequenceClassification):
    def forward(self, *args, **kwargs):
        # Remove any unexpected kwargs
        kwargs.pop("num_items_in_batch", None)
        return super().forward(*args, **kwargs)


In [48]:

# ✅ Step 2: Load tokenizer and base model
model_name = "valhalla/distilbart-mnli-12-3"
num_labels=7
config= AutoConfig.from_pretrained(model_name, num_labels=num_labels, problem_type="multi_label_classification")

tokenizer = AutoTokenizer.from_pretrained(model_name)

model=BartForSeqClsWithExtraKwargs.from_pretrained(model_name, config=config, ignore_mismatched_sizes= True)# original size is 3
# Number of groups as labels
group_labels = {
    "family": ["near kindergarten", "good for families", "children-friendly neighborhood","nearby kindergarten"],
    "safety": ["safe neighborhood", "low crime rate", "quiet area"],
    "transport": ["near public transport", "nearby subway or bus stop", "convenient location","Ubahn","bus station"],
    "nature": ["near parks or pools", "green surroundings", "recreational areas nearby"],
    "convenience": ["near shops or restaurants or bars", "cafes nearby", "shopping area nearby"],
    "healthcare": ["near hospital or clinic", "good medical facilities nearby"],
    "education": ["near school", "near university or college"]
}



# ✅ Step 3: Add LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)
model = get_peft_model(model, lora_config)

# ✅ Step 4: Create small training dataset
data = {
    "text": [
        "I want an apartment near kindergarten and transport stops, less crime.",
        "Looking for a house far from city center, very safe, near school and hospitals.",
        "Apartment under 1000 euro, no kids, doesn’t matter about transport.",
        "I want an apartment near a kindergarten and subway, safe area, under 1000 Euro",
        "Two bedrooms close to subway, low crime, nearby parks, price less than 1200 euro.",
        "2 rooms nearby supermarket and bars or restaurants, price less than 800 Euro.",
        "3-bedrooms, near public transport, shopping malls, and safe area",
        "Apartment near a hospital and clinic, good medical facilities, quiet area.",
        "Looking for a flat close to a hospital and pharmacy, suitable for elderly family members."
    ],
    "labels": [
        [1,1,1,1,0,0,0],  # family ✅, safety ✅, transport ✅, nature ❌, convenience ❌, healthcare ❌, education ❌
        [0,1,0,0,0,1,1],  # safety ✅, education ✅, and healthcare
        [0,0,1,0,0,0,0],  # transport ✅
        [1,1,1,0,0,0,0],  # family ✅, safety ✅, transport ✅
        [0,1,1,1,0,0,0],  # safety ✅, transport ✅, nature ✅
        [0,0,1,0,1,0,0],  # transport ✅, convenience ✅
        [0,1,1,0,1,0,0],  # safety ✅, transport ✅, convenience ✅
        [0,1,0,0,0,1,0],  # safety ✅, healthcare ✅
        [0,1,0,0,0,1,0]   # safety ✅, healthcare ✅
    ]
}

dataset = Dataset.from_dict(data)

# ✅ Step 5: Tokenize dataset
def preprocess(example):
    enc = tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)
    enc["labels"] = example["labels"]
    return enc

tokenized_dataset = dataset.map(preprocess)

# ✅ Step 6: Training setup
training_args = TrainingArguments(
    output_dir="./real_estate_lora_results",
    per_device_train_batch_size=2,
    num_train_epochs=5,
    learning_rate=2e-4,
    logging_dir="./logs",
    logging_steps=5,
    save_total_limit=2,
    fp16=False,
    bf16=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# ✅ Step 7: Fine-tune
trainer.train()

# ✅ Step 8: Save LoRA-adapted model
model.save_pretrained("real_estate_lora_model_distibart")
tokenizer.save_pretrained("real_estate_lora_model_distibart")

print("✅ Fine-tuning completed and model saved.")


Some weights of BartForSeqClsWithExtraKwargs were not initialized from the model checkpoint at valhalla/distilbart-mnli-12-3 and are newly initialized because the shapes did not match:
- classification_head.out_proj.weight: found shape torch.Size([3, 1024]) in the checkpoint and torch.Size([7, 1024]) in the model instantiated
- classification_head.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([7]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 9/9 [00:00<00:00, 267.59 examples/s]


RuntimeError: result type Float can't be cast to the desired output type Long

#Fine tuning has some problem, the original model set output three class, we need to reset the the base model (), and requires the fp16 CUDA GPU, for multiple labeling still require to change the model to BCEWithLogistcLoss, I personaly think the model is not good for this case application. We only do zero shot model.

In [None]:


# ✅ Step 1: Load tokenizer and base model
model_name = "valhalla/distilbart-mnli-12-3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
num_labels=6
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

# ✅ Step 2: Add LoRA
lora_config = LoraConfig(
    r=8,                    # rank
    lora_alpha=32,          # scaling factor
    target_modules=["q_proj", "v_proj"],  # attention layers to adapt, small matrix
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_CLS"
)
model = get_peft_model(model, lora_config)

# ✅ Step 3: Create your small training dataset
data = {
    "text": [
        "I want an apartment near kindergarten and transport stops, less crime.",
        "Looking for a house far from city center, very safe, near school.",
        "Apartment under 1000 euro, no kids, doesn’t matter about transport.",
        "I want an apartment near a kindergarten and subway, safe area, under 1000 Euro, ",
        "Two bedrooms close to subway, low crime, nearby parks, price less than 1200 euro.",
        "2 rooms nearby supermarket and bars or resturants, price less than 800Euro.",
        "3-bedrooms, near public transport, shopping malls, and satey area" 
      
    ],

"labels" :[
    [1, 1, 1, 1, 0, 0],  # Near kindergarten ✅, children-friendly ✅, low crime ✅, transport ✅, parks ❌, shopping ❌
    [0, 1, 1, 0, 0, 0],  # Kindergarten ❌, children-friendly ✅, low crime ✅, transport ❌, parks ❌, shopping ❌
    [0, 0, 0, 1, 0, 0],  # Kindergarten ❌, children-friendly ❌, low crime ❌, transport ✅, parks ❌, shopping ❌
    [1, 1, 1, 1, 0, 0],  # Kindergarten ✅, children-friendly ✅, low crime ✅, transport ✅, parks ❌, shopping ❌
    [0, 1, 1, 1, 1, 0],  # Kindergarten ❌, children-friendly ✅, low crime ✅, transport ✅, parks ✅, shopping ❌
    [0, 1, 0, 0, 0, 1],  # Kindergarten ❌, children-friendly ✅, low crime ❌, transport ❌, parks ❌, shopping ✅
    [0, 1, 1, 1, 0, 1]   # Kindergarten ❌, children-friendly ✅, low crime ✅, transport ✅, parks ❌, shopping ✅
]
}

dataset = Dataset.from_dict(data)

# ✅ Step 4: Tokenize
def preprocess(example):
    enc = tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)
    enc["labels"] = example["labels"]
    return enc

tokenized_dataset = dataset.map(preprocess)

# ✅ Step 5: Training setup
training_args = TrainingArguments(
    output_dir="distibart-mnli./results",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_dir="./logs",
    logging_steps=10
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# ✅ Step 6: Fine-tune
trainer.train()

# ✅ Step 7: Save your LoRA-adapted model
model.save_pretrained("real_estate_lora_model and distibart")


RuntimeError: Error(s) in loading state_dict for Linear:
	size mismatch for weight: copying a param with shape torch.Size([3, 1024]) from checkpoint, the shape in current model is torch.Size([4, 1024]).

In [50]:
# Define the labels
text="I want an apartment near kindergarten and transport stops, less crime.",
labels = ["district", "price", "bedroom", "kindergarten", "parks", "public transport", "shopping mall", "gym", "hospital", "school"]
results = classifier("text", labels)
print("Zero shot classification results:", results)

Zero shot classification results: {'sequence': 'text', 'labels': ['price', 'public transport', 'parks', 'district', 'shopping mall', 'bedroom', 'hospital', 'school', 'kindergarten', 'gym'], 'scores': [0.24442839622497559, 0.15627136826515198, 0.11556384712457657, 0.10741044580936432, 0.10244598984718323, 0.07879061996936798, 0.07345880568027496, 0.057474758476018906, 0.03305509313941002, 0.031100604683160782]}
