### Library Usage in Hybrid Recommender Systems for Property Market

This notebook utilizes several powerful Python libraries to build a **Hybrid Recommender System** that combines a Knowledge Based Recommender System (KBRS), a Content Based Recommender System (CBRS), and Profile Matching for property recommendations. Here’s how each library contributes:

- **pandas**: Used for data manipulation and analysis, especially for loading, cleaning, and transforming property datasets.
- **numpy**: Provides efficient numerical operations, particularly useful for handling arrays and numerical computations in feature engineering and similarity calculations.
- **spacy**: Enables advanced Natural Language Processing (NLP) for extracting entities (like schools, hospitals, malls) from property descriptions, enriching the dataset with contextual features.
- **pathlib**: Simplifies file and directory operations, making it easy to manage paths for reading and saving datasets and results.
- **rapidfuzz**: Offers fast fuzzy string matching, which helps in identifying relevant entities in property descriptions even when there are typos or variations in wording.
- **sklearn.preprocessing (MinMaxScaler, OneHotEncoder)**: Used for feature scaling and encoding categorical variables, ensuring that structured property features are normalized and machine-readable for similarity computations.
- **sklearn.metrics.pairwise (cosine_similarity)**: Calculates similarity scores between user preferences and property features, forming the core of the CBRS.
- **scipy.sparse (csr_matrix, hstack)**: Efficiently handles large, sparse feature matrices, which is crucial when combining one-hot encoded and scaled features for similarity calculations.

By integrating these libraries, the system can:
- Extract and encode both structured and unstructured property features,
- Match properties to user personas (KBRS),
- Compute similarity between user preferences and property listings (CBRS),
- Apply profile matching to rank recommendations based on how closely they fit the user's ideal profile.

This hybrid approach leverages both domain knowledge and data-driven insights to deliver more accurate and personalized property recommendations.

In [None]:
import pandas as pd
import numpy as np
import spacy
from pathlib import Path
from rapidfuzz import fuzz
from spacy.pipeline import EntityRuler
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix, hstack


## Iner

In [None]:
# 1. Load your dataset
df = pd.read_csv(r'C:\Users\madea\OneDrive\Documents\Kuliah\Semester 8\Tugas Akhir\Coding\Data Preprocessing\updated_jabodetabeksur_olx_housing_dataset_.csv')
df['description'] = df['description'].fillna('').str.lower()

# 2. Define entity patterns
entity_patterns = {
    "SCHOOL": ["sd", "smp", "sma", "sekolah", "tk", "playgroup"],
    "UNIVERSITY": ["universitas", "kampus", "perguruan tinggi"],
    "HOSPITAL": ["rumah sakit", "rs"],
    "MALL": ["mall", "plaza", "supermall", "pusat perbelanjaan"],
    "MARKET": ["pasar", "market", "traditional market"],
    "TRANSPORT": ["terminal", "stasiun", "halte", "tol", "bandara"],
    "WORSHIP": ["masjid", "gereja", "pura"]
}

# 3. Initialize spaCy with a blank Indonesian model
nlp = spacy.blank('id')
ruler = nlp.add_pipe('entity_ruler')

# 4. Add patterns to EntityRuler
patterns = []
for label, keywords in entity_patterns.items():
    for word in keywords:
        patterns.append({"label": label, "pattern": word})
ruler.add_patterns(patterns)

# 5. Fuzzy matcher fallback
def fuzzy_match(text, keywords, threshold=85):
    return any(fuzz.partial_ratio(text, k) >= threshold for k in keywords)

# 6. Entity extraction function
def extract_entities(description):
    doc = nlp(description)
    found_entities = {label: 0 for label in entity_patterns.keys()}

    # Exact rule-based matches
    for ent in doc.ents:
        found_entities[ent.label_] = 1

    # Fuzzy fallback check
    for label, keywords in entity_patterns.items():
        if found_entities[label] == 0:
            if fuzzy_match(description, keywords):
                found_entities[label] = 1

    return found_entities

# 7. Apply to dataset and expand columns
entity_df = df['description'].apply(extract_entities).apply(pd.Series)
df = pd.concat([df, entity_df], axis=1)


# Export results to CSV
filepath_csv = Path('INER/dataset_with_entities.csv')  
filepath_csv.parent.mkdir(parents=True, exist_ok=True)  
df.to_csv(filepath_csv, index=False, encoding='utf-8-sig')

# Export results to JSON
filepath_json = Path('INER/dataset_with_entities.json')  
filepath_json.parent.mkdir(parents=True, exist_ok=True)  
df.to_json(filepath_json, orient='records', force_ascii=False, indent=2)

# Labeling KBRS

In [None]:
# Load dataset
df = pd.read_csv('INER/dataset_with_entities.csv')

# Define ideal profiles for each persona
ideal_personas = {
    "Pasangan Bekerja dengan Anak": {
        "type": "Rumah",
        "land_area": (200, 600),
        "building_area": (200, 600),
        "bedrooms": [3],
        "bathrooms": [2],
        "SCHOOL": 1,
        "HOSPITAL": 1,
        "TRANSPORT": 1,
        "MARKET": 1
    },
    "Pasangan Bekerja tanpa Anak": {
        "type": ["Apartemen", "Rumah"],
        "land_area": (22, 70),
        "building_area": (22, 70),
        "bedrooms": [2],
        "bathrooms": [2],
        "MALL": 1,
        "TRANSPORT": 1
    },
    "Individu Lajang": {
        "type": "Apartemen",
        "land_area": (22, 50),
        "building_area": (22, 50),
        "bedrooms": [1, 2],
        "bathrooms": [1, 2],
        "MALL": 1,
        "MARKET": 1,
        "TRANSPORT": 1
    }
}

# Helper function to calculate match score
def match_score(property_row, persona_criteria):
    matches = 0
    total = 0

    for key, val in persona_criteria.items():
        if key in ['land_area', 'building_area']:
            if not pd.isna(property_row[key]):
                if val[0] <= property_row[key] <= val[1]:
                    matches += 1
            total += 1
        elif key in ['bedrooms', 'bathrooms']:
            try:
                if int(property_row[key]) in val:
                    matches += 1
            except:
                pass
            total += 1
        elif key == 'type':
            if isinstance(val, list):
                if isinstance(property_row[key], str) and property_row[key].lower() in [v.lower() for v in val]:
                    matches += 1
            else:
                if isinstance(property_row[key], str) and property_row[key].lower() == val.lower():
                    matches += 1
            total += 1

    return matches / total if total > 0 else 0

# Score each property
persona_scores = []

for _, row in df.iterrows():
    scores = {persona: match_score(row, criteria) for persona, criteria in ideal_personas.items()}
    best_match = max(scores, key=scores.get)

    # Extract main attributes
    property_data = {
        "title": row.get("title", ""),
        "best_persona_match": best_match,
        "match_score": scores[best_match],
        "type": row.get("type"),
        "land_area": row.get("land_area"),
        "building_area": row.get("building_area"),
        "bedrooms": row.get("bedrooms"),
        "bathrooms": row.get("bathrooms"),
        "floors": row.get("floors"),
        
    }

    # Optionally include all entity columns dynamically (assuming they are binary flags like SCHOOL, HOSPITAL, etc.)
    entity_cols = ["SCHOOL", "HOSPITAL", "TRANSPORT", "MARKET", "MALL"]
    for col in entity_cols:
        property_data[col] = row.get(col)

    persona_scores.append(property_data)

# Create DataFrame with results
persona_score_df = pd.DataFrame(persona_scores)

# Save to CSV (optional)  
filepath = Path('KBRS/kbrs_dataset.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
persona_score_df.to_csv(filepath)

# Preview
print(persona_score_df.head())

# Personalized CBRS

In [None]:
# === Load dataset ===
df = pd.read_csv("KBRS/kbrs_dataset.csv")

# === USER INPUT ===
user_input = {
    "Persona": "Pasangan Bekerja tanpa Anak",
    "type": "Rumah",
    "land_area": 200,
    "building_area": 50,
    "bedrooms": 2,
    "bathrooms": 2,
    "floors": 2,
    "SCHOOL": 0,
    "HOSPITAL": 1,
    "TRANSPORT": 1,
    "MARKET": 0,
    "MALL": 1
}

# === Filter dataset by persona ===
persona = user_input['Persona']
filtered_df = df[df['best_persona_match'] == persona].copy()

# === Select relevant structured features ===
features = ['type', 'land_area', 'building_area', 'bedrooms', 'bathrooms', 'floors',
            'SCHOOL', 'HOSPITAL', 'TRANSPORT', 'MARKET', 'MALL']

# === Prepare dataset features ===
filtered_df_structured = filtered_df[features].copy()

# Handle categorical encoding (for 'type')
categorical_cols = ['type']
numeric_cols = [col for col in features if col not in categorical_cols]

# One-hot encode 'type'
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_types = encoder.fit_transform(filtered_df_structured[categorical_cols])
encoded_type_cols = encoder.get_feature_names_out(categorical_cols)

def clean_numeric_column(series):
    return (
        series.replace('>10', 10)
              .replace('3+', 3)
              .replace('2+', 2)
              .replace('4+', 4)
              .replace('-', np.nan)
              .astype(float)
    )

# Clean numeric columns
for col in numeric_cols:
    filtered_df[col] = clean_numeric_column(filtered_df[col])

# Scale numeric features
scaler = MinMaxScaler()
scaled_numeric = scaler.fit_transform(filtered_df[numeric_cols])
scaled_numeric_df = pd.DataFrame(scaled_numeric, columns=numeric_cols)

# Combine encoded + scaled features
property_matrix = hstack([csr_matrix(encoded_types), csr_matrix(scaled_numeric)])

# === Prepare user input vector ===
user_df = pd.DataFrame([user_input])
user_structured = user_df[features].copy()

# Encode 'type'
user_encoded_type = encoder.transform(user_structured[categorical_cols])
# Scale numeric
user_scaled_numeric = scaler.transform(user_structured[numeric_cols])

# Combine user features
user_vector = hstack([csr_matrix(user_encoded_type), csr_matrix(user_scaled_numeric)])

# === Compute Cosine Similarity ===
similarity_scores = cosine_similarity(user_vector, property_matrix)[0]

# === Top N Results ===
top_n = 10
top_indices = np.argsort(similarity_scores)[::-1][:top_n]
recommended = filtered_df.iloc[top_indices].copy()
recommended['similarity_score'] = similarity_scores[top_indices]

# === Output ===
print("\nTop Recommendations for Persona:", persona)
print(recommended[['title', 'type', 'bedrooms', 'bathrooms', 'similarity_score']])

# === Save to CSV ===
filepath = Path('CBRS/cbrs_results.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
recommended.to_csv(filepath, index=False)
print("\n✅ Top-N recommendations saved to 'cbrs_results.csv'")


# SPK with Profile Matching

In [None]:
# Load Top-N CBRS results
df = pd.read_csv("CBRS/cbrs_results.csv")

# Ideal profile based on user input
ideal_profile = {
    "type": "Rumah",
    "land_area": 200,
    "building_area": 50,
    "bedrooms": 2,
    "bathrooms": 2,
    "floors": 2,
    "SCHOOL": 0,
    "HOSPITAL": 1,
    "TRANSPORT": 1,
    "MARKET": 0,
    "MALL": 1
}

# Bobot kriteria (bisa disesuaikan)
weights = {
    "type": 5,
    "land_area": 4,
    "building_area": 4,
    "bedrooms": 3,
    "bathrooms": 3,
    "floors": 2,
    "SCHOOL": 2,
    "HOSPITAL": 3,
    "TRANSPORT": 3,
    "MARKET": 2,
    "MALL": 3
}

# Skala konversi gap (selisih -> nilai skor)
def gap_to_score(gap):
    if gap == 0:
        return 5
    elif gap == 1 or gap == -1:
        return 4.5
    elif gap == 2 or gap == -2:
        return 4
    elif gap == 3 or gap == -3:
        return 3.5
    elif gap == 4 or gap == -4:
        return 3
    elif gap >= 5 or gap <= -5:
        return 2.5
    else:
        return 1  # fallback

# Fungsi menghitung total score untuk setiap properti
def calculate_total_score(row):
    total_score = 0
    total_weight = 0
    for key in ideal_profile:
        if key == "type":
            score = 5 if row[key].lower() == ideal_profile[key].lower() else 1
        else:
            try:
                gap = row[key] - ideal_profile[key]
                score = gap_to_score(gap)
            except:
                score = 1  # jika data kosong atau tidak valid
        total_score += score * weights[key]
        total_weight += weights[key]
    return total_score / total_weight if total_weight else 0

# Hitung skor untuk semua properti
df["gap_score"] = df.apply(calculate_total_score, axis=1)

# Urutkan berdasarkan skor tertinggi
df_sorted = df.sort_values(by="gap_score", ascending=False)

# Simpan hasilnya
df_sorted.to_csv("final_profile_matching_result.csv", index=False)

# Tampilkan hasil teratas
df_sorted[["title", "type", "bedrooms", "bathrooms", "similarity_score", "gap_score"]].head(5)

# === Output ===
print("\nRecommended Properties")
print(df_sorted[["title", "type", "bedrooms", "bathrooms", "similarity_score", "gap_score"]])

# Save the final DataFrame to CSV
filepath = Path('profile_matching/hasil_akhir.csv')
filepath.parent.mkdir(parents=True, exist_ok=True)
df_sorted.to_csv(filepath, index=False)