# 🧴 Skincare Compatibility Model Training

## Overview
Notebook ini digunakan untuk training model ML yang memprediksi kompatibilitas produk skincare berdasarkan analisis bahan-bahan aktif.

## Workflow
1. **Load Data** - Import dataset produk dan aturan kompatibilitas
2. **Preprocessing** - Normalisasi dan prepare data
3. **Feature Engineering** - Extract fitur dari ingredients
4. **Model Training** - Train RandomForest Classifier
5. **Model Evaluation** - Evaluate performance
6. **Save Model** - Export model untuk production
7. **Testing** - Test prediction function

---

In [16]:
# ===============================
# STEP 1: Load dataset
# ===============================
import pandas as pd
from itertools import combinations
from tqdm import tqdm

# Ganti path sesuai lokasi file di Colab
df_products = pd.read_csv("unified_cleaned_products.csv", encoding="utf-8", engine="python")
df_rules = pd.read_csv("compatibility_rules.csv", encoding="utf-8", engine="python")

print("Produk:", df_products.shape)
print("Aturan kompatibilitas:", df_rules.shape)

# Ambil kolom penting
df_products = df_products[['product_name', 'parsed_ingredients']]
df_products.dropna(subset=['parsed_ingredients'], inplace=True)

# Pastikan kolom ingredients dalam bentuk list
import ast
df_products['parsed_ingredients'] = df_products['parsed_ingredients'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)

df_products.head()


Produk: (2610, 7)
Aturan kompatibilitas: (5, 5)


Unnamed: 0,product_name,parsed_ingredients
0,The Ordinary Natural Moisturising Factors + HA...,"[isoleucine, citric acid, alanine, glycine, hi..."
1,CeraVe Facial Moisturising Lotion SPF 25 52ml,"[homosalate, butyl methoxydibenzoylmethane, ph..."
2,The Ordinary Hyaluronic Acid 2% + B5 Hydration...,"[citric acid, ahnfeltia concinna extract]"
3,AMELIORATE Transforming Body Lotion 200ml,"[urea, lactic acid, allantoin, serine, sodium ..."
4,CeraVe Moisturising Cream 454g,"[phytosphingosine, cholesterol]"


In [17]:
# ===============================
# STEP 2: Normalisasi aturan kompatibilitas
# ===============================
# Peta nilai kompatibilitas ke skor numerik
score_map = {
    'Sangat Cocok': 2,
    'Cocok': 1,
    'Hati-hati': 0,
    'Tidak Cocok': -1
}

df_rules['score'] = df_rules['Kompatibilitas'].map(score_map)

# Buat dictionary lookup agar cepat
compat_dict = {}
for _, row in df_rules.iterrows():
    a, b = row['Bahan_A'].strip().lower(), row['Bahan_B'].strip().lower()
    score = row['score']
    compat_dict[(a, b)] = score
    compat_dict[(b, a)] = score  # dua arah

print("Jumlah aturan kompatibilitas:", len(compat_dict))


Jumlah aturan kompatibilitas: 10


In [18]:
# ===============================
# STEP 3: Fungsi untuk mengecek kecocokan 2 produk
# ===============================
def check_compatibility(ingr_list1, ingr_list2):
    total_score = 0
    count = 0
    for a in ingr_list1:
        for b in ingr_list2:
            key = (a.lower(), b.lower())
            if key in compat_dict:
                total_score += compat_dict[key]
                count += 1
    if count == 0:
        return 1  # netral jika tidak ada aturan yang tumpang tindih
    avg_score = total_score / count
    return 1 if avg_score > 0 else 0  # 1 = cocok, 0 = tidak cocok


In [19]:
# ===============================
# STEP 4: Bentuk pasangan produk (sample)
# ===============================
# Ambil subset kecil dulu biar cepat
df_sample = df_products.sample(200, random_state=42).reset_index(drop=True)

pairs = list(combinations(df_sample.index, 2))
pair_df = pd.DataFrame(pairs, columns=["idx_1", "idx_2"])

# Gabungkan info produk
pair_df["product_1"] = df_sample.loc[pair_df["idx_1"], "product_name"].values
pair_df["product_2"] = df_sample.loc[pair_df["idx_2"], "product_name"].values
pair_df["ingredients_1"] = df_sample.loc[pair_df["idx_1"], "parsed_ingredients"].values
pair_df["ingredients_2"] = df_sample.loc[pair_df["idx_2"], "parsed_ingredients"].values

# Hitung label kompatibilitas
tqdm.pandas()
pair_df["compatible"] = pair_df.progress_apply(
    lambda x: check_compatibility(x["ingredients_1"], x["ingredients_2"]),
    axis=1
)

print(pair_df["compatible"].value_counts())
pair_df.head()


  0%|          | 0/19900 [00:00<?, ?it/s]

100%|██████████| 19900/19900 [00:00<00:00, 124397.18it/s]

compatible
1    19900
Name: count, dtype: int64





Unnamed: 0,idx_1,idx_2,product_1,product_2,ingredients_1,ingredients_2,compatible
0,0,1,Instant Foaming Cleanser,Turbo Booster C Powder,[citric acid],"[arginine, ascorbic acid, zinc, zinc pca]",1
1,0,2,Instant Foaming Cleanser,Indeed Labs Pepta-Bright 30ml,[citric acid],"[lactic acid, sodium lactate]",1
2,0,3,Instant Foaming Cleanser,Uplifting Miracle Worker Eye Cream,[citric acid],"[adenosine, citric acid, salicylic acid, lecit...",1
3,0,4,Instant Foaming Cleanser,Eye Fuel,[citric acid],[],1
4,0,5,Instant Foaming Cleanser,Cleansing & Exfoliating Wipes - Green Tea,[citric acid],[],1


In [20]:
# ===============================
# STEP 5: FEATURE ENGINEERING
# ===============================
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def jaccard_similarity(list1, list2):
    set1, set2 = set(list1), set(list2)
    return len(set1 & set2) / len(set1 | set2) if len(set1 | set2) > 0 else 0

# Hitung fitur dasar
pair_df["len_1"] = pair_df["ingredients_1"].apply(len)
pair_df["len_2"] = pair_df["ingredients_2"].apply(len)
pair_df["len_diff"] = abs(pair_df["len_1"] - pair_df["len_2"])

pair_df["shared_ingredients"] = pair_df.apply(
    lambda x: len(set(x["ingredients_1"]) & set(x["ingredients_2"])), axis=1
)
pair_df["jaccard"] = pair_df.apply(
    lambda x: jaccard_similarity(x["ingredients_1"], x["ingredients_2"]), axis=1
)

# Gabungkan daftar bahan jadi teks untuk vectorizer
pair_df["text_1"] = pair_df["ingredients_1"].apply(lambda lst: " ".join(lst))
pair_df["text_2"] = pair_df["ingredients_2"].apply(lambda lst: " ".join(lst))

# TF vectorizer untuk cosine similarity
vectorizer = CountVectorizer().fit(pair_df["text_1"].tolist() + pair_df["text_2"].tolist())
tf_1 = vectorizer.transform(pair_df["text_1"])
tf_2 = vectorizer.transform(pair_df["text_2"])
pair_df["cosine_sim"] = [
    cosine_similarity(tf_1[i], tf_2[i])[0][0] for i in range(len(pair_df))
]

pair_df.head()


Unnamed: 0,idx_1,idx_2,product_1,product_2,ingredients_1,ingredients_2,compatible,len_1,len_2,len_diff,shared_ingredients,jaccard,text_1,text_2,cosine_sim
0,0,1,Instant Foaming Cleanser,Turbo Booster C Powder,[citric acid],"[arginine, ascorbic acid, zinc, zinc pca]",1,1,4,3,0,0.0,citric acid,arginine ascorbic acid zinc zinc pca,0.25
1,0,2,Instant Foaming Cleanser,Indeed Labs Pepta-Bright 30ml,[citric acid],"[lactic acid, sodium lactate]",1,1,2,1,0,0.0,citric acid,lactic acid sodium lactate,0.353553
2,0,3,Instant Foaming Cleanser,Uplifting Miracle Worker Eye Cream,[citric acid],"[adenosine, citric acid, salicylic acid, lecit...",1,1,5,4,1,0.2,citric acid,adenosine citric acid salicylic acid lecithin ...,0.707107
3,0,4,Instant Foaming Cleanser,Eye Fuel,[citric acid],[],1,1,0,1,0,0.0,citric acid,,0.0
4,0,5,Instant Foaming Cleanser,Cleansing & Exfoliating Wipes - Green Tea,[citric acid],[],1,1,0,1,0,0.0,citric acid,,0.0


In [21]:
# ===============================
# STEP 6: TRAINING MODEL
# ===============================
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Fitur yang akan digunakan
feature_cols = ["len_diff", "shared_ingredients", "jaccard", "cosine_sim"]

X = pair_df[feature_cols]
y = pair_df["compatible"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Gunakan Random Forest (lebih stabil daripada Logistic Regression di data kecil)
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42,
    class_weight="balanced"
)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 1.0

Classification Report:
               precision    recall  f1-score   support

           1       1.00      1.00      1.00      5970

    accuracy                           1.00      5970
   macro avg       1.00      1.00      1.00      5970
weighted avg       1.00      1.00      1.00      5970


Confusion Matrix:
 [[5970]]




In [22]:
# ===============================
# STEP 7: SAVE MODEL
# ===============================
import joblib

# Simpan model menggunakan joblib (better for ML models)
joblib.dump(model, "skincare_model.pkl")

print("✅ Model berhasil disimpan sebagai skincare_model.pkl")
print(f"📦 Model type: {type(model).__name__}")
print(f"📊 Features used: {feature_cols}")

✅ Model berhasil disimpan sebagai skincare_model.pkl
📦 Model type: RandomForestClassifier
📊 Features used: ['len_diff', 'shared_ingredients', 'jaccard', 'cosine_sim']


In [23]:
# ===============================
# STEP 8: PREDICTION FUNCTION
# ===============================
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import get_close_matches
import numpy as np
import joblib


def find_closest_product_name(name, df, cutoff=0.3):
    """Cari nama produk paling mirip dari dataset menggunakan fuzzy matching."""
    all_names = df["product_name"].tolist()
    matches = get_close_matches(name, all_names, n=1, cutoff=cutoff)
    return matches[0] if matches else None


def predict_compatibility(product_name_1, product_name_2, df=df_products, model=None):
    """
    Predict compatibility between two skincare products.
    
    Parameters:
    -----------
    product_name_1 : str
        Nama produk pertama (fuzzy matching)
    product_name_2 : str
        Nama produk kedua (fuzzy matching)
    df : DataFrame
        Dataset produk (default: df_products dari STEP 1)
    model : sklearn model
        Trained model (jika None, akan load dari file)
    
    Returns:
    --------
    str : Hasil prediksi dengan confidence score
    """
    # Load model jika belum di-provide
    if model is None:
        try:
            model = joblib.load("skincare_model.pkl")
            print("✅ Model loaded from skincare_model.pkl")
        except FileNotFoundError:
            return "❌ Model 'skincare_model.pkl' not found. Please run STEP 7 to save the model first."

    # Find closest product names (fuzzy matching)
    name1 = find_closest_product_name(product_name_1, df)
    name2 = find_closest_product_name(product_name_2, df)

    if not name1 or not name2:
        missing = [p for p, n in zip([product_name_1, product_name_2], [name1, name2]) if n is None]
        return f"❌ Produk tidak ditemukan di database: {', '.join(missing)}"

    # Get product data
    p1 = df[df["product_name"] == name1].iloc[0]
    p2 = df[df["product_name"] == name2].iloc[0]
    ing1, ing2 = p1["parsed_ingredients"], p2["parsed_ingredients"]

    # === Feature Engineering (sama seperti training) ===
    len_diff = abs(len(ing1) - len(ing2))
    shared = len(set(ing1) & set(ing2))
    jaccard = len(set(ing1) & set(ing2)) / len(set(ing1) | set(ing2)) if len(set(ing1) | set(ing2)) > 0 else 0

    # Cosine similarity
    all_ingredients_text = " ".join(ing1) + " " + " ".join(ing2)
    vec = CountVectorizer().fit([all_ingredients_text])
    tf1 = vec.transform([" ".join(ing1)])
    tf2 = vec.transform([" ".join(ing2)])
    cosine_sim = cosine_similarity(tf1, tf2)[0][0] if vec.vocabulary_ else 0.0

    # Create feature vector
    X_new = np.array([[len_diff, shared, jaccard, cosine_sim]])

    # === Prediction ===
    pred = model.predict(X_new)[0]
    
    # Get confidence score
    if hasattr(model, 'predict_proba'):
        proba = model.predict_proba(X_new)[0]
        confidence = proba[pred] if len(proba) > pred else 0.5
        
        output = (
            f"\n{'='*60}\n"
            f"🧴 Produk 1: {name1}\n"
            f"🧴 Produk 2: {name2}\n"
            f"{'='*60}\n"
            f"📊 Hasil: {'✅ COCOK digunakan bersama' if pred == 1 else '⚠️ TIDAK DISARANKAN dipakai bersama'}\n"
            f"📈 Confidence: {confidence:.1%}\n"
            f"🔬 Shared Ingredients: {shared} bahan\n"
            f"{'='*60}"
        )
    else:
        result = "✅ COCOK digunakan bersama" if pred == 1 else "⚠️ TIDAK DISARANKAN dipakai bersama"
        output = (
            f"\n{'='*60}\n"
            f"🧴 Produk 1: {name1}\n"
            f"🧴 Produk 2: {name2}\n"
            f"{'='*60}\n"
            f"📊 Hasil: {result}\n"
            f"🔬 Shared Ingredients: {shared} bahan\n"
            f"{'='*60}"
        )

    print(output)
    return output

In [24]:
# ===============================
# STEP 9: TEST PREDICTIONS
# ===============================

# Test 1: Compatible products
print("🧪 Test 1: CeraVe + The Ordinary Hyaluronic Acid")
predict_compatibility("cerave cream", "ordinary hyaluronic acid")

print("\n" + "="*60 + "\n")

# Test 2: Another combination
print("🧪 Test 2: Cetaphil + Niacinamide")
predict_compatibility("cetaphil", "niacinamide")

🧪 Test 1: CeraVe + The Ordinary Hyaluronic Acid
✅ Model loaded from skincare_model.pkl

🧴 Produk 1: Ceramidin™ Cream
🧴 Produk 2: The Ordinary Marine Hyaluronics 30ml
📊 Hasil: ✅ COCOK digunakan bersama
📈 Confidence: 50.0%
🔬 Shared Ingredients: 1 bahan


🧪 Test 2: Cetaphil + Niacinamide
✅ Model loaded from skincare_model.pkl

🧴 Produk 1: RetAsphere™ Micro Peel
🧴 Produk 2: Coconut Ceramide Mask
📊 Hasil: ✅ COCOK digunakan bersama
📈 Confidence: 50.0%
🔬 Shared Ingredients: 1 bahan

🧴 Produk 1: Ceramidin™ Cream
🧴 Produk 2: The Ordinary Marine Hyaluronics 30ml
📊 Hasil: ✅ COCOK digunakan bersama
📈 Confidence: 50.0%
🔬 Shared Ingredients: 1 bahan


🧪 Test 2: Cetaphil + Niacinamide
✅ Model loaded from skincare_model.pkl

🧴 Produk 1: RetAsphere™ Micro Peel
🧴 Produk 2: Coconut Ceramide Mask
📊 Hasil: ✅ COCOK digunakan bersama
📈 Confidence: 50.0%
🔬 Shared Ingredients: 1 bahan






In [25]:
# Optional: Lihat sample produk di database
print("📋 Sample produk di database:")
print("="*60)
df_products["product_name"].head(20)

📋 Sample produk di database:


0     The Ordinary Natural Moisturising Factors + HA...
1         CeraVe Facial Moisturising Lotion SPF 25 52ml
2     The Ordinary Hyaluronic Acid 2% + B5 Hydration...
3             AMELIORATE Transforming Body Lotion 200ml
4                        CeraVe Moisturising Cream 454g
5                      CeraVe Moisturising Lotion 473ml
6         CeraVe Facial Moisturising Lotion No SPF 52ml
7     The Ordinary Natural Moisturizing Factors + HA...
8                          CeraVe Smoothing Cream 177ml
9      Clinique Moisture Surge 72 Hour Moisturiser 75ml
10                       CeraVe Moisturising Cream 50ml
11                       CeraVe Moisturising Cream 340g
12          First Aid Beauty Ultra Repair Cream (56.7g)
13    Avène Antirougeurs Jour Redness Relief Moistur...
14    Clinique Dramatically Different Moisturising L...
15           First Aid Beauty Ultra Repair Cream (170g)
16                              Weleda Skin Food (75ml)
17    Neutrogena Hydro Boost City Shield SPF Moi