🎯 Goal:
Predict 1–3 trip tags a user is likely to select again based on their past preferences.

Step 1: Simulated Dataset
Each row = a trip by a user
Each column = 1 or 0 depending on whether that tag was chosen for that trip

In [1]:
import pandas as pd
import numpy as np
import random

# Define tags and users
tags = ["Adventure", "Relaxation", "Nature", "Cultural"]
user_ids = list(range(101, 111))  # 10 users

# Create 300 trips (rows)
data = []
for _ in range(300):
    user = random.choice(user_ids)
    selected_tags = random.sample(tags, random.randint(1, 3))
    row = {tag: 1 if tag in selected_tags else 0 for tag in tags}
    row["user_id"] = user
    data.append(row)

df = pd.DataFrame(data)
df = df[["user_id"] + tags]  # reorder columns
df.to_csv("tag_prediction_dataset.csv", index=False)
df.head()


Unnamed: 0,user_id,Adventure,Relaxation,Nature,Cultural
0,105,0,0,1,0
1,104,1,1,0,1
2,102,1,1,1,0
3,103,0,1,0,1
4,108,0,1,1,0


Step 2: Multi-Label Model with Random Forest

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report

# Load the dataset
df = pd.read_csv("tag_prediction_dataset.csv")

# Features and targets
X = df[["user_id"]]
y = df.drop(columns=["user_id"])  # target: multi-label tags

# One-hot encode user_id
encoder = ColumnTransformer(
    transformers=[("user", OneHotEncoder(), ["user_id"])],
    remainder="passthrough"
)

# Pipeline: encoder + classifier
model = Pipeline([
    ("encode", encoder),
    ("clf", MultiOutputClassifier(RandomForestClassifier(random_state=42)))
])

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)

# Evaluation
y_pred = model.predict(X_test)
print("📊 Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=y.columns))


📊 Classification Report:

              precision    recall  f1-score   support

   Adventure       0.44      0.43      0.44        28
  Relaxation       0.54      0.42      0.47        33
      Nature       0.50      0.47      0.48        30
    Cultural       0.59      0.62      0.61        32

   micro avg       0.52      0.49      0.50       123
   macro avg       0.52      0.49      0.50       123
weighted avg       0.52      0.49      0.50       123
 samples avg       0.54      0.52      0.49       123



Basic Idea:
Each user can have multiple correct labels (tags like Adventure, Relaxation, etc.), so the model is evaluated on how well it predicts each tag independently and collectively.

Explanation:
Precision: Of all the times the model predicted a tag, how often was it correct?

e.g., for "Cultural", 59% of predictions were actually right.

Recall: Of all the real times the tag was supposed to be predicted, how often did the model catch it?

e.g., for "Relaxation", it only caught 42% of the real cases.

F1-score: Harmonic mean of precision and recall — a balanced measure.

Support: How many test samples had that label in reality.



Micro, Macro, Weighted, and Samples Avg:
These are ways to combine all the tag performances into one number.


Metric Type	Meaning
micro avg	Calculates metrics globally (total true positives, etc.). Best for imbalanced labels.
macro avg	Simple average of metrics across all tags — treats all classes equally.
weighted avg	Averages weighted by number of instances per tag — adjusts for tag frequency.
samples avg	Averages F1 score across all users (samples), treating each user's multi-label prediction as a unit.

✅ Interpretation of Your Report:
Your model is doing decently for Cultural (F1 = 0.61), but struggling a bit with Adventure and Relaxation (F1 ~ 0.44–0.47).

The overall F1 score is around 0.50, meaning the model gets about 50% of the tags right — not bad for a first version trained on limited simulated data!

Step 3: Predict Tags for a Returning User

In [6]:
# Predict tags for a user logging in
user_input = pd.DataFrame({"user_id": [110]})
predicted = model.predict(user_input)[0]

# Decode predicted tags
predicted_tags = [tag for tag, val in zip(y.columns, predicted) if val == 1]
print(f"🔮 Predicted tags for user 110: {predicted_tags}")

🔮 Predicted tags for user 110: ['Nature', 'Cultural']


Use Cases in Your App:
🎯 Auto-select tags on dashboard when user logs in

🧠 Improve suggestions before the user even interacts

🔁 Update predictions after every trip (new feedback = better accuracy)