# Generative Text For Tourism Recommendation - Capstone Project

ini merupakan bagian dari fitur tambahan proyek capstone 'wisatapas' yang bertujuan untuk membuat model ringkasan rekomendasi destinasi wisata menggunakan model T5 dari Hugging Face Transformers.

## Load Data

In [1]:
import os
import pickle
import re
import numpy as np
import pandas as pd

In [2]:
print("Loading tourism data...")
try:
    destinasi_url = 'https://drive.google.com/uc?id=1lqGH27q8zwN6mlMjTDhOk_jd4vgvWbWU'
    destinasi_df = pd.read_csv(destinasi_url)
    print(f"✓ Data loaded successfully. Shape: {destinasi_df.shape}")
except Exception as e:
    print(f"✗ Error loading data: {str(e)}")


Loading tourism data...
✓ Data loaded successfully. Shape: (604, 13)


## Data Understanding

In [3]:
destinasi_df.head()

Unnamed: 0,Place_Id,Place_Name,Description,Category,City,Price,Rating,Time_Minutes,Coordinate,Lat,Long,Unnamed: 11,Unnamed: 12
0,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,15.0,"{'lat': -6.1753924, 'lng': 106.8271528}",-6.175392,106.827153,,1.0
1,2,Kota Tua,"Kota tua di Jakarta, yang juga bernama Kota Tu...",Budaya,Jakarta,0,4.6,90.0,"{'lat': -6.137644799999999, 'lng': 106.8171245}",-6.137645,106.817125,,2.0
2,3,Dunia Fantasi,Dunia Fantasi atau disebut juga Dufan adalah t...,Taman Hiburan,Jakarta,270000,4.6,360.0,"{'lat': -6.125312399999999, 'lng': 106.8335377}",-6.125312,106.833538,,3.0
3,4,Taman Mini Indonesia Indah (TMII),Taman Mini Indonesia Indah merupakan suatu kaw...,Taman Hiburan,Jakarta,10000,4.5,,"{'lat': -6.302445899999999, 'lng': 106.8951559}",-6.302446,106.895156,,4.0
4,5,Atlantis Water Adventure,Atlantis Water Adventure atau dikenal dengan A...,Taman Hiburan,Jakarta,94000,4.5,60.0,"{'lat': -6.12419, 'lng': 106.839134}",-6.12419,106.839134,,5.0


In [4]:
destinasi_df.isna().sum()

Unnamed: 0,0
Place_Id,0
Place_Name,0
Description,0
Category,0
City,0
Price,0
Rating,0
Time_Minutes,232
Coordinate,0
Lat,0


In [5]:
print(f"Total duplicate values: {destinasi_df.duplicated().sum()}")

Total duplicate values: 0


In [6]:
print("Jumlah Destinasi unik:", destinasi_df['Place_Name'].nunique())
print('\n')
print("Jumlah Deskripsi unik:", destinasi_df['Description'].nunique())
print('\n')
print("Kategori unik:", destinasi_df['Category'].unique())
print("Jumlah Kategori unik:", destinasi_df['Category'].nunique())
print('\n')
print("Kota unik:", destinasi_df['City'].unique())
print("Jumlah kota unik:", destinasi_df['City'].nunique())
print('\n')
harga_unik = np.sort(destinasi_df['Price'].unique())[::-1]
print("Harga unik:", harga_unik)
print("Jumlah Harga unik:", destinasi_df['Price'].nunique())

Jumlah Destinasi unik: 604


Jumlah Deskripsi unik: 604


Kategori unik: ['Budaya' 'Taman Hiburan' 'Cagar Alam' 'Bahari' 'Pusat Perbelanjaan'
 'Tempat Ibadah']
Jumlah Kategori unik: 6


Kota unik: ['Jakarta' 'Yogyakarta' 'Bandung' 'Semarang' 'Surabaya' 'Malang' 'Batu'
 'Solo' 'Karanganyar' 'Klaten' 'Boyolali' 'Sragen' 'Sukoharjo' 'Jember']
Jumlah kota unik: 14


Harga unik: [900000 500000 375000 300000 280000 270000 250000 220000 200000 185000
 180000 175000 150000 140000 125000 120000 115000 110000 100000  95000
  94000  85000  81000  80000  75000  70000  65000  63700  60000  55000
  50000  45000  40000  37500  35000  30000  27000  25000  23000  22000
  20000  15000  12000  11000  10000   9000   8000   7500   7000   6000
   5500   5000   4000   3000   2500   2000   1000      0]
Jumlah Harga unik: 58


In [7]:
budget_counts = destinasi_df['Price'].value_counts()
budget_counts

Unnamed: 0_level_0,count
Price,Unnamed: 1_level_1
0,173
10000,76
5000,71
15000,38
25000,23
50000,22
20000,21
3000,20
35000,14
2000,14


## Data Preparation

In [8]:
destinasi_df = destinasi_df.drop_duplicates()
print(f"Jumlah duplikat di df_place: {destinasi_df.duplicated().sum()}")

Jumlah duplikat di df_place: 0


In [9]:
destinasi_df.isna().sum()

Unnamed: 0,0
Place_Id,0
Place_Name,0
Description,0
Category,0
City,0
Price,0
Rating,0
Time_Minutes,232
Coordinate,0
Lat,0


In [10]:
destinasi_df = destinasi_df.drop(columns=['Unnamed: 11', 'Unnamed: 12','Time_Minutes'])

In [11]:
destinasi_df.head()

Unnamed: 0,Place_Id,Place_Name,Description,Category,City,Price,Rating,Coordinate,Lat,Long
0,1,Monumen Nasional,Monumen Nasional atau yang populer disingkat d...,Budaya,Jakarta,20000,4.6,"{'lat': -6.1753924, 'lng': 106.8271528}",-6.175392,106.827153
1,2,Kota Tua,"Kota tua di Jakarta, yang juga bernama Kota Tu...",Budaya,Jakarta,0,4.6,"{'lat': -6.137644799999999, 'lng': 106.8171245}",-6.137645,106.817125
2,3,Dunia Fantasi,Dunia Fantasi atau disebut juga Dufan adalah t...,Taman Hiburan,Jakarta,270000,4.6,"{'lat': -6.125312399999999, 'lng': 106.8335377}",-6.125312,106.833538
3,4,Taman Mini Indonesia Indah (TMII),Taman Mini Indonesia Indah merupakan suatu kaw...,Taman Hiburan,Jakarta,10000,4.5,"{'lat': -6.302445899999999, 'lng': 106.8951559}",-6.302446,106.895156
4,5,Atlantis Water Adventure,Atlantis Water Adventure atau dikenal dengan A...,Taman Hiburan,Jakarta,94000,4.5,"{'lat': -6.12419, 'lng': 106.839134}",-6.12419,106.839134


In [12]:
destinasi_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 604 entries, 0 to 603
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Place_Id     604 non-null    int64  
 1   Place_Name   604 non-null    object 
 2   Description  604 non-null    object 
 3   Category     604 non-null    object 
 4   City         604 non-null    object 
 5   Price        604 non-null    int64  
 6   Rating       604 non-null    float64
 7   Coordinate   604 non-null    object 
 8   Lat          604 non-null    float64
 9   Long         604 non-null    float64
dtypes: float64(3), int64(2), object(5)
memory usage: 47.3+ KB


In [13]:
destinasi_df.describe()

Unnamed: 0,Place_Id,Price,Rating,Lat,Long
count,604.0,604.0,604.0,604.0,604.0
mean,302.5,24083.112583,4.408444,-7.39353,41392.44
std,174.504059,58485.815884,0.220458,2.867232,211078.5
min,1.0,0.0,3.4,-75.639,103.9314
25%,151.75,0.0,4.2,-7.806364,107.6105
50%,302.5,7500.0,4.4,-7.310507,110.4105
75%,453.25,22250.0,4.6,-6.897825,110.905
max,604.0,900000.0,5.0,1.07888,1127000.0


## Data Augmentasi untuk Fine-Tuning

In [14]:
import random
import json
import pandas as pd

print("Loading tourism data...")
try:
    destinasi_url = 'https://drive.google.com/uc?id=1lqGH27q8zwN6mlMjTDhOk_jd4vgvWbWU'
    destinasi_df = pd.read_csv(destinasi_url)
    print(f"✓ Data loaded successfully. Shape: {destinasi_df.shape}")
except Exception as e:
    print(f"✗ Error loading data: {str(e)}")
    destinasi_df = pd.DataFrame()

input_templates = [
    "User menyukai kategori: {category}; lokasi: {city}; tempat: {place}; rating: {rating}",
    "Preferensi pengguna: {category} - {city} - {place} ({rating})"
]

base_output_templates = [
    "{place} adalah salah satu destinasi {category_lower} menarik yang bisa Anda kunjungi di {city}. Tempat ini memiliki rating sebesar {rating}.",
    "Kami merekomendasikan {place} untuk Anda yang mencari pengalaman wisata {category_lower} di {city}. Destinasi ini dinilai baik dengan rating {rating}.",
    "Sedang mencari destinasi {category_lower}? Coba kunjungi {place} di {city}. Tempat ini mendapatkan rating {rating}.",
    "{place} di {city} bisa jadi pilihan tepat untuk menikmati wisata {category_lower}. Rating tempat ini tercatat {rating}.",
    "Tempat seperti {place} di {city} cocok untuk Anda yang menyukai wisata {category_lower}. Saat ini memiliki rating {rating}.",
    "Berwisata ke {place} di {city} dapat menjadi pilihan seru untuk kategori {category_lower}. Ratingnya mencapai {rating}.",
    "{place} merupakan tempat wisata {category_lower} yang direkomendasikan di {city}. Ratingnya saat ini {rating}.",
    "Ingin suasana {category_lower}? {place} di {city} bisa jadi opsi yang menarik, dengan rating {rating}.",
]

augmented_data = []

for _, row in destinasi_df.iterrows():
    try:
        category = str(row['Category'])
        city = str(row['City'])
        place = str(row['Place_Name'])
        rating = float(row['Rating'])
        category_lower = category.lower()

        for input_template in input_templates:
            input_text = input_template.format(
                category=category,
                city=city,
                place=place,
                rating=rating
            )

            selected_outputs = random.sample(base_output_templates, 4)

            for output_template in selected_outputs:
                output_text = output_template.format(
                    category_lower=category_lower,
                    place=place,
                    city=city,
                    rating=rating
                )

                augmented_data.append({
                    "input": input_text,
                    "output": output_text
                })

    except Exception as e:
        print(f"✗ Error processing row: {e}")
        continue

with open("fine_tune_variatif_general_loc.jsonl", "w", encoding='utf-8') as f:
    for item in augmented_data:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"{len(augmented_data)} pasangan input-output berhasil dibuat dan disimpan sebagai 'fine_tune_variatif_general_loc.jsonl'")


Loading tourism data...
✓ Data loaded successfully. Shape: (604, 13)
4832 pasangan input-output berhasil dibuat dan disimpan sebagai 'fine_tune_variatif_general_loc.jsonl'


## Instalasi Library yang Dibutuhkan

In [15]:
!pip install --upgrade datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [16]:
!pip install -U transformers



##  Load Model T5

In [17]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Tokenisasi dan Preprocessing

In [18]:
from datasets import load_dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments

dataset = load_dataset('json', data_files='fine_tune_variatif_general_loc.jsonl', split='train')

def preprocess_function(examples):
    inputs = examples['input']
    targets = examples['output']
    if isinstance(inputs, str):
        inputs = [inputs]
    if isinstance(targets, str):
        targets = [targets]

    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(text_target=targets, max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Generating train split: 0 examples [00:00, ? examples/s]

## Fine-Tuning Model T5

In [19]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)

training_args = TrainingArguments(
    output_dir="./t5-finetuned-recommendation",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    save_steps=10_000,
    save_total_limit=2,
    logging_steps=500,
    learning_rate=5e-5,
    weight_decay=0.01,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

trainer.train()
trainer.save_model("./t5-finetuned-recommendation-final")
tokenizer.save_pretrained("./t5-finetuned-recommendation-final")

Map:   0%|          | 0/4832 [00:00<?, ? examples/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmiftahullahsn295[0m ([33mmiftahullahsn295-universitas-islam-negeri-sunan-kalijaga[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,0.8327
1000,0.0908
1500,0.0546


('./t5-finetuned-recommendation-final/tokenizer_config.json',
 './t5-finetuned-recommendation-final/special_tokens_map.json',
 './t5-finetuned-recommendation-final/spiece.model',
 './t5-finetuned-recommendation-final/added_tokens.json',
 './t5-finetuned-recommendation-final/tokenizer.json')

## Inference

In [20]:
import torch

# DETEKSI DEVICE TERLEBIH DULU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def generate_recommendation(input_text):
    # tokenisasi lalu **pindahkan tensor hasilnya** ke device yang sama
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=128,
        truncation=True
    ).to(device)

    output_ids = model.generate(
        inputs["input_ids"],
        max_length=128,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)



In [21]:
input_text = "User menyukai kategori: Budaya; lokasi: Jakarta; tempat: Monumen Nasional; rating: 4.6"
print(generate_recommendation(input_text))

Monumen Nasional merupakan tempat wisata budaya yang direkomendasikan di Jakarta. Ratingnya saat ini 4.6.


## Save model untuk fitur dalam sistem

In [22]:
!zip -r t5-finetuned-recommendation-final.zip t5-finetuned-recommendation-final

  adding: t5-finetuned-recommendation-final/ (stored 0%)
  adding: t5-finetuned-recommendation-final/training_args.bin (deflated 51%)
  adding: t5-finetuned-recommendation-final/model.safetensors (deflated 10%)
  adding: t5-finetuned-recommendation-final/spiece.model (deflated 48%)
  adding: t5-finetuned-recommendation-final/config.json (deflated 62%)
  adding: t5-finetuned-recommendation-final/special_tokens_map.json (deflated 85%)
  adding: t5-finetuned-recommendation-final/generation_config.json (deflated 29%)
  adding: t5-finetuned-recommendation-final/tokenizer.json (deflated 74%)
  adding: t5-finetuned-recommendation-final/tokenizer_config.json (deflated 95%)


In [23]:
from google.colab import files
files.download("t5-finetuned-recommendation-final.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>