# Pokemon Go - Dağıtık Makine Öğrenmesi Pipeline (Ray)


## Pipeline Özeti
- **Veri:** `pokemon_clean.parquet` (296K satır)
- **Baseline Feature Set:** 16 özellik  1 hedef
- **Hedef:** Pokemon türü (`class`) tahmini
- **Framework:** Ray + LightGBM
- **Model:** LightGBM Classifier (Gradient Boosting)

## Veri Bölme Stratejisi
- **Train Set (%70):** Model eğitimi için
- **Validation Set (%15):** Hiperparametre optimizasyonu
- **Test Set (%15):** Final değerlendirme

## Baseline Feature Set
| Kategori | Özellikler |
|----------|------------|
| Spatial | `cellId_730m`, `terrainType`, `closeToWater`, `population_density`, `urban`, `suburban`, `midurban`, `rural` |
| Temporal | `appearedHour`, `appearedTimeOfDay`, `appearedDayOfWeek` |
| Weather | `weather`, `temperature` |
| POI | `pokestopDistanceKm`, `gymDistanceKm` |
| Co-occurrence | `nearby_pokemon_count` |


## 1) Kurulum ve Ray Başlatma


In [1]:
!pip -q install ray[data,train] scikit-learn pyarrow pandas lightgbm

import ray
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import time
import warnings
warnings.filterwarnings('ignore')

ray.init(ignore_reinit_error=True)

print(f"Ray version: {ray.__version__}")
print(f"LightGBM version: {lgb.__version__}")
print(f"Available CPU: {ray.cluster_resources().get('CPU', 0):.0f}")
print(f"Available Memory: {ray.cluster_resources().get('memory', 0) / (1024**3):.2f} GB")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.4/72.4 MB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[?25h

2025-12-28 19:09:15,228	INFO worker.py:2007 -- Started a local Ray instance.


Ray version: 2.53.0
LightGBM version: 4.6.0
Available CPU: 12
Available Memory: 58.10 GB


## 2) Veri Yükleme (Ray Data)


In [2]:
PARQUET_PATH = "pokemon_clean.parquet"

# Baseline Feature Set
BASELINE_FEATURES = [
    "class",                # Hedef değişken
    "cellId_730m", "terrainType", "closeToWater", "population_density",  # Mekansal
    "urban", "suburban", "midurban", "rural",  # Mekansal - Urban/Rural kategorileri
    "appearedHour", "appearedTimeOfDay", "appearedDayOfWeek",  # Zamansal
    "weather", "temperature",  # Weather
    "pokestopDistanceKm", "gymDistanceKm",  # POI
    "nearby_pokemon_count"  # Co-occurrence
]

print("Baseline Feature Set ile veri yükleniyor...")
print(f"Seçilen özellik sayısı: {len(BASELINE_FEATURES)} (hedef dahil)")

ds = ray.data.read_parquet(PARQUET_PATH, columns=BASELINE_FEATURES)

print(f"\nVeri yüklendi!")
print(f"   Satır sayısı: {ds.count():,}")

print(f"\nSchema:")
print(ds.schema())


Baseline Feature Set ile veri yükleniyor...
Seçilen özellik sayısı: 17 (hedef dahil)


2025-12-28 19:09:19,641	INFO progress_bar.py:155 -- Progress bar disabled because stdout is a non-interactive terminal.
2025-12-28 19:09:20,375	INFO progress_bar.py:213 -- === Ray Data Progress {Parquet dataset sampling} ===
2025-12-28 19:09:20,376	INFO progress_bar.py:215 -- Parquet dataset sampling: Progress Completed 1 / 1
2025-12-28 19:09:20,377	INFO parquet_datasource.py:1048 -- Estimated parquet encoding ratio is 1.477.
2025-12-28 19:09:20,378	INFO parquet_datasource.py:1108 -- Estimated parquet reader batch size at 1177349 rows
2025-12-28 19:09:21,134	INFO logging.py:397 -- Registered dataset logger for dataset dataset_1_0
2025-12-28 19:09:21,148	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_1_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:21,148	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_1_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> AggregateNumRows[Aggre


Veri yüklendi!
   Satır sayısı: 296,021

Schema:
Column                Type
------                ----
class                 int64
cellId_730m           uint64
terrainType           int64
closeToWater          bool
population_density    double
urban                 bool
suburban              bool
midurban              bool
rural                 bool
appearedHour          int64
appearedTimeOfDay     large_string
appearedDayOfWeek     large_string
weather               large_string
temperature           double
pokestopDistanceKm    double
gymDistanceKm         double
nearby_pokemon_count  int32


In [3]:

print("Baseline Feature Set Doğrulaması:")
print("=" * 60)

loaded_cols = ds.schema().names
print(f"Yüklenen kolon sayısı: {len(loaded_cols)}")
print(f"\nKolonlar:")
for i, col in enumerate(loaded_cols, 1):
    feature_type = "HEDEF" if col == "class" else "Feature"
    print(f"   {i:2d}. {col:<25} [{feature_type}]")

print(f"\nİlk 3 satır:")
print(ds.take(3))


2025-12-28 19:09:21,293	INFO dataset.py:3641 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2025-12-28 19:09:21,297	INFO logging.py:397 -- Registered dataset logger for dataset dataset_2_0
2025-12-28 19:09:21,302	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_2_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:21,302	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_2_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=3]
2025-12-28 19:09:21,318	INFO progress_bar.py:213 -- === Ray Data Progress {ReadParquet->SplitBlocks(32)} ===
2025-12-28 19:09:21,320	INFO progress_bar.py:215 -- ReadParquet->SplitBlocks(32): Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2025-12-28 19:09:21,320	INFO progress_bar.py:213 -- === Ray Data Progress {limit=3}

Baseline Feature Set Doğrulaması:
Yüklenen kolon sayısı: 17

Kolonlar:
    1. class                     [HEDEF]
    2. cellId_730m               [Feature]
    3. terrainType               [Feature]
    4. closeToWater              [Feature]
    5. population_density        [Feature]
    6. urban                     [Feature]
    7. suburban                  [Feature]
    8. midurban                  [Feature]
    9. rural                     [Feature]
   10. appearedHour              [Feature]
   11. appearedTimeOfDay         [Feature]
   12. appearedDayOfWeek         [Feature]
   13. weather                   [Feature]
   14. temperature               [Feature]
   15. pokestopDistanceKm        [Feature]
   16. gymDistanceKm             [Feature]
   17. nearby_pokemon_count      [Feature]

İlk 3 satır:


2025-12-28 19:09:22,092	INFO streaming_executor.py:304 -- ✔️  Dataset dataset_2_0 execution finished in 0.79 seconds


[{'class': 16, 'cellId_730m': 9645139109517197000, 'terrainType': 14, 'closeToWater': False, 'population_density': 2431.2341, 'urban': True, 'suburban': True, 'midurban': True, 'rural': False, 'appearedHour': 5, 'appearedTimeOfDay': 'night', 'appearedDayOfWeek': 'dummy_day', 'weather': 'Foggy', 'temperature': 25.5, 'pokestopDistanceKm': 0.081776, 'gymDistanceKm': 0.049869, 'nearby_pokemon_count': 3}, {'class': 133, 'cellId_730m': 9645139109517197000, 'terrainType': 14, 'closeToWater': False, 'population_density': 2431.2341, 'urban': True, 'suburban': True, 'midurban': True, 'rural': False, 'appearedHour': 5, 'appearedTimeOfDay': 'night', 'appearedDayOfWeek': 'dummy_day', 'weather': 'Foggy', 'temperature': 25.5, 'pokestopDistanceKm': 0.195622, 'gymDistanceKm': 0.259156, 'nearby_pokemon_count': 5}, {'class': 16, 'cellId_730m': 9923201477013144000, 'terrainType': 13, 'closeToWater': False, 'population_density': 761.8856, 'urban': False, 'suburban': True, 'midurban': True, 'rural': False, 

## 3) Veri Ön İşleme


In [4]:
target_col = "class"

# Eksik değer kontrolü
print("Eksik Değer Kontrolü:")
print("=" * 60)
sample = ds.take_batch(10000)
sample_df = pd.DataFrame(sample)

missing_info = []
for col in sample_df.columns:
    if col != target_col:
        null_count = sample_df[col].isnull().sum()
        if null_count > 0:
            null_pct = (null_count / len(sample_df)) * 100
            missing_info.append((col, null_count, null_pct))

if missing_info:
    print(f"{'Kolon':<30} {'Eksik (10K sample)':<20} {'Yüzde':<10}")
    print("-" * 60)
    for col, count, pct in sorted(missing_info, key=lambda x: x[1], reverse=True):
        print(f"{col:<30} {count:<20,} {pct:<10.2f}%")
else:
    print("Eksik değer bulunamadı!")

print("\n" + "=" * 60)
print("Sınıf Dağılımı Analizi:")
print("=" * 60)
class_dist = sample_df[target_col].value_counts()
print(f"Unique classes: {sample_df[target_col].nunique()}")
print("\nTop 10 Pokemon types:")
print(class_dist.head(10))


2025-12-28 19:09:22,108	INFO logging.py:397 -- Registered dataset logger for dataset dataset_3_0
2025-12-28 19:09:22,112	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_3_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:22,113	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_3_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=10000]
2025-12-28 19:09:22,125	INFO progress_bar.py:213 -- === Ray Data Progress {ReadParquet->SplitBlocks(32)} ===
2025-12-28 19:09:22,126	INFO progress_bar.py:215 -- ReadParquet->SplitBlocks(32): Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2025-12-28 19:09:22,127	INFO progress_bar.py:213 -- === Ray Data Progress {limit=10000} ===
2025-12-28 19:09:22,128	INFO progress_bar.py:215 -- limit=10000: Tasks: 0; Actors: 0; Queued blocks: 0 (0.0B); Resources: 0.0 CPU, 0.0B obj

Eksik Değer Kontrolü:
Eksik değer bulunamadı!

Sınıf Dağılımı Analizi:
Unique classes: 118

Top 10 Pokemon types:
class
16     1786
19     1482
13      882
21      457
133     412
41      335
48      309
96      292
10      291
46      239
Name: count, dtype: int64


In [5]:
# LightGBM kategorik feature tanımı
CATEGORICAL_COLS = ["cellId_730m", "terrainType", "weather",
                    "appearedDayOfWeek", "appearedTimeOfDay"]


def preprocess_batch(batch: dict):
    import pandas as pd
    df = pd.DataFrame(batch)

    for col in df.columns:
        if df[col].dtype == "bool":
            df[col] = df[col].astype(int)

    object_cols = df.select_dtypes(include=['object']).columns
    for col in object_cols:
         df[col] = df[col].astype("string")

    return df

print("\nVeri ön işleme yapılıyor...")


ds_processed = ds.map_batches(preprocess_batch, batch_format="pandas")

sample_processed = ds_processed.take(1)[0]
feature_cols = [k for k in sample_processed.keys() if k != target_col]

print(f"\nOn isleme tamamlandi!")
print(f"   Kategorik kolonlar: {CATEGORICAL_COLS}")
print(f"   Toplam feature: {len(feature_cols)}")

print(ds_processed.schema())

2025-12-28 19:09:22,252	INFO logging.py:397 -- Registered dataset logger for dataset dataset_5_0
2025-12-28 19:09:22,257	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_5_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:22,258	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_5_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=1] -> TaskPoolMapOperator[MapBatches(preprocess_batch)]
2025-12-28 19:09:22,275	INFO progress_bar.py:213 -- === Ray Data Progress {ReadParquet->SplitBlocks(32)} ===
2025-12-28 19:09:22,276	INFO progress_bar.py:215 -- ReadParquet->SplitBlocks(32): Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2025-12-28 19:09:22,277	INFO progress_bar.py:213 -- === Ray Data Progress {limit=1} ===
2025-12-28 19:09:22,277	INFO progress_bar.py:215 -- limit=1: Tasks: 0; Actors: 0; Queued block


Veri ön işleme yapılıyor...


2025-12-28 19:09:22,388	INFO streaming_executor.py:304 -- ✔️  Dataset dataset_5_0 execution finished in 0.13 seconds
2025-12-28 19:09:22,404	INFO logging.py:397 -- Registered dataset logger for dataset dataset_6_0
2025-12-28 19:09:22,409	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_6_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:22,409	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_6_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> LimitOperator[limit=1] -> TaskPoolMapOperator[MapBatches(preprocess_batch)]
2025-12-28 19:09:22,421	INFO progress_bar.py:213 -- === Ray Data Progress {ReadParquet->SplitBlocks(32)} ===
2025-12-28 19:09:22,422	INFO progress_bar.py:215 -- ReadParquet->SplitBlocks(32): Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2025-12-28 19:09:22,423	INFO progress_bar.py:213 -- === Ray Data 


On isleme tamamlandi!
   Kategorik kolonlar: ['cellId_730m', 'terrainType', 'weather', 'appearedDayOfWeek', 'appearedTimeOfDay']
   Toplam feature: 16


2025-12-28 19:09:22,512	INFO streaming_executor.py:304 -- ✔️  Dataset dataset_6_0 execution finished in 0.10 seconds


Column                Type
------                ----
class                 int64
cellId_730m           uint64
terrainType           int64
closeToWater          int64
population_density    double
urban                 int64
suburban              int64
midurban              int64
rural                 int64
appearedHour          int64
appearedTimeOfDay     string
appearedDayOfWeek     string
weather               string
temperature           double
pokestopDistanceKm    double
gymDistanceKm         double
nearby_pokemon_count  int32


In [6]:
print("Train/Validation/Test Split (70/15/15)...")
print("Validation: Hiperparametre seçimi")
print("Test: Son değerlendirme")

# 1. Train (%70) ve Temp (%30)
train_ds, temp_ds = ds_processed.train_test_split(test_size=0.3, seed=42)

# 2. Temp setini Validation (%15) ve Test (%15) olarak ikiye böl
val_ds, test_ds = temp_ds.train_test_split(test_size=0.5, seed=42)

print(f"   Train: {train_ds.count():,} | Validation: {val_ds.count():,} | Test: {test_ds.count():,}")

print("\nPandas'a dönüştürülüyor...")
train_data = train_ds.to_pandas()
val_data = val_ds.to_pandas()
test_data = test_ds.to_pandas()

# DataFrame olarak tut (LightGBM category)
X_train = train_data[feature_cols]
y_train = train_data[target_col].values
X_val = val_data[feature_cols]
y_val = val_data[target_col].values
X_test = test_data[feature_cols]
y_test = test_data[target_col].values

# Hedef değişkeni encode et (0-indexed)
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)
y_test_encoded = label_encoder.transform(y_test)

print(f"\n Veri hazır!")
print(f"   Train Shape: {X_train.shape}")
print(f"   Validation Shape: {X_val.shape}")
print(f"   Test Shape: {X_test.shape}")
print(f"   Sınıf sayısı: {len(label_encoder.classes_)}")
print(f"   X_train, X_val, X_test DataFrame formatında (kategorik tipler korundu)")


2025-12-28 19:09:22,523	INFO logging.py:397 -- Registered dataset logger for dataset dataset_7_0
2025-12-28 19:09:22,528	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_7_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:22,529	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_7_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(preprocess_batch)->Project] -> AggregateNumRows[AggregateNumRows]
2025-12-28 19:09:22,542	INFO progress_bar.py:213 -- === Ray Data Progress {ReadParquet->SplitBlocks(32)} ===
2025-12-28 19:09:22,543	INFO progress_bar.py:215 -- ReadParquet->SplitBlocks(32): Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2025-12-28 19:09:22,544	INFO progress_bar.py:213 -- === Ray Data Progress {MapBatches(preprocess_batch)->Project} ===
2025-12-28 19:09:22,544	INFO progress_bar.p

Train/Validation/Test Split (70/15/15)...
Validation: Hiperparametre seçimi
Test: Son değerlendirme


2025-12-28 19:09:24,353	INFO streaming_executor.py:304 -- ✔️  Dataset dataset_7_0 execution finished in 1.82 seconds
2025-12-28 19:09:24,360	INFO logging.py:397 -- Registered dataset logger for dataset dataset_4_0
2025-12-28 19:09:24,365	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_4_0. Full logs are in /tmp/ray/session_2025-12-28_19-09-12_120528_1166/logs/ray-data
2025-12-28 19:09:24,365	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_4_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet] -> TaskPoolMapOperator[MapBatches(preprocess_batch)]
2025-12-28 19:09:24,378	INFO progress_bar.py:213 -- === Ray Data Progress {ReadParquet->SplitBlocks(32)} ===
2025-12-28 19:09:24,379	INFO progress_bar.py:215 -- ReadParquet->SplitBlocks(32): Tasks: 1; Actors: 0; Queued blocks: 0 (0.0B); Resources: 1.0 CPU, 384.0MiB object store: Progress Completed 0 / ?
2025-12-28 19:09:24,380	INFO progress_bar.py:213 -- === Ray Data Progress {MapBatches(prepr

   Train: 207,214 | Validation: 44,403 | Test: 44,404

Pandas'a dönüştürülüyor...

 Veri hazır!
   Train Shape: (207214, 16)
   Validation Shape: (44403, 16)
   Test Shape: (44404, 16)
   Sınıf sayısı: 144
   X_train, X_val, X_test DataFrame formatında (kategorik tipler korundu)


In [7]:
# Kategorik feature isimleri
categorical_features = [col for col in CATEGORICAL_COLS if col in feature_cols]
print(f"Kategorik feature'lar: {categorical_features}")
print(f"LightGBM 'category' tipindeki kolonları otomatik algılar")
print(f"Scaling gerekmiyor (tree-based model)")

# X_train, X_val ve X_test'teki kategorik kolonları category tipine çevir
print(f"\nKategorik kolonlar 'category' tipine çevriliyor...")
for col in categorical_features:
    if col in X_train.columns:
        X_train[col] = X_train[col].astype("category")
    if col in X_val.columns:
        X_val[col] = X_val[col].astype("category")
    if col in X_test.columns:
        X_test[col] = X_test[col].astype("category")

print(f"Kategorik tipler ayarlandı")
print(f"X_train dtypes: {dict(X_train.dtypes)}")
print(f"X_val dtypes: {dict(X_val.dtypes)}")
print(f"X_test dtypes: {dict(X_test.dtypes)}")


Kategorik feature'lar: ['cellId_730m', 'terrainType', 'weather', 'appearedDayOfWeek', 'appearedTimeOfDay']
LightGBM 'category' tipindeki kolonları otomatik algılar
Scaling gerekmiyor (tree-based model)

Kategorik kolonlar 'category' tipine çevriliyor...
Kategorik tipler ayarlandı
X_train dtypes: {'cellId_730m': CategoricalDtype(categories=[   43174570332520448,    43175025599053824,
                     43175205987680256,    43175214577614848,
                     43175472275652608,    43175515225325568,
                     43201293619036160,    43201302208970752,
                     43201310798905344,    43201319388839936,
                  ...
                  12280711362249753000, 12280711370839687000,
                  12280711379429622000, 12280768030048256000,
                  12280881468724478000, 12280881477314413000,
                  12280881485904347000, 12280881494494282000,
                  12280882439387087000, 12280882447977021000],
, ordered=False, categories_dtype

## 4) Parallel Hyperparameter Tuning with Ray


In [8]:
@ray.remote(num_cpus=2)
def train_and_evaluate(config_id, params, X_train, y_train, X_val, y_val, cat_features):
    import time
    import lightgbm as lgb
    import pandas as pd
    from sklearn.metrics import accuracy_score, f1_score

    print(f"[Config {config_id}] Training: {params}")
    start_time = time.time()

    # Object tipindeki kategorik kolonları category tipine çevir
    if cat_features is not None:
        for col in cat_features:
            if col in X_train.columns:
                X_train[col] = X_train[col].astype("category")
            if col in X_val.columns:
                X_val[col] = X_val[col].astype("category")

    model = lgb.LGBMClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        learning_rate=params['learning_rate'],
        num_leaves=params['num_leaves'],
        min_child_samples=params['min_child_samples'],
        subsample=params.get('subsample', 0.8),
        colsample_bytree=params.get('colsample_bytree', 0.8),
        objective='multiclass',
        metric='multi_logloss',
        boosting_type='gbdt',
        class_weight='balanced',
        random_state=42,
        n_jobs=2,
        verbose=-1,
        force_col_wise=True
    )

    # DataFrame + category tipi LightGBM otomatik algılar
    model.fit(X_train, y_train, categorical_feature='auto')
    training_time = time.time() - start_time

    # Validation seti üzerinde değerlendir
    y_pred_val = model.predict(X_val)

    return {
        'config_id': config_id,
        'params': params,
        'model': model,
        'val_accuracy': accuracy_score(y_val, y_pred_val),
        'val_f1_score': f1_score(y_val, y_pred_val, average='weighted'),
        'training_time': training_time
    }

print("Ray remote function tanımlandı")


Ray remote function tanımlandı


In [9]:
# Hyperparameter konfigürasyonları
hyperparameter_configs = [
    {'n_estimators': 450, 'max_depth': 10, 'learning_rate': 0.03, 'num_leaves': 90, 'min_child_samples': 5},   # Deep and Complex
    {'n_estimators': 400, 'max_depth': 15, 'learning_rate': 0.03, 'num_leaves': 128, 'min_child_samples': 5},  # Deep and Complex
    {'n_estimators': 600, 'max_depth': 12, 'learning_rate': 0.01, 'num_leaves': 100, 'min_child_samples': 3},  # Slow and Robust
    {'n_estimators': 350, 'max_depth': -1, 'learning_rate': 0.04, 'num_leaves': 255, 'min_child_samples': 10}, # Unlimited Depth
    {'n_estimators': 300, 'max_depth': 8, 'learning_rate': 0.05, 'num_leaves': 64, 'min_child_samples': 1},    # Aggressive Sensitivity
]

print(f"Paralel LightGBM eğitimi başlıyor ({len(hyperparameter_configs)} model)...")
print("=" * 70)
print("Değerlendirme: VALIDATION seti")

# Veriyi Ray object store'a koy
X_train_ref = ray.put(X_train)
y_train_ref = ray.put(y_train_encoded)
X_val_ref = ray.put(X_val)
y_val_ref = ray.put(y_val_encoded)
cat_features_ref = ray.put(categorical_features)  # Kategorik feature listesini gönder

start_time = time.time()

futures = [
    train_and_evaluate.remote(i, config, X_train_ref, y_train_ref, X_val_ref, y_val_ref, cat_features_ref)
    for i, config in enumerate(hyperparameter_configs)
]

print(f"   {len(futures)} paralel iş gönderildi, bekleniyor...")

results = ray.get(futures)
total_time = time.time() - start_time

print(f"\n{'='*70}")
print(f"Tüm modeller {total_time:.2f} saniyede eğitildi")
print(f"Model başına ortalama: {total_time/len(results):.2f} saniye")
print(f"{'='*70}")


Paralel LightGBM eğitimi başlıyor (5 model)...
Değerlendirme: VALIDATION seti
   5 paralel iş gönderildi, bekleniyor...
[36m(train_and_evaluate pid=1794)[0m [Config 2] Training: {'n_estimators': 600, 'max_depth': 12, 'learning_rate': 0.01, 'num_leaves': 100, 'min_child_samples': 3}


[36m(pid=gcs_server)[0m [2025-12-28 19:09:42,348 E 1601 1601] (gcs_server) gcs_server.cc:303: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[33m(raylet)[0m [2025-12-28 19:09:45,169 E 1732 1732] (raylet) main.cc:1032: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[36m(train_and_evaluate pid=1792)[0m [2025-12-28 19:09:48,335 E 1792 1844] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14



Tüm modeller 2766.06 saniyede eğitildi
Model başına ortalama: 553.21 saniye


In [10]:
print("\nLightGBM Model Comparison (VALIDATION Skorları):")
print("=" * 120)
print(f"{'Config':<8} {'n_est':<8} {'depth':<8} {'lr':<8} {'leaves':<8} {'min_child':<10} {'Val_Acc':<12} {'Val_F1':<12} {'Time(s)':<10}")
print("=" * 120)

for result in results:
    p = result['params']
    print(f"{result['config_id']:<8} {p['n_estimators']:<8} {p['max_depth']:<8} "
          f"{p['learning_rate']:<8} {p['num_leaves']:<8} {p['min_child_samples']:<10} "
          f"{result['val_accuracy']:<12.4f} {result['val_f1_score']:<12.4f} {result['training_time']:<10.2f}")

print("=" * 120)

best_result = max(results, key=lambda x: x['val_f1_score'])
print(f"\nBest LightGBM Model (Validation F1'e göre): Config {best_result['config_id']}")
print(f"   Validation F1 Score: {best_result['val_f1_score']:.4f}")
print(f"   Validation Accuracy: {best_result['val_accuracy']:.4f}")
print(f"\nBest Parameters:")
for k, v in best_result['params'].items():
    print(f"   {k}: {v}")

model = best_result['model']


LightGBM Model Comparison (VALIDATION Skorları):
Config   n_est    depth    lr       leaves   min_child  Val_Acc      Val_F1       Time(s)   
0        450      10       0.03     90       5          0.0967       0.1016       1315.74   
1        400      15       0.03     128      5          0.1073       0.1101       1356.87   
2        600      12       0.01     100      3          0.0890       0.0932       1813.67   
3        350      -1       0.04     255      10         0.1324       0.1251       1411.39   
4        300      8        0.05     64       1          0.0896       0.0951       797.84    

Best LightGBM Model (Validation F1'e göre): Config 3
   Validation F1 Score: 0.1251
   Validation Accuracy: 0.1324

Best Parameters:
   n_estimators: 350
   max_depth: -1
   learning_rate: 0.04
   num_leaves: 255
   min_child_samples: 10


## 5) Model Değerlendirme


In [11]:
print("FİNAL DEĞERLENDİRME: TEST SETİ")
print("=" * 60)

# X_test'teki kategorik kolonları category tipine çevir
for col in CATEGORICAL_COLS:
    if col in X_test.columns and X_test[col].dtype == 'object':
        X_test[col] = X_test[col].astype("category")

# Test seti üzerinde tahmin yap
y_pred = model.predict(X_test)

# Tüm metrikleri test seti üzerinden hesapla
accuracy = accuracy_score(y_test_encoded, y_pred)
f1 = f1_score(y_test_encoded, y_pred, average='weighted', zero_division=0)
precision = precision_score(y_test_encoded, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test_encoded, y_pred, average='weighted', zero_division=0)

print("\n" + "=" * 60)
print("EN İYİ MODEL PERFORMANSI (TEST SETİ)")
print("=" * 60)
print(f"Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"F1 Score:  {f1:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print("=" * 60)

# Validation vs Test karşılaştırması
print(f"\n Validation vs Test Karşılaştırması:")
print(f"   Validation Accuracy: {best_result['val_accuracy']:.4f} | Test Accuracy: {accuracy:.4f}")
print(f"   Validation F1:       {best_result['val_f1_score']:.4f} | Test F1:       {f1:.4f}")

FİNAL DEĞERLENDİRME: TEST SETİ

EN İYİ MODEL PERFORMANSI (TEST SETİ)
Accuracy:  0.1274 (12.74%)
F1 Score:  0.1181
Precision: 0.1133
Recall:    0.1274

 Validation vs Test Karşılaştırması:
   Validation Accuracy: 0.1324 | Test Accuracy: 0.1274
   Validation F1:       0.1251 | Test F1:       0.1181


## 6) Özellik Önem Sıralaması


In [12]:
# Feature Importance: split=kullanım sayısı, gain=bilgi kazancı
print("Feature Importance Analizi")
print("=" * 70)

importances_split = model.booster_.feature_importance(importance_type='split')
importances_gain = model.booster_.feature_importance(importance_type='gain')

importance_df = pd.DataFrame({
    'Feature': feature_cols,
    'Split': importances_split,
    'Gain': importances_gain
}).sort_values(by='Gain', ascending=False)

importance_df['Gain_Norm'] = importance_df['Gain'] / importance_df['Gain'].sum()

print("\nTüm Features (Gain bazlı):")
print(importance_df.to_string(index=False))

print("\nGörsel:")
print("-" * 70)
for _, row in importance_df.iterrows():
    bar = "█" * int(row['Gain_Norm'] * 100)
    print(f"{row['Feature']:<25} {row['Gain_Norm']:.4f} ({row['Gain_Norm']*100:.1f}%) {bar}")
print("-" * 70)

print(f"\nTop 5 toplam önem: {importance_df.head(5)['Gain_Norm'].sum()*100:.1f}%")


Feature Importance Analizi

Tüm Features (Gain bazlı):
             Feature   Split         Gain  Gain_Norm
         cellId_730m  269697 6.037446e+06   0.237920
  pokestopDistanceKm 2578993 5.223248e+06   0.205835
       gymDistanceKm 2423171 4.376105e+06   0.172451
         temperature 1792446 3.102735e+06   0.122271
        appearedHour 1467583 2.514161e+06   0.099077
  population_density  976642 1.829928e+06   0.072113
nearby_pokemon_count  982196 1.717089e+06   0.067666
        closeToWater  104017 1.925778e+05   0.007589
         terrainType    4390 1.018407e+05   0.004013
               urban   26304 8.384118e+04   0.003304
            suburban   17118 7.015399e+04   0.002765
            midurban   21520 5.553045e+04   0.002188
             weather    5028 3.205271e+04   0.001263
   appearedDayOfWeek    3766 1.778013e+04   0.000701
               rural    6984 1.440360e+04   0.000568
   appearedTimeOfDay    1089 7.054314e+03   0.000278

Görsel:
-----------------------------------

## 8) Modeli Kaydet


In [13]:
import joblib

joblib.dump(model, "pokemon_lgbm_model_ray.joblib")
joblib.dump(label_encoder, "pokemon_encoder_ray.joblib")
joblib.dump(feature_cols, "pokemon_features_ray.joblib")

print("Kaydedilen dosyalar:")
print("   - pokemon_lgbm_model_ray.joblib")
print("   - pokemon_encoder_ray.joblib")
print("   - pokemon_features_ray.joblib")


Kaydedilen dosyalar:
   - pokemon_lgbm_model_ray.joblib
   - pokemon_encoder_ray.joblib
   - pokemon_features_ray.joblib


## 9) Özet


In [14]:
print("=" * 70)
print("PIPELINE ÖZETİ (RAY + LightGBM)")
print("=" * 70)

print(f"\nVERI:")
total_records = train_ds.count() + val_ds.count() + test_ds.count()
print(f"   Toplam Kayıt: {total_records:,}")
print(f"   Train: {train_ds.count():,} (%70) | Val: {val_ds.count():,} (%15) | Test: {test_ds.count():,} (%15)")
print(f"   Feature: {len(feature_cols)} | Sınıf: {len(label_encoder.classes_)}")

print(f"\nEGITIM:")
print(f"   {len(hyperparameter_configs)} konfigürasyon, {total_time:.2f} saniye")
print(f"   Hiperparametre secimi VALIDATION setiyle yapildi (Data Leakage yok)")

print(f"\nEN IYI MODEL:")
bp = best_result['params']
print(f"   n_estimators={bp['n_estimators']}, max_depth={bp['max_depth']}, lr={bp['learning_rate']}")

print(f"\nPERFORMANS (TEST SETI - Final):")
print(f"   Accuracy: {accuracy*100:.2f}% | F1: {f1:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f}")

print(f"\nVALIDATION vs TEST:")
print(f"   Val Acc: {best_result['val_accuracy']*100:.2f}% → Test Acc: {accuracy*100:.2f}%")
print(f"   Val F1:  {best_result['val_f1_score']:.4f} → Test F1:  {f1:.4f}")

print(f"\nTOP 3 FEATURE:")
for _, row in importance_df.head(3).iterrows():
    print(f"   {row['Feature']}: {row['Gain_Norm']*100:.1f}%")

print("\n" + "=" * 70)
print("Pipeline tamamlandi")
print("=" * 70)


PIPELINE ÖZETİ (RAY + LightGBM)

VERI:
   Toplam Kayıt: 296,021
   Train: 207,214 (%70) | Val: 44,403 (%15) | Test: 44,404 (%15)
   Feature: 16 | Sınıf: 144

EGITIM:
   5 konfigürasyon, 2766.06 saniye
   Hiperparametre secimi VALIDATION setiyle yapildi (Data Leakage yok)

EN IYI MODEL:
   n_estimators=350, max_depth=-1, lr=0.04

PERFORMANS (TEST SETI - Final):
   Accuracy: 12.74% | F1: 0.1181 | Precision: 0.1133 | Recall: 0.1274

VALIDATION vs TEST:
   Val Acc: 13.24% → Test Acc: 12.74%
   Val F1:  0.1251 → Test F1:  0.1181

TOP 3 FEATURE:
   cellId_730m: 23.8%
   pokestopDistanceKm: 20.6%
   gymDistanceKm: 17.2%

Pipeline tamamlandi


In [15]:
ray.shutdown()
print("Ray session closed")


Ray session closed
