# Topic Classifier Training Pipeline

Bu notebook, metin sınıflandırma için hibrit özellik vektörizasyonu kullanan makine öğrenmesi modellerini eğitir.

## Pipeline Workflow:
1. **Veri Yükleme** - Train/test verilerini yükleme ve ön işleme
2. **GloVe Embeddings** - Önceden eğitilmiş word embeddings yükleme  
3. **TF-IDF Hesaplaması** - Kategori-bazlı TF-IDF skorları hesaplama
4. **Feature Vektörizasyonu** - Hibrit vektör oluşturma (GloVe + TF-IDF weighted)
5. **Model Eğitimi** - LogReg (tam vektör) → PCA → Diğer modeller (PCA)
6. **Sonuç Analizi** - Performans karşılaştırması

## Özellikler:
- **Hibrit Vektörler**: GloVe + TF-IDF ağırlıklı kategori vektörleri
- **Eksik Kelime İşleme**: Kategorsel TF-IDF tabanlı KNN ile vektör oluşturma
- **Hızlı Mod**: Önceden optimize edilmiş parametreler
- **PCA Optimizasyonu**: Boyut indirgeme ile hız artışı

## 1. Kütüphane İmportları ve Konfigürasyon

### 1.1 Gerekli Kütüphanelerin İmportları

In [1]:
import pandas as pd
import numpy as np
import os
import time
import pickle
import re
import warnings
import requests
import zipfile
import io
from tqdm import tqdm
from typing import Dict, List, Tuple
from dataclasses import dataclass
from abc import ABC, abstractmethod

# Sklearn
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import _stop_words

warnings.filterwarnings("ignore")

print("Kütüphaneler başarıyla yüklendi!")

Kütüphaneler başarıyla yüklendi!


### 1.2 Konfigürasyon Sınıfı

Bu hücre training pipeline'ının tüm konfigürasyon ayarlarını tanımlar. Hızlı mod (predefined params) ve hibrit vektör kullanımı gibi seçenekleri içerir.

In [2]:
@dataclass
class Config:
    """Training konfigürasyonu"""
    # Data paths
    train_path: str = "archive/train.csv"
    test_path: str = "archive/test.csv"
    glove_dir: str = "glove"
    models_dir: str = "models"
    cache_dir: str = "cache"
    
    # GloVe settings
    glove_dim: int = 300
    
    # TF-IDF settings
    tfidf_max_features: int = 5000
    
    # KNN settings for missing words
    knn_neighbors: int = 5
    
    # PCA settings
    pca_components: int = 100
    
    # Cross-validation
    cv_folds: int = 3
    
    # Random search
    random_search_iter: int = 20
    random_state: int = 42
    
    # Strategy settings
    use_predefined_params: bool = True  # Fast mode with pre-optimized params
    use_hybrid_vectors: bool = True     # Category-weighted hybrid vectors
    
    # Categories
    categories: List[int] = None
    
    def __post_init__(self):
        if self.categories is None:
            self.categories = [1, 2, 3, 4]  # World, Sports, Business, Sci/Tech
        
        # Create directories
        os.makedirs(self.models_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)
    
    @property
    def hybrid_vector_dim(self) -> int:
        """Calculate hybrid vector dimension"""
        if self.use_hybrid_vectors:
            return self.glove_dim * len(self.categories)  # 300 * 4 = 1200D
        else:
            return self.glove_dim

# Initialize configuration
config = Config()

print("CONFIGURATION")
print(f"├── Training Strategy: {'FAST (predefined params)' if config.use_predefined_params else 'SEARCH (RandomizedSearchCV)'}")
print(f"├── Vector Type: {'Hybrid' if config.use_hybrid_vectors else 'Standard'} ({config.hybrid_vector_dim}D)")
print(f"├── PCA Components: {config.pca_components}")
print(f"├── CV Folds: {config.cv_folds}")
print(f"├── KNN Neighbors: {config.knn_neighbors}")
print(f"└── Cache Directory: {config.cache_dir}")

if config.use_hybrid_vectors:
    print(f"Hybrid Vector Details:")
    print(f"   Base GloVe: {config.glove_dim}D")
    print(f"   Categories: {len(config.categories)} ({config.categories})")
    print(f"   Final: {config.hybrid_vector_dim}D (TF-IDF weighted)")

CONFIGURATION
├── Training Strategy: FAST (predefined params)
├── Vector Type: Hybrid (1200D)
├── PCA Components: 100
├── CV Folds: 3
├── KNN Neighbors: 5
└── Cache Directory: cache
Hybrid Vector Details:
   Base GloVe: 300D
   Categories: 4 ([1, 2, 3, 4])
   Final: 1200D (TF-IDF weighted)


## 2. Utility Classes (Yardımcı Sınıflar)

### 2.1 Metin Temizleme Sınıfı

Bu hücre metinleri temizlemek için kullanılan TextPreprocessor sınıfını tanımlar. URL'ler, email'ler, sayılar ve özel karakterleri kaldırır.

In [3]:
class TextPreprocessor:
    """Text preprocessing utilities"""
    
    @staticmethod
    def clean_text(text: str) -> str:
        """Advanced text cleaning"""
        if not isinstance(text, str):
            return ""
        
        text = text.lower()
        text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
        text = re.sub(r'\S+@\S+', '', text)
        text = re.sub(r'\d+', '', text)
        text = re.sub(r'[^\w\s]', '', text)
        text = re.sub(r'\b\w{1,2}\b', '', text)
        text = re.sub(r'\b\w{15,}\b', '', text)
        text = re.sub(r'(.)\1{2,}', r'\1\1', text)
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text

print("TextPreprocessor sınıfı tanımlandı")

TextPreprocessor sınıfı tanımlandı


### 2.2 GloVe Embeddings Yükleme Sınıfı

Bu hücre GloVe word embeddings'lerini yüklemek için kullanılan sınıfı tanımlar. Otomatik download ve progress bar özellikleri içerir.

In [4]:
class GloVeLoader:
    """GloVe embeddings loader with progress tracking"""
    
    def __init__(self, glove_dir: str, dim: int = 300):
        self.glove_dir = glove_dir
        self.dim = dim
        self.embeddings_index = {}
    
    def load_embeddings(self) -> Dict[str, np.ndarray]:
        """Load GloVe vectors (auto-download if needed)"""
        # Auto-download GloVe files if they don't exist
        glove_path = self.download_glove_model()
        
        print(f"Loading GloVe vectors from: {glove_path}")
        
        # Count total lines for progress bar
        print("Counting lines for progress tracking...")
        total_lines = sum(1 for _ in open(glove_path, encoding='utf-8'))
        
        with open(glove_path, encoding='utf-8') as f:
            for line in tqdm(f, desc="Loading GloVe", total=total_lines, unit="words"):
                values = line.split()
                word = values[0]
                coefs = np.asarray(values[1:], dtype='float32')
                self.embeddings_index[word] = coefs
        
        print(f"Loaded {len(self.embeddings_index):,} word vectors")
        return self.embeddings_index
    
    def download_glove_model(self) -> str:
        """Download GloVe embeddings if not available"""
        glove_path = os.path.join(self.glove_dir, f"glove.6B.{self.dim}d.txt")
        
        if not os.path.exists(glove_path):
            print(f"GloVe vektörleri indiriliyor (boyut: {self.dim}d)...")
            print("Not: Bu büyük bir dosya, indirme işlemi birkaç dakika sürebilir.")
            
            # Create glove directory if it doesn't exist
            os.makedirs(self.glove_dir, exist_ok=True)
            
            try:
                url = "https://nlp.stanford.edu/data/glove.6B.zip"
                print("İndirme başlıyor...")
                r = requests.get(url, stream=True, timeout=120)
                if r.status_code == 200:
                    print("İndirme tamamlandı, zipten çıkarılıyor...")
                    z = zipfile.ZipFile(io.BytesIO(r.content))
                    z.extractall(self.glove_dir)
                    print("GloVe vektörleri başarıyla indirildi ve çıkarıldı.")
                else:
                    print(f"İndirme hatası: {r.status_code}")
                    raise Exception("GloVe vektörleri indirilemedi.")
            except Exception as e:
                print(f"Hata: {e}")
                print("GloVe vektörlerini manuel olarak indirmeniz gerekebilir.")
                print("İndirme linki: https://nlp.stanford.edu/data/glove.6B.zip")
                print(f"Dosyayı {self.glove_dir} klasörüne çıkarın.")
                raise Exception("GloVe vektörleri yüklenemedi.")
        else:
            print(f"GloVe vektörleri zaten mevcut: {glove_path}")
        
        return glove_path

print("GloVeLoader sınıfı tanımlandı")

GloVeLoader sınıfı tanımlandı


### 2.3 TF-IDF Skor Hesaplama Sınıfı

Bu hücre cache kullanmadan TF-IDF skorlarını her seferinde yeniden hesaplayan SimpleTFIDFBuilder sınıfını tanımlar. Kategori-bazlı TF-IDF hesaplamaları yapar.

In [5]:
class SimpleTFIDFBuilder:
    """Simplified TF-IDF builder without caching - calculates everything from scratch"""
    
    def __init__(self, config: Config):
        self.config = config
        self.word_score_cache = {}
        self.category_vectorizers = {}
    
    def build_tfidf_scores(self, texts: List[str], labels: List[int]) -> Dict:
        """Build TF-IDF scores from scratch without any caching"""
        print("Building TF-IDF scores from scratch...")
        
        # Clean texts
        print("Cleaning texts...")
        cleaned_texts = [TextPreprocessor.clean_text(text) for text in tqdm(texts, desc="Cleaning texts")]
        
        # Build category-specific TF-IDF vectorizers
        print("Building category-specific TF-IDF vectorizers...")
        for category in tqdm(self.config.categories, desc="Processing categories"):
            print(f"Processing category {category}...")
            
            # Filter texts for this category
            category_texts = [
                cleaned_texts[i] for i, label in enumerate(labels) 
                if label == category
            ]
            
            if not category_texts:
                print(f"Warning: No texts found for category {category}!")
                continue
            
            # Create TF-IDF vectorizer for this category
            vectorizer = TfidfVectorizer(
                max_features=self.config.tfidf_max_features,
                stop_words='english',
                ngram_range=(1, 1),
                lowercase=True,
                token_pattern=r'\b[a-zA-Z]{3,}\b'
            )
            
            # Fit and store vectorizer
            vectorizer.fit(category_texts)
            self.category_vectorizers[category] = vectorizer
        
        # Build word score cache
        self._build_word_score_cache(cleaned_texts, labels)
        
        print(f"TF-IDF scores built for {len(self.word_score_cache):,} words")
        return self.word_score_cache
    
    def _build_word_score_cache(self, cleaned_texts: List[str], labels: List[int]):
        """Build optimized word score cache"""
        print("Building optimized word score cache...")
        
        # Pre-compute category data
        print("Pre-computing TF-IDF matrices...")
        category_data = {}
        
        for category in tqdm(self.config.categories, desc="Pre-computing matrices"):
            if category in self.category_vectorizers:
                vectorizer = self.category_vectorizers[category]
                
                # Filter category texts
                category_texts = [
                    cleaned_texts[i] for i, label in enumerate(labels) 
                    if label == category
                ]
                
                if category_texts:
                    # Transform texts to TF-IDF matrix
                    tfidf_matrix = vectorizer.transform(category_texts)
                    feature_names = vectorizer.get_feature_names_out()
                    
                    # Compute mean scores for all features
                    feature_scores = np.array(tfidf_matrix.mean(axis=0)).flatten()
                    
                    # Create fast lookup dictionary
                    feature_to_idx = {word: idx for idx, word in enumerate(feature_names)}
                    
                    category_data[category] = {
                        'feature_names': feature_names,
                        'feature_scores': feature_scores,
                        'feature_to_idx': feature_to_idx
                    }
                    
                    print(f"Category {category}: {len(feature_names)} features pre-computed")
        
        # Collect all unique words
        all_words = set()
        for data in category_data.values():
            all_words.update(data['feature_names'])
            
        # Add words from texts that might not be in any category features
        stop_words = _stop_words.ENGLISH_STOP_WORDS
        for text in cleaned_texts:
            tokens = [t for t in text.split() if t not in stop_words and len(t) >= 3]
            all_words.update(tokens)
        
        print(f"Found {len(all_words):,} unique words total")
        
        # Build cache from pre-computed data
        print("Building cache from pre-computed data...")
        for word in tqdm(all_words, desc="Building word cache"):
            self.word_score_cache[word] = {}
            
            for category in self.config.categories:
                if (category in category_data and 
                    word in category_data[category]['feature_to_idx']):
                    # Fast lookup from pre-computed data
                    word_idx = category_data[category]['feature_to_idx'][word]
                    score = category_data[category]['feature_scores'][word_idx]
                    self.word_score_cache[word][category] = float(score) if not np.isnan(score) else 0.0
                else:
                    self.word_score_cache[word][category] = 0.0

print("SimpleTFIDFBuilder sınıfı tanımlandı")

SimpleTFIDFBuilder sınıfı tanımlandı


### 2.4 Eksik Kelime İşleme Sınıfı

Bu hücre GloVe'de bulunmayan kelimeler için TF-IDF bazlı KNN ile vektör oluşturulur.

In [6]:
class TFIDFBasedKNNHandler:
    """TF-IDF based KNN handler for missing words"""
    
    def __init__(self, embeddings_index: Dict, word_score_cache: Dict, config: Config):
        self.embeddings_index = embeddings_index
        self.word_score_cache = word_score_cache
        self.config = config
    
    def find_top_words_by_tfidf(self, target_category: int, min_score: float = 0.01) -> List[Tuple[str, float]]:
        """Find top words by TF-IDF score for target category"""
        category_words = []
        
        for word, scores in self.word_score_cache.items():
            category_score = scores.get(target_category, 0.0)
            if category_score > min_score and word in self.embeddings_index:
                category_words.append((word, category_score))
        
        return sorted(category_words, key=lambda x: x[1], reverse=True)
    
    def create_missing_word_vector(self, missing_word: str, target_category: int) -> np.ndarray:
        """Create vector for missing word using TF-IDF weighted KNN"""
        top_words = self.find_top_words_by_tfidf(target_category)
        
        if not top_words:
            return np.random.normal(0, 0.1, size=self.config.glove_dim)
        
        top_neighbors = top_words[:self.config.knn_neighbors]
        
        vectors = []
        weights = []
        
        for word, tfidf_score in top_neighbors:
            if word in self.embeddings_index:
                vector = self.embeddings_index[word]
                weight = tfidf_score
                
                vectors.append(vector)
                weights.append(weight)
        
        if not vectors:
            return np.random.normal(0, 0.1, size=self.config.glove_dim)
        
        # Weighted average based on TF-IDF scores
        vectors = np.array(vectors)
        weights = np.array(weights)
        weights = weights / weights.sum()  # Normalize
        
        return np.average(vectors, axis=0, weights=weights)
    
    def get_category_statistics(self, target_category: int) -> Dict:
        """Get category statistics for debugging"""
        top_words = self.find_top_words_by_tfidf(target_category)
        
        if not top_words:
            return {'total_words': 0, 'avg_score': 0, 'top_5': []}
        
        scores = [score for _, score in top_words]
        
        return {
            'total_words': len(top_words),
            'avg_score': np.mean(scores),
            'max_score': max(scores),
            'min_score': min(scores),
            'top_5': top_words[:5]
        }

print("TFIDFBasedKNNHandler sınıfı tanımlandı")

TFIDFBasedKNNHandler sınıfı tanımlandı


### 2.5 Feature Vektörizasyon Sınıfı

Bu hücre hibrit feature vektörleri oluşturan FeatureVectorizer sınıfını tanımlar. GloVe + TF-IDF ağırlıklı kategori vektörleri oluşturur ve PCA transformer'ını içerir.

In [7]:
class FeatureVectorizer:
    """Main feature vectorization class"""
    
    def __init__(self, embeddings_index: Dict, word_score_cache: Dict, config: Config):
        self.embeddings_index = embeddings_index.copy()
        self.word_score_cache = word_score_cache
        self.config = config
        self.missing_word_handler = TFIDFBasedKNNHandler(embeddings_index, word_score_cache, config)
        self.missing_word_cache = {}
        self.pca = PCA(n_components=config.pca_components, random_state=config.random_state)
        self.is_fitted = False
    
    def text_to_vector(self, text: str, target_category: int) -> np.ndarray:
        """Convert single text to vector"""
        cleaned_text = TextPreprocessor.clean_text(text)
        tokens = cleaned_text.split()
        stop_words = _stop_words.ENGLISH_STOP_WORDS
        filtered_tokens = [t for t in tokens if t not in stop_words and len(t) >= 3]
        
        if not filtered_tokens:
            if self.config.use_hybrid_vectors:
                return np.zeros(self.config.hybrid_vector_dim)
            else:
                return np.zeros(self.config.glove_dim)
        
        weighted_vectors = []
        total_weight = 0
        
        for token in filtered_tokens:
            if token in self.embeddings_index:
                # Word exists in GloVe
                base_vector = self.embeddings_index[token]
                weight = self._calculate_word_weight(token, target_category)
                
                if self.config.use_hybrid_vectors:
                    hybrid_vector = self._create_hybrid_vector(base_vector, token)
                    weighted_vectors.append(hybrid_vector * weight)
                else:
                    weighted_vectors.append(base_vector * weight)
                
                total_weight += weight
                
            else:
                # Missing word - handle with KNN
                if token in self.missing_word_cache:
                    base_vector = self.missing_word_cache[token]
                else:
                    base_vector = self.missing_word_handler.create_missing_word_vector(token, target_category)
                    self.missing_word_cache[token] = base_vector
                    self.embeddings_index[token] = base_vector
                
                if self.config.use_hybrid_vectors:
                    hybrid_vector = self._create_hybrid_vector(base_vector, token)
                    weighted_vectors.append(hybrid_vector * 0.2)  # Lower weight for missing words
                else:
                    weighted_vectors.append(base_vector * 0.2)
                
                total_weight += 0.2
        
        if not weighted_vectors or total_weight == 0:
            if self.config.use_hybrid_vectors:
                return np.zeros(self.config.hybrid_vector_dim)
            else:
                return np.zeros(self.config.glove_dim)
        
        # Weighted average
        return np.sum(weighted_vectors, axis=0) / total_weight
    
    def _create_hybrid_vector(self, base_vector: np.ndarray, word: str) -> np.ndarray:
        """Create hybrid vector: 300D GloVe → 1200D (4×TF-IDF weighted)"""
        hybrid_parts = []
        
        # For each category, create TF-IDF weighted vector
        for category in self.config.categories:
            tfidf_score = self.word_score_cache.get(word, {}).get(category, 0.0)
            category_weighted_vector = base_vector * tfidf_score
            hybrid_parts.append(category_weighted_vector)
        
        # Concatenate: 300D + 300D + 300D + 300D = 1200D
        return np.concatenate(hybrid_parts)
    
    def _calculate_word_weight(self, word: str, target_category: int) -> float:
        """Calculate word weight based on TF-IDF scores"""
        if word not in self.word_score_cache:
            return 0.1
        
        category_scores = [
            self.word_score_cache[word].get(cat, 0.0) 
            for cat in self.config.categories
        ]
        
        target_score = category_scores[target_category - 1]  # 0-indexed
        
        if target_score <= 0:
            return 0.1
        
        # Cross-category discriminativeness
        other_scores = [
            category_scores[i] for i in range(len(self.config.categories))
            if i != (target_category - 1)
        ]
        
        mean_other = np.mean(other_scores) if other_scores else 0
        std_other = np.std(other_scores) if len(other_scores) > 1 else 0.01
        
        # Z-score based distinctiveness
        if std_other > 0:
            z_score = (target_score - mean_other) / std_other
            distinctiveness = max(0, min(2, z_score)) / 2.0
        else:
            distinctiveness = min(target_score / (mean_other + 0.01), 2.0) / 2.0
        
        final_weight = target_score * (1.0 + distinctiveness)
        return min(final_weight, 2.0)
    
    def texts_to_vectors(self, texts: List[str], labels: List[int]) -> np.ndarray:
        """Convert texts to feature vectors"""
        print("Converting texts to vectors...")
        
        vectors = []
        for text, label in tqdm(zip(texts, labels), total=len(texts), desc="Vectorizing texts"):
            vector = self.text_to_vector(text, label)
            vectors.append(vector)
        
        return np.array(vectors)
    
    def fit_pca(self, vectors: np.ndarray):
        """Fit PCA transformer"""
        input_dim = vectors.shape[1]
        print(f"PCA fitting: {input_dim}D → {self.config.pca_components}D")
        
        if self.config.use_hybrid_vectors:
            print(f"Input: Hybrid vectors ({len(self.config.categories)}×{self.config.glove_dim}D TF-IDF weighted)")
        
        self.pca.fit(vectors)
        self.is_fitted = True
        
        explained_var = self.pca.explained_variance_ratio_.sum()
        print(f"PCA fitted, explained variance ratio: {explained_var:.4f}")
    
    def transform_pca(self, vectors: np.ndarray) -> np.ndarray:
        """Transform vectors using fitted PCA"""
        if not self.is_fitted:
            raise ValueError("PCA not fitted yet!")
        return self.pca.transform(vectors)

print("FeatureVectorizer sınıfı tanımlandı")

FeatureVectorizer sınıfı tanımlandı


## 3. Model Classes (Model Sınıfları)

### 3.1 Temel Model Sınıfı

Bu hücre tüm makine öğrenmesi modellerinin kalıtım alacağı abstract BaseModel sınıfını tanımlar. Hiperparametre optimizasyonu için ortak arayüz sağlar.

In [8]:
class BaseModel(ABC):
    """Base model class with hyperparameter optimization"""
    
    def __init__(self, config: Config):
        self.config = config
        self.model = None
    
    @abstractmethod
    def get_model(self):
        pass
    
    @abstractmethod
    def get_param_grid(self) -> Dict:
        pass
    
    @abstractmethod
    def get_predefined_params(self) -> Dict:
        pass
    
    def hyperparameter_search(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Hyperparameter tuning: predefined params or RandomizedSearchCV"""
        
        if self.config.use_predefined_params:
            # Use predefined optimal parameters (FAST)
            print(f"Using predefined optimal params for {self.__class__.__name__}...")
            search_start = time.time()
            
            predefined_params = self.get_predefined_params()
            if predefined_params:
                # Create model with optimal parameters
                model = self.get_model()
                model.set_params(**predefined_params)
                
                # Cross validation to measure performance
                cv_scores = cross_val_score(
                    model, X, y, 
                    cv=self.config.cv_folds, 
                    scoring='accuracy',
                    n_jobs=-1
                )
                
                # Full fit
                model.fit(X, y)
                self.model = model
                
                search_time = time.time() - search_start
                print(f"Predefined params applied in {search_time:.2f}s")
                
                return {
                    'best_score': cv_scores.mean(),
                    'best_params': predefined_params,
                    'cv_results': None,
                    'best_estimator': model,
                    'search_time': search_time,
                    'method': 'predefined'
                }
            else:
                print("No predefined params, falling back to default training...")
                return self._train_with_cv(X, y)
        
        else:
            # RandomizedSearchCV hyperparameter search (SLOW)
            print(f"Hyperparameter search for {self.__class__.__name__}...")
            search_start = time.time()
            
            model = self.get_model()
            param_grid = self.get_param_grid()
            
            if not param_grid:
                print("No param grid, using default training...")
                return self._train_with_cv(X, y)
            
            search = RandomizedSearchCV(
                model,
                param_grid,
                n_iter=self.config.random_search_iter,
                cv=self.config.cv_folds,
                scoring='accuracy',
                random_state=self.config.random_state,
                n_jobs=-1,
                verbose=0
            )
            
            search.fit(X, y)
            self.model = search.best_estimator_
            
            search_time = time.time() - search_start
            print(f"Search completed in {search_time:.2f}s")
            
            return {
                'best_score': search.best_score_,
                'best_params': search.best_params_,
                'cv_results': search.cv_results_,
                'best_estimator': search.best_estimator_,
                'search_time': search_time,
                'method': 'random_search'
            }
    
    def _train_with_cv(self, X: np.ndarray, y: np.ndarray) -> Dict:
        """Cross validation ile train et"""
        model = self.get_model()
        
        # Cross validation
        cv_scores = cross_val_score(
            model, X, y, 
            cv=self.config.cv_folds, 
            scoring='accuracy',
            n_jobs=-1
        )
        
        # Full fit
        model.fit(X, y)
        self.model = model
        
        return {
            'best_score': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'cv_scores': cv_scores,
            'best_estimator': model,
            'method': 'cv_only'
        }

print("BaseModel sınıfı tanımlandı")

BaseModel sınıfı tanımlandı


### 3.2 Logistic Regression Model Sınıfı

Bu hücre LogisticRegression modeli için özel sınıfı tanımlar. Hiperparametre gridini ve önceden optimize edilmiş parametreleri içerir.

In [9]:
class LogisticRegressionModel(BaseModel):
    def get_model(self):
        return LogisticRegression(max_iter=2000, random_state=self.config.random_state, n_jobs=-1)
    
    def get_param_grid(self) -> Dict:
        return {
            'C': [0.01, 0.1, 1, 10, 100],
            'penalty': ['l1', 'l2'],
            'solver': ['liblinear', 'saga']
        }
    
    def get_predefined_params(self) -> Dict:
        return {
            'C': 100,
            'penalty': 'l1',
            'solver': 'liblinear'
        }

class SVMModel(BaseModel):
    def get_model(self):
        return SVC(random_state=self.config.random_state)
    
    def get_param_grid(self) -> Dict:
        return {
            'C': [0.1, 1, 10, 100],
            'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
            'kernel': ['rbf', 'linear', 'poly']
        }
    
    def get_predefined_params(self) -> Dict:
        return {
            'C': 100,
            'gamma': 'scale',
            'kernel': 'rbf'
        }

class MLPModel(BaseModel):
    def get_model(self):
        return MLPClassifier(
            random_state=self.config.random_state,
            max_iter=500
        )
    
    def get_param_grid(self) -> Dict:
        return {
            'hidden_layer_sizes': [(100,), (200,), (100, 50), (200, 100)],
            'activation': ['relu', 'tanh'],
            'alpha': [0.0001, 0.001, 0.01],
            'learning_rate': ['constant', 'adaptive']
        }
    
    def get_predefined_params(self) -> Dict:
        return {
            'hidden_layer_sizes': (200, 100),
            'activation': 'relu',
            'alpha': 0.0001,
            'learning_rate': 'constant'
        }

class GradientBoostingModel(BaseModel):
    def get_model(self):
        return GradientBoostingClassifier(random_state=self.config.random_state)
    
    def get_param_grid(self) -> Dict:
        return {
            'n_estimators': [50, 100, 200],
            'learning_rate': [0.01, 0.1, 0.2],
            'max_depth': [3, 5, 7],
            'subsample': [0.8, 0.9, 1.0]
        }
    
    def get_predefined_params(self) -> Dict:
        return {
            'n_estimators': 200,
            'learning_rate': 0.2,
            'max_depth': 5,
            'subsample': 0.9
        }

print("Tüm model sınıfları tanımlandı")

Tüm model sınıfları tanımlandı


## 4. Veri Yükleme ve Preprocessing

In [10]:
def load_data(config: Config) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Load and preprocess training and test data"""
    print("Loading datasets...")
    
    # Load training data
    train_df = pd.read_csv(config.train_path, header=None)
    train_df.columns = ["label", "title", "description"]
    train_df = train_df[train_df["label"] != "Class Index"]
    train_df["label"] = train_df["label"].astype(int)
    train_df["text"] = train_df["title"] + " " + train_df["description"]
    
    # Load test data
    test_df = pd.read_csv(config.test_path, header=None)
    test_df.columns = ["label", "title", "description"]
    test_df = test_df[test_df["label"] != "Class Index"]
    test_df["label"] = test_df["label"].astype(int)
    test_df["text"] = test_df["title"] + " " + test_df["description"]
    
    print(f"Data loaded - Train: {len(train_df)}, Test: {len(test_df)}")
    print(f"Distribution: {dict(train_df['label'].value_counts().sort_index())}")
    
    return train_df, test_df

# Execute data loading
train_df, test_df = load_data(config)

# Display sample data
print("\nSample Training Data:")
display(train_df.head())

print("\nLabel Distribution:")
label_counts = train_df['label'].value_counts().sort_index()
label_names = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
for label, count in label_counts.items():
    print(f"  {label} ({label_names[label]}): {count:,} samples")

Loading datasets...
Data loaded - Train: 120000, Test: 7600
Distribution: {1: np.int64(30000), 2: np.int64(30000), 3: np.int64(30000), 4: np.int64(30000)}

Sample Training Data:


Unnamed: 0,label,title,description,text
1,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Wall St. Bears Claw Back Into the Black (Reute...
2,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
3,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
4,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
5,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."



Label Distribution:
  1 (World): 30,000 samples
  2 (Sports): 30,000 samples
  3 (Business): 30,000 samples
  4 (Sci/Tech): 30,000 samples


## 5. GloVe Embeddings ve TF-IDF Skorları

### 5.1 Embeddings ve TF-IDF Kurulum Fonksiyonu

Bu hücre GloVe embeddings'lerini yükleyen ve TF-IDF skorlarını hesaplayan ana kurulum fonksiyonunu tanımlar.

In [11]:
def setup_embeddings_and_tfidf(config: Config, train_texts: List[str], train_labels: List[int]):
    """Setup GloVe embeddings and TF-IDF scores"""
    
    # Load GloVe embeddings
    print("Loading GloVe embeddings...")
    glove_loader = GloVeLoader(config.glove_dir, config.glove_dim)
    embeddings_index = glove_loader.load_embeddings()
    
    # Build TF-IDF scores from scratch (no cache)
    print("Building TF-IDF scores...")
    tfidf_builder = SimpleTFIDFBuilder(config)
    word_score_cache = tfidf_builder.build_tfidf_scores(train_texts, train_labels)
    
    return embeddings_index, word_score_cache

# Execute embeddings and TF-IDF setup
embeddings_index, word_score_cache = setup_embeddings_and_tfidf(
    config, train_df["text"].tolist(), train_df["label"].tolist()
)

print(f"\nTF-IDF Statistics:")
print(f"├── Total words: {len(word_score_cache):,}")
print(f"├── GloVe coverage: {len([w for w in word_score_cache.keys() if w in embeddings_index]):,}")
print(f"└── Missing words: {len([w for w in word_score_cache.keys() if w not in embeddings_index]):,}")

Loading GloVe embeddings...
GloVe vektörleri zaten mevcut: glove\glove.6B.300d.txt
Loading GloVe vectors from: glove\glove.6B.300d.txt
Counting lines for progress tracking...


Loading GloVe: 100%|██████████| 400000/400000 [00:19<00:00, 20490.09words/s]


Loaded 400,000 word vectors
Building TF-IDF scores...
Building TF-IDF scores from scratch...
Cleaning texts...


Cleaning texts: 100%|██████████| 120000/120000 [00:04<00:00, 26619.99it/s]


Building category-specific TF-IDF vectorizers...


Processing categories:   0%|          | 0/4 [00:00<?, ?it/s]

Processing category 1...


Processing categories:  25%|██▌       | 1/4 [00:00<00:01,  1.84it/s]

Processing category 2...


Processing categories:  50%|█████     | 2/4 [00:01<00:00,  2.00it/s]

Processing category 3...


Processing categories:  75%|███████▌  | 3/4 [00:01<00:00,  2.12it/s]

Processing category 4...


Processing categories: 100%|██████████| 4/4 [00:01<00:00,  2.09it/s]


Building optimized word score cache...
Pre-computing TF-IDF matrices...


Pre-computing matrices:  25%|██▌       | 1/4 [00:00<00:01,  2.19it/s]

Category 1: 5000 features pre-computed


Pre-computing matrices:  50%|█████     | 2/4 [00:00<00:00,  2.06it/s]

Category 2: 5000 features pre-computed


Pre-computing matrices:  75%|███████▌  | 3/4 [00:01<00:00,  1.99it/s]

Category 3: 5000 features pre-computed


Pre-computing matrices: 100%|██████████| 4/4 [00:01<00:00,  2.04it/s]

Category 4: 5000 features pre-computed





Found 83,918 unique words total
Building cache from pre-computed data...


Building word cache: 100%|██████████| 83918/83918 [00:00<00:00, 1041489.43it/s]

TF-IDF scores built for 83,918 words

TF-IDF Statistics:
├── Total words: 83,918
├── GloVe coverage: 54,322
└── Missing words: 29,596





## 6. Feature Vektörizasyonu

### 6.1 Feature Vektör Oluşturma Fonksiyonu

Bu hücre train ve test metinleri için hibrit feature vektörlerini oluşturan ana fonksiyonu tanımlar. FeatureVectorizer'ı kullanarak GloVe + TF-IDF hibrit vektörleri oluşturur.

In [12]:
def create_feature_vectors(config: Config, embeddings_index: Dict, word_score_cache: Dict,
                          train_texts: List[str], train_labels: List[int],
                          test_texts: List[str], test_labels: List[int]):
    """Create feature vectors for training and testing"""
    
    print("Creating feature vectors...")
    
    vectorizer = FeatureVectorizer(embeddings_index, word_score_cache, config)
    
    # TF-IDF KNN debug info
    print("TF-IDF KNN Debug Info:")
    for category in config.categories:
        stats = vectorizer.missing_word_handler.get_category_statistics(category)
        print(f"  Category {category}: {stats['total_words']} words, avg_score={stats['avg_score']:.4f}")
        if stats['top_5']:
            print(f"    Top 3: {[(w, f'{s:.4f}') for w, s in stats['top_5'][:3]]}")
    
    # Create full-dimensional vectors
    X_train_full = vectorizer.texts_to_vectors(train_texts, train_labels)
    X_test_full = vectorizer.texts_to_vectors(test_texts, test_labels)
    
    print(f"Feature vectors created: {X_train_full.shape}")
    
    return vectorizer, X_train_full, X_test_full

# Execute feature vectorization
vectorizer, X_train_full, X_test_full = create_feature_vectors(
    config, embeddings_index, word_score_cache,
    train_df["text"].tolist(), train_df["label"].tolist(),
    test_df["text"].tolist(), test_df["label"].tolist()
)

print(f"\nVector Dimensions:")
print(f"├── Training vectors: {X_train_full.shape}")
print(f"├── Test vectors: {X_test_full.shape}")
print(f"└── Vector type: {'Hybrid' if config.use_hybrid_vectors else 'Standard'} ({config.hybrid_vector_dim}D)")

if config.use_hybrid_vectors:
    print(f"Hybrid vectors: {len(config.categories)}×{config.glove_dim}D category-weighted = {config.hybrid_vector_dim}D total")

Creating feature vectors...
TF-IDF KNN Debug Info:
  Category 1: 13 words, avg_score=0.0136
    Top 3: [('said', '0.0204'), ('iraq', '0.0194'), ('reuters', '0.0184')]
  Category 2: 10 words, avg_score=0.0126
    Top 3: [('new', '0.0150'), ('game', '0.0145'), ('win', '0.0132')]
  Category 3: 18 words, avg_score=0.0140
    Top 3: [('oil', '0.0231'), ('reuters', '0.0214'), ('new', '0.0199')]
  Category 4: 9 words, avg_score=0.0139
    Top 3: [('new', '0.0217'), ('microsoft', '0.0192'), ('software', '0.0142')]
Converting texts to vectors...


Vectorizing texts: 100%|██████████| 120000/120000 [04:31<00:00, 442.68it/s]


Converting texts to vectors...


Vectorizing texts: 100%|██████████| 7600/7600 [00:13<00:00, 553.71it/s]

Feature vectors created: (120000, 1200)

Vector Dimensions:
├── Training vectors: (120000, 1200)
├── Test vectors: (7600, 1200)
└── Vector type: Hybrid (1200D)
Hybrid vectors: 4×300D category-weighted = 1200D total





## 7. Model Eğitimi ve Kaydetme

Bu bölümde modeller sırasıyla eğitilip kaydedilecek:
1. **LogisticRegression** (Full-dimensional vectors) → Eğit ve Kaydet
2. **PCA** (Boyut indirgeme) → Uygula
3. **SVM** (PCA vectors) → Eğit ve Kaydet  
4. **MLP** (PCA vectors) → Eğit ve Kaydet
5. **GradientBoosting** (PCA vectors) → Eğit ve Kaydet

In [13]:
import joblib

def save_model(model, model_name: str, config: Config, results: Dict):
    """Model ve sonuçlarını kaydet"""
    os.makedirs(config.models_dir, exist_ok=True)
    
    # Model dosyasını kaydet
    model_path = os.path.join(config.models_dir, f"{model_name.lower()}_model.pkl")
    joblib.dump(model, model_path)
    
    # Sonuçları kaydet
    results_path = os.path.join(config.models_dir, f"{model_name.lower()}_results.pkl")
    joblib.dump(results, results_path)
    
    print(f"Model saved: {model_path}")
    print(f"Results saved: {results_path}")
    
    return model_path, results_path

# Initialize training variables
results = {}
y_train = train_df["label"].values
y_test = test_df["label"].values

print("Model training starting...")
print("Training Order: LogisticRegression(full-dim) → PCA → SVM/MLP/GradientBoosting(PCA)")
print(f"Strategy: {'Fast (predefined params)' if config.use_predefined_params else 'Search (RandomizedSearchCV)'}")

Model training starting...
Training Order: LogisticRegression(full-dim) → PCA → SVM/MLP/GradientBoosting(PCA)
Strategy: Fast (predefined params)


## 7. Model Eğitimi

### 7.1 LogisticRegression Eğitimi (Full-dimensional vectors)

Bu hücre LogisticRegression modelini tam boyutlu hibrit vektörlerle (1200D) eğitir ve kaydeder. İlk model olarak eğitilir çünkü yüksek boyutlu vektörlerle iyi çalışır.

In [14]:
# LogisticRegression Training (Full-dimensional vectors)
print("\n[1/4] Training LogisticRegression with full-dimensional vectors...")
lr_model = LogisticRegressionModel(config)
start_time = time.time()

lr_results = lr_model.hyperparameter_search(X_train_full, y_train)
y_pred_lr = lr_model.model.predict(X_test_full)
test_acc_lr = accuracy_score(y_test, y_pred_lr)
train_time_lr = time.time() - start_time

results['LogisticRegression'] = {
    'test_accuracy': test_acc_lr,
    'cv_score': lr_results['best_score'],
    'params': lr_results['best_params'],
    'method': lr_results['method'],
    'training_time': train_time_lr,
    'vector_type': 'full-dimensional',
    'predictions': y_pred_lr
}

print(f"Best CV Score: {lr_results['best_score']:.4f}")
print(f"Test Accuracy: {test_acc_lr:.4f}")
print(f"Training Time: {train_time_lr:.2f}s")
print(f"Best Params ({lr_results['method']}): {lr_results['best_params']}")

# Save LogisticRegression model
save_model(lr_model.model, 'LogisticRegression', config, results['LogisticRegression'])
print("✅ LogisticRegression training and saving completed!")


[1/4] Training LogisticRegression with full-dimensional vectors...
Using predefined optimal params for LogisticRegressionModel...
Predefined params applied in 395.13s
Best CV Score: 0.9116
Test Accuracy: 0.9150
Training Time: 395.32s
Best Params (predefined): {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Model saved: models\logisticregression_model.pkl
Results saved: models\logisticregression_results.pkl
✅ LogisticRegression training and saving completed!


### 7.2 PCA Boyut İndirgeme

Bu hücre hibrit vektörleri (1200D) PCA ile boyut indirger (100D). Sonraki modeller bu indirgenmiş vektörleri kullanacak.

In [15]:
# PCA Dimensionality Reduction
print("\n[2/4] Applying PCA dimensionality reduction...")
vectorizer.fit_pca(X_train_full)
X_train_pca = vectorizer.transform_pca(X_train_full)
X_test_pca = vectorizer.transform_pca(X_test_full)

print(f"PCA transformation: {X_train_full.shape[1]}D → {X_train_pca.shape[1]}D")
print(f"Explained variance ratio: {vectorizer.pca.explained_variance_ratio_.sum():.4f}")

# Save PCA transformer
pca_path = os.path.join(config.models_dir, "pca_transformer.pkl")
joblib.dump(vectorizer.pca, pca_path)
print(f"PCA transformer saved: {pca_path}")
print("✅ PCA dimensionality reduction completed!")


[2/4] Applying PCA dimensionality reduction...
PCA fitting: 1200D → 100D
Input: Hybrid vectors (4×300D TF-IDF weighted)
PCA fitted, explained variance ratio: 0.9895
PCA transformation: 1200D → 100D
Explained variance ratio: 0.9895
PCA transformer saved: models\pca_transformer.pkl
✅ PCA dimensionality reduction completed!


### 7.3 SVM Eğitimi (PCA vectors)

Bu hücre Support Vector Machine modelini PCA ile indirgenmiş vektörlerle (100D) eğitir ve kaydeder. RBF kernel ile nonlinear sınıflandırma yapar.

In [16]:
# SVM Training (PCA-reduced vectors)
print("\n[3/5] Training SVM with PCA vectors...")
svm_model = SVMModel(config)
start_time = time.time()

try:
    # Hyperparameter search on PCA vectors
    search_results = svm_model.hyperparameter_search(X_train_pca, y_train)
    
    # Test prediction on PCA vectors
    y_pred = svm_model.model.predict(X_test_pca)
    test_accuracy = accuracy_score(y_test, y_pred)
    
    training_time = time.time() - start_time
    
    results['SVM'] = {
        'test_accuracy': test_accuracy,
        'cv_score': search_results['best_score'],
        'params': search_results['best_params'],
        'method': search_results['method'],
        'training_time': training_time,
        'vector_type': 'PCA-reduced',
        'predictions': y_pred
    }
    
    print(f"Best CV Score: {search_results['best_score']:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Training Time: {training_time:.2f}s")
    print(f"Best Params ({search_results['method']}): {search_results['best_params']}")
    
    # Save SVM model
    save_model(svm_model.model, 'SVM', config, results['SVM'])
    print("✅ SVM training and saving completed!")
    
except Exception as e:
    print(f"❌ Error training SVM: {e}")
    results['SVM'] = {'error': str(e)}


[3/5] Training SVM with PCA vectors...
Using predefined optimal params for SVMModel...
Predefined params applied in 636.12s
Best CV Score: 0.9097
Test Accuracy: 0.9099
Training Time: 650.27s
Best Params (predefined): {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
Model saved: models\svm_model.pkl
Results saved: models\svm_results.pkl
✅ SVM training and saving completed!


### 7.4 MLP Eğitimi (PCA vectors)

Bu hücre Multi-Layer Perceptron (Neural Network) modelini PCA ile indirgenmiş vektörlerle (100D) eğitir ve kaydeder.

In [17]:
# MLP Training (PCA-reduced vectors)
print("\n[4/5] Training MLP with PCA vectors...")
mlp_model = MLPModel(config)
start_time = time.time()

try:
    # Hyperparameter search on PCA vectors
    search_results = mlp_model.hyperparameter_search(X_train_pca, y_train)
    
    # Test prediction on PCA vectors
    y_pred = mlp_model.model.predict(X_test_pca)
    test_accuracy = accuracy_score(y_test, y_pred)
    
    training_time = time.time() - start_time
    
    results['MLP'] = {
        'test_accuracy': test_accuracy,
        'cv_score': search_results['best_score'],
        'params': search_results['best_params'],
        'method': search_results['method'],
        'training_time': training_time,
        'vector_type': 'PCA-reduced',
        'predictions': y_pred
    }
    
    print(f"Best CV Score: {search_results['best_score']:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Training Time: {training_time:.2f}s")
    print(f"Best Params ({search_results['method']}): {search_results['best_params']}")
    
    # Save MLP model
    save_model(mlp_model.model, 'MLP', config, results['MLP'])
    print("✅ MLP training and saving completed!")
    
except Exception as e:
    print(f"❌ Error training MLP: {e}")
    results['MLP'] = {'error': str(e)}


[4/5] Training MLP with PCA vectors...
Using predefined optimal params for MLPModel...
Predefined params applied in 11625.65s
Best CV Score: 0.9189
Test Accuracy: 0.9272
Training Time: 11626.33s
Best Params (predefined): {'hidden_layer_sizes': (200, 100), 'activation': 'relu', 'alpha': 0.0001, 'learning_rate': 'constant'}
Model saved: models\mlp_model.pkl
Results saved: models\mlp_results.pkl
✅ MLP training and saving completed!


### 7.5 GradientBoosting Eğitimi (PCA vectors)

Bu hücre Gradient Boosting Classifier modelini PCA ile indirgenmiş vektörlerle (100D) eğitir ve kaydeder.

In [18]:
# GradientBoosting Training (PCA-reduced vectors)
print("\n[5/5] Training GradientBoosting with PCA vectors...")
gb_model = GradientBoostingModel(config)
start_time = time.time()

try:
    # Hyperparameter search on PCA vectors
    search_results = gb_model.hyperparameter_search(X_train_pca, y_train)
    
    # Test prediction on PCA vectors
    y_pred = gb_model.model.predict(X_test_pca)
    test_accuracy = accuracy_score(y_test, y_pred)
    
    training_time = time.time() - start_time
    
    results['GradientBoosting'] = {
        'test_accuracy': test_accuracy,
        'cv_score': search_results['best_score'],
        'params': search_results['best_params'],
        'method': search_results['method'],
        'training_time': training_time,
        'vector_type': 'PCA-reduced',
        'predictions': y_pred
    }
    
    print(f"Best CV Score: {search_results['best_score']:.4f}")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Training Time: {training_time:.2f}s")
    print(f"Best Params ({search_results['method']}): {search_results['best_params']}")
    
    # Save GradientBoosting model
    save_model(gb_model.model, 'GradientBoosting', config, results['GradientBoosting'])
    print("✅ GradientBoosting training and saving completed!")
    
except Exception as e:
    print(f"❌ Error training GradientBoosting: {e}")
    results['GradientBoosting'] = {'error': str(e)}

print("\n🎉 All models training completed!")


[5/5] Training GradientBoosting with PCA vectors...
Using predefined optimal params for GradientBoostingModel...
Predefined params applied in 9534.25s
Best CV Score: 0.9130
Test Accuracy: 0.9143
Training Time: 9534.38s
Best Params (predefined): {'n_estimators': 200, 'learning_rate': 0.2, 'max_depth': 5, 'subsample': 0.9}
Model saved: models\gradientboosting_model.pkl
Results saved: models\gradientboosting_results.pkl
✅ GradientBoosting training and saving completed!

🎉 All models training completed!


## 8. Sonuç Analizi ve Karşılaştırma

Bu bölümde tüm modellerin performansları karşılaştırılır ve en iyi model belirlenir.

In [19]:
def analyze_results(results: Dict, y_test: np.ndarray):
    """Analyze and display results"""
    results_data = []
    for model_name, result in results.items():
        if 'error' not in result:
            results_data.append({
                'Model': model_name,
                'Vector Type': result['vector_type'],
                'Method': result['method'],
                'CV Score': f"{result['cv_score']:.4f}",
                'Test Accuracy': f"{result['test_accuracy']:.4f}",
                'Training Time (s)': f"{result['training_time']:.2f}"
            })
    
    df_results = pd.DataFrame(results_data)
    print("MODEL RESULTS SUMMARY")
    print("=" * 50)
    display(df_results)
    
    # Find best model
    best_model_name = max(
        [name for name, result in results.items() if 'error' not in result], 
        key=lambda x: results[x]['test_accuracy']
    )
    best_result = results[best_model_name]
    
    print(f"\nBest Model: {best_model_name}")
    print(f"Test Accuracy: {best_result['test_accuracy']:.4f}")
    print(f"CV Score: {best_result['cv_score']:.4f}")
    
    # Classification report for best model
    y_pred_best = best_result['predictions']
    target_names = ['World', 'Sports', 'Business', 'Sci/Tech']
    report = classification_report(y_test, y_pred_best, target_names=target_names)
    print(f"\nClassification Report - {best_model_name}:")
    print(report)
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred_best)
    cm_df = pd.DataFrame(cm, index=target_names, columns=target_names)
    print(f"\nConfusion Matrix - {best_model_name}:")
    display(cm_df)
    
    return df_results, best_model_name, best_result

# Execute analysis
results_df, best_model, best_result = analyze_results(results, y_test)

print(f"\nPipeline completed successfully!")
print(f"Best performing model: {best_model} ({best_result['test_accuracy']:.4f} accuracy)")
print(f"Training completed at: {time.strftime('%Y-%m-%d %H:%M:%S')}")

MODEL RESULTS SUMMARY


Unnamed: 0,Model,Vector Type,Method,CV Score,Test Accuracy,Training Time (s)
0,LogisticRegression,full-dimensional,predefined,0.9116,0.915,395.32
1,SVM,PCA-reduced,predefined,0.9097,0.9099,650.27
2,MLP,PCA-reduced,predefined,0.9189,0.9272,11626.33
3,GradientBoosting,PCA-reduced,predefined,0.913,0.9143,9534.38



Best Model: MLP
Test Accuracy: 0.9272
CV Score: 0.9189

Classification Report - MLP:
              precision    recall  f1-score   support

       World       0.92      0.94      0.93      1900
      Sports       0.94      0.95      0.95      1900
    Business       0.94      0.90      0.92      1900
    Sci/Tech       0.90      0.92      0.91      1900

    accuracy                           0.93      7600
   macro avg       0.93      0.93      0.93      7600
weighted avg       0.93      0.93      0.93      7600


Confusion Matrix - MLP:


Unnamed: 0,World,Sports,Business,Sci/Tech
World,1780,41,26,53
Sports,38,1810,17,35
Business,60,29,1711,100
Sci/Tech,49,38,67,1746



Pipeline completed successfully!
Best performing model: MLP (0.9272 accuracy)
Training completed at: 2025-07-08 09:45:16


## 9. Essential Components Kaydetme

Bu bölümde gelecekte prediction yapmak için gerekli tüm bileşenler kaydedilir:
- **FeatureVectorizer**: Metin → hibrit vektör dönüşümü
- **GloVe embeddings**: Word embeddings 
- **TF-IDF scores**: Kategori-bazlı ağırlık hesaplamaları
- **Training config**: Aynı parametrelerle çalışma
(sonradan eklendi vectorizerları unutmuşum)

In [24]:
print("\nSaving essential components for future predictions...")

# 1. FeatureVectorizer'ı kaydet (en önemli!)
vectorizer_path = os.path.join(config.models_dir, "feature_vectorizer.pkl")
joblib.dump(vectorizer, vectorizer_path)
print(f"FeatureVectorizer saved: {vectorizer_path}")

# 4. Config'i kaydet
config_path = os.path.join(config.models_dir, "training_config.pkl")
joblib.dump(config, config_path)
print(f"Training config saved: {config_path}")

print("\nAll essential components saved!")
print("\nSaved files in models directory:")
print("├── feature_vectorizer.pkl      (Text → Vector conversion)")
print("├── pca_transformer.pkl         (PCA for dimensionality reduction)")
print("├── training_config.pkl         (Training configuration)")
print("└── [model_name]_model.pkl      (Trained models)")

print(f"\nTo make predictions on new text:")
print("1. Load feature_vectorizer.pkl")
print("2. Load the best model from results")
print("3. Use vectorizer.text_to_vector() → PCA → model.predict()")


Saving essential components for future predictions...
FeatureVectorizer saved: models\feature_vectorizer.pkl
Training config saved: models\training_config.pkl

All essential components saved!

Saved files in models directory:
├── feature_vectorizer.pkl      (Text → Vector conversion)
├── pca_transformer.pkl         (PCA for dimensionality reduction)
├── training_config.pkl         (Training configuration)
└── [model_name]_model.pkl      (Trained models)

To make predictions on new text:
1. Load feature_vectorizer.pkl
2. Load the best model from results
3. Use vectorizer.text_to_vector() → PCA → model.predict()
