# Yapay Sinir Aglari ile Sigara Icme Durumu Siniflandirmasi

## 1. Proje Tanimi

### Projenin Amaci
Bu projede, bireylerin saglik verilerine dayali olarak sigara icme durumlarinin siniflandirilmasi yapilmaktadir.

### Cozulen Problem
Sigara kullanimi, pek cok saglik sorununun temel nedenlerinden biridir. Bu proje, bireylerin fiziksel ve biyokimyasal verilerinden sigara icip icmediklerini tahmin etmeyi amaclar. Bu sayede saglik taramalarinda risk degerlendirmesi yapilabiilir.

### Siniflandirma Islemi
- **Sinif 0**: Sigara icmeyen
- **Sinif 1**: Sigara icen

### Yontem Secimi Gerekceleri
- **MLP (Multi-Layer Perceptron)**: Non-linear iliskileri ogrenebilir, saglik verileri arasindaki karmasik oruntuleri yakalayabilir
- **Adam Optimizer**: Adaptive learning rate ile hizli yakinasama saglar
- **ReLU Aktivasyon**: Vanishing gradient problemini onler, derin aglarda etkilidir
- **Early Stopping**: Overfitting'i engeller, generalization performansini arttirir

## 2. Kutuphaneler

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

print("Kutuphaneler yuklendi!")

## 3. Veri Seti

### 3.1 Veri Setinin Kaynagi
- **Kaynak**: Kaggle
- **Isim**: Smoking and Drinking Dataset with Body Signal
- **Tur**: Saglik verileri

### 3.2 Veri Setinin Aciklamasi

In [None]:
# Veri setini yukle
df = pd.read_csv('data/smoking.csv')

print("Veri Seti Ozeti")
print("="*40)
print(f"Toplam ornek sayisi: {len(df)}")
print(f"Toplam sutun sayisi: {len(df.columns)}")
print(f"\nIlk 5 satir:")
df.head()

In [None]:
# Veri seti bilgisi
print("Veri Tipleri:")
print("-"*40)
print(df.dtypes)

In [None]:
# Eksik veri kontrolu
print("Eksik Veri Kontrolu:")
print("-"*40)
missing = df.isnull().sum()
print(f"Toplam eksik veri: {missing.sum()}")

In [None]:
# Ozellikler listesi
print("Ozellikler:")
print("-"*40)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")
print(f"\nHedef Degisken: smoking (0=Icmiyor, 1=Iciyor)")

In [None]:
# Sinif dagilimi
print("Sinif Dagilimi:")
print("-"*40)
class_counts = df['smoking'].value_counts().sort_index()
class_names = {0: 'Sigara Icmeyen', 1: 'Sigara Icen'}
for idx, count in class_counts.items():
    pct = count / len(df) * 100
    print(f"Sinif {idx} - {class_names[idx]}: {count} ornek ({pct:.1f}%)")

# Gorsel
plt.figure(figsize=(8, 5))
colors = ['#27ae60', '#e74c3c']
bars = plt.bar(['Sigara Icmeyen (0)', 'Sigara Icen (1)'], class_counts.values, color=colors)
plt.title('Sinif Dagilimi', fontsize=14)
plt.xlabel('Sinif')
plt.ylabel('Ornek Sayisi')
for bar, count in zip(bars, class_counts.values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 500, 
             f'{count:,}', ha='center', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Veriyi hazirla
# ID sutununu kaldir
df = df.drop('ID', axis=1)

# Kategorik degiskenleri encode et
le = LabelEncoder()
categorical_cols = ['gender', 'oral', 'tartar']
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# X ve y ayir
X = df.drop('smoking', axis=1).values
y = df['smoking'].values

# Normalize et
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"X shape: {X_scaled.shape}")
print(f"y shape: {y.shape}")
print(f"\nOzellik sayisi: {X_scaled.shape[1]}")

## 4. Model Parametreleri

In [None]:
# Model parametreleri
HIDDEN_LAYERS = (128, 64)  # 2 gizli katman
ACTIVATION = 'relu'
SOLVER = 'adam'
MAX_ITER = 500

def create_model(random_state=42):
    return MLPClassifier(
        hidden_layer_sizes=HIDDEN_LAYERS,
        activation=ACTIVATION,
        solver=SOLVER,
        max_iter=MAX_ITER,
        random_state=random_state,
        early_stopping=True,
        validation_fraction=0.1
    )

print("Model Parametreleri")
print("="*40)
print(f"Ag Topolojisi: {X.shape[1]} -> {HIDDEN_LAYERS[0]} -> {HIDDEN_LAYERS[1]} -> 2")
print(f"Aktivasyon: {ACTIVATION}")
print(f"Optimizer: {SOLVER}")
print(f"Max Iterasyon: {MAX_ITER}")
print(f"Early Stopping: Evet")

In [None]:
# Konfuzyon matrisi cizim fonksiyonu
def plot_confusion_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Icmiyor', 'Iciyor'],
                yticklabels=['Icmiyor', 'Iciyor'])
    plt.title(f'Konfuzyon Matrisi - {title}', fontsize=14)
    plt.xlabel('Tahmin')
    plt.ylabel('Gercek')
    plt.tight_layout()
    plt.show()
    return cm

## 5. Deneysel Calismalar

### Senaryo 1: Egitim Verisini Test Olarak Kullanma

In [None]:
print("SENARYO 1: Egitim Verisini Test Olarak Kullanma")
print("="*50)

model1 = create_model()
model1.fit(X_scaled, y)
y_pred1 = model1.predict(X_scaled)
acc1 = accuracy_score(y, y_pred1)

print(f"\nAg Topolojisi: {X.shape[1]} -> 128 -> 64 -> 2")
print(f"Dogruluk (Accuracy): {acc1*100:.2f}%")
print(f"\nSiniflandirma Raporu:")
print(classification_report(y, y_pred1, target_names=['Icmiyor', 'Iciyor']))

In [None]:
cm1 = plot_confusion_matrix(y, y_pred1, "Senaryo 1 (Egitim=Test)")

### Senaryo 2: 5-Fold Cross Validation

In [None]:
print("SENARYO 2: 5-Fold Cross Validation")
print("="*50)

model2 = create_model()
scores_5fold = cross_val_score(model2, X_scaled, y, cv=5)
y_pred2 = cross_val_predict(model2, X_scaled, y, cv=5)

print(f"\nAg Topolojisi: {X.shape[1]} -> 128 -> 64 -> 2")
print(f"\nHer Fold Sonucu:")
for i, score in enumerate(scores_5fold, 1):
    print(f"  Fold {i}: {score*100:.2f}%")
print(f"\nOrtalama Dogruluk: {scores_5fold.mean()*100:.2f}% (+/- {scores_5fold.std()*100:.2f}%)")
print(f"\nSiniflandirma Raporu:")
print(classification_report(y, y_pred2, target_names=['Icmiyor', 'Iciyor']))

In [None]:
cm2 = plot_confusion_matrix(y, y_pred2, "Senaryo 2 (5-Fold CV)")

### Senaryo 3: 10-Fold Cross Validation

In [None]:
print("SENARYO 3: 10-Fold Cross Validation")
print("="*50)

model3 = create_model()
scores_10fold = cross_val_score(model3, X_scaled, y, cv=10)
y_pred3 = cross_val_predict(model3, X_scaled, y, cv=10)

print(f"\nAg Topolojisi: {X.shape[1]} -> 128 -> 64 -> 2")
print(f"\nHer Fold Sonucu:")
for i, score in enumerate(scores_10fold, 1):
    print(f"  Fold {i:2d}: {score*100:.2f}%")
print(f"\nOrtalama Dogruluk: {scores_10fold.mean()*100:.2f}% (+/- {scores_10fold.std()*100:.2f}%)")
print(f"\nSiniflandirma Raporu:")
print(classification_report(y, y_pred3, target_names=['Icmiyor', 'Iciyor']))

In [None]:
cm3 = plot_confusion_matrix(y, y_pred3, "Senaryo 3 (10-Fold CV)")

### Senaryo 4: %75-%25 Egitim-Test Ayirma (5 Farkli Rastgele Ayirma)

In [None]:
print("SENARYO 4: %75-%25 Egitim-Test Ayirma (5 Farkli Seed)")
print("="*50)

seeds = [42, 123, 456, 789, 999]
results_s4 = []

print(f"\nAg Topolojisi: {X.shape[1]} -> 128 -> 64 -> 2\n")

for i, seed in enumerate(seeds, 1):
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y, test_size=0.25, random_state=seed, stratify=y
    )
    
    model4 = create_model(random_state=seed)
    model4.fit(X_train, y_train)
    y_pred4 = model4.predict(X_test)
    acc = accuracy_score(y_test, y_pred4)
    results_s4.append((acc, y_test, y_pred4, seed))
    
    print(f"Rastgele Ayirma {i} (seed={seed}): {acc*100:.2f}%")

avg_acc = np.mean([r[0] for r in results_s4])
std_acc = np.std([r[0] for r in results_s4])
print(f"\nOrtalama Dogruluk: {avg_acc*100:.2f}% (+/- {std_acc*100:.2f}%)")

In [None]:
# En iyi sonucun konfuzyon matrisi
best_idx = np.argmax([r[0] for r in results_s4])
best_result = results_s4[best_idx]
print(f"En iyi sonuc: seed={best_result[3]}, Dogruluk={best_result[0]*100:.2f}%")
print(f"\nSiniflandirma Raporu:")
print(classification_report(best_result[1], best_result[2], target_names=['Icmiyor', 'Iciyor']))

cm4 = plot_confusion_matrix(best_result[1], best_result[2], f"Senaryo 4 (seed={best_result[3]})")

## 6. Sonuclarin Karsilastirmasi

In [None]:
# Ozet tablo
print("TUM SENARYOLARIN KARSILASTIRMASI")
print("="*60)
print(f"{'Senaryo':<40} {'Dogruluk':>15}")
print("-"*60)
print(f"{'1. Egitim = Test':<40} {acc1*100:>13.2f}%")
print(f"{'2. 5-Fold Cross Validation':<40} {scores_5fold.mean()*100:>13.2f}%")
print(f"{'3. 10-Fold Cross Validation':<40} {scores_10fold.mean()*100:>13.2f}%")
print(f"{'4. %75-25 Ayirma (5 seed ortalamasi)':<40} {avg_acc*100:>13.2f}%")
print("-"*60)

In [None]:
# Gorsel karsilastirma
scenarios = ['Egitim=Test', '5-Fold CV', '10-Fold CV', '%75-25']
accuracies = [acc1*100, scores_5fold.mean()*100, scores_10fold.mean()*100, avg_acc*100]

plt.figure(figsize=(10, 6))
colors = ['#3498db', '#2ecc71', '#9b59b6', '#e74c3c']
bars = plt.bar(scenarios, accuracies, color=colors)
plt.title('Senaryolara Gore Model Basarisi', fontsize=14)
plt.xlabel('Senaryo')
plt.ylabel('Dogruluk (%)')
plt.ylim(70, 90)

for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, 
             f'{acc:.2f}%', ha='center', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 7. Ag Yapisi Gorsellestirmesi

In [None]:
def draw_neural_network():
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Katman boyutlari
    layers = [X.shape[1], 128, 64, 2]
    layer_names = [f'Giris\n({layers[0]})', f'Gizli 1\n({layers[1]})', 
                   f'Gizli 2\n({layers[2]})', f'Cikis\n({layers[3]})']
    colors = ['#3498db', '#2ecc71', '#2ecc71', '#e74c3c']
    
    x_positions = [0.15, 0.4, 0.65, 0.9]
    max_display = [8, 6, 6, 2]
    
    node_positions = []
    
    for i, (layer_size, max_d, x, color, name) in enumerate(zip(layers, max_display, x_positions, colors, layer_names)):
        positions = []
        n_display = min(layer_size, max_d)
        y_start = 0.5 + (n_display - 1) * 0.06
        
        for j in range(n_display):
            y = y_start - j * 0.12
            circle = plt.Circle((x, y), 0.025, color=color, ec='black', lw=2, zorder=10)
            ax.add_patch(circle)
            positions.append((x, y))
        
        if layer_size > max_d:
            ax.text(x, y_start - max_d * 0.12, '...', fontsize=16, ha='center', va='center')
        
        ax.text(x, 0.08, name, fontsize=10, ha='center', va='center', fontweight='bold')
        node_positions.append(positions)
    
    # Baglantilari ciz
    for i in range(len(node_positions) - 1):
        for pos1 in node_positions[i]:
            for pos2 in node_positions[i + 1]:
                ax.plot([pos1[0], pos2[0]], [pos1[1], pos2[1]], 'gray', alpha=0.2, lw=0.5)
    
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect('equal')
    ax.axis('off')
    ax.set_title(f'Yapay Sinir Agi Mimarisi\n{layers[0]} -> {layers[1]} -> {layers[2]} -> {layers[3]}', 
                 fontsize=14, fontweight='bold')
    ax.text(0.5, 0.95, 'Aktivasyon: ReLU | Optimizer: Adam', fontsize=10, ha='center', style='italic')
    
    plt.tight_layout()
    plt.show()

draw_neural_network()

## 8. Sonuc

### Model Parametreleri
- **Ag Topolojisi**: 25 -> 128 -> 64 -> 2
- **Aktivasyon**: ReLU
- **Optimizer**: Adam
- **Early Stopping**: Evet

### Degerlendirme
- Senaryo 1 (Egitim=Test) en yuksek dogrulugu verir ancak overfitting riski vardir
- Cross validation sonuclari daha gercekci performans tahmini saglar
- 10-Fold CV, 5-Fold CV'ye gore daha stabil sonuc verir
- %75-25 ayirma farkli seed'lerle tutarli sonuclar uretir