# 02 - Baseline Model

Bu notebook'ta basit bir baseline model oluşturacağız.

## Hedefler:
- Veriyi train/test olarak ayırmak
- Basit bir Logistic Regression modeli eğitmek
- Model performansını değerlendirmek
- Baseline metrikler oluşturmak

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    average_precision_score
)
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully")

## 1. Load Data

In [None]:
# Load dataset
df = pd.read_csv('../data/creditcard.csv')

print(f"Dataset shape: {df.shape}")
print(f"Fraud cases: {df['Class'].sum()} ({df['Class'].mean()*100:.4f}%)")

## 2. Prepare Data

In [None]:
# Separate features and target
X = df.drop('Class', axis=1)
y = df['Class']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y
)

print(f"Train set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTrain fraud rate: {y_train.mean()*100:.4f}%")
print(f"Test fraud rate: {y_test.mean()*100:.4f}%")

In [None]:
# Scale features
scaler = StandardScaler()

# Scale Time and Amount
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[['Time', 'Amount']] = scaler.fit_transform(X_train[['Time', 'Amount']])
X_test_scaled[['Time', 'Amount']] = scaler.transform(X_test[['Time', 'Amount']])

print("✅ Features scaled")

## 3. Train Baseline Model

In [None]:
# Initialize model
baseline_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
    solver='liblinear'
)

# Train model
print("Training baseline model...")
baseline_model.fit(X_train_scaled, y_train)
print("✅ Model trained")

## 4. Make Predictions

In [None]:
# Predictions
y_pred = baseline_model.predict(X_test_scaled)
y_pred_proba = baseline_model.predict_proba(X_test_scaled)[:, 1]

print(f"Predictions made: {len(y_pred)}")
print(f"Predicted frauds: {y_pred.sum()}")
print(f"Actual frauds: {y_test.sum()}")

## 5. Evaluate Model

In [None]:
# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Fraud']))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['Predicted Normal', 'Predicted Fraud'],
    y=['Actual Normal', 'Actual Fraud'],
    text=cm,
    texttemplate='%{text}',
    colorscale='Blues'
))

fig.update_layout(
    title='Confusion Matrix',
    xaxis_title='Predicted',
    yaxis_title='Actual',
    height=500
)
fig.show()

print(f"\nTrue Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")

In [None]:
# ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Average Precision Score
avg_precision = average_precision_score(y_test, y_pred_proba)
print(f"Average Precision Score: {avg_precision:.4f}")

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=fpr, y=tpr,
    mode='lines',
    name=f'ROC Curve (AUC = {roc_auc:.4f})',
    line=dict(color='blue', width=2)
))

fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines',
    name='Random Classifier',
    line=dict(color='red', width=2, dash='dash')
))

fig.update_layout(
    title='ROC Curve',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    height=500
)
fig.show()

In [None]:
# Precision-Recall Curve
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_pred_proba)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=recall, y=precision,
    mode='lines',
    name=f'PR Curve (AP = {avg_precision:.4f})',
    line=dict(color='green', width=2)
))

fig.update_layout(
    title='Precision-Recall Curve',
    xaxis_title='Recall',
    yaxis_title='Precision',
    height=500
)
fig.show()

## 6. Feature Importance

In [None]:
# Get feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': baseline_model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("Top 10 Most Important Features:")
print(feature_importance.head(10))

# Visualize
fig = go.Figure()

fig.add_trace(go.Bar(
    x=feature_importance['feature'][:15],
    y=feature_importance['coefficient'][:15],
    marker_color=['red' if x < 0 else 'green' for x in feature_importance['coefficient'][:15]]
))

fig.update_layout(
    title='Top 15 Feature Coefficients',
    xaxis_title='Features',
    yaxis_title='Coefficient',
    height=500
)
fig.show()

## 7. Summary

### Baseline Model Performance:
- **Model**: Logistic Regression
- **ROC-AUC**: Check output above
- **Average Precision**: Check output above

### Key Observations:
1. Model performs reasonably well despite class imbalance
2. Some features show strong predictive power
3. There's room for improvement with:
   - Feature engineering
   - Handling class imbalance (SMOTE)
   - Trying different models
   - Hyperparameter tuning

### Next Steps:
1. Feature engineering (03_feature_engineering.ipynb)
2. Apply SMOTE for class balancing
3. Try more complex models
4. Optimize hyperparameters

In [None]:
print("✅ Baseline model completed successfully!")
print(f"\nBaseline ROC-AUC: {roc_auc:.4f}")
print("\nNext: Run 03_feature_engineering.ipynb")