# Titanic Modeling 

Bu çalışma, Titanic veri seti üzerinde tam bir makine öğrenimi pipeline’ı oluşturmayı amaçlar. 
İçerik:
- Kategorik numerik encoding
- Model eğitimi
- Performans değerlendirme
- Kaggle submission dosyası oluşturma

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
sns.set(style="whitegrid")

# 3. Modeling – Logistic Regression


In [2]:
df = pd.read_csv("/kaggle/input/d/bzeydanli/titanic/titanic_fe.csv")
df.head()
print("Veri boyutu:", df.shape)

y = df['Survived']
X = df.drop(columns=['Survived'])

# Sayısal / Kategorik
num_cols = X.select_dtypes(include=['int64','float64']).columns
cat_cols = X.select_dtypes(include=['object','category']).columns

print("Sayısal:", list(num_cols))
print("Kategorik:", list(cat_cols))

# Encoding
X = pd.get_dummies(X, columns=cat_cols, drop_first=True)
print("Encoding sonrası:", X.shape)

#Cleaning
import re
def clean_col(col):

    col = re.sub('[^A-Za-z0-9_]+', '_', col)
    return col

X.columns = [clean_col(c) for c in X.columns]

# Train/Test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=5, stratify=y
)
print("Train:", X_train.shape, " Test:", X_test.shape)

# Ölçekleme (StandardScaler)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

# Modelleme 1: Logistic Regression 
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Tahmin ve performans ölçümü
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]

print("\nLogistic Regression Results")
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_prob).round(3))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

coef = pd.DataFrame({
    "Feature": X_train.columns,
    "Coefficient": model.coef_[0]
}).sort_values(by="Coefficient", key=abs, ascending=False)

coef.head(15)

Veri boyutu: (891, 18)
Sayısal: ['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'IsAlone', 'Fare_log']
Kategorik: ['Name', 'Sex', 'Ticket', 'Embarked', 'AgeGroup', 'FareBin', 'Title', 'Pclass_Sex']
Encoding sonrası: (891, 1600)
Train: (623, 1600)  Test: (268, 1600)

Logistic Regression Results

Accuracy: 0.8432835820895522
AUC: 0.886

Confusion Matrix:
 [[149  16]
 [ 26  77]]

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.90      0.88       165
           1       0.83      0.75      0.79       103

    accuracy                           0.84       268
   macro avg       0.84      0.83      0.83       268
weighted avg       0.84      0.84      0.84       268



Unnamed: 0,Feature,Coefficient
1591,Title_Mr,-1.54644
936,Ticket_113781,-1.451041
899,Sex_male,-1.175465
979,Ticket_1601,1.169549
1,Pclass,-1.03476
1593,Title_Officer,-0.944461
25,Name_Allison_Miss_Helen_Loraine,-0.857878
1597,Pclass_Sex_2_male,-0.82326
1043,Ticket_244252,-0.778652
1228,Ticket_347077,0.77419


# Model Evaluation Summary

- Model: Logistic Regression (baseline)
- Train/Test Split: 70/30 stratified
- Accuracy: 0.843
- AUC: 0.88
- Precision (Survived): 0.83
- Recall (Survived): 0.75
- Observation: Model ölümleri daha iyi tanıyor, kurtulanları eksik sınıflandırıyor.
- Titanic veri seti küçük ve lineer olduğu için Logistic Regression en yüksek ve en stabil performansı vermiştir.


# Submission
Bu bölümde, test seti üzerinde tahmin edilen sonuçlar Kaggle formatına uygun olarak submission.csv dosyasına dönüştürülmüştür.


In [22]:
def apply_fe(df):
    df = df.copy()

    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    
    # Title
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    title_map = {
        'Mlle':'Miss','Ms':'Miss','Mme':'Mrs','Lady':'Royal','Countess':'Royal',
        'Capt':'Officer','Col':'Officer','Major':'Officer','Dr':'Officer','Rev':'Officer',
        'Sir':'Royal','Don':'Royal','Dona':'Royal','Jonkheer':'Royal'
    }
    df['Title'] = df['Title'].replace(title_map)

    # Child + Pclass_Sex
    df['IsChild'] = (df['Age'] < 12).astype(int)
    df['Pclass_Sex'] = df['Pclass'].astype(str) + "_" + df['Sex'].astype(str)

    # Fare log
    df['Fare_log'] = np.log1p(df['Fare'])

    return df



test_df = pd.read_csv("/kaggle/input/titanic/test.csv")
test_df['Age'] = test_df['Age'].fillna(df['Age'].median())
test_df['Fare'] = test_df['Fare'].fillna(df['Fare'].median())
test_df = test_df.drop(columns=['Cabin'])
test_fe = apply_fe(test_df)

cat_cols = test_fe.select_dtypes(include=['object','category']).columns

test_encoded = pd.get_dummies(test_fe, columns=cat_cols, drop_first=True)

test_encoded = test_encoded.reindex(columns=X_train.columns, fill_value=0)

test_encoded[num_cols] = scaler.transform(test_encoded[num_cols])

test_pred = model.predict(test_encoded) 

submission = pd.DataFrame({
    "PassengerId": test_df["PassengerId"],
    "Survived": test_pred
})

submission.to_csv("submission.csv", index=False)
submission.head()




Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
