타이타닉 생존 예측
- 과거 승객들의 특징과 결과(survived)로 학습/예측 (지도학습)

In [16]:
# 경로 세팅
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

ROOT = Path.cwd() # mini_projects/
DATA_ROOT = ROOT / "data"
ART_ROOT = ROOT / "artifacts"

DS_DIR = DATA_ROOT / "titanic"
ART_DIR = ART_ROOT / "01_text"
for d in [DS_DIR, ART_DIR]: d.mkdir(parents=True, exist_ok=True)

print("[경로 설정] DS_DIR :", DS_DIR.resolve())
print("[경로 설정] ART_DIR:", ART_DIR.resolve())

[경로 설정] DS_DIR : C:\Users\dkjjk\ai-ml\mini_projects\data\titanic
[경로 설정] ART_DIR: C:\Users\dkjjk\ai-ml\mini_projects\artifacts\01_text


In [17]:
# 데이터 파일 체크
TRAIN_PATH = DS_DIR / "train.csv"
TEST_PATH = DS_DIR / "test.csv"

if not TRAIN_PATH.exists() and not TEST_PATH.exists():
  print("[파일 부재] train/test.csv 다운로드 시도 중...")

  # 모듈 import
  import zipfile, sys
  try:
    import kaggle
  except ModuleNotFoundError:
    print("[모듈 부재] kaggle 패키지 설치 중...")

    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "insatll" "-q" "kaggle"]) # sys.executeable : 현재 실행 중인 파이썬 파일

  # Kaggle API 인증
  from kaggle.api.kaggle_api_extended import KaggleApi
  api = KaggleApi()
  try:
    api.authenticate()
  except Exception as e:
    raise SystemExit(
      "[Kaggle API 인증 실패] Kaggle 토큰 배치 확인 필요"
      "$env:USERPROFILE\.kaggle\kaggle.json"
    )
  
  # .zip Download
  print("[다운로드 시작] Titanic 데이터셋 다운로드 중...")
  api.competition_download_files("titanic", path=str(DS_DIR), queit=False)

  # 압축 해제
  zpath = DS_DIR / "titanic.zip"
  if not zpath.exist():
    raise SystemExit("[다운로드 실패] titanic.zip 부재 (대회 참여/규칙 동의 여부 확인 필요)")
  with zipfile.ZipFile(zpath, "r") as zf:
    zf.extractall(DS_DIR)
  try:
    zpath.unlink() # 파일 삭제
  except FileNotFoundError:
    pass # 치명적 에러 아니므로 넘어감

assert TRAIN_PATH.exists() and TEST_PATH.exists(), "[파일 부재] train/test.csv 다운로드 실패"


In [18]:
# 데이터 로드
import pandas as pd

train = pd.read_csv(TRAIN_PATH) # na_values=[] : 결측(NaN) 취급할 문자열 추가 옵션
test = pd.read_csv(TEST_PATH)

EDA
- basic: 데이터셋 크기 / target 클래스 분포 / 컬럼-도메인 / 결측치 / 실제값 확인
- EDA용 가벼운 파생/구간화 - train.copy()
- 단변량 vs 조합 (수학식 점수) -> 후보 5-8 -> Ablation으로 채택

In [19]:
# EDA basic
print("Shape:", train.shape, '\n')
print("Target Dist:", train["Survived"].value_counts(normalize=True).to_string(), '\n')
print("[Columns]")
for c in train.columns:
  domain = train[c].str[0].unique() if c == "Cabin" else train[c].unique()
  cnt = len(domain)
  if cnt > 15:
    print(f"{str(c):12s}| {str(train[c].dtypes):8s}| (count) {cnt}")
  else:
    print(f"{str(c):12s}| {str(train[c].dtypes):8s}| {domain.tolist()}")
print("\nMissing Values:", train.isna().mean().sort_values(ascending=False), sep='\n')
train.head()

Shape: (891, 12) 

Target Dist: Survived
0    0.616162
1    0.383838 

[Columns]
PassengerId | int64   | (count) 891
Survived    | int64   | [0, 1]
Pclass      | int64   | [3, 1, 2]
Name        | object  | (count) 891
Sex         | object  | ['male', 'female']
Age         | float64 | (count) 89
SibSp       | int64   | [1, 0, 3, 4, 2, 5, 8]
Parch       | int64   | [0, 1, 2, 5, 3, 4, 6]
Ticket      | object  | (count) 681
Fare        | float64 | (count) 248
Cabin       | object  | [nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T']
Embarked    | object  | ['S', 'C', 'Q', nan]

Missing Values:
Cabin          0.771044
Age            0.198653
Embarked       0.002245
PassengerId    0.000000
Name           0.000000
Pclass         0.000000
Survived       0.000000
Sex            0.000000
Parch          0.000000
SibSp          0.000000
Fare           0.000000
Ticket         0.000000
dtype: float64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [20]:
# EDA용 가벼운 파생/구간화
import re
import numpy as np
import pandas as pd

df = train.copy()

# Title 추출
def extract_title(name):
  m = re.search(r",\s*([^\.]+)\.", str(name))
  t = m.group(1).strip() if m else "None"
  return t if t in {"Mr", "Mrs", "Miss", "Master"} else "Rare"

df["Title"] = df["Name"].apply(extract_title)

# FamilySize / IsAlone
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = np.where(df["FamilySize"] == 1, "Alone", "NotAlone")

# bining -> Age_bin / Fare_bin / FamilySize_bin
df["FamilySize_bin"] = pd.cut(
  df["FamilySize"],
  bins=[0, 1, 3, 5, 20], labels=["1", "2-3", "4-5", "6+"],
  right=True
)

age_bins = [0, 9, 19, 29, 39, 49, 59, 120]
age_labels = ["0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60+"]
df["Age_bin"] = pd.cut(
  df["Age"], bins=age_bins, labels=age_labels,
  include_lowest=True, right=True
).astype("category").cat.add_categories(["Unknown"]).fillna("Unknown") # cat : category accessor

# cut vs qcut(분위수)
df["Fare_bin"] = pd.qcut(
  df["Fare"].fillna(df["Fare"].median()), # 결측 중앙값으로 임시 대체
  q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'], duplicates="drop"
)

print(df[['Title', 'FamilySize', 'IsAlone', 'Age_bin', 'Fare_bin']].head())

  Title  FamilySize   IsAlone Age_bin Fare_bin
0    Mr           2  NotAlone   20-29       Q1
1   Mrs           2  NotAlone   30-39       Q4
2  Miss           1     Alone   20-29       Q2
3   Mrs           2  NotAlone   30-39       Q4
4    Mr           1     Alone   30-39       Q2


In [21]:
# 단변량(uni) vs 조합(cross)

# 레벨 / 그룹 / 셀

MIN_COUNT = 15 # 셀 최소 표본 권장치

# uni score
def group_var_score(df, key, target="Survived", min_count=1):
  base = df[target].mean()
  g = df.groupby(key)[target].agg(['mean', 'size'])

  wvar = float(((g['mean']-base)**2 * g['size']).sum() / g['size'].sum())

  coverage = float((g['size']>=min_count).mean())

  levels = int(g.shape[0])

  return {"wvar": wvar, "coverage": coverage, "levels": levels}

# cross score
def pair_var_score(df, a, b, target="Survived", min_count=MIN_COUNT):
  base = df[target].mean()
  g = df.groupby([a, b])[target].agg(['mean', 'size'])

  wvar = float(((g['mean']-base)**2 * g['size']).sum() / g['size'].sum())

  coverage = float((g['size']>=min_count).mean())

  non_empty = int(g.shape[0])

  score = wvar * coverage * np.log1p(non_empty)
  # wvar = 얼마나 잘 갈라지는가 (구분력)
  # coverage = 그 차이가 충분한 표본에 걸쳐 있나 (안정성)
  # non_empty = 그 신호가 여러 조합에서 나타나나 (폭) -> 과세분화 방지 위해 로그 가산

  return {"score": score, "wvar": wvar, "coverage": coverage, "non_empty": non_empty}


# uni rank
CANDS = ["Title", "Sex", "Pclass", "Age_bin", "FamilySize_bin", "IsAlone", "Fare_bin", "Embarked"]
CANDS = [c for c in CANDS if c in df.columns]

rows = []
for c in CANDS:
  stat = group_var_score(df, c, min_count=MIN_COUNT)
  rows.append({"feature": c, **stat})

uni = pd.DataFrame(rows).sort_values("wvar", ascending=False).reset_index(drop=True)
print("Univariate Rank (wvar desc)")
display(uni)

# cross rank (+ excess)
# 의미 중복 조합은 제외
SKIP_PAIRS = {tuple(sorted(p)) for p in [
  ("Title", "Sex"),
  ("IsAlone", "FamilySize_bin")
]}

S = dict(zip(uni["feature"], uni["wvar"]))
pair_rows = []
for i in range(len(CANDS)):
  for j in range(i+1, len(CANDS)):
    a, b = CANDS[i], CANDS[j]
    if tuple(sorted((a,b))) in SKIP_PAIRS:
      continue
    stat = pair_var_score(df, a, b)
    wvar = stat["wvar"]
    excess = wvar - 0.5*(S.get(a,0) + S.get(b,0)) # 0은 없을 때 반환값
    pair_rows.append({"a":a, "b":b, "pair":f"{a}__x__{b}", **stat, "excess": excess})

pairs = (pd.DataFrame(pair_rows)
         .sort_values(["coverage", "excess", "wvar"], ascending=[False, False, False])
         .reset_index(drop=True))
print("Pair Rank (coverage, excess, wvar desc)")
display(pairs.head(20))

Univariate Rank (wvar desc)


Unnamed: 0,feature,wvar,coverage,levels
0,Title,0.075202,1.0,5
1,Sex,0.069824,1.0,2
2,Pclass,0.027311,1.0,3
3,Fare_bin,0.021281,1.0,4
4,FamilySize_bin,0.017534,1.0,4
5,IsAlone,0.009781,1.0,2
6,Embarked,0.007039,1.0,3
7,Age_bin,0.006556,1.0,8


Pair Rank (coverage, excess, wvar desc)


Unnamed: 0,a,b,pair,score,wvar,coverage,non_empty,excess
0,Sex,Pclass,Sex__x__Pclass,0.181131,0.093083,1.0,6,0.044516
1,Sex,Embarked,Sex__x__Embarked,0.143143,0.073561,1.0,6,0.035129
2,Sex,Fare_bin,Sex__x__Fare_bin,0.171442,0.078027,1.0,8,0.032474
3,Sex,IsAlone,Sex__x__IsAlone,0.115945,0.072041,1.0,4,0.032238
4,Pclass,IsAlone,Pclass__x__IsAlone,0.067895,0.034891,1.0,6,0.016345
5,IsAlone,Embarked,IsAlone__x__Embarked,0.03317,0.017046,1.0,6,0.008636
6,Sex,Age_bin,Sex__x__Age_bin,0.209608,0.078915,0.9375,16,0.040724
7,Title,IsAlone,Title__x__IsAlone,0.155988,0.076213,0.888889,9,0.033721
8,Sex,FamilySize_bin,Sex__x__FamilySize_bin,0.153496,0.079839,0.875,8,0.03616
9,IsAlone,Fare_bin,IsAlone__x__Fare_bin,0.043143,0.02244,0.875,8,0.006909


In [22]:
# uni/cross 자동 선택

COVERAGE_OK = 0.70 # 충분한 셀 비율 기준

TOP_SINGLES = 6 # 단일 피처 상위 채택 수
TOP_PAIRS = 8 # 확인할 조합 수

singles_eda = list(uni["feature"].head(TOP_SINGLES))
pairs_eda = pairs.query("coverage >= @COVERAGE_OK").nlargest(TOP_SINGLES, "excess")["pair"].tolist()

print("EDA-Selected Singles")
print(singles_eda)
print("EDA-selected Pairs")
print(pairs_eda)

EDA-Selected Singles
['Title', 'Sex', 'Pclass', 'Fare_bin', 'FamilySize_bin', 'IsAlone']
EDA-selected Pairs
['Title__x__Pclass', 'Sex__x__Pclass', 'Sex__x__Age_bin', 'Sex__x__FamilySize_bin', 'Sex__x__Embarked', 'Title__x__IsAlone']


In [23]:
# 선택된 조합 피벗 확인 (평균 + 카운트)
def show_pivot(df, a, b, target="Survived"):
  pv_mean = df.pivot_table(index=a, columns=b, values=target, aggfunc="mean")
  pv_count = df.pivot_table(index=a, columns=b, values=target, aggfunc="size")
  print(f"[{a} x {b}] Survival rate")
  display(pv_mean.round(3))
  print("& Count")
  display(pv_count)

for p in pairs_eda:
  a, b = p.split("__x__")
  if a in df.columns and b in df.columns:
    show_pivot(df, a, b)

[Title x Pclass] Survival rate


Pclass,1,2,3
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Master,1.0,1.0,0.393
Miss,0.957,0.941,0.5
Mr,0.346,0.088,0.113
Mrs,0.976,0.902,0.5
Rare,0.611,0.111,


& Count


Pclass,1,2,3
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Master,3.0,9.0,28.0
Miss,46.0,34.0,102.0
Mr,107.0,91.0,319.0
Mrs,42.0,41.0,42.0
Rare,18.0,9.0,


[Sex x Pclass] Survival rate


Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968,0.921,0.5
male,0.369,0.157,0.135


& Count


Pclass,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,94,76,144
male,122,108,347


[Sex x Age_bin] Survival rate


Age_bin,0-9,10-19,20-29,30-39,40-49,50-59,60+,Unknown
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,0.633,0.756,0.722,0.833,0.688,0.889,1.0,0.679
male,0.594,0.123,0.169,0.215,0.211,0.133,0.136,0.129


& Count


Age_bin,0-9,10-19,20-29,30-39,40-49,50-59,60+,Unknown
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,30,45,72,60,32,18,4,53
male,32,57,148,107,57,30,22,124


[Sex x FamilySize_bin] Survival rate


FamilySize_bin,1,2-3,4-5,6+
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.786,0.801,0.613,0.286
male,0.156,0.307,0.385,0.038


& Count


FamilySize_bin,1,2-3,4-5,6+
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,126,136,31,21
male,411,127,13,26


[Sex x Embarked] Survival rate


Embarked,C,Q,S
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.877,0.75,0.69
male,0.305,0.073,0.175


& Count


Embarked,C,Q,S
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,73,36,203
male,95,41,441


[Title x IsAlone] Survival rate


IsAlone,Alone,NotAlone
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Master,,0.575
Miss,0.75,0.634
Mr,0.154,0.167
Mrs,0.9,0.771
Rare,0.45,0.429


& Count


IsAlone,Alone,NotAlone
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Master,,40.0
Miss,100.0,82.0
Mr,397.0,120.0
Mrs,20.0,105.0
Rare,20.0,7.0


Title x Pclass
- "Rare" x "3" 결측
- Title은 Sex, Age 내포 -> Sex x Pclass와 중복되므로 Ablation에서 하나 선택 (과신호 방지)

Sex x Age_bin
- Unknown 존재 (결측)
- 성별에 따라 나이 효과가 완전히 다름 -> A급 후보
- 꼬리 구간(60+) 커버리지 약함 -> min_count 낮추거나 bining 재조정

Sex x FamilySize_bin
- 여성: 가족 커질수록 떨어짐 / 남성: 혼자인 남성 압도적 불리
- 유의미하지만 클래스 불균형 심함 (대가족 셀 희소) -> B급 후보

Sex x Embarked
- 차이 미비 -> 신호 약함

Title x IsAlone
- "Master" x "Alone" : NaN으로 아예 없음
- 차이 미비 -> 신호 약함

<br><br>
Ablation
- 파이프라인 : 전처리 + 모델링(baseline 고정) + CV
- 이때, 전처리 : 파생/bining/cross 생성 + 결측 대체 + 인코딩 + 스케일링