타이타닉 생존 예측
- 과거 승객들의 특징과 결과(survived)로 학습/예측 (지도학습)

In [2]:
# 경로 세팅
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

ROOT = Path.cwd() # mini_projects/
DATA_ROOT = ROOT / "data"
ART_ROOT = ROOT / "artifacts"

DS_DIR = DATA_ROOT / "titanic"
ART_DIR = ART_ROOT / "01_text"
for d in [DS_DIR, ART_DIR]: d.mkdir(parents=True, exist_ok=True)

print("[경로 설정] DS_DIR :", DS_DIR.resolve())
print("[경로 설정] ART_DIR:", ART_DIR.resolve())

[경로 설정] DS_DIR : C:\Users\dkjjk\ai-ml\mini_projects\data\titanic
[경로 설정] ART_DIR: C:\Users\dkjjk\ai-ml\mini_projects\artifacts\01_text


In [4]:
# 데이터 파일 체크
TRAIN_PATH = DS_DIR / "train.csv"
TEST_PATH = DS_DIR / "test.csv"

if not TRAIN_PATH.exists() and not TEST_PATH.exists():
  print("[파일 부재] train/test.csv 다운로드 시도 중...")

  # 모듈 import
  import zipfile, sys
  try:
    import kaggle
  except ModuleNotFoundError:
    print("[모듈 부재] kaggle 패키지 설치 중...")

    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "insatll" "-q" "kaggle"]) # sys.executeable : 현재 실행 중인 파이썬 파일

  # Kaggle API 인증
  from kaggle.api.kaggle_api_extended import KaggleApi
  api = KaggleApi()
  try:
    api.authenticate()
  except Exception as e:
    raise SystemExit(
      "[Kaggle API 인증 실패] Kaggle 토큰 배치 확인 필요"
      "$env:USERPROFILE\.kaggle\kaggle.json"
    )
  
  # .zip Download
  print("[다운로드 시작] Titanic 데이터셋 다운로드 중...")
  api.competition_download_files("titanic", path=str(DS_DIR), queit=False)

  # 압축 해제
  zpath = DS_DIR / "titanic.zip"
  if not zpath.exist():
    raise SystemExit("[다운로드 실패] titanic.zip 부재 (대회 참여/규칙 동의 여부 확인 필요)")
  with zipfile.ZipFile(zpath, "r") as zf:
    zf.extractall(DS_DIR)
  try:
    zpath.unlink() # 파일 삭제
  except FileNotFoundError:
    pass # 치명적 에러 아니므로 넘어감

assert TRAIN_PATH.exists() and TEST_PATH.exists(), "[파일 부재] train/test.csv 다운로드 실패"


In [5]:
# 데이터 로드
import pandas as pd

train = pd.read_csv(TRAIN_PATH) # na_values=[] : 결측(NaN) 취급할 문자열 추가 옵션
test = pd.read_csv(TEST_PATH)

In [None]:
# EDA : 데이터셋 크기, target 클래스 분포, 컬럼/도메인, 결측치, 실제값
print("Shape:", train.shape, '\n')
print("Target Dist:", train["Survived"].value_counts(normalize=True).to_string(), '\n')
print("[Columns]")
for c in train.columns:
  domain = train[c].str[0].unique() if c == "Cabin" else train[c].unique()
  cnt = len(domain)
  if cnt > 15:
    print(f"{str(c):12s}| {str(train[c].dtypes):8s}| (count) {cnt}")
  else:
    print(f"{str(c):12s}| {str(train[c].dtypes):8s}| {domain.tolist()}")
print("\nMissing Values:", train.isna().mean().sort_values(ascending=False), sep='\n')
train.head()

Shape: (891, 12) 

Target Dist: Survived
0    0.616162
1    0.383838 

[Columns]
PassengerId | int64   | (count) 891
Survived    | int64   | [0, 1]
Pclass      | int64   | [3, 1, 2]
Name        | object  | (count) 891
Sex         | object  | ['male', 'female']
Age         | float64 | (count) 89
SibSp       | int64   | [1, 0, 3, 4, 2, 5, 8]
Parch       | int64   | [0, 1, 2, 5, 3, 4, 6]
Ticket      | object  | (count) 681
Fare        | float64 | (count) 248
Cabin       | object  | [nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T']
Embarked    | object  | ['S', 'C', 'Q', nan]

Missing Values:
Cabin          0.771044
Age            0.198653
Embarked       0.002245
PassengerId    0.000000
Name           0.000000
Pclass         0.000000
Survived       0.000000
Sex            0.000000
Parch          0.000000
SibSp          0.000000
Fare           0.000000
Ticket         0.000000
dtype: float64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
