<a href="https://colab.research.google.com/github/clustering-jun/KMU-Data_Science/blob/main/L06_Random_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Random Forest Practice**

## **Titanic - Kaggle Competition**
- Titanic - Machine Learning from Disaster
 - https://www.kaggle.com/competitions/titanic

 - 타이타닉호 탑승자 정보로부터 생존 여부를 예측하는 대회

<br>

- Titanic Dataset
 - Survival - 생존 여부 (0-사망 / 1-생존)
 - Pclass - 티켓등급 (1/2/3)
 - Sex - 성별 (male, female)
 - Age - 나이
 - Sibsp - 타이타닉호에 승선한 형제/배우자 수
 - Parch - 타이타닉호에 승선한 부모/자녀 수
 - Ticket - 티켓 번호
 - Fare - 탑승 요금
 - cabin - 객실 번호
 - embarked - 출항지 (C/Q/S)
   - C = Cherbourg, Q = Queenstown, S = Southamptonc

### **데이터 전처리**
- Age 열의 결측치를 평균값으로 채우기
- 사용할 특징 추출
- pd.get_dummies: one-hot encoding 데이터 변환
 - drop_first: 첫 번째 class를 제거 (중복 방지를 위해)

In [17]:
import pandas as pd

train = pd.read_csv('train.csv')

train['Age'] = train['Age'].fillna(train['Age'].mean())

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

X = pd.get_dummies(train[features], drop_first=True).values
y = train['Survived'].values

### **RandomForestClassifier**
- n_estimators: Decision Tree 수
- max_depth: 각 Decision Tree의 최대 깊이
- max_features: random features selection 수
- max_samples: bootstrap sampling에서 각 sample 수

### **fit()**
- Training data로부터 Bootstrap Sampling을 수행하여 Decision Tree들을 생성
- `np.random.choice(n, m, replace=True)`: 0 ~ n-1 사이에서 m개의 sample을 추출하여 array 생성
 - replace=True: 복원추출

### **predict()**
- 입력 데이터 X에 대해서, 각 estimator마다 예측 결과를 생성 후, voting 방식으로 최종 예측 생성

In [23]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np

class RandomForestClassifier:
    def __init__(self, n_estimators=100, max_depth=None, max_features='sqrt', max_samples=None):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.max_samples = max_samples
        self.estimators = []

        for _ in range(n_estimators):
            dt = DecisionTreeClassifier(max_depth=self.max_depth, max_features=self.max_features)
            self.estimators.append(dt)



    def fit(self, X, y):
        for i in range(self.n_estimators):
            X_sample, y_sample = self.sample(X, y)
            self.estimators[i].fit(X_sample, y_sample)



    def sample(self, X, y):
        if self.max_samples is None:
            n_samples = X.shape[0]
        else:
            n_samples = min(self.max_samples, X.shape[0])

        indices = np.random.choice(X.shape[0], n_samples, replace=True)
        return X[indices], y[indices]


    # np.apply_along_axis 로도 구현 가능함.
    def predict(self, X):
        all_predictions = np.zeros((self.n_estimators, X.shape[0]), dtype=np.int64)

        for i in range(self.n_estimators):
            all_predictions[i] = self.estimators[i].predict(X)

        return np.array([np.bincount(all_predictions[:, i]).argmax() \
                                for i in range(X.shape[0])])


In [33]:
rf = RandomForestClassifier(n_estimators=50, max_depth=10)
rf.fit(X, y)

In [34]:
print(f'Training accuracy: {round((rf.predict(X) == y).mean(), 2)}%')

Training accuracy: 0.92%


## **Random Forests with Scikit-learn**
- `oob_score`: Bootstrap sampling에 의해 추출되지 않은 데이터로 정확도를 측정 (out of bag)
- min_samples_split: Decision Tree에서 노드를 분할할 때 필요한 최소 샘플 수
- Parameter setting guide: https://scikit-learn.org/stable/modules/ensemble.html#parameter

In [60]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=80, max_samples=0.8, max_depth=None, min_samples_split=20,
                            max_features=4, oob_score=True)

rf.fit(X, y)

print('train accuracy:', ((y==rf.predict(X)).mean()))
print('out-of-bag score:', rf.oob_score_)

train accuracy: 0.8731762065095399
out-of-bag score: 0.8260381593714927


In [62]:
import pandas as pd

test = pd.read_csv('test.csv')

test['Age'] = test['Age'].fillna(train['Age'].mean())

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

X_test = pd.get_dummies(test[features], drop_first=True).values

y_test = rf.predict(X_test)

In [63]:
with open('rf_result.csv', 'w') as f:
    f.write('PassengerId,Survived\n')
    for pid, survied in zip(test['PassengerId'].values, y_test):
        f.write(f'{pid},{survied}\n')