분류기 만들기

In [2]:
# 사용자 정의 분류기 만들기
from sklearn.base import BaseEstimator
import numpy as np
class MyDummyClassifier(BaseEstimator):
  def fit(self, X, y):
    pass
  
  def predict(self, X):
    pred = np.zeros((X.shape[0],1))
    for i in range(X.shape[0]):
      if X['Sex'].iloc[i] == 1:
        pred[i]=0
      else :
        pred[i]=1
    return pred   

타이타닉 생존 여부
- 규칙: 성별이 1이면 생존하지 않은 것으로 분류

In [3]:
import pandas as pd
titanic_df = pd.read_csv('./data/titanic.csv')
X_titanic_df = titanic_df.drop(columns = 'Survived')
y_titanic_df = titanic_df['Survived']
# 머신러닝 알고리즘에 불필요한 피처 제거

# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Cabin'].fillna('N', inplace=True)
    df['Embarked'].fillna('N', inplace=True)
    df['Fare'].fillna(0, inplace=True)
    return df

def drop_features(df):
    df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
    return df

# 레이블 인코딩 수행 함수
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

# 앞에서 설정한 데이터 전처리 함수 호출
def transform_features(df):
    df = fillna(df) 
    df = drop_features(df)
    df = format_features(df)
    return df

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

X_titanic_df = transform_features(X_titanic_df)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Cabin'].fillna('N', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always

In [5]:
#데이터셋 분할
X_train, X_test, y_train, y_test = train_test_split(X_titanic_df,
                                                    y_titanic_df, 
                                                    test_size=0.2, 
                                                    random_state=0 )

In [6]:
myclf = MyDummyClassifier()
myclf.fit(X_train, y_train)

In [7]:
my_pred = myclf.predict(X_test)
accuracy_score(y_test, my_pred)

0.7877094972067039

혼동행렬
- 데이터의 실제 클래스와 모델에 의해 예측된 클래스를 비교하는 행렬로 각, 클래스 별로 잘 분류된 포인트 수를 정리하는 것

In [8]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,my_pred)

array([[92, 18],
       [20, 49]])

In [9]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_test, my_pred), recall_score(y_test, my_pred)

(np.float64(0.7313432835820896), np.float64(0.7101449275362319))

로지스틱 회귀, 랜덤포레스트, knn 의 정밀도, 재현율 비교하기

In [10]:
def get_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)

    print(confusion)
    print('='*20)
    print(accuracy, precision, recall)

In [11]:
#데이터셋 분할
X_train, X_test, y_train, y_test = train_test_split(X_titanic_df,
                                                    y_titanic_df, 
                                                    test_size=0.2, 
                                                    random_state=0 )

from sklearn.linear_model import LogisticRegression

model_logit = LogisticRegression()
model_logit.fit(X_train,y_train)
pred = model_logit.predict(X_test)

from sklearn.metrics import accuracy_score

get_clf_eval(y_test, pred)

[[92 18]
 [16 53]]
0.8100558659217877 0.7464788732394366 0.7681159420289855


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


임계치 조정

In [12]:
pred_proba = model_logit.predict_proba(X_test) # 0일 확률과 1일 확률
pos_proba = pred_proba[:,1]

from sklearn.metrics import confusion_matrix
threshold = 0.4 # 임계치
custom_proba = (pos_proba >= threshold).astype(int)

confusion_matrix(y_test,custom_proba)
get_clf_eval(y_test,custom_proba)

[[86 24]
 [13 56]]
0.7932960893854749 0.7 0.8115942028985508


정밀도와 재현율의 변화
- 정밀도와 재현율의 불균형이 심할 때, 
혹은 비즈니스의 요구사항이 있을 때
임계치 조정을 해야한다.

- 임계치를 낮추면 정밀도가 낮아지고 재현율은 올라간다. 

In [13]:
from sklearn.metrics import f1_score, classification_report
print(f1_score(y_test,pred))
print(classification_report(y_test,pred))

0.7571428571428571
              precision    recall  f1-score   support

           0       0.85      0.84      0.84       110
           1       0.75      0.77      0.76        69

    accuracy                           0.81       179
   macro avg       0.80      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179

