제시된 여행 보험 예측 데이터에서 TravelInsurance(여행보험 패키지를 구매 했는지 여부) 를 예측하는 모델을 개발하고, 모델 개발 과정과 테스트 데이터셋에 대한 auc 를 답안으로 작성하시오

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder, MinMaxScaler
from scipy.special import expit, softmax
import warnings 
warnings.filterwarnings('ignore')

# 데이터 불러오기

In [2]:
df = pd.read_csv("./data/travel_insurance_prediction.csv")

# 데이터 확인

In [3]:
df.head()

Unnamed: 0,Age,Employment Type,GraduateOrNot,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,Government Sector,Yes,400000,6,1,No,No,0
1,31,Private Sector/Self Employed,Yes,1250000,7,0,No,No,0
2,34,Private Sector/Self Employed,Yes,500000,4,1,No,No,1
3,28,Private Sector/Self Employed,Yes,700000,3,1,No,No,0
4,28,Private Sector/Self Employed,Yes,700000,8,1,Yes,No,0


In [4]:
df.shape

(1987, 9)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987 entries, 0 to 1986
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Age                  1987 non-null   int64 
 1   Employment Type      1987 non-null   object
 2   GraduateOrNot        1987 non-null   object
 3   AnnualIncome         1987 non-null   int64 
 4   FamilyMembers        1987 non-null   int64 
 5   ChronicDiseases      1987 non-null   int64 
 6   FrequentFlyer        1987 non-null   object
 7   EverTravelledAbroad  1987 non-null   object
 8   TravelInsurance      1987 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 139.8+ KB


In [6]:
# 결측치 확인
df.isnull().sum()

Age                    0
Employment Type        0
GraduateOrNot          0
AnnualIncome           0
FamilyMembers          0
ChronicDiseases        0
FrequentFlyer          0
EverTravelledAbroad    0
TravelInsurance        0
dtype: int64

- 결측치 없음

In [7]:
df.dtypes

Age                     int64
Employment Type        object
GraduateOrNot          object
AnnualIncome            int64
FamilyMembers           int64
ChronicDiseases         int64
FrequentFlyer          object
EverTravelledAbroad    object
TravelInsurance         int64
dtype: object

In [8]:
df.describe()

Unnamed: 0,Age,AnnualIncome,FamilyMembers,ChronicDiseases,TravelInsurance
count,1987.0,1987.0,1987.0,1987.0,1987.0
mean,29.650226,932763.0,4.752894,0.277806,0.357323
std,2.913308,376855.7,1.60965,0.44803,0.479332
min,25.0,300000.0,2.0,0.0,0.0
25%,28.0,600000.0,4.0,0.0,0.0
50%,29.0,900000.0,5.0,0.0,0.0
75%,32.0,1250000.0,6.0,1.0,1.0
max,35.0,1800000.0,9.0,1.0,1.0


# object type 확인

In [10]:
df['Employment Type'].value_counts

<bound method IndexOpsMixin.value_counts of 0                  Government Sector
1       Private Sector/Self Employed
2       Private Sector/Self Employed
3       Private Sector/Self Employed
4       Private Sector/Self Employed
                    ...             
1982    Private Sector/Self Employed
1983    Private Sector/Self Employed
1984    Private Sector/Self Employed
1985    Private Sector/Self Employed
1986    Private Sector/Self Employed
Name: Employment Type, Length: 1987, dtype: object>

In [11]:
df['GraduateOrNot'].value_counts

<bound method IndexOpsMixin.value_counts of 0       Yes
1       Yes
2       Yes
3       Yes
4       Yes
       ... 
1982    Yes
1983    Yes
1984    Yes
1985    Yes
1986    Yes
Name: GraduateOrNot, Length: 1987, dtype: object>

- 여행 보험 패키지에 졸업여부가 필요하지 않은것 같음

In [12]:
df['FrequentFlyer'].value_counts

<bound method IndexOpsMixin.value_counts of 0        No
1        No
2        No
3        No
4       Yes
       ... 
1982    Yes
1983     No
1984     No
1985    Yes
1986     No
Name: FrequentFlyer, Length: 1987, dtype: object>

- Frequent Flyer : 상용고객 우대제도는 많은 항공사에서 자주이용하는 손님을 위해 제공하는 서비스들이다. 보통 항공사의 고객은 포인트를 모으는 회원제에 가입하고 비행기를 탄 거리에 따른 포인트를 적립 

In [13]:
df['EverTravelledAbroad'].value_counts()

EverTravelledAbroad
No     1607
Yes     380
Name: count, dtype: int64

# 새로운 컬럼 지정

In [14]:
# 졸업여부 제외 새로운 컬럼 생성 
df_new = df.drop('GraduateOrNot', axis = 1)

In [15]:
df_new.head()

Unnamed: 0,Age,Employment Type,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,Government Sector,400000,6,1,No,No,0
1,31,Private Sector/Self Employed,1250000,7,0,No,No,0
2,34,Private Sector/Self Employed,500000,4,1,No,No,1
3,28,Private Sector/Self Employed,700000,3,1,No,No,0
4,28,Private Sector/Self Employed,700000,8,1,Yes,No,0


# 데이터 전처리

### LableEncoder를 사용하여 범주형 데이터를 숫자로 변환

In [16]:
cols = df_new.select_dtypes(include='object').columns

In [17]:
le = LabelEncoder()

In [18]:
for col in cols:
    df_new[col] = le.fit_transform(df_new[col])

In [19]:
df_new.head()

Unnamed: 0,Age,Employment Type,AnnualIncome,FamilyMembers,ChronicDiseases,FrequentFlyer,EverTravelledAbroad,TravelInsurance
0,31,0,400000,6,1,0,0,0
1,31,1,1250000,7,0,0,0,0
2,34,1,500000,4,1,0,0,1
3,28,1,700000,3,1,0,0,0
4,28,1,700000,8,1,1,0,0
