<a href="https://colab.research.google.com/github/crazat/crazat.githurb.io/blob/main/%ED%9A%8C%EA%B7%80%EB%AA%A8%EB%8D%B8_%EC%8B%A4%EC%8A%B5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 회귀문제 실습

- 보험료 예측 데이터셋 다운로드 받아 구글드라이브에 올리기
    - 개인의 여러 특징을 기반으로 미래의 의료비를 예측하여 보험사에서 보험료 청구 결정을 내리는데 활용
    - 학습세트: https://drive.google.com/file/d/11O7IiJNZo3rsAfnPl6PPIIERbSOxHxh8/view?usp=sharing
    - 평가세트: https://drive.google.com/file/d/18r6qXql5ARvbJCspqJ5qds_yv7LztvXI/view?usp=sharing

- 구글 드라이브 연결

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


- 데이터 경로 변수

In [None]:
DATA_PATH = "/content/drive/MyDrive/머신러닝/data/"
DATA_PATH

'/content/drive/MyDrive/머신러닝/data/'

- 데이터 불러오기

In [None]:
import pandas as pd
import numpy as np

train = pd.read_csv(f"{DATA_PATH}insurance_train.csv") # 학습데이터
test = pd.read_csv(f"{DATA_PATH}insurance_test.csv") # 테스트 데이터
train.shape , test.shape

((936, 7), (402, 6))

- 보험료 예측 데이터셋 컬럼명 정보
    - age: 나이
    - sex: 성별
    - bmi: 체질량 지수
    - children: 자녀수
    - smoker: 흡연여부
    - region: 지역
    - target:의료비

In [None]:
train.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,target
0,37,male,36.19,0,no,southeast,19214.70553
1,18,male,29.37,1,no,southeast,1719.4363
2,28,male,26.98,2,no,northeast,4435.0942
3,47,male,36.2,1,no,southwest,8068.185
4,32,male,27.835,1,no,northwest,4454.40265


- 특성으로 사용할 변수 추가하기

In [None]:
cols = ["age","bmi","children"]
train_ft = train[cols].copy()
test_ft = test[cols].copy()

- 범주형 컬럼 고유값 확인

In [None]:
train["sex"].unique() , train["smoker"].unique()  , train["region"].unique()

(array(['male', 'female'], dtype=object),
 array(['no', 'yes'], dtype=object),
 array(['southeast', 'northeast', 'southwest', 'northwest'], dtype=object))

- sex, smoker 컬럼 0과1로 인코딩하여 피쳐추가

In [None]:
train_ft["sex"] = train["sex"].map(lambda x : int(x == "male") )
train_ft["smoker"] = train["smoker"].map(lambda x : int(x == "yes"))

test_ft["sex"] = test["sex"].map(lambda x : int(x == "male") )
test_ft["smoker"] = test["smoker"].map(lambda x : int(x == "yes"))

- 범주형 변수 원핫인코딩하여 특성으로 추가하기

In [None]:
from sklearn.preprocessing import OneHotEncoder
cols = ['region']
enc = OneHotEncoder(handle_unknown = 'ignore') # 모르는 범주가 있을 경우 무시
enc.fit(train[cols])

In [None]:
tmp = pd.DataFrame(
    enc.transform(train[cols]).toarray(), # ndarray
    columns = enc.get_feature_names_out() # 컬럼명
)
train_ft = pd.concat([train_ft,tmp],axis=1)
train_ft.head()

Unnamed: 0,age,bmi,children,sex,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,37,36.19,0,1,0,0.0,0.0,1.0,0.0
1,18,29.37,1,1,0,0.0,0.0,1.0,0.0
2,28,26.98,2,1,0,1.0,0.0,0.0,0.0
3,47,36.2,1,1,0,0.0,0.0,0.0,1.0
4,32,27.835,1,1,0,0.0,1.0,0.0,0.0


In [None]:
# 테스트 데이터
tmp = pd.DataFrame(
    enc.transform(test[cols]).toarray(),
    columns = enc.get_feature_names_out()
)
test_ft = pd.concat([test_ft,tmp],axis=1)
test_ft.head()

Unnamed: 0,age,bmi,children,sex,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,43,26.03,0,1,0,1.0,0.0,0.0,0.0
1,54,27.645,1,0,0,0.0,1.0,0.0,0.0
2,53,24.32,0,1,0,0.0,1.0,0.0,0.0
3,23,28.31,0,0,1,0.0,1.0,0.0,0.0
4,49,25.84,2,1,1,0.0,1.0,0.0,0.0


- 결측치 처리

In [None]:
train_ft.isnull().sum()

Unnamed: 0,0
age,0
bmi,0
children,0
sex,0
smoker,0
region_northeast,0
region_northwest,0
region_southeast,0
region_southwest,0


In [None]:
test_ft.isnull().sum()

Unnamed: 0,0
age,0
bmi,0
children,0
sex,0
smoker,0
region_northeast,0
region_northwest,0
region_southeast,0
region_southwest,0


- Min-Max Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_ft)

In [None]:
train_ft[train_ft.columns] = scaler.transform(train_ft) # 학습 데이터
train_ft.head()

Unnamed: 0,age,bmi,children,sex,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,0.413043,0.544256,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.360775,0.2,1.0,0.0,0.0,0.0,1.0,0.0
2,0.217391,0.296476,0.4,1.0,0.0,1.0,0.0,0.0,0.0
3,0.630435,0.544525,0.2,1.0,0.0,0.0,0.0,0.0,1.0
4,0.304348,0.319478,0.2,1.0,0.0,0.0,1.0,0.0,0.0


In [None]:
test_ft[test_ft.columns] = scaler.transform(test_ft) # 테스트 데이터
test_ft.head()

Unnamed: 0,age,bmi,children,sex,smoker,region_northeast,region_northwest,region_southeast,region_southwest
0,0.543478,0.270917,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.782609,0.314366,0.2,0.0,0.0,0.0,1.0,0.0,0.0
2,0.76087,0.224913,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,0.108696,0.332257,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.673913,0.265806,0.4,1.0,1.0,0.0,1.0,0.0,0.0


- 정답 데이터

In [None]:
target = train["target"]
target

Unnamed: 0,target
0,19214.70553
1,1719.43630
2,4435.09420
3,8068.18500
4,4454.40265
...,...
931,1632.56445
932,1629.83350
933,9563.02900
934,27375.90478


## LinearRegression 클래스
- 선형 회귀(Linear regression) 모델
- 원인이 되는 설명변수(독립변수,Feature)에 따른 종속변수(목표변수, target)의 결과 예측
$$
y = b_0 + b_1x
$$
$$
y = b_0 + b_1x_1 +  b_2x_2 + ... + b_nx_n
$$
- 최소제곱법을 이용한 선형회귀
    - RSS를 최소화하는 회귀 계수를 선택하는 통계적 접근법
    - RSS를 각각의 회귀계수들로 미분했을때 0이 되는 해를 구하는 방법
$$
b_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}
$$
<br>
$$
b_0 = \bar{y}-b_1\bar{x}
$$

In [None]:
target.mean() , target.median() , target.skew() # 평균, 중앙값, 왜도

(13361.425757883548, 9447.316375, 1.5316992039020136)

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

cv = KFold(n_splits=5,shuffle=True, random_state=42) # shuffle 및 시드고정 할 것
model = LinearRegression()
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='neg_root_mean_squared_error',n_jobs = -1)
-scores # 폴드별 검증점수 리스트

array([6877.50059715, 5610.12235528, 6757.65632446, 6075.86007517,
       5566.17923959])

In [None]:
-np.mean(scores) # 평균

6177.463718328557

In [None]:
model = LinearRegression()
model.fit(train_ft,target) # 학습데이터 전체 학습

In [None]:
model.coef_ # 가중치

array([12557.77017969, 11696.89137551,  2181.23647705,    61.41010494,
       24035.06249283,   800.56056396,  -178.57281687,  -775.92585942,
         153.93811232])

In [None]:
model.intercept_ # 편향

-2425.406140089708

In [None]:
pred = model.predict(test_ft) # 테스트 데이터 예측
pred[:5]

array([ 8430.33110216, 11337.19839898,  9643.03411358, 26682.43489352,
       33936.93442322])

## Ridge 클래스
- 최소제곱법을 이용한 선형회귀의 L2 규제를 적용한 모델
- 가중치가 큰 피쳐일수록 페널티가 더 가해져서 오버피팅 억제하는 효과
- alpha 값을 이용하여 가중치를 규제하는데 alpha 값이 클수록 가중치에 규제가 더 가해져서 가중치가 감소

In [None]:
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # 1.0 기본값
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='neg_root_mean_squared_error',n_jobs = -1)
-scores # 폴드별 검증점수 리스트

array([6898.68421508, 5615.04843145, 6754.45864602, 6063.24017356,
       5564.09960718])

In [None]:
-np.mean(scores) # 평균

6179.106214657907

## Lasso 클래스
- 최소제곱법을 이용한 선형회귀의 L1 규제를 적용한 모델
- 가중치가 작은 피쳐들의 가중치를 0으로 수렴하게 만들어 특성 선택을 하는 효과
- alpha 값을 이용하여 가중치를 규제하는데 alpha 값이 클수록 가중치에 규제가 더 가해져서 가중치가 감소

In [None]:
from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0) # 1.0 기본값
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='neg_root_mean_squared_error',n_jobs = -1)
-scores # 폴드별 검증점수 리스트

array([6878.6477916 , 5609.79288004, 6757.22977334, 6074.90481329,
       5565.75713153])

In [None]:
-np.mean(scores) # 평균

6177.266477959944

## ElasticNet 클래스
- 최소제곱법을 이용한 선형회귀의 L1, L2 규제를 조합하여 적용한 모델
- L1 규제는 alpha값에 따라 회귀 계수의 값이 급격히 변동하여 많은 회귀 계수들을  0으로 만드는 성향이 강해 이를 완화하기 위해 L2 규제를 추가한 것이 엘라스틱넷
- l1_ratio 파라미터
    - L1 규제의 비율
    - 기본값 = 0.5

In [None]:
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1.0,l1_ratio = 0.9)
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='neg_root_mean_squared_error',n_jobs = -1)
-scores # 폴드별 검증점수 리스트

array([8818.87765646, 7058.46580789, 7852.25928449, 7145.38463601,
       7390.04372841])

In [None]:
-np.mean(scores) # 평균

7653.006222651898

# 분류문제 실습

- 타이타닉 데이터 불러오기

In [None]:
train = pd.read_csv(f"{DATA_PATH}titanic_train.csv") # 학습데이터
test = pd.read_csv(f"{DATA_PATH}titanic_test.csv") # 테스트 데이터
test_target = pd.read_csv(f"{DATA_PATH}titanic_test_target.csv") # 테스트데이터 정답값
train.shape , test.shape , test_target.shape

((916, 12), (393, 11), (393, 2))

- 결측치 확인하기

In [None]:
train.isnull().sum()

Unnamed: 0,0
passengerid,0
survived,0
pclass,0
name,0
gender,0
age,180
sibsp,0
parch,0
ticket,0
fare,0


In [None]:
test.isnull().sum()

Unnamed: 0,0
passengerid,0
pclass,0
name,0
gender,0
age,83
sibsp,0
parch,0
ticket,0
fare,1
cabin,308


- 학습 데이터에서 얻은 통계량을 이용하여 결측치를 채워야한다.

In [None]:
age_mean = train["age"].mean() # 평균
fare_median = train["fare"].median() # 중앙값
cabin_unk = "UNK" # 새로운 범주
embarked_mode = train["embarked"].mode()[0] # 최빈값
age_mean , fare_median ,cabin_unk , embarked_mode

(29.904891304347824, 14.5, 'UNK', 'S')

- 학습데이터 결측치 처리

In [None]:
train["age"] = train["age"].fillna(age_mean)
train["cabin"] = train["cabin"].fillna(cabin_unk)

- 테스트데이터 결측치 처리

In [None]:
test["age"] = test["age"].fillna(age_mean)
test["fare"] = test["fare"].fillna(fare_median)
test["cabin"] = test["cabin"].fillna(cabin_unk)
test["embarked"] = test["embarked"].fillna(embarked_mode)

In [None]:
train.isnull().sum().sum() , test.isnull().sum().sum()

(0, 0)

- 특성으로 사용할 변수 추가하기

In [None]:
cols = ["age","sibsp","parch","fare","pclass","gender","embarked"]
train_ft = train[cols].copy()
test_ft = test[cols].copy() # 테스트 데이터
train_ft.shape, test_ft.shape

((916, 7), (393, 7))

- 범주형 변수 원핫인코딩하여 특성으로 추가하기

In [None]:
cols = ['gender','embarked']
enc = OneHotEncoder(handle_unknown = 'ignore') # 모르는 범주가 있을 경우 무시
enc.fit(train_ft[cols])

In [None]:
# 학습 데이터
tmp = pd.DataFrame(
    enc.transform(train_ft[cols]).toarray(), # ndarray
    columns = enc.get_feature_names_out() # 컬럼명
)
train_ft = pd.concat([train_ft,tmp],axis=1)
train_ft.head()

Unnamed: 0,age,sibsp,parch,fare,pclass,gender,embarked,gender_female,gender_male,embarked_C,embarked_Q,embarked_S
0,71.0,0,0,49.5042,1,male,C,0.0,1.0,1.0,0.0,0.0
1,34.0,0,0,8.05,3,male,S,0.0,1.0,0.0,0.0,1.0
2,29.0,3,1,22.025,3,male,S,0.0,1.0,0.0,0.0,1.0
3,18.0,1,1,13.0,2,female,S,1.0,0.0,0.0,0.0,1.0
4,48.0,0,0,26.55,1,male,S,0.0,1.0,0.0,0.0,1.0


In [None]:
# 테스트 데이터
tmp = pd.DataFrame(
    enc.transform(test_ft[cols]).toarray(),
    columns = enc.get_feature_names_out()
)
test_ft = pd.concat([test_ft,tmp],axis=1)
test_ft.head()

Unnamed: 0,age,sibsp,parch,fare,pclass,gender,embarked,gender_female,gender_male,embarked_C,embarked_Q,embarked_S
0,62.0,0,0,26.55,1,male,S,0.0,1.0,0.0,0.0,1.0
1,28.0,0,0,47.1,1,male,S,0.0,1.0,0.0,0.0,1.0
2,24.0,0,0,9.5,3,male,S,0.0,1.0,0.0,0.0,1.0
3,29.904891,0,0,7.7333,3,female,Q,1.0,0.0,0.0,1.0,0.0
4,18.5,0,0,7.2833,3,female,Q,1.0,0.0,0.0,1.0,0.0


- 파생변수 생성과정에서 생긴 결측치 확인 해보기

In [None]:
train_ft.isnull().sum().sum(), test_ft.isnull().sum().sum()

(0, 0)

- 문자열 데이터 제거

In [None]:
cols = ["gender","embarked"]
train_ft = train_ft.drop(columns=cols)
test_ft = test_ft.drop(columns=cols)

- Min-Max Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(train_ft)

In [None]:
train_ft[train_ft.columns] = scaler.transform(train_ft) # 학습 데이터
train_ft.head()

Unnamed: 0,age,sibsp,parch,fare,pclass,gender_female,gender_male,embarked_C,embarked_Q,embarked_S
0,0.88726,0.0,0.0,0.096626,0.0,0.0,1.0,1.0,0.0,0.0
1,0.423776,0.0,0.0,0.015713,1.0,0.0,1.0,0.0,0.0,1.0
2,0.361142,0.375,0.111111,0.04299,1.0,0.0,1.0,0.0,0.0,1.0
3,0.22335,0.125,0.111111,0.025374,0.5,1.0,0.0,0.0,0.0,1.0
4,0.599148,0.0,0.0,0.051822,0.0,0.0,1.0,0.0,0.0,1.0


In [None]:
test_ft[test_ft.columns] = scaler.transform(test_ft) # 테스트 데이터
test_ft.head()

Unnamed: 0,age,sibsp,parch,fare,pclass,gender_female,gender_male,embarked_C,embarked_Q,embarked_S
0,0.774521,0.0,0.0,0.051822,0.0,0.0,1.0,0.0,0.0,1.0
1,0.348616,0.0,0.0,0.091933,0.0,0.0,1.0,0.0,0.0,1.0
2,0.298509,0.0,0.0,0.018543,1.0,0.0,1.0,0.0,0.0,1.0
3,0.372478,0.0,0.0,0.015094,1.0,1.0,0.0,0.0,1.0,0.0
4,0.229613,0.0,0.0,0.014216,1.0,1.0,0.0,0.0,1.0,0.0


- 정답 데이터

In [None]:
target = train["survived"]
target

Unnamed: 0,survived
0,0
1,0
2,0
3,1
4,1
...,...
911,1
912,0
913,0
914,0


## LogisticRegression 클래스
- Regression(회귀)라는 단어가 들어가지만 분류 모델
- 선형회귀 + 시그모이드 함수
- 경사하강법을 이용하여 모델 파라미터를 업데이트 함
- 예측의 결정에 $\sigma$(시그모이드) 함수를 사용
- 시그모이드 함수의 특징
  - 입력값이 클수록 1에 가깝게 출력됨
  - 입력값이 작을수록 0에 가깝게 출력됨

$$
y = \frac{1}{1+e^{-x}}
$$

- 주요 파라미터
    - random_state : 시드값
    - penalty : 'l2'(기본값) , 'l1'   'elasticnet', None or 'none'
        - 사이킷런 1.2 버전부터 None 으로 변경됨
        - 1.4 버전부터는 'none' 제거됨
    - solver: 모델 파라미터 최적화 알고리즘
        - 'lbfgs'(기본값): l2, None or 'none' 일 경우 사용
        - 'liblinear': l2, l1 일 경우 사용
        - 'newton-cg': l2, None or 'none' 일 경우 사용
        - 'newton-cholesky': l2, None or 'none' 일 경우 사용
        - 'sag': l2, None or 'none' 일 경우 사용
        - 'saga': l2, l1, 'elasticnet', None or 'none' 일 경우 사용
    - C: 1.0(기본값), 양의 실수를 줘야 하며 값이 작을 수록 모델 파라미터에 규제를 더 가한다.
    - max_iter: 100(기본값), 최대 반복 횟수
    - tol: 1e-4(기본값) , 학습 중지 기준이 되는 허용 오차

In [None]:
# 코랩에 설치된 버전 확인하기
import sklearn
sklearn.__version__

'1.6.1'

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42)
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='roc_auc',n_jobs = -1)
print(scores) # 폴드별 검증점수 리스트
np.mean(scores) # cv 평균 점수

[0.90006266 0.95127697 0.88071166 0.86257993 0.88976273]


0.8968787884837675

- L1 규제 적용

In [None]:
model = LogisticRegression(random_state=42,penalty="l1",solver="liblinear")
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='roc_auc',n_jobs = -1)
print(scores) # 폴드별 검증점수 리스트
np.mean(scores) # cv 평균 점수

[0.89467419 0.94972725 0.88202468 0.85833743 0.89029592]


0.8950118947954492

- 규제 적용 X

In [None]:
model = LogisticRegression(random_state=42,penalty=None) # 사이킷런 1.2 버전 이상일경우에는 None 을 penalty에 인수로 줘야함
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='roc_auc',n_jobs = -1)
print(scores) # 폴드별 검증점수 리스트
np.mean(scores) # cv 평균 점수

[0.89818296 0.95016117 0.88071166 0.86590015 0.88842975]


0.8966771374066592

- L2 규제 강도 올리기

In [None]:
model = LogisticRegression(random_state=42,C=0.2)
scores = cross_val_score(model,train_ft,target,cv = cv ,scoring='roc_auc',n_jobs = -1)
print(scores) # 폴드별 검증점수 리스트
np.mean(scores) # cv 평균 점수

[0.89843358 0.95115299 0.88123687 0.866515   0.8913623 ]


0.8977401494806436

- 테스트 데이터에 대하여 모델 평가

In [None]:
from sklearn.metrics import roc_auc_score

model = LogisticRegression(random_state=42,C=0.2)
model.fit(train_ft,target) #학습 데이터 전체 다시학습

# 테스트 데이터 예측 및 평가
y_test = test_target["survived"] # 테스트셋 y값
pred = model.predict_proba(test_ft)[:,1] # 예측
roc_auc_score(y_test,pred) # AUC 평가

0.8892719249862107