> 분류 알고리즘
* 의사결정나무(Decision Tree)
* 로지스틱 회귀(Logistic Regression)
* 나이브베이즈(Naive Bayes)
* 서포트벡터머신(SVM, Support Vector Machine)
* KNN(K-Nearest Neighbor)
* 랜덤 포레스트(Random Forest)
* 신경망(Neural Network)

# 1. 의사결정나무를 이용한 분류


## 1) 의사결정나무 알고리즘

> 특징
* 결과 해석, 이해가 쉬움.
* 수치, 범주 데이터 모두 가능
* 과대 적합의 위험이 높아 모델이 과대적합 되지 않도록 적절히 조절 필요.


## 2) 의사결정나무를 이용한 타이타닉 생존자 분류 분석

### (1) 필요 패키지 임포트

In [None]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

### (2) 데이터 불러오기

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

### (3) 데이터 살펴보기

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.shape

(891, 12)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


-> Age, Cabin, Embarked 에 결측치 존재

In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### (4) 데이터 전처리

* Age 칼럼의 결측은 평균값으로 대체
* Embarked 칼럼의 결측은 최빈값으로 대체
* Cabin 칼럼은 결측이 너무 많기 때문에 분석에서 제외

In [None]:
df.Age.fillna(df.Age.mean(), inplace=True)

df.Embarked.fillna(df.Embarked.mode()[0], inplace=True)

* Sex 칼럼은 숫자로 인코딩
* Embarked 칼럼 역시 숫자로 인코딩
* sklearn.preprocessing 의 LabelEncoder 사용

In [None]:
from sklearn.preprocessing import LabelEncoder

df.Sex = LabelEncoder().fit_transform(df.Sex)
df.Embarked = LabelEncoder().fit_transform(df.Embarked)

* 동승 가족수는 2개의 컬럼 SibSp, Parch로 존재하기 때문에 두 값을 더해서 FamilySize 라는 파생변수 생성.

In [None]:
df["FamilySize"] = df.SibSp + df.Parch

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",1,22.000000,1,0,A/5 21171,7.2500,,2,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.000000,1,0,PC 17599,71.2833,C85,0,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.000000,0,0,STON/O2. 3101282,7.9250,,2,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.000000,1,0,113803,53.1000,C123,2,1
4,5,0,3,"Allen, Mr. William Henry",1,35.000000,0,0,373450,8.0500,,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.000000,0,0,211536,13.0000,,2,0
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.000000,0,0,112053,30.0000,B42,2,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.699118,1,2,W./C. 6607,23.4500,,2,3
889,890,1,1,"Behr, Mr. Karl Howell",1,26.000000,0,0,111369,30.0000,C148,0,0


### (5) 분석 데이터셋 준비

In [None]:
X = df[["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize"]]
y = df["Survived"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(712, 6)
(179, 6)
(712,)
(179,)


### (6) 데이터 분석 수행

In [None]:
dt = DecisionTreeClassifier(random_state=111)
dt.fit(X_train, y_train)

In [None]:
pred = dt.predict(X_test)

### (7) 성능평가 및 시각화

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
acc

0.7541899441340782

#### KNN 추가 분석

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

knn = KNeighborsClassifier(n_neighbors=25)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

acc = accuracy_score(y_test, pred)
print(acc)

confusion_matrix(y_test, pred)

0.6703910614525139


array([[91, 22],
       [37, 29]])

# 2. KNN을 이용한 분류

## 1) KNN 알고리즘

> 특징
* 동작 원리 단순해 이해하기 쉽고, 구현하기 쉬움
* 거리 기반의 연산으로 숫자 속성에 우수한 성능
* 하나의 데이터 예측마다 전체 데이터와 거리를 계산하기 때문에 차원의 크기가 크면 속도가 느려짐

## 2) KNN을 이용한 붓꽃 종류 분류

### (1) 필요패키지 임포트

In [None]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

### (2) 데이터 불러오기

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")

### (3) 데이터 살펴보기

In [None]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
df.shape

(150, 5)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [None]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### (4) 데이터 전처리

* 4개의 독립변수에 대해 Min-Max 정규화

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[["sepal_length"]] = scaler.fit_transform(df[["sepal_length"]])
df[["sepal_width"]] = scaler.fit_transform(df[["sepal_width"]])
df[["petal_length"]] = scaler.fit_transform(df[["petal_length"]])
df[["petal_width"]] = scaler.fit_transform(df[["petal_width"]])

df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0.222222,0.625000,0.067797,0.041667,setosa
1,0.166667,0.416667,0.067797,0.041667,setosa
2,0.111111,0.500000,0.050847,0.041667,setosa
3,0.083333,0.458333,0.084746,0.041667,setosa
4,0.194444,0.666667,0.067797,0.041667,setosa
...,...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667,virginica
146,0.555556,0.208333,0.677966,0.750000,virginica
147,0.611111,0.416667,0.711864,0.791667,virginica
148,0.527778,0.583333,0.745763,0.916667,virginica


### (5) 분석 데이터셋 준비

In [None]:
X = df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
y = df["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)


### (6) 데이터 분석 수행

In [None]:
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

### (7) 성능평가 및 시각화

* 정확도

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
acc

0.9666666666666667

* 혼동 행렬

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

array([[10,  0,  0],
       [ 0,  7,  0],
       [ 0,  1, 12]])

* 재현도, 정밀도, f1-score

In [None]:
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.88      1.00      0.93         7
   virginica       1.00      0.92      0.96        13

    accuracy                           0.97        30
   macro avg       0.96      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



# 3. SVM을 이용한 분류

## 1) SVM 알고리즘

> 특징
* 커널 트릭을 사용함으로써 다양한 데이터의 특성에 맞는 분류 수행
* 비교적 적은 데이터로 정확도가 높은 분류 기대 (전처리에서 데이터의 특성 잘 표현해야 함)
* 변수가 많으면 시각화가 어려워 분류의 결과를 이해하기 어려움

## 2) SVM을 이용한 타이타닉 생존자 분류 분석

### (1) 필요 패키지 임포트

In [None]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

### (2) 데이터 불러오기

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

### (3) 데이터 살펴보기

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### (4) 데이터 전처리

* Age 컬럼 결측은 평균값으로 대치
* Embarked 컬럼 결측은 최빈값으로 대치
* Cabin 컬럼은 결측치가 많아 분석에서 제외
* 동승가족수 파생변수 생성

In [None]:
df.Age.fillna(df.Age.mean(), inplace=True)
df.Embarked.fillna(df.Embarked.mode(), inplace=True)
df["FamilySize"] = df.SibSp + df.Parch

* Sex 컬럼과 Embarked 컬럼을 원-핫 인코딩
* pandas 의 get_dummies() 함수 사용

In [None]:
onehot_sex = pd.get_dummies(df.Sex)
df = pd.concat([df, onehot_sex], axis=1)

onehot_embarked = pd.get_dummies(df.Embarked)
df = pd.concat([df, onehot_embarked], axis=1)

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,female,male,C,Q,S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,1,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,1,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,0,1,0,0,1


### (5) 분석 데이터셋 준비

In [None]:
X = df[["Pclass", "Age", "Fare", "FamilySize", "female", "male", "C", "Q", "S"]]
y = df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=111)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(623, 9)
(268, 9)
(623,)
(268,)


### (6) 데이터 분석 수행

* kernel 옵션으로 rbf(Radial Basis Function), linear, polynomial, sigmoid 등이 있음

In [None]:
sv = SVC(kernel="rbf")
sv.fit(X_train, y_train)
pred = sv.predict(X_test)

### (7) 성능평가 및 시각화

* 정확도

In [None]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
acc

0.6716417910447762

* 혼동행렬

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, pred)
print(mat)

[[144  27]
 [ 61  36]]


* 정밀도, 재현도, F1-score

In [None]:
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

           0       0.70      0.84      0.77       171
           1       0.57      0.37      0.45        97

    accuracy                           0.67       268
   macro avg       0.64      0.61      0.61       268
weighted avg       0.66      0.67      0.65       268



## 3) SVM 커널 파라미터 조정

* kernel, C(비용), gamma(허용 표준편차)를 변경하며 결정경계 조정
* C 를 이용해 마진의 크기 조절
* gamma 가 커지면 허용하는 표준편차가 작아지고 결정경계도 작아지면서 구부러짐.

In [None]:
sv = SVC(kernel="linear", C=10, gamma=0.01)
sv.fit(X_train, y_train)
pred = sv.predict(X_test)

acc = accuracy_score(y_test, pred)
mat = confusion_matrix(y_test, pred)
rpt = classification_report(y_test, pred)

print(acc)
print(mat)
print(rpt)

0.7723880597014925
[[142  29]
 [ 32  65]]
              precision    recall  f1-score   support

           0       0.82      0.83      0.82       171
           1       0.69      0.67      0.68        97

    accuracy                           0.77       268
   macro avg       0.75      0.75      0.75       268
weighted avg       0.77      0.77      0.77       268



# 4. 로지스틱 회귀를 이용한 분류

## 1) 로지스틱 회귀 알고리즘 

> 특징
* 선형 회귀의 결과를 입력 값으로 받아 특정 레이블로 분류
* 시그모이드 함수수를 사용
* 갖고 있는 데이터를 통해 에러를 줄이는 방향으로 weigth와 bias의 최적값을 찾음 

> 접근 방법
* 규제의 유형과 강도에 따라 분류의 정확도가 달라지므로, 적절한 값을 찾는 것이 중요
* 규제는 과적합 예방을 위함
* 규제 유형은 LogisticRegression 클래스 내에 penalty 매개변수에서 설정, L2(릿지)가 기본이고, L1(라쏘)를 선택택할 수 있음
* 규제 강도는 C 매개변수로 설정, 기본값 1.0이고 작을수록 규제가 강해짐
* predict_proba() 메소드 이용해 각 분류 항목에 속할 확률을 볼 수 있음
* decision_function() 메소드로 선형방정식 확인 가능


## 2) 로지스틱 회귀를 이용한 붓꽃 종류 분류


### (1) 필요 패키지 임포트 

In [2]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

### (2) 데이터 불러오기 

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")

### (3) 데이터 살펴보기 

In [5]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [7]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### (4) 데이터 전처리 

* 4개 독립변수에 대해 Min-Max정규

In [8]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["sepal_length"]] = scaler.fit_transform(df[["sepal_length"]])
df[["sepal_width"]] = scaler.fit_transform(df[["sepal_width"]])
df[["petal_length"]] = scaler.fit_transform(df[["petal_length"]])
df[["petal_width"]] = scaler.fit_transform(df[["petal_width"]])

### (5) 분석 데이터셋 준비 

In [11]:
X = df[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
y = df["species"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)


### (6) 데이터 분석 수행 

In [12]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)

### (7) 성능평가 및 시각화 

* 정확도 

In [13]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
acc

0.8666666666666667

* 혼동행렬 

In [14]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, pred)
mat

array([[10,  0,  0],
       [ 0,  4,  3],
       [ 0,  1, 12]])

* 정밀도, 재현도, F1-score 

In [15]:
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.80      0.57      0.67         7
   virginica       0.80      0.92      0.86        13

    accuracy                           0.87        30
   macro avg       0.87      0.83      0.84        30
weighted avg       0.87      0.87      0.86        30



In [17]:
lr.predict_proba(X_test)

array([[0.85493234, 0.13833697, 0.00673069],
       [0.82797416, 0.16153837, 0.01048747],
       [0.00350905, 0.09779845, 0.89869251],
       [0.02949571, 0.46436668, 0.50613761],
       [0.03331478, 0.42106569, 0.54561953],
       [0.8461796 , 0.14622749, 0.00759291],
       [0.87477447, 0.1192301 , 0.00599543],
       [0.01183667, 0.25620286, 0.73196048],
       [0.00547795, 0.14242547, 0.85209658],
       [0.07510837, 0.43132119, 0.49357044],
       [0.00581459, 0.12951086, 0.86467454],
       [0.85923732, 0.13367938, 0.0070833 ],
       [0.03203171, 0.64390093, 0.32406736],
       [0.01232919, 0.22271063, 0.76496018],
       [0.00289723, 0.17927666, 0.81782611],
       [0.91997412, 0.07021686, 0.00980902],
       [0.03617944, 0.45832129, 0.50549927],
       [0.25710552, 0.65650975, 0.08638473],
       [0.85074961, 0.14378851, 0.00546188],
       [0.00637194, 0.18305861, 0.81056945],
       [0.09353509, 0.62303455, 0.28343036],
       [0.02569438, 0.35418582, 0.6201198 ],
       [0.

# 5. 랜덤 포레스트를 이용한 분류 

## 1) 랜덤 포레스트 알고리즘

> 특징
* 다양한 분야에 비교적 좋은 성능
* 모델의 편향을 증가시켜 과대적합의 위험 감소
* 트리들이 서로 조금씩 다른 특성을 갖게 되어 일반화 성능 향상
* 샘플링 하는 중 한 샘플이 중복되어 추출될 수 있음
* 기본 매개변수 설정만으로도 좋은 결과를 얻을 수 있음
* 랜덤 포레스트의 특성 중요도는 각 트리의 특성 중요도를 취합한 것 

> 접근 방법
* 트리 모델의 개수(n_estimators)와 개별 트리의 깊이(max_depth) 매개변수를 잘 조절하여 예측의 정확도를 높인다. 

## 2) 랜덤 포레스트를 이용한 타이타닉 생존자 분류 

### (1) 필요 패키지 임포트 

In [20]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

### (2) 데이터 불러오기 

In [38]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

### (3) 데이터 살펴보기 

In [33]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [41]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.002015,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,29.699118,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### (4) 데이터 전처리 

* Age 칼럼 결측은 평균으로 대치 
* Embarked 칼럼 결측은 최빈값으로 대치
* Cabin 칼럼은 결측이 많으므로 분석에서 제외 

In [40]:
df.Age.fillna(df.Age.mean(), inplace=True)
df.Embarked.fillna(df.Embarked.mode(), inplace=True)

* Sex 칼럼과 Embarked 칼럼은 레이블 인코딩 

In [42]:
from sklearn.preprocessing import LabelEncoder

df.Sex = LabelEncoder().fit_transform(df.Sex)

df.Embarked = LabelEncoder().fit_transform(df.Embarked)

* 동승 가족 수는 2개의 칼럼 SibSp, Parch 를 더해서 FamilySize라는 파생변수 생성 

In [43]:
df["FamilySize"] = df.SibSp + df.Parch

### (5) 분석 데이터셋 준비 

In [44]:
X = df[["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize"]]
y = df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(712, 6)
(179, 6)
(712,)
(179,)


### (6) 데이터 분석 수행 

In [55]:
rf = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=20)
rf.fit(X_train, y_train)

pred = rf.predict(X_test)

### (7) 성능평가 및 시각화 

In [56]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

acc = accuracy_score(y_test, pred)
mat = confusion_matrix(y_test,  pred)
rpt = classification_report(y_test, pred)

print(acc)
print(mat)
print(rpt)

0.8603351955307262
[[107   7]
 [ 18  47]]
              precision    recall  f1-score   support

           0       0.86      0.94      0.90       114
           1       0.87      0.72      0.79        65

    accuracy                           0.86       179
   macro avg       0.86      0.83      0.84       179
weighted avg       0.86      0.86      0.86       179

