# EXPLORATION 02
## | 프로젝트 (3) 유방암 여부 진단
### | 프로젝트 학습 과정
1. 필요한 모듈 import 하기
2. 데이터 준비 및 데이터 이해
- Feature Data 지정하기
- Label Data 지정하기
- Target Names 출력해 보기
- 데이터 Describe 해 보기
3. train, test 데이터 분리
4. 다양한 모델로 학습 시켜보기
- Decision Tree 사용해 보기
- Random Forest 사용해 보기
- SVM 사용해 보기
- SGD Classifier 사용해 보기
- logistic Regression 사용해 보기
5. 모델 평가
- sklearn.metrics에서 제공하는 평가지표 중 적절한 것 선택
- 선택 이유

# 3. load_breast_cancer 유방암 여부 예측

### Data info
1. Classes: 2
2. Samples per class: 212(M), 357(B)
3. Samples total: 569
4. Dimensionality: 30
5. Features: real, positive

## 3_1. 필요한 모듈 import

In [52]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

## 3_2. 데이터 준비 및 이해

In [53]:
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()
bc_data = bc.data
bc_label = bc.target
bc_data.shape

(569, 30)

* 유방암 데이터는 30개의 특성을 갖고 있으며 569개의 데이터를 갖고 있습니다.

In [54]:
print("데이터 정보:\n{}".format(bc.DESCR))

데이터 정보:
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is M

In [55]:
bc_label.shape

(569,)

In [56]:
print("유방암 데이터셋: \n{}".format(bc.keys()))

유방암 데이터셋: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [57]:
bc.target # 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [58]:
print("특성 이름:\n{}".format(bc.feature_names))

특성 이름:
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [59]:
bc.target_names

array(['malignant', 'benign'], dtype='<U9')

In [60]:
bc.target[[10, 80, 140]]

array([0, 1, 1])

In [61]:
bc_data[0]

array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])

In [62]:
df = pd.DataFrame(bc_data, columns=bc.feature_names)
df.columns
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [63]:
df.shape

(569, 30)

In [64]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


## 3_3. train, test 데이터 분리

In [65]:
X_train, X_test, y_train, y_test = train_test_split(bc_data,
                                                    bc_label,
                                                    test_size=0.3,
                                                   random_state=32)

## 3_4. 모델 학습
### 3_4_(1) Decision Tree

In [66]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

decision_tree = DecisionTreeClassifier(random_state=32)
print(decision_tree._estimator_type)

classifier


In [67]:
decision_tree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=32)

In [68]:
Decision_pred = decision_tree.predict(X_test)
Decision_pred

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0])

In [69]:
accuracy = accuracy_score(y_test, Decision_pred)
accuracy

0.9415204678362573

In [70]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
clf

DecisionTreeClassifier()

In [71]:
Decision_pred = clf.predict(X_test)
print(confusion_matrix(y_test, Decision_pred))

[[62  4]
 [11 94]]


### 3_4_(2) Random Forest 사용해 보기

In [72]:
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)
random_pred = random_forest.predict(X_test)
random_pred

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1])

In [73]:
print(classification_report(y_test, random_pred))

              precision    recall  f1-score   support

           0       0.95      0.92      0.94        66
           1       0.95      0.97      0.96       105

    accuracy                           0.95       171
   macro avg       0.95      0.95      0.95       171
weighted avg       0.95      0.95      0.95       171



In [74]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
clf

RandomForestClassifier()

In [75]:
random_pred = clf.predict(X_test)
print(confusion_matrix(y_test, random_pred))

[[ 63   3]
 [  4 101]]


### 3_4_(3) SVM 사용해 보기

In [76]:
svm_model = svm.SVC()

print(svm_model._estimator_type)

classifier


In [77]:
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
svm_pred

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1])

In [78]:
print(classification_report(y_test, svm_pred))

              precision    recall  f1-score   support

           0       0.93      0.76      0.83        66
           1       0.86      0.96      0.91       105

    accuracy                           0.88       171
   macro avg       0.89      0.86      0.87       171
weighted avg       0.89      0.88      0.88       171



In [79]:
clf = svm.SVC()
clf.fit(X_train, y_train)
clf

SVC()

In [80]:
svm_pred = clf.predict(X_test)
print(confusion_matrix(y_test, svm_pred))

[[ 50  16]
 [  4 101]]


### 3_4_(4) SGD Classifier 사용해 보기

In [81]:
from sklearn.linear_model import SGDClassifier
sgd_model = SGDClassifier()

print(sgd_model._estimator_type)

classifier


In [82]:
sgd_model.fit(X_train, y_train)
sgd_pred = sgd_model.predict(X_test)
sgd_pred

array([1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1])

In [83]:
print(classification_report(y_test, sgd_pred))

              precision    recall  f1-score   support

           0       0.90      0.83      0.87        66
           1       0.90      0.94      0.92       105

    accuracy                           0.90       171
   macro avg       0.90      0.89      0.89       171
weighted avg       0.90      0.90      0.90       171



In [84]:
clf = SGDClassifier()
clf.fit(X_train, y_train)
clf

SGDClassifier()

In [85]:
sgd_pred = clf.predict(X_test)
print(confusion_matrix(y_test, sgd_pred))

[[58  8]
 [ 8 97]]


* SGD는 2종 오류가 너무 많아 안 좋은 모델임을 알 수 있다.

### 3_4_(5) logistic Regression 사용해 보기

In [86]:
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()

print(logistic_model._estimator_type)

classifier


In [87]:
logistic_model.fit(X_train, y_train)
logistic_pred = logistic_model.predict(X_test)
logistic_pred

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array([1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1])

In [88]:
print(classification_report(y_test, logistic_pred))

              precision    recall  f1-score   support

           0       0.92      0.88      0.90        66
           1       0.93      0.95      0.94       105

    accuracy                           0.92       171
   macro avg       0.92      0.92      0.92       171
weighted avg       0.92      0.92      0.92       171



In [89]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [90]:
logistic_pred = clf.predict(X_test)
print(confusion_matrix(y_test, logistic_pred))

[[ 58   8]
 [  5 100]]


## 3_5. 모델 평가

- [sklearn.metrics 평가 지표에 대한 설명](https://velog.io/@cha-suyeon/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-sklearn.metrics-%ED%8F%89%EA%B0%80-%EC%A7%80%ED%91%9C)

### Data info
1. Classes: 2
2. Samples per class: 212(M), 357(B)
3. Samples total: 569
4. Dimensionality: 30
5. Features: real, positive

#### 1) 학습 방법
- Machine Learning을 이용하여 분류기(Classifier)를 만들어서 유방암 여부를 예측, 분류하였음
- 5개의 모델을 사용하였고, 학습 데이터 70%, 테스트 데이터 30%로 나누어 훈련시키고 분류기 모델을 검증하였음

#### 2) 선택한 평가 지표
- f1-score

#### ※ 참고
- accuracy: 정확도. 전체 학습데이터의 개수에서 각 클래스에서 자신의 클래스를 정확하게 맞춘 개수의 비율.
- macro : 각각의 class에 따라 TP, FN, FP, TN값들을 이용해서 평가 지표를 계산한 후 그 값들의 평균을 사용
- Weighted Average: 각 class에 해당하는 data의 개수에 가중치를 주어 평균을 구한 것

#### 3) 선택 이유
- 유방암 여부의 결과는 얼마나 잘 예측했는지의 precision과 실제 positive를 positive라고 잘 예측했는가의 recall도 둘 다 중요하기 때문에, 두 지표 모두 모델의 성능을 확인하는데 중요할 때 사용하는 F-score를 보았음

#### 4) 분류 결과

||model|f1-score의 accuracy|
|------|---|---|
|0|Decision Tree|0.94|
|1|Random Forest|0.95|
|2|Svm|0.88|
|3|SGD Classifier|0.74|
|4|Logistic Regression|0.92|

#### 5) 결과 요약
- Random Forest가 0.95로 성능이 가장 좋다.
- SGD의 경우 0.74로 아주 낮은 성능을 보여주고 있다.

#### 6) 결과 해석
- F1 Score는 조화평균이기 때문에 Recall과 Precision 중 한 가지 값이라도 극단적으로 낮을 때 F1 Score 값도 낮게 나온다. 두 값 다 높아야 F1 도 높게 나온다.
- 또한, 유방암 환자(1), 비환자(2)처럼 데이터가 불균형한 경우에도 F1 Score를 사용하기도 한다.
- confusion matrix를 만들었을 때, 실제로 Random Forest에서 2종 오류(FN)가 나온 수와 비율이 가장 적었으므로 좋은 모델 임을 알 수 있다.
- 좋은 분류 모델을 찾는 실습을 진행했으니 이후엔 어떤 feature가 가장 positive한 결과에 영향을 많이 주는지도 분석해보고 싶음