# 프로젝트 : Iris의 세 가지 품종, 분류해볼 수 있겠어요?
## (1) 손글씨를 분류해보기

### 1. 필요한 모듈 import하기

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### 2. 데이터 준비 & 확인

In [2]:
digits = load_digits()
digits.keys()

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

### 3. 데이터 이해하기
#### Feature Data 지정하기

In [3]:
digits_data = digits.data
digits_data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

#### Label Data 지정하기

In [4]:
digits_label = digits.target
digits_label

array([0, 1, 2, ..., 8, 9, 8])

#### Target Names 출력해보기

In [5]:
digits.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#### 데이터 Describe 해보기

In [6]:
digits.DESCR #데이터 간단요약정리!!

".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 1797\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixel

### 4. train, test 데이터 분리

In [7]:
# X_train, X_test, y_train, y_test를 생성
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=1)

### 5. 다양한 모델로 학습시켜보기
#### Decision tree

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

#의사결정트리 정의 & 학습
decision_tree = DecisionTreeClassifier(random_state=1)
decision_tree.fit(X_train, y_train)

#의사결정트리 예측
y_pred_dt = decision_tree.predict(X_test)

#정확도
acc = accuracy_score(y_test, y_pred_dt)
acc

0.8388888888888889

#### Random Forest

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

random_forest = RandomForestClassifier(random_state=1)
random_forest.fit(X_train, y_train)

y_pred_rf = random_forest.predict(X_test)

acc = accuracy_score(y_test, y_pred_rf)
acc

0.9833333333333333

#### SVM

In [10]:
from sklearn import svm
from sklearn.metrics import accuracy_score

svm_model = svm.SVC(random_state=1)
svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_svm)
acc

0.9916666666666667

#### SGD Classifier

In [11]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

sgd_model = SGDClassifier(random_state=1)
sgd_model.fit(X_train, y_train)

y_pred_sgd = sgd_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_sgd)
acc

0.9472222222222222

#### Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logistic_model = LogisticRegression(max_iter = 3000, random_state=1)
logistic_model.fit(X_train, y_train)

y_pred_log = logistic_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_log)
acc

0.9722222222222222

### 6. 모델을 평가해 보기

In [13]:
from sklearn.metrics import classification_report

print("Decision Tree")
print(classification_report(y_test, y_pred_dt))
print("------------------------------------------------------")
print()

print("Random Forest")
print(classification_report(y_test, y_pred_rf))
print("------------------------------------------------------")
print()

print("SVM")
print(classification_report(y_test, y_pred_svm))
print("------------------------------------------------------")
print()

print("SGD Classifier")
print(classification_report(y_test, y_pred_sgd))
print("------------------------------------------------------")
print()

print("Logistic Regression")
print(classification_report(y_test, y_pred_log))

Decision Tree
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.91      0.83      0.87        35
           2       0.83      0.83      0.83        36
           3       0.78      0.71      0.74        41
           4       0.82      0.84      0.83        38
           5       0.78      0.97      0.87        30
           6       0.88      0.97      0.92        37
           7       0.78      0.76      0.77        37
           8       0.83      0.86      0.85        29
           9       0.75      0.71      0.73        34

    accuracy                           0.84       360
   macro avg       0.84      0.84      0.84       360
weighted avg       0.84      0.84      0.84       360

------------------------------------------------------

Random Forest
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       1.00      1.00      1.00     

- Decision Tree를 제외한 나머지 4개의 모델은 90% 이상의 정확도를 보여주었다.
- confusion_matrix를 확인해 본 결과 다른 모델에 비해서 대체적으로 오답률이 높다는 것을 알 수 있다.

### 손글씨 이미지 데이터는 SVM 모델을 사용했을 때 정확도 약 99.16%로 모델의 성능이 가장 높게 나온다.

###  

## (2) 와인 분류하기

### 1. 필요한 모듈 import하기

In [14]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### 2. 데이터 준비

In [15]:
wine = load_wine()
wine.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

### 3. 데이터 이해하기
#### Feature Data 지정하기

In [16]:
wine_data = wine.data
wine_data

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

#### Label Data 지정하기

In [17]:
wine_label = wine.target
wine_label

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

#### Target Names 출력해보기

In [18]:
wine.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

#### 데이터 Describe 해보기

In [19]:
wine.DESCR



### 4. train, test 데이터 분리

In [20]:
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=2)

### 5. 다양한 모델로 학습시켜보기
#### Decision tree

In [21]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

decision_tree = DecisionTreeClassifier(random_state=2)
decision_tree.fit(X_train, y_train)

y_pred_dt = decision_tree.predict(X_test)

acc = accuracy_score(y_test, y_pred_dt)
acc

0.9444444444444444

#### Random Forest

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

random_forest = RandomForestClassifier(random_state=2)
random_forest.fit(X_train, y_train)

y_pred_rf = random_forest.predict(X_test)

acc = accuracy_score(y_test, y_pred_rf)
acc

1.0

#### SVM

In [23]:
from sklearn import svm
from sklearn.metrics import accuracy_score

svm_model = svm.SVC(random_state=2)
svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_svm)
acc

0.6944444444444444

#### SGD Classifier

In [24]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

sgd_model = SGDClassifier(random_state=2)
sgd_model.fit(X_train, y_train)

y_pred_sgd = sgd_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_sgd)
acc

0.75

#### Logistic Regression

In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logistic_model = LogisticRegression(max_iter = 3000, random_state=2)
logistic_model.fit(X_train, y_train)

y_pred_log = logistic_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_log)
acc

0.9444444444444444

### 6. 모델을 평가해 보기

In [26]:
from sklearn.metrics import classification_report

print("Decision Tree")
print(classification_report(y_test, y_pred_dt))
print("------------------------------------------------------")
print()

print("Random Forest")
print(classification_report(y_test, y_pred_rf))
print("------------------------------------------------------")
print()

print("SVM")
print(classification_report(y_test, y_pred_svm))
print("------------------------------------------------------")
print()

print("SGD Classifier")
print(classification_report(y_test, y_pred_sgd))
print("------------------------------------------------------")
print()

print("Logistic Regression")
print(classification_report(y_test, y_pred_log))

Decision Tree
              precision    recall  f1-score   support

           0       1.00      0.89      0.94        18
           1       0.82      1.00      0.90         9
           2       1.00      1.00      1.00         9

    accuracy                           0.94        36
   macro avg       0.94      0.96      0.95        36
weighted avg       0.95      0.94      0.95        36

------------------------------------------------------

Random Forest
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        18
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00         9

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36

------------------------------------------------------

SVM
              precision    recall  f1-score   support

           0       0.94      0.89      

- SVM, SGD Classifier를 제외하고 나머지 3개의 모델은 90% 이상의 정확도를 보여주었다.
- confusion_matrix를 확인해 본 결과 SVM, SGD Classifier는 오답률이 높은것을 알 수 있다.
- 따라서! SVM과 SGD Classifier 모델은 wine 데이터를 분류하기에는 적합한 모델이 아니라고 판단되었다. 

### 와인 데이터는 Random Forest 모델을 사용했을 때 정확도 100%로 모델의 성능이 가장 높게 나온다.

##### 

## (3) 유방암 여부를 진단해보기

### 1. 필요한 모듈 import하기

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### 2. 데이터 준비

In [3]:
breast_cancer = load_breast_cancer()
breast_cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

### 3. 데이터 이해하기
#### Feature Data 지정하기

In [4]:
breast_cancer_data = breast_cancer.data
breast_cancer_data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

#### Label Data 지정하기

In [5]:
breast_cancer_label = breast_cancer.target
breast_cancer_label

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

#### Target Names 출력해보기

In [6]:
breast_cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

#### 데이터 Describe 해보기

In [7]:
breast_cancer.DESCR



### 4. train, test 데이터 분리

In [8]:
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=3)

### 5. 다양한 모델로 학습시켜보기
#### Decision tree

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

decision_tree = DecisionTreeClassifier(random_state=3)
decision_tree.fit(X_train, y_train)

y_pred_dt = decision_tree.predict(X_test)

acc = accuracy_score(y_test, y_pred_dt)
acc

0.8947368421052632

#### Random Forest

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

random_forest = RandomForestClassifier(random_state=3)
random_forest.fit(X_train, y_train)

y_pred_rf = random_forest.predict(X_test)

acc = accuracy_score(y_test, y_pred_rf)
acc

0.9385964912280702

#### SVM

In [11]:
from sklearn import svm
from sklearn.metrics import accuracy_score

svm_model = svm.SVC(random_state=3)
svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_svm)
acc

0.9122807017543859

#### SGD Classifier

In [12]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

sgd_model = SGDClassifier(random_state=3)
sgd_model.fit(X_train, y_train)

y_pred_sgd = sgd_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_sgd)
acc

0.9122807017543859

#### Logistic Regression

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logistic_model = LogisticRegression(max_iter = 3000, random_state=3)
logistic_model.fit(X_train, y_train)

y_pred_log = logistic_model.predict(X_test)

acc = accuracy_score(y_test, y_pred_log)
acc

0.9385964912280702

### 6. 모델을 평가해 보기

In [15]:
from sklearn.metrics import classification_report

print("Decision Tree")
print(classification_report(y_test, y_pred_dt))
print("------------------------------------------------------")
print()

print("Random Forest")
print(classification_report(y_test, y_pred_rf))
print("------------------------------------------------------")
print()

print("SVM")
print(classification_report(y_test, y_pred_svm))
print("------------------------------------------------------")
print()

print("SGD Classifier")
print(classification_report(y_test, y_pred_sgd))
print("------------------------------------------------------")
print()

print("Logistic Regression")
print(classification_report(y_test, y_pred_log))

Decision Tree
              precision    recall  f1-score   support

           0       0.83      0.88      0.85        40
           1       0.93      0.91      0.92        74

    accuracy                           0.89       114
   macro avg       0.88      0.89      0.89       114
weighted avg       0.90      0.89      0.90       114

------------------------------------------------------

Random Forest
              precision    recall  f1-score   support

           0       0.90      0.93      0.91        40
           1       0.96      0.95      0.95        74

    accuracy                           0.94       114
   macro avg       0.93      0.94      0.93       114
weighted avg       0.94      0.94      0.94       114

------------------------------------------------------

SVM
              precision    recall  f1-score   support

           0       0.94      0.80      0.86        40
           1       0.90      0.97      0.94        74

    accuracy                          

- 환자에게 있어서 가장 중요한 것은 recall 값이다. why? 환자인데 환자가 아니라고 인식되기 때문에...
- 유방암 여부 데이터는 악성/양성 종양에 대한 데이터이므로, accuracy 보다는, recall의 값이 중요하다!!
- 참고 0 - positive (병이 있다고 진단) / 1 - negative (병이 없다고 진단)

### 유방암 여부 데이터는 Random Forest 모델을 사용했을 때 정확도 93%로 모델의 성능이 가장 높게 나온다

# 회고

#### Aiffle에서 진행하는 두번째 프로젝트!!

- 손글씨와 와인은 정확도를 위주로 판단하면 되지만 유방암 여부는 환자라는 것을 생각하고 정확도가 아닌 재현율이 더 중요하다는 것을 알게되었다. 
- 여러 데이터들을 사용해보면서 데이터마다, 모델마다 약간의 성능 차이가 있다는 것을 알았다.
- 또한 데이터의 종류나 성질에 따라 여러가지 모델들로 학습시켜보고 최적의 모델을 선택하는 것이 중요하다는 것을 알았다.
- 각 데이터들을 하나하나 해석해보면서 어떤 것이 최적의 모델인지 확실히 알 수 있었다. 