![임도형 커멘트](https://github.com/dhrim/keras_example_seminia_2020/raw/master/comment.png)

# 개요

- 원 본 : https://keras.io/examples/structured_data/imbalanced_classification/

- 작업 : classification
- 데이터 : 속성 데이터. 테이블 데이터. https://www.kaggle.com/mlg-ulb/creditcardfraud/
- 적용 모델 : DNN


<br>

# 데이터

신용카드 데이터. 속성들만 있는


<br>

# 모델

Nothing Special


<br>

# class_weight에 의한 데이터 비균등 처리

model.fit() 호출 시에 class_weight를 주면 loss에 해당 가중치를 주어 계산한다. focal loss가 이런 개념인데, Keras에서 class_weight로 구현해 놓았다.


<br>





# 태그

```
#structued_data
#attribute_data
#class_weight
#np.bincount()
#keras_metrics
```

In [8]:
%%shell
wget https://github.com/dhrim/keras_example_seminia_2020/raw/master/creditcard.csv.zip
unzip creditcard.csv.zip

--2020-09-24 06:43:20--  https://github.com/dhrim/keras_example_seminia_2020/raw/master/creditcard.csv.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dhrim/keras_example_seminia_2020/master/creditcard.csv.zip [following]
--2020-09-24 06:43:21--  https://raw.githubusercontent.com/dhrim/keras_example_seminia_2020/master/creditcard.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 69155672 (66M) [application/zip]
Saving to: ‘creditcard.csv.zip’


2020-09-24 06:43:21 (76.3 MB/s) - ‘creditcard.csv.zip’ saved [69155672/69155672]

Archive:  creditcard.csv.zip
  inflating: creditcard.csv          



# Imbalanced classification: credit card fraud detection

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2019/05/28<br>
**Last modified:** 2020/04/17<br>
**Description:** Demonstration of how to handle highly imbalanced classification problems.

## Introduction

This example looks at the
[Kaggle Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud/)
dataset to demonstrate how
to train a classification model on data with highly imbalanced classes.

## First, vectorize the CSV data

In [9]:
import csv
import numpy as np

# Get the real data from https://www.kaggle.com/mlg-ulb/creditcardfraud/
# fname = "/Users/fchollet/Downloads/creditcard.csv"
fname = "creditcard.csv"

all_features = []
all_targets = []
with open(fname) as f:
    for i, line in enumerate(f):
        if i == 0:
            print("HEADER:", line.strip())
            continue  # Skip header
        fields = line.strip().split(",")
        all_features.append([float(v.replace('"', "")) for v in fields[:-1]])
        all_targets.append([int(fields[-1].replace('"', ""))])
        if i == 1:
            print("EXAMPLE FEATURES:", all_features[-1])

features = np.array(all_features, dtype="float32")
targets = np.array(all_targets, dtype="uint8")
print("features.shape:", features.shape)
print("targets.shape:", targets.shape)


HEADER: "Time","V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","Amount","Class"
EXAMPLE FEATURES: [0.0, -1.3598071336738, -0.0727811733098497, 2.53634673796914, 1.37815522427443, -0.338320769942518, 0.462387777762292, 0.239598554061257, 0.0986979012610507, 0.363786969611213, 0.0907941719789316, -0.551599533260813, -0.617800855762348, -0.991389847235408, -0.311169353699879, 1.46817697209427, -0.470400525259478, 0.207971241929242, 0.0257905801985591, 0.403992960255733, 0.251412098239705, -0.018306777944153, 0.277837575558899, -0.110473910188767, 0.0669280749146731, 0.128539358273528, -0.189114843888824, 0.133558376740387, -0.0210530534538215, 149.62]
features.shape: (284807, 30)
targets.shape: (284807, 1)


## Prepare a validation set

In [10]:
num_val_samples = int(len(features) * 0.2)
train_features = features[:-num_val_samples]
train_targets = targets[:-num_val_samples]
val_features = features[-num_val_samples:]
val_targets = targets[-num_val_samples:]

print("Number of training samples:", len(train_features))
print("Number of validation samples:", len(val_features))


Number of training samples: 227846
Number of validation samples: 56961


## Analyze class imbalance in the targets

![임도형 커멘트](https://github.com/dhrim/keras_example_seminia_2020/raw/master/comment.png)

카테고리 0은 22749개, 카테고리 1은 417개. 82:18의 비율이다.

np.bincount()로 숫자 빈도수를 셀수 있다.

학습에 사용할 class weight를 1/82, 1/18로 한다. 결국 18:82가 된다.

In [15]:
counts = np.bincount(train_targets[:, 0])
print(counts)
print(
    "Number of positive samples in training data: {} ({:.2f}% of total)".format(
        counts[1], 100 * float(counts[1]) / len(train_targets)
    )
)

weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]


[227429    417]
Number of positive samples in training data: 417 (0.18% of total)


## Normalize the data using training set statistics

In [12]:
mean = np.mean(train_features, axis=0)
train_features -= mean
val_features -= mean
std = np.std(train_features, axis=0)
train_features /= std
val_features /= std


## Build a binary classification model

In [13]:
from tensorflow import keras

model = keras.Sequential(
    [
        keras.layers.Dense(256, activation="relu", input_shape=(train_features.shape[-1],)),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               7936      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
Total params: 139,777
Trainable params: 139,777
Non-trainable params: 0
__________________________________________________

## Train the model with `class_weight` argument

In [22]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    class_weight=class_weight,
)

Epoch 1/30
112/112 - 6s - loss: 1.5320e-06 - fn: 58.0000 - fp: 742.0000 - tn: 226687.0000 - tp: 359.0000 - precision: 0.3261 - recall: 0.8609 - val_loss: 0.0307 - val_fn: 14.0000 - val_fp: 145.0000 - val_tn: 56741.0000 - val_tp: 61.0000 - val_precision: 0.2961 - val_recall: 0.8133
Epoch 2/30
112/112 - 5s - loss: 4.5041e-07 - fn: 17.0000 - fp: 1155.0000 - tn: 226274.0000 - tp: 400.0000 - precision: 0.2572 - recall: 0.9592 - val_loss: 0.0302 - val_fn: 14.0000 - val_fp: 65.0000 - val_tn: 56821.0000 - val_tp: 61.0000 - val_precision: 0.4841 - val_recall: 0.8133
Epoch 3/30
112/112 - 5s - loss: 2.6920e-07 - fn: 13.0000 - fp: 826.0000 - tn: 226603.0000 - tp: 404.0000 - precision: 0.3285 - recall: 0.9688 - val_loss: 0.0313 - val_fn: 17.0000 - val_fp: 62.0000 - val_tn: 56824.0000 - val_tp: 58.0000 - val_precision: 0.4833 - val_recall: 0.7733
Epoch 4/30
112/112 - 5s - loss: 2.2644e-07 - fn: 9.0000 - fp: 598.0000 - tn: 226831.0000 - tp: 408.0000 - precision: 0.4056 - recall: 0.9784 - val_loss: 0.

<tensorflow.python.keras.callbacks.History at 0x7fd55a72b0f0>

In [23]:
with_weight_class_result = model.evaluate(val_features, val_targets)



In [18]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

callbacks = [keras.callbacks.ModelCheckpoint("fraud_model_at_epoch_{epoch}.h5")]
class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_features,
    train_targets,
    batch_size=2048,
    epochs=30,
    verbose=2,
    callbacks=callbacks,
    validation_data=(val_features, val_targets),
    # class_weight=class_weight, # COMMENT OUT
)

Epoch 1/30
112/112 - 6s - loss: 0.0057 - fn: 137.0000 - fp: 100.0000 - tn: 227329.0000 - tp: 280.0000 - precision: 0.7368 - recall: 0.6715 - val_loss: 0.0049 - val_fn: 19.0000 - val_fp: 6.0000 - val_tn: 56880.0000 - val_tp: 56.0000 - val_precision: 0.9032 - val_recall: 0.7467
Epoch 2/30
112/112 - 5s - loss: 0.0031 - fn: 105.0000 - fp: 68.0000 - tn: 227361.0000 - tp: 312.0000 - precision: 0.8211 - recall: 0.7482 - val_loss: 0.0057 - val_fn: 22.0000 - val_fp: 2.0000 - val_tn: 56884.0000 - val_tp: 53.0000 - val_precision: 0.9636 - val_recall: 0.7067
Epoch 3/30
112/112 - 5s - loss: 0.0032 - fn: 133.0000 - fp: 55.0000 - tn: 227374.0000 - tp: 284.0000 - precision: 0.8378 - recall: 0.6811 - val_loss: 0.0049 - val_fn: 20.0000 - val_fp: 6.0000 - val_tn: 56880.0000 - val_tp: 55.0000 - val_precision: 0.9016 - val_recall: 0.7333
Epoch 4/30
112/112 - 6s - loss: 0.0026 - fn: 117.0000 - fp: 54.0000 - tn: 227375.0000 - tp: 300.0000 - precision: 0.8475 - recall: 0.7194 - val_loss: 0.0110 - val_fn: 19.0

<tensorflow.python.keras.callbacks.History at 0x7fd55b5cdfd0>

In [19]:
without_weight_class_result = model.evaluate(val_features, val_targets)



In [30]:
print(model.metrics_names )
print(without_weight_class_result)

['loss', 'fn', 'fp', 'tn', 'tp', 'precision', 'recall']
[0.022713497281074524, 28.0, 2.0, 56884.0, 47.0, 0.9591836929321289, 0.6266666650772095]


In [28]:
print(model.metrics_names )
print(with_weight_class_result)

['loss', 'fn', 'fp', 'tn', 'tp', 'precision', 'recall']
[0.07225595414638519, 12.0, 459.0, 56427.0, 63.0, 0.12068965286016464, 0.8399999737739563]


In [27]:
print(model.metrics_names )

['loss', 'fn', 'fp', 'tn', 'tp', 'precision', 'recall']


## Conclusions

At the end of training, out of 56,961 validation transactions, we are:

- Correctly identifying 66 of them as fraudulent
- Missing 9 fraudulent transactions
- At the cost of incorrectly flagging 441 legitimate transactions

In the real world, one would put an even higher weight on class 1,
so as to reflect that False Negatives are more costly than False Positives.

Next time your credit card gets  declined in an online purchase -- this is why.