# **Google Cloud Platform Virtual Machine (GCP VM) 설정 및 Model Training**

## 개발 환경


*   머신 유형
>  custom(vCPU 4개, 16GB 메모리)

*   CPU 플랫폼
> Intel Broadwell
* OS Image
> Ubuntu 16.04


*   GPU
> 2 x NVIDIA Tesla K80

*   영역
> asia-east1-b

* 부팅 디스크
  * 이름	: k80x2-ubuntu16-1
  * 크기(GB)	: 100	
  * 유형	: SSD 영구 디스크
  * 암호화	모드 : Google 관리
  * 인스턴스 : 부팅, 읽기/쓰기
  * 삭제 시 : 디스크 삭제

---
## 1. CUDA 설치
GCP VM의 경우 Google Colab과 다르게 Tensorflow에서 GPU 사용을 위한 필요조건인 CUDA 설치가 기본적으로 되어 있지 않으므로 초기에 환경설정을 할 필요가 있습니다. CUDA 설치 및 설정을 해주지 않으면 GPU를 VM에 연동시켜 놓아도 training을 시킬 때 gpu를 사용하지 못하므로 cpu를 사용해 training을 하게 되기 때문에 gpu를 사용할 때에 비해 수십배 이상 시간이 지연되게 됩니다.

###1-1. cuda download


```
// Download cuda repo
curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.176-1_amd64.deb
// Unpackaging duco repo
sudo dpkg -i cuda-repo-ubuntu1604_10.0.176–1_amd64.deb
 
sudo apt-key adv — fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub

// Once the drivers have been downloaded to your instance, you will need to update them.
sudo apt-get update
```
###1-2. Install cuda
```
sudo apt-get install cuda-9–0
```
###1-3. Enable cuda
```
sudo nvidia-smi -pm 1
```
###1-4. Check running status
```
// Verify running
nvidia-smi
```

###1-5. Set environment variables
```
echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
echo 'export PATH=$PATH:$CUDA_HOME/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64' >> ~/.bashrc
source ~/.bashrc
```


##2. GPU 설정
##2-1. tensorflow-gpu 설치
GPU를 사용하기 위해선 training을 시키기 위해 tensorflow-gpu 설치를 전제로 합니다. 이 때 tensorflow가 미리 설치되어 있으면 gpu 인식을 제대로 하지 못하기 때문에 설치되어 있을경우 제거를 한 뒤 tensorflow-gpu를 설치해야 합니다. 제거 및 설치 과정은 다음과 같습니다.
```
// Step 0: Uninstall protobuf
pip uninstall protobuf

// Step 1: Uninstall tensorflow
pip uninstall tensorflow
pip uninstall tensorflow-gpu

// Step 2: Force reinstall Tensorflow with GPU support
pip install --upgrade --force-reinstall tensorflow-gpu

// Step 3: If you haven't already, set CUDA_VISIBLE_DEVICES
// So for me with 2 GPUs it would be
export CUDA_VISIBLE_DEVICES=0,1
```
###2-2. gpu 설정 확인
터미널에서 다음 코드로 gpu가 정상적으로 설치되었는지 확인할 수 있습니다.
#### Verify gpu setting
```
python3

import tensorflow as tf

with tf.device('/cpu:0'):
    a_c = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a-cpu')
    b_c = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b-cpu')
    c_c = tf.matmul(a_c, b_c, name='c-cpu')

with tf.device('/gpu:0'):
    a_g = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a-gpu')
    b_g = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b-gpu')
    c_g = tf.matmul(a_g, b_g, name='c-gpu')

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    print (sess.run(c_c))
    print (sess.run(c_g))

quit()
```
#### Expected Results
```
2019-05-28 08:21:03.076023: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions tha
t this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-28 08:21:03.082701: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-05-28 08:21:03.083207: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5639a8b40560 executing 
computations on platform Host. Devices:
2019-05-28 08:21:03.083250: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefin
ed>, <undefined>
2019-05-28 08:21:03.265467: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read f
rom SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-28 08:21:03.267378: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read f
rom SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-28 08:21:03.268674: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5639a8c124b0 executing 
computations on platform CUDA. Devices:
2019-05-28 08:21:03.268727: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K8
0, Compute Capability 3.7
2019-05-28 08:21:03.268753: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): Tesla K8
0, Compute Capability 3.7
2019-05-28 08:21:03.269063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties
: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.09GiB
2019-05-28 08:21:03.269156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties
: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:05.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2019-05-28 08:21:03.269456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0,
 1
2019-05-28 08:21:03.271807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecu
tor with strength 1 edge matrix:
2019-05-28 08:21:03.271856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
2019-05-28 08:21:03.271866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
2019-05-28 08:21:03.271872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
2019-05-28 08:21:03.272148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/jo
b:localhost/replica:0/task:0/device:GPU:0 with 10790 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bu
s id: 0000:00:04.0, compute capability: 3.7)
2019-05-28 08:21:03.272921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/jo
b:localhost/replica:0/task:0/device:GPU:1 with 10805 MB memory) -> physical GPU (device: 1, name: Tesla K80, pci bu
s id: 0000:00:05.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:1 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capab
ility: 3.7
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla K80, pci bus id: 0000:00:05.0, compute capab
ility: 3.7
2019-05-28 08:21:03.274104: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:1 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capab
ility: 3.7
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla K80, pci bus id: 0000:00:05.0, compute capab
ility: 3.7

c-cpu: (MatMul): /job:localhost/replica:0/task:0/device:CPU:0
2019-05-28 08:21:03.275730: I tensorflow/core/common_runtime/placer.cc:1059] c-cpu: (MatMul)/job:localhost/replica:
0/task:0/device:CPU:0
c-gpu: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-28 08:21:03.275777: I tensorflow/core/common_runtime/placer.cc:1059] c-gpu: (MatMul)/job:localhost/replica:
0/task:0/device:GPU:0
a-cpu: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2019-05-28 08:21:03.275855: I tensorflow/core/common_runtime/placer.cc:1059] a-cpu: (Const)/job:localhost/replica:0
/task:0/device:CPU:0
b-cpu: (Const): /job:localhost/replica:0/task:0/device:CPU:0
2019-05-28 08:21:03.275891: I tensorflow/core/common_runtime/placer.cc:1059] b-cpu: (Const)/job:localhost/replica:0
/task:0/device:CPU:0
a-gpu: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-28 08:21:03.275908: I tensorflow/core/common_runtime/placer.cc:1059] a-gpu: (Const)/job:localhost/replica:0
/task:0/device:GPU:0
b-gpu: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-28 08:21:03.275928: I tensorflow/core/common_runtime/placer.cc:1059] b-gpu: (Const)/job:localhost/replica:0
/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]
[[22. 28.]
 [49. 64.]]
```


##3. ImageAI requirements & ImageAI 설치
이미지 classification 학습을 하기 위해 기존 이미지 분류 state-of-the-art 딥러닝 알고리즘을 활용하는 imageai 라이브러리를 사용합니다. ImageAI 라이브러리는 SqueezeNet, ResNet, InceptionV3, DenseNet 알고리즘을 사용할 수 있고, 자동차 이미지 분류 학습을 위해 사용한 알고리즘은 resnet입니다.
Prerequisites 설치 과정은 다음과 같습니다.

###3-1. Install prerequisites

In [0]:
!pip install tensorflow-gpu==2.0.0-alpha0

Collecting tensorflow-gpu==2.0.0-alpha0
[?25l  Downloading https://files.pythonhosted.org/packages/1a/66/32cffad095253219d53f6b6c2a436637bbe45ac4e7be0244557210dc3918/tensorflow_gpu-2.0.0a0-cp36-cp36m-manylinux1_x86_64.whl (332.1MB)
[K     |████████████████████████████████| 332.1MB 59kB/s 
Collecting tf-estimator-nightly<1.14.0.dev2019030116,>=1.14.0.dev2019030115 (from tensorflow-gpu==2.0.0-alpha0)
[?25l  Downloading https://files.pythonhosted.org/packages/13/82/f16063b4eed210dc2ab057930ac1da4fbe1e91b7b051a6c8370b401e6ae7/tf_estimator_nightly-1.14.0.dev2019030115-py2.py3-none-any.whl (411kB)
[K     |████████████████████████████████| 419kB 38.5MB/s 
Collecting google-pasta>=0.1.2 (from tensorflow-gpu==2.0.0-alpha0)
[?25l  Downloading https://files.pythonhosted.org/packages/f9/68/a14620bfb042691f532dcde8576ff82ee82e4c003cdc0a3dbee5f289cee6/google_pasta-0.1.6-py3-none-any.whl (51kB)
[K     |████████████████████████████████| 61kB 29.2MB/s 
Collecting tb-nightly<1.14.0a20190302,>=1.14

In [0]:
pip install numpy




In [0]:
pip install scipy



In [0]:
pip install opencv-python



In [0]:
pip install pillow



In [0]:
pip install matplotlib



In [0]:
pip install h5py



In [0]:
pip install keras



###3-2. Install imageai library

In [0]:
pip install https://github.com/OlafenwaMoses/ImageAI/releases/download/2.0.2/imageai-2.0.2-py3-none-any.whl



##4. Dataset 준비
학습에 필요한 데이터셋을 준비합니다. GCP Storage를 사용해 이미지를 업로드하고 GCP VM으로 다운로드해 데이터셋을 사용하고 있습니다. 데이터 구조는 다음과 같습니다.


*   json : 학습을 시작하면서 분류되는 클래스의 식별내용을 저장
*   models : 학습된 모델을 저장하는 디렉토리
*  test : 테스트용 이미지를 각 클래스별로 서브 디렉토리로 구분해 저장
*  train : 학습용 이미지를 각 클래스별로 서브 디렉토리로 구분해 저장





In [0]:
from google.colab import auth
auth.authenticate_user()

In [0]:
# First, we need to set our project. Replace the assignment below
# with your project ID.
project_id = 'trive-image-classification'

In [0]:
bucket_name = 'image-classification-bucket'

In [0]:
!gcloud config set project {project_id}

Updated property [core/project].


In [0]:
# Finally, dump the contents of our newly copied file to make sure everything worked.
# !gsutil ls gs://{bucket_name}/cars_refined
# !gsutil cp -r gs://{bucket_name}/cars_refined .
!gsutil cp -r gs://{bucket_name}/cars_refined.zip .

Copying gs://image-classification-bucket/cars_refined.zip...
/ [1 files][736.2 MiB/736.2 MiB]                                                
Operation completed over 1 objects/736.2 MiB.                                    


In [0]:
!rm -rf cars_refined
import zipfile
zip_ref = zipfile.ZipFile('cars_refined.zip', 'r')
zip_ref.extractall('./cars_refined')
zip_ref.close()

In [0]:
!ls cars_refined

json  models  test  train


In [0]:
!mkdir /content/cars_refined/models
!mkdir /content/cars_refined/json

mkdir: cannot create directory ‘/content/cars_refined/models’: File exists
mkdir: cannot create directory ‘/content/cars_refined/json’: File exists


##6. Model training 
로컬에 저장한 데이터셋을 불러와 학습을 시킵니다. 알고리즘은 ResNet을 사용했고, 학습횟수는 최대 200회로 설정했으며 학습을 시키면서 정확도가 더이상 향상되지 않는다고 판단될 때 멈추는 방식을 사용했습니다. 

In [0]:
from imageai.Prediction.Custom import ModelTraining
  
model_trainer = ModelTraining()
model_trainer.setModelTypeAsResNet()
model_trainer.setDataDirectory("/content/cars_refined")
model_trainer.trainModel(num_objects=10, num_experiments=200, enhance_data=True, batch_size=32, show_network_summary=True)

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 112, 112, 64) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
batch_normalization_v1 (BatchNo (None, 112, 112, 64) 256         conv2d[0][0]                     
__________________________________________________________________________________________________
activation (Activation)         (None, 112, 112, 64) 0           batch_normalization_v1[0][0]     
______________________________________________________________________________________________

KeyboardInterrupt: ignored