# Exploratory_Data_Analysis

<a class="anchor" id="0"></a>
# Table of Contents

1. [套件安裝與載入](#1)
1. [環境檢測與設定](#2)
1. [EDA參數設定](#3)
1. [資料處理](#4)
    -  [載入CSV檔](#4.1)
    -  [檢查CSV檔缺失值](#4.2)
1. [圖片處理](#5)

# 1. 套件安裝與載入<a class="anchor" id="1"></a>
[Back to Table of Contents](#0)

In [None]:
# 資料處理套件
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 2. 環境檢測與設定<a class="anchor" id="2"></a>
[Back to Table of Contents](#0)

In [None]:
'''執行環境參數設定'''

# (Boolean)是否為本機
LOCAL = False

# (Boolean)是否為 Colab
COLAB = False


'''檔案路徑參數設定'''

# (String)Root路徑
if LOCAL:
    PATH = r'../'
elif COLAB:
    PATH = r'/content/drive/My Drive/Colab Notebooks/'
else:
    PATH = r'../input/'
    OUTPUT_PATH = r'/kaggle/working/'
    
# (String)資料根路徑
DATA_ROOT_PATH = PATH+r'hpa-single-cell-image-classification/'

# (String)訓練資料路徑
TRAIN_DATA_PATH = DATA_ROOT_PATH+r'train'

# (String)訓練CSV路徑，如為None則不讀CSV檔
TRAIN_CSV_PATH = DATA_ROOT_PATH+r'train.csv'

# (String)測試資料路徑
TEST_DATA_PATH = DATA_ROOT_PATH+r'test'

# (String)Submission路徑，如為None則不讀CSV檔
SUB_CSV_PATH = DATA_ROOT_PATH+r'sample_submission.csv'

In [None]:
if not LOCAL and COLAB:
    from google.colab import drive
    drive.mount('/content/drive')

# 3. EDA參數設定<a class="anchor" id="3"></a>
[Back to Table of Contents](#0)

In [None]:
'''客製參數設定'''


'''資料參數設定'''

# (String)CSV標籤欄位
LABEL_NAME = 'Label'


''''圖表參數設定'''

# (Float)全部SNS圖表的字形縮放
ALL_SNS_FONT_SCALE = 1.0

# (Int)CSV缺失值圖表寬度
CSV_COUNTPLOT_FIGSIZE_W = 10

# (Int)CSV缺失值圖表高度
CSV_COUNTPLOT_FIGSIZE_H = 10

# (Int)CSV缺失值圖表標題字型大小
CSV_COUNTPLOT_TITLE_FONTSIZE = 20

# (Int)CSV缺失值圖表X軸標題字型大小
CSV_COUNTPLOT_XLABEL_FONTSIZE = 15

# (Int)CSV缺失值圖表Y軸標題字型大小
CSV_COUNTPLOT_YLABEL_FONTSIZE = 15

In [None]:
# 設置sns圖表縮放係數
sns.set(font_scale = ALL_SNS_FONT_SCALE)

# 4. 資料處理<a class="anchor" id="4"></a>
[Back to Table of Contents](#0)

## 4.1 載入CSV檔 <a class="anchor" id="4.1"></a>
[Back to Table of Contents](#0)

In [None]:
print('Reading data...')

# 讀取訓練資料集CSV檔
train_csv = pd.read_csv(TRAIN_CSV_PATH,encoding="utf8")

# 讀取測試資料集CSV檔
sub_csv = pd.read_csv(SUB_CSV_PATH,encoding="utf8")

print('Reading data completed')

In [None]:
# 顯示訓練資料集CSV檔
train_csv.head()

In [None]:
print("Shape of train_data :", train_csv.shape)

In [None]:
# 顯示Submission CSV檔
sub_csv.head()

In [None]:
print("Shape of test_data :", sub_csv.shape)

## 4.2 檢查CSV檔缺失值 <a class="anchor" id="4.2"></a>
[Back to Table of Contents](#0)

In [None]:
total = train_csv.isnull().sum().sort_values(ascending = False)
percent = (train_csv.isnull().sum()/train_csv.isnull().count()*100).sort_values(ascending = False)
missing_train_csv  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_train_csv.head())

In [None]:
print(train_csv[LABEL_NAME].value_counts())
f,ax = plt.subplots(figsize=(CSV_COUNTPLOT_FIGSIZE_W, CSV_COUNTPLOT_FIGSIZE_H))
sns.countplot(train_csv[LABEL_NAME], hue = train_csv[LABEL_NAME],ax = ax)
plt.title("LABEL COUNT", fontsize=CSV_COUNTPLOT_TITLE_FONTSIZE)
plt.xlabel(LABEL_NAME.upper(), fontsize=CSV_COUNTPLOT_XLABEL_FONTSIZE)
plt.ylabel("COUNT", fontsize=CSV_COUNTPLOT_YLABEL_FONTSIZE)
plt.legend()
plt.show()