## Summary of D4 Data Type and Processing

### Types: numeric columns can be generally categorized into
1. Discrete variable: such as number of houses
2. Continuous variable: such as speed

### Common numeric type in Pandas dataframe
1. float64
2. int64
3. object

### Label and One-hot encoder
[label-encoder-vs-one-hot-encoder](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)

__Label Encoding Example__:

```python
from sklearn.preprocessing import LabelEncoder
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training data
            app_train[col] = le.transform(app_train[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)
```

__One Hot Encoding Example__:

```python
import pandas as pd
app_train = pd.get_dummies(app_train)
```

## Practice

In [3]:
import os
import numpy as np
import pandas as pd

In [5]:
# 設定 data_path, 並讀取 app_train
dir_data = '../data/'
f_app_train = os.path.join(dir_data, 'application_train.csv')
app_train = pd.read_csv(f_app_train)

## 作業
將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化

In [10]:
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()

(307511, 1)


Unnamed: 0,WEEKDAY_APPR_PROCESS_START
0,WEDNESDAY
1,MONDAY
2,MONDAY
3,WEDNESDAY
4,THURSDAY


In [11]:
print('Before One Hot Encoding')
print('shape: {}'.format(sub_train.shape))
print('head: \n{}\n\n'.format(sub_train.head()))
## One Hot Encoder ##
sub_train = pd.get_dummies(sub_train)
print('After One Hot Encoding')
print('shape: {}'.format(sub_train.shape))
print('head: \n{}'.format(sub_train.head()))

Before One Hot Encoding
shape: (307511, 1)
head: 
  WEEKDAY_APPR_PROCESS_START
0                  WEDNESDAY
1                     MONDAY
2                     MONDAY
3                  WEDNESDAY
4                   THURSDAY


After One Hot Encoding
shape: (307511, 7)
head: 
   WEEKDAY_APPR_PROCESS_START_FRIDAY  WEEKDAY_APPR_PROCESS_START_MONDAY  \
0                                  0                                  0   
1                                  0                                  1   
2                                  0                                  1   
3                                  0                                  0   
4                                  0                                  0   

   WEEKDAY_APPR_PROCESS_START_SATURDAY  WEEKDAY_APPR_PROCESS_START_SUNDAY  \
0                                    0                                  0   
1                                    0                                  0   
2                                    0     