# Tutorial: Data preprocessing by pandas and scikit-learn (Take air pollution as example)

In [1]:
__author__ = "Allenyummy"
__date__ = "May 29, 2020"

## 基礎概念
資料可粗淺地區分為數值型與類別型，數值型帶有連續的概念，類別則帶有離散的特徵。舉例來說，環境溫度值便是一種數值型資料，天氣狀況(晴、陰、雨)便是一種類別型資料。在做資料前處理之前，必須先對自己手上的資料有一定的了解，切勿一味埋頭苦幹，否則僅會事倍功半。

## 下載資料
首先先至下列網址下載空汙資料至本地端  (自行下載台北古亭測站2019年整年的數據)
https://erdb.epa.gov.tw/DataRepository/EnvMonitor/AirPSIValues.aspx?topic1=大氣&topic2=環境及生態監測&subject=空氣品質

## Step 0 - Import packages
API of pandas 1.0.3: 
https://pandas.pydata.org/pandas-docs/stable/reference/index.html

In [2]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
pd.set_option('display.max_columns', None)

## Step 1.0 - Read Data

In [3]:
data = pd.read_csv('air_pollution_data.csv')

#--- Last 2 rows are all nan. Just drop it. ---#
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv
# data = pd.read_csv('空氣品質即時污染指標.csv', skiprows=[8228, 8229])

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna
data.dropna(how='all', inplace=True)
data = data[[i for i in data.columns if i not in ['污染物', 'PSI值']]]

#--- Check what's inside the dataframe ---#
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head
# data.head()
# data.tail()

#--- Descriptive statistics of dataframe ---#
# data.describe()
# data.info()

#--- list of columns of dataframe ---#
data.columns

Index(['序號', '縣市', '測站', '監測日期', '監測時間', '污染程度', '二氧化硫(ppb)', '一氧化碳(ppm)',
       '臭氧(ppb)', '懸浮微粒(μg/m3)', '細懸浮微粒(μg/m3)', '二氧化氮(ppb)', '風速(m/sec)',
       'AQI值', '臭氧8小時移動平均(ppb)', '氮氧化物(ppb)(NOx)', '一氧化氮(ppb)(NO)',
       '風向(degrees)', '一氧化碳8小時移動平均(ppb)', '懸浮微粒移動平均值(μg/m3)',
       '細懸浮微粒移動平均值(μg/m3)'],
      dtype='object')

## Step 1.1 - Rename chinese columns names into english

In [4]:
col_name_map = {'序號': 'serial_num',
                '縣市': 'city',
                '測站': 'station',
                '監測日期': 'date',
                '監測時間': 'hour',
                '污染物': 'pollutant',
                '污染程度': 'degree_of_pollution',
                'PSI值': 'PSI',
                '二氧化硫(ppb)': 'SO2',
                '一氧化碳(ppm)': 'CO',
                '臭氧(ppb)': 'O3',
                '懸浮微粒(μg/m3)': 'PM10',
                '細懸浮微粒(μg/m3)': 'PM25',
                '二氧化氮(ppb)': 'NO2',
                '風速(m/sec)': 'wind_speed',
                'AQI值': 'AQI',
                '臭氧8小時移動平均(ppb)': 'O3_ma8',
                '氮氧化物(ppb)(NOx)': 'NOx',
                '一氧化氮(ppb)(NO)': 'NO',
                '風向(degrees)': 'degrees',
                '一氧化碳8小時移動平均(ppb)': 'CO_ma8',
                '懸浮微粒移動平均值(μg/m3)': 'PM10_ma',
                '細懸浮微粒移動平均值(μg/m3)': 'PM25_ma'}

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
data.rename(columns=col_name_map, inplace=True)

## Step 1.2 - Numerical vs. Categorical columns (features)

In [5]:
time_col = ['date', 'hour']
categorical_col = ['city', 'station', 'degree_of_pollution']
numerical_col = ['SO2', 'CO', 'O3', 'PM10', 'PM25', 'NO2', 'wind_speed', 'AQI', 'O3_ma8', 'NOx', 'NO', 'degrees', 'CO_ma8', 'PM10_ma', 'PM25_ma']

## Step 1.3 - Ascend order of time by reversing dataframe

In [6]:
data = data.iloc[::-1]
data.index = range(0, len(data))
# data

## Step 1.4 - Handle datetime
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

In [7]:
data['time'] = data['date'] + ' ' + data['hour'].str.slice(0, -1)
data['time'] = pd.to_datetime(data['time'], format='%Y/%m/%d %H')
data.drop(columns=['date', 'hour'], inplace=True)
time_col.append('time')
[time_col.remove(i) for i in ['date', 'hour']]
# data['time']

[None, None]

## Step 2 - Detect outliers from numerical features as NaN
一般來說，我會將異常值標記為NaN，視為遺失值，之後一併處理。

使用 pandas where function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

### method 1: 平均值、標準差
將平均值±N個標準差之外的資料視為outlier。至於N要設為多少，一切依照任務需求，沒有正確答案。

In [8]:
## cutsomize N std to detect outliers as nan
data_1 = data.copy()
for col in numerical_col:
    mean = data_1[col].mean()
    std = data_1[col].std()
    condition = np.abs(data_1[col]-mean)/std < 7
    data_1[col] = data_1[col].where(condition)

### method 2: Z-score
在常態分佈的情況下，99.7%的資料會落在平均值±3個標準差之內，因此在這之外的資料可視為outlier。

In [9]:
## z-score to detect outliers as nan
data_2 = data.copy()
for col in numerical_col:
    mean = data_2[col].mean()
    std = data_2[col].std()
    condition = np.abs(data_2[col]-mean)/std < 3
    data_2[col] = data_2[col].where(condition)

### method 3: 分位數、四分位距(IQR)
小於(Q1-1.5xIQR) or 大於(Q3+1.5xIQR)，皆視為outlier。

In [10]:
## quantile to dectect outliers as nan
data_3 = data.copy()
for col in numerical_col:
    q1 = data_3[col].quantile(.25)
    q3 = data_3[col].quantile(.75)
    iqr = q3-q1
    condition1 = data_3[col] > q1-1.5*iqr
    condition2 = data_3[col] < q3+1.5*iqr
    data_3[col] = data_3[col].where(condition1|condition2)

### Step 2.1 - Compare different handling methods about nan and plot bar

In [11]:
## original total count of nan
nan_0 = data.isna().sum()

nan_1 = data_1.isna().sum()
nan_2 = data_2.isna().sum()
nan_3 = data_3.isna().sum()

## different situations about nan and plot bar
nan = pd.DataFrame({'original': nan_0, 
                    'after N std': nan_1, 
                    'after z-score': nan_2, 
                    'after quantile': nan_3})
print (nan)
nan.plot(kind='bar')

                     original  after N std  after z-score  after quantile
serial_num                  0            0              0               0
city                        0            0              0               0
station                     0            0              0               0
degree_of_pollution         0            0              0               0
SO2                       362          381            471             362
CO                        372          373            513             372
O3                        329          329            376             329
PM10                      350          355            431             350
PM25                      359          364            509             359
NO2                       493          493            585             493
wind_speed                259          259            284             259
AQI                       175          175            317             175
O3_ma8                    153         

<matplotlib.axes._subplots.AxesSubplot at 0x1d60ce2c7f0>

## Step 3 - Handle missing values
如果從資料集中剔除這些遺失值，對任務需求以及後續模型分析來說，付出的成本與代價在可接受的範圍之內，那麼，就毫不猶豫地丟棄資料遺失值吧，因為丟棄資料非常簡單。但是，通常資料得來不易，秉持著客家精神，能救回多少就救回多少，才是最經濟的套路。

### method 1: 棄值
1. 列表刪除(listwise deletion): 只要某個欄位有缺、或是多個欄位有缺值，整筆資料刪除。
2. 欄位刪除(column/feature delection): 若某欄位/特徵有大量缺值，整個欄位刪除。

In [12]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna
data_p = data_2.dropna()
data_p.index = range(0, len(data_p))
data_p.tail()

# data_p = data_2.dropna(subset=['wind_speed', 'degrees'])

Unnamed: 0,serial_num,city,station,degree_of_pollution,SO2,CO,O3,PM10,PM25,NO2,wind_speed,AQI,O3_ma8,NOx,NO,degrees,CO_ma8,PM10_ma,PM25_ma,time
6971,5.0,臺北市,古亭,普通,2.8,0.41,26.0,25.0,18.0,19.0,2.2,53.0,26.0,20.0,1.0,82.0,0.0,25.0,16.0,2019-12-31 19:00:00
6972,4.0,臺北市,古亭,普通,2.0,0.32,31.0,21.0,15.0,14.0,2.4,53.0,26.0,14.0,1.0,68.0,0.0,25.0,16.0,2019-12-31 20:00:00
6973,3.0,臺北市,古亭,良好,1.6,0.24,38.0,22.0,10.0,7.9,2.8,49.0,27.0,9.0,1.0,81.0,0.0,24.0,15.0,2019-12-31 21:00:00
6974,2.0,臺北市,古亭,良好,1.4,0.25,36.0,23.0,11.0,7.3,2.0,46.0,28.0,7.0,0.0,82.0,0.0,23.0,14.0,2019-12-31 22:00:00
6975,1.0,臺北市,古亭,良好,1.5,0.27,33.0,28.0,10.0,7.9,1.5,43.0,29.0,8.0,0.0,83.0,0.0,24.0,13.0,2019-12-31 23:00:00


### method2: 針對數值型資料補固定值 
(中位數、眾數、平均值、(向前補、向後補))

In [13]:
## method 2: fillna
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html?highlight=fillna#pandas.DataFrame.fillna
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html
stat = data_2.agg(['min', 'max', 'mean', 'median'])
data_p = data_2.fillna(data_2.median())
# data_p = data_2.fillna(data_2.mode())
# data_p = data_2.fillna(data_2.mean())
# len(data_p)
# data_p.isnull().sum()

# data_2['SO2'] = data_2['SO2'].fillna(data_2['SO2'].mean())

### method3: 針對數值型資料，用插值法補值 
(線性插值、時序插值、(最近插值、knn插值))

In [14]:
## by index
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
# https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
data_p = data_2.interpolate(method='index', axis=0)

##by time index
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html
data_p = data_2.set_index(pd.DatetimeIndex(data['time']))
data_p = data_p.interpolate(method='time', axis=0, inplace=False)
# data_p.isna().sum()

### 針對類別資料 (較少出現)
1. 補眾數
2. 多增加一類(others)，遺失值當作一類別
3. knn插值

## Step 4 - Feature Scaling
有時候，不同資料特徵的數值範圍不同，有些可能落在1到10000，有些可能是-0.1到0.1，這類資料會使得那些基於計算空間距離或是基於計算梯度下降的模型有不好的效果，例如: 訓練時間增加、網路無法收斂等等。因此我們會試圖在資料前處理的步驟中安插這個步驟，使得不同資料特徵的數值範圍落差不要太大，並快速收斂。題外話: Decision Tree 和 Random Forest 等等基於樹結構的模型則不需要這步驟。

一般有兩種做法: Normalization 和 Standardization。Normalization 適合用在非高斯分布的資料，且適合用在那些沒有假定資料為任何分布的演算法(KNN)。Standardization 適合用在高斯分布的資料(非必要條件)，且標準化是較不受異常值影響。不過，其實可以 try and error 的方式去嘗試這兩種特徵縮放的方法，套用其中一種方法，並看看模型結果誰優誰劣，反推哪種特徵縮放較適合用在當前任務。https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

### 資料正規化 (Normalization，將數值範圍落在[0,1]之間)

### method 1: custom function of pandas

In [15]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
#--- single column ---#
# min = data_2['SO2'].min()
# diff = data_2['SO2'].max()-data_2['SO2'].min()
# data_2['SO2_1'] = data_2['SO2'].apply(lambda x: (x-min)/diff)

#--- multiple columns ---#
data_2_norm = data_2.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)) if x.name in numerical_col else x)
# print (data_2_norm.head(1))

#--- function ---#
def norm(data):
    if data.name in numerical_col:
        min = data.min()
        diff = data.max()-data.min()
        data = (data-min)/diff
    return data
data_2_norm = data_2.apply(norm)
data_2_norm

Unnamed: 0,serial_num,city,station,degree_of_pollution,SO2,CO,O3,PM10,PM25,NO2,wind_speed,AQI,O3_ma8,NOx,NO,degrees,CO_ma8,PM10_ma,PM25_ma,time
0,8227.0,臺北市,古亭,良好,0.433962,0.235849,0.378734,0.126582,0.173913,0.212264,0.560976,0.168067,0.364865,0.178571,0.095238,0.250070,0.0,0.088235,0.136364,2019-01-01 00:00:00
1,8226.0,臺北市,古亭,良好,0.547170,0.226415,0.402628,0.177215,0.260870,0.174528,0.463415,0.176471,0.378378,0.142857,0.095238,0.238918,0.0,0.088235,0.159091,2019-01-01 01:00:00
2,8225.0,臺北市,古亭,良好,0.320755,0.216981,0.402628,0.139241,0.108696,0.195755,0.634146,0.184874,0.391892,0.160714,0.095238,0.244494,0.0,0.088235,0.159091,2019-01-01 02:00:00
3,8224.0,臺北市,古亭,良好,0.320755,0.198113,0.426523,0.088608,0.021739,0.167453,0.634146,0.201681,0.418919,0.125000,0.095238,0.233343,0.0,0.088235,0.159091,2019-01-01 03:00:00
4,8223.0,臺北市,古亭,良好,0.320755,0.188679,0.438471,0.164557,0.152174,0.127358,0.536585,0.201681,0.432432,0.089286,0.047619,0.213828,0.0,0.088235,0.159091,2019-01-01 04:00:00
5,8222.0,臺北市,古亭,良好,0.320755,0.179245,0.450418,0.113924,0.130435,0.113208,0.512195,0.210084,0.445946,0.089286,0.047619,0.244494,0.0,0.073529,0.136364,2019-01-01 05:00:00
6,8221.0,臺北市,古亭,良好,0.320755,0.179245,0.450418,0.101266,0.065217,0.108491,0.634146,0.218487,0.459459,0.089286,0.047619,0.230555,0.0,0.058824,0.136364,2019-01-01 06:00:00
7,8220.0,臺北市,古亭,良好,0.339623,0.207547,0.414576,0.101266,0.065217,0.167453,0.439024,0.218487,0.459459,0.125000,0.095238,0.236130,0.0,0.044118,0.136364,2019-01-01 07:00:00
8,8219.0,臺北市,古亭,良好,0.301887,0.216981,0.402628,0.113924,0.065217,0.195755,0.707317,0.226891,0.472973,0.160714,0.095238,0.250070,0.0,0.029412,0.113636,2019-01-01 08:00:00
9,8218.0,臺北市,古亭,良好,0.301887,0.235849,0.390681,0.189873,0.065217,0.209906,0.804878,0.226891,0.472973,0.178571,0.095238,0.236130,0.0,0.044118,0.113636,2019-01-01 09:00:00


### method 2: sklearn.preprocessing.MinMaxScaler

In [16]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_2_norm = data_2.copy()

# fit and transform numerical columns
data_2_norm[numerical_col] = minmax_scaler.fit_transform(data_2_norm[numerical_col])
# data_2_norm

# # inverse transform
# data_2_norm[numerical_col] = minmax_scaler.inverse_transform(data_2_norm[numerical_col])
# data_2_norm

### 資料標準化 (Standardization，使數值的平均值為0，標準差為1)

### method 1: custom function of pandas

In [17]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
#--- single column ---#
# mean = data_2['SO2'].mean()
# std = data_2['SO2'].std()
# data_2['SO2_1'] = data_2['SO2'].apply(lambda x: (x-mean)/std)

#--- multiple columns ---#
data_2_stan = data_2.apply(lambda x: (x - np.mean(x)) / np.std(x) if x.name in numerical_col else x)
# print (data_2_stan.head(1))

#--- function ---#
def standardization(data):
    if data.name in numerical_col:
        mean = data.mean()
        std = data.std()
        data = (data-mean)/std
    return data
data_2_stan = data_2.apply(standardization)
data_2_stan

Unnamed: 0,serial_num,city,station,degree_of_pollution,SO2,CO,O3,PM10,PM25,NO2,wind_speed,AQI,O3_ma8,NOx,NO,degrees,CO_ma8,PM10_ma,PM25_ma,time
0,8227.0,臺北市,古亭,良好,0.117805,-0.762003,0.140363,-1.181528,-0.508487,-0.748621,1.069668,-0.954990,-0.133076,-0.737374,-0.583964,-0.563803,-0.679489,-1.224918,-0.935069,2019-01-01 00:00:00
1,8226.0,臺北市,古亭,良好,0.877422,-0.816795,0.255891,-0.910644,-0.065653,-0.943835,0.626479,-0.910184,-0.066772,-0.929479,-0.583964,-0.612665,-0.679489,-1.224918,-0.815925,2019-01-01 01:00:00
2,8225.0,臺北市,古亭,良好,-0.641813,-0.871588,0.255891,-1.113807,-0.840612,-0.834027,1.402059,-0.865379,-0.000469,-0.833427,-0.583964,-0.588234,-0.679489,-1.224918,-0.815925,2019-01-01 02:00:00
3,8224.0,臺北市,古亭,良好,-0.641813,-0.981173,0.371419,-1.384690,-1.283446,-0.980437,1.402059,-0.775768,0.132138,-1.025531,-0.583964,-0.637096,-0.679489,-1.224918,-0.815925,2019-01-01 03:00:00
4,8223.0,臺北市,古亭,良好,-0.641813,-1.035965,0.429184,-0.978365,-0.619195,-1.187851,0.958871,-0.775768,0.198442,-1.217636,-0.891715,-0.722604,-0.679489,-1.224918,-0.815925,2019-01-01 04:00:00
5,8222.0,臺北市,古亭,良好,-0.641813,-1.090758,0.486948,-1.249249,-0.729904,-1.261056,0.848073,-0.730963,0.264745,-1.217636,-0.891715,-0.588234,-0.679489,-1.298260,-0.935069,2019-01-01 05:00:00
6,8221.0,臺北市,古亭,良好,-0.641813,-1.090758,0.486948,-1.316969,-1.062029,-1.285458,1.402059,-0.686157,0.331049,-1.217636,-0.891715,-0.649311,-0.679489,-1.371602,-0.935069,2019-01-01 06:00:00
7,8220.0,臺北市,古亭,良好,-0.515210,-0.926380,0.313655,-1.316969,-1.062029,-0.980437,0.515682,-0.686157,0.331049,-1.025531,-0.583964,-0.624880,-0.679489,-1.444944,-0.935069,2019-01-01 07:00:00
8,8219.0,臺北市,古亭,良好,-0.768416,-0.871588,0.255891,-1.249249,-1.062029,-0.834027,1.734451,-0.641352,0.397352,-0.833427,-0.583964,-0.563803,-0.679489,-1.518286,-1.054214,2019-01-01 08:00:00
9,8218.0,臺北市,古亭,良好,-0.768416,-0.762003,0.198127,-0.842923,-1.062029,-0.760822,2.177639,-0.641352,0.397352,-0.737374,-0.583964,-0.624880,-0.679489,-1.444944,-1.054214,2019-01-01 09:00:00


### method 2: sklearn.preprocessing.StandardScaler

In [18]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
stan_scaler = preprocessing.StandardScaler()
data_2_stan = data_2.copy()

# fit and transform numerical columns
data_2_stan[numerical_col] = stan_scaler.fit_transform(data_2_stan[numerical_col])
# data_2_stan

# # inverse transform
# data_2_stan[numerical_col] = stan_scaler.inverse_transform(data_2_stan[numerical_col])
# data_2_stan

### method 3: sklearn.preprocessing.scale

In [19]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale
data_2_stan = data_2.copy()
data_2_stan[numerical_col] = preprocessing.scale(data_2_stan[numerical_col])
# data_2_stan

## Step 5 - Encode categorical columns

### method 1: Label Encoding (sklearn.preprocessing.LabelEncoder)

In [20]:
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder
le = preprocessing.LabelEncoder()
data_2_le = data_2.copy()

## 1
for col in categorical_col:
    data_2_le[col] = le.fit_transform(data_2_le[col])
    print (le.classes_)

## 2
# from collections import defaultdict
# le_dict = defaultdict(preprocessing.LabelEncoder)
# data_2_le[categorical_col] = data_2_le[categorical_col].apply(lambda x: le_dict[x.name].fit_transform(x))
# # data_2_le
# data_2_le[categorical_col] = data_2_le[categorical_col].apply(lambda x: le_dict[x.name].inverse_transform(x))
# # data_2_le

['臺北市']
['古亭']
['對所有族群不健康' '對敏感族群不健康' '普通' '良好' '設備維護']


### method 2: One Hot Encoding (pandas get dummies)
sklearn.preprocessing.OneHotEncoder (not recommended)
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.fit_transform

In [21]:
## : pandas get dummies (recommended)
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies
data_2_ohe = data_2.copy()
data_2_ohe = data_2_ohe.join(pd.get_dummies(data_2_ohe[categorical_col]))
# data_2_ohe = data_2_ohe.join(pd.get_dummies(data_2_ohe[categorical_col], drop_first=True))
data_2_ohe.head()

Unnamed: 0,serial_num,city,station,degree_of_pollution,SO2,CO,O3,PM10,PM25,NO2,wind_speed,AQI,O3_ma8,NOx,NO,degrees,CO_ma8,PM10_ma,PM25_ma,time,city_臺北市,station_古亭,degree_of_pollution_對所有族群不健康,degree_of_pollution_對敏感族群不健康,degree_of_pollution_普通,degree_of_pollution_良好,degree_of_pollution_設備維護
0,8227.0,臺北市,古亭,良好,1.8,0.25,33.0,10.0,10.0,8.6,2.7,27.0,29.0,10.0,1.0,90.0,0.0,11.0,7.0,2019-01-01 00:00:00,1,1,0,0,0,1,0
1,8226.0,臺北市,古亭,良好,2.4,0.24,35.0,14.0,14.0,7.0,2.3,28.0,30.0,8.0,1.0,86.0,0.0,11.0,8.0,2019-01-01 01:00:00,1,1,0,0,0,1,0
2,8225.0,臺北市,古亭,良好,1.2,0.23,35.0,11.0,7.0,7.9,3.0,29.0,31.0,9.0,1.0,88.0,0.0,11.0,8.0,2019-01-01 02:00:00,1,1,0,0,0,1,0
3,8224.0,臺北市,古亭,良好,1.2,0.21,37.0,7.0,3.0,6.7,3.0,31.0,33.0,7.0,1.0,84.0,0.0,11.0,8.0,2019-01-01 03:00:00,1,1,0,0,0,1,0
4,8223.0,臺北市,古亭,良好,1.2,0.2,38.0,13.0,9.0,5.0,2.6,31.0,34.0,5.0,0.0,77.0,0.0,11.0,8.0,2019-01-01 04:00:00,1,1,0,0,0,1,0
