# Preprocess raw data for selected spot (E.g. Gyoku Sendo)

1. Three months parkingbreak data is saved in *data/* dir:
    - gzip -dc /mnt/lv/heromiya/OkinawaVisitorPred/2018-12_2019-04_2019-08.csv.gz | awk 'BEGIN{FS=","}$18==1{print}' > data/parkingbreak1-3month.csv


2.  Subset the point data around the tourism spot 'Gukyo Sendo' and save as *data/sendoRegion_3months.csv*. The selected region of Gukyo Sendo is around 127.748361,26.139007 (Extent: 127.74563156,26.13900734,127.75200790,26.14219240)
    - 3 months data( all region- 2641569, GukyoSendo region - 20798)
    - FIgure below shows the *extent of selected data*.
    ![GyokuSendo.png](data/GyokuSendo.png)

### Read and preprocess *data/sendoRegion_3months.csv*
- The field names'serial' and 'tlm_datage' are renamed as 'ap_id' and 'timestamp'

In [1]:
import pandas as pd

In [3]:
csv_parking ='/mnt/lv/bidur/OkinawaVisitorPred/data/sendoRegion_3months.csv'
df = pd.read_csv(csv_parking)
df.rename(columns = {'serial':'ap_id','tlm_datage':'timestamp'}, inplace = True)

### separate timestamp fileds into smaller units like months and day

In [4]:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year']  = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day']   = df['timestamp'].dt.dayofweek # Monday= 0, Tuesday=1 .... Sunday=6
df['day_num']  = df['timestamp'].dt.day 
df['date']  = df['timestamp'].dt.date
df.head()

Unnamed: 0,ap_id,tripid,tripcount,timestamp,lon,lat,year,month,day,day_num,date
0,AP520017,2019-08-02 11:58:22,2452,2019-08-02 13:24:49,127.749147,26.141359,2019,8,4,2,2019-08-02
1,AP520017,2019-08-02 15:22:22,2453,2019-08-02 15:22:27,127.749147,26.141359,2019,8,4,2,2019-08-02
2,AP520017,2019-08-02 15:22:22,2453,2019-08-02 15:24:25,127.749147,26.141359,2019,8,4,2,2019-08-02
3,AP520017,2019-08-04 08:41:14,2467,2019-08-04 09:19:53,127.74913,26.141587,2019,8,6,4,2019-08-04
4,AP520017,2019-08-04 08:41:14,2467,2019-08-04 09:21:55,127.74913,26.141587,2019,8,6,4,2019-08-04


#### How many months and days

In [5]:
df.month.unique(), df.day.unique()

(array([8, 4]), array([4, 6, 5, 1, 0, 2, 3]))

### total data points in GyokuSendo region

In [6]:
len(df)

20798

### how many cars

In [7]:
len(df.ap_id.unique())

1477

### number of cars by months

In [8]:
#df.groupby(['month']).agg(['mean', 'count'])
df[['ap_id','month']].groupby(['month']).agg(['count'])

Unnamed: 0_level_0,ap_id
Unnamed: 0_level_1,count
month,Unnamed: 1_level_2
4,9630
8,11168


### number of cars by day

In [9]:
df[['ap_id','month','day']].groupby(['day']).agg(['count'])

Unnamed: 0_level_0,ap_id,month
Unnamed: 0_level_1,count,count
day,Unnamed: 1_level_2,Unnamed: 2_level_2
0,3251,3251
1,2895,2895
2,2765,2765
3,3342,3342
4,2772,2772
5,3028,3028
6,2745,2745


### How many total in each week day

In [10]:
df.groupby('day').count()

Unnamed: 0_level_0,ap_id,tripid,tripcount,timestamp,lon,lat,year,month,day_num,date
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,3251,3251,3251,3251,3251,3251,3251,3251,3251,3251
1,2895,2895,2895,2895,2895,2895,2895,2895,2895,2895
2,2765,2765,2765,2765,2765,2765,2765,2765,2765,2765
3,3342,3342,3342,3342,3342,3342,3342,3342,3342,3342
4,2772,2772,2772,2772,2772,2772,2772,2772,2772,2772
5,3028,3028,3028,3028,3028,3028,3028,3028,3028,3028
6,2745,2745,2745,2745,2745,2745,2745,2745,2745,2745


### Prepare desired data and save in csv

In [11]:
df_new = df[['ap_id','date', 'month', 'day']].groupby(['date', 'month', 'day']).count()
df_new.rename(columns = { 'ap_id' :'car_count'}, inplace = True)
# 'date', 'month', 'day' becomes index -> convert them to normal column
df_new.reset_index(inplace=True)
df_new.head()

Unnamed: 0,date,month,day,car_count
0,2019-04-01,4,0,276
1,2019-04-02,4,1,232
2,2019-04-03,4,2,353
3,2019-04-04,4,3,377
4,2019-04-05,4,4,273


In [12]:
len(df_new)

61

In [13]:
df_new.to_csv("data/sendoPreprocessed.csv",index=False)