# 장애인콜택시 대기시간 예측
## 단계1.데이터 전처리

## 0.미션

* 1.데이터를 탐색하며 정보 획득
    * 데이터는 2015-01-01 ~ 2022-12-31 까지의 서울 장애인 콜택시 운행 정보입니다. 우리는 평균대기시간을 예측하고자 합니다. 
    * 요일, 월, 계절, 연도별 운행 정보에 어떤 주기와 변화가 있는지 탐색해 봅시다.
* 2.분석을 위한 데이터 구조 만들기
    * 문제정의 : 
        * 전 날 콜택시 운행이 종료되었을 때, 다음 날 대기시간을 예측하고자 합니다.

        * 만약 다음 날 대기시간을 예측할 수 있다면, 일정 범위 내에서 배차를 조절할 수 있을 뿐만 아니라, 향후 교통약자의 이동 편의 증진을 위한 정책 수립 및 개선에 기여할 수 있습니다. 
    * 이를 위한 데이터 구조를 만들어 봅시다.
        * 분석 단위는 일별 데이터 입니다.
        * 주어진 데이터 : 장애인 콜택시 운행 정보, 서울시 날씨
        * 날씨 데이터는 실제 측정값이지만, 다음 날에 대한 예보 데이터로 간주합니다. 
            * 예를 들어, 
                * 2020-12-23 의 날씨 데이터는 전 날(12월22일) 날씨예보 데이터로 간주하여 분석을 수행합니다.
                * 2020-12-22일의 장애인 이동 데이터로 23일의 대기시간을 예측해야 하며, 이때 고려할 날씨데이터는 23일 데이터 입니다.
        * 장애인 이동 데이터를 기준으로 날씨 데이터를 붙여서 만듭시다.
        * 휴무일 데이터는 패키지를 통해서 다운받아 사용합니다.
    * Feature Engineering
        * 대기시간에 영향을 주는 요인을 도출하고(가설수립) 이를 feature로 생성합시다.
        * 주어진 그대로의 데이터가 아닌 새로운 feature를 생성해 봅시다.
            * 날짜와 관련된 feature : 요일, 월, 계절 ... 
            * 시계열 특성이 반영된 feature : 최근 7일간의 평균 대기시간 ...




## 1.환경설정

### (1) 라이브러리 불러오기

* **세부 요구사항**
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [270]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import joblib

# 더 필요한 라이브러리를 아래에 추가합니다.



### (2) 데이터 로딩
* 주어진 데이터셋
    * 장애인 콜택시 운행 정보 : open_data.csv
    * 날씨 데이터 : weather.csv
* 다음의 두가지 방법 중 하나를 선택하시오.
    * 1) 로컬 수행(Ananconda)
        * 제공된 압축파일을 다운받아 압축을 풀고
        * anaconda의 root directory(보통 C:/Users/< ID > 에 project 폴더를 만들고, 복사해 넣습니다.
    * 2) 구글콜랩
        * 구글 드라이브 바로 밑에 project 폴더를 만들고, 
        * 데이터 파일을 복사해 넣습니다.

#### 1) 로컬 수행(Anaconda)
* project 폴더에 필요한 파일들을 넣고, 본 파일을 열었다면, 별도 경로 지정이 필요하지 않습니다.

In [271]:
path = ''

#### 2) 구글 콜랩 수행

* 구글 드라이브 연결

In [272]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [273]:
path = '/content/drive/MyDrive/6_mini/3,4일차/'

#### 3) 데이터 읽어오기

In [274]:
file1 = 'open_data.csv'
file2 = 'weather.csv'

In [275]:
data1 = pd.read_csv(path+file1)
data2 = pd.read_csv(path+file2)

#### 4) 기본정보 확인하기
* .info() 등을 이용하여 기본 정보를 확인합니다.

In [276]:
data1.head()

Unnamed: 0,기준일,차량운행,접수건,탑승건,평균대기시간,평균요금,평균승차거리
0,2015-01-01,213,1023,924,23.2,2427,10764
1,2015-01-02,420,3158,2839,17.2,2216,8611
2,2015-01-03,209,1648,1514,26.2,2377,10198
3,2015-01-04,196,1646,1526,24.5,2431,10955
4,2015-01-05,421,4250,3730,26.2,2214,8663


In [277]:
data2.head()

Unnamed: 0,Date,temp_max,temp_min,rain(mm),humidity_max(%),humidity_min(%),sunshine(MJ/m2)
0,2012-01-01,0.4,-6.6,0.0,77.0,45.0,4.9
1,2012-01-02,-1.2,-8.3,0.0,80.0,48.0,6.16
2,2012-01-03,-0.4,-6.6,0.4,86.0,45.0,4.46
3,2012-01-04,-4.6,-9.5,0.0,66.0,38.0,8.05
4,2012-01-05,-1.4,-9.6,0.0,71.0,28.0,9.14


In [278]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2922 entries, 0 to 2921
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   기준일     2922 non-null   object 
 1   차량운행    2922 non-null   int64  
 2   접수건     2922 non-null   int64  
 3   탑승건     2922 non-null   int64  
 4   평균대기시간  2922 non-null   float64
 5   평균요금    2922 non-null   int64  
 6   평균승차거리  2922 non-null   int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 159.9+ KB


In [279]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4018 entries, 0 to 4017
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             4018 non-null   object 
 1   temp_max         4018 non-null   float64
 2   temp_min         4018 non-null   float64
 3   rain(mm)         4018 non-null   float64
 4   humidity_max(%)  4018 non-null   float64
 5   humidity_min(%)  4018 non-null   float64
 6   sunshine(MJ/m2)  4018 non-null   float64
dtypes: float64(6), object(1)
memory usage: 219.9+ KB


#### 5) 칼럼이름을 영어로 변경
* 꼭 필요한 작업은 아니지만, 데이터를 편리하게 다루고 차트에서 불필요한 경고메시지를 띄우지 않게 하기 위해 영문으로 변경하기를 권장합니다.


In [280]:
data1 = data1.rename(columns={'기준일': 'Date', 
                              '차량운행': 'vehicle_operation', 
                              '접수건': 'reception_cases', 
                              '탑승건': 'boarding_cases', 
                              '평균대기시간': 'average_wait_time', 
                              '평균요금': 'average_fare', 
                              '평균승차거리': 'average_boarding_distance'})

In [281]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2922 entries, 0 to 2921
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Date                       2922 non-null   object 
 1   vehicle_operation          2922 non-null   int64  
 2   reception_cases            2922 non-null   int64  
 3   boarding_cases             2922 non-null   int64  
 4   average_wait_time          2922 non-null   float64
 5   average_fare               2922 non-null   int64  
 6   average_boarding_distance  2922 non-null   int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 159.9+ KB


## 2.데이터 기본 탐색

* **세부요구사항**
    * 날짜 요소에 따라 각 정보의 패턴을 조회 합니다.
        * 일별, 요일별, 주차별, 월별, 연도별
        * 접수건, 탑승건, 거리, 요금, 대기시간 등
    * 제시된 범위 외에 가능하다면 추가 탐색을 시도합니다.

### (1) 주기별 분석을 위해서 날짜 변수 추가하기
* data를 복사합니다.
* 복사한 df에 요일, 주차, 월, 연도 등을 추가합니다.

In [282]:
data1["Date"] = pd.to_datetime(data1["Date"])
data2["Date"] = pd.to_datetime(data2["Date"])
df = data1.copy() 

In [283]:
df["요일"] = df["Date"].dt.day_name()
df["주차"] = df["Date"].dt.isocalendar().week
df["월"] = df["Date"].dt.month
df["연도"] = df["Date"].dt.year
df

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,요일,주차,월,연도
0,2015-01-01,213,1023,924,23.2,2427,10764,Thursday,1,1,2015
1,2015-01-02,420,3158,2839,17.2,2216,8611,Friday,1,1,2015
2,2015-01-03,209,1648,1514,26.2,2377,10198,Saturday,1,1,2015
3,2015-01-04,196,1646,1526,24.5,2431,10955,Sunday,1,1,2015
4,2015-01-05,421,4250,3730,26.2,2214,8663,Monday,2,1,2015
...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-27,669,5635,4654,44.4,2198,8178,Tuesday,52,12,2022
2918,2022-12-28,607,5654,4648,44.8,2161,7882,Wednesday,52,12,2022
2919,2022-12-29,581,5250,4247,52.5,2229,8433,Thursday,52,12,2022
2920,2022-12-30,600,5293,4200,38.3,2183,8155,Friday,52,12,2022


### (2) 일별

* 차량 운행수

In [284]:
df[["Date",	"vehicle_operation"]]

Unnamed: 0,Date,vehicle_operation
0,2015-01-01,213
1,2015-01-02,420
2,2015-01-03,209
3,2015-01-04,196
4,2015-01-05,421
...,...,...
2917,2022-12-27,669
2918,2022-12-28,607
2919,2022-12-29,581
2920,2022-12-30,600


* 접수건, 탑승건

In [285]:
df[["Date",	"reception_cases",	"boarding_cases"	]]

Unnamed: 0,Date,reception_cases,boarding_cases
0,2015-01-01,1023,924
1,2015-01-02,3158,2839
2,2015-01-03,1648,1514
3,2015-01-04,1646,1526
4,2015-01-05,4250,3730
...,...,...,...
2917,2022-12-27,5635,4654
2918,2022-12-28,5654,4648
2919,2022-12-29,5250,4247
2920,2022-12-30,5293,4200


* 대기시간

In [286]:
df[["Date",	"average_wait_time"	]]

Unnamed: 0,Date,average_wait_time
0,2015-01-01,23.2
1,2015-01-02,17.2
2,2015-01-03,26.2
3,2015-01-04,24.5
4,2015-01-05,26.2
...,...,...
2917,2022-12-27,44.4
2918,2022-12-28,44.8
2919,2022-12-29,52.5
2920,2022-12-30,38.3


* 운임

In [287]:
df[["Date",	"average_fare"]]

Unnamed: 0,Date,average_fare
0,2015-01-01,2427
1,2015-01-02,2216
2,2015-01-03,2377
3,2015-01-04,2431
4,2015-01-05,2214
...,...,...
2917,2022-12-27,2198
2918,2022-12-28,2161
2919,2022-12-29,2229
2920,2022-12-30,2183


* 이동거리

In [288]:
df[["Date",	"average_boarding_distance"]]

Unnamed: 0,Date,average_boarding_distance
0,2015-01-01,10764
1,2015-01-02,8611
2,2015-01-03,10198
3,2015-01-04,10955
4,2015-01-05,8663
...,...,...
2917,2022-12-27,8178
2918,2022-12-28,7882
2919,2022-12-29,8433
2920,2022-12-30,8155


### (3) 요일별

* 차량 운행수

In [289]:
df.groupby("요일")[["vehicle_operation"]].sum().sort_values("vehicle_operation",ascending=False)

Unnamed: 0_level_0,vehicle_operation
요일,Unnamed: 1_level_1
Thursday,206525
Tuesday,206447
Friday,202775
Monday,202251
Wednesday,202131
Saturday,109590
Sunday,93728


* 접수건, 탑승건

In [290]:
df.groupby("요일")[["reception_cases"]].sum().sort_values("reception_cases",ascending=False)

Unnamed: 0_level_0,reception_cases
요일,Unnamed: 1_level_1
Tuesday,2001969
Thursday,1989963
Wednesday,1976302
Monday,1966172
Friday,1961284
Saturday,864876
Sunday,710060


In [291]:
df.groupby("요일")[["boarding_cases"]].sum().sort_values("boarding_cases",ascending=False)

Unnamed: 0_level_0,boarding_cases
요일,Unnamed: 1_level_1
Tuesday,1679172
Thursday,1662888
Monday,1651838
Wednesday,1650059
Friday,1620747
Saturday,708025
Sunday,622813


* 대기시간

In [292]:
df.groupby("요일")[["average_wait_time"]].sum().sort_values("average_wait_time",ascending=False)

Unnamed: 0_level_0,average_wait_time
요일,Unnamed: 1_level_1
Saturday,18174.7
Wednesday,17379.0
Thursday,17156.3
Friday,17119.8
Tuesday,17024.5
Monday,16364.4
Sunday,14554.5


* 운임

In [293]:
df.groupby("요일")[["average_fare"]].sum().sort_values("average_fare",ascending=False)

Unnamed: 0_level_0,average_fare
요일,Unnamed: 1_level_1
Sunday,1027714
Saturday,1015590
Thursday,943161
Friday,938838
Wednesday,938426
Tuesday,935250
Monday,934105


* 이동거리

In [294]:
df.groupby("요일")[["average_boarding_distance"]].sum().sort_values("average_boarding_distance",ascending=False)

Unnamed: 0_level_0,average_boarding_distance
요일,Unnamed: 1_level_1
Sunday,4479112
Saturday,4360354
Thursday,3672925
Friday,3650672
Wednesday,3649600
Tuesday,3616271
Monday,3609793


### (4) 월별

* 차량 운행수

In [295]:
df.groupby("월")[["vehicle_operation"]].sum().sort_values("vehicle_operation",ascending=False)

Unnamed: 0_level_0,vehicle_operation
월,Unnamed: 1_level_1
7,108360
11,108116
8,107756
10,107369
9,107139
6,103422
5,102562
12,101475
4,99885
3,98892


* 접수건, 탑승건

In [296]:
df.groupby("월")[["reception_cases"]].sum().sort_values("reception_cases",ascending=False)

Unnamed: 0_level_0,reception_cases
월,Unnamed: 1_level_1
11,1018558
7,1016710
8,995711
10,993912
12,975026
5,969822
6,960313
4,951475
9,951002
3,915450


In [297]:
df.groupby("월")[["boarding_cases"]].sum().sort_values("boarding_cases",ascending=False)

Unnamed: 0_level_0,boarding_cases
월,Unnamed: 1_level_1
7,852590
11,839640
8,830927
10,822579
5,813052
6,809481
4,806054
12,794113
9,789108
3,785868


* 대기시간

In [298]:
df.groupby("월")[["average_wait_time"]].sum().sort_values("average_wait_time",ascending=False)

Unnamed: 0_level_0,average_wait_time
월,Unnamed: 1_level_1
12,11584.9
11,11295.2
10,11031.1
5,10213.0
9,10167.4
7,9928.3
6,9691.6
8,9565.1
4,9510.9
3,8725.2


* 운임

In [299]:
df.groupby("월")[["average_fare"]].sum().sort_values("average_fare",ascending=False)

Unnamed: 0_level_0,average_fare
월,Unnamed: 1_level_1
10,579024
5,576900
3,568317
7,568021
8,566975
12,566954
1,565345
9,559580
4,554571
11,554202


* 이동거리

In [300]:
df.groupby("월")[["average_boarding_distance"]].sum().sort_values("average_boarding_distance",ascending=False)

Unnamed: 0_level_0,average_boarding_distance
월,Unnamed: 1_level_1
10,2352423
5,2346687
9,2278324
7,2269586
3,2264138
8,2263514
12,2250359
1,2243944
4,2235230
6,2230236


### (5) 연도별

* 차량 운행수

In [301]:
df.groupby("연도")[["vehicle_operation"]].sum().sort_values("vehicle_operation",ascending=False)

Unnamed: 0_level_0,vehicle_operation
연도,Unnamed: 1_level_1
2022,179178
2021,170919
2020,152447
2017,147970
2019,145660
2018,145182
2016,142855
2015,139236


* 대기시간

In [302]:
df.groupby("연도")[["average_wait_time"]].sum().sort_values("average_wait_time",ascending=False)

Unnamed: 0_level_0,average_wait_time
연도,Unnamed: 1_level_1
2018,20557.6
2019,19511.3
2017,16112.2
2016,14007.7
2022,13675.1
2015,12431.7
2021,11213.1
2020,10264.5


## 3.데이터 구조 만들기

* **세부요구사항**
    * 조건 : 
        * 목표 : 전날 저녁, 다음날 평균 대기시간을 예측하고자 합니다.
        * 날씨 데이터는 실제 측정값이지만, 다음 날에 대한 예보 데이터로 간주합니다. 
            * 예를 들어, 
                * 2020-12-23 의 날씨 데이터는 전날(12월22일) 날씨예보 데이터로 간주하여 분석을 수행합니다.
                * 2020-12-22일의 장애인 이동 데이터로 23일의 대기시간을 예측해야 하며, 이때 고려할 날씨데이터는 23일 데이터 입니다.
    * 장애인 이동 데이터를 기준으로 날씨 데이터를 붙입니다.

In [303]:
data = pd.merge(data1,data2,on='Date')

In [304]:
data.head()

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,temp_max,temp_min,rain(mm),humidity_max(%),humidity_min(%),sunshine(MJ/m2)
0,2015-01-01,213,1023,924,23.2,2427,10764,-4.3,-9.8,0.0,52.0,33.0,9.79
1,2015-01-02,420,3158,2839,17.2,2216,8611,-2.0,-8.9,0.0,63.0,28.0,9.07
2,2015-01-03,209,1648,1514,26.2,2377,10198,2.4,-9.2,0.0,73.0,37.0,8.66
3,2015-01-04,196,1646,1526,24.5,2431,10955,8.2,0.2,0.0,89.0,58.0,5.32
4,2015-01-05,421,4250,3730,26.2,2214,8663,7.9,-0.9,0.0,95.0,52.0,6.48


### (1) target 만들기
* 예측하는 날짜, 대기시간(target)으로 기준을 잡습니다.

In [305]:
data["target"] = data["average_wait_time"].shift(-1)
data.head()

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,temp_max,temp_min,rain(mm),humidity_max(%),humidity_min(%),sunshine(MJ/m2),target
0,2015-01-01,213,1023,924,23.2,2427,10764,-4.3,-9.8,0.0,52.0,33.0,9.79,17.2
1,2015-01-02,420,3158,2839,17.2,2216,8611,-2.0,-8.9,0.0,63.0,28.0,9.07,26.2
2,2015-01-03,209,1648,1514,26.2,2377,10198,2.4,-9.2,0.0,73.0,37.0,8.66,24.5
3,2015-01-04,196,1646,1526,24.5,2431,10955,8.2,0.2,0.0,89.0,58.0,5.32,26.2
4,2015-01-05,421,4250,3730,26.2,2214,8663,7.9,-0.9,0.0,95.0,52.0,6.48,23.6


In [306]:
data.tail()

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,temp_max,temp_min,rain(mm),humidity_max(%),humidity_min(%),sunshine(MJ/m2),target
2917,2022-12-27,669,5635,4654,44.4,2198,8178,3.0,-7.3,0.0,86.0,51.0,10.25,44.8
2918,2022-12-28,607,5654,4648,44.8,2161,7882,-0.3,-5.4,0.1,92.0,40.0,10.86,52.5
2919,2022-12-29,581,5250,4247,52.5,2229,8433,1.7,-7.8,0.0,71.0,34.0,10.88,38.3
2920,2022-12-30,600,5293,4200,38.3,2183,8155,2.1,-4.0,0.0,87.0,38.0,10.84,33.7
2921,2022-12-31,263,2167,1806,33.7,2318,9435,-4.4,-4.4,0.0,66.0,66.0,0.0,


### (2) 날씨 데이터 붙이기
* merge를 활용합니다. 기준은 운행정보 입니다.

In [307]:
data["temp_max_forecast"] = data["temp_max"].shift(-1)
data["temp_min_forecast"] = data["temp_min"].shift(-1)
data["rain(mm)_forecast"] = data["rain(mm)"].shift(-1)
data["humidity_max(%)_forecast"] = data["humidity_max(%)"].shift(-1)
data["humidity_min(%)_forecast"] = data["humidity_min(%)"].shift(-1)
data["sunshine(MJ/m2)_forecast"] = data["sunshine(MJ/m2)"].shift(-1)
data = data.drop(["temp_max","temp_min","rain(mm)","humidity_max(%)","humidity_min(%)","sunshine(MJ/m2)"],axis=1)
data = data.dropna(axis=0)
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2916,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25
2917,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86
2918,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88
2919,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84


### (3) 새로운 feature를 생성해 봅시다.
* 날짜와 관련된 변수 추가하기 : 요일, 월, 계절, 연도
* 그외 새로운 feature 도출 : 최소 2개 이상
    * 예 : 공휴일, 최근 7주일간의 평균 대기시간, 탑승률 등

#### 1) 날짜와 관련된 변수 추가하기 : 요일, 월, 계절, 연도
* 요일 이름, 계절이름, 월 이름으로 만드는 경우에는, 변수를 pd.Categorical로 범주형을 만들면서 순서를 지정하는 것이 이후 그래프를 그릴 때 순서대로 표현할 수 있습니다.


In [308]:
pd.Categorical(data["Date"].dt.month)

[1, 1, 1, 1, 1, ..., 12, 12, 12, 12, 12]
Length: 2921
Categories (12, int64): [1, 2, 3, 4, ..., 9, 10, 11, 12]

In [309]:
data["month"] = pd.Categorical(data["Date"].dt.month)
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2916,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12
2917,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12
2918,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12
2919,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12


In [310]:
data["Boarding rate"] = data['boarding_cases']/data["reception_cases"]
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2916,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983
2917,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909
2918,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073
2919,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952


#### 2) 공휴일 정보
* workalendar 패키지를 설치하고, 대한민국 공휴일 정보를 끌어와 봅시다.

* 휴무일 데이터 패키지 설치

In [311]:
!pip install workalendar

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


* 간단 사용법

In [312]:
from workalendar.asia import SouthKorea
cal = SouthKorea()
pd.DataFrame(cal.holidays(2023))

Unnamed: 0,0,1
0,2023-01-01,New year
1,2023-01-21,Korean New Year's Day
2,2023-01-22,Korean New Year's Day
3,2023-01-23,Korean New Year's Day
4,2023-03-01,Independence Day
5,2023-05-05,Children's Day
6,2023-05-26,Buddha's Birthday
7,2023-06-06,Memorial Day
8,2023-08-15,Liberation Day
9,2023-09-28,Midautumn Festival


* 휴무일 데이터셋 만들기 2015 ~ 2022
* 실제로 휴무일에 해당하지만 workalendar 라이브러리에 없는 날짜는 직접 추가해봅시다.
    * 휴무일 장애인 콜택시의 접수건 변화에 대한 특징을 찾아 이를 바탕으로 데이터를 조회하여 찾아볼 수 있음

In [313]:
a=pd.DataFrame(cal.holidays(2015))
b=pd.DataFrame(cal.holidays(2016))
c=pd.DataFrame(cal.holidays(2017))
d=pd.DataFrame(cal.holidays(2018))
e=pd.DataFrame(cal.holidays(2019))
f=pd.DataFrame(cal.holidays(2020))
g=pd.DataFrame(cal.holidays(2021))
h=pd.DataFrame(cal.holidays(2022))
holiday = pd.concat([a,b,c,d,e,f,g,h])
holiday

Unnamed: 0,0,1
0,2015-01-01,New year
1,2015-02-18,Korean New Year's Day
2,2015-02-19,Korean New Year's Day
3,2015-02-20,Korean New Year's Day
4,2015-03-01,Independence Day
...,...,...
10,2022-09-10,Midautumn Festival
11,2022-09-11,Midautumn Festival
12,2022-10-03,National Foundation Day
13,2022-10-09,Hangul Day


In [314]:
holiday=holiday.rename(columns={0:"Date",1:"holiday"})
holiday

Unnamed: 0,Date,holiday
0,2015-01-01,New year
1,2015-02-18,Korean New Year's Day
2,2015-02-19,Korean New Year's Day
3,2015-02-20,Korean New Year's Day
4,2015-03-01,Independence Day
...,...,...
10,2022-09-10,Midautumn Festival
11,2022-09-11,Midautumn Festival
12,2022-10-03,National Foundation Day
13,2022-10-09,Hangul Day


In [315]:
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2916,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983
2917,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909
2918,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073
2919,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952


In [316]:
holiday["Date"] = pd.to_datetime(holiday["Date"])
data = pd.merge(data,holiday,on="Date",how="outer")
data["holiday"] = np.where(data["holiday"].isnull()==True,0,1)
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,0
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,0
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0


In [317]:
data["holiday eve"] = data["holiday"].shift(-1)
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,holiday eve
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,0.0
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,0.0
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,0,0.0
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,0,0.0
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,0.0
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,0.0
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,0.0
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,0.0


In [318]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2922 entries, 0 to 2921
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       2922 non-null   datetime64[ns]
 1   vehicle_operation          2922 non-null   int64         
 2   reception_cases            2922 non-null   int64         
 3   boarding_cases             2922 non-null   int64         
 4   average_wait_time          2922 non-null   float64       
 5   average_fare               2922 non-null   int64         
 6   average_boarding_distance  2922 non-null   int64         
 7   target                     2922 non-null   float64       
 8   temp_max_forecast          2922 non-null   float64       
 9   temp_min_forecast          2922 non-null   float64       
 10  rain(mm)_forecast          2922 non-null   float64       
 11  humidity_max(%)_forecast   2922 non-null   float64       
 12  humidi

In [319]:
data["holiday eve"]=data["holiday eve"].fillna(1)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2922 entries, 0 to 2921
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Date                       2922 non-null   datetime64[ns]
 1   vehicle_operation          2922 non-null   int64         
 2   reception_cases            2922 non-null   int64         
 3   boarding_cases             2922 non-null   int64         
 4   average_wait_time          2922 non-null   float64       
 5   average_fare               2922 non-null   int64         
 6   average_boarding_distance  2922 non-null   int64         
 7   target                     2922 non-null   float64       
 8   temp_max_forecast          2922 non-null   float64       
 9   temp_min_forecast          2922 non-null   float64       
 10  rain(mm)_forecast          2922 non-null   float64       
 11  humidity_max(%)_forecast   2922 non-null   float64       
 12  humidi

* 기존 데이터에 휴무일 정보 결합하기.
* 휴무일이 아닌 날짜는 0으로 저장하시오.

#### 3) 7일 이동평균 대기시간
* rolling().mean() 사용

In [320]:
# data['average_wait_time'].rolling(7).mean()
data["average_wait_time_7"] = data['average_wait_time'].rolling(7).mean()

In [321]:
data["average_wait_time_7"].mean()

40.31670585929845

In [322]:
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,holiday eve,average_wait_time_7
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,0.0,
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,0.0,
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,0,0.0,
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,0,0.0,
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,0.0,43.485714
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,0.0,42.771429
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,0.0,43.514286
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,0.0,42.957143


In [323]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=4,weights="distance")
data["average_wait_time_7"] = imputer.fit_transform(data["average_wait_time_7"].values.reshape(-1, 1))
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,holiday eve,average_wait_time_7
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,0.0,40.316706
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,0.0,40.316706
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,0,0.0,40.316706
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,0,0.0,40.316706
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,0.0,40.316706
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,0.0,43.485714
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,0.0,42.771429
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,0.0,43.514286
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,0.0,42.957143


#### 4) 탑승률

In [324]:
data["day"] = pd.Categorical(data["Date"].dt.day_name())
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,holiday eve,average_wait_time_7,day
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,0.0,40.316706,Thursday
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,0.0,40.316706,Friday
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,0,0.0,40.316706,Saturday
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,0,0.0,40.316706,Sunday
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,0.0,40.316706,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,0.0,43.485714,Monday
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,0.0,42.771429,Tuesday
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,0.0,43.514286,Wednesday
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,0.0,42.957143,Thursday


## 4.데이터 저장
* **세부요구사항**
    * joblib 을 사용하여 작업 경로에 정리한 데이터프레임을 저장합니다.
        * 저장파일이름 : data1.pkl

In [325]:
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,holiday eve,average_wait_time_7,day
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,0.0,40.316706,Thursday
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,0.0,40.316706,Friday
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,0,0.0,40.316706,Saturday
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,0,0.0,40.316706,Sunday
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,0.0,40.316706,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,0.0,43.485714,Monday
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,0.0,42.771429,Tuesday
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,0.0,43.514286,Wednesday
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,0.0,42.957143,Thursday


In [327]:
data["holiday"] = np.where((data["day"]=="Saturday") | (data["day"]=="Sunday"),1,data["holiday"])
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,holiday eve,average_wait_time_7,day
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,0.0,40.316706,Thursday
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,0.0,40.316706,Friday
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,1,0.0,40.316706,Saturday
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,1,0.0,40.316706,Sunday
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,0.0,40.316706,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,0.0,43.485714,Monday
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,0.0,42.771429,Tuesday
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,0.0,43.514286,Wednesday
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,0.0,42.957143,Thursday


In [328]:
data = data.drop("holiday eve",axis=1)
data["holiday eve"] = data["holiday"].shift(-1)
data

Unnamed: 0,Date,vehicle_operation,reception_cases,boarding_cases,average_wait_time,average_fare,average_boarding_distance,target,temp_max_forecast,temp_min_forecast,rain(mm)_forecast,humidity_max(%)_forecast,humidity_min(%)_forecast,sunshine(MJ/m2)_forecast,month,Boarding rate,holiday,average_wait_time_7,day,holiday eve
0,2015-01-01,213,1023,924,23.2,2427,10764,17.2,-2.0,-8.9,0.0,63.0,28.0,9.07,1,0.903226,1,40.316706,Thursday,0.0
1,2015-01-02,420,3158,2839,17.2,2216,8611,26.2,2.4,-9.2,0.0,73.0,37.0,8.66,1,0.898987,0,40.316706,Friday,1.0
2,2015-01-03,209,1648,1514,26.2,2377,10198,24.5,8.2,0.2,0.0,89.0,58.0,5.32,1,0.918689,1,40.316706,Saturday,1.0
3,2015-01-04,196,1646,1526,24.5,2431,10955,26.2,7.9,-0.9,0.0,95.0,52.0,6.48,1,0.927096,1,40.316706,Sunday,0.0
4,2015-01-05,421,4250,3730,26.2,2214,8663,23.6,4.1,-7.4,3.4,98.0,29.0,10.47,1,0.877647,0,40.316706,Monday,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,2022-12-26,603,5555,4605,39.2,2163,7889,44.4,3.0,-7.3,0.0,86.0,51.0,10.25,12,0.828983,0,43.485714,Monday,0.0
2918,2022-12-27,669,5635,4654,44.4,2198,8178,44.8,-0.3,-5.4,0.1,92.0,40.0,10.86,12,0.825909,0,42.771429,Tuesday,0.0
2919,2022-12-28,607,5654,4648,44.8,2161,7882,52.5,1.7,-7.8,0.0,71.0,34.0,10.88,12,0.822073,0,43.514286,Wednesday,0.0
2920,2022-12-29,581,5250,4247,52.5,2229,8433,38.3,2.1,-4.0,0.0,87.0,38.0,10.84,12,0.808952,0,42.957143,Thursday,0.0


In [329]:
data["holiday eve"] = data["holiday eve"].fillna(1)

In [330]:
joblib.dump(data, path + "data1.pkl")

['/content/drive/MyDrive/6_mini/3,4일차/data1.pkl']