# **0. 대회 소개**


- 시계열 예측을 사용하여 에콰도르 기반의 대형 식료품 소매업체인 Corporación Favorita의 `매장 판매량`을 예측하고자 함
  - Favorita 매장에서 판매되는 수천 개 제품들의 `단위 판매량`을 제품군 별로 더 정확하게 예측하는 모델을 구축하고자 함


**평가 지표**
- Root Mean Squared Logarithmic Error(`RMSLE`)
  - 다음과 같이 계산  
  $ \sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2} $
  - $n$: 총 인스턴스 수
  - $ \hat{y}_i$: 인스턴스 $i$에 대한 타겟의 예측값
  - $y_i$: 인스턴스 $i$에 대한 타겟의 실제값

# **1. 데이터 살펴보기**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format # 소수점 둘째자리까지만 표시

import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **1-1. 훈련 데이터**
- 날짜, 매장 및 제품 정보, 해당 제품이 프로모션되었는지 여부, 그리고 매출 숫자가 포함되어 있음
- 추가 파일에는 모델을 구축하는 데 유용할 수 있는 보충 정보가 포함되어 있음



### **a) train.csv**  
- 학습용 데이터  

**컬럼 설명**   
- `id`: 각 데이터를 구분하기 위한 식별자
  - 300만 건의 데이터
- `date`: 판매일자
  - 기간) 2013/01/01 ~ 2017/08/15
- `store_nbr`: 매장 고유 식별번호
  - 1 ~ 54번
- `family`: 판매되는 제품 유형
  - 33개의 제품군
- `sales`: 특정 날짜에 특정 매장의 특정 제품군의 총 매출
  - 제품은 분수 단위로 판매될 수 있기 때문에 소수 값이 가능  
    (예: 1.5 kg의 치즈, 1 봉지의 감자칩 등)  
  - **target 변수**
- `onpromotion`: 특정 날짜에 매장에서 프로모션되고 있는 제품군의 총 가짓수

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/train.csv')
train.head(10)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0
5,5,2013-01-01,1,BREAD/BAKERY,0.0,0
6,6,2013-01-01,1,CELEBRATION,0.0,0
7,7,2013-01-01,1,CLEANING,0.0,0
8,8,2013-01-01,1,DAIRY,0.0,0
9,9,2013-01-01,1,DELI,0.0,0


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


In [5]:
train.isna().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

In [6]:
print(train['store_nbr'].unique())
print(len(train['store_nbr'].unique()))

[ 1 10 11 12 13 14 15 16 17 18 19  2 20 21 22 23 24 25 26 27 28 29  3 30
 31 32 33 34 35 36 37 38 39  4 40 41 42 43 44 45 46 47 48 49  5 50 51 52
 53 54  6  7  8  9]
54


In [7]:
print(train['family'].unique())
print(len(train['family'].unique()))

['AUTOMOTIVE' 'BABY CARE' 'BEAUTY' 'BEVERAGES' 'BOOKS' 'BREAD/BAKERY'
 'CELEBRATION' 'CLEANING' 'DAIRY' 'DELI' 'EGGS' 'FROZEN FOODS' 'GROCERY I'
 'GROCERY II' 'HARDWARE' 'HOME AND KITCHEN I' 'HOME AND KITCHEN II'
 'HOME APPLIANCES' 'HOME CARE' 'LADIESWEAR' 'LAWN AND GARDEN' 'LINGERIE'
 'LIQUOR,WINE,BEER' 'MAGAZINES' 'MEATS' 'PERSONAL CARE' 'PET SUPPLIES'
 'PLAYERS AND ELECTRONICS' 'POULTRY' 'PREPARED FOODS' 'PRODUCE'
 'SCHOOL AND OFFICE SUPPLIES' 'SEAFOOD']
33


### **b) stores.csv**
- 매장 메타데이터  

**컬렴 설명**
  - `store_nbr`: 매장 고유 식별번호
  - `city`: 매장이 위치한 도시
    - 22개 도시
  - `state`: 매장이 위치한 주
    - 16개의 주
  - `type`: 매장 유형
    - A, B, C, D, E
  - `cluster`: 유사한 매장들의 그룹
    - 1 ~ 17

In [8]:
stores = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/stores.csv')
stores.head(10)

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
5,6,Quito,Pichincha,D,13
6,7,Quito,Pichincha,D,8
7,8,Quito,Pichincha,D,8
8,9,Quito,Pichincha,B,6
9,10,Quito,Pichincha,C,15


In [9]:
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


In [10]:
print(stores['city'].unique())
print(len(stores['city'].unique()))

['Quito' 'Santo Domingo' 'Cayambe' 'Latacunga' 'Riobamba' 'Ibarra'
 'Guaranda' 'Puyo' 'Ambato' 'Guayaquil' 'Salinas' 'Daule' 'Babahoyo'
 'Quevedo' 'Playas' 'Libertad' 'Cuenca' 'Loja' 'Machala' 'Esmeraldas'
 'Manta' 'El Carmen']
22


In [11]:
print(stores['state'].unique())
print(len(stores['state'].unique()))

['Pichincha' 'Santo Domingo de los Tsachilas' 'Cotopaxi' 'Chimborazo'
 'Imbabura' 'Bolivar' 'Pastaza' 'Tungurahua' 'Guayas' 'Santa Elena'
 'Los Rios' 'Azuay' 'Loja' 'El Oro' 'Esmeraldas' 'Manabi']
16


In [12]:
print(stores['type'].unique())
print(len(stores['type'].unique()))

['D' 'B' 'C' 'E' 'A']
5


In [13]:
print(stores['cluster'].unique())
print(len(stores['cluster'].unique()))

[13  8  9  4  6 15  7  3 12 16  1 10  2  5 11 14 17]
17


In [14]:
result = stores.groupby('state')['city'].unique()
print(result)

state
Azuay                                                         [Cuenca]
Bolivar                                                     [Guaranda]
Chimborazo                                                  [Riobamba]
Cotopaxi                                                   [Latacunga]
El Oro                                                       [Machala]
Esmeraldas                                                [Esmeraldas]
Guayas                            [Guayaquil, Daule, Playas, Libertad]
Imbabura                                                      [Ibarra]
Loja                                                            [Loja]
Los Rios                                           [Babahoyo, Quevedo]
Manabi                                              [Manta, El Carmen]
Pastaza                                                         [Puyo]
Pichincha                                             [Quito, Cayambe]
Santa Elena                                                  [Salinas]


### **c) oil.csv**
- 일일 유가
  - 학습 및 테스트 데이터 기간 모두의 값을 포함
- 에콰도르는 석유 의존국임
  - 석유 가격의 변동에 매우 취약함
  - 이를 통해 어떤 제품군이 유가에 영향을 받는지 파악할 수 있음

**컬럼 설명**  
- `date`: 관측 일자
  - 기간) 2013/01/01 ~ 2017/08/31
- `dcoilwtico`: 유가

In [15]:
oil = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/oil.csv')
oil.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [16]:
oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


- 일부 결측치가 존재한다.

In [17]:
oil.describe()

Unnamed: 0,dcoilwtico
count,1175.0
mean,67.71
std,25.63
min,26.19
25%,46.41
50%,53.19
75%,95.66
max,110.62


### **d) transaction.csv**
- 거래 데이터로, 하루 동안 매장에 방문한 사람 수 또는 하루에 생성된 송장(영수증) 수를 의미
  - 학습 데이터의 매출(sales) 컬럼과 매우 상관 관계가 있음
  - 이를 통해 매장의 매출 패턴을 이해할 수 있음
- `date`: 일자
  - 2013/01/01 ~ 2017/08/15
- `store_nbr`: 가게 고유번호
- `transactions`: 거래량

In [18]:
transaction = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/transactions.csv')
transaction.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [19]:
transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


In [20]:
transaction.describe()

Unnamed: 0,store_nbr,transactions
count,83488.0,83488.0
mean,26.94,1694.6
std,15.61,963.29
min,1.0,5.0
25%,13.0,1046.0
50%,27.0,1393.0
75%,40.0,2079.0
max,54.0,8359.0


### **e) holidays_events.csv**
- 휴일 및 이벤트 메타데이터
- `date`: 일자
  - 2012/03/02 ~ 2017/12/26
- `type` : 공휴일 속성
  - Holiday, Event, Additional, Transfer, Bridge, Work Day
- `locale`: 규모
  - National, Local, Regional
- `locale_name`: 지역명
  - 24개 지역
- `description`: 설명
- `transferred`: 대체 여부
  - True, False


In [21]:
holidays_events = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/holidays_events.csv')
holidays_events.head(10)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
6,2012-06-23,Holiday,Local,Guaranda,Cantonizacion de Guaranda,False
7,2012-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
8,2012-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
9,2012-06-25,Holiday,Local,Machala,Fundacion de Machala,False


In [22]:
# 2013-01-01 이전 데이터는 필요 없기에 삭제
holidays_events = holidays_events.loc[holidays_events['date'] > '2012-12-31'].reset_index(drop=True)

# 2017-08-31 이후 데이터는 필요 없기에 삭제
holidays_events = holidays_events.loc[holidays_events['date'] < '2017-09-09'].reset_index(drop=True)

In [23]:
### 변수명 변경
# stores.csv의 type과 변수명이 중복됨

holidays_events = holidays_events.rename(columns={'type': 'category'})

In [24]:
holidays_events['category'].unique()

array(['Holiday', 'Work Day', 'Additional', 'Event', 'Transfer', 'Bridge'],
      dtype=object)

In [25]:
holidays_events['locale'].unique()

array(['National', 'Local', 'Regional'], dtype=object)

#### **📌 caterogy 처리**

**Transfer**

In [26]:
# transferred == True인 행을 drop

holidays_events = holidays_events.loc[holidays_events['transferred'] != True].reset_index(drop=True) # False만 택함

**Bridge & Work Day**

In [27]:
# category == Bridge인 행을 drop

holidays_events = holidays_events.loc[holidays_events['category'] != 'Bridge'].reset_index(drop=True)

#### **📌 locale 처리**
- 중복된 날짜가 있는 경우 병합 시 문제가 생길 가능성이 높음
  - 중복된 데이터를 확인 후 최대한 중복 행 제거

In [28]:
## 국가 공휴일
# 지역명 == 'Ecuador'

print(holidays_events.loc[holidays_events['locale'] == 'National', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'National', 'locale_name'].unique()))

['Ecuador']
1


In [29]:
## 지역(주) 공휴일
# 지역명 == 주(state) 명

print(holidays_events.loc[holidays_events['locale'] == 'Regional', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'Regional', 'locale_name'].unique()))

['Cotopaxi' 'Imbabura' 'Santo Domingo de los Tsachilas' 'Santa Elena']
4


In [30]:
## 지역(도시) 공휴일
# 지역명 == 도시(city) 명

print(holidays_events.loc[holidays_events['locale'] == 'Local', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'Local', 'locale_name'].unique()))

['Manta' 'Cuenca' 'Libertad' 'Riobamba' 'Puyo' 'Guaranda' 'Machala'
 'Latacunga' 'El Carmen' 'Santo Domingo' 'Cayambe' 'Guayaquil'
 'Esmeraldas' 'Ambato' 'Ibarra' 'Quevedo' 'Quito' 'Loja' 'Salinas']
19


In [31]:
## 중복된 날짜 확인

holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
14,2013-05-12,Event,National,Ecuador,Dia de la Madre,False
17,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
18,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
19,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
20,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
21,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
43,2013-12-22,Additional,National,Ecuador,Navidad-3,False
44,2013-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
68,2014-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


In [32]:
holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]['date'].unique()

array(['2013-05-12', '2013-06-25', '2013-07-03', '2013-12-22',
       '2014-06-25', '2014-07-03', '2014-12-22', '2015-06-25',
       '2015-07-03', '2015-12-22', '2016-04-21', '2016-05-01',
       '2016-05-07', '2016-05-08', '2016-05-12', '2016-06-25',
       '2016-07-03', '2016-07-24', '2016-11-12', '2016-12-22',
       '2017-04-14', '2017-06-25', '2017-07-03'], dtype=object)

**2013-05-12**


In [33]:
holidays_events.loc[holidays_events['date'] == '2013-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
14,2013-05-12,Event,National,Ecuador,Dia de la Madre,False


In [34]:
# National만 남김

holidays_events = holidays_events.drop(13).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2013-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Event,National,Ecuador,Dia de la Madre,False


**2013-06-25**

In [35]:
holidays_events.loc[holidays_events['date'] == '2013-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
16,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
17,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
18,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 셋은 접점이 없음 -> 모두 유지

**2013-07-03**

In [36]:
holidays_events.loc[holidays_events['date'] == '2013-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
19,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
20,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2013-12-22**

In [37]:
holidays_events.loc[holidays_events['date'] == '2013-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
42,2013-12-22,Additional,National,Ecuador,Navidad-3,False
43,2013-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [38]:
# National만 남김

holidays_events = holidays_events.drop(43).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2013-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
42,2013-12-22,Additional,National,Ecuador,Navidad-3,False


**2014-06-25**

In [39]:
holidays_events.loc[holidays_events['date'] == '2014-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
66,2014-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
67,2014-06-25,Holiday,Local,Machala,Fundacion de Machala,False
68,2014-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
69,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False


In [40]:
# National만 남김

holidays_events = holidays_events.drop(66)
holidays_events = holidays_events.drop(67)
holidays_events = holidays_events.drop(68).reset_index(drop=True)

holidays_events.loc[holidays_events['date'] == '2014-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
66,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False


**2014-07-03**

In [41]:
holidays_events.loc[holidays_events['date'] == '2014-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
71,2014-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
72,2014-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2014-12-22**

In [42]:
holidays_events.loc[holidays_events['date'] == '2014-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
103,2014-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
104,2014-12-22,Additional,National,Ecuador,Navidad-3,False


In [43]:
# National만 남김

holidays_events = holidays_events.drop(103).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2014-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
103,2014-12-22,Additional,National,Ecuador,Navidad-3,False


**2015-06-25**

In [44]:
holidays_events.loc[holidays_events['date'] == '2015-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
125,2015-06-25,Holiday,Local,Machala,Fundacion de Machala,False
126,2015-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
127,2015-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 모두 유지

**2015-07-03**

In [45]:
holidays_events.loc[holidays_events['date'] == '2015-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
128,2015-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
129,2015-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 모두 유지

**2015-12-22**

In [46]:
holidays_events.loc[holidays_events['date'] == '2015-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
153,2015-12-22,Additional,National,Ecuador,Navidad-3,False
154,2015-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [47]:
# National만 남김

holidays_events = holidays_events.drop(154).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2015-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
153,2015-12-22,Additional,National,Ecuador,Navidad-3,False


**2016-04-21**

In [48]:
holidays_events.loc[holidays_events['date'] == '2016-04-21',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
172,2016-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
173,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False


In [49]:
# National만 남김

holidays_events = holidays_events.drop(172).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-04-21',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
172,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False


**2016-05-01**

In [50]:
holidays_events.loc[holidays_events['date'] == '2016-05-01',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
182,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False
183,2016-05-01,Event,National,Ecuador,Terremoto Manabi+15,False


In [51]:
# 위만 남기기로 결정

holidays_events = holidays_events.drop(183).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-01',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
182,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False


**2016-05-07**

In [52]:
holidays_events.loc[holidays_events['date'] == '2016-05-07',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
188,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False
189,2016-05-07,Event,National,Ecuador,Terremoto Manabi+21,False


In [53]:
# 위만 남기기로 결정

holidays_events = holidays_events.drop(189).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-07',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
188,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False


**2016-05-08**

In [54]:
holidays_events.loc[holidays_events['date'] == '2016-05-08',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
189,2016-05-08,Event,National,Ecuador,Terremoto Manabi+22,False
190,2016-05-08,Event,National,Ecuador,Dia de la Madre,False


In [55]:
# 아래만 남기기로 결정

holidays_events = holidays_events.drop(189).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-08',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
189,2016-05-08,Event,National,Ecuador,Dia de la Madre,False


**2016-05-12**

In [56]:
holidays_events.loc[holidays_events['date'] == '2016-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
193,2016-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
194,2016-05-12,Event,National,Ecuador,Terremoto Manabi+26,False


In [57]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(193).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
193,2016-05-12,Event,National,Ecuador,Terremoto Manabi+26,False


**2016-06-25**

In [58]:
holidays_events.loc[holidays_events['date'] == '2016-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
200,2016-06-25,Holiday,Local,Machala,Fundacion de Machala,False
201,2016-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
202,2016-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 셋 다 유지

**2016-07-03**

In [59]:
holidays_events.loc[holidays_events['date'] == '2016-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
203,2016-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
204,2016-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2016-07-24**

In [60]:
holidays_events.loc[holidays_events['date'] == '2016-07-24',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
206,2016-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
207,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False


In [61]:
# 아래만 남기기

holidays_events = holidays_events.drop(206).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-07-24',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
206,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False


**2016-11-12**

In [62]:
holidays_events.loc[holidays_events['date'] == '2016-11-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
220,2016-11-12,Holiday,Local,Ambato,Independencia de Ambato,False
221,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


In [63]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(220).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-11-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
220,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


**2016-12-22**

In [64]:
holidays_events.loc[holidays_events['date'] == '2016-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
227,2016-12-22,Additional,National,Ecuador,Navidad-3,False
228,2016-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [65]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(227).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
227,2016-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


**2017-04-14**

In [66]:
holidays_events.loc[holidays_events['date'] == '2017-04-14',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
239,2017-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
240,2017-04-14,Holiday,National,Ecuador,Viernes Santo,False


In [67]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(239).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2017-04-14',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
239,2017-04-14,Holiday,National,Ecuador,Viernes Santo,False


**2017-06-25**

In [68]:
holidays_events.loc[holidays_events['date'] == '2017-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
247,2017-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
248,2017-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
249,2017-06-25,Holiday,Local,Machala,Fundacion de Machala,False


- 셋 다 유지

**2017-07-03**

In [69]:
holidays_events.loc[holidays_events['date'] == '2017-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
250,2017-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
251,2017-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

In [70]:
### 최종 holidays_events 확인

holidays_events.head(10)

Unnamed: 0,date,category,locale,locale_name,description,transferred
0,2013-01-01,Holiday,National,Ecuador,Primer dia del ano,False
1,2013-01-05,Work Day,National,Ecuador,Recupero puente Navidad,False
2,2013-01-12,Work Day,National,Ecuador,Recupero puente primer dia del ano,False
3,2013-02-11,Holiday,National,Ecuador,Carnaval,False
4,2013-02-12,Holiday,National,Ecuador,Carnaval,False
5,2013-03-02,Holiday,Local,Manta,Fundacion de Manta,False
6,2013-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
7,2013-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
8,2013-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
9,2013-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False


In [71]:
holidays_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         259 non-null    object
 1   category     259 non-null    object
 2   locale       259 non-null    object
 3   locale_name  259 non-null    object
 4   description  259 non-null    object
 5   transferred  259 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 10.5+ KB


## **1-2. 테스트 데이터**
- 기간) 2017/08/16 ~ 2017/08/31

In [72]:
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/test.csv')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [73]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           28512 non-null  int64 
 1   date         28512 non-null  object
 2   store_nbr    28512 non-null  int64 
 3   family       28512 non-null  object
 4   onpromotion  28512 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


## **1-3. 제출용 파일**

In [74]:
sample_submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/sample_submission.csv')
sample_submission.head()

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


# **2. 데이터 병합**
- `train`의 모든 행은 결과 데이터프레임에 포함되며, 병합하려는 데이터프레임의 정보는 병합 대상인 열을 기준으로 매칭되는 경우에만 병합

## **2-1. train + stores**
- `store_nbr`을 기준으로 병합

In [75]:
# 매장 고유번호를 기준으로 merge

train = pd.merge(train, stores, on = 'store_nbr', how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13


In [76]:
## test

test = pd.merge(test, stores, on = 'store_nbr', how = 'left')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13


## **2-2. train + oil**
- `date` 열을 기준으로 병햡

In [77]:
# 날짜를 기준으로 merge
train = pd.merge(train, oil, on = 'date', how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,


In [78]:
## test

test = pd.merge(test, oil, on = 'date', how = 'left')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster,dcoilwtico
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13,46.8
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13,46.8
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13,46.8
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13,46.8
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13,46.8


## **2-3. train + transaction**
- EDA 시에는 활용 가능
- 모델 학습 시에는 활용하기 어려울 듯
  - test 데이터가 수집된 시점에 대해 거래 데이터가 없음
  - 모델 학습 시에는 해당 변수를 drop 해야 함
- transaction의 PK는 (date, store_nbr)임
  - 두 개를 쌍으로 고려해야 각각의 데이터(행) 구분 가능
  - `date`, `store_nbr`을 기준으로 결합

In [79]:
# 날짜와 가게 고유 번호를 기준으로 merge
train = pd.merge(train, transaction, on = ['date', 'store_nbr'], how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,,
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,,
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,,
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,,
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,,


In [80]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3000888 entries, 0 to 3000887
Data columns (total 12 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   family        object 
 4   sales         float64
 5   onpromotion   int64  
 6   city          object 
 7   state         object 
 8   type          object 
 9   cluster       int64  
 10  dcoilwtico    float64
 11  transactions  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 297.6+ MB


- transaction 데이터의 경우 test에는 합칠 수 없음

## **2-4. train + holidays_events**



### **a) 공휴일인 경우**

📌 **국가 공휴일인 경우(locale = 'National')**  
- `date` 기준으로 병합

In [81]:
## Case 1: 국가 공휴일(National)인 경우
# 날짜 기준으로 병합

national = holidays_events.loc[holidays_events['locale'] == 'National', :]
national = pd.merge(train, national[['date', 'locale']], on = 'date', how = 'inner')

In [82]:
national = national.rename(columns={'locale': 'holiday'})
national.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,,,National
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,,,National
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,,,National
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,,,National
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,,,National


📌 **지역(주) 공휴일인 경우(locale = 'Regional')**
- `date`와 `state` 기준으로 병합
  

In [84]:
## Case 2: 지역 공휴일(Regional)인 경우
# 날짜, 주 기준으로 병합

regional = holidays_events.loc[holidays_events['locale'] == 'Regional', :]
regional = regional[['date', 'locale', 'locale_name']]
regional = regional.rename(columns={'locale_name': 'state'})
regional = pd.merge(train, regional, on=['date', 'state'], how = 'inner')

In [85]:
regional = regional.rename(columns={'locale': 'holiday'})
regional.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,160479,2013-04-01,12,AUTOMOTIVE,3.0,0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
1,160480,2013-04-01,12,BABY CARE,0.0,0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
2,160481,2013-04-01,12,BEAUTY,4.0,0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
3,160482,2013-04-01,12,BEVERAGES,762.0,0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
4,160483,2013-04-01,12,BOOKS,0.0,0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional


📌 **지역(도시) 공휴일인 경우(locale = 'Local')**
- `date`와 `city` 기준으로 병합

In [87]:
## Case 3: 지역 공휴일(Local)인 경우
# 날짜, 도시 기준으로 병합

local = holidays_events.loc[holidays_events['locale'] == 'Local', :]
local = local[['date', 'locale', 'locale_name']]
local = local.rename(columns={'locale_name': 'city'})
local = pd.merge(train, local, on=['date', 'city'], how = 'inner')

In [88]:
local = local.rename(columns={'locale': 'holiday'})
local.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,108471,2013-03-02,52,AUTOMOTIVE,0.0,0,Manta,Manabi,A,11,,,Local
1,108472,2013-03-02,52,BABY CARE,0.0,0,Manta,Manabi,A,11,,,Local
2,108473,2013-03-02,52,BEAUTY,0.0,0,Manta,Manabi,A,11,,,Local
3,108474,2013-03-02,52,BEVERAGES,0.0,0,Manta,Manabi,A,11,,,Local
4,108475,2013-03-02,52,BOOKS,0.0,0,Manta,Manabi,A,11,,,Local


In [90]:
# 공휴일 정보 저장

holiday = pd.concat([national, regional, local], ignore_index=True)
holiday['holiday'] = True # 공휴일 유무 -> True/False로
holiday.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,,,True
1,1,2013-01-01,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,,,True
2,2,2013-01-01,1,BEAUTY,0.0,0,Quito,Pichincha,D,13,,,True
3,3,2013-01-01,1,BEVERAGES,0.0,0,Quito,Pichincha,D,13,,,True
4,4,2013-01-01,1,BOOKS,0.0,0,Quito,Pichincha,D,13,,,True


In [91]:
holiday.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 249447 entries, 0 to 249446
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            249447 non-null  int64  
 1   date          249447 non-null  object 
 2   store_nbr     249447 non-null  int64  
 3   family        249447 non-null  object 
 4   sales         249447 non-null  float64
 5   onpromotion   249447 non-null  int64  
 6   city          249447 non-null  object 
 7   state         249447 non-null  object 
 8   type          249447 non-null  object 
 9   cluster       249447 non-null  int64  
 10  dcoilwtico    147378 non-null  float64
 11  transactions  225522 non-null  float64
 12  holiday       249447 non-null  bool   
dtypes: bool(1), float64(3), int64(4), object(5)
memory usage: 23.1+ MB


### **b) 공휴일이 아닌 경우**

In [92]:
# holiday 데이터프레임에서 날짜 추출
holiday_dates = set(holidays_events['date'])

# holiday 데이터프레임에 없는 날짜를 train 데이터프레임에서 추출
not_holidays = train[~train['date'].isin(holiday_dates)]

# 결과 확인
not_holidays.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions
1782,1782,2013-01-02,1,AUTOMOTIVE,2.0,0,Quito,Pichincha,D,13,93.14,2111.0
1783,1783,2013-01-02,1,BABY CARE,0.0,0,Quito,Pichincha,D,13,93.14,2111.0
1784,1784,2013-01-02,1,BEAUTY,2.0,0,Quito,Pichincha,D,13,93.14,2111.0
1785,1785,2013-01-02,1,BEVERAGES,1091.0,0,Quito,Pichincha,D,13,93.14,2111.0
1786,1786,2013-01-02,1,BOOKS,0.0,0,Quito,Pichincha,D,13,93.14,2111.0


In [93]:
not_holidays.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2571426 entries, 1782 to 2999105
Data columns (total 12 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   family        object 
 4   sales         float64
 5   onpromotion   int64  
 6   city          object 
 7   state         object 
 8   type          object 
 9   cluster       int64  
 10  dcoilwtico    float64
 11  transactions  float64
dtypes: float64(3), int64(4), object(5)
memory usage: 255.0+ MB


## **2-5. 병합 결과 확인**

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.isnull().sum()

In [None]:
test.head()

In [None]:
test.loc[test['date'] == '2017-08-24']

In [None]:
test.info()

## **2-6. 병합된 파일 저장**

In [None]:
train.to_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/train_merged.csv', index = False)