# **0. 대회 소개**


- [대회 링크](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)
- 시계열 예측을 사용하여 에콰도르 기반의 대형 식료품 소매업체인 Corporación Favorita의 `매장 판매량`을 예측하고자 함
  - Favorita 매장에서 판매되는 수천 개 제품들의 `단위 판매량`을 제품군 별로 더 정확하게 예측하는 모델을 구축하고자 함


**평가 지표**
- Root Mean Squared Logarithmic Error(`RMSLE`)
  - 다음과 같이 계산  
  $ \sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2} $
  - $n$: 총 인스턴스 수
  - $ \hat{y}_i$: 인스턴스 $i$에 대한 타겟의 예측값
  - $y_i$: 인스턴스 $i$에 대한 타겟의 실제값

# **1. 데이터 살펴보기**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format # 소수점 둘째자리까지만 표시

import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **1-1. 훈련 데이터**
- 날짜, 매장 및 제품 정보, 해당 제품이 프로모션되었는지 여부, 그리고 매출 숫자가 포함되어 있음
- 추가 파일에는 모델을 구축하는 데 유용할 수 있는 보충 정보가 포함되어 있음



### **a) train.csv**  
- 학습용 데이터  

**컬럼 설명**   
- `id`: 각 데이터를 구분하기 위한 식별자
  - 300만 건의 데이터
- `date`: 판매일자
  - 기간) 2013/01/01 ~ 2017/08/15
- `store_nbr`: 매장 고유 식별번호
  - 1 ~ 54번
- `family`: 판매되는 제품 유형
  - 33개의 제품군
- `sales`: 특정 날짜에 특정 매장의 특정 제품군의 총 매출
  - 제품은 분수 단위로 판매될 수 있기 때문에 소수 값이 가능  
    (예: 1.5 kg의 치즈, 1 봉지의 감자칩 등)  
  - **target 변수**
- `onpromotion`: 특정 날짜에 매장의 특정 제품군에 대해 프로모션되고 있는 제품의 총 가짓수

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/train.csv')
train.head(10)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0
5,5,2013-01-01,1,BREAD/BAKERY,0.0,0
6,6,2013-01-01,1,CELEBRATION,0.0,0
7,7,2013-01-01,1,CLEANING,0.0,0
8,8,2013-01-01,1,DAIRY,0.0,0
9,9,2013-01-01,1,DELI,0.0,0


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


In [5]:
train.isna().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

#### **📌 date 전처리**

**날짜 확인**
- 2013(365), 2014(365), 2015(365), 2016(366), 2017-08-15(227) => 1,688일

In [6]:
# 날짜에 공백 부분 없애기
train['date'] = train['date'].replace(" ", "")

print(train['date'].unique())
print(len(train['date'].unique()))

['2013-01-01' '2013-01-02' '2013-01-03' ... '2017-08-13' '2017-08-14'
 '2017-08-15']
1684


In [7]:
### 빠진 날짜

start_date = '2013-01-01'
end_date = '2017-08-15'

# 주어진 날짜 범위로 date_range 생성
date_range = pd.date_range(start=start_date, end=end_date)

# train['date'] 열의 데이터 타입을 datetime으로 변환
dates = pd.to_datetime(train['date'])

# 주어진 범위 내 빠진 날짜 찾기
missing_dates = date_range[~date_range.isin(train['date'])].astype(str)
print("빈 날짜:", missing_dates)

빈 날짜: Index(['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'], dtype='object')


In [8]:
stores = train['store_nbr'].unique()
stores

array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,  4,
       40, 41, 42, 43, 44, 45, 46, 47, 48, 49,  5, 50, 51, 52, 53, 54,  6,
        7,  8,  9])

In [9]:
families = train['family'].unique()
families

array(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD'], dtype=object)

In [10]:
### 빠진 날짜에 대한 데이터 채우기
## (4개 날짜) * (54개의 매장) * (33개의 제품군) => 7128개

# 일단은 리스트로 저장(for 속도 측면)
new_rows = []
for date in missing_dates:
  for store in stores:
    for family in families:
      new_rows.append({'id':0, 'date':date, 'store_nbr':store, 'family':family, 'sales':np.nan, 'onpromotion':np.nan})

# 데이터프레임으로 변환
new_data = pd.DataFrame(new_rows)

# 기존 train 데이터프레임과 새로운 데이터프레임 합치기
train = pd.concat([train, new_data]).sort_values('date').reset_index(drop = True)

# id 재설정
train['id'] = train.index

In [11]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3008016 entries, 0 to 3008015
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  float64
dtypes: float64(2), int64(2), object(2)
memory usage: 137.7+ MB


- 결측치가 정확한 개수로 채워짐

In [12]:
train.loc[train['date'] == '2013-12-25', :]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
637956,637956,2013-12-25,41,PRODUCE,,
637957,637957,2013-12-25,42,BREAD/BAKERY,,
637958,637958,2013-12-25,42,BOOKS,,
637959,637959,2013-12-25,42,BEVERAGES,,
637960,637960,2013-12-25,42,BEAUTY,,
...,...,...,...,...,...,...
639733,639733,2013-12-25,25,PLAYERS AND ELECTRONICS,,
639734,639734,2013-12-25,25,PET SUPPLIES,,
639735,639735,2013-12-25,26,BOOKS,,
639736,639736,2013-12-25,1,BABY CARE,,


In [13]:
# NaN은 0으로 채워줌
train[['sales', 'onpromotion']] = train[['sales', 'onpromotion']].fillna(0)

In [14]:
train.isna().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

- 결측치 처리 완료

#### **📌 판매하지 않는 물품 처리**
- 해당 매장에서 전 기간 동안 매출이 0인 제품군 -> 안 파는 물건이 아닐까?
- 전 기간에 걸쳐 매출액 합계가 0인 제품군 -> 해당 매장에서 판매하지 않는 제품군

In [15]:
# 각 가게별 제품군(family)별 판매액 합계 계산
sales_by_store_family = train.groupby(['store_nbr', 'family'])['sales'].sum().reset_index()

# 판매액이 0인 행 추출
zero_sales = sales_by_store_family[sales_by_store_family['sales'] == 0]

# sales가 0인 행을 store_nbr 별로 다시 groupby
zero_sales_by_store = zero_sales.groupby('store_nbr')['family'].apply(list)
zero_sales_by_store

store_nbr
1                              [BABY CARE]
9                                  [BOOKS]
10                                 [BOOKS]
11                                 [BOOKS]
12                                 [BOOKS]
13                      [BABY CARE, BOOKS]
14                [BOOKS, LAWN AND GARDEN]
15                                 [BOOKS]
16                     [BOOKS, LADIESWEAR]
17                                 [BOOKS]
18                                 [BOOKS]
19                                 [BOOKS]
20                                 [BOOKS]
21                                 [BOOKS]
22                                 [BOOKS]
23                             [BABY CARE]
25                            [LADIESWEAR]
28                     [BOOKS, LADIESWEAR]
29                     [BOOKS, LADIESWEAR]
30                [BOOKS, LAWN AND GARDEN]
31                                 [BOOKS]
32                     [BOOKS, LADIESWEAR]
33                     [BOOKS, LADIESWEAR]
3

In [16]:
### 각 가게에서 판매하지 않는 제품군 삭제

for store_nbr, zero_sales_families in zero_sales_by_store.items():
  train = train[~((train['store_nbr'] == store_nbr) & (train['family'].isin(zero_sales_families)))]

In [17]:
### 재확인

# 각 가게별 제품군(family)별 판매액 합계 계산
sales_by_store_family = train.groupby(['store_nbr', 'family'])['sales'].sum().reset_index()

# 판매액이 0인 행 추출
zero_sales = sales_by_store_family[sales_by_store_family['sales'] == 0]

# sales가 0인 행을 store_nbr 별로 다시 groupby
zero_sales_by_store = zero_sales.groupby('store_nbr')['family'].apply(list)
zero_sales_by_store

Series([], Name: family, dtype: object)

In [18]:
train = train.reset_index(drop = True)

# id 재설정
train['id'] = train.index

- 깔끔하게 제거되었다.

In [19]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2918552 entries, 0 to 2918551
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  float64
dtypes: float64(2), int64(2), object(2)
memory usage: 133.6+ MB


In [20]:
train.isna().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

#### **📌 결측치 처리**
- 판매액이 0인 경우는 결측치로 처리하는 것이 타당해 보임
  - 다른 방향으로의 해석: [kaggle](https://www.kaggle.com/code/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guide)
- 이후 EDA를 통해 해당 결측치를 어떻게 처리할 지 고민해보면 좋을 듯함

In [21]:
train.isna().sum()

id             0
date           0
store_nbr      0
family         0
sales          0
onpromotion    0
dtype: int64

In [22]:
train.loc[train['sales'] == 0,:]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.00,0.00
1,1,2013-01-01,42,CELEBRATION,0.00,0.00
2,2,2013-01-01,42,BREAD/BAKERY,0.00,0.00
3,3,2013-01-01,42,BOOKS,0.00,0.00
4,4,2013-01-01,42,BEVERAGES,0.00,0.00
...,...,...,...,...,...,...
2918522,2918522,2017-08-15,24,SCHOOL AND OFFICE SUPPLIES,0.00,0.00
2918529,2918529,2017-08-15,26,HARDWARE,0.00,0.00
2918541,2918541,2017-08-15,26,BEAUTY,0.00,0.00
2918542,2918542,2017-08-15,26,BABY CARE,0.00,0.00


### **b) stores.csv**
- 매장 메타데이터  

**컬렴 설명**
  - `store_nbr`: 매장 고유 식별번호
  - `city`: 매장이 위치한 도시
    - 22개 도시
  - `state`: 매장이 위치한 주
    - 16개의 주
  - `type`: 매장 유형
    - A, B, C, D, E
  - `cluster`: 유사한 매장들의 그룹
    - 1 ~ 17

In [23]:
stores = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/stores.csv')
stores.head(10)

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
5,6,Quito,Pichincha,D,13
6,7,Quito,Pichincha,D,8
7,8,Quito,Pichincha,D,8
8,9,Quito,Pichincha,B,6
9,10,Quito,Pichincha,C,15


In [24]:
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


In [25]:
print(stores['city'].unique())
print(len(stores['city'].unique()))

['Quito' 'Santo Domingo' 'Cayambe' 'Latacunga' 'Riobamba' 'Ibarra'
 'Guaranda' 'Puyo' 'Ambato' 'Guayaquil' 'Salinas' 'Daule' 'Babahoyo'
 'Quevedo' 'Playas' 'Libertad' 'Cuenca' 'Loja' 'Machala' 'Esmeraldas'
 'Manta' 'El Carmen']
22


In [26]:
print(stores['state'].unique())
print(len(stores['state'].unique()))

['Pichincha' 'Santo Domingo de los Tsachilas' 'Cotopaxi' 'Chimborazo'
 'Imbabura' 'Bolivar' 'Pastaza' 'Tungurahua' 'Guayas' 'Santa Elena'
 'Los Rios' 'Azuay' 'Loja' 'El Oro' 'Esmeraldas' 'Manabi']
16


In [27]:
print(stores['type'].unique())
print(len(stores['type'].unique()))

['D' 'B' 'C' 'E' 'A']
5


In [28]:
print(stores['cluster'].unique())
print(len(stores['cluster'].unique()))

[13  8  9  4  6 15  7  3 12 16  1 10  2  5 11 14 17]
17


In [29]:
result = stores.groupby('state')['city'].unique()
print(result)

state
Azuay                                                         [Cuenca]
Bolivar                                                     [Guaranda]
Chimborazo                                                  [Riobamba]
Cotopaxi                                                   [Latacunga]
El Oro                                                       [Machala]
Esmeraldas                                                [Esmeraldas]
Guayas                            [Guayaquil, Daule, Playas, Libertad]
Imbabura                                                      [Ibarra]
Loja                                                            [Loja]
Los Rios                                           [Babahoyo, Quevedo]
Manabi                                              [Manta, El Carmen]
Pastaza                                                         [Puyo]
Pichincha                                             [Quito, Cayambe]
Santa Elena                                                  [Salinas]


### **c) oil.csv**
- 일일 유가
  - 학습 및 테스트 데이터 기간 모두의 값을 포함
- 에콰도르는 석유 의존국임
  - 석유 가격의 변동에 매우 취약함
  - 이를 통해 어떤 제품군이 유가에 영향을 받는지 파악할 수 있음

**컬럼 설명**  
- `date`: 관측 일자
  - 기간) 2013/01/01 ~ 2017/08/31
- `dcoilwtico`: 유가

In [30]:
oil = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/oil.csv')
oil.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


#### **📌 date 전처리**

**날짜 확인**
- 2013(365), 2014(365), 2015(365), 2016(366), 2017-08-31(243) => 1704일

In [31]:
# 날짜에 공백 부분 없애기
oil['date'] = oil['date'].replace(" ", "")

print(oil['date'].unique())
print(len(oil['date'].unique()))

['2013-01-01' '2013-01-02' '2013-01-03' ... '2017-08-29' '2017-08-30'
 '2017-08-31']
1218


In [32]:
### 빠진 날짜

start_date = '2013-01-01'
end_date = '2017-08-31'

# 주어진 날짜 범위로 date_range 생성
date_range = pd.date_range(start=start_date, end=end_date)

# oil['date'] 열의 데이터 타입을 datetime으로 변환
dates = pd.to_datetime(oil['date'])

# 주어진 범위 내 빠진 날짜 찾기
missing_dates = date_range[~date_range.isin(oil['date'])].astype(str)
print("빈 날짜:", missing_dates)

빈 날짜: Index(['2013-01-05', '2013-01-06', '2013-01-12', '2013-01-13', '2013-01-19',
       '2013-01-20', '2013-01-26', '2013-01-27', '2013-02-02', '2013-02-03',
       ...
       '2017-07-29', '2017-07-30', '2017-08-05', '2017-08-06', '2017-08-12',
       '2017-08-13', '2017-08-19', '2017-08-20', '2017-08-26', '2017-08-27'],
      dtype='object', length=486)


In [33]:
### 빠진 날짜에 대한 데이터 채우기

# 일단은 리스트로 저장(for 속도 측면)
new_rows = []
for date in missing_dates:
  new_rows.append({'date':date, 'dcoilwtico':np.nan})

# 데이터프레임으로 변환
new_data = pd.DataFrame(new_rows)

# 기존 train 데이터프레임과 새로운 데이터프레임 합치기
oil = pd.concat([oil, new_data]).sort_values('date').reset_index(drop = True)

In [34]:
oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1704 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 26.8+ KB


#### **📌 결측치 처리**
- 뒤의 값으로 채우는 방식? 보간법?
  - [결측값 변경](https://wikidocs.net/153209)
  - 일단은 **선형 보간법(linear interpolation)**을 활용해 보았음

In [35]:
oil.isna().sum()

date            0
dcoilwtico    529
dtype: int64

In [36]:
oil.loc[oil['dcoilwtico'] == 0, :] # 다행히 유가가 0으로 결측인 값은 없음

Unnamed: 0,date,dcoilwtico


In [37]:
# 'dcoilwtico'(유가) 컬럼 결측치 선형 보간으로 채우기
oil['dcoilwtico'] = oil['dcoilwtico'].interpolate(method='linear')

In [38]:
oil.isna().sum()

date          0
dcoilwtico    1
dtype: int64

In [39]:
oil.head()

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-05,93.15


* 여전히 결측치 1개가 남아 있음
    
    -> 첫 날의 결측치임

    -> 둘째 날의 값으로 똑같게 변경해줘도 데이터에 크게 변동 없음

In [40]:
oil.at[0, 'dcoilwtico'] = oil.at[1, 'dcoilwtico']
oil['dcoilwtico']

0      93.14
1      93.14
2      92.97
3      93.12
4      93.15
        ... 
1699   46.82
1700   46.40
1701   46.46
1702   45.96
1703   47.26
Name: dcoilwtico, Length: 1704, dtype: float64

In [41]:
oil.isna().sum()

date          0
dcoilwtico    0
dtype: int64

* 더이상 결측치 없음

### **d) transaction.csv**
- 거래 데이터로, 하루 동안 매장에 방문한 사람 수 또는 하루에 생성된 송장(영수증) 수를 의미
  - 학습 데이터의 매출(sales) 컬럼과 매우 상관 관계가 있음
  - 이를 통해 매장의 매출 패턴을 이해할 수 있음
- `date`: 일자
  - 2013/01/01 ~ 2017/08/15
- `store_nbr`: 가게 고유번호
- `transactions`: 거래량

In [42]:
transaction = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/transactions.csv')
transaction.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [43]:
transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


#### **📌 date 전처리**

**날짜 확인**
- 2013(365), 2014(365), 2015(365), 2016(366), 2017-08-15(227) => 1688일

In [44]:
# 날짜에 공백 부분 없애기
transaction['date'] = transaction['date'].replace(" ", "")

print(transaction['date'].unique())
print(len(transaction['date'].unique()))

['2013-01-01' '2013-01-02' '2013-01-03' ... '2017-08-13' '2017-08-14'
 '2017-08-15']
1682


In [45]:
### 빠진 날짜

start_date = '2013-01-01'
end_date = '2017-08-15'

# 주어진 날짜 범위로 date_range 생성
date_range = pd.date_range(start=start_date, end=end_date)

# transaction['date'] 열의 데이터 타입을 datetime으로 변환
dates = pd.to_datetime(transaction['date'])

# 주어진 범위 내 빠진 날짜 찾기
missing_dates = date_range[~date_range.isin(transaction['date'])].astype(str)
print("빈 날짜:", missing_dates)

빈 날짜: Index(['2013-12-25', '2014-12-25', '2015-12-25', '2016-01-01', '2016-01-03',
       '2016-12-25'],
      dtype='object')


### **e) holidays_events.csv**
- 휴일 및 이벤트 메타데이터
- `date`: 일자
  - 2012/03/02 ~ 2017/12/26
- `type` : 공휴일 속성
  - Holiday, Event, Additional, Transfer, Bridge, Work Day
- `locale`: 규모
  - National, Local, Regional
- `locale_name`: 지역명
  - 24개 지역
- `description`: 설명
- `transferred`: 대체 여부
  - True, False


In [46]:
holidays_events = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/holidays_events.csv')
holidays_events.head(10)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
6,2012-06-23,Holiday,Local,Guaranda,Cantonizacion de Guaranda,False
7,2012-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
8,2012-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
9,2012-06-25,Holiday,Local,Machala,Fundacion de Machala,False


In [47]:
# 2013-01-01 이전 데이터는 필요 없기에 삭제
holidays_events = holidays_events.loc[holidays_events['date'] > '2012-12-31'].reset_index(drop=True)

# 2017-08-31 이후 데이터는 필요 없기에 삭제
holidays_events = holidays_events.loc[holidays_events['date'] < '2017-09-09'].reset_index(drop=True)

In [48]:
holidays_events['date'].unique()

array(['2013-01-01', '2013-01-05', '2013-01-12', '2013-02-11',
       '2013-02-12', '2013-03-02', '2013-04-01', '2013-04-12',
       '2013-04-14', '2013-04-21', '2013-04-29', '2013-05-01',
       '2013-05-11', '2013-05-12', '2013-05-24', '2013-06-23',
       '2013-06-25', '2013-07-03', '2013-07-23', '2013-07-24',
       '2013-07-25', '2013-08-05', '2013-08-10', '2013-08-15',
       '2013-08-24', '2013-09-28', '2013-10-07', '2013-10-09',
       '2013-10-11', '2013-11-02', '2013-11-03', '2013-11-06',
       '2013-11-07', '2013-11-10', '2013-11-11', '2013-11-12',
       '2013-12-05', '2013-12-06', '2013-12-08', '2013-12-21',
       '2013-12-22', '2013-12-23', '2013-12-24', '2013-12-25',
       '2013-12-26', '2013-12-31', '2014-01-01', '2014-03-02',
       '2014-03-03', '2014-03-04', '2014-04-01', '2014-04-12',
       '2014-04-14', '2014-04-18', '2014-04-21', '2014-05-01',
       '2014-05-10', '2014-05-11', '2014-05-12', '2014-05-24',
       '2014-06-12', '2014-06-15', '2014-06-20', '2014-

In [49]:
### 변수명 변경
# stores.csv의 type과 변수명이 중복됨

holidays_events = holidays_events.rename(columns={'type': 'category'})

In [50]:
holidays_events['category'].unique()

array(['Holiday', 'Work Day', 'Additional', 'Event', 'Transfer', 'Bridge'],
      dtype=object)

In [51]:
holidays_events['locale'].unique()

array(['National', 'Local', 'Regional'], dtype=object)

#### **📌 caterogy 처리**

**Transfer**

In [52]:
# transferred == True인 행을 drop

holidays_events = holidays_events.loc[holidays_events['transferred'] != True].reset_index(drop=True) # False만 택함

In [53]:
holidays_events['transferred'].unique() # 제대로 삭제됨

array([False])

**Bridge & Work Day**

- Bridge: 휴일 기간을 연장하기 위해 추가된 날짜
   - 이러한 날은 원래는 근무일이지만 그냥 휴일로 취급

- Work Day
    - 주말이지만 근무함
        
        -> 둘다 가게의 매출에 영향을 줄 것이라 생각해 삭제하지 않음

#### **📌 locale 처리**
- 중복된 날짜가 있는 경우 병합 시 문제가 생길 가능성이 높음
  - 중복된 데이터를 확인 후 최대한 중복 행 제거

In [54]:
## 국가 공휴일
# 지역명 == 'Ecuador'

print(holidays_events.loc[holidays_events['locale'] == 'National', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'National', 'locale_name'].unique()))

['Ecuador']
1


In [55]:
## 지역(주) 공휴일
# 지역명 == 주(state) 명

print(holidays_events.loc[holidays_events['locale'] == 'Regional', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'Regional', 'locale_name'].unique()))

['Cotopaxi' 'Imbabura' 'Santo Domingo de los Tsachilas' 'Santa Elena']
4


In [56]:
## 지역(도시) 공휴일
# 지역명 == 도시(city) 명

print(holidays_events.loc[holidays_events['locale'] == 'Local', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'Local', 'locale_name'].unique()))

['Manta' 'Cuenca' 'Libertad' 'Riobamba' 'Puyo' 'Guaranda' 'Machala'
 'Latacunga' 'El Carmen' 'Santo Domingo' 'Cayambe' 'Guayaquil'
 'Esmeraldas' 'Ambato' 'Ibarra' 'Quevedo' 'Quito' 'Loja' 'Salinas']
19


In [57]:
## 중복된 날짜 확인

holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
14,2013-05-12,Event,National,Ecuador,Dia de la Madre,False
17,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
18,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
19,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
20,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
21,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
43,2013-12-22,Additional,National,Ecuador,Navidad-3,False
44,2013-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
68,2014-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


In [58]:
holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]['date'].unique()

array(['2013-05-12', '2013-06-25', '2013-07-03', '2013-12-22',
       '2014-06-25', '2014-07-03', '2014-12-22', '2014-12-26',
       '2015-06-25', '2015-07-03', '2015-12-22', '2016-04-21',
       '2016-05-01', '2016-05-07', '2016-05-08', '2016-05-12',
       '2016-06-25', '2016-07-03', '2016-07-24', '2016-11-12',
       '2016-12-22', '2017-04-14', '2017-06-25', '2017-07-03'],
      dtype=object)

**2013-05-12**


In [59]:
holidays_events.loc[holidays_events['date'] == '2013-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
14,2013-05-12,Event,National,Ecuador,Dia de la Madre,False


In [60]:
# National만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2013-05-12') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2013-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
14,2013-05-12,Event,National,Ecuador,Dia de la Madre,False


**2013-06-25**

In [61]:
holidays_events.loc[holidays_events['date'] == '2013-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
17,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
18,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
19,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 셋은 접점이 없음 -> 모두 유지

**2013-07-03**

In [62]:
holidays_events.loc[holidays_events['date'] == '2013-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
20,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
21,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2013-12-22**

In [63]:
holidays_events.loc[holidays_events['date'] == '2013-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
43,2013-12-22,Additional,National,Ecuador,Navidad-3,False
44,2013-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [64]:
# National만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2013-12-22') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2013-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
43,2013-12-22,Additional,National,Ecuador,Navidad-3,False


**2014-06-25**

In [65]:
holidays_events.loc[holidays_events['date'] == '2014-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
68,2014-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
69,2014-06-25,Holiday,Local,Machala,Fundacion de Machala,False
70,2014-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
71,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False


In [66]:
# National만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2014-06-25') & (holidays_events['locale'] == 'Local'))]
holidays_events = holidays_events[~((holidays_events['date'] == '2014-06-25') & (holidays_events['locale'] == 'Regional'))]

holidays_events.loc[holidays_events['date'] == '2014-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
71,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False


**2014-07-03**

In [67]:
holidays_events.loc[holidays_events['date'] == '2014-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
76,2014-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
77,2014-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2014-12-22**

In [68]:
holidays_events.loc[holidays_events['date'] == '2014-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
108,2014-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
109,2014-12-22,Additional,National,Ecuador,Navidad-3,False


In [69]:
# National만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2014-12-22') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2014-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
109,2014-12-22,Additional,National,Ecuador,Navidad-3,False


**2014-12-26**

In [70]:
holidays_events.loc[holidays_events['date'] == '2014-12-26',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
113,2014-12-26,Bridge,National,Ecuador,Puente Navidad,False
114,2014-12-26,Additional,National,Ecuador,Navidad+1,False


In [71]:
# 위만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2014-12-26') & (holidays_events['category'] == 'Additional'))]
holidays_events.loc[holidays_events['date'] == '2014-12-26',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
113,2014-12-26,Bridge,National,Ecuador,Puente Navidad,False


**2015-06-25**

In [72]:
holidays_events.loc[holidays_events['date'] == '2015-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
133,2015-06-25,Holiday,Local,Machala,Fundacion de Machala,False
134,2015-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
135,2015-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 모두 유지

**2015-07-03**

In [73]:
holidays_events.loc[holidays_events['date'] == '2015-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
136,2015-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
137,2015-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 모두 유지

**2015-12-22**

In [74]:
holidays_events.loc[holidays_events['date'] == '2015-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
161,2015-12-22,Additional,National,Ecuador,Navidad-3,False
162,2015-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [75]:
# National만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2015-12-22') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2015-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
161,2015-12-22,Additional,National,Ecuador,Navidad-3,False


**2016-04-21**

In [76]:
holidays_events.loc[holidays_events['date'] == '2016-04-21',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
181,2016-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
182,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False


In [77]:
# National만 남김

holidays_events = holidays_events[~((holidays_events['date'] == '2016-04-21') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2016-04-21',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
182,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False


**2016-05-01**

In [78]:
holidays_events.loc[holidays_events['date'] == '2016-05-01',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
192,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False
193,2016-05-01,Event,National,Ecuador,Terremoto Manabi+15,False


In [79]:
# 위만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2016-05-01') & (holidays_events['category'] == 'Event'))]
holidays_events.loc[holidays_events['date'] == '2016-05-01',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
192,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False


**2016-05-07**

In [80]:
holidays_events.loc[holidays_events['date'] == '2016-05-07',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
199,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False
200,2016-05-07,Event,National,Ecuador,Terremoto Manabi+21,False


In [81]:
# 위만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2016-05-07') & (holidays_events['category'] == 'Event'))]
holidays_events.loc[holidays_events['date'] == '2016-05-07',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
199,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False


**2016-05-08**

In [82]:
holidays_events.loc[holidays_events['date'] == '2016-05-08',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
201,2016-05-08,Event,National,Ecuador,Terremoto Manabi+22,False
202,2016-05-08,Event,National,Ecuador,Dia de la Madre,False


In [83]:
# 아래만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2016-05-08') & (holidays_events['description'] == 'Terremoto Manabi+22'))]
holidays_events.loc[holidays_events['date'] == '2016-05-08',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
202,2016-05-08,Event,National,Ecuador,Dia de la Madre,False


**2016-05-12**

In [84]:
holidays_events.loc[holidays_events['date'] == '2016-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
206,2016-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
207,2016-05-12,Event,National,Ecuador,Terremoto Manabi+26,False


In [85]:
# National만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2016-05-12') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2016-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
207,2016-05-12,Event,National,Ecuador,Terremoto Manabi+26,False


**2016-06-25**

In [86]:
holidays_events.loc[holidays_events['date'] == '2016-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
214,2016-06-25,Holiday,Local,Machala,Fundacion de Machala,False
215,2016-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
216,2016-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 셋 다 유지

**2016-07-03**

In [87]:
holidays_events.loc[holidays_events['date'] == '2016-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
217,2016-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
218,2016-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2016-07-24**

In [88]:
holidays_events.loc[holidays_events['date'] == '2016-07-24',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
220,2016-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
221,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False


In [89]:
# 아래만 남기기

holidays_events = holidays_events[~((holidays_events['date'] == '2016-07-24') & (holidays_events['category'] == 'Additional'))]
holidays_events.loc[holidays_events['date'] == '2016-07-24',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
221,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False


**2016-11-12**

In [90]:
holidays_events.loc[holidays_events['date'] == '2016-11-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
236,2016-11-12,Holiday,Local,Ambato,Independencia de Ambato,False
237,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


In [91]:
# National만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2016-11-12') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2016-11-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
237,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


**2016-12-22**

In [92]:
holidays_events.loc[holidays_events['date'] == '2016-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
244,2016-12-22,Additional,National,Ecuador,Navidad-3,False
245,2016-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [93]:
# National만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2016-12-22') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2016-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
244,2016-12-22,Additional,National,Ecuador,Navidad-3,False


**2017-04-14**

In [94]:
holidays_events.loc[holidays_events['date'] == '2017-04-14',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
257,2017-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
258,2017-04-14,Holiday,National,Ecuador,Viernes Santo,False


In [95]:
# National만 남기기로 결정

holidays_events = holidays_events[~((holidays_events['date'] == '2017-04-14') & (holidays_events['locale'] == 'Local'))]
holidays_events.loc[holidays_events['date'] == '2017-04-14',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
258,2017-04-14,Holiday,National,Ecuador,Viernes Santo,False


**2017-06-25**

In [96]:
holidays_events.loc[holidays_events['date'] == '2017-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
266,2017-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
267,2017-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
268,2017-06-25,Holiday,Local,Machala,Fundacion de Machala,False


- 셋 다 유지

**2017-07-03**

In [97]:
holidays_events.loc[holidays_events['date'] == '2017-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
269,2017-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
270,2017-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

In [98]:
## 중복된 날짜 재확인

holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]

Unnamed: 0,date,category,locale,locale_name,description,transferred
17,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
18,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
19,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
20,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
21,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
76,2014-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
77,2014-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
133,2015-06-25,Holiday,Local,Machala,Fundacion de Machala,False
134,2015-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
135,2015-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


In [99]:
### 최종 holidays_events 확인

holidays_events.head()

Unnamed: 0,date,category,locale,locale_name,description,transferred
0,2013-01-01,Holiday,National,Ecuador,Primer dia del ano,False
1,2013-01-05,Work Day,National,Ecuador,Recupero puente Navidad,False
2,2013-01-12,Work Day,National,Ecuador,Recupero puente primer dia del ano,False
3,2013-02-11,Holiday,National,Ecuador,Carnaval,False
4,2013-02-12,Holiday,National,Ecuador,Carnaval,False


In [100]:
holidays_events.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 261 entries, 0 to 277
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         261 non-null    object
 1   category     261 non-null    object
 2   locale       261 non-null    object
 3   locale_name  261 non-null    object
 4   description  261 non-null    object
 5   transferred  261 non-null    bool  
dtypes: bool(1), object(5)
memory usage: 12.5+ KB


## **1-2. 테스트 데이터**
- 기간) 2017/08/16 ~ 2017/08/31

In [101]:
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/test.csv')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [102]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           28512 non-null  int64 
 1   date         28512 non-null  object
 2   store_nbr    28512 non-null  int64 
 3   family       28512 non-null  object
 4   onpromotion  28512 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


## **1-3. 제출용 파일**

In [103]:
sample_submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA/OB/방학프로젝트/sample_submission.csv')
sample_submission.head()

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


# **2. 데이터 병합**
- `train`의 모든 행은 결과 데이터프레임에 포함되며, 병합하려는 데이터프레임의 정보는 병합 대상인 열을 기준으로 매칭되는 경우에만 병합

## **2-1. train + stores**
- `store_nbr`을 기준으로 병합

In [104]:
# 매장 고유번호를 기준으로 merge

train = pd.merge(train, stores, on = 'store_nbr', how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2


In [105]:
## test

test = pd.merge(test, stores, on = 'store_nbr', how = 'left')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13


## **2-2. train + oil**
- `date` 열을 기준으로 병햡

In [106]:
# 날짜를 기준으로 merge
train = pd.merge(train, oil, on = 'date', how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,93.14
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,93.14
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,93.14
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,93.14
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,93.14


In [107]:
## test

test = pd.merge(test, oil, on = 'date', how = 'left')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster,dcoilwtico
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13,46.8
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13,46.8
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13,46.8
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13,46.8
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13,46.8


## **2-3. train + transaction**
- EDA 시에는 활용 가능
- 모델 학습 시에는 활용하기 어려울 듯
  - test 데이터가 수집된 시점에 대해 거래 데이터가 없음
  - 모델 학습 시에는 해당 변수를 drop 해야 함
- transaction의 PK는 (date, store_nbr)임
  - 두 개를 쌍으로 고려해야 각각의 데이터(행) 구분 가능
  - `date`, `store_nbr`을 기준으로 결합

In [108]:
# 날짜와 가게 고유 번호를 기준으로 merge
train = pd.merge(train, transaction, on = ['date', 'store_nbr'], how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,93.14,
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,93.14,
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,93.14,
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,93.14,
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,93.14,


In [109]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2918552 entries, 0 to 2918551
Data columns (total 12 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   family        object 
 4   sales         float64
 5   onpromotion   float64
 6   city          object 
 7   state         object 
 8   type          object 
 9   cluster       int64  
 10  dcoilwtico    float64
 11  transactions  float64
dtypes: float64(4), int64(3), object(5)
memory usage: 289.5+ MB


- transaction 데이터의 경우 test에는 합칠 수 없음

## **2-4. train + holidays_events**



In [110]:
mmm = train.copy()

### **b) 공휴일인 경우**

📌 **국가 공휴일인 경우(locale = 'National')**  
- `date` 기준으로 병합

In [111]:
holidays_events.loc[holidays_events['locale'] == 'National', ['date', 'locale']]

Unnamed: 0,date,locale
0,2013-01-01,National
1,2013-01-05,National
2,2013-01-12,National
3,2013-02-11,National
4,2013-02-12,National
...,...,...
260,2017-05-01,National
262,2017-05-13,National
263,2017-05-14,National
264,2017-05-26,National


In [112]:
## Case 1: 국가 공휴일(National)인 경우
# 날짜 기준으로 병합

national = holidays_events.loc[holidays_events['locale'] == 'National', ['date', 'locale']]
national = pd.merge(mmm, national, on='date', how='inner')

In [113]:
national = national.rename(columns={'locale': 'holiday'})
national.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,93.14,,National
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,93.14,,National
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,93.14,,National
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,93.14,,National
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,93.14,,National


📌 **지역(주) 공휴일인 경우(locale = 'Regional')**
- `date`와 `state` 기준으로 병합
  

In [114]:
## Case 2: 지역 공휴일(Regional)인 경우
# 날짜, 주 기준으로 병합

regional = holidays_events.loc[holidays_events['locale'] == 'Regional', :]
regional = regional[['date', 'locale', 'locale_name']]
regional = regional.rename(columns={'locale_name': 'state'})
regional = pd.merge(mmm, regional, on=['date', 'state'], how = 'inner')

In [115]:
regional = regional.rename(columns={'locale': 'holiday'})
regional.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,156722,2013-04-01,12,DAIRY,186.0,0.0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
1,156723,2013-04-01,12,CLEANING,1059.0,0.0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
2,156724,2013-04-01,12,CELEBRATION,0.0,0.0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
3,156725,2013-04-01,12,BREAD/BAKERY,439.0,0.0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional
4,156726,2013-04-01,12,BEVERAGES,762.0,0.0,Latacunga,Cotopaxi,C,15,97.1,1313.0,Regional


📌 **지역(도시) 공휴일인 경우(locale = 'Local')**
- `date`와 `city` 기준으로 병합

In [116]:
## Case 3: 지역 공휴일(Local)인 경우
# 날짜, 도시 기준으로 병합

local = holidays_events.loc[holidays_events['locale'] == 'Local', :]
local = local[['date', 'locale', 'locale_name']]
local = local.rename(columns={'locale_name': 'city'})
local = pd.merge(mmm, local, on=['date', 'city'], how = 'inner')

In [117]:
local = local.rename(columns={'locale': 'holiday'})
local.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,104223,2013-03-02,53,SEAFOOD,0.0,0.0,Manta,Manabi,D,13,90.52,,Local
1,104224,2013-03-02,53,SCHOOL AND OFFICE SUPPLIES,0.0,0.0,Manta,Manabi,D,13,90.52,,Local
2,104225,2013-03-02,53,PRODUCE,0.0,0.0,Manta,Manabi,D,13,90.52,,Local
3,104226,2013-03-02,53,BABY CARE,0.0,0.0,Manta,Manabi,D,13,90.52,,Local
4,104227,2013-03-02,53,AUTOMOTIVE,0.0,0.0,Manta,Manabi,D,13,90.52,,Local


In [118]:
# 세 데이터프레임 합치기
a = pd.concat([national, regional, local], axis=0, ignore_index=True)
a.isna().sum()

id                  0
date                0
store_nbr           0
family              0
sales               0
onpromotion         0
city                0
state               0
type                0
cluster             0
dcoilwtico          0
transactions    30324
holiday             0
dtype: int64

In [119]:
a

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.00,0.00,Quito,Pichincha,D,13,93.14,,National
1,1,2013-01-01,42,CELEBRATION,0.00,0.00,Cuenca,Azuay,D,2,93.14,,National
2,2,2013-01-01,42,BREAD/BAKERY,0.00,0.00,Cuenca,Azuay,D,2,93.14,,National
3,3,2013-01-01,42,BOOKS,0.00,0.00,Cuenca,Azuay,D,2,93.14,,National
4,4,2013-01-01,42,BEVERAGES,0.00,0.00,Cuenca,Azuay,D,2,93.14,,National
...,...,...,...,...,...,...,...,...,...,...,...,...,...
254090,2918053,2017-08-15,14,PREPARED FOODS,107.00,0.00,Riobamba,Chimborazo,C,7,47.57,1241.00,Local
254091,2918063,2017-08-15,14,SEAFOOD,6.00,0.00,Riobamba,Chimborazo,C,7,47.57,1241.00,Local
254092,2918064,2017-08-15,14,SCHOOL AND OFFICE SUPPLIES,4.00,3.00,Riobamba,Chimborazo,C,7,47.57,1241.00,Local
254093,2918067,2017-08-15,14,BABY CARE,0.00,0.00,Riobamba,Chimborazo,C,7,47.57,1241.00,Local


In [120]:
mmm

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions
0,0,2013-01-01,1,AUTOMOTIVE,0.00,0.00,Quito,Pichincha,D,13,93.14,
1,1,2013-01-01,42,CELEBRATION,0.00,0.00,Cuenca,Azuay,D,2,93.14,
2,2,2013-01-01,42,BREAD/BAKERY,0.00,0.00,Cuenca,Azuay,D,2,93.14,
3,3,2013-01-01,42,BOOKS,0.00,0.00,Cuenca,Azuay,D,2,93.14,
4,4,2013-01-01,42,BEVERAGES,0.00,0.00,Cuenca,Azuay,D,2,93.14,
...,...,...,...,...,...,...,...,...,...,...,...,...
2918547,2918547,2017-08-15,25,PREPARED FOODS,25.47,0.00,Salinas,Santa Elena,D,1,47.57,849.00
2918548,2918548,2017-08-15,25,POULTRY,172.52,0.00,Salinas,Santa Elena,D,1,47.57,849.00
2918549,2918549,2017-08-15,25,PLAYERS AND ELECTRONICS,3.00,0.00,Salinas,Santa Elena,D,1,47.57,849.00
2918550,2918550,2017-08-15,25,PET SUPPLIES,3.00,0.00,Salinas,Santa Elena,D,1,47.57,849.00


In [None]:
# 전체 병합
b = pd.merge(mmm, a, on=['date'], how = 'left')

In [None]:
b = b['holiday'].fillna('None')

## **2-5. 병합 결과 확인**

In [None]:
train.info()

In [None]:
train.isnull().sum()

In [None]:
test.head()

In [None]:
test.info()

## **2-6. 병합된 파일 저장**

In [None]:
train.to_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/train_merged.csv', index = False)