# **0. 대회 소개**


- [대회 링크](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)
- 시계열 예측을 사용하여 에콰도르 기반의 대형 식료품 소매업체인 Corporación Favorita의 `매장 판매량`을 예측하고자 함
  - Favorita 매장에서 판매되는 수천 개 제품들의 `단위 판매량`을 제품군 별로 더 정확하게 예측하는 모델을 구축하고자 함


**평가 지표**
- Root Mean Squared Logarithmic Error(`RMSLE`)
  - 다음과 같이 계산  
  $ \sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2} $
  - $n$: 총 인스턴스 수
  - $ \hat{y}_i$: 인스턴스 $i$에 대한 타겟의 예측값
  - $y_i$: 인스턴스 $i$에 대한 타겟의 실제값

# **1. 데이터 살펴보기**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format # 소수점 둘째자리까지만 표시

import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **1-1. 훈련 데이터**
- 날짜, 매장 및 제품 정보, 해당 제품이 프로모션되었는지 여부, 그리고 매출 숫자가 포함되어 있음
- 추가 파일에는 모델을 구축하는 데 유용할 수 있는 보충 정보가 포함되어 있음



### **a) train.csv**  
- 학습용 데이터  

**컬럼 설명**   
- `id`: 각 데이터를 구분하기 위한 식별자
  - 300만 건의 데이터
- `date`: 판매일자
  - 기간) 2013/01/01 ~ 2017/08/15
- `store_nbr`: 매장 고유 식별번호
  - 1 ~ 54번
- `family`: 판매되는 제품 유형
  - 33개의 제품군
- `sales`: 특정 날짜에 특정 매장의 특정 제품군의 총 매출
  - 제품은 분수 단위로 판매될 수 있기 때문에 소수 값이 가능  
    (예: 1.5 kg의 치즈, 1 봉지의 감자칩 등)  
  - **target 변수**
- `onpromotion`: 특정 날짜에 매장의 특정 제품군에 대해 프로모션되고 있는 제품의 총 가짓수

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/train.csv')
train.head(10)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0
1,1,2013-01-01,1,BABY CARE,0.0,0
2,2,2013-01-01,1,BEAUTY,0.0,0
3,3,2013-01-01,1,BEVERAGES,0.0,0
4,4,2013-01-01,1,BOOKS,0.0,0
5,5,2013-01-01,1,BREAD/BAKERY,0.0,0
6,6,2013-01-01,1,CELEBRATION,0.0,0
7,7,2013-01-01,1,CLEANING,0.0,0
8,8,2013-01-01,1,DAIRY,0.0,0
9,9,2013-01-01,1,DELI,0.0,0


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000888 entries, 0 to 3000887
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 137.4+ MB


- 1684(일) * 54(매장 수) * 33(제품군 수) = 3000888

#### **📌 빠진 날짜 처리**

**날짜 확인**
- 2013(365), 2014(365), 2015(365), 2016(366), 2017-08-15(227) => 1,688일

In [5]:
print(train['date'].unique())
print(len(train['date'].unique()))

['2013-01-01' '2013-01-02' '2013-01-03' ... '2017-08-13' '2017-08-14'
 '2017-08-15']
1684


In [6]:
### 빠진 날짜

start_date = '2013-01-01'
end_date = '2017-08-15'

# 주어진 날짜 범위로 date_range 생성
date_range = pd.date_range(start=start_date, end=end_date)

# train['date'] 열의 데이터 타입을 datetime으로 변환
dates = pd.to_datetime(train['date'])

# 주어진 범위 내 빠진 날짜 찾기
missing_dates = date_range[~date_range.isin(train['date'])].astype(str)
print("빈 날짜:", missing_dates)

빈 날짜: Index(['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'], dtype='object')


In [7]:
# 고유한 가게 수

store_list = train['store_nbr'].unique()
store_list

array([ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2, 20, 21, 22, 23, 24,
       25, 26, 27, 28, 29,  3, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,  4,
       40, 41, 42, 43, 44, 45, 46, 47, 48, 49,  5, 50, 51, 52, 53, 54,  6,
        7,  8,  9])

In [8]:
# 고유한 제품군 수

family_list = train['family'].unique()
family_list

array(['AUTOMOTIVE', 'BABY CARE', 'BEAUTY', 'BEVERAGES', 'BOOKS',
       'BREAD/BAKERY', 'CELEBRATION', 'CLEANING', 'DAIRY', 'DELI', 'EGGS',
       'FROZEN FOODS', 'GROCERY I', 'GROCERY II', 'HARDWARE',
       'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES',
       'HOME CARE', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE',
       'LIQUOR,WINE,BEER', 'MAGAZINES', 'MEATS', 'PERSONAL CARE',
       'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'POULTRY',
       'PREPARED FOODS', 'PRODUCE', 'SCHOOL AND OFFICE SUPPLIES',
       'SEAFOOD'], dtype=object)

In [9]:
### 빠진 날짜에 대한 데이터 채우기
## (4개 날짜) * (54개의 매장) * (33개의 제품군) => 7128개의 데이터가 추가되어야 함

# 일단은 리스트로 데이터 생성(for 속도 측면)
new_rows = []
for date in missing_dates:
  for store in store_list:
    for family in family_list:
      new_rows.append({'id':0, 'date':date, 'store_nbr':store, 'family':family, 'sales':np.nan, 'onpromotion':np.nan})

# 데이터프레임으로 변환
new_data = pd.DataFrame(new_rows)

# 기존 train 데이터프레임과 새로운 데이터프레임 합치기
train = pd.concat([train, new_data]).sort_values('date').reset_index(drop = True)

# id 재설정
train['id'] = train.index

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3008016 entries, 0 to 3008015
Data columns (total 6 columns):
 #   Column       Dtype  
---  ------       -----  
 0   id           int64  
 1   date         object 
 2   store_nbr    int64  
 3   family       object 
 4   sales        float64
 5   onpromotion  float64
dtypes: float64(2), int64(2), object(2)
memory usage: 137.7+ MB


- 빠진 날짜가 정확히 다 채워짐

In [11]:
# 채워진 날짜에 대한 데이터 확인

train.loc[train['date'] == '2013-12-25', :]

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
637956,637956,2013-12-25,41,PRODUCE,,
637957,637957,2013-12-25,42,BREAD/BAKERY,,
637958,637958,2013-12-25,42,BOOKS,,
637959,637959,2013-12-25,42,BEVERAGES,,
637960,637960,2013-12-25,42,BEAUTY,,
...,...,...,...,...,...,...
639733,639733,2013-12-25,25,PLAYERS AND ELECTRONICS,,
639734,639734,2013-12-25,25,PET SUPPLIES,,
639735,639735,2013-12-25,26,BOOKS,,
639736,639736,2013-12-25,1,BABY CARE,,


> 결측치 처리 완료!

In [12]:
# 날짜에 공백 부분 없애기

train['date'] = train['date'].replace(" ", "")

### **b) stores.csv**
- 매장 메타데이터  

**컬렴 설명**
  - `store_nbr`: 매장 고유 식별번호
  - `city`: 매장이 위치한 도시
    - 22개 도시
  - `state`: 매장이 위치한 주
    - 16개의 주
  - `type`: 매장 유형
    - A, B, C, D, E
  - `cluster`: 유사한 매장들의 그룹
    - 1 ~ 17

In [13]:
stores = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/stores.csv')
stores.head(10)

Unnamed: 0,store_nbr,city,state,type,cluster
0,1,Quito,Pichincha,D,13
1,2,Quito,Pichincha,D,13
2,3,Quito,Pichincha,D,8
3,4,Quito,Pichincha,D,9
4,5,Santo Domingo,Santo Domingo de los Tsachilas,D,4
5,6,Quito,Pichincha,D,13
6,7,Quito,Pichincha,D,8
7,8,Quito,Pichincha,D,8
8,9,Quito,Pichincha,B,6
9,10,Quito,Pichincha,C,15


In [14]:
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB


In [15]:
print(stores['city'].unique())
print(len(stores['city'].unique()))

['Quito' 'Santo Domingo' 'Cayambe' 'Latacunga' 'Riobamba' 'Ibarra'
 'Guaranda' 'Puyo' 'Ambato' 'Guayaquil' 'Salinas' 'Daule' 'Babahoyo'
 'Quevedo' 'Playas' 'Libertad' 'Cuenca' 'Loja' 'Machala' 'Esmeraldas'
 'Manta' 'El Carmen']
22


In [16]:
print(stores['state'].unique())
print(len(stores['state'].unique()))

['Pichincha' 'Santo Domingo de los Tsachilas' 'Cotopaxi' 'Chimborazo'
 'Imbabura' 'Bolivar' 'Pastaza' 'Tungurahua' 'Guayas' 'Santa Elena'
 'Los Rios' 'Azuay' 'Loja' 'El Oro' 'Esmeraldas' 'Manabi']
16


In [17]:
print(stores['type'].unique())
print(len(stores['type'].unique()))

['D' 'B' 'C' 'E' 'A']
5


In [18]:
print(stores['cluster'].unique())
print(len(stores['cluster'].unique()))

[13  8  9  4  6 15  7  3 12 16  1 10  2  5 11 14 17]
17


In [19]:
result = stores.groupby('state')['city'].unique()
print(result)

state
Azuay                                                         [Cuenca]
Bolivar                                                     [Guaranda]
Chimborazo                                                  [Riobamba]
Cotopaxi                                                   [Latacunga]
El Oro                                                       [Machala]
Esmeraldas                                                [Esmeraldas]
Guayas                            [Guayaquil, Daule, Playas, Libertad]
Imbabura                                                      [Ibarra]
Loja                                                            [Loja]
Los Rios                                           [Babahoyo, Quevedo]
Manabi                                              [Manta, El Carmen]
Pastaza                                                         [Puyo]
Pichincha                                             [Quito, Cayambe]
Santa Elena                                                  [Salinas]


### **c) oil.csv**
- 일일 유가
  - 학습 및 테스트 데이터 기간 모두의 값을 포함
- 에콰도르는 석유 의존국임
  - 석유 가격의 변동에 매우 취약함
  - 이를 통해 어떤 제품군이 유가에 영향을 받는지 파악할 수 있음

**컬럼 설명**  
- `date`: 관측 일자
  - 기간) 2013/01/01 ~ 2017/08/31
- `dcoilwtico`: 유가

In [20]:
oil = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/oil.csv')
oil.head(10)

Unnamed: 0,date,dcoilwtico
0,2013-01-01,
1,2013-01-02,93.14
2,2013-01-03,92.97
3,2013-01-04,93.12
4,2013-01-07,93.2
5,2013-01-08,93.21
6,2013-01-09,93.08
7,2013-01-10,93.81
8,2013-01-11,93.6
9,2013-01-14,94.27


In [21]:
oil.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1218 entries, 0 to 1217
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        1218 non-null   object 
 1   dcoilwtico  1175 non-null   float64
dtypes: float64(1), object(1)
memory usage: 19.2+ KB


In [22]:
# 날짜에 공백 부분 없애기
oil['date'] = oil['date'].replace(" ", "")

### **d) transaction.csv**
- 거래 데이터로, 하루 동안 매장에 방문한 사람 수 또는 하루에 생성된 송장(영수증) 수를 의미
  - 학습 데이터의 매출(sales) 컬럼과 매우 상관 관계가 있음
  - 이를 통해 매장의 매출 패턴을 이해할 수 있음
- `date`: 일자
  - 2013/01/01 ~ 2017/08/15
- `store_nbr`: 가게 고유번호
- `transactions`: 거래량

In [23]:
transaction = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/transactions.csv')
transaction.head()

Unnamed: 0,date,store_nbr,transactions
0,2013-01-01,25,770
1,2013-01-02,1,2111
2,2013-01-02,2,2358
3,2013-01-02,3,3487
4,2013-01-02,4,1922


In [24]:
transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83488 entries, 0 to 83487
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   date          83488 non-null  object
 1   store_nbr     83488 non-null  int64 
 2   transactions  83488 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.9+ MB


In [25]:
# 날짜에 공백 부분 없애기
transaction['date'] = transaction['date'].replace(" ", "")

### **e) holidays_events.csv**
- 휴일 및 이벤트 메타데이터
- `date`: 일자
  - 2012/03/02 ~ 2017/12/26
- `type` : 공휴일 속성
  - Holiday, Event, Additional, Transfer, Bridge, Work Day
- `locale`: 규모
  - National, Local, Regional
- `locale_name`: 지역명
  - 24개 지역
- `description`: 설명
- `transferred`: 대체 여부
  - True, False


In [26]:
holidays_events = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/holidays_events.csv')
holidays_events.head(10)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
6,2012-06-23,Holiday,Local,Guaranda,Cantonizacion de Guaranda,False
7,2012-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
8,2012-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
9,2012-06-25,Holiday,Local,Machala,Fundacion de Machala,False


In [27]:
# 2013-01-01 이전 데이터는 필요 없기에 삭제
holidays_events = holidays_events.loc[holidays_events['date'] > '2012-12-31'].reset_index(drop=True)

# 2017-08-31 이후 데이터는 필요 없기에 삭제
holidays_events = holidays_events.loc[holidays_events['date'] < '2017-09-09'].reset_index(drop=True)

In [28]:
### 변수명 변경
# stores.csv의 type과 변수명이 중복됨

holidays_events = holidays_events.rename(columns={'type': 'category'})

In [29]:
holidays_events['category'].unique()

array(['Holiday', 'Work Day', 'Additional', 'Event', 'Transfer', 'Bridge'],
      dtype=object)

In [30]:
holidays_events['locale'].unique()

array(['National', 'Local', 'Regional'], dtype=object)

#### **1️⃣ caterogy 처리**

**Transfer**

In [31]:
# transferred == True인 행을 drop

holidays_events = holidays_events.loc[holidays_events['transferred'] != True].reset_index(drop=True) # False만 택함

In [32]:
holidays_events['transferred'].unique() # 제대로 삭제됨

array([False])

**Bridge & Work Day**

- Bridge: 유지
- Work Day: 일단은 남겨두고 데이터 병합 후 이후에 처리

#### **2️⃣ locale 처리**

In [33]:
## 국가 공휴일
# 지역명 == 'Ecuador'

print(holidays_events.loc[holidays_events['locale'] == 'National', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'National', 'locale_name'].unique()))

['Ecuador']
1


In [34]:
## 지역(주) 공휴일
# 지역명 == 주(state) 명

print(holidays_events.loc[holidays_events['locale'] == 'Regional', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'Regional', 'locale_name'].unique()))

['Cotopaxi' 'Imbabura' 'Santo Domingo de los Tsachilas' 'Santa Elena']
4


In [35]:
## 지역(도시) 공휴일
# 지역명 == 도시(city) 명

print(holidays_events.loc[holidays_events['locale'] == 'Local', 'locale_name'].unique())
print(len(holidays_events.loc[holidays_events['locale'] == 'Local', 'locale_name'].unique()))

['Manta' 'Cuenca' 'Libertad' 'Riobamba' 'Puyo' 'Guaranda' 'Machala'
 'Latacunga' 'El Carmen' 'Santo Domingo' 'Cayambe' 'Guayaquil'
 'Esmeraldas' 'Ambato' 'Ibarra' 'Quevedo' 'Quito' 'Loja' 'Salinas']
19


In [36]:
## 중복된 날짜 확인

holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]['date'].unique()

array(['2013-05-12', '2013-06-25', '2013-07-03', '2013-12-22',
       '2014-06-25', '2014-07-03', '2014-12-22', '2014-12-26',
       '2015-06-25', '2015-07-03', '2015-12-22', '2016-04-21',
       '2016-05-01', '2016-05-07', '2016-05-08', '2016-05-12',
       '2016-06-25', '2016-07-03', '2016-07-24', '2016-11-12',
       '2016-12-22', '2017-04-14', '2017-06-25', '2017-07-03'],
      dtype=object)

##### 최대한 중복되는 날짜를 제거해 보자..
- 밑에는 노가다라 접어둔 거니 안 보셔도 됩니다.

**2013-05-12**


In [37]:
holidays_events.loc[holidays_events['date'] == '2013-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
14,2013-05-12,Event,National,Ecuador,Dia de la Madre,False


In [38]:
# National만 남김

holidays_events = holidays_events.drop(13).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2013-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
13,2013-05-12,Event,National,Ecuador,Dia de la Madre,False


**2013-06-25**

In [39]:
holidays_events.loc[holidays_events['date'] == '2013-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
16,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
17,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
18,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 셋은 접점이 없음 -> 모두 유지

**2013-07-03**

In [40]:
holidays_events.loc[holidays_events['date'] == '2013-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
19,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
20,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2013-12-22**

In [41]:
holidays_events.loc[holidays_events['date'] == '2013-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
42,2013-12-22,Additional,National,Ecuador,Navidad-3,False
43,2013-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [42]:
# National만 남김

holidays_events = holidays_events.drop(43).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2013-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
42,2013-12-22,Additional,National,Ecuador,Navidad-3,False


**2014-06-25**

In [43]:
holidays_events.loc[holidays_events['date'] == '2014-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
66,2014-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
67,2014-06-25,Holiday,Local,Machala,Fundacion de Machala,False
68,2014-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
69,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False


In [44]:
# National만 남김

holidays_events = holidays_events.drop(66)
holidays_events = holidays_events.drop(67)
holidays_events = holidays_events.drop(68).reset_index(drop=True)

holidays_events.loc[holidays_events['date'] == '2014-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
66,2014-06-25,Event,National,Ecuador,Mundial de futbol Brasil: Ecuador-Francia,False


**2014-07-03**

In [45]:
holidays_events.loc[holidays_events['date'] == '2014-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
71,2014-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
72,2014-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2014-12-22**

In [46]:
holidays_events.loc[holidays_events['date'] == '2014-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
103,2014-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False
104,2014-12-22,Additional,National,Ecuador,Navidad-3,False


In [47]:
# National만 남김

holidays_events = holidays_events.drop(103).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2014-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
103,2014-12-22,Additional,National,Ecuador,Navidad-3,False


**2014-12-26**

In [48]:
holidays_events.loc[holidays_events['date'] == '2014-12-26',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
107,2014-12-26,Bridge,National,Ecuador,Puente Navidad,False
108,2014-12-26,Additional,National,Ecuador,Navidad+1,False


- 어짜피 국경일이라 구분할 이유가 x
  - 위에만 남기자

In [49]:
holidays_events = holidays_events.drop(108).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2014-12-26',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
107,2014-12-26,Bridge,National,Ecuador,Puente Navidad,False


**2015-06-25**

In [50]:
holidays_events.loc[holidays_events['date'] == '2015-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
126,2015-06-25,Holiday,Local,Machala,Fundacion de Machala,False
127,2015-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
128,2015-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 모두 유지

**2015-07-03**

In [51]:
holidays_events.loc[holidays_events['date'] == '2015-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
129,2015-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
130,2015-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 모두 유지

**2015-12-22**

In [52]:
holidays_events.loc[holidays_events['date'] == '2015-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
154,2015-12-22,Additional,National,Ecuador,Navidad-3,False
155,2015-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [53]:
# National만 남김

holidays_events = holidays_events.drop(155).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2015-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
154,2015-12-22,Additional,National,Ecuador,Navidad-3,False


**2016-04-21**

In [54]:
holidays_events.loc[holidays_events['date'] == '2016-04-21',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
173,2016-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
174,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False


In [55]:
# National만 남김

holidays_events = holidays_events.drop(173).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-04-21',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
173,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False


**2016-05-01**

In [56]:
holidays_events.loc[holidays_events['date'] == '2016-05-01',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
183,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False
184,2016-05-01,Event,National,Ecuador,Terremoto Manabi+15,False


In [57]:
# 위만 남기기로 결정

holidays_events = holidays_events.drop(184).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-01',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
183,2016-05-01,Holiday,National,Ecuador,Dia del Trabajo,False


**2016-05-07**

In [58]:
holidays_events.loc[holidays_events['date'] == '2016-05-07',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
189,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False
190,2016-05-07,Event,National,Ecuador,Terremoto Manabi+21,False


In [59]:
# 위만 남기기로 결정

holidays_events = holidays_events.drop(190).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-07',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
189,2016-05-07,Additional,National,Ecuador,Dia de la Madre-1,False


**2016-05-08**

In [60]:
holidays_events.loc[holidays_events['date'] == '2016-05-08',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
190,2016-05-08,Event,National,Ecuador,Terremoto Manabi+22,False
191,2016-05-08,Event,National,Ecuador,Dia de la Madre,False


In [61]:
# 위만 남기기로 결정

holidays_events = holidays_events.drop(191).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-08',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
190,2016-05-08,Event,National,Ecuador,Terremoto Manabi+22,False


**2016-05-12**

In [62]:
holidays_events.loc[holidays_events['date'] == '2016-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
194,2016-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
195,2016-05-12,Event,National,Ecuador,Terremoto Manabi+26,False


In [63]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(194).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-05-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
194,2016-05-12,Event,National,Ecuador,Terremoto Manabi+26,False


**2016-06-25**

In [64]:
holidays_events.loc[holidays_events['date'] == '2016-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
201,2016-06-25,Holiday,Local,Machala,Fundacion de Machala,False
202,2016-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
203,2016-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


- 셋 다 유지

**2016-07-03**

In [65]:
holidays_events.loc[holidays_events['date'] == '2016-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
204,2016-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
205,2016-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

**2016-07-24**

In [66]:
holidays_events.loc[holidays_events['date'] == '2016-07-24',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
207,2016-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
208,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False


In [67]:
# 아래만 남기기

holidays_events = holidays_events.drop(207).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-07-24',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
207,2016-07-24,Transfer,Local,Guayaquil,Traslado Fundacion de Guayaquil,False


**2016-11-12**

In [68]:
holidays_events.loc[holidays_events['date'] == '2016-11-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
222,2016-11-12,Holiday,Local,Ambato,Independencia de Ambato,False
223,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


In [69]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(222).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-11-12',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
222,2016-11-12,Work Day,National,Ecuador,Recupero Puente Dia de Difuntos,False


**2016-12-22**

In [70]:
holidays_events.loc[holidays_events['date'] == '2016-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
229,2016-12-22,Additional,National,Ecuador,Navidad-3,False
230,2016-12-22,Holiday,Local,Salinas,Cantonizacion de Salinas,False


In [71]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(230).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2016-12-22',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
229,2016-12-22,Additional,National,Ecuador,Navidad-3,False


**2017-04-14**

In [72]:
holidays_events.loc[holidays_events['date'] == '2017-04-14',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
241,2017-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
242,2017-04-14,Holiday,National,Ecuador,Viernes Santo,False


In [73]:
# National만 남기기로 결정

holidays_events = holidays_events.drop(241).reset_index(drop=True)
holidays_events.loc[holidays_events['date'] == '2017-04-14',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
241,2017-04-14,Holiday,National,Ecuador,Viernes Santo,False


**2017-06-25**

In [74]:
holidays_events.loc[holidays_events['date'] == '2017-06-25',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
249,2017-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
250,2017-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
251,2017-06-25,Holiday,Local,Machala,Fundacion de Machala,False


- 셋 다 유지

**2017-07-03**

In [75]:
holidays_events.loc[holidays_events['date'] == '2017-07-03',:]

Unnamed: 0,date,category,locale,locale_name,description,transferred
252,2017-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
253,2017-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False


- 둘 다 유지

#### **3️⃣ 변수 선택**
- transferred, description은 굳이 필요 없으니 삭제

In [76]:
## 중복된 날짜 재확인

holidays_events[holidays_events.duplicated(subset = 'date', keep = False)]

Unnamed: 0,date,category,locale,locale_name,description,transferred
16,2013-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
17,2013-06-25,Holiday,Local,Machala,Fundacion de Machala,False
18,2013-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
19,2013-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
20,2013-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
71,2014-07-03,Holiday,Local,El Carmen,Cantonizacion de El Carmen,False
72,2014-07-03,Holiday,Local,Santo Domingo,Fundacion de Santo Domingo,False
126,2015-06-25,Holiday,Local,Machala,Fundacion de Machala,False
127,2015-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
128,2015-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False


In [77]:
# transferred, description 변수는 더 이상 필요 없으므로 drop

holidays_events.drop(columns = ['transferred', 'description'], inplace=True)

In [78]:
### 최종 holidays_events 확인

holidays_events.head()

Unnamed: 0,date,category,locale,locale_name
0,2013-01-01,Holiday,National,Ecuador
1,2013-01-05,Work Day,National,Ecuador
2,2013-01-12,Work Day,National,Ecuador
3,2013-02-11,Holiday,National,Ecuador
4,2013-02-12,Holiday,National,Ecuador


In [79]:
holidays_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261 entries, 0 to 260
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   date         261 non-null    object
 1   category     261 non-null    object
 2   locale       261 non-null    object
 3   locale_name  261 non-null    object
dtypes: object(4)
memory usage: 8.3+ KB


In [80]:
# 날짜에 공백 부분 없애기

holidays_events['date'] = holidays_events['date'].replace(" ", "")

## **1-2. 테스트 데이터**
- 기간) 2017/08/16 ~ 2017/08/31

In [81]:
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/test.csv')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion
0,3000888,2017-08-16,1,AUTOMOTIVE,0
1,3000889,2017-08-16,1,BABY CARE,0
2,3000890,2017-08-16,1,BEAUTY,2
3,3000891,2017-08-16,1,BEVERAGES,20
4,3000892,2017-08-16,1,BOOKS,0


In [82]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           28512 non-null  int64 
 1   date         28512 non-null  object
 2   store_nbr    28512 non-null  int64 
 3   family       28512 non-null  object
 4   onpromotion  28512 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB


- 16(일) * 54(매장 수) * 33(제품군 수) = 28512건

## **1-3. 제출용 파일**

In [83]:
sample_submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/sample_submission.csv')
sample_submission.head()

Unnamed: 0,id,sales
0,3000888,0.0
1,3000889,0.0
2,3000890,0.0
3,3000891,0.0
4,3000892,0.0


In [84]:
sample_submission.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28512 entries, 0 to 28511
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id      28512 non-null  int64  
 1   sales   28512 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 445.6 KB


# **2. 데이터 병합**
- `train`의 모든 행은 결과 데이터프레임에 포함되며, 병합하려는 데이터프레임의 정보는 병합 대상인 열을 기준으로 매칭되는 경우에만 병합

## **2-1. train + stores**
- `store_nbr`을 기준으로 병합

In [85]:
## train
# 매장 고유번호를 기준으로 merge

train = pd.merge(train, stores, on = 'store_nbr', how = 'left') # train의 데이터는 모두 유지
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2


In [86]:
train.shape

(3008016, 10)

In [87]:
## test

test = pd.merge(test, stores, on = 'store_nbr', how = 'left')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13


In [88]:
test.shape

(28512, 9)

## **2-2. train + oil**
- `date` 열을 기준으로 병햡

In [89]:
## train
# 날짜를 기준으로 merge

train = pd.merge(train, oil, on = 'date', how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,


In [90]:
train.shape

(3008016, 11)

In [91]:
## test

test = pd.merge(test, oil, on = 'date', how = 'left')
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster,dcoilwtico
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13,46.8
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13,46.8
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13,46.8
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13,46.8
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13,46.8


In [92]:
test.shape

(28512, 10)

## **2-3. train + transaction**
- EDA 시에는 활용 가능
- 모델 학습 시에는 활용하기 어려울 듯
  - test 데이터가 수집된 시점에 대해 거래 데이터가 없음
  - 모델 학습 시에는 해당 변수를 drop 해야 함
- transaction의 PK는 (date, store_nbr)임
  - 두 개를 쌍으로 고려해야 각각의 데이터(행) 구분 가능
  - `date`, `store_nbr`을 기준으로 결합

In [93]:
## train
# 날짜와 가게 고유 번호를 기준으로 merge

train = pd.merge(train, transaction, on = ['date', 'store_nbr'], how = 'left')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,,
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,,
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,,
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,,
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,,


In [94]:
train.shape

(3008016, 12)

- transaction 데이터의 경우 test에는 합칠 수 없음

## **2-4. train + holidays_events**
- 공휴일 처리 결과는 스프레드시트로 확인해 주세요.  
[링크](https://docs.google.com/spreadsheets/d/1L4qLNSvs9FJvJ4VZZFTFxon5J3NeLuEl/edit?usp=drive_link&ouid=104031361358442970604&rtpof=true&sd=true)



In [95]:
### 새로운 변수 생성
## 휴일 종류를 저장하는 holiday 변수를 생성
# 이후 빈 문자열로 초기화

train['holiday'] = np.nan
train['holiday'] = train['holiday'].astype('object')
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,,,
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,,,
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,,,
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,,,
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,,,


In [96]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3008016 entries, 0 to 3008015
Data columns (total 13 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   family        object 
 4   sales         float64
 5   onpromotion   float64
 6   city          object 
 7   state         object 
 8   type          object 
 9   cluster       int64  
 10  dcoilwtico    float64
 11  transactions  float64
 12  holiday       object 
dtypes: float64(4), int64(3), object(6)
memory usage: 321.3+ MB


### **1️⃣ 국가 공휴일인 경우(locale = 'National')**  
- `date` 기준으로 병합  

In [97]:
## Case 1: 국가 공휴일(National)인 경우
# 날짜 기준으로 병합

national = holidays_events.loc[holidays_events['locale'] == 'National', ['date', 'locale']]
national['date'].nunique()

140

In [98]:
# 'train' 데이터프레임과 'national' 데이터프레임을 'date' 기준으로 merge
merged = pd.merge(train, national, on='date', how='left')

# 'merged' 데이터프레임에서 'locale'이 'National'인 경우 'holiday' 값을 'National'로 변경
merged.loc[merged['locale'] == 'National', 'holiday'] = 'National'

# 변경된 'holiday' 열을 'train' 데이터프레임에 반영
train['holiday'] = merged['holiday']

In [99]:
train.loc[train['holiday'] == 'National','date'].nunique()

140

### **2️⃣ 지역(주) 공휴일인 경우(locale = 'Regional')**
- `date`와 `state` 기준으로 병합

In [100]:
## Case 2: 지역 공휴일(Regional)인 경우
# 날짜와 주를 기준으로 병합

regional = holidays_events.loc[holidays_events['locale'] == 'Regional', ['date', 'locale', 'locale_name']]
regional = regional.rename(columns={'locale_name': 'state'})
regional['date'].nunique()

17

In [101]:
# 'train' 데이터프레임과 'regional' 데이터프레임을 'date'와 'state' 기준으로 merge
merged = pd.merge(train, regional, on=['date', 'state'], how='left')

# 'merged' 데이터프레임에서 'locale'이 'National'인 경우 'holiday' 값을 'Regional'로 변경
merged.loc[merged['locale'] == 'Regional', 'holiday'] = 'Regional'

# 변경된 'holiday' 열을 'train' 데이터프레임에 반영
train['holiday'] = merged['holiday']

In [102]:
train.loc[train['holiday'] == 'Regional','date'].nunique()

17

### **3️⃣ 지역(도시) 공휴일인 경우(locale = 'Local')**
- `date`와 `city` 기준으로 병합





In [103]:
## Case 2: 지역 공휴일(Local)인 경우
# 날짜와 주를 기준으로 병합

local = holidays_events.loc[holidays_events['locale'] == 'Local', ['date', 'locale', 'locale_name']]
local = local.rename(columns={'locale_name': 'city'})
local['date'].nunique() # 2017-08-24가 끼어 있음

95

In [104]:
# 'train' 데이터프레임과 'local' 데이터프레임을 'date'와 'city' 기준으로 merge
merged = pd.merge(train, local, on=['date', 'city'], how='left')

# 'merged' 데이터프레임에서 'locale'이 'National'인 경우 'holiday' 값을 'Local'로 변경
merged.loc[merged['locale'] == 'Local', 'holiday'] = 'Local'

# 변경된 'holiday' 열을 'train' 데이터프레임에 반영
train['holiday'] = merged['holiday']

In [105]:
train.loc[train['holiday'] == 'Local','date'].nunique()

94

### **4️⃣ 공휴일이 아닌 경우**
- Normal로 처리

In [106]:
train.loc[train['holiday'].isna(),'holiday'] = 'Normal'
train['holiday'].unique()

array(['National', 'Normal', 'Local', 'Regional'], dtype=object)

주말과 Work Day에 대한 처리는 이후 전처리 과정에서 파생 변수를 생성해 줄 예정

### **test 데이터 처리**

| date       | category | locale | locale_name |
|------------|----------|--------|-------------|
| 2017-08-24 | Holiday  | Local  | Ambato      |

- 하나이기에 그냥 이것만 처리해 주자.



In [107]:
test['holiday'] = np.nan
test['holiday'] = test['holiday'].astype('object')
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28512 entries, 0 to 28511
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           28512 non-null  int64  
 1   date         28512 non-null  object 
 2   store_nbr    28512 non-null  int64  
 3   family       28512 non-null  object 
 4   onpromotion  28512 non-null  int64  
 5   city         28512 non-null  object 
 6   state        28512 non-null  object 
 7   type         28512 non-null  object 
 8   cluster      28512 non-null  int64  
 9   dcoilwtico   21384 non-null  float64
 10  holiday      0 non-null      object 
dtypes: float64(1), int64(4), object(6)
memory usage: 2.6+ MB


In [108]:
# 'test' 데이터프레임과 'local' 데이터프레임을 'date'와 'city' 기준으로 merge
merged = pd.merge(test, local, on=['date', 'city'], how='left')

# 'merged' 데이터프레임에서 'locale'이 'National'인 경우 'holiday' 값을 'Local'로 변경
merged.loc[merged['locale'] == 'Local', 'holiday'] = 'Local'

# 변경된 'holiday' 열을 'train' 데이터프레임에 반영
test['holiday'] = merged['holiday']

In [109]:
test.loc[test['holiday'] == 'Local','date'].nunique()

1

In [110]:
test.loc[(test['date'] == '2017-08-24') & (test['city'] == 'Ambato'),:]

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster,dcoilwtico,holiday
14751,3015639,2017-08-24,23,AUTOMOTIVE,0,Ambato,Tungurahua,D,9,47.24,Local
14752,3015640,2017-08-24,23,BABY CARE,0,Ambato,Tungurahua,D,9,47.24,Local
14753,3015641,2017-08-24,23,BEAUTY,0,Ambato,Tungurahua,D,9,47.24,Local
14754,3015642,2017-08-24,23,BEVERAGES,27,Ambato,Tungurahua,D,9,47.24,Local
14755,3015643,2017-08-24,23,BOOKS,0,Ambato,Tungurahua,D,9,47.24,Local
...,...,...,...,...,...,...,...,...,...,...,...
15769,3016657,2017-08-24,50,POULTRY,0,Ambato,Tungurahua,A,14,47.24,Local
15770,3016658,2017-08-24,50,PREPARED FOODS,0,Ambato,Tungurahua,A,14,47.24,Local
15771,3016659,2017-08-24,50,PRODUCE,2,Ambato,Tungurahua,A,14,47.24,Local
15772,3016660,2017-08-24,50,SCHOOL AND OFFICE SUPPLIES,13,Ambato,Tungurahua,A,14,47.24,Local


In [111]:
test.loc[test['holiday'].isna(),'holiday'] = 'Normal'
test['holiday'].unique()

array(['Normal', 'Local'], dtype=object)

**병합 결과 확인**

In [112]:
train.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,type,cluster,dcoilwtico,transactions,holiday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0.0,Quito,Pichincha,D,13,,,National
1,1,2013-01-01,42,CELEBRATION,0.0,0.0,Cuenca,Azuay,D,2,,,National
2,2,2013-01-01,42,BREAD/BAKERY,0.0,0.0,Cuenca,Azuay,D,2,,,National
3,3,2013-01-01,42,BOOKS,0.0,0.0,Cuenca,Azuay,D,2,,,National
4,4,2013-01-01,42,BEVERAGES,0.0,0.0,Cuenca,Azuay,D,2,,,National


In [113]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3008016 entries, 0 to 3008015
Data columns (total 13 columns):
 #   Column        Dtype  
---  ------        -----  
 0   id            int64  
 1   date          object 
 2   store_nbr     int64  
 3   family        object 
 4   sales         float64
 5   onpromotion   float64
 6   city          object 
 7   state         object 
 8   type          object 
 9   cluster       int64  
 10  dcoilwtico    float64
 11  transactions  float64
 12  holiday       object 
dtypes: float64(4), int64(3), object(6)
memory usage: 321.3+ MB


In [114]:
train.isnull().sum()

id                   0
date                 0
store_nbr            0
family               0
sales             7128
onpromotion       7128
city                 0
state                0
type                 0
cluster              0
dcoilwtico      935550
transactions    252912
holiday              0
dtype: int64

In [115]:
test.head()

Unnamed: 0,id,date,store_nbr,family,onpromotion,city,state,type,cluster,dcoilwtico,holiday
0,3000888,2017-08-16,1,AUTOMOTIVE,0,Quito,Pichincha,D,13,46.8,Normal
1,3000889,2017-08-16,1,BABY CARE,0,Quito,Pichincha,D,13,46.8,Normal
2,3000890,2017-08-16,1,BEAUTY,2,Quito,Pichincha,D,13,46.8,Normal
3,3000891,2017-08-16,1,BEVERAGES,20,Quito,Pichincha,D,13,46.8,Normal
4,3000892,2017-08-16,1,BOOKS,0,Quito,Pichincha,D,13,46.8,Normal


In [116]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28512 entries, 0 to 28511
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           28512 non-null  int64  
 1   date         28512 non-null  object 
 2   store_nbr    28512 non-null  int64  
 3   family       28512 non-null  object 
 4   onpromotion  28512 non-null  int64  
 5   city         28512 non-null  object 
 6   state        28512 non-null  object 
 7   type         28512 non-null  object 
 8   cluster      28512 non-null  int64  
 9   dcoilwtico   21384 non-null  float64
 10  holiday      28512 non-null  object 
dtypes: float64(1), int64(4), object(6)
memory usage: 2.6+ MB


In [117]:
test.isnull().sum()

id                0
date              0
store_nbr         0
family            0
onpromotion       0
city              0
state             0
type              0
cluster           0
dcoilwtico     7128
holiday           0
dtype: int64

## **2-5. 병합된 파일 저장**

In [118]:
train.to_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/train_merged.csv', index = False)

In [119]:
test.to_csv('/content/drive/MyDrive/Colab Notebooks/ESAA 8기/OB/winter_proj/data/test_merged.csv', index = False)