# Preprocessing

### part 1 : fill in missing values & refine

- DepartmentDescription
- FinelineNumber
- Upc
- All of DepartmentDescription, FinelineNumber, Upc is NaN
- Divide the Upc into manufacturer(company) number and product number.
    
=======================================================================
### part 2 : Encode & Derivation

- DepartmentDescription
- Weekday
- FinelineNumber
- company Upc
- ScanCount
    -  Divide by abs_scancount for each columns
- Refund rate

=======================================================================
### part 3 : Feature Selection
- Feature Selection

- X_train
- y_train
- X_test

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import preprocessing_functions as pf
from functools import partial
%matplotlib inline

### Data load

In [2]:
train = pd.read_csv("train.csv")

print(train.shape)
train.tail()

(647054, 7)


Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
647049,39,191346,Sunday,32390000000.0,1,PHARMACY OTC,1118.0
647050,39,191346,Sunday,7874205000.0,1,FROZEN FOODS,1752.0
647051,39,191346,Sunday,4072.0,1,PRODUCE,4170.0
647052,8,191347,Sunday,4190008000.0,1,DAIRY,1512.0
647053,8,191347,Sunday,3800060000.0,1,GROCERY DRY GOODS,3600.0


In [3]:
train.dtypes

TripType                   int64
VisitNumber                int64
Weekday                   object
Upc                      float64
ScanCount                  int64
DepartmentDescription     object
FinelineNumber           float64
dtype: object

In [4]:
test = pd.read_csv("test.csv")

print(test.shape)
test.tail()

(653646, 6)


Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
653641,191348,Sunday,66572110000.0,1,BATH AND SHOWER,1505.0
653642,191348,Sunday,88181390000.0,1,BATH AND SHOWER,1099.0
653643,191348,Sunday,4282557000.0,1,MENS WEAR,8220.0
653644,191348,Sunday,80469190000.0,1,SWIMWEAR/OUTERWEAR,114.0
653645,191348,Sunday,7871536000.0,1,MENS WEAR,4923.0


In [5]:
test.dtypes

VisitNumber                int64
Weekday                   object
Upc                      float64
ScanCount                  int64
DepartmentDescription     object
FinelineNumber           float64
dtype: object

## part 1 : fill in missing values & refine

### DepartmentDescription
- 각각의 VisitNumber 별 DepartmentDescription의 최빈값으로 DepartmentDescription의 빈값을 채운다.
- 유추할 수 없는 191개의 값은 'UNKNOWN' 으로 대체한다.
- "MENSWEAR" change to "MENS WEAR"
- "HEALTH AND BEAUTY AIDS" change to "BEAUTY"

#### train

In [6]:
train.loc[train["VisitNumber"]==259, "DepartmentDescription"]

546        LAWN AND GARDEN
547        LAWN AND GARDEN
548                    NaN
549                    NaN
550        LAWN AND GARDEN
551        LAWN AND GARDEN
552    IMPULSE MERCHANDISE
Name: DepartmentDescription, dtype: object

In [7]:
DD_VN_list = train[train["DepartmentDescription"].isna()]["VisitNumber"].unique()

In [8]:
for loc in tqdm(DD_VN_list): # if: 특정 VisitNumber 따른 DepartmentDescription 값이 모두 비어있는 경우 제외
    if len(train[train["VisitNumber"] == loc]["DepartmentDescription"].value_counts().index) != 0:
        train.loc[(train["VisitNumber"] == loc)&(train["DepartmentDescription"].isna()), "DepartmentDescription"] = train[train["VisitNumber"] == loc]["DepartmentDescription"].value_counts().index[0]

100%|██████████| 1172/1172 [00:33<00:00, 34.52it/s]


In [9]:
train.loc[train["VisitNumber"]==259, "DepartmentDescription"]

546        LAWN AND GARDEN
547        LAWN AND GARDEN
548        LAWN AND GARDEN
549        LAWN AND GARDEN
550        LAWN AND GARDEN
551        LAWN AND GARDEN
552    IMPULSE MERCHANDISE
Name: DepartmentDescription, dtype: object

#### test

In [10]:
test.loc[test["VisitNumber"]==874, "DepartmentDescription"]

2115             AUTOMOTIVE
2116             AUTOMOTIVE
2117                    NaN
2118                    NaN
2119             AUTOMOTIVE
2120    IMPULSE MERCHANDISE
Name: DepartmentDescription, dtype: object

In [11]:
DD_VN_list_t = test[test["DepartmentDescription"].isna()]["VisitNumber"].unique()

In [12]:
for loc in tqdm(DD_VN_list_t): # if: 특정 VisitNumber 따른 DepartmentDescription 값이 모두 비어있는 경우 제외
    if len(test[test["VisitNumber"] == loc]["DepartmentDescription"].value_counts().index) != 0:
        test.loc[(test["VisitNumber"] == loc)&(test["DepartmentDescription"].isna()), "DepartmentDescription"] = test[test["VisitNumber"] == loc]["DepartmentDescription"].value_counts().index[0]

100%|██████████| 1141/1141 [00:32<00:00, 35.13it/s]


In [13]:
test.loc[test["VisitNumber"]==874, "DepartmentDescription"]

2115             AUTOMOTIVE
2116             AUTOMOTIVE
2117             AUTOMOTIVE
2118             AUTOMOTIVE
2119             AUTOMOTIVE
2120    IMPULSE MERCHANDISE
Name: DepartmentDescription, dtype: object

#### refine DepartmentDescription

#### "MENSWEAR" change to "MENS WEAR"

In [14]:
train.loc[train["DepartmentDescription"] == "MENSWEAR", "DepartmentDescription"] = "MENS WEAR"
train[train["DepartmentDescription"] == "MENSWEAR"]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


#### "HEALTH AND BEAUTY AIDS" change to "BEAUTY"
- train data has only two rows of "HEALTH AND BEAUTY AIDS", and test data doesn't

In [15]:
train.loc[train["DepartmentDescription"] == "HEALTH AND BEAUTY AIDS", "DepartmentDescription"] = "BEAUTY"
train[train["DepartmentDescription"] == "HEALTH AND BEAUTY AIDS"]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


### FinelineNumber
- DepartmentDescription이 'PHARMACY RX'일때, FinelineNumber가 빈값인 경우 DepartmentDescription이 'PHARMACY RX'일때의 FinelineNumber의 최빈값으로 채운다.
- 각각의 VisitNumber 별 FinelineNumber의 최빈값으로 FinelineNumber의 빈값을 채운다.
- 191개의 유추할 수 없는 값은 기존에 있던 값과 중복되지 않는 -9999 값으로 대체한다.

#### train

In [16]:
train[train["DepartmentDescription"] == 'PHARMACY RX']["FinelineNumber"].value_counts()

4822.0    84
5615.0    63
1335.0     6
1336.0     1
Name: FinelineNumber, dtype: int64

In [17]:
Pharmacy_idx = train[train["DepartmentDescription"]=='PHARMACY RX'].index
number_idx = np.arange(2922)
idx_box = zip(number_idx, Pharmacy_idx)


for idx, Pha_idx in tqdm(idx_box):
    if idx % 2 == 0:
        train.loc[Pha_idx, "FinelineNumber"] = 4822.0
    else:
        train.loc[Pha_idx, "FinelineNumber"] = 5615.0
        
train[train["DepartmentDescription"] == 'PHARMACY RX'][["DepartmentDescription", "FinelineNumber"]].head()

2922it [00:09, 313.50it/s]


Unnamed: 0,DepartmentDescription,FinelineNumber
1155,PHARMACY RX,4822.0
1216,PHARMACY RX,5615.0
1373,PHARMACY RX,4822.0
1455,PHARMACY RX,5615.0
1456,PHARMACY RX,4822.0


In [18]:
train.loc[train["VisitNumber"]==259, "FinelineNumber"]

546    5141.0
547    1748.0
548       NaN
549       NaN
550    2605.0
551    2605.0
552     337.0
Name: FinelineNumber, dtype: float64

In [19]:
FN_VN_list = train[train["FinelineNumber"].isna()]["VisitNumber"].unique()

In [20]:
for loc in tqdm(FN_VN_list): # if: 특정 VisitNumber 따른 FinelineNumber 값이 모두 비어있는 경우 제외
    if len(train[train["VisitNumber"] == loc]["FinelineNumber"].value_counts().index) != 0:
        train.loc[(train["VisitNumber"] == loc)&(train["FinelineNumber"].isna()), "FinelineNumber"] = train[train["VisitNumber"] == loc]["FinelineNumber"].value_counts().index[0]

100%|██████████| 1172/1172 [00:10<00:00, 113.46it/s]


In [21]:
train.loc[train["VisitNumber"]==259, "FinelineNumber"]

546    5141.0
547    1748.0
548    2605.0
549    2605.0
550    2605.0
551    2605.0
552     337.0
Name: FinelineNumber, dtype: float64

#### test

In [22]:
test[test["DepartmentDescription"] == 'PHARMACY RX']["FinelineNumber"].value_counts()

4822.0    79
5615.0    45
1335.0     2
Name: FinelineNumber, dtype: int64

In [23]:
Pharmacy_idx = test[test["DepartmentDescription"]=='PHARMACY RX'].index
number_idx = np.arange(2784)
idx_box = zip(number_idx, Pharmacy_idx)


for idx, Pha_idx in tqdm(idx_box):
    if idx % 2 == 0:
        test.loc[Pha_idx, "FinelineNumber"] = 4822.0
    else:
        test.loc[Pha_idx, "FinelineNumber"] = 5615.0
        
test[test["DepartmentDescription"] == 'PHARMACY RX'][["DepartmentDescription", "FinelineNumber"]].head()

2784it [00:09, 306.35it/s]


Unnamed: 0,DepartmentDescription,FinelineNumber
1188,PHARMACY RX,4822.0
1189,PHARMACY RX,5615.0
1190,PHARMACY RX,4822.0
1314,PHARMACY RX,5615.0
1315,PHARMACY RX,4822.0


In [24]:
test.loc[test["VisitNumber"]==874, "FinelineNumber"]

2115    250.0
2116      9.0
2117      NaN
2118      NaN
2119    253.0
2120    145.0
Name: FinelineNumber, dtype: float64

In [25]:
FN_VN_list_t = test[test["FinelineNumber"].isna()]["VisitNumber"].unique()

In [26]:
for loc in tqdm(FN_VN_list_t): # if: 특정 VisitNumber 따른 FinelineNumber 값이 모두 비어있는 경우 제외
    if len(test[test["VisitNumber"] == loc]["FinelineNumber"].value_counts().index) != 0:
        test.loc[(test["VisitNumber"] == loc)&(test["FinelineNumber"].isna()), "FinelineNumber"] = test[test["VisitNumber"] == loc]["FinelineNumber"].value_counts().index[0]

100%|██████████| 1141/1141 [00:09<00:00, 115.74it/s]


In [27]:
test.loc[test["VisitNumber"]==874, "FinelineNumber"]

2115    250.0
2116      9.0
2117    145.0
2118    145.0
2119    253.0
2120    145.0
Name: FinelineNumber, dtype: float64

### UPC
- DepartmentDescription이 'PHARMACY RX'일때, UPC가 빈값인 경우 DepartmentDescription이 'PHARMACY RX'일때의 UPC의 최빈값으로 채운다.
- 각각의 VisitNumber 별 UPC의 최빈값으로 UPC의 빈값을 채운다.
- 191개의 유추할 수 없는 값은 기존에 있던 값과 중복되지 않는 '0000599996' 값으로 대체한다.

#### train

In [28]:
train.loc[train["VisitNumber"]==259, "Upc"]

546    7.112176e+09
547    4.656118e+09
548             NaN
549             NaN
550    3.146256e+09
551    3.146253e+09
552    4.650073e+09
Name: Upc, dtype: float64

In [29]:
Upc_VN_list = train[train["Upc"].isna()]["VisitNumber"].unique()

In [30]:
for loc in tqdm(Upc_VN_list): # if: 특정 VisitNumber 따른 Upc 값이 모두 비어있는 경우 제외
    if len(train[train["VisitNumber"] == loc]["Upc"].value_counts().index) != 0:
        train.loc[(train["VisitNumber"] == loc)&(train["Upc"].isna()), "Upc"] = train[train["VisitNumber"] == loc]["Upc"].value_counts().index[0]

100%|██████████| 2754/2754 [00:15<00:00, 182.57it/s]


In [31]:
train.loc[train["VisitNumber"]==259, "Upc"]

546    7.112176e+09
547    4.656118e+09
548    4.656118e+09
549    4.656118e+09
550    3.146256e+09
551    3.146253e+09
552    4.650073e+09
Name: Upc, dtype: float64

#### test

In [32]:
test.loc[test["VisitNumber"]==874, "Upc"]

2115    1.284410e+09
2116    8.182000e+10
2117             NaN
2118             NaN
2119    1.284410e+09
2120    3.400001e+09
Name: Upc, dtype: float64

In [33]:
Upc_VN_list_t = test[test["Upc"].isna()]["VisitNumber"].unique()

In [34]:
for loc in tqdm(Upc_VN_list_t): # if: 특정 VisitNumber 따른 Upc 값이 모두 비어있는 경우 제외
    if len(test[test["VisitNumber"] == loc]["Upc"].value_counts().index) != 0:
        test.loc[(test["VisitNumber"] == loc)&(test["Upc"].isna()), "Upc"] = test[test["VisitNumber"] == loc]["Upc"].value_counts().index[0]

100%|██████████| 2706/2706 [00:14<00:00, 180.60it/s]


In [35]:
test.loc[test["VisitNumber"]==874, "Upc"]

2115    1.284410e+09
2116    8.182000e+10
2117    3.400001e+09
2118    3.400001e+09
2119    1.284410e+09
2120    3.400001e+09
Name: Upc, dtype: float64

###  All of DepartmentDescription, FinelineNumber, Upc is NaN

- 총 191개의 다른 컬럼과의 관계로 추론 불가능한 DepartmentDescription, FinelineNumber, Upc의 값이 모두 비어있는 경우, 기존에 train, test 데이터에 없는 "UNKNOWN", -9999, '0000599996' 값으로 각각 채운다.
- 이 경우에 모든 row들은 TripType이 999이다.

In [36]:
empty_df = train[(train["DepartmentDescription"].isna())&(train["DepartmentDescription"].isna())&(train["DepartmentDescription"].isna())][["VisitNumber", "DepartmentDescription", "FinelineNumber", "Upc", "TripType", "Weekday", "ScanCount"]]

print(empty_df.shape)
empty_df[["DepartmentDescription", "FinelineNumber", "Upc"]].head()

(191, 7)


Unnamed: 0,DepartmentDescription,FinelineNumber,Upc
959,,,
1134,,,
1135,,,
6285,,,
8524,,,


In [37]:
print("191개의 빈 row들은 모두 triptype이 {}이다.".format(empty_df["TripType"].value_counts().index[0]))
empty_df["TripType"].value_counts()

191개의 빈 row들은 모두 triptype이 999이다.


999    191
Name: TripType, dtype: int64

#### train

In [38]:
train.loc[train["DepartmentDescription"].isna(), "DepartmentDescription"] = "UNKNOWN"
train[train["DepartmentDescription"].isna()]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


In [39]:
train.loc[train["FinelineNumber"].isna(), "FinelineNumber"] = -9999.0
train[train["FinelineNumber"].isna()]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


In [40]:
train.loc[train["Upc"].isna(), "Upc"] = 0000599996.0
train[train["Upc"].isna()]

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


#### test

In [41]:
test.loc[test["DepartmentDescription"].isna(), "DepartmentDescription"] = "UNKNOWN"
test[test["DepartmentDescription"].isna()]

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


In [42]:
test.loc[test["FinelineNumber"].isna(), "FinelineNumber"] = -9999.0
test[test["FinelineNumber"].isna()]

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


In [43]:
test.loc[test["Upc"].isna(), "Upc"] = 0000599996.0
test[test["Upc"].isna()]

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber


### Divide the Upc into manufacturer(company) number and product number.

- 3~12자리의 여러자기 종류의 UPC를 모두 12자리로 복원 후, 분류에 필요하다고 판단한 company_Upc와 product_Upc로 나누어 인코딩
- preprocessing function 사용(custom function)

#### train

In [44]:
train["Upc"] = train["Upc"].astype(str)

In [45]:
train["full_Upc"] = train["Upc"].apply(pf.upc_789101112_to_10)
train["full_Upc"] = train["full_Upc"].apply(pf.upc_3456_to_10)

train["company_Upc"] = train["full_Upc"].apply(pf.company_part_Upc)
train["product_Upc"] = train["full_Upc"].apply(pf.product_part_Upc) 

train[["Upc", "full_Upc", "company_Upc", "product_Upc"]].tail()

Unnamed: 0,Upc,full_Upc,company_Upc,product_Upc
647049,32390001778.0,2390001778,23900,1778
647050,7874205336.0,7874205336,78742,5336
647051,4072.0,404072,4,4072
647052,4190007664.0,4190007664,41900,7664
647053,3800059655.0,3800059655,38000,59655


#### test

In [46]:
test["Upc"] = test["Upc"].astype(str)

In [47]:
test["full_Upc"] = test["Upc"].apply(pf.upc_789101112_to_10)
test["full_Upc"] = test["full_Upc"].apply(pf.upc_3456_to_10)

test["company_Upc"] = test["full_Upc"].apply(pf.company_part_Upc)
test["product_Upc"] = test["full_Upc"].apply(pf.product_part_Upc) 

test[["Upc", "full_Upc", "company_Upc", "product_Upc"]].tail()

Unnamed: 0,Upc,full_Upc,company_Upc,product_Upc
653641,66572105763.0,6572105763,65721,5763
653642,88181390024.0,8181390024,81813,90024
653643,4282557050.0,4282557050,42825,57050
653644,80469193740.0,469193740,4691,93740
653645,7871535983.0,7871535983,78715,35983


## part 2 : Encode & Derivation

## Encode

### Count the DepartmentDescription for each VisitNumber(Encode)

#### train

In [48]:
train_department = pd.pivot_table(data=train, index='VisitNumber', columns='DepartmentDescription', values='ScanCount', aggfunc='sum')
train_department = train_department.fillna(0)

In [49]:
train_department.head()

DepartmentDescription,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,UNKNOWN,WIRELESS
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### test

In [50]:
test_department = pd.pivot_table(data=test, index='VisitNumber', columns='DepartmentDescription', values='ScanCount', aggfunc='sum')
test_department = test_department.fillna(0)

In [51]:
test_department.head()

DepartmentDescription,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,UNKNOWN,WIRELESS
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Count the Weekday for each VisitNumber(Encode)

#### train

In [52]:
train_weekday = pd.pivot_table(data=train, index='VisitNumber', columns='Weekday', values='ScanCount', aggfunc='sum')
train_weekday = train_weekday.fillna(0)

In [53]:
train_weekday.head()

Weekday,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2.0,0.0,0.0,0.0,0.0,0.0,0.0
8,28.0,0.0,0.0,0.0,0.0,0.0,0.0
9,3.0,0.0,0.0,0.0,0.0,0.0,0.0
10,3.0,0.0,0.0,0.0,0.0,0.0,0.0


#### test

In [54]:
test_weekday = pd.pivot_table(data=test, index='VisitNumber', columns='Weekday', values='ScanCount', aggfunc='sum')
test_weekday = test_weekday.fillna(0)

In [55]:
test_weekday.head()

Weekday,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,4.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Count the FinelineNumber for each VisitNumber(Encode)

#### train

In [56]:
train_fineline = pd.pivot_table(data=train, index='VisitNumber', columns='FinelineNumber', values='ScanCount', aggfunc='sum')
train_fineline = train_fineline.fillna(0)

In [57]:
train_fineline.head()

FinelineNumber,-9999.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,...,9964.0,9966.0,9967.0,9970.0,9971.0,9974.0,9975.0,9991.0,9997.0,9998.0
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
test_fineline = pd.pivot_table(data=test, index='VisitNumber', columns='FinelineNumber', values='ScanCount', aggfunc='sum')
test_fineline = test_fineline.fillna(0)

In [59]:
test_fineline.head()

FinelineNumber,-9999.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,...,9967.0,9969.0,9970.0,9971.0,9974.0,9975.0,9991.0,9997.0,9998.0,9999.0
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Count the company_Upc for each VisitNumber(Encode)

#### train

In [60]:
train["company_Upc"] = train["company_Upc"].astype('str')
train_company_upc = pd.pivot_table(data=train, index='VisitNumber', columns='company_Upc', values='ScanCount', aggfunc='sum')
train_company_upc = train_company_upc.fillna(0)

In [61]:
train_company_upc.head()

company_Upc,00001,00002,00003,00004,00005,00008,00049,00050,00054,00055,...,99804,99829,99870,99919,99923,99928,99939,99967,99988,99991
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### test

In [62]:
test["company_Upc"] = test["company_Upc"].astype('str')
test_company_upc = pd.pivot_table(data=test, index='VisitNumber', columns='company_Upc', values='ScanCount', aggfunc='sum')
test_company_upc = test_company_upc.fillna(0)

In [63]:
test_company_upc.head()

company_Upc,00001,00002,00003,00004,00005,00008,00049,00050,00054,00055,...,99800,99804,99829,99870,99874,99919,99923,99939,99967,99988
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Concat above columns

In [64]:
train_df = pd.concat([train_department, train_fineline, train_company_upc, train_weekday], axis=1)

print(train_df.shape)
train_df.tail()

(95674, 10954)


Unnamed: 0_level_0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,99967,99988,99991,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
VisitNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
191343,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0
191344,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
191345,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0
191346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0
191347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0


In [65]:
test_df = pd.concat([test_department, test_fineline, test_company_upc, test_weekday], axis=1)
test_df = test_df.reset_index(drop=False)

## Derivation

### 파생 컬럼 생성을 위한 abs_ScanCount생성

#### train

In [66]:
train["abs_ScanCount"] = np.abs(train["ScanCount"])

In [67]:
train_abs_ScanCount = train.groupby(by="VisitNumber").sum().reset_index()["abs_ScanCount"]
train_abs_ScanCount.head()

0     1
1     2
2    32
3     3
4     3
Name: abs_ScanCount, dtype: int64

In [68]:
train_ScanCount = train.groupby(by="VisitNumber").sum().reset_index()["ScanCount"]
train_ScanCount.head()

0    -1
1     2
2    28
3     3
4     3
Name: ScanCount, dtype: int64

#### test

In [69]:
test["abs_ScanCount"] = np.abs(test["ScanCount"])

In [70]:
test_abs_ScanCount = test.groupby(by="VisitNumber").sum().reset_index()["abs_ScanCount"]
test_abs_ScanCount.head()

0    4
1    4
2    2
3    1
4    2
Name: abs_ScanCount, dtype: int64

In [71]:
test_ScanCount = test.groupby(by="VisitNumber").sum().reset_index()["ScanCount"]
test_ScanCount.head()

0    4
1    4
2    0
3    1
4    0
Name: ScanCount, dtype: int64

### Divide by abs_scancount for each columns
- ScanCount로 각 컬럼을 나눠 산 물건 종류의 비율을 할당
- 절대값 사용 이유는 한번 샀다가 환불한 경우도 구매했던 이력으로 간주

#### train

In [72]:
train_df = train_df.reset_index(drop=False)
train_df["abs_ScanCount"] = train_abs_ScanCount

train_columns = list(train_df.columns)

for col_name in tqdm(train_columns):
    train_df[col_name] = train_df[col_name] / train_df["abs_ScanCount"]

100%|██████████| 10956/10956 [00:08<00:00, 1280.03it/s]


#### test

In [73]:
test_df = test_df.reset_index(drop=False)
test_df["abs_ScanCount"] = test_abs_ScanCount
test_columns = list(test_df.columns)

for col_name in tqdm(test_columns):
    test_df[col_name] = test_df[col_name] / test_df["abs_ScanCount"]
    

100%|██████████| 10986/10986 [00:10<00:00, 1057.54it/s]


### Refund rate
- 999타입은 대부분이 환불 고객이며, 환불 비율에 따라 고객 유형 구분

#### train

In [74]:
train_df["ScanCount"] = train_ScanCount
train_df["refund_rate"] = ((train_abs_ScanCount - train_ScanCount) / 2) / train_abs_ScanCount

In [75]:
train_df["refund_rate"].head()

0    1.0000
1    0.0000
2    0.0625
3    0.0000
4    0.0000
Name: refund_rate, dtype: float64

#### test

In [76]:
test_df["ScanCount"] = test_ScanCount
test_df["refund_rate"] = ((test_abs_ScanCount - test_ScanCount) / 2) / test_abs_ScanCount

In [77]:
test_df["refund_rate"].head()

0    0.0
1    0.0
2    0.5
3    0.0
4    0.5
Name: refund_rate, dtype: float64

### part3 : Feature Selection
- feature로 필요하지 않은 abs_ScanCount, VisitNumber 제거
- train, test 데이터 중 중복되는 company_upc(5304 columns), finelinenumber(5045 columns)만 사용

In [78]:
del train_df["abs_ScanCount"]
del train_df["VisitNumber"]

del test_df["abs_ScanCount"]
del test_df["VisitNumber"]

train_company_list = list(train["company_Upc"].value_counts().index)
test_company_list = list(test["company_Upc"].value_counts().index)

train_fineline_list = list(train["FinelineNumber"].value_counts().index)
test_fineline_list = list(test["FinelineNumber"].value_counts().index)

company_feature = list(set(train_company_list) & set(test_company_list))
fineline_feature = list(set(train_fineline_list) & set(test_fineline_list))

feature_names = list(train_department.columns) + list(train_weekday.columns) + company_feature + fineline_feature + ["ScanCount", "refund_rate"]
print(len(feature_names))

10425


### X_train

In [79]:
X_train = train_df[feature_names]

print(X_train.shape)
X_train.tail()

(95674, 10425)


Unnamed: 0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,8172.0,8173.0,8175.0,8176.0,1625.0,8180.0,8190.0,8191.0,ScanCount,refund_rate
95669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,0.0
95670,0.0,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,0.0
95671,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17,0.0
95672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17,0.0
95673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,0.0


### y_train

In [80]:
y_train = train.groupby(by="VisitNumber").mean().reset_index()["TripType"]

print(y_train.shape)
y_train[:5]

(95674,)


0    999.0
1     30.0
2     26.0
3      8.0
4      8.0
Name: TripType, dtype: float64

### X_test

In [81]:
X_test = test_df[feature_names]

print(X_test.shape)
X_test.tail()

(95674, 10425)


Unnamed: 0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,8172.0,8173.0,8175.0,8176.0,1625.0,8180.0,8190.0,8191.0,ScanCount,refund_rate
95669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12,0.0
95670,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6,0.0
95671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,0.0
95672,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12,0.0
95673,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,0.0


### Sparse matrix - csr matrix

In [82]:
from scipy.sparse import csr_matrix

In [83]:
X_train = csr_matrix(X_train)

X_train

<95674x10425 sparse matrix of type '<class 'numpy.float64'>'
	with 1575630 stored elements in Compressed Sparse Row format>

In [84]:
X_test = csr_matrix(X_test)

X_test

<95674x10425 sparse matrix of type '<class 'numpy.float64'>'
	with 1584737 stored elements in Compressed Sparse Row format>

### Label Encode

In [85]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)

print(y_train.shape)
y_train[:5]

(95674,)


array([37, 22, 18,  5,  5])