# Action 2 : 기여도 분석 및 시뮬레이션을 통한 예산 최적화
- 도메인 및 모델 기반 Attribution Model을 이용한 캠페인 기여도 분석

---

---

---

# 데이터 살펴보기
- 마케팅 캠페인이 많다면 어떻게 최적화할 수 있을까?

<br/>

- 데이터셋 출처: https://ailab.criteo.com/ressources/ (광고 리타겟팅 회사 Criteo)

<br/>

- **Timestamp**: timestamp of the impression
    - 캠페인이 노출된 시간
- **UID**: unique user identifier : user id
- **Campaign**: unique campaign identifier
    - 유니크한 캠페인의 ID
- **Conversion**: 1 if there was a conversion in the 30 days after the impression; 0 otherwise
    - 노출 후에 30일 이내에 전환이 일어나면 1
- **Conversion ID**: a unique identifier for each conversion
    - 전환이 일어났을 때, 전환에 대한 유니크한 ID
- **Click**: 1 if the impression was clicked; 0 otherwise
    - 유저가 노출된 것을 보고 클릭하면 1, 무시하면 0
- **Cost**: the price paid for this ad
- **Cat1-Cat9**: categorical features associated with the ad. These features’ semantic meaning is not disclosed.
    - 캠페인이 어떤 카테고리에 속하는지에 대한 카테고리 정보

In [1]:
df = pd.read_csv('data/criteo_attribution_dataset_sampled_campaign_300_journey_over_2_points_balanced.csv')
df = df.iloc[:, 2:]

print(df.shape)
df.head(2)

(602160, 23)


Unnamed: 0,timestamp,uid,campaign,conversion,conversion_timestamp,conversion_id,attribution,click,click_pos,click_nb,...,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,jid
0,209415,21670661,10474106,1,1420115,17559949,0,1,4,8,...,30763035,9312274,30867372,29196072,11409684,30763035,8549927,29196072,29520626,21670661_17559949
1,225512,32320979,28874676,0,-1,-1,0,1,-1,-1,...,30763035,9312274,1461750,29196072,26611394,1973606,30600973,9068204,15351053,32320979_-1


In [2]:
for col in df.columns:
    print(f"{col} Unique Classes : {df[col].nunique()}")

timestamp Unique Classes : 501483
uid Unique Classes : 107463
campaign Unique Classes : 300
conversion Unique Classes : 2
conversion_timestamp Unique Classes : 78799
conversion_id Unique Classes : 79768
attribution Unique Classes : 2
click Unique Classes : 2
click_pos Unique Classes : 143
click_nb Unique Classes : 89
cost Unique Classes : 524248
cpo Unique Classes : 104681
time_since_last_click Unique Classes : 313370
cat1 Unique Classes : 9
cat2 Unique Classes : 44
cat3 Unique Classes : 818
cat4 Unique Classes : 19
cat5 Unique Classes : 41
cat6 Unique Classes : 30
cat7 Unique Classes : 7431
cat8 Unique Classes : 11
cat9 Unique Classes : 30
jid Unique Classes : 124258


---

### 자주 등장하는 상위 50개의 캠페인 추출
- 300개의 종류를 가진 캠페인을 상위 50개에 대한 캠페인만 필터링해서 데이터셋 추출

In [3]:
campaign_50 = (
    df.groupby('campaign')['timestamp']
    .count()
    .sort_values(ascending=False)
    .reset_index()
    .rename({'timestamp': 'camp_cnt'}, axis=1)
    .head(50)
)

filtered_df = df.merge(campaign_50, on='campaign')

print(filtered_df.campaign.nunique())
print(filtered_df.shape)
df = filtered_df

50
(451364, 24)


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 451364 entries, 0 to 451363
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   timestamp              451364 non-null  int64  
 1   uid                    451364 non-null  int64  
 2   campaign               451364 non-null  int64  
 3   conversion             451364 non-null  int64  
 4   conversion_timestamp   451364 non-null  int64  
 5   conversion_id          451364 non-null  int64  
 6   attribution            451364 non-null  int64  
 7   click                  451364 non-null  int64  
 8   click_pos              451364 non-null  int64  
 9   click_nb               451364 non-null  int64  
 10  cost                   451364 non-null  float64
 11  cpo                    451364 non-null  float64
 12  time_since_last_click  451364 non-null  int64  
 13  cat1                   451364 non-null  int64  
 14  cat2                   451364 non-nu

---

### 유저가 전환되기까지 얼마나 많은 캠페인에 노출되는지

In [5]:
# jid별 이벤트 수
event_per_jid = df.groupby('jid')['timestamp'].count().reset_index()
event_per_jid.head(2)

Unnamed: 0,jid,timestamp
0,10000148_-1,9
1,1000023_5282678,2


In [6]:
# 노출된 캠페인 수별 jid수
jid_cnt_per_event = event_per_jid.groupby('timestamp')['jid'].count()

event_journey = pd.DataFrame({
    'event_count': jid_cnt_per_event.index, 'journey_count': jid_cnt_per_event.values
})
event_journey.head(3)

Unnamed: 0,event_count,journey_count
0,1,2653
1,2,36617
2,3,17702


### 전이별 캠페인을 몇 개 보는지

In [7]:
campaign_per_jid = df.groupby('jid')['campaign'].nunique().reset_index()

jid_cnt_per_campaign = campaign_per_jid.groupby('campaign')['jid'].count()

campaign_journey = pd.DataFrame({
    'campaign_count': jid_cnt_per_campaign.index, 'journey_count': jid_cnt_per_campaign.values
})
campaign_journey.head(3)

Unnamed: 0,campaign_count,journey_count
0,1,89749
1,2,5120
2,3,472


---

### 정리
- 상위 50개의 캠페인을 포함한 데이터 필터링 결과 : 602160 rows > 451364 rows
- 결측데이터는 존재하지 않음
- 캠페인에 한 번 노출된 전이 수는 2653, 두 번은 36617, 세 번은 17702개 이다.
- 전이별 캠페인 한 개에 노출된 수는 89749명, 두 개는 5120명, 세 개는 472명 이다.

---

---

---

# Logistic Regression 을 통한 캠페인 예산 분배 최적화
- Conversion을 예측하는 모델 생성 후 생긴 가중치 벡터에서 각 캠페인에 대한 부분만 추출

## 데이터 전처리
- 행이 이벤트 단위로 되어 있는 것을 전이 단위로 변환

In [8]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

#### timestamp 스케일링

In [9]:
mm_scaler = MinMaxScaler()

df['timestamp_norm'] = mm_scaler.fit_transform(df[['timestamp']])

#### 범주형 변수 원핫인코딩

In [10]:
# 캠페인 변수 원핫인코딩 
camp_encoder = OneHotEncoder(drop='if_binary').fit(df.campaign.values.reshape(-1, 1))

camp_matrix = camp_encoder.transform(df.campaign.values.reshape(-1, 1))

df['campaigns'] = [list(v) for v in camp_matrix.toarray()]

In [11]:
# 카테고리 변수 원핫인코딩
cate_cols = ['cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat8', 'cat9']
cate_encoder = OneHotEncoder(drop='if_binary').fit(df[cate_cols])

cate_matrix = cate_encoder.transform(df[cate_cols])
df['cates'] = [list(v) for v in cate_matrix.toarray()]

In [12]:
df.head(2)

Unnamed: 0,timestamp,uid,campaign,conversion,conversion_timestamp,conversion_id,attribution,click,click_pos,click_nb,...,cat5,cat6,cat7,cat8,cat9,jid,camp_cnt,timestamp_norm,campaigns,cates
0,209415,21670661,10474106,1,1420115,17559949,0,1,4,8,...,11409684,30763035,8549927,29196072,29520626,21670661_17559949,7890,0.078397,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
1,1238119,3661462,10474106,1,1672978,24286866,0,1,3,5,...,11409684,28928366,8549927,29196072,29520629,3661462_24286866,7890,0.46351,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."


In [15]:
def element_wise_max(series):
    return np.max(series.tolist(), axis=0).tolist()

aggregation = {
    'campaigns': element_wise_max,
    'cates': element_wise_max,
    'click': 'sum',
    'cost': 'sum',
    'conversion': 'max'
}

In [16]:
agg_df = df.groupby('jid').agg(aggregation)

agg_df['features'] = agg_df[['campaigns', 'cates', 'click', 'cost']].values.tolist()
agg_df['features'].head(2)

jid
10000148_-1        [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
1000023_5282678    [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
Name: features, dtype: object

In [17]:
agg_df = agg_df[['features', 'conversion']]

print(agg_df.shape)
agg_df.head(2)

(95388, 2)


Unnamed: 0_level_0,features,conversion
jid,Unnamed: 1_level_1,Unnamed: 2_level_1
10000148_-1,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",0
1000023_5282678,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",1


#### 데이터셋 저장
- pkl 파일로 저장

In [18]:
import pickle

with open('data/campaign_50.pkl', 'wb') as f:
    pickle.dump(agg_df, f)

---

### 데이터 불러오기

In [19]:
with open('data/campaign_50.pkl', 'rb') as f:
    agg_df = pickle.load(f)

In [20]:
X = np.stack(agg_df['features'].map(lambda x: np.hstack(x)).values)
y = agg_df['conversion'].values

print(X.shape, y.shape)

(95388, 406) (95388,)


### 데이터셋 세분화

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
# Train / Test Dataset Split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.2)

# Train / Validation Dataset Split
train_x, valid_x, train_y, valid_y = train_test_split(train_x, train_y, test_size = 0.2)

---

---

## Logistic Regression 모델 생성