In [1]:
import pandas as pd
import numpy as np

In [2]:
# load data
path = "./data/"
donation = pd.read_csv(path+"donations.csv")
email = pd.read_csv(path+"emails.csv")
year_joined = pd.read_csv(path+"year_joined.csv")

email["emailsOpened"] = email["emailsOpened"].astype("int")
email["user"] = email["user"].astype("int")
donation["user"] = donation["user"].astype("int")
email["week"] = pd.to_datetime(email['week'], format='%Y-%m-%d %H:%M:%S')
donation["timestamp"] = pd.to_datetime(donation['timestamp'], format='%Y-%m-%d %H:%M:%S')
year_joined["yearJoined"] = pd.to_datetime(year_joined['yearJoined'], format='%Y')

In [3]:
year_joined.head(3)

Unnamed: 0,user,userStats,yearJoined
0,0,silver,2014-01-01
1,1,silver,2015-01-01
2,2,silver,2016-01-01


In [4]:
email.head(3)

Unnamed: 0,emailsOpened,user,week
0,3,1,2015-06-29
1,2,1,2015-07-13
2,2,1,2015-07-20


In [5]:
donation.head(3)

Unnamed: 0,amount,timestamp,user
0,25.0,2017-11-12 11:13:44,0
1,50.0,2015-08-25 19:01:45,0
2,25.0,2015-03-26 12:03:47,0


In [6]:
# 複数の会員資格を持つユーザーを検索
# year_joined.groupby("user").count() : ユーザーIDでGROUPBYしてユーザーID数をカウント
# *.groupby("userStats").count() : userStatsでGROUPBYして資格の数をカウント
year_joined.groupby("user").count().groupby("userStats").count()

Unnamed: 0_level_0,yearJoined
userStats,Unnamed: 1_level_1
1,1000


次のコードでデータを追加して実行すると, 会員資格が複数あるユーザーがいるか検索できる

```
year_joined.loc[1000] = [0,"gold",2021]
```

```
year_joined.groupby("user").count().groupby("userStats").count()

userStats yearJoined	
1	999
2	1
```

In [7]:
# 1件もメールを開封していないユーザーがいるか
email[email.emailsOpened<1]

Unnamed: 0,emailsOpened,user,week


実行結果からメールを1件も開けていないユーザーを記録していないか, 少なくとも1回はメールが開封されていることがわかる.  
全ユーザーが週最低1回はメールを開くか? -> そのような仮説は正しくなさそう.

In [8]:
# とあるユーザーの記録を見る
email[email.user==998]

Unnamed: 0,emailsOpened,user,week
25464,1,998,2017-12-04
25465,3,998,2017-12-11
25466,3,998,2017-12-18
25467,3,998,2018-01-01
25468,3,998,2018-01-08
25469,2,998,2018-01-15
25470,3,998,2018-01-22
25471,2,998,2018-01-29
25472,3,998,2018-02-05
25473,3,998,2018-02-12


年末にメールを開いた形跡がないことから開いていないときは記録されていないと考えられる.

In [9]:
# 会員番号998の在会期間を週ごとで計算
(max(email[email.user==998].week) - min(email[email.user==998].week)).days/7

25.0

この計算では結果は25であるが, 週は26である. 例えば4/7,14,21,28のデータがあるとき(最大の週-最小の週)/7は3になる. しかし週の数は4である. このため1を足す必要がある.

In [10]:
# データの週の数を確認
email[email.user==998].shape

(24, 3)

本来の週の数は26だが, 24週しかないため欠けている週があることがわかる.

In [11]:
# 週と会員のすべての組み合わせのインデックスを生成
complete_idx = pd.MultiIndex.from_product((set(email.week),set(email.user)))

In [12]:
# 欠けている週を埋める処理
all_email = email.set_index(["week","user"]).reindex(complete_idx,fill_value=0).reset_index()
all_email.columns = ["week","user","emailsOpened"]

In [13]:
all_email[all_email.user==998].sort_values("week")

Unnamed: 0,week,user,emailsOpened
71147,2015-02-09,998,0
43119,2015-02-16,998,0
44736,2015-02-23,998,0
70608,2015-03-02,998,0
63601,2015-03-09,998,0
...,...,...,...
30183,2018-04-30,998,3
86239,2018-05-07,998,3
13474,2018-05-14,998,3
93246,2018-05-21,998,3


初めてメールが開封されるまでは会員登録していないことが考えられる. そこで初めてメールを開封するまでのデータはカットする.

In [14]:
# 会員ごとに最初と最後に既読した週を抽出
# agg : まとめてGROPUBYする関数

cutoff_dates = email.groupby("user").week.agg(["min","max"]).reset_index()
cutoff_dates

Unnamed: 0,user,min,max
0,1,2015-06-29,2018-05-28
1,3,2018-03-05,2018-04-23
2,5,2017-06-05,2018-05-28
3,6,2016-12-05,2018-05-28
4,9,2016-07-18,2018-05-28
...,...,...,...
534,991,2016-10-24,2016-10-24
535,992,2015-02-09,2015-07-06
536,993,2017-09-11,2018-05-28
537,995,2016-09-05,2018-05-28


In [15]:
# 最初の非ゼロ数以前の行と最後の非ゼロ数以後の行を削除

for _,row in cutoff_dates.iterrows():
    user=row["user"]
    start_date = row["min"]
    end_date = row["max"]
    all_email.drop(all_email[all_email.user==user][all_email.week<start_date].index, inplace=True)
    all_email.drop(all_email[all_email.user==user][all_email.week>end_date].index, inplace=True)

  all_email.drop(all_email[all_email.user==user][all_email.week<start_date].index, inplace=True)
  all_email.drop(all_email[all_email.user==user][all_email.week>end_date].index, inplace=True)


In [16]:
all_email[all_email.user==998]

Unnamed: 0,week,user,emailsOpened
1077,2018-01-01,998,3
1616,2018-03-19,998,2
5389,2018-02-05,998,3
8084,2018-04-02,998,3
11857,2017-12-04,998,1
13474,2018-05-14,998,3
14013,2018-03-12,998,3
18864,2018-03-26,998,2
30183,2018-04-30,998,3
42580,2018-05-28,998,3


電子メールと寄付のデータの関係性について考える. そのために寄付のデータを週単位の時系列にする処理を行う.  
lambda式の書き方
```
lambda 引数 : 返り値
```

resample : 時系列データのサンプリング頻度を変更する  
W-MON : 週の月曜日

In [17]:
donation.set_index("timestamp",inplace=True)
agg_donation = donation.groupby("user").apply(lambda df:df.amount.resample("W-MON").sum().dropna())

In [18]:
agg_donation

user  timestamp 
0     2015-03-30      25.0
      2015-04-06       0.0
      2015-04-13       0.0
      2015-04-20       0.0
      2015-04-27       0.0
                     ...  
995   2017-09-11       0.0
      2017-09-18       0.0
      2017-09-25       0.0
      2017-10-02    1000.0
998   2018-01-08      50.0
Name: amount, Length: 32352, dtype: float64

本に掲載されているコードでは実行できないため変更を行った.  
参考 : https://www.oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920187714

動作しなかったコード
```
for user,user_email in all_email.groupby("user"):
    user_donations = agg_donation[agg_donations.user==user]} 
    
    user_donations.set_index("timestamp",inplace=True)
    user_email.set_index("week",inplace=True)
    
    user_email = all_email[all_email.user==user]
    user_email = user_email.sort_values("week").set_index("week")
    
    df = pd.merge(user_email,user_donations,how="left",left_index=True,right_index=True)
    
    df = df.fillna(0)

    df["user"] = df.user_x
    merge_df =merge_df.append(df.reset_index()[["user","week","emailsOpened","amount"]])
```

In [19]:
# 電子メールのデータと寄付のデータを結合する
merge_df = pd.DataFrame()
for user,user_email in all_email.groupby("user"): # 各ユーザーごとに
    
    user_donation = agg_donation[agg_donation.index.get_level_values('user')==user] # 該当するユーザーの寄付を取得
    
    user_donation = user_donation.reset_index(level="user")
    #user_donation.set_index("timestamp",inplace=True)
    #user_email.set_index("week",inplace=True)
    
    #user_email = all_email[all_email.user==user] # 該当するユーザーのemailを取得
    user_email = user_email.sort_values("week").set_index("week")
    
    df = pd.merge(user_email,user_donation,how="left",left_index=True,right_index=True)
    
    df = df.fillna(0)
    # debug用
    # print(user_email.loc["2016-05-09"])
    # print(user_donation.loc["2016-05-09"])
    # print(df.loc["2016-05-09"])
    
    df["user"] = df.user_x
    merge_df =merge_df.append(df.reset_index()[["user","week","emailsOpened","amount"]])

In [20]:
merge_df

Unnamed: 0,user,week,emailsOpened,amount
0,1,2015-06-29,3,0.0
1,1,2015-07-06,0,0.0
2,1,2015-07-13,2,0.0
3,1,2015-07-20,2,0.0
4,1,2015-07-27,3,0.0
...,...,...,...,...
21,998,2018-04-30,3,0.0
22,998,2018-05-07,3,0.0
23,998,2018-05-14,3,0.0
24,998,2018-05-21,3,0.0


電子メールへの反応から寄付を行うかどうかを予測するモデルを作成するために, targetの設定を行う.  
shift(.) : ,遅れの時系列を生成する. 

In [21]:
df = merge_df[merge_df.user==998]
df["target"] = df.amount.shift(1)
df = df.fillna(1)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["target"] = df.amount.shift(1)


Unnamed: 0,user,week,emailsOpened,amount,target
0,998,2017-12-04,1,0.0,1.0
1,998,2017-12-11,3,0.0,0.0
2,998,2017-12-18,3,0.0,0.0
3,998,2017-12-25,0,0.0,0.0
4,998,2018-01-01,3,0.0,0.0
5,998,2018-01-08,3,50.0,0.0
6,998,2018-01-15,2,0.0,50.0
7,998,2018-01-22,3,0.0,0.0
8,998,2018-01-29,2,0.0,0.0
9,998,2018-02-05,3,0.0,0.0
