# The initial approach 
As a start I will use simple heuristics to analyse the population and how different groups are responding to different offers. 

We already started slightly looking at different data distributions etc. in the exploratory analysis. However, here I will go a bit more into the behaviour of different groups. 

In the matrices I built, we have data about the user behaviour. We cannot determine if the user will respond to a specific offer with those data alone. However, we can look into if an offer in general increases spending. And which groups respond best overall. 

To take away the effect of duration of offers, I will use the time values to normalise the spending data. 

We'll start by investigating if gender by itself may influence the response to offers, followed by splitting data into age groups. Later we'll combine the user data and the offer data to look at which groups are responding best to the different offers, and if there are groups that are not really affected by any offers. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
offers = pd.read_pickle('offer_df.pkl')
users = pd.read_pickle('profile_expanded.pkl')

In [3]:
offers.head(2)

Unnamed: 0,offer_id,user_id,offer_type,difficulty,reward,start_time,duration,end_time,viewed,view_time,completed,complet_time,time_in_window,amount_in_window,type_bogo,type_discount,type_informational
0,2906b810c7d4411798c6938adc9daaa5,68be06ca386d4c31939f3a4f0e3dd783,discount,10,2,168,168,336,1,216.0,0,,121,0.0,0,1,0
1,0b1e1539f2cc45b7b9fa7c272da2e1d7,68be06ca386d4c31939f3a4f0e3dd783,discount,20,5,336,240,576,1,348.0,0,,229,10.52,0,1,0


In [4]:
users.head(2)

Unnamed: 0,age,became_member_on,gender,income,user_id,spent_total,spent_in_window,spent_no_window,spent_in_discount,spent_in_bogo,...,time_in_bogo,time_in_informational,view_ratio,completion_ratio,view_and_complete_ratio,num_offers_received,gender_F,gender_M,gender_O,gender_nan
0,33,2017-04-21,M,72000.0,0009655768c64bdeb2e877511632db8f,127.6,30.73,96.87,0.0,0.0,...,0.0,108.0,0.4,0.6,0.0,5,0,1,0,0
1,118,2018-04-25,,,00116118485d4dfda04fdbaba9a87b5c,4.09,0.0,4.09,0.0,0.0,...,138.0,0.0,1.0,0.0,0.0,2,0,0,0,1


From the users dataframe we have more columns than what is useful to use in an analysis. The below subset is considered to be useful to answer questions about the users. 

In [5]:
times = ['time_in_window', 'time_no_window', 'time_in_discount', 'time_in_bogo', 'time_in_informational']
users.loc[:,times]  = np.array(users.loc[:,times]) +1 
users = users.drop(index=users.loc[users['time_no_window']<0,'user_id'].index)
users.columns


Index(['age', 'became_member_on', 'gender', 'income', 'user_id', 'spent_total',
       'spent_in_window', 'spent_no_window', 'spent_in_discount',
       'spent_in_bogo', 'spent_in_informational', 'time_in_window',
       'time_no_window', 'time_in_discount', 'time_in_bogo',
       'time_in_informational', 'view_ratio', 'completion_ratio',
       'view_and_complete_ratio', 'num_offers_received', 'gender_F',
       'gender_M', 'gender_O', 'gender_nan'],
      dtype='object')

In [6]:
subset = ['gender', 'age', 'income', 'spent_total', 'spent_in_window', 'spent_no_window', 'spent_in_discount',
       'spent_in_bogo', 'spent_in_informational', 'time_in_window',
       'time_no_window', 'time_in_discount', 'time_in_bogo',
       'time_in_informational', 'view_ratio', 'completion_ratio', 'view_and_complete_ratio']

In [7]:
users_sub = users.loc[:,subset]
users_sub.loc[users_sub['gender'].isna(),'gender'] ='N'
users_sub.head(3)

Unnamed: 0,gender,age,income,spent_total,spent_in_window,spent_no_window,spent_in_discount,spent_in_bogo,spent_in_informational,time_in_window,time_no_window,time_in_discount,time_in_bogo,time_in_informational,view_ratio,completion_ratio,view_and_complete_ratio
0,M,33,72000.0,127.6,30.73,96.87,0.0,0.0,30.73,109.0,607.0,1.0,1.0,109.0,0.4,0.6,0.0
1,N,118,,4.09,0.0,4.09,0.0,0.0,0.0,139.0,577.0,1.0,139.0,1.0,1.0,0.0,0.0
2,O,40,57000.0,79.46,33.98,45.48,33.98,22.05,0.0,355.0,361.0,211.0,61.0,145.0,1.0,0.6,0.6


In [8]:
users_sub.describe()

Unnamed: 0,age,income,spent_total,spent_in_window,spent_no_window,spent_in_discount,spent_in_bogo,spent_in_informational,time_in_window,time_no_window,time_in_discount,time_in_bogo,time_in_informational,view_ratio,completion_ratio,view_and_complete_ratio
count,16884.0,14816.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0,16884.0
mean,62.182007,65395.653348,105.136487,40.113361,65.023127,17.262521,17.869157,7.917868,228.34968,487.65032,104.982232,102.081734,35.178038,0.692291,0.443393,0.314992
std,26.460707,21598.750417,126.069911,59.780557,98.590737,40.037113,41.252933,26.202186,152.35947,152.35947,125.000428,103.014948,47.880489,0.252257,0.338742,0.28212
min,18.0,30000.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,45.0,49000.0,22.3,10.94,7.01,0.0,0.0,0.0,109.0,391.0,1.0,7.0,1.0,0.5,0.166667,0.0
50%,58.0,64000.0,70.285,26.8,32.335,9.205,9.88,0.0,205.0,511.0,55.0,85.0,1.0,0.75,0.5,0.25
75%,73.0,80000.0,149.465,53.3525,93.2525,23.38,24.46,3.4925,325.0,607.0,169.0,157.0,67.0,1.0,0.75,0.5
max,118.0,120000.0,1608.69,1073.51,1344.93,1307.2,1136.81,777.47,715.0,715.0,709.0,637.0,355.0,1.0,1.0,1.0


In [9]:
users.loc[users['time_no_window']<0,'user_id']

Series([], Name: user_id, dtype: object)

From the above summary we can observe a couple of important things. First of all, the min and max values gives indications of obvious outliers or erronous data. However, it look reasonable for most at least. 

Secondly we can observe that the mean and median are very different in some of the parameters. 

In [10]:
idx = pd.IndexSlice
users_sub.groupby(['gender']).describe().transpose().loc[idx[:,idx['mean', '50%']],:]

Unnamed: 0,gender,F,M,N,O
age,mean,57.545455,52.11077,118.0,54.400943
age,50%,58.0,53.0,118.0,55.0
income,mean,71301.452587,61179.780583,,63287.735849
income,50%,71000.0,59000.0,,62000.0
spent_total,mean,140.959832,99.667275,19.452384,124.323302
spent_total,50%,116.57,63.14,15.715,93.44
spent_in_window,mean,50.682853,38.670381,13.740397,49.604623
spent_in_window,50%,38.61,26.42,10.125,33.685
spent_no_window,mean,90.276979,60.996894,5.711987,74.718679
spent_no_window,50%,68.02,27.24,3.745,50.15


In the above table we see the mean and the median (50%) for all the users split by gender. N is unknown gender and O is other. One thing to note about the mean and median; If the mean and median is significantly different, it means that a few users are having quite extreme values, driving the mean to change from the mean. 

In the above summary we can see that for all of the spending parameters, the median is significantly lower than the mean. Thus, there are a solid minority doing some serious spending compared to the middle spender. The same is true for the time_in_window which describes the total time a user spend after viewing an offer. 

However, lets drill a bit further. The spending amount is not that interesting as it is now by itself. Every user receives different types of offers, some offers are more common and have longer duration etc. which affect the time the user has to spend their money inside a specific offer. Thus, I will normalise the spendings by dividing by the respective time in window. 

In [11]:
users['spent_in_window_norm'] = users.loc[:,'spent_in_window']/users['time_in_window']
users['spent_no_window_norm'] = users.loc[:,'spent_no_window']/users['time_no_window']
users['spent_in_discount_norm'] = users.loc[:,'spent_in_discount']/users['time_in_discount']
users['spent_in_bogo_norm'] = users.loc[:,'spent_in_bogo']/users['time_in_bogo']
users['spent_in_informational_norm'] = users.loc[:,'spent_in_informational']/users['time_in_informational']
users

Unnamed: 0,age,became_member_on,gender,income,user_id,spent_total,spent_in_window,spent_no_window,spent_in_discount,spent_in_bogo,...,num_offers_received,gender_F,gender_M,gender_O,gender_nan,spent_in_window_norm,spent_no_window_norm,spent_in_discount_norm,spent_in_bogo_norm,spent_in_informational_norm
0,33,2017-04-21,M,72000.0,0009655768c64bdeb2e877511632db8f,127.60,30.73,96.87,0.00,0.00,...,5,0,1,0,0,0.281927,0.159588,0.000000,0.000000,0.281927
1,118,2018-04-25,,,00116118485d4dfda04fdbaba9a87b5c,4.09,0.00,4.09,0.00,0.00,...,2,0,0,0,1,0.000000,0.007088,0.000000,0.000000,0.000000
2,40,2018-01-09,O,57000.0,0011e0d4e6b944f998e987f904e8c1e5,79.46,33.98,45.48,33.98,22.05,...,5,0,0,1,0,0.095718,0.125983,0.161043,0.361475,0.000000
3,59,2016-03-04,F,90000.0,0020c2b971eb4e9188eac86d93036a77,196.86,34.87,161.99,17.63,17.24,...,5,1,0,0,0,0.274567,0.275025,0.410000,0.202824,0.000000
4,24,2016-11-11,F,60000.0,0020ccbbb6d84e358d3414a3ff76cffd,154.05,95.37,58.68,11.65,24.85,...,4,1,0,0,0,0.544971,0.108466,0.211818,0.507143,0.806438
5,26,2017-06-21,F,73000.0,003d66b6608740288d6cc97a6903f4f0,48.34,30.92,17.42,22.47,0.00,...,5,1,0,0,0,0.128299,0.036674,0.132959,0.000000,0.129794
6,19,2016-08-09,F,65000.0,00426fe3ffde4c6b9cb9ad6d077a13ea,68.51,49.26,19.25,23.34,0.00,...,5,1,0,0,0,0.451927,0.031713,0.319726,0.000000,0.700541
7,55,2018-05-08,F,74000.0,004b041fbfe44859945daa2c7f79ee64,138.36,47.85,90.51,19.93,27.92,...,3,1,0,0,0,0.293558,0.163671,0.143381,1.116800,0.000000
8,54,2016-03-31,M,99000.0,004c5799adbf42868b9cff0396190900,347.38,101.94,245.44,43.21,58.73,...,5,0,1,0,0,0.886435,0.408386,0.881837,0.876567,0.000000
9,56,2017-12-09,M,47000.0,005500a7188546ff8a767329a2f7c76a,20.36,20.36,0.00,0.00,20.36,...,5,0,1,0,0,0.047681,0.000000,0.000000,0.047681,0.000000


In [13]:
users.loc[:,['spent_in_window_norm', 'spent_no_window_norm', 'spent_in_discount_norm', 'spent_in_bogo_norm', 'spent_in_informational_norm']] = users_sub.loc[:,['spent_in_window_norm', 'spent_no_window_norm', 'spent_in_discount_norm', 'spent_in_bogo_norm', 'spent_in_informational_norm']].fillna(0)

KeyError: "None of [Index(['spent_in_window_norm', 'spent_no_window_norm',\n       'spent_in_discount_norm', 'spent_in_bogo_norm',\n       'spent_in_informational_norm'],\n      dtype='object')] are in the [columns]"

In [None]:
users.loc[users['spent_in_window_norm']==np.inf,['time_in_window', 'spent_in_window']]

In [None]:

users.head()

In [None]:
subset2 = ['gender', 'age', 'income', 'spent_in_window_norm', 'spent_no_window_norm', 'spent_in_discount_norm',
           'spent_in_bogo_norm', 'spent_in_informational_norm', 'view_ratio', 'completion_ratio',
           'view_and_complete_ratio']

In [None]:
users_sub = users.loc[:,subset2]
users_sub.head()

With our new subset of the user matrix I would then like to try to create some statistics. First let's see if it makes sense. 

In [None]:
users_sub.describe()