# Starbucks Customer Behaviour (Data Cleaning, Formatting & Processing)

In this notebook I will create the two input files needed for the modeling:
- The first one will contain all of the user data needed to perform the demographics clustering K-means model. This should involve information on transactions which have not been influenced by offers, and any information from the profile data about an individual user.
- The second input dataset should be users spend daily with an indication of which offers could have influenced a user on any given day. This can then be combined with the above user data and which demographic they are from to predict how each marketing campagin will influence user spend.

### Imports

In [22]:
# import general functions
import pandas as pd
import numpy as np
import json

### Functions

In [23]:
def clean_transcript_data(data):
    """
    this process cleans the values column and formats the transcript data
    """
    # creates a column for the type of interaction   
    data['interaction_value'] = [list(x.keys())[0] for x in data['value']]
    
    # creates a column related to the value amount or id    
    data['id'] = [list(x.values())[0] for x in data['value']]
    
    # drops the value column
    data = data.drop(columns=['value'])
    
    # cleans the interaction type column so offer id is consistent
    data['interaction_value'] = [x.replace('offer id','offer_id') for x in data['interaction_value']]
    
    # split out interaction_type
    temp_df = pd.get_dummies(data['interaction_value'])

    # combine the dataframes
    data = pd.concat([temp_df, data], axis=1, sort=True)
    
    # split out event
    temp_df = pd.get_dummies(data['event'])

    # combine the dataframes
    data = pd.concat([temp_df, data], axis=1, sort=True)

    # drop the original columns
    data = data.drop(columns=['interaction_value','event'])    
    
    return data # returns the clean transcript data


def clean_profile_data(data):
    """
    this process clean age, income and became_member_on columns in the profile data
    """
    # rename the column 'id' to person
    data.columns = ['age','member joined','gender','person' ,'income']
    
    # replace 118 in the age column with a zero indicating no age 
    # it might be worth looking at this a seperate group of users later on
    data['age'] = data['age'].replace(118,0)

    # update the became_member_on column to a datetime format
    data['member joined'] = pd.to_datetime(data['member joined'], format='%Y%m%d')
    
    # replace the NaN's in the income
    data['income'] = data['income'].fillna(0)
    
    # replace M, F, O and None types to get the 4 groups of customers
    data['gender'] = data['gender'].replace('M','male')
    data['gender'] = data['gender'].replace('F','female')
    data['gender'] = data['gender'].replace('O','other')
    data['gender'] = data['gender'].fillna('unknown gender')

    # split the column into seperate columns
    temp_df = pd.get_dummies(data['gender'])

    # combine the dataframes
    data = pd.concat([temp_df, data], axis=1, sort=True)

    # drop the original column
    data = data.drop(columns=['gender'])

    return data

def clean_portfolio_data(data):
    """
    this process has been created to clean columns in the profile data
    """
    # splits the channels column into seperate columns
    # creates temporary dataframes and lists  
    temp_df = pd.DataFrame(columns=['web', 'email', 'mobile','social'])
    temp_list = []

    # loop through the rows and attach the values to a dic   
    for index, row in data.iterrows():
        for value in row['channels']:
             temp_list.append({'index': index, 'value':value})

    # change the list into dataframe
    temp_df = temp_df.append(temp_list, ignore_index=False, sort=True)
    temp_df = temp_df.groupby('index')['value'].value_counts()
    temp_df = temp_df.unstack(level=-1).fillna(0)
    
    # combine the dataframes
    data = pd.concat([temp_df, data], axis=1, sort=True)
    
    # split the column into seperate columns
    temp_df = pd.get_dummies(data['offer_type'])

    # combine the dataframes
    data = pd.concat([temp_df, data], axis=1, sort=True)

    # drop the original columns
    data = data.drop(columns=['offer_type','channels'])
    
    return data

### Global Variables

In [24]:
# read in the different datasources
portfolio_df = pd.read_json('data/portfolio.json', lines=True)
profile_df = pd.read_json('data/profile.json', lines=True)
transcript_df = pd.read_json('data/transcript.json', lines=True)

### Initial Processing

Below I will run some initial processing of the three input datasets. Details on what these datasets contain can be found in the 'Customer Behaviour notebook' which was ran as a prerequisite to this notebook. 

The initial processing has been ran to split out an categorical columns so that they have dummies of 1's and 0's depending on if they are true or not. It has cleaned & reformatted some of the data in date columns, replaced NaN values as 0 and updated column names so they are consistent across each of the dataframes. The functions can be found above or in the .py file containing all functions that will be used in  future notebooks.

In [29]:
# run the initial cleaning on each dataset
clean_port_df = clean_portfolio_data(portfolio_df)
clean_port_df

Unnamed: 0,bogo,discount,informational,email,mobile,social,web,difficulty,duration,id,reward
0.0,1,0,0,1.0,1.0,1.0,0.0,10,7,ae264e3637204a6fb9bb56bc8210ddfd,10
1.0,1,0,0,1.0,1.0,1.0,1.0,10,5,4d5c57ea9a6940dd891ad53e9dbe8da0,10
2.0,0,0,1,1.0,1.0,0.0,1.0,0,4,3f207df678b143eea3cee63160fa8bed,0
3.0,1,0,0,1.0,1.0,0.0,1.0,5,7,9b98b8c7a33c4b65b9aebfe6a799e6d9,5
4.0,0,1,0,1.0,0.0,0.0,1.0,20,10,0b1e1539f2cc45b7b9fa7c272da2e1d7,5
5.0,0,1,0,1.0,1.0,1.0,1.0,7,7,2298d6c36e964ae4a3e7e9706d1fb8c2,3
6.0,0,1,0,1.0,1.0,1.0,1.0,10,10,fafdcd668e3743c1bb461111dcafc2a4,2
7.0,0,0,1,1.0,1.0,1.0,0.0,0,3,5a8bc65990b245e5a138643cd4eb9837,0
8.0,1,0,0,1.0,1.0,1.0,1.0,5,5,f19421c1d4aa40978ebb69ca19b0e20d,5
9.0,0,1,0,1.0,1.0,0.0,1.0,10,7,2906b810c7d4411798c6938adc9daaa5,2


In [26]:
clean_prof_df = clean_profile_data(profile_df)
clean_prof_df.head()

Unnamed: 0,female,male,other,unknown gender,age,member joined,person,income
0,0,0,0,1,0,2017-02-12,68be06ca386d4c31939f3a4f0e3dd783,0.0
1,1,0,0,0,55,2017-07-15,0610b486422d4921ae7d2bf64640c50b,112000.0
2,0,0,0,1,0,2018-07-12,38fe809add3b4fcf9315a9694bb96ff5,0.0
3,1,0,0,0,75,2017-05-09,78afa995795e4d85b5d9ceeca43f5fef,100000.0
4,0,0,0,1,0,2017-08-04,a03223e636434f42ac4c3df47e8bac43,0.0


In [27]:
clean_trans_df = clean_transcript_data(transcript_df)
clean_trans_df.head()

Unnamed: 0,offer completed,offer received,offer viewed,transaction,amount,offer_id,person,time,id
0,0,1,0,0,0,1,78afa995795e4d85b5d9ceeca43f5fef,0,9b98b8c7a33c4b65b9aebfe6a799e6d9
1,0,1,0,0,0,1,a03223e636434f42ac4c3df47e8bac43,0,0b1e1539f2cc45b7b9fa7c272da2e1d7
2,0,1,0,0,0,1,e2127556f4f64592b11af22de27a7932,0,2906b810c7d4411798c6938adc9daaa5
3,0,1,0,0,0,1,8ec6ce2a7e7949b1bf142def7d0e0586,0,fafdcd668e3743c1bb461111dcafc2a4
4,0,1,0,0,0,1,68617ca6246f4fbc85e91a2a49552598,0,4d5c57ea9a6940dd891ad53e9dbe8da0


Now that some intital cleaning has been performed we can look further into what inputs we will need for future analysis.

### Offer Influence

The first thing we probably need to decide is how we will determine when an offer has influenced a purchase. This will be useful when looking what users normally spend for the demographic split and how often they spend money. We have three different types of offers that need to be handled slightly differently:

 - BOGO (Buy One Get One Free): We can consider a user has been influenced by this offer when they have viewed the   offer and they have completed the offer at any point throughout the offer period.
 
 - Discount: Again we can consider a user has been influenced if they have viewed the offer and completed it in the offer period. We can also consider that they have been influenced on any purchases between viewing and completing the offer as it may be an accumulation offer (e.g. spend more than 10 dollars between a certain period).

- Informational: These are the most difficult to assess in any situation as it's difficult to measure what influence they have had. For this offer type we'll assume that a user was under the influence for the duration of the offer after viewing it. Any purchase in that period will be treated as being influenced.

I'm going to assume if an offer was not completed then a user was not influenced by the offer. Obviously a user could have tried to complete a discount offer but failed but this is difficult to tell from the data provided here.

For the first step I will therefore split out the transactions from the offer data:

In [12]:
def transactions(data):
    """
    returns all the transactions from the transcript dataframe
    """
    transactions_df = data[data['transaction'] == 1]
    transactions_df = transactions_df[['person','time','id']]
    transactions_df.columns = ['person','transaction_time','spend']
    
    return transactions_df

In [13]:
# split out the all the transactions

transactions_df = transactions(clean_trans_df)
transactions_df.shape

(138953, 3)

In [28]:
transactions_df.head()

Unnamed: 0,person,transaction_time,spend
12654,02c083884c7d45b39cc68e1314fec56c,0,0.83
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56
12659,54890f68699049c2a04d415abc25e717,0,13.23
12670,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51
12671,fe97aa22dd3e48c8b143116a8403dd52,0,18.97


Now we can split out the offers and process them seperately based on the criteria above:

In [36]:
def offers(transcript_data, portfolio_data):
    """
    returns all of the offers that were received/viewed/completed combined with portfolio data
    """
    # keep only the recived offers
    received_offer = transcript_data[transcript_data['offer received'] == 1]
    received_offer = received_offer[['offer received','person', 'time', 'id']]
    received_offer.columns = ['offer received','person', 'time_received', 'id_offer']    
    
    # keep only the viewed offers
    veiwed_offer = transcript_data[transcript_data['offer viewed'] == 1]
    veiwed_offer = veiwed_offer[['offer viewed','person', 'time', 'id']]
    veiwed_offer.columns = ['offer viewed','person', 'time_viewed', 'id_offer']
    
    # keep all the offers completed data as informational campaigns don't have a completed flag
    completed_offer = transcript_data
    completed_offer = completed_offer[['offer completed','person', 'time', 'id']]
    completed_offer.columns = ['offer completed','person', 'time_completed', 'id_offer']
    
    # merge the offers data into one dataframe based on id and person
    merged_veiws = received_offer.merge(veiwed_offer, on=['person','id_offer']) 
    merged_completed = merged_veiws.merge(completed_offer, on=['person','id_offer']) 
    
    # drop anywhere the offer was recived after being viewed 
    # (not useful as it suggests it was a different offer)
    merged_completed = merged_completed[merged_completed['time_viewed'] > 
                                        merged_completed['time_received']]
    
    # merges all of the offer data with info in the portfolio data
    portfolio_data = portfolio_data.rename(columns = {'id':'id_offer'})
    offers = merged_completed.merge(portfolio_data, on=['id_offer'])
    
    # change duration time to hours
    offers['duration'] = offers['duration']*24
    
    return offers

def influenced_bogo(transcript_data, portfolio_data):
    """
    this function has been created to keep only BOGO offers that influenced a purchase
    """
    # gets all of the offers that were received/viewed/completed formatted together
    offer_data = offers(transcript_data, portfolio_data)
    
    # select only the bogo offers and have been completed
    bogo_offers = offer_data[(offer_data['bogo'] == 1) & 
                             (offer_data['offer completed'] == 1)]
    
    # removes any that were completed prior to being viewed
    bogo_offers = bogo_offers[bogo_offers['time_completed'] >= 
                              bogo_offers['time_viewed']]
    
    # removes offers that were completed outside of the offer timeframe (indicating it was a second offer)
    bogo_offers = bogo_offers[(bogo_offers['duration'] >= (bogo_offers['time_completed'] - 
                                                           bogo_offers['time_received']))
                             ]

    # creates the transaction data
    transactions_data = transactions(transcript_data)
    
    # merge the offers and transactions
    transactions_bogo = transactions_data.merge(bogo_offers, on=['person'])
    
    # filter the tansactions keeping ones that occured at same time as the offer was complete
    transactions_bogo = transactions_bogo[transactions_bogo['transaction_time'] == 
                                          transactions_bogo['time_completed']]
    
    # remove any repeat transactions
    transactions_bogo = transactions_bogo.drop_duplicates(subset=['person','transaction_time','spend'], keep="first")
    
    return transactions_bogo

def influenced_discount(transcript_data, portfolio_data):
    """
    this function has been created to keep only discount offers that influenced a purchase
    """
    # gets all of the offers that were received/viewed/completed formatted together
    offer_data = offers(transcript_data, portfolio_data)
    
    # select only the discuont offers and have been completed
    discount_offers = offer_data[(offer_data['discount'] == 1) & 
                                 (offer_data['offer completed'] == 1)]
    
    # removes any that were completed prior to being viewed
    discount_offers = discount_offers[discount_offers['time_completed'] >= 
                                      discount_offers['time_viewed']]
    
    # removes offers that were completed outside of the timeframe (indicating it was a second offer)
    discount_offers = discount_offers[discount_offers['duration'] >= (discount_offers['time_completed'] - 
                                                                      discount_offers['time_received'])]

    # creates the transaction data
    transactions_data = transactions(transcript_data)
    
    # merge the offers and transactions
    transactions_discount = transactions_data.merge(discount_offers, on=['person'])
    
    # filter the tansactions keeping the ones after the offer was viewed but before it was completed
    transactions_discount = transactions_discount[(transactions_discount['transaction_time'] >= transactions_discount['time_viewed']) &
                                                 (transactions_discount['transaction_time'] <= transactions_discount['time_completed'])]
    
    # remove any repeat transactions
    transactions_discount = transactions_discount.drop_duplicates(subset=['person','transaction_time','spend'], keep="first")
    
    return transactions_discount

def influenced_informational(transcript_data, portfolio_data):
    """
    this function has been created to keep only informational offers that influenced a purchase
    """
    # gets all of the offers that were received/viewed/completed formatted together
    offer_data = offers(transcript_data, portfolio_data)
    
    # select only the informational offers
    info_offers = offer_data[(offer_data['informational'] == 1)]

    # creates the transaction data
    transactions_data = transactions(transcript_data)
    
    # merge the offers and transactions
    transactions_info = transactions_data.merge(info_offers, on=['person'])
    
    # filter the tansactions keeping the ones after the offer was viewed
    transactions_info = transactions_info[(transactions_info['transaction_time'] >= transactions_info['time_viewed'])]
    
    # removes transactions that happened outside of duration timeframe of the offer
    transactions_info = transactions_info[transactions_info['duration'] >= (transactions_info['transaction_time'] - 
                                                                            transactions_info['time_viewed'])]
    
    # remove any repeat transactions
    transactions_info = transactions_info.drop_duplicates(subset=['person','transaction_time','spend'], keep="first")
    
    return transactions_info

In [37]:
inf_discount = influenced_discount(clean_trans_df, clean_port_df)
print(inf_discount.shape)
inf_discount.head()

(15428, 20)


Unnamed: 0,person,transaction_time,spend,offer received,time_received,id_offer,offer viewed,time_viewed,offer completed,time_completed,bogo,discount,informational,email,mobile,social,web,difficulty,duration,reward
13,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,354,18.42,1,336,2298d6c36e964ae4a3e7e9706d1fb8c2,1,342,1,354,0,1,0,1.0,1.0,1.0,1.0,7,168,3
20,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,474,21.13,1,408,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,462,1,474,0,1,0,1.0,0.0,0.0,1.0,20,240,5
26,54890f68699049c2a04d415abc25e717,330,15.61,1,168,2298d6c36e964ae4a3e7e9706d1fb8c2,1,186,1,330,0,1,0,1.0,1.0,1.0,1.0,7,168,3
42,bbeb54e861614fc7b22a8844f72dca6c,372,2.36,1,336,2298d6c36e964ae4a3e7e9706d1fb8c2,1,354,1,396,0,1,0,1.0,1.0,1.0,1.0,7,168,3
44,bbeb54e861614fc7b22a8844f72dca6c,390,0.36,1,336,2298d6c36e964ae4a3e7e9706d1fb8c2,1,354,1,396,0,1,0,1.0,1.0,1.0,1.0,7,168,3


In [38]:
inf_bogo = influenced_bogo(clean_trans_df, clean_port_df)
print(inf_bogo.shape)
inf_bogo.head()

(7957, 20)


Unnamed: 0,person,transaction_time,spend,offer received,time_received,id_offer,offer viewed,time_viewed,offer completed,time_completed,bogo,discount,informational,email,mobile,social,web,difficulty,duration,reward
11,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,540,24.3,1,504,4d5c57ea9a6940dd891ad53e9dbe8da0,1,516,1,540,1,0,0,1.0,1.0,1.0,1.0,10,120,10
19,676506bad68e4161b9bbaffeb039626b,636,17.2,1,576,9b98b8c7a33c4b65b9aebfe6a799e6d9,1,588,1,636,1,0,0,1.0,1.0,0.0,1.0,5,168,5
22,4cbe33c601a5407f8202086565c55111,558,31.72,1,504,ae264e3637204a6fb9bb56bc8210ddfd,1,522,1,558,1,0,0,1.0,1.0,1.0,0.0,10,168,10
33,a04fcfd571034456aaa6d56c0a3fd9b6,660,223.07,1,576,f19421c1d4aa40978ebb69ca19b0e20d,1,612,1,660,1,0,0,1.0,1.0,1.0,1.0,5,120,5
39,227f2d69e46a4899b70d48182822cff6,642,24.7,1,576,ae264e3637204a6fb9bb56bc8210ddfd,1,582,1,642,1,0,0,1.0,1.0,1.0,0.0,10,168,10


In [39]:
inf_informational = influenced_informational(clean_trans_df, clean_port_df)
print(inf_informational.shape)
inf_informational.head()

(10290, 20)


Unnamed: 0,person,transaction_time,spend,offer received,time_received,id_offer,offer viewed,time_viewed,offer completed,time_completed,bogo,discount,informational,email,mobile,social,web,difficulty,duration,reward
10,54890f68699049c2a04d415abc25e717,534,20.01,1,408,5a8bc65990b245e5a138643cd4eb9837,1,468,0,408,0,0,1,1.0,1.0,1.0,0.0,0,72,0
24,b2f1cd155b864803ad8334cdf13c4bd2,102,17.53,1,0,5a8bc65990b245e5a138643cd4eb9837,1,66,0,0,0,0,1,1.0,1.0,1.0,0.0,0,72,0
30,b2f1cd155b864803ad8334cdf13c4bd2,222,27.45,1,168,3f207df678b143eea3cee63160fa8bed,1,198,0,168,0,0,1,1.0,1.0,0.0,1.0,0,96,0
174,fe97aa22dd3e48c8b143116a8403dd52,198,28.71,1,168,3f207df678b143eea3cee63160fa8bed,1,198,0,168,0,0,1,1.0,1.0,0.0,1.0,0,96,0
236,fe97aa22dd3e48c8b143116a8403dd52,438,380.24,1,408,5a8bc65990b245e5a138643cd4eb9837,1,420,0,408,0,0,1,1.0,1.0,1.0,0.0,0,72,0


It's possible that the some of the transactions above would have been brought regardless of if the user had seen the ads esspecially in the case of the informational offers. In the next part of this processing notebook we will calculate how much users spend on average over time and per transaction. This will help to further process the above transactions and also decided help to decide how to split the different demographic groups.

### Normal Customer Behaviour

In [116]:
# Now that we have all the influenced transactions we can find the ones not influenced by offers
def norm_transactions(clean_trans_df, clean_port_df):
    """
    produces all the transactions that weren't influenced by offers
    """
    # creates the transaction data
    transactions_data = transactions(clean_trans_df)
    
    # all offer affected transactions
    inf_discount = influenced_discount(clean_trans_df, clean_port_df)
    inf_bogo = influenced_bogo(clean_trans_df, clean_port_df)
    inf_informational = influenced_informational(clean_trans_df, clean_port_df)
    
    # combine all the influenced transcations    
    inf_trans = inf_informational.append(inf_discount.append(inf_bogo))
    
    # drop to have the same columns as all transactions    
    inf_trans = inf_trans[['person', 'transaction_time', 'spend']]
    
    # remove offer related transactions
    norm_trans = pd.concat([transactions_data, inf_trans]).drop_duplicates(keep=False)
    
    return norm_trans

In [117]:
uninflunced_trans = norm_transactions(clean_trans_df, clean_port_df)
uninflunced_trans.head()

Unnamed: 0,person,transaction_time,spend
12654,02c083884c7d45b39cc68e1314fec56c,0,0.83
12657,9fa9ae8f57894cc9a3b8a9bbe0fc1b2f,0,34.56
12659,54890f68699049c2a04d415abc25e717,0,13.23
12670,b2f1cd155b864803ad8334cdf13c4bd2,0,19.51
12671,fe97aa22dd3e48c8b143116a8403dd52,0,18.97


In [43]:
len(uninflunced_trans)

106098

Now that we have all the normal transactions we can create useful metrics based on the users normal behaviour. These will include average transaction amount, weekly transactions and cost. It's better not to look at overall transactons as this will be heavily skewed to users who have been using the app a long time. In a more indepth analysis seasonal trends, trends in the run up to offers and other changing habits could be taken into account. However for this analysis I will just look at long term trends over the whole of the users membership.

In [47]:
def user_transactions(profile, transactions):
    """
    this creates useful information of individual users transactions
    """
    # list of consumers in the transaction data
    consumers = transactions.groupby('person').sum().index

    # calculate the total transaction values for a consumer
    consumer_spend = transactions.groupby('person')['spend'].sum().values

    # calculate the number of transactions per consumer
    consumer_trans = transactions.groupby('person')['spend'].count().values

    # create a dataframe with spend info per consumer
    consumer_data = pd.DataFrame(consumer_trans, index=consumers, columns=['total transactions'])

    # add the total transaction column
    consumer_data['total spend'] = consumer_spend 
    
    # average spend per transaction    
    consumer_data['spend per trans'] = consumer_data['total spend']/consumer_data['total transactions']
    
    # average spend per day
    consumer_data['spend per day'] = consumer_data['total spend']/30
    
    # combine profile and transaction data
    consumer_profile = profile.merge(consumer_data, on=['person']).fillna(0)
    
    # I will take the last date the final day data has been collected
    final_date = consumer_profile['member joined'].max()
    
    # membership length in weeks
    consumer_profile['membership length'] = [round((final_date - x).days / 7,0) for x in consumer_profile['member joined']]

    return consumer_profile

In [48]:
consumer_profiles = user_transactions(clean_prof_df, uninflunced_trans)
consumer_profiles.head()

Unnamed: 0,female,male,other,unknown gender,age,member joined,person,income,total transactions,total spend,spend per trans,spend per day,membership length
0,0,0,0,1,0,2017-02-12,68be06ca386d4c31939f3a4f0e3dd783,0.0,9,20.4,2.266667,0.68,76.0
1,1,0,0,0,55,2017-07-15,0610b486422d4921ae7d2bf64640c50b,112000.0,3,77.01,25.67,2.567,54.0
2,0,0,0,1,0,2018-07-12,38fe809add3b4fcf9315a9694bb96ff5,0.0,5,10.21,2.042,0.340333,2.0
3,1,0,0,0,75,2017-05-09,78afa995795e4d85b5d9ceeca43f5fef,100000.0,4,89.99,22.4975,2.999667,63.0
4,0,0,0,1,0,2017-08-04,a03223e636434f42ac4c3df47e8bac43,0.0,3,4.65,1.55,0.155,51.0


The above consumer profiles can be used as an input for the K-means modeling to determine what the main demographics are that shop at Starbucks. This will be performed in the first modeling noteboook.

Next I will make the initial input for the second part of the modeling. This will be average user spend by day and offers that influenced spend in that day. For instance for each user I should have the spend for each day out of the 30 days with a flag for each offer that was live during that day. This can be done by using the transactions that were influenced by the offers above and the total transactions dataset.

### Spend Per Day

In [111]:
def spend_per_day(clean_trans_df, clean_port_df):
    """
    this creates the spend per day by person which will be used for the regression analysis
    """
    # all offer affected transactions
    inf_discount = influenced_discount(clean_trans_df, clean_port_df)
    inf_bogo = influenced_bogo(clean_trans_df, clean_port_df)
    inf_informational = influenced_informational(clean_trans_df, clean_port_df)

    # combine all the influenced transcations    
    inf_trans = inf_informational.append(inf_discount.append(inf_bogo))

    # keep only the columns needed
    inf_trans = inf_trans[['person', 'transaction_time', 'spend', 'id_offer']]

    # creates dummies for each type of offer that was avalible
    inf_off = pd.get_dummies(inf_trans['id_offer'])

    # concates the offers with the transactions
    inf_trans = pd.concat([inf_trans, inf_off], axis=1).drop(columns=['id_offer'])

    # changes the transaction time to a day
    inf_trans['transaction_time'] = np.ceil(inf_trans['transaction_time']/24)

    # groupby the person and transaction_time 
    influenced = inf_trans.groupby(['person','transaction_time']).sum()
    
    # unstack and restack in index to fill days with zeros   
    influenced = influenced.unstack().fillna(0).stack()
    
    # create the same file for all other transactions to get spend   
    trans_up = transactions(clean_trans_df)

    # changes the transaction time to a day
    trans_up['transaction_time'] = np.ceil(trans_up['transaction_time']/24)

    # group all of the transaction
    trans_up = trans_up.groupby(['person','transaction_time']).sum()
    
    # fill any empty days with zeros 
    trans_up = trans_up.unstack().fillna(0).stack()
    
    # megre the files to have spend by day and if they were influenced by any offers
    spend_per_day = trans_up.merge(influenced, right_index=True, left_index=True) 
    
    return spend_per_day

In [112]:
spd = spend_per_day(clean_trans_df, clean_port_df)

In [115]:
spd.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,spend,0b1e1539f2cc45b7b9fa7c272da2e1d7,2298d6c36e964ae4a3e7e9706d1fb8c2,2906b810c7d4411798c6938adc9daaa5,3f207df678b143eea3cee63160fa8bed,4d5c57ea9a6940dd891ad53e9dbe8da0,5a8bc65990b245e5a138643cd4eb9837,9b98b8c7a33c4b65b9aebfe6a799e6d9,ae264e3637204a6fb9bb56bc8210ddfd,f19421c1d4aa40978ebb69ca19b0e20d,fafdcd668e3743c1bb461111dcafc2a4
person,transaction_time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0009655768c64bdeb2e877511632db8f,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009655768c64bdeb2e877511632db8f,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009655768c64bdeb2e877511632db8f,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009655768c64bdeb2e877511632db8f,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0009655768c64bdeb2e877511632db8f,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that I have the two input datasets for the analysis I can start working on the modeling. It's possible that not all features will be used from these input data but also after the first step of modeling and some analysis more features may need to be added. The next step is to look into the demographic data of users using k-means to see if we can cluster consumers into similar groups. This will make the process of predicting spend easier.