In [1]:
import pandas as pd

# Problem Description

You have a table of in-app purchases by user. Users that make their first in-app purchase are placed in a marketing campaign where they see call-to-actions for more in-app purchases. Find the number of users that made additional in-app purchases due to the success of the marketing campaign.


The marketing campaign doesn't start until one day after the initial in-app purchase so users that make multiple purchases on the same day do not count, nor do we count users that make only the same purchases over time.

## First look at Data

In [3]:
mkt_campaign = pd.read_csv('marketing_campaign.csv')
mkt_campaign.head(3)

Unnamed: 0,user_id,created_at,product_id,quantity,price
0,10,2019-01-01 00:00:00,101,3,55
1,10,2019-01-02 00:00:00,119,5,29
2,10,2019-03-31 00:00:00,111,2,149


## Firsts Tougths
* For every new user keep his first date and list of products

* search if the same user_id has bougth other products at other day

* count the users that apply for this rules

## Data Analysis

In [4]:
#checking for missing values and format of columns
mkt_campaign.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     102 non-null    int64 
 1   created_at  102 non-null    object
 2   product_id  102 non-null    int64 
 3   quantity    102 non-null    int64 
 4   price       102 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 4.1+ KB


There is no missing values, but date coumns are not in optimal format

date -> to_datetime

In [5]:
mkt_campaign.created_at = pd.to_datetime(mkt_campaign.created_at, format='%Y-%m-%d')
mkt_campaign.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   user_id     102 non-null    int64         
 1   created_at  102 non-null    datetime64[ns]
 2   product_id  102 non-null    int64         
 3   quantity    102 non-null    int64         
 4   price       102 non-null    int64         
dtypes: datetime64[ns](1), int64(4)
memory usage: 4.1 KB


Now that data is fixed lets start...
## Solution

### 1. More logical approach

In [16]:
#dropping useless columns and sortting by date
df = mkt_campaign.drop(['quantity', 'price'], axis=1).sort_values('created_at')
count = 0

#looping trhougth unique user_id
for id in list(df.user_id.unique()):
    #first date
    first_date= df[df.user_id == id]['created_at'].values[0]
    #list of products bougths in the first_date
    list_prod = list(df[(df.user_id == id) & (df.created_at == first_date)]['product_id'].values)

    #searching for orders that match the rules
    for i in range(df.shape[0]):
        if (df.loc[i,'user_id'] == id) and (df.loc[i,'created_at'] > first_date) and (df.loc[i,'product_id'] not in list_prod):
            count +=1 # if found, increment the count
            break # if found, break the for and go the next user_id
print('Number of users: ',count)

Number of users:  23


### 2. Most efficient approach

In [15]:
#dropping useless columns and sortting by date
df = mkt_campaign.drop(['quantity', 'price'], axis=1).sort_values('created_at')

#dropping same products bougths by the same user 
df = df.drop_duplicates(subset=['user_id','product_id'])

#dropping other dates of orders by the same user
df = df.drop_duplicates(subset=['created_at', 'user_id'])

#now that we have left only products bougths at different dates, we can groupby user and count
df = df.groupby('user_id').count()

#the users that have more than one register are the users that we are looking for
print('Number of users: ', df[df["created_at"]>1].shape[0])

Number of users:  23
