In this notebook, we will look to analyze what factors will be useful for predicting user adoption. First, we will read in our data and important libraries to consider.

In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

In [82]:
df_engagement = pd.read_csv('takehome_user_engagement.csv')

df_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [83]:
df_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [84]:
df_users = pd.read_csv('takehome_users.csv', encoding="ISO-8859-1")

df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [85]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   12000 non-null  int64  
 1   creation_time               12000 non-null  object 
 2   name                        12000 non-null  object 
 3   email                       12000 non-null  object 
 4   creation_source             12000 non-null  object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    12000 non-null  int64  
 7   enabled_for_marketing_drip  12000 non-null  int64  
 8   org_id                      12000 non-null  int64  
 9   invited_by_user_id          6417 non-null   float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


In [86]:
df_users['object_id'].is_unique

True

In [87]:
total_visits = df_engagement.groupby('user_id')['visited'].count()

total_visits = pd.DataFrame(total_visits)
total_visits.reset_index(inplace=True)

total_visits.head()

Unnamed: 0,user_id,visited
0,1,1
1,2,14
2,3,1
3,4,1
4,5,1


In [88]:
total_visits['user_id'].is_unique

True

In [89]:
total_visits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8823 entries, 0 to 8822
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   user_id  8823 non-null   int64
 1   visited  8823 non-null   int64
dtypes: int64(2)
memory usage: 138.0 KB


It seems that there are a significant number of users that have no information in regards to visits.

In [92]:
null_session = df_users[df_users['last_session_creation_time'].isnull()]

In [94]:
null_list = list(null_session['object_id'])
visit_list = list(total_visits['user_id'])

check = False

for i in null_list:
    if i in visit_list:
        check = True
        
print(check)

False


It appears that the null values for customers who have no information for creating a session also corresponds to the customers with no visits. Therefore, we can drop these rows, as they will not provide us any useful information, since they never actually were customers. We will do this by merging the total visits dataframe with our users dataframe.

In [99]:
df_users = pd.merge(df_users, total_visits, left_on='object_id', right_on='user_id')

df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,user_id,visited
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,1,1
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,2,14
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,3,1
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,4,1
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,5,1


In [100]:
df_users.rename(columns={'visited':'total_visits'}, inplace=True)

df_users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8823 entries, 0 to 8822
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   8823 non-null   int64  
 1   creation_time               8823 non-null   object 
 2   name                        8823 non-null   object 
 3   email                       8823 non-null   object 
 4   creation_source             8823 non-null   object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    8823 non-null   int64  
 7   enabled_for_marketing_drip  8823 non-null   int64  
 8   org_id                      8823 non-null   int64  
 9   invited_by_user_id          4776 non-null   float64
 10  user_id                     8823 non-null   int64  
 11  total_visits                8823 non-null   int64  
dtypes: float64(2), int64(6), object(4)
memory usage: 896.1+ KB


In [101]:
df_users.drop(columns='user_id', inplace=True)

Now, we will look to change the invited_by_user_id column to contain no null values. Because we are not entirely concerned with exactly who invited the customers, we can simply replace the null values with 0, and any non null values with 1.

In [102]:
df_users['invited'] = 1
df_users.loc[df_users['invited_by_user_id'].isnull(), 'invited'] = 0

df_users['invited'].unique()

array([1, 0], dtype=int64)

In [105]:
df_users.drop(columns='invited_by_user_id', inplace=True)

In [106]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8823 entries, 0 to 8822
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   object_id                   8823 non-null   int64  
 1   creation_time               8823 non-null   object 
 2   name                        8823 non-null   object 
 3   email                       8823 non-null   object 
 4   creation_source             8823 non-null   object 
 5   last_session_creation_time  8823 non-null   float64
 6   opted_in_to_mailing_list    8823 non-null   int64  
 7   enabled_for_marketing_drip  8823 non-null   int64  
 8   org_id                      8823 non-null   int64  
 9   total_visits                8823 non-null   int64  
 10  invited                     8823 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 827.2+ KB


Finally, we need to confirm which customers actually were adopted. To do this, we must create a column that tracks whether a customer had visited on three separate days over a week's span. We can find this information in our df_engagements dataframe.

In [108]:
df_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   time_stamp  207917 non-null  object
 1   user_id     207917 non-null  int64 
 2   visited     207917 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 4.8+ MB


In [109]:
df_engagement['time_stamp'] = pd.to_datetime(df_engagement['time_stamp'])

df_engagement.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207917 entries, 0 to 207916
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   time_stamp  207917 non-null  datetime64[ns]
 1   user_id     207917 non-null  int64         
 2   visited     207917 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 4.8 MB


In [115]:
uid = list(set(i for i in df_engagement['user_id']))

[1, 2, 3, 4, 5]

In [138]:
(df_engagement.loc[2,'time_stamp'].to_pydatetime() - df_engagement.loc[1,'time_stamp'].to_pydatetime()).days

14

In [143]:
temp = df_engagement[df_engagement['user_id'] == 10]

temp.shape[0]

284

In [167]:
temp.loc[10, 'time_stamp']

Timestamp('2014-02-13 03:45:04')

In [169]:
adopt = {}
for i in uid:
    adopt[i] = 0
    temp = df_engagement[df_engagement['user_id'] == i]
    size = temp.shape[0]
    if size > 2:
        for j in range(size - 2):
            print(j)
            first = temp.loc[j,'time_stamp']
            print(first)
#             diff = temp.loc[j+2,'time_stamp'].to_pydatetime().day - temp.loc[j,'time_stamp'].to_pydatetime().day
#             if diff <= 7:
#                 adopt[i] = 1
        

0


KeyError: 0