In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import re

# Relax Data Challenge

## Problem:
Identify which factors predict future user adoption
+ Adopted user: user who has logged into the product on three separate days in at least one seven-day period

## Data

The first step is to import the data into Python to do some initial data exploration to figure out if the data has any missing values or irregularities.

### User Engagement

In [2]:
# Import user engagement
user_engage = pd.read_csv('takehome_user_engagement.csv',
                          index_col = 'time_stamp',
                         parse_dates = True)

user_engage.head()

Unnamed: 0_level_0,user_id,visited
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1


In [3]:
user_engage.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 207917 entries, 2014-04-22 03:53:30 to 2014-01-26 08:57:12
Data columns (total 2 columns):
user_id    207917 non-null int64
visited    207917 non-null int64
dtypes: int64(2)
memory usage: 4.8 MB


In [4]:
user_engage.describe()

Unnamed: 0,user_id,visited
count,207917.0,207917.0
mean,5913.314197,1.0
std,3394.941674,0.0
min,1.0,1.0
25%,3087.0,1.0
50%,5682.0,1.0
75%,8944.0,1.0
max,12000.0,1.0


By briefly looking at the data, there doesn't seem to be any missing values or irregularities. There seems to be 12,000 unique users for which every time they visited, it counted only once which should be expected. Before moving on, I want to get the date range for which this data was recorded for.

In [5]:
# Get the beginning and end date for logs
min_date = user_engage.index.min()
max_date = user_engage.index.max()

print('The log started recording on', min_date)
print('The log fnished recording on', max_date)

The log started recording on 2012-05-31 08:20:06
The log fnished recording on 2014-06-06 14:58:50


### Users

In [6]:
# Import user information
users = pd.read_csv('takehome_users.csv', 
                    index_col = 'object_id', 
                    parse_dates = [1], 
                    encoding = 'iso-8859-1')

users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 1 to 12000
Data columns (total 9 columns):
creation_time                 12000 non-null datetime64[ns]
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 937.5+ KB


In [8]:
users.describe()

Unnamed: 0,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
count,8823.0,12000.0,12000.0,12000.0,6417.0
mean,1379279000.0,0.2495,0.149333,141.884583,5962.957145
std,19531160.0,0.432742,0.356432,124.056723,3383.761968
min,1338452000.0,0.0,0.0,0.0,3.0
25%,1363195000.0,0.0,0.0,29.0,3058.0
50%,1382888000.0,0.0,0.0,108.0,5954.0
75%,1398443000.0,0.0,0.0,238.25,8817.0
max,1402067000.0,1.0,1.0,416.0,11999.0


From the initial findings, it appears that the data has some missing values under ``last_session_creation_time`` and ``invited_by_user_id``. There's roughly a quarter of ``last_session_creation_time`` and half of ``invited_by_user_id`` data missing. Removing those entries would account for valuable observations so it should be avoided. Also, according to the data documentation, the column ``last_session_creation_time`` is encoded as a unix timestamp which may make more sense if it's converted from type float64 to a timestamp.

In [9]:
# Convert last_session_creation_time to datetime
users['last_session_creation_time'] = pd.to_datetime(users['last_session_creation_time'], unit = 's')

#### ``invited_by_user_id``

I'll start by investigating the missing values for ``invited_by_user_id`` as it might be easier to resolve. One reason why these values might be missing is because they weren't invited by other users to join. In that case, the values under ``creation_source`` for these observations would be _SIGNUP_ or *SIGNUP_GOOGLE_AUTH*.

In [10]:
# Filter out missing invited_by_user_id
missing_invited_by = users[users['invited_by_user_id'].isna()]

In [11]:
# Group by creation_source and obtain count
missing_invited_by.groupby('creation_source')['name'].count()

creation_source
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: name, dtype: int64

In [12]:
# Compare counts to counts for users
users.groupby('creation_source')['name'].count()

creation_source
GUEST_INVITE          2163
ORG_INVITE            4254
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: name, dtype: int64

As suspected, all the missing values either correspond to direct signups through the website or Google. However, all the users to signed up for personal projects weren't considered invited by another user.

Now that we know why values of ``invited_by_user_id`` are missing, we can fill them in. Since there's no ``object_id`` (or ``user_id``) of $0$, I'm going to choose that value to fill in the missing values to mean *self*.

In [13]:
# Fill nan values in invited_by_user_id as 0
users['invited_by_user_id'].fillna(0, inplace=True)

#### ``last_session_creation_time``

The missing values in ``last_session_creation_time`` offer an intriguing problem. Although the missing data comprises of roughly 25% of the data, the only real way to fill in the values is to use the user engagement data set. That data is also necessary for determining adopted users. The only explanation is that these users signed up but never used the platform. For that reasoning, these observations have to be removed as they won't provide any insight.

In [14]:
users.dropna(inplace=True)

### ``adopted_users``

Relax wants to know how many of their users were adopted and what factors play into it. A column will have to be added to the users data frame to mark if they were adopted or not.

In [15]:
adopted_users = []
for i in users.index:
    user_activity = user_engage[user_engage['user_id'] == i].resample('D').min()
    rolling_count = user_activity['visited'].rolling(window=7, min_periods=1).sum()
    max_days_active = rolling_count.max()
    adopted_users.append(int(max_days_active >= 7))

In [16]:
users['adopted_users'] = adopted_users