# Relax Data Science Challenge
The data is available as two attached CSV files:   
*takehome_user_engagement.csv*  
*takehome_users.csv*

The data has the following two tables:   
1] A user table ( *"takehome_users"* ) with data on 12,000 users who signed up for the product in the last two years.  
2] A usage summary table ( *"takehome_user_engagement"* ) that has a row for each day that a user logged into the product. 

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one seven-day period, **identify which factors predict future user adoption.** 
 
We suggest spending 1-2 hours on this, but you're welcome to spend more or less. Please send us a brief writeup of your findings ( the more concise, the better -- no more than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Read and explore datasets

In [2]:
# Read datasets to dataframe
# Convert takehome_users.csv to utf-8 in order to open it properly
df_users = pd.read_csv('takehome_users.csv')
df_users.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [3]:
df_engage = pd.read_csv('takehome_user_engagement.csv')
df_engage.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [4]:
len(df_users)

12000

In [5]:
print('There are {} users signed up for the product.'.format(len(df_users['object_id'].unique())))
print('There are {} users logged into the product.'.format(len(df_engage['user_id'].unique())))

There are 12000 users signed up for the product.
There are 8823 users logged into the product.


In [6]:
# Are there any missing values?
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 10 columns):
object_id                     12000 non-null int64
creation_time                 12000 non-null object
name                          12000 non-null object
email                         12000 non-null object
creation_source               12000 non-null object
last_session_creation_time    8823 non-null float64
opted_in_to_mailing_list      12000 non-null int64
enabled_for_marketing_drip    12000 non-null int64
org_id                        12000 non-null int64
invited_by_user_id            6417 non-null float64
dtypes: float64(2), int64(4), object(4)
memory usage: 937.6+ KB


8823 users have non-null last_session_creation_time, i.e. they have logged into the product.    
6417 users are invited by other users.

In [7]:
# Check on the higher risk/more defined features
for feature in ['creation_source', 'opted_in_to_mailing_list', 'enabled_for_marketing_drip']:
    print('{} values:'.format(feature))
    print(df_users[feature].value_counts(), end='\n\n')

creation_source values:
ORG_INVITE            4254
GUEST_INVITE          2163
PERSONAL_PROJECTS     2111
SIGNUP                2087
SIGNUP_GOOGLE_AUTH    1385
Name: creation_source, dtype: int64

opted_in_to_mailing_list values:
0    9006
1    2994
Name: opted_in_to_mailing_list, dtype: int64

enabled_for_marketing_drip values:
0    10208
1     1792
Name: enabled_for_marketing_drip, dtype: int64



### Label "adopted user" 

In [8]:
# convert time_stamp to datetime
df_engage['time_stamp'] = pd.to_datetime(df_engage['time_stamp'])

In [9]:
type(df_engage['time_stamp'])

pandas.core.series.Series

In [10]:
from datetime import datetime
df_engage['date'] = df_engage.time_stamp.dt.date

In [11]:
df_engage.head()

Unnamed: 0,time_stamp,user_id,visited,date
0,2014-04-22 03:53:30,1,1,2014-04-22
1,2013-11-15 03:45:04,2,1,2013-11-15
2,2013-11-29 03:45:04,2,1,2013-11-29
3,2013-12-09 03:45:04,2,1,2013-12-09
4,2013-12-25 03:45:04,2,1,2013-12-25


In [14]:
df_engage.groupby('user_id')

TypeError: 'bool' object is not callable