__INSTRUCTIONS:__

A user table ( "takehome_users" ) with data on 12,000 users who signed up for the
product in the last two years.

A usage summary table ( "takehome_user_engagement") that has a row for each day that a user logged into the product. 

Defining an "adopted user" as a user who has logged into the product on three separate days in at least one sevenday period, identify which factors predict future user adoption.

Please send us a brief writeup of your findings (the more concise, the better no more
than one page), along with any summary tables, graphs, code, or queries that can help us understand your approach. Please note any factors you considered or investigation you did, even if they did not pan out. Feel free to identify any further research or data you think would be valuable.

__1. Perform Cleaning, EDA and Visualizations__

In [1]:
import pandas as pd
import numpy as np
import timestring
from datetime import datetime, timedelta
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


# disable warnings
import warnings
warnings.filterwarnings('ignore')

# converting csv file into dataframe
engage_df = pd.read_csv('takehome_user_engagement.csv', infer_datetime_format=True)
engage_df.dropna(inplace=True)

cols = list(pd.read_csv("takehome_users.csv", nrows =1))
user_df = pd.read_csv('takehome_users.csv', usecols =[i for i in cols if i != 'last_session_creation_time']) # removing last_session_creation_time column as it does not contain date time info
user_df.dropna(inplace=True)

In [4]:
# converting time_stamp column from string into datetime format
engage_df['timestamp'] = engage_df['time_stamp'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
engage_df.drop('time_stamp', axis=1, inplace=True)

In [5]:
# getting list of user_id's that have at least 3 visits
grouped_df = engage_df.groupby('user_id').count()
more_than_3 = grouped_df[grouped_df['visited'] >= 3]
more_than_3_ids = list(more_than_3.index)

# check_adopted function takes in a user_id and checks if customer was adopted (3 logins 
# within 7 days) or not; if customer is adopted, append user_id to adopted_list
adopted_list = []
def check_adopted(user_id):
    user_id_df = engage_df[engage_df['user_id'] == user_id]
    current_date = user_id_df['timestamp'].iloc[0]
        
    for x in range(len(user_id_df)-2):
        week_forward = current_date + timedelta(days=7)
        if week_forward > user_id_df['timestamp'].iloc[x+2]:
            adopted_list.append(user_id)
            break
        else:
            current_date = user_id_df['timestamp'].iloc[x]

            
for user_ids in more_than_3_ids:
    check_adopted(user_ids)

In [6]:
# creating new column 'adopted' in user_df that indicates whether customer was adopted or not
user_df['adopted'] = user_df['object_id'].apply(lambda x: x in adopted_list)

In [7]:
# updating categorical columns with get_dummies
user_df = pd.get_dummies(user_df, columns=['creation_source'])

In [8]:
user_df.head()

Unnamed: 0,object_id,creation_time,name,email,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted,creation_source_GUEST_INVITE,creation_source_ORG_INVITE
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,1,0,11,10803.0,False,1,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,0,0,1,316.0,False,0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,0,0,94,1525.0,False,0,1
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,0,0,1,5151.0,False,1,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,0,0,193,5240.0,False,1,0


In [9]:
# segregating features and labels
X = user_df.drop(['object_id', 'creation_time', 'name', 'email', 'adopted', 'invited_by_user_id'], axis=1)
y = user_df['adopted']

# feature extraction
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X, y)

# print out feature importance scores
for x in range(len(X.columns)):
    print('Score for {0}: {1}'.format(X.columns[x], round(fit.scores_[x], 3)))


Score for opted_in_to_mailing_list: 0.669
Score for enabled_for_marketing_drip: 0.097
Score for org_id: 2522.815
Score for creation_source_GUEST_INVITE: 10.51
Score for creation_source_ORG_INVITE: 5.344


__Summary:__

My first step was to determine the users that were considered 'adopted users' (3 days of logins over a 7 day period) and creating a column to identify these users.  Next, I segregated the features and labels and performed feature extraction using SelectKBest to determine the features that were most important in determining whether a customer would be an adopted user or not.  From the results, org_id is overwhelmingly the most important feature, followed by creation source.