## Relax Take-Home Challenge
### C. Bonfield (Springboard Data Science Career Track)

In this notebook, I present my solution to the interview challenge outlined in the PDF contained in this repository. 

In [1]:
# Import statements (standard)
import json
import math
import numpy as np
import pandas as pd

from plotly import tools
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)

# Import statements (ML)
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, recall_score

We are provided with two tables containing the following fields:

(1). user table: 'takehome_users.csv'
   - `name`: user's name
   - `object_id`: user's id
   - `email`: email address
   - `creation_source`: how the account was created
   - `creation_time`: when user created account
   - `last_session_creation_time`: timestamp of last login (epoch time)
   - `opted_in_to_mailing_list`: opted into marketing emails?
   - `enabled_for_marketing_drip`: on regular marketing email drip?
   - `org_id`: organization (group of users) that a given user belongs to
   - `invited_by_user_id`: id corresponding to user that invited them to join the site

(2). usage summary table: 'takehome_user_engagement.csv' (row for each day a user has loggedf into the product)
   - `time_stamp`: time stamp corresponding to login
   - `user_id`: user's id (match with `object_id` in other table)
   - `visited`: indicator (all ones, will drop)

#### Loading/Preparing Data

In [2]:
# Load data.
user_df = pd.read_csv('takehome_users.csv', encoding='cp1252')
engagement_df = pd.read_csv('takehome_user_engagement.csv')

engagement_df.drop('visited', axis=1, inplace=True)
engagement_df.columns = ['login_time', 'user_id']
engagement_df.login_time = pd.to_datetime(engagement_df.login_time)

In [3]:
user_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [4]:
engagement_df.head(20)

Unnamed: 0,login_time,user_id
0,2014-04-22 03:53:30,1
1,2013-11-15 03:45:04,2
2,2013-11-29 03:45:04,2
3,2013-12-09 03:45:04,2
4,2013-12-25 03:45:04,2
5,2013-12-31 03:45:04,2
6,2014-01-08 03:45:04,2
7,2014-02-03 03:45:04,2
8,2014-02-08 03:45:04,2
9,2014-02-09 03:45:04,2


Now, we need to identify "adopted users". As stated in the instructions, this is defined as a user who *has logged into the product on three separate days in at least one seven-day period*. To identify adopted users, we need only make use of the engagement table. 

As a use case, consider `user_id` = 2. There are a few seven-day intervals that would qualify this user as "adopted", but the first time this occurs starts on 2/3/2014 and ends on 2/9/2014. 

In [5]:
def find_adopted_users(df):
    """
    Identify adopted users. We define adopted users as users who have logged into the product on 
    three separate days in at least one seven-day period.
    
    Note: Probably not the most efficient way of accomplishing this task, but it works!
    
    Inputs:
        df: engagement table (first column: login_time, second column: user_id)
    
    Returns:
        adopted_df: adopted users (first column: user_id, second column: is_adopted)
    """
    
    unique_users = df.user_id.unique()
    adopted_dict = dict()
    
    for u in unique_users:
        #print(u)
        is_adopted = 0
        user_df = df.loc[df.user_id == u].reset_index()
        
        for i in range(len(user_df.index)):
            if (i == (len(user_df.index)-2)) or (i == (len(user_df.index)-1)):
                continue
            else:
                if (user_df.login_time[i+1] - user_df.login_time[i]).days < 7:
                    if (user_df.login_time[i+2] - user_df.login_time[i]).days <= 7:
                        is_adopted = 1
                        break
                    else:
                        continue
                        
        adopted_dict[u] = is_adopted
        
    return adopted_dict

In [6]:
ad_dict = find_adopted_users(engagement_df)
ad_df = pd.DataFrame.from_dict(ad_dict, orient='index').reset_index()
ad_df.columns=['user_id', 'is_adopted']

Just to make sure that our code did as we intended, let's examine a couple of the adopted users.

In [7]:
ad_users_list = ad_df.loc[ad_df.is_adopted==1].user_id.tolist()
#print(ad_users_list)

In [8]:
engagement_df.loc[engagement_df.user_id==10].head(10)

Unnamed: 0,login_time,user_id
20,2013-01-16 22:08:03,10
21,2013-01-22 22:08:03,10
22,2013-01-30 22:08:03,10
23,2013-02-04 22:08:03,10
24,2013-02-06 22:08:03,10
25,2013-02-14 22:08:03,10
26,2013-02-17 22:08:03,10
27,2013-02-19 22:08:03,10
28,2013-02-26 22:08:03,10
29,2013-03-01 22:08:03,10


In [9]:
engagement_df.loc[engagement_df.user_id==1506].head(10)

Unnamed: 0,login_time,user_id
25271,2013-10-25 17:27:05,1506
25272,2013-11-11 17:27:05,1506
25273,2013-11-15 17:27:05,1506
25274,2013-11-18 17:27:05,1506
25275,2013-11-21 17:27:05,1506
25276,2013-11-23 17:27:05,1506
25277,2013-11-27 17:27:05,1506
25278,2013-12-12 17:27:05,1506
25279,2013-12-16 17:27:05,1506


In [10]:
engagement_df.loc[engagement_df.user_id==11036].head(10)

Unnamed: 0,login_time,user_id
194472,2014-03-01 12:01:05,11036
194473,2014-03-02 12:01:05,11036
194474,2014-03-09 12:01:05,11036
194475,2014-03-12 12:01:05,11036
194476,2014-04-09 12:01:05,11036
194477,2014-04-11 12:01:05,11036
194478,2014-04-12 12:01:05,11036
194479,2014-05-05 12:01:05,11036
194480,2014-05-08 12:01:05,11036
194481,2014-05-21 12:01:05,11036


We can confirm visually that these users would indeed be classified as "adopted" based on our definition. With that task performed, let's add the `is_adopted` column to our user table so that we may do a bit of exploratory analysis!

In [11]:
# Merge dataframes.
complete_df = user_df.merge(right=ad_df, left_on='object_id', right_on='user_id')
complete_df.drop('user_id', axis=1, inplace=True) # drop duplicate column

In [12]:
complete_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,is_adopted
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0
1,2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,1
2,3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,0
3,4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,0
4,5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,0


#### Exploratory Data Analysis

First, let's see if we can deduce anything between our categorical/boolean columns (`creation_source`, `opted_in_to_mailing_list`, `enabled_for_marketing_drip`) and `is_adopted`. (In addition to visualizing the differences, we'll also perform a sequence of two proportion z-tests to determine if there is are statistically significant differences here. The null hypothesis will be that there is no difference in the proportions.)

In [13]:
import math 
from scipy.stats import norm

def two_proportion_ztest(p1, p2, n1, n2, two_sided=True):
    
    z = abs((p1 - p2) / math.sqrt((p1*n1 + p2*n2) / (n1+n2) * ((1 / n1) + (1 / n2))))
    
    if two_sided:
        p = 2.* (1. - norm.cdf(z, loc=0, scale=1))
    else:
        p = 1. - norm.cdf(z, loc=0, scale=1)
    
    return z, p 

In [14]:
# First, examine the distribution of creation sources. 
cs_counts = complete_df.creation_source.value_counts()
classes = list(cs_counts.index)
counts = np.array(cs_counts)

In [15]:
trace = go.Bar(
    x=classes,
    y=counts,
    marker = dict(color='blue')
)

data=[trace]

layout = dict(title = 'Creation Sources',
              yaxis = dict(title = 'Count'),
              xaxis = dict(autorange='reversed',
                           tickfont=dict(size=10))
              )

fig = dict(data=data, layout=layout)
iplot(fig, filename='cs-counts')

In [16]:
# Now, examine relative frequency of is_adopted for each creation source.
cs_dict = dict() 

for cs in classes:
    cs_df = complete_df.loc[complete_df.creation_source == cs]
    isa_freq = cs_df.is_adopted.value_counts()[1] / len(cs_df.index)
    cs_dict[cs] = isa_freq

In [17]:
trace = go.Bar(
    x=list(cs_dict.keys()),
    y=list(cs_dict.values()),
    marker = dict(color='blue')
)

data=[trace]

layout = dict(title = 'RAF within Creation Sources',
              yaxis = dict(title = 'Relative Frequency (within CS)'),
              xaxis = dict(autorange='reversed',
                           tickfont=dict(size=10))
              )

fig = dict(data=data, layout=layout)
iplot(fig, filename='cs-isa-freq')

In [18]:
# Perform two proportion z test between all creation_source classes.
for i in range(5):
    for j in range(5):
        if i < j:
            print(classes[i], classes[j])
            z, p = two_proportion_ztest(cs_dict[classes[i]], cs_dict[classes[j]], counts[i], counts[j])
            print(z, p)
        else:
            continue

ORG_INVITE SIGNUP
1.73994302689 0.0818690224843
ORG_INVITE GUEST_INVITE
3.83332924932 0.000126420563511
ORG_INVITE SIGNUP_GOOGLE_AUTH
0.551756247657 0.581115367103
ORG_INVITE PERSONAL_PROJECTS
2.57588868102 0.00999828296546
SIGNUP GUEST_INVITE
4.90949608701 9.13107217171e-07
SIGNUP SIGNUP_GOOGLE_AUTH
0.937437592355 0.348533511454
SIGNUP PERSONAL_PROJECTS
3.65136317551 0.00026085205911
GUEST_INVITE SIGNUP_GOOGLE_AUTH
3.59693674291 0.000321986746958
GUEST_INVITE PERSONAL_PROJECTS
0.342707773148 0.731818314364
SIGNUP_GOOGLE_AUTH PERSONAL_PROJECTS
2.6672869458 0.00764663543724


The way that I spewed out the significance tests is probably a bit messy (sorry), but let's see what we can glean from them. 
* With the exception of `PERSONAL_PROJECTS`, the differences between `GUEST_INVITE` and the other classes are statistically significant. Similarly, the differences between `PERSONAL_PROJECTS` and all of the other classes (besides `GUEST_INVITE`) are statistically significant. Thus, the conclusion that we may have drawn by eye (`GUEST_INVITE` and `PERSONAL_PROJECTS` have a significantly higher adoption frequency) is, in fact, consistent with our statistics.
* A direct consequence of the previous point is that users that registered via the website, through an organization invitation (as a full member), or by using Google authentication lagged behind the other two avenues. That said, it may be worth evaluating if there is a way to increase adoption for those creation sources.

Next, let's see if being on the mailing list helps with adoption.

In [19]:
# Check opted_in_to_mailing_list.
ml_dict = dict() 

ml_dict['N'] = complete_df.loc[complete_df.opted_in_to_mailing_list == 0].is_adopted.sum() / \
               len(complete_df.loc[complete_df.opted_in_to_mailing_list == 0].index)
ml_dict['Y'] = complete_df.loc[complete_df.opted_in_to_mailing_list == 1].is_adopted.sum() / \
               len(complete_df.loc[complete_df.opted_in_to_mailing_list == 1].index)

In [20]:
trace = go.Bar(
    x=list(ml_dict.keys()),
    y=list(ml_dict.values()),
    marker = dict(color='red')
)

data=[trace]

layout = dict(title = 'RAF within Mailing List',
              yaxis = dict(title = 'Relative Adoption Frequency (within ML)'),
              xaxis = dict(title='Opted into Mailing List?',
                           tickfont=dict(size=10))
              )

fig = dict(data=data, layout=layout)
iplot(fig, filename='ml-isa-freq')

In [21]:
ml_n = len(complete_df.loc[complete_df.opted_in_to_mailing_list == 0].index)
ml_y = len(complete_df.loc[complete_df.opted_in_to_mailing_list == 1].index)

z, p = two_proportion_ztest(ml_dict['N'], ml_dict['Y'], ml_n, ml_y)
print('Mailing List (p-value): %f' % p)

Mailing List (p-value): 0.526319


We fail to reject the null hypothesis, meaning that we are unable to conclude that being on the mailing list makes a significant difference. 

Does being on the regular marketing email drip help matters?

In [22]:
# Check enabled_for_marketing_drip.
md_dict = dict() 

md_dict['N'] = complete_df.loc[complete_df.enabled_for_marketing_drip == 0].is_adopted.sum() / \
               len(complete_df.loc[complete_df.enabled_for_marketing_drip == 0].index)
md_dict['Y'] = complete_df.loc[complete_df.enabled_for_marketing_drip == 1].is_adopted.sum() / \
               len(complete_df.loc[complete_df.enabled_for_marketing_drip == 1].index)

In [23]:
trace = go.Bar(
    x=list(md_dict.keys()),
    y=list(md_dict.values()),
    marker = dict(color='green')
)

data=[trace]

layout = dict(title = 'RAF within Marketing Drip',
              yaxis = dict(title = 'Relative Adoption Frequency (within MD)'),
              xaxis = dict(title='Enabled Marketing Drip?',
                           tickfont=dict(size=10))
              )

fig = dict(data=data, layout=layout)
iplot(fig, filename='md-isa-freq')

In [24]:
md_n = len(complete_df.loc[complete_df.enabled_for_marketing_drip == 0].index)
md_y = len(complete_df.loc[complete_df.enabled_for_marketing_drip == 1].index)

z, p = two_proportion_ztest(md_dict['N'], md_dict['Y'], md_n, md_y)
print('Marketing Drip (p-value): %f' % p)

Marketing Drip (p-value): 0.716464


Again, the answer is no! 

Lastly, let's see if being invited by another user makes for higher adoption rates. Note that this column likely overlaps with the combination of `PERSONAL_PROJECTS`, `GUEST_INVITE`, and `ORG_INVITE` in the `creation_source` column, as one would presumably have to be invited by other users to create an account through those sources.  

In [25]:
# Create new column 'is_invited'. 
complete_df['is_invited'] = complete_df.invited_by_user_id.apply(lambda x: 0 if np.isnan(x) else 1)

ii_dict = dict() 

ii_dict['N'] = complete_df.loc[complete_df.is_invited == 0].is_adopted.sum() / \
               len(complete_df.loc[complete_df.is_invited == 0].index)
ii_dict['Y'] = complete_df.loc[complete_df.is_invited == 1].is_adopted.sum() / \
               len(complete_df.loc[complete_df.is_invited == 1].index)

In [26]:
trace = go.Bar(
    x=list(ii_dict.keys()),
    y=list(ii_dict.values()),
    marker = dict(color='blue')
)

data=[trace]

layout = dict(title = 'RAF (if invited)',
              yaxis = dict(title = 'Relative Adoption Frequency (within II)'),
              xaxis = dict(title='Invited by Another User?',
                           tickfont=dict(size=10))
              )

fig = dict(data=data, layout=layout)
iplot(fig, filename='ii-isa-freq')

In [27]:
ii_n = len(complete_df.loc[complete_df.is_invited == 0].index)
ii_y = len(complete_df.loc[complete_df.is_invited == 1].index)

z, p = two_proportion_ztest(ii_dict['N'], ii_dict['Y'], ii_n, ii_y)
print('Invited by Other User (p-value): %f' % p)

Invited by Other User (p-value): 0.021593


Finally, we're in business! The increase that we see between the two classes (invite vs. no invite) does appear to be real.

We may also explore whether the size of an organization (group of users) has anything to do with adoption.

In [28]:
org_sizes = dict()

for oi in complete_df.org_id.unique():
    org_df = complete_df.loc[complete_df.org_id == oi]
    org_sizes[oi] = len(org_df.index)
    
org_sizes_df = pd.DataFrame.from_dict(org_sizes, orient='index').reset_index()
org_sizes_df.columns = ['org_id', 'org_size']

In [29]:
# Merge organization dfs. 
merge_os_df = org_sizes_df.merge(right=complete_df[['org_id', 'is_adopted']], left_on='org_id', right_on='org_id')

In [30]:
os_isa_dict = dict()

for oi in merge_os_df.org_id.unique():
    org_df = merge_os_df.loc[merge_os_df.org_id == oi]
    os_isa_dict[int(org_df.org_size.unique())] = org_df.is_adopted.sum() / len(org_df.index)

In [31]:
# Create a trace
trace = go.Scatter(
    x = list(os_isa_dict.keys()),
    y = list(os_isa_dict.values()),
    mode = 'markers'
)

data = [trace]

layout = dict(title = 'Effect of Group Size on RAF',
              yaxis = dict(title = 'Relative Adoption Frequency (within groups)'),
              xaxis = dict(title='Group Size',
                           tickfont=dict(size=10))
              )

fig = dict(data=data, layout=layout)
iplot(fig, filename='os-isa-freq')

Interesting! It does appear that larger groups do appear to inhibit relative adoption frequency. 

#### Logistic Regression

We can also use logistic regression to generate a quick and dirty predictive model! Let's do that and see if we get anything more than we did from our EDA.

In [32]:
complete_df = complete_df.merge(right=org_sizes_df, left_on='org_id', right_on='org_id')

In [33]:
complete_df.head()

Unnamed: 0,object_id,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,is_adopted,is_invited,org_size
0,1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,0,1,56
1,151,2013-04-12 11:45:27,Goncalves Melissa,MelissaRibeiroGoncalves@yahoo.com,SIGNUP,1365767000.0,0,0,11,,0,0,56
2,179,2013-04-14 20:47:44,Millen Kai,KaiMillen@yahoo.com,ORG_INVITE,1366059000.0,0,0,11,7701.0,0,1,56
3,254,2014-03-04 19:52:58,Simonsen Niels,NielsHSimonsen@jourrapide.com,PERSONAL_PROJECTS,1394308000.0,1,0,11,,0,0,56
4,518,2014-01-26 03:42:20,Bennett Anthony,AnthonyDBennett@yahoo.com,ORG_INVITE,1401767000.0,1,0,11,6882.0,1,1,56


In [34]:
def categorical_dummies(df_col, labels):
    """
    Given a column containing a categorical feature, df_col, and labels, we generate
    a set of boolean dummy variables. 
    """
    dummies = pd.DataFrame(np.zeros((len(df_col.index), len(labels))))
    
    for i in range(len(labels)):
        dummies.iloc[:,i] = df_col.apply(lambda x: 1 if x == i else 0)
    
    dummies.columns = ['is_' + s for s in labels]
    
    return dummies

In [35]:
# Drop all irrelevant columns, create dummy variables where necessary.
cs_le = LabelEncoder()
complete_df.creation_source = cs_le.fit_transform(complete_df.creation_source)
cs_labels = cs_le.classes_

cs_dummies = categorical_dummies(complete_df.creation_source, cs_labels)
complete_df.drop(['creation_source'], axis=1, inplace=True)
complete_df = complete_df.merge(cs_dummies, left_index=True, right_index=True)

# Drop one of the dummy columns to prevent issues with multicollinearity. 
complete_df.drop('is_PERSONAL_PROJECTS', axis=1, inplace=True)

# Drop other columns that will not be helpful.
complete_df.drop(['object_id','name', 'email', 'org_id', 'invited_by_user_id'], axis=1, inplace=True)

# Generate additional feature from creation_date and last_session_creation_time. Essentially, this is a measure
# of how long an active user has had an account.
# 
# NOTE: I removed this from the model, as it actually dominates our regression (and is kind of obvious - the longer 
#       a user has been signed up, the more likely it is that they will have been "adopted").
#complete_df['created_to_last_time'] = (pd.to_datetime(complete_df.last_session_creation_time, unit='s') - pd.to_datetime(complete_df.creation_time))
#complete_df['created_to_last_time'] = complete_df.created_to_last_time.astype('timedelta64[D]').astype(int)
complete_df.drop(['creation_time', 'last_session_creation_time'], axis=1, inplace=True)

# Split out target variable, drop from feature matrix.
is_adopted = complete_df.is_adopted
complete_df.drop(['is_adopted'], axis=1, inplace=True)

In [36]:
complete_df.head()

Unnamed: 0,opted_in_to_mailing_list,enabled_for_marketing_drip,is_invited,org_size,is_GUEST_INVITE,is_ORG_INVITE,is_SIGNUP,is_SIGNUP_GOOGLE_AUTH
0,1,0,1,56,1,0,0,0
1,0,0,0,56,0,0,1,0
2,0,0,1,56,0,1,0,0
3,1,0,0,56,0,0,0,0
4,1,0,1,56,0,1,0,0


In [37]:
# Random train/test split. 
X_train, X_test, y_train, y_test = train_test_split(complete_df.values, is_adopted.values, test_size=0.2, random_state=42)

In [38]:
# Standardize all input variables. 
scl = StandardScaler()
X_train_scl = scl.fit_transform(X_train)
X_test_scl = scl.transform(X_test)


Data with input dtype int64 was converted to float64 by StandardScaler.



In [39]:
from sklearn.linear_model import LogisticRegression

params = {'C': [0.001, 0.01, 0.1, 1., 10., 100., 1000.]}

logistic = LogisticRegression(solver='sag')

gs = GridSearchCV(logistic, params, n_jobs=1)
gs.fit(X_train_scl, y_train)
log_reg_params = gs.best_params_

logistic = LogisticRegression(**log_reg_params)

# Fit/predict.
logistic.fit(X_train_scl, y_train)
y_predict = logistic.predict(X_test_scl)

In [40]:
print('Accuracy Score: %f' % accuracy_score(y_test, y_predict))
#print('Recall: %f' % recall_score(y_test, y_predict))

Accuracy Score: 0.820963


In [41]:
# Presented a bit nicer.
print('ODDS RATIOS: ')
cols = complete_df.columns
odds_r = np.exp(logistic.coef_)[0]
sort_indices = np.argsort(odds_r)[::-1]

for i in sort_indices:
    print(cols[i]+': ', odds_r[i])

ODDS RATIOS: 
is_GUEST_INVITE:  1.03596697417
enabled_for_marketing_drip:  1.00688699876
opted_in_to_mailing_list:  1.00554728198
is_invited:  1.00221272022
is_SIGNUP_GOOGLE_AUTH:  0.976901636099
is_ORG_INVITE:  0.97464824322
is_SIGNUP:  0.968906963332
org_size:  0.890333947806


#### Conclusions

Based on our logit and some of the hypothesis tests run earlier, we can make the following conclusions about user adoption:
* As I mentioned in some of my code comments, I found that the longer a user was registered, the more likely they were to be considered "adopted" (the odds ratio for that feature was on the order of 3000!). This, in my opinion, is fairly obvious, as the longer a person has had his/her account, the more likely it is that they would have had a spurt of logins that would earn them the "adopted" title. That being said, perhaps accounting for the length of time since the last "period of adoption" (i.e., a user logged in three or more times in a seven-day period) would be more useful here. If a user was "adopted" three years ago and did not login since, would we really consider that a success? 
* The results from our hypothesis tests are similar in spirit to the odds ratios spewed out by our logit. Specifically, it does appear that creating an account via `GUEST_INVITE` or `PERSONAL_PROJECTS` (our reference class for the dummied `creation_source` column in our logit) slightly improves our chances at having an "adopted" user. The conclusion that we drew before about `is_invited`, however, is not mirrored in the corresponding odds ratio from our logit, but that may stem from the overlap between that column and `is_GUEST_INVITE` and `is_ORG_INVITE`. 
* The most marked change in odds occured for the `org_size` feature (although the corresponding decrease in odds ratio was relatively small). That said, it would be worth exploring whether splitting larger groups/organizations into smaller subgroups would increase adoption. If the groups are too large for a user to feel that they are useful (or heard within the organization, for that matter), we may have a reasonable explanation for why this may be the case.
* Interestingly, the pair of features that are presumably in place to boost engagement (marketing drip, mailing list) lead to a significant increase in odds of adoption. Thus, it may be useful to re-evaluate the efficacy of these strategies. 

In closing, I would say that the best ways to go about increasing adoption include (1) encouraging users to send out personal invitations and (2) reducing the size of larger organizations. Additionally, the company should look into finding better ways to keep users engaged, as the mailing list and marketing drip campaigns are not doing much for them.