## Relax Data Science Challenge: Project Overview

This notebook contains my analysis and insights for the Relax Inc. Data Science Challenge. The goal of this project is to understand user behavior and identify key drivers of product adoption. The dataset provided consists of two CSV files:

takehome_users.csv

takehome_user_engagement.csv

These datasets include the following information:

User Information (takehome_users.csv)
Contains data on 12,000 users who registered over the past two years. Key fields include:

object_id: Unique user identifier

name and email: User contact information

creation_source: Method by which the account was created (e.g., via Google, organization invite, personal project)

creation_time: Timestamp of account creation

last_session_creation_time: Unix timestamp of the user's most recent login

opted_in_to_mailing_list and enabled_for_marketing_drip: Email marketing engagement flags

org_id: Organization to which the user belongs

invited_by_user_id: ID of the user who issued the invitation, if applicable

User Engagement (takehome_user_engagement.csv)
Logs daily user interactions with the product, with one row per user per login event.



## Challenge Objective
An "adopted user" is defined as someone who logged into the platform on at least three separate days within a single seven-day window. The objective is to determine which user attributes or behaviors are most predictive of future product adoption.

While the suggested time for the challenge is 1–2 hours, additional time may be taken to refine the analysis. Deliverables should include a concise summary (one page max), accompanied by code, visualizations, and any other relevant insights or exploration—even if certain avenues didn’t yield strong results. The final submission may also include recommendations for further data or research that could improve predictive performance.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline


In [2]:
df_users = pd.read_csv('takehome_users.csv', encoding='latin-1', index_col=0)
df_users.head()

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0


In [3]:
df_engagement = pd.read_csv('takehome_user_engagement.csv')
df_engagement.head()

Unnamed: 0,time_stamp,user_id,visited
0,2014-04-22 03:53:30,1,1
1,2013-11-15 03:45:04,2,1
2,2013-11-29 03:45:04,2,1
3,2013-12-09 03:45:04,2,1
4,2013-12-25 03:45:04,2,1


In [4]:
# convert from string to datetime format
df_engagement['time_stamp'] = pd.to_datetime(df_engagement['time_stamp'])

In [5]:
# set the time_stamp column as the dataframe index
df_engagement = df_engagement.set_index('time_stamp')
df_engagement.head()

Unnamed: 0_level_0,user_id,visited
time_stamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-04-22 03:53:30,1,1
2013-11-15 03:45:04,2,1
2013-11-29 03:45:04,2,1
2013-12-09 03:45:04,2,1
2013-12-25 03:45:04,2,1


With the dataframe now indexed by datetime, we can take advantage of the .resample() method to aggregate user activity by week. Although this approach may not identify every possible 7-day window where a user could qualify as 'adopted,' it should capture the majority of users who meet the 3-visits-in-a-week threshold. For this exploratory analysis, that level of accuracy should be sufficient. We'll begin by grouping the data by user_id, then apply weekly resampling and aggregation to the visited column to flag adopted users.

In [6]:
# group data by user_id, resample into weekly dates, and sum the visited column
df_engagement = df_engagement.groupby('user_id').resample('1W').sum()
df_engagement.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,visited
user_id,time_stamp,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2014-04-27,1,1
2,2013-11-17,2,1
2,2013-11-24,0,0
2,2013-12-01,2,1
2,2013-12-08,0,0


In [7]:
df_adoption = df_engagement[df_engagement['visited'] > 2]
del df_engagement['user_id']
df_adoption.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,visited
user_id,time_stamp,Unnamed: 2_level_1,Unnamed: 3_level_1
2,2014-02-09,6,3
10,2013-03-03,30,3
10,2013-04-14,30,3
10,2013-04-28,30,3
10,2013-05-05,40,4


In [8]:
# Extract user IDs from the index of adopted users
adopted_users_list = df_adoption.index.get_level_values(0).unique().tolist()

# Check the number of unique adopted users
len(adopted_users_list)

1445

"Out of 12,000 users, 1,445 (~12%) have adopted the product. We'll now label these users in the df_users dataframe."

In [9]:
# Mark adopted users in the dataframe
df_users['adopted'] = df_users.index.isin(adopted_users_list)

# Preview
df_users.head(10)

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,invited_by_user_id,adopted
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,10803.0,False
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,316.0,True
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,1525.0,False
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,5151.0,False
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,5240.0,False
6,2013-12-17 03:37:06,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,1387424000.0,0,0,197,11241.0,False
7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,1356010000.0,0,1,37,,False
8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,,False
9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,,False
10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1401833000.0,1,1,318,4143.0,True


## Data Preprocessing and EDA

Now that the adopted user flag has been added, I’ll explore the df_users DataFrame to identify any necessary data cleaning steps and better understand the available features.

In [10]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12000 entries, 1 to 12000
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   creation_time               12000 non-null  object 
 1   name                        12000 non-null  object 
 2   email                       12000 non-null  object 
 3   creation_source             12000 non-null  object 
 4   last_session_creation_time  8823 non-null   float64
 5   opted_in_to_mailing_list    12000 non-null  int64  
 6   enabled_for_marketing_drip  12000 non-null  int64  
 7   org_id                      12000 non-null  int64  
 8   invited_by_user_id          6417 non-null   float64
 9   adopted                     12000 non-null  bool   
dtypes: bool(1), float64(2), int64(3), object(4)
memory usage: 949.2+ KB


Two columns contain missing values. About one-third of users have no data in the last_session_creation_time field. Since this feature is likely correlated with adoption status (i.e., recently active users are probably adopters), including it could bias the model and reduce its overall insight. For that reason, I’ve chosen to exclude it from further analysis.

Next, I’ll take a closer look at the missing values in the invited_by_user_id column.

In [11]:
df_users['invited_by_user_id'].value_counts(ascending=False, dropna=False)

invited_by_user_id
NaN        5583
10741.0      13
2527.0       12
1525.0       11
2308.0       11
           ... 
2071.0        1
1390.0        1
5445.0        1
8526.0        1
5450.0        1
Name: count, Length: 2565, dtype: int64

The invited_by_user_id column indicates who referred the user, or contains a NaN if there was no inviter. For modeling purposes, I’ll transform this into a boolean feature that simply reflects whether or not a user was invited by someone else.

In [12]:
# Create a boolean column indicating if the user was invited, and drop the original ID column
df_users['invited_by_user'] = df_users['invited_by_user_id'].notna()
df_users.drop(columns='invited_by_user_id', inplace=True)

df_users.head(10)

Unnamed: 0_level_0,creation_time,name,email,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,invited_by_user
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-04-22 03:53:30,Clausen August,AugustCClausen@yahoo.com,GUEST_INVITE,1398139000.0,1,0,11,False,True
2,2013-11-15 03:45:04,Poole Matthew,MatthewPoole@gustr.com,ORG_INVITE,1396238000.0,0,0,1,True,True
3,2013-03-19 23:14:52,Bottrill Mitchell,MitchellBottrill@gustr.com,ORG_INVITE,1363735000.0,0,0,94,False,True
4,2013-05-21 08:09:28,Clausen Nicklas,NicklasSClausen@yahoo.com,GUEST_INVITE,1369210000.0,0,0,1,False,True
5,2013-01-17 10:14:20,Raw Grace,GraceRaw@yahoo.com,GUEST_INVITE,1358850000.0,0,0,193,False,True
6,2013-12-17 03:37:06,Cunha Eduardo,EduardoPereiraCunha@yahoo.com,GUEST_INVITE,1387424000.0,0,0,197,False,True
7,2012-12-16 13:24:32,Sewell Tyler,TylerSewell@jourrapide.com,SIGNUP,1356010000.0,0,1,37,False,False
8,2013-07-31 05:34:02,Hamilton Danielle,DanielleHamilton@yahoo.com,PERSONAL_PROJECTS,,1,1,74,False,False
9,2013-11-05 04:04:24,Amsel Paul,PaulAmsel@hotmail.com,PERSONAL_PROJECTS,,0,0,302,False,False
10,2013-01-16 22:08:03,Santos Carla,CarlaFerreiraSantos@gustr.com,ORG_INVITE,1401833000.0,1,1,318,True,True


## Email Addresses

The dataset includes a unique email address for every user. While the specific addresses themselves aren't useful for modeling, analyzing the most frequently occurring email domains might reveal patterns about user demographics or behaviors. Let’s extract and examine those domains.

In [13]:
# Extract email domain directly and drop the original email column
df_users['email_domain'] = df_users['email'].str.split('@').str[1]
df_users.drop(columns='email', inplace=True)

# Display the top 10 most common email domains
df_users['email_domain'].value_counts().head(10)

email_domain
gmail.com         3562
yahoo.com         2447
jourrapide.com    1259
cuvox.de          1202
gustr.com         1179
hotmail.com       1165
rerwl.com            2
oqpze.com            2
qgjbc.com            2
dqwln.com            2
Name: count, dtype: int64

There appear to be six frequently occurring email domains among the users. To simplify this feature for modeling, we'll group the less common domains under a single category labeled 'Other', making it easier to treat this as a categorical variable.

In [14]:
# Define common domains
common_domains = ['gmail.com', 'yahoo.com', 'jourrapide.com', 'cuvox.de', 'gustr.com', 'hotmail.com']

# Use np.where to assign 'Other' or the actual domain
df_users['email'] = np.where(df_users['email_domain'].isin(common_domains),
                             df_users['email_domain'],
                             'Other')

# Drop the original domain column
df_users.drop(columns='email_domain', inplace=True)

# View domain distribution
df_users['email'].value_counts(ascending=False)


email
gmail.com         3562
yahoo.com         2447
jourrapide.com    1259
cuvox.de          1202
Other             1186
gustr.com         1179
hotmail.com       1165
Name: count, dtype: int64

In [15]:
df_users.head()

Unnamed: 0_level_0,creation_time,name,creation_source,last_session_creation_time,opted_in_to_mailing_list,enabled_for_marketing_drip,org_id,adopted,invited_by_user,email
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-04-22 03:53:30,Clausen August,GUEST_INVITE,1398139000.0,1,0,11,False,True,yahoo.com
2,2013-11-15 03:45:04,Poole Matthew,ORG_INVITE,1396238000.0,0,0,1,True,True,gustr.com
3,2013-03-19 23:14:52,Bottrill Mitchell,ORG_INVITE,1363735000.0,0,0,94,False,True,gustr.com
4,2013-05-21 08:09:28,Clausen Nicklas,GUEST_INVITE,1369210000.0,0,0,1,False,True,yahoo.com
5,2013-01-17 10:14:20,Raw Grace,GUEST_INVITE,1358850000.0,0,0,193,False,True,yahoo.com


## Machine Learning Data Prep

Next, I’ll set up the dataset for modeling by creating a new df_ml DataFrame. This will include the target variable, adopted, as the leading column, and I’ll set user_id as the index to maintain user-level tracking.

In [36]:
df_model = pd.DataFrame({'adopted_user': df_users['adopted'].values}, index=df_users.index)
df_model.head()

Unnamed: 0_level_0,adopted_user
object_id,Unnamed: 1_level_1
1,False
2,True
3,False
4,False
5,False


In [37]:
# Select relevant features from df_users and add them to df_ml
features = ['opted_in_to_mailing_list', 'enabled_for_marketing_drip', 'invited_by_user']
df_model = df_model.join(df_users[features].rename(columns={
    'opted_in_to_mailing_list': 'mailing_list',
    'enabled_for_marketing_drip': 'marketing_drip',
    'invited_by_user': 'invited_by_user'
}))

df_model.head()

Unnamed: 0_level_0,adopted_user,mailing_list,marketing_drip,invited_by_user
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,False,1,0,True
2,True,0,0,True
3,False,0,0,True
4,False,0,0,True
5,False,0,0,True


In [38]:
# convert org_id column from int to string
df_users['org_id'] = df_users['org_id'].astype('str')

# create the dummy variables
df_dummies = pd.get_dummies(df_users[['creation_source', 'org_id', 'email']])

# add dummy variables to df_model dataframe
df_model = pd.concat([df_model, df_dummies], axis=1)
df_model.head(3)

Unnamed: 0_level_0,adopted_user,mailing_list,marketing_drip,invited_by_user,creation_source_GUEST_INVITE,creation_source_ORG_INVITE,creation_source_PERSONAL_PROJECTS,creation_source_SIGNUP,creation_source_SIGNUP_GOOGLE_AUTH,org_id_0,...,org_id_97,org_id_98,org_id_99,email_Other,email_cuvox.de,email_gmail.com,email_gustr.com,email_hotmail.com,email_jourrapide.com,email_yahoo.com
object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,1,0,True,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,True,0,0,True,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,False,0,0,True,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


## Predictive Modeling

The next step involves building a predictive model. According to guidelines from the scikit-learn model selection flowchart, ensemble methods are well-suited for classification tasks with datasets under 100,000 samples. Therefore, I will start by using a Support Vector Machine classifier and random forest classifier.

To begin, I will divide the dataset into training and testing subsets:

In [39]:
# divide the data into label and features for use in ml models
X = df_model.iloc[:, 1:]
y = df_model.loc[:, 'adopted_user']

# split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=6, stratify=y)

## Support Vector Machine

In [45]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Create a pipeline to scale features and train an SVM classifier
svm_model = make_pipeline(StandardScaler(), 
                          SVC(class_weight='balanced', random_state=6))

# Train the SVM model
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred_svm = svm_model.predict(X_test)

# Print accuracy scores
print(f'Accuracy on training set = {svm_model.score(X_train, y_train):.4f}')
print(f'Accuracy on test set = {svm_model.score(X_test, y_test):.4f}')


Accuracy on training set = 0.6171
Accuracy on test set = 0.5680


Let's look at improving paramater through randomized grid search CV and test out the new paramaters with the SVM. 

## Tuned Model Performance

In [41]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.metrics import classification_report, confusion_matrix

# Define pipeline
svm_pipeline = make_pipeline(
    StandardScaler(),
    SVC(class_weight='balanced', probability=True, random_state=6)
)

# Define parameter distribution for RandomizedSearchCV
param_dist = {
    'svc__C': uniform(loc=0.1, scale=10),         # C values from 0.1 to 10.1
    'svc__gamma': uniform(loc=0.001, scale=0.5),  # gamma values from 0.001 to 0.501
    'svc__kernel': ['rbf']                        # keep kernel fixed for now
}

# Initialize search
random_search = RandomizedSearchCV(
    svm_pipeline,
    param_distributions=param_dist,
    n_iter=25,
    scoring='accuracy',
    cv=5,
    random_state=6,
    verbose=1,
    n_jobs=-1
)

# Fit model
random_search.fit(X_train, y_train)

# Predict on test set
y_pred = random_search.predict(X_test)

# Output results
print("Best parameters found:", random_search.best_params_)
print(f'Accuracy on training set = {random_search.score(X_train, y_train):.4f}')
print(f'Accuracy on test set = {random_search.score(X_test, y_test):.4f}')

# Detailed performance
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))



Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best parameters found: {'svc__C': 0.6447450782235463, 'svc__gamma': 0.36031861828209716, 'svc__kernel': 'rbf'}
Accuracy on training set = 0.9322
Accuracy on test set = 0.8283

Classification Report:
              precision    recall  f1-score   support

       False       0.88      0.93      0.90      2639
        True       0.15      0.09      0.11       361

    accuracy                           0.83      3000
   macro avg       0.52      0.51      0.51      3000
weighted avg       0.79      0.83      0.81      3000


Confusion Matrix:
[[2452  187]
 [ 328   33]]


## Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from scipy.stats import randint

# Random Forest base
rf = RandomForestClassifier(class_weight='balanced_subsample', random_state=6)

# Parameter space (smaller for faster search)
param_dist_rf = {
    'n_estimators': randint(50, 150),         # fewer trees for speed
    'max_depth': randint(5, 20),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5),
    'max_features': ['sqrt', 'log2']
}

# Fast search
rf_search = RandomizedSearchCV(
    rf,
    param_distributions=param_dist_rf,
    n_iter=10,
    scoring='accuracy',
    cv=3,
    n_jobs=-1,
    verbose=1,
    random_state=6
)

rf_search.fit(X_train, y_train)
y_pred_rf = rf_search.predict(X_test)

# Output results
print("Best RF Parameters:", rf_search.best_params_)
print(f"Train Accuracy: {rf_search.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {rf_search.score(X_test, y_test):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))


Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best RF Parameters: {'max_depth': 17, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 58}
Train Accuracy: 0.6882
Test Accuracy: 0.6340

Classification Report:
              precision    recall  f1-score   support

       False       0.90      0.66      0.76      2639
        True       0.15      0.45      0.23       361

    accuracy                           0.63      3000
   macro avg       0.53      0.56      0.49      3000
weighted avg       0.81      0.63      0.70      3000


Confusion Matrix:
[[1738  901]
 [ 197  164]]


Let's look at implementing with different paramaters and fixing up the issues with overfitting: 

## Tuned model performance

In [44]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Define pipeline using imblearn's Pipeline
pipeline = Pipeline([
    ('smote', SMOTE(random_state=6)),
    ('rf', RandomForestClassifier(class_weight='balanced', random_state=6))
])

# Define hyperparameter grid
param_grid = {
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [10, 15, 20, None],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4],
    'rf__max_features': ['sqrt', 'log2']
}

# Create RandomizedSearchCV
search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_grid,
    n_iter=15,
    scoring='f1_macro',  # or 'balanced_accuracy'
    cv=3,
    random_state=6,
    verbose=1,
    n_jobs=-1
)

# Fit the model
search.fit(X_train, y_train)

# Predict
y_pred = search.predict(X_test)

# Evaluate
print("Best RF Parameters:", search.best_params_)
print(f"Train Accuracy: {search.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {search.score(X_test, y_test):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Fitting 3 folds for each of 15 candidates, totalling 45 fits
Best RF Parameters: {'rf__n_estimators': 100, 'rf__min_samples_split': 10, 'rf__min_samples_leaf': 4, 'rf__max_features': 'sqrt', 'rf__max_depth': None}
Train Accuracy: 0.5258
Test Accuracy: 0.4774

Classification Report:
              precision    recall  f1-score   support

       False       0.91      0.59      0.71      2639
        True       0.15      0.55      0.24       361

    accuracy                           0.58      3000
   macro avg       0.53      0.57      0.48      3000
weighted avg       0.81      0.58      0.66      3000

Confusion Matrix:
[[1555 1084]
 [ 163  198]]


In [49]:
# create dataframe of feature importances from model
df_features = pd.DataFrame({'rfc': rfc.feature_importances_}, index=df_model.columns[1:])

# sort by highest values
df_features.sort_values('rfc', ascending=False)[:15]

Unnamed: 0,rfc
creation_source_PERSONAL_PROJECTS,0.080602
creation_source_GUEST_INVITE,0.047442
org_id_0,0.043563
email_yahoo.com,0.041418
creation_source_ORG_INVITE,0.023307
email_gmail.com,0.022881
org_id_1,0.021587
email_hotmail.com,0.021135
invited_by_user,0.01725
org_id_3,0.016858


In [50]:
# Extract the tuned Random Forest classifier from the pipeline
rfc_tuned = search.best_estimator_.named_steps['rf']

# Get feature importances from the tuned model
df_features['rfc_tuned'] = rfc_tuned.feature_importances_

# Add column for the difference (assuming 'rfc' column already exists)
df_features['difference'] = df_features['rfc_tuned'] - df_features['rfc']

# Sort and view top 15 features by change in importance
df_features.sort_values('difference', ascending=False).head(15)

Unnamed: 0,rfc,rfc_tuned,difference
org_id_112,0.000101,0.030339,0.030238
org_id_334,0.001708,0.031571,0.029863
org_id_223,7e-05,0.029893,0.029823
org_id_0,0.043563,0.065447,0.021884
org_id_1,0.021587,0.042994,0.021407
org_id_362,0.00041,0.018662,0.018252
org_id_395,0.000486,0.017593,0.017107
org_id_406,0.000226,0.016028,0.015802
org_id_373,3.8e-05,0.014118,0.01408
org_id_46,0.003307,0.016851,0.013544


## Final Analysis & Strategic Recommendations

The evaluation results from both the Support Vector Machine (SVM) and the Random Forest Classifier (RFC) models provide important insights into user adoption behavior on the Relax platform.

The SVM model, while achieving strong overall accuracy on the training (93.2%) and test sets (82.8%), showed limited ability to correctly identify true adopters (recall of 0.09 for the positive class). This suggests that although the model was good at identifying non-adopters, it struggled to detect users likely to adopt the platform.

In contrast, the tuned Random Forest model (with SMOTE for class balancing) produced more balanced results in terms of recall and f1-score for the positive class, even if its overall accuracy was lower (test accuracy: 47.7%). Notably, it performed significantly better at identifying adopters compared to the SVM, as shown by the recall improvement (from 0.09 to 0.55). This model thus provides more actionable insights for understanding adoption drivers, especially among the minority class.

## Key Factors Driving Adoption

Analysis of feature importances revealed that the following factors play a significant role in influencing user adoption:

Invitations to workspaces—particularly through personal projects and guest invites—were among the top indicators for adoption.

Specific organizations, especially org_id_0, org_id_1, and newly prominent ones like org_id_112, org_id_223, and org_id_334, showed a marked increase in predictive power in the tuned RFC model.

Email domain features (e.g., yahoo.com, gmail.com, and hotmail.com) consistently held weight in both the original and tuned models.

Signup method—notably via Google authentication—was also associated with higher likelihood of adoption.

The comparison between baseline and tuned feature importances highlighted new organizational IDs that became more influential after model optimization. This shift suggests that targeted outreach to specific organizations could be an effective lever for growth.

## Recommendations for Increasing User Adoption

Based on these insights, the following strategic actions are advised for Relax:

- Encourage workspace invites: Launch campaigns or incentives that motivate existing users to invite others to their personal workspaces, particularly using guest or full-member invitations.

- Leverage high-converting organizations: Promote organizations that the model flagged as having high adoption influence (e.g., org_id_0, org_id_112, org_id_223). Collaborate with these groups or feature them in marketing campaigns.

- Target email-based outreach: Customize promotional content and onboarding flows for users with popular domains like gmail.com, yahoo.com, and hotmail.com, as they appear more likely to adopt.

- Highlight Google sign-up benefits: Emphasize the ease or added benefits of signing up via Google authentication to capture users more predisposed to engage.

By aligning outreach and user acquisition strategies with these data-driven insights, Relax can improve the efficiency and effectiveness of their adoption initiatives.