# Problem Statement

### How do we increase paid conversion rates in the first 14 days?

# Initialize 

In [59]:
!pip install pandas
!pip install numpy
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.24.1-cp38-cp38-macosx_10_13_x86_64.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 3.8 MB/s eta 0:00:01
Collecting joblib>=0.11
  Downloading joblib-1.0.0-py3-none-any.whl (302 kB)
[K     |████████████████████████████████| 302 kB 80.5 MB/s eta 0:00:01
[?25hCollecting scipy>=0.19.1
  Downloading scipy-1.6.0-cp38-cp38-macosx_10_9_x86_64.whl (30.7 MB)
[K     |████████████████████████████████| 30.7 MB 52.2 MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1315 sha256=86c67dc42a2fda25910b103957da5e4944cd5861d93fed40b9d5ff66c35d66e5
  Stored in directory: /Users/boyelusi/Library/Caches/pip/wheels/22/0b/40/fd3f795caaa1

# Load Data from CSV

In [41]:
import pandas as pd
filename = 'train.csv'
raw_data = pd.read_csv(filename)

# 1) Analyze Data

We will use descriptive statistics and visualization to understand the data 

## Preview at raw data

In [42]:
preview = raw_data.head(20)
print(preview)

     idx  time_to_first_matter  time_to_first_time_entry  time_to_first_bill  time_to_second_user  \
0    745                   NaN                       NaN                 NaN                  NaN   
1   1190             117.00000                 117.00000           223.00000         280881.00000   
2   1242             351.00000                 448.00000                 NaN                  NaN   
3   1044                   NaN                       NaN                 NaN                  NaN   
4    304                   NaN                       NaN                 NaN                  NaN   
5    843         1190696.00000             1190448.00000       1191054.00000                  NaN   
6    936             278.00000                 355.00000                 NaN                  NaN   
7    997                   NaN                       NaN                 NaN           1168.00000   
8    110                   NaN                       NaN                 NaN               

## Dimensions of data

In [43]:
shape = raw_data.shape
print(shape)

(1000, 11)


## Data type for each attribute

In [44]:
types = raw_data.dtypes
print(types)

idx                              int64
time_to_first_matter           float64
time_to_first_time_entry       float64
time_to_first_bill             float64
time_to_second_user            float64
page_views_in_first_hour       float64
page_views_in_first_day        float64
page_views_in_first_7_days     float64
page_views_in_first_14_days    float64
time_to_conversion             float64
conversion_value               float64
dtype: object


## Descriptive Statistics
- Using the train data set we see 1000 accounts
- Assuming normal conditions we see that 171 accounts out of 1000 converted with a conversion rate of 17.1%
- We are seeing minimum values to get to each stage can be as little as seconds or can take many almost the whole 14 day period
- The biggest drop in the number of accounts in the funnel happens from the time the 14 day trial starts to the time to first matter by 55.6%. Making note that accounts can convert without engaging in application. But this would be a good area to involve the Sales And Marketing Team to reach out to customers who have not converted and not engaged after a few days. They may help answer questions the account may have and help them with a smoother onboarding process in the early stages 
- Another significant drop in the funnel is the number of accounts going from time to first time entry to time to first bill. 18.1% of accounts get the first bill. The Sales Team and Account Management Team can help increase this rate by touching base with the account to help them understand any questions they may have about their bill or service.
- 11.6% of accounts get a second user and is an opportunity for the Sales Team and Account Management Team to engage with the customer when they get to their first bill
- 17.1% of the accounts actually convert as stated earlier. Its slightly lower than the 18.1% that get the first bill which means their may be some difficulty with some accounts paying. The Account Management Team may be able to touch base with the customer here to ensure that payment methods are not experiencing isssues with charging credit cards. Payment arrangments can also be made here to help ensure we recieve payment.
- We see the a steady increase of average (mean) page views from first hour, to first day, to first seven days, and first 2 weeks. 
- The average conversion value $119.23. This should be watched alongside cost of aquisition per account by the Marketing Team


In [45]:
pd.set_option('display.width', 100)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
description = raw_data.describe()
print(description)

             idx  time_to_first_matter  time_to_first_time_entry  time_to_first_bill  \
count 1000.00000             444.00000                 365.00000           181.00000   
mean   648.07500          125224.85360              131241.60548        205680.39779   
std    369.84021          254762.37811              255418.15214        301326.67633   
min      2.00000              48.00000                  15.00000           217.00000   
25%    326.50000             437.25000                 551.00000          1239.00000   
50%    658.50000            1955.50000                2819.00000         16591.00000   
75%    970.25000           89674.50000              151014.00000        342753.00000   
max   1283.00000         1190696.00000             1208042.00000       1204219.00000   

       time_to_second_user  page_views_in_first_hour  page_views_in_first_day  \
count            116.00000                 972.00000                972.00000   
mean          188593.61207                  2

## Correlations Between Attributes
- Assuming a normal distribution here
- Using Pearson's Correlation Coefficient to see relationships between two variables at a time
- Time To First Matter is highly correlated with Time To First Time Entry with a correlation of 0.82194 (1 means a positive correlation)
- Page Views In First 7 Days is highly correlated with Page Views In First 14 Days with a correlation of 0.84131


In [46]:
correlations = raw_data.corr(method='pearson')
print(correlations)

                                 idx  time_to_first_matter  time_to_first_time_entry  \
idx                          1.00000               0.08228                   0.02448   
time_to_first_matter         0.08228               1.00000                   0.82194   
time_to_first_time_entry     0.02448               0.82194                   1.00000   
time_to_first_bill           0.00232               0.59288                   0.50156   
time_to_second_user         -0.14466               0.20757                  -0.05208   
page_views_in_first_hour     0.01080              -0.22033                  -0.21484   
page_views_in_first_day     -0.00834              -0.19843                  -0.17121   
page_views_in_first_7_days  -0.01309              -0.09506                   0.00353   
page_views_in_first_14_days -0.02819               0.04183                   0.14579   
time_to_conversion          -0.16530              -0.08042                  -0.06521   
conversion_value             0.0

## Relationships Between Conversion And Engagement
- With the customer journey of some engagement and no conversion, just over  33% of accounts go through this path. There is an opportunity to engage with customers here during their onboarding process and we can get the Sales team involved to help customers with any questions they may have about the product. Its also an opportunity to collect that feedback and share it with the Software Development Team. This would be a great time to ideat and come up with A/B testing experiment ideas
- Almost 50% have no engagment and no conversion. The Marketing Team would be good to loop in here to see if there is copy or messaging that could entice the prospect to act on a call to action to trigger engagement. This is another place where A/B testing can be used to see if we can see some lift.
- The cohort where there is some engagement and conversion provides the most conversion value. They are only 16.7% of the train group but there is potential to increase the average revenue of $121.47 by invovling the Account Management Team to get them to upgrade.
- The cohort where there is no engagement and conversion is a small group but it would be a good opportunity to involve the Sales and Marketing Team to try and understand the accounts' reasoning to paying right away. We can use this feedback to ideate on A/B tests to help bring in more of these types of customers. Their average conversion value is smaller compared to the previous cohort. This is an opportunity to involve the Account Management Team to try and upsell.

In [47]:
import numpy as np

# Build a condition list to evaluate based on certain scenarios
condition_list = [
    
    # no engagement and no conversion
    (raw_data['time_to_first_matter'].isna()) & 
    (raw_data['time_to_first_time_entry'].isna()) & 
    (raw_data['time_to_first_bill'].isna()) &
    (raw_data['time_to_second_user'].isna()) &
    (raw_data['time_to_conversion'].isna()),
    
    # some engagement and conversion 
    ((raw_data['time_to_first_matter'] > 0) | 
    (raw_data['time_to_first_time_entry'] > 0) |
    (raw_data['time_to_first_bill'] > 0) |
    (raw_data['time_to_second_user'] > 0)) &
    (raw_data['time_to_conversion'] > 0),
    
     
     # no engagement and conversion
    (raw_data['time_to_first_matter'].isna()) & 
    (raw_data['time_to_first_time_entry'].isna()) & 
    (raw_data['time_to_first_bill'].isna()) &
    (raw_data['time_to_second_user'].isna()) &
    (raw_data['time_to_conversion'] > 0),
    
    # some engagement and no conversion 
    ((raw_data['time_to_first_matter'] > 0) | 
    (raw_data['time_to_first_time_entry'] > 0) |
    (raw_data['time_to_first_bill'] > 0) |
    (raw_data['time_to_second_user'] > 0)) &
    (raw_data['time_to_conversion'].isna()),
]

# Assign scenarios a journey description    
choice_list = ['no engagement and no conversion', 
               'some engagement and conversion',
               'no engagement and conversion',
               'some engagement and no conversion'
              ]
raw_data['journey'] = np.select(condition_list, choice_list, default='uncategorized')

# Group accounts by journey
columns = ['idx', 'journey', 'conversion_value']
analysis = raw_data[columns]

group_analysis = analysis.groupby('journey').agg({'conversion_value': [np.size, 
                                                                       np.sum, 
                                                                       np.mean, 
                                                                       np.max, 
                                                                       np.min]})

group_analysis.head()


Unnamed: 0_level_0,conversion_value,conversion_value,conversion_value,conversion_value,conversion_value
Unnamed: 0_level_1,size,sum,mean,amax,amin
journey,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
no engagement and conversion,5.0,224.06,44.812,52.37,29.59
no engagement and no conversion,498.0,0.0,,,
some engagement and conversion,166.0,20163.99,121.46982,828.0,32.7
some engagement and no conversion,331.0,0.0,,,


## Skew For Each Attribute
- Checking for skewness because machine learning techniques assume that there is a normal distribution

In [49]:
skew = raw_data.skew()
print(skew)

idx                           -0.04667
time_to_first_matter           2.42557
time_to_first_time_entry       2.41474
time_to_first_bill             1.51663
time_to_second_user            1.77736
page_views_in_first_hour       2.32223
page_views_in_first_day        4.62391
page_views_in_first_7_days     5.31661
page_views_in_first_14_days    6.13625
time_to_conversion             0.48333
conversion_value               3.00399
dtype: float64


## Data Prep For Prediction

In [64]:
condition = raw_data['time_to_conversion'] > 0
raw_data['convert'] = np.where(condition, 1, 0)

## Model To Predict an Accounts Conversion Based on Early Application Engagement
- Approach that I would take is Time to Event Analysis to estimate when an account converts in a 14 day period
- Seeing that not all accounts convert by day 14, we dont know if and when they will eventually convert
- Using the Cox Proportional Hazard we could introduce the censorship (accounts that havent converted yet within 14 days). If we remove them we could be biasing results. We can't ignore them and need to distinguish them from those that reached the event of conversion within 14 days
- Explaining this model to people not in Data, I would say that we don't want to ignore accounts that engage in the first 14 days but do not convert within the first 14 days. We could continue to monitor those that haven't converted further out in time (30 days, 45 days, 60 days). 
- The model also tells the probability that a conversion will happen at a certain point in time. 
- I would use the lifelines python package to perform this 
- The recommendation is that the sooner we assist accounts within the 14 day period to engage in the product, the more likely they will convert

## Experiment 
- With approximately 50% of accounts with no engagement and no conversion we could do some experiments on the pricing page where the clients first seeing plans and prices. We could have the control group see the pricing page as is and have the test group see a different arrangement and copy/ messaging of it to see if they could start to engage with the application.
- Null Hypothesis: High to Low Pricing on pricing page layout does not correlate with engagement of application
- Alternate Hypothesis: Changing the pricing page layout to Low To High Pricing increases engagement of application
- Try to control any confounding variables ( pricing should not change between both groups, and there should be no difference in business process with the groups during their first 14 day period)