# Description

__Goal__

Our goal is to detect interface elements / screens of an app at which users' engagement drops significantly and induce them to leave the app without account registration.

__Tasks__

1. Collect data
2. Prepare data 
3. Analyze data
    1. Build pivot tables
    2. Visualize users path in the app
    3. Build the classifier
        1. Classifier helps you to pick out specific users paths
        2. Classifier allows to estimate the probability of user's leaving from the app based on his current path. One can use this information to dynamically change the content of the app to prevent from that.
        

__Expected results__

1. One will identify the most "problematic" elements of an app
2. One will get the classifier which will allow you to predict user's leaving from the app based on current user's behaviour

# Download data

In [2]:
import os
from retentioneering import utils, init_from_file, Config

Firstly, we need to load a google cloud credentials.

In [4]:
client, job_config = init_from_file('./settings_yaml.yaml')
settings = Config('./settings_yaml.yaml')

Execute the query:

* user_filter_event_names -- users filter: takes only users who had that event
* dates_users -- dates where user_filter_event_names was happen
* users_app_version -- filter on app's version
* event_filter_event_names -- events of our interest
* dates_events -- time period of analysis
* count_events -- number of events for user

In [None]:
# df = download_events(
#     client,
#     job_config=job_config,
#     user_filter_event_names=[u'first_open'],
#     users_app_version='7.4.2', 
#     event_filter_event_names=[u'screen_view',
#                               u'myFlights_add',
#                               u'myFlights_edit',
#                               u'myFlights_refresh',
#                               u'profile_edit_close',
#                               u'tabbar_select_page',
#                               u'welcome_see_screen',
#                               u'feed_widget_present',
#                               u'welcome_login_google',
#                               u'welcome_login_tripit',
#                               u'welcome__loginFailure',
#                               u'feed_ad_canBePresented',
#                               u'myFlights_connectEmail',
#                               u'myFlights_swipe_action',
#                               u'newFlight_myflights_see',
#                               u'welcome__chooseLoginType',
#                               u'welcome_otherLogin__show',
#                               u'newFlight_awardwallet_see',
#                               u'welcome_otherLogin__close',
#                               u'welcome_login_google_cancel',
#                               u'welcome_privacy_policyShown',
#                               u'welcome_privacy_policyShown',
#                               u'welcome_privacy_policyDecline',
#                               u'welcome_privacy_policyDecline',
#                               u'welcome_privacy_policyAccepted',
#                               u'welcome_privacy_policyAccepted',
#                               u'welcome_privacy_policyTapToPolicy',
#                               u'feed_widget_aircraft_amenities_saw',
#                               u'welcome_otherLogin__chooseLoginType',
#                               u'feed_widget_aircraft_noAircraftImage',
#                               u'welcome_otherLogin_privacy_policyShown',
#                               u'welcome_otherLogin_privacy_policyShown',
#                               u'welcome_otherLogin_privacy_policyDecline',
#                               u'welcome_otherLogin_privacy_policyDecline',
#                               u'welcome_otherLogin_privacy_policyAccepted'], 
#     dates_users=(u'2018-10-01', u'2018-10-01'), 
#     dates_events=(u'2018-10-01', u'2018-10-01'), 
#     count_events=40, 
#     return_dataframe=True
# )

In [None]:
settings['sql']

Or we can put all of it in `settings['sql']` (you can see example in current directory) and execute query with it

In [8]:
df = utils.download_events_multi(client, job_config=job_config, settings=settings)
print ('Downloaded DataFrame shape: {}'.format(df.shape))

100%|██████████| 503589/503589 [00:54<00:00, 9223.56it/s]  


Downloaded DataFrame shape: (503589, 12)


#### Prepare your dataset for further analysis

In [10]:
# select target users from settings['users']
print ('Started DataFrame shape: {}'.format(df.shape))
df = utils.preparing.filter_users(df, settings=settings)
print ('DataFrame shape after user filters: {}'.format(df.shape))

# delete events from settings['events']
df = utils.preparing.filter_events(df, settings=settings)
print ('DataFrame shape after event filters: {}'.format(df.shape))

# drop duplicated events hapenning during settings['events']['duplicate_thr_time']
df = utils.preparing.drop_duplicated_events(df, settings=settings)
print ('DataFrame shape after drop duplicated events: {}'.format(df.shape))

# add passed events from settings['positive_event']
df = utils.preparing.add_passed_event(df, settings=settings)
print ('DataFrame shape after adding passed events: {}'.format(df.shape))

# add lost events from settings['negative_event']
df = utils.preparing.add_lost_events(df, settings=settings)
print ('DataFrame shape after adding lost events: {}'.format(df.shape))

Started DataFrame shape: (503589, 12)
DataFrame shape after user filters: (474403, 12)
DataFrame shape after event filters: (186082, 12)
DataFrame shape after drop duplicated events: (52090, 12)
DataFrame shape after adding passed events: (31281, 12)
DataFrame shape after adding lost events: (33338, 12)


#### Look at first 5 records in prepared dataset

#### Save DataFrame if needed

In [None]:
directory = '../../data' 
if not os.path.exists(directory):
    os.makedirs(directory)

In [None]:
# choose your path
path = '../../data/data_from_bq.csv'
df.to_csv(path, sep=';', index=False)

# Analysis

Now we are ready for data analysis

In [None]:
import pandas as pd
path = '../../data/data_from_bq.csv'
df = pd.read_csv(path, sep=';')

## Ad-hoc

In [None]:
from retentioneering import analysis

#### Pivot tables of event distribution by user steps

In [None]:
desc = analysis.get_desc_table(df, settings=settings, plot=True)

In rows of the table there are serial numbers of the user's steps from the user path.
In columns of the table there are events themselves.

In cells you will see the probability of user's choice event at every step.

It's difficult to make complicate analysis from that table so we should split our users to those who leave the app and those who passed on.

We can split data into `lost` and `passed` users to compare behaviour 

In [None]:
lost_users_list = df[df.event_name == 'lost'].user_pseudo_id
filt = df.user_pseudo_id.isin(lost_users_list)
df_lost = df[filt]
df_passed = df[~filt]

desc_loss = analysis.get_desc_table(df_lost, settings, plot=True)

In [None]:
desc_passed = analysis.get_desc_table(df_passed,  settings, plot=True)

And plot a heatmap of differences

In [None]:
diff_df = analysis.get_diff(desc_loss, desc_passed, settings=settings, precalc=True)

We can aggregate edges data

In [None]:
agg_list = ['trans_count', 'dt_mean', 'dt_median', 'dt_min', 'dt_max']
df_agg = analysis.get_all_agg(df, agg_list)
df_agg.head()

We can see which transitions take the most time and how often people have use different transitions.

We can choose the longest 10 user's path.

In [None]:
df_agg.sort_values('trans_count', ascending=False).head(10)

You can see events in which users spend most of the time. It seems reasonable to analyze only popular events to get stable results

Adjacency matrix from it

In [None]:
adj_count = analysis.get_adjacency(df_agg, 'trans_count')
adj_count

Also one can clusterize users by events' frequency choice

In [None]:
countmap = analysis.utils.plot_frequency_map(df, settings)

In [None]:
analysis.utils.plot_clusters(df, countmap, n_clusters=5, plot_cnt=2)

Visualization of these groups in Lost classifier section below

# Graph visualization

Visualize your graph in python

In [None]:
analysis.utils.plot_graph_python(df_agg, 'trans_count', settings)

Or with our api

`Api sends aggregated graph to our server for visualization`

In [None]:
from retentioneering.visualization.plot import plot_graph_api
plot_graph_api(df_lost, settings)

# Lost/not-lost classifier

Fit the model

In [None]:
clf = analysis.Model(df, target_event='lost', settings=settings)
clf.fit_model()

Get simple access to your quality metrics

In [None]:
print ('ROC-AUC: {:.2f}'.format(clf.average_precision_score))
print ('PR-AUC: {:.2f}'.format(clf.roc_auc_score))

Predict probabilities for a certain user

In [None]:
# first we need to aggregate events by a user
data = analysis.prepare_dataset(df, target_event='lost')
# now we can predict probability for her track
vec = clf._get_vectors(data.event_name.iloc[:1])
clf.predict_proba(vec)

Visualize t-sne projection of events vs targets

In [None]:
clf.plot_projections()

Or vs probability from model

In [None]:
clf.plot_projections(sample=data.event_name.values, ids=data.user_pseudo_id.values)

Select cluster of interest with bbox and visualize trajectories for it

In [None]:
# write coordinates bbox angles

bbox = [
    [-4, -12],
    [8, -26]
]

clf.plot_cluster_track(bbox)

In [None]:
from matplotlib import pyplot as plt
fig = plt.figure(figsize=[10, 10])
plt.scatter(clf._cached_tsne[:, 0], clf._cached_tsne[:, 1], c=clf.target)
plt.grid()
plt.title('TSNE over Tf-Idf transform of user tracks')

Highlight major nodes and edges with our api