# <font color=blue>USER ENGAGEMENT STUDY</font>

***
From Ref : https://clearbridgemobile.com/5-methods-for-increasing-app-engagement-user-retention/ <br>
Engagement – describes how active users are on the application. While this is a somewhat subjective metric, Localytics describes highly engaged users as those that have 10+ sessions per month.<br>

Metrics to study:
0. Retention rate and churn rate
1. Session length (time between opening and closing of app)
2. Time interval between two consecuitive sessions
3. Screen flows
4. App crashes
5. Daily uninstalls (new vs exsisting users, android vs ios)
6. Track App Launch to App Launch Retention Cohorts
7. Track Active Users: DAUs (daily Active Users), MAUs (Monthly Active Users), Stickiness = DAU/MAU
8. Track Number of Daily/Monthly Sessions

More references:
https://blog.appsee.com/the-best-metrics-and-tools-for-measuring-user-engagement/ <br>
https://clevertap.com/blog/cohort-analysis-user-retention/

## Load Libraries

In [1]:
#from sklearn import cluster
#from collections import defaultdict
import matplotlib.pyplot as plt
import matplotlib as mpl

#from matplotlib import cm
import pandas as pd
import numpy as np
import seaborn as sns

#from sklearn.metrics.cluster import normalized_mutual_info_score
#from sklearn.metrics.cluster import adjusted_rand_score

%matplotlib inline

In [2]:
def label_encoding(df, col_name):
    df[col_name] = c_df[col_name].astype('category')
    df[col_name+"_CAT"] = c_df[col_name].cat.codes
    return


# <font color=green>1. USER RETENTION AND CHURN RATE</font>

For the study I am going to use a sample dataset I found online called chapter-12-relay-foods.csv. <br> It is from reference http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/#2.-Determine-the-user's-cohort-group-(based-on-their-first-order) <br> which is the tutorial I am going to follow to estimate the user retention. <br>

NOTE <br> 
I will add the variables I think we should use from InstaSize dataset in double parenthisis (()) so that it is easy to insert them and study when we get a larger sample dataset. I have run this code on the current InstaSize sample dataset but there aren't enough data to draw a rentetion curve.

## Load the FOOD Data Set

In [3]:
# Open the data file and read the contents into a dataframe
#df=pd.read_csv("SFData5Users.csv",parse_dates=['FIRST_SESSION_DATE'])
#df.describe()

# Open the test data set
df=pd.read_csv("tute_data/chapter-12-relay-foods.csv",parse_dates=['OrderDate'])
df.head()

IOError: File tute_data/chapter-12-relay-foods.csv does not exist

## Pre-processing (required for the final analysis)

In [None]:
# Keep only the ones required for this analysis
#keep_col = ['CUSTOMER_ID','CLIENT_DATE','CLIENT_TIME','SESSION_UUID','FIRST_SESSION_DATE']
#df = df[keep_col]
#df

In [None]:
df.dtypes

In [None]:
# Find columns with null values
print(df.isnull().sum())


## Cohort Analysis

Reference - http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/ <br>

What is cohort analysis? <br>
A cohort is a group of users who share something in common like their sign-up date, first purchase month, birth date, acquisition channel, etc. Cohort analysis is the method by which these groups are tracked over time, helping us to spot trends, understand repeat behaviors (purchases, engagement, amount spent, etc.), and monitor customer and revenue retention. <br>

### 1. Create a period column and a year column based on the OrderDate ((FIRST_SESSION_DATE))

In [None]:
#df['FIRST_SESSION_PERIOD'] = df.FIRST_SESSION_DATE.apply(lambda x: x.strftime('%Y-%m'))
df['OrderYear'] = df.OrderDate.apply(lambda x: x.strftime('%Y'))
df.head()

### 2. Determine the user's cohort group based on their order ((first session))


In [None]:
# Create a new column called CohortGroup, which is the year and month in which the user's first started using the app.
#df.set_index('CUSTOMER_ID', inplace=True)
#df['COHORT_GROUP'] = df.groupby(level=0)['FIRST_SESSION_DATE'].min().apply(lambda x: x.strftime('%Y-%m'))
df.set_index('UserId', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['OrderDate'].min().apply(lambda x: x.strftime('%Y-%m'))
df.reset_index(inplace=True)
df.head()

### 3. Rollup data by CohortGropu & OrderDate ((COHORT_GROUP & FIRST_SESSION_DATE))

In [None]:
#grouped = df.groupby(['COHORT_GROUP', 'FIRST_SESSION_PERIOD'])
grouped = df.groupby(['CohortGroup', 'OrderPeriod'])

# count the unique users, orders, and total revenue per Group + Period
#cohorts = grouped.agg({'CUSTOMER_ID': pd.Series.nunique,'SESSION_UUID': pd.Series.nunique})
cohorts = grouped.agg({'UserId': pd.Series.nunique,'OrderId': pd.Series.nunique})

# make the column names more meaningful
#cohorts.rename(columns={'CUSTOMER_ID': 'TOTAL_CUSTOMERS', 'SESSION_UUID': 'TOTAL_SESSIONS'}, inplace=True)
cohorts.rename(columns={'UserId': 'TotalUsers', 'OrderId': 'TotalOrders'}, inplace=True)

cohorts


### 4. Label the COHORT_GROUP for each CohortGroup¶


In [None]:
# Check how each cohort has behaved in the months following their first session.
# This allows us to compare cohorts across various stages of their lifetime.
# To do this we need to index each cohort to their first session month. 
def cohort_period(df):
    #df['COHORT_PERIOD'] = np.arange(len(df)) + 1
    df['CohortPeriod'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts


## User Retention by Cohort Group

In [None]:
# We look at the percentage change of each CohortGroup over time 

# reindex the DataFrame
cohorts.reset_index(inplace=True)

#cohorts.set_index(['COHORT_GROUP', 'COHORT_PERIOD'], inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

# create a Series holding the total size of each CohortGroup
#cohort_group_size = cohorts['TOTAL_CUSTOMERS'].groupby(level=0).first()
cohort_group_size = cohorts['TotalOrders'].groupby(level=0).first()

cohort_group_size.head()


In [None]:
# Now, we'll need to divide the TotalUsers values in cohorts by cohort_group_size. 
# Since DataFrame operations are performed based on the indices of the objects, we'll use unstack 
# on our cohorts DataFrame to create a matrix where each column represents a CohortGroup and 
# each row is the CohortPeriod corresponding to that group.

#cohorts['TOTAL_CUSTOMERS'].unstack(0).head()
cohorts['TotalOrders'].unstack(0).head()


In [None]:
# Now utilize broadcasting to divide each column by the corresponding cohort_group_size.

#user_retention = cohorts['TOTAL_CUSTOMERS'].unstack(0).divide(cohort_group_size, axis=1)
user_retention = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)

user_retention.head(10)


## Visualizations

### Retention Rate Trend 

In [None]:
# Plot the cohorts over time in an effort to spot behavioral differences or similarities.
#user_retention[['2018-06', '2018-07']].plot(figsize=(10,5))
#user_retention[['2009-01', '2009-02', '2010-01']].plot(figsize=(10,5))
user_retention.plot(figsize=(10,5))

plt.title('Cohorts: User Retention')
plt.xticks(np.arange(1, 12.1, 1))
plt.xlim(1, 12)
#plt.ylabel('% of Active users');
#plt.xlabel('Months following first session');
plt.ylabel('% of Cohort Purchasing');

### Retention Table

In [None]:
sns.set(style='white')

plt.figure(figsize=(12, 8))
plt.title('Cohorts: User Retention')
sns.heatmap(user_retention.T, mask=user_retention.T.isnull(), annot=True, fmt='.0%');

### Monthly Retention or the percent of users who return to the app one month, two months, and three months after the app is downloaded.


In [None]:
# To estimate this take the average for each of the cohorts in period 1, 2 and 3 seperately.
# Seperate years 2009 and 2010
monthly_retention = user_retention.groupby(user_retention.columns.str.split("-").str[0],axis=1).mean()
monthly_retention.head()

In [None]:
# Select all cases where CohortPeriod < 4
list(monthly_retention.index)
select_indices = list(np.where(monthly_retention[""] == True)[0])


In [None]:
monthly_retention.plot(y='2009', use_index=True, figsize=(10,5))


In [None]:
# create plot
n_groups = len(monthly_retention.columns)

fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8
 
rects1 = plt.bar(index, monthly_retention, bar_width,
                 alpha=opacity,
                 color='b',
                 label='Frank')
 

 
plt.tight_layout()
plt.show()
