###**Customer behaviour metrics quality assurance**
Having run quality assurance on event data from liveBook's data warehouse, we can now take snapshots of customers over the measurement period. Recall that popular event types were defined as those that averaged at least 0.1 events per account per month. Because this is a relatively low frequency, customer metrics will be calculated every month for the previous 84 days.

---
1. Import `metrics_over_time` data for each popular event and plot summary statistics
2. Manually assess data quality using quality assurance plots


In [1]:
!pip install plotly --upgrade

Collecting plotly
  Downloading plotly-5.16.1-py2.py3-none-any.whl (15.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.6/15.6 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.15.0
    Uninstalling plotly-5.15.0:
      Successfully uninstalled plotly-5.15.0
Successfully installed plotly-5.16.1


In [2]:
from google.colab import drive
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).


In [3]:
from google.colab import auth
from google.auth import default
import gspread
import itertools
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [4]:
# From 'Event data QA.ipynb'
popular_events = ['LivebookLogin', 'ProductTocLivebookLinkOpened', 'ReadingOpenChapter', 'HighlightCreated', 'FreeContentCheckout', 'ReadingFreePreview',
                  'EBookDownloaded','ReadingOwnedBook']


### **1. Quality assurance plots**

In [5]:
r = [i for i in range(1,5)]
c = [1]
subplots = ['max', 'avg', 'min', 'count_with_metric']
pos = list(itertools.product(r, c))

for event in popular_events:
  filepath = f"/content/drive/MyDrive/Churn prediction/metrics_over_time_{event}.csv"
  event_df = pd.read_csv(filepath, index_col='calc_at')
  fig = make_subplots(rows=r[-1], cols=c[-1], subplot_titles=subplots)
  fig.update_xaxes(showgrid=False)
  fig.update_yaxes(showgrid=False)
  fig.update_layout(height=600, width=600, title_text=event, showlegend=False)
  for subplot, p in zip(subplots, pos):
    fig.update_yaxes(range=[0, 1.3*event_df[subplot].max()], row=p[0], col=p[1])
    fig.append_trace(go.Scatter(x=event_df.index, y=event_df[subplot],  line=dict(width=0.5), marker_color='blue'), row=p[0], col=p[1])
  fig.show()

###**2. Extract behavioural metrics for current customer**
- What percent of customers have engaged in each behaviour in the recent past at the time of the measurements?
- What are the typical and maximum values for each customer metric?

In [6]:
filepath = f"/content/drive/MyDrive/Churn prediction/curr_customer_metrics.csv"
curr_customer_metrics = pd.read_csv(filepath)
summary = curr_customer_metrics.describe()
summary = summary.T
summary['skew'] = curr_customer_metrics.skew(numeric_only=True)
summary['1%'] = curr_customer_metrics.quantile(q=0.01, numeric_only=True)
summary['99%'] = curr_customer_metrics.quantile(q=0.99, numeric_only=True)
summary['nonzero'] = curr_customer_metrics.astype(bool).sum(axis=0)/curr_customer_metrics.shape[0]
summary = summary[['count', 'nonzero', 'mean', 'std', 'skew', 'min', '1%', '25%', '50%', '75%', '99%', 'max']]

In [7]:
 summary

Unnamed: 0,count,nonzero,mean,std,skew,min,1%,25%,50%,75%,99%,max
readingownedbook,54613.0,0.319136,6.70756,24.247329,8.935092,0.0,0.0,0.0,0.0,3.0,116.0,1000.0
ebookdownloaded,54613.0,0.548716,2.516745,7.991171,22.501083,0.0,0.0,0.0,1.0,2.0,29.88,693.0
readingfreepreview,54613.0,0.21559,1.265083,4.839763,16.554243,0.0,0.0,0.0,0.0,0.0,18.0,314.0
highlightcreated,54613.0,0.02992,0.995276,17.115185,37.697204,0.0,0.0,0.0,0.0,0.0,13.0,1349.0
freecontentcheckout,54613.0,0.276125,1.574845,208.434963,233.654902,0.0,0.0,0.0,0.0,1.0,10.0,48708.0
readingopenchapter,54613.0,0.156062,0.864172,3.731637,14.540966,0.0,0.0,0.0,0.0,0.0,16.0,260.0
producttoclivebooklinkopened,54613.0,0.176094,0.617307,4.97428,71.144698,0.0,0.0,0.0,0.0,0.0,9.0,544.0
livebooklogin,54613.0,0.344185,0.539578,1.174135,13.934591,0.0,0.0,0.0,0.0,1.0,4.0,83.0


In [8]:
summary.to_csv('curr_customer_metrics_summary.csv')
!cp curr_customer_metrics_summary.csv "drive/MyDrive/Churn prediction/"

###**3. Feature engineering**

In [9]:
new_metrics = ['total_events_per_quarter',
     'uniq_prod_ct_per_quarter',
     'total_freebies_per_quarter',
     'pct_downloads_per_quarter',
     'pct_reading_per_quarter']

In [10]:
for event in new_metrics:
  filepath = f"/content/drive/MyDrive/Churn prediction/metrics_over_time_{event}.csv"
  event_df = pd.read_csv(filepath, index_col='calc_at')
  fig = make_subplots(rows=r[-1], cols=c[-1], subplot_titles=subplots)
  fig.update_xaxes(showgrid=False)
  fig.update_yaxes(showgrid=False)
  fig.update_layout(height=600, width=600, title_text=event, showlegend=False)
  for subplot, p in zip(subplots, pos):
    fig.update_yaxes(range=[0, 1.3*event_df[subplot].max()], row=p[0], col=p[1])
    fig.append_trace(go.Scatter(x=event_df.index, y=event_df[subplot],  line=dict(width=0.5), marker_color='blue'), row=p[0], col=p[1])
  fig.show()

###**4. **

In [11]:
filepath = f"/content/drive/MyDrive/Churn prediction/updated_curr_customer_metrics.csv"
curr_customer_metrics = pd.read_csv(filepath)
summary = curr_customer_metrics.describe()
summary = summary.T
summary['skew'] = curr_customer_metrics.skew(numeric_only=True)
summary['1%'] = curr_customer_metrics.quantile(q=0.01, numeric_only=True)
summary['99%'] = curr_customer_metrics.quantile(q=0.99, numeric_only=True)
summary['nonzero'] = curr_customer_metrics.astype(bool).sum(axis=0)/curr_customer_metrics.shape[0]
summary = summary[['count', 'nonzero', 'mean', 'std', 'skew', 'min', '1%', '25%', '50%', '75%', '99%', 'max']]

In [12]:
summary

Unnamed: 0,count,nonzero,mean,std,skew,min,1%,25%,50%,75%,99%,max
total_events_per_quarter,57172.0,1.0,34.962184,4593.349453,239.076475,1.0,1.0,2.0,5.0,13.0,182.0,1098270.0
uniq_prod_ct_per_quarter,57172.0,1.0,3.700325,5.671036,6.797467,1.0,1.0,1.0,2.0,4.0,28.0,141.0
total_freebies_per_quarter,57172.0,0.469863,3.538305,203.821167,238.670398,0.0,0.0,0.0,0.0,2.0,31.0,48708.0
pct_reading_per_quarter,57172.0,0.304852,0.180303,0.312965,1.456898,0.0,0.0,0.0,0.0,0.285714,1.0,1.0
pct_downloads_per_quarter,57172.0,0.524155,0.585356,0.844212,4.129836,0.0,0.0,0.0,0.25,1.0,3.5,20.0
