<a href="https://colab.research.google.com/github/amina-safdar/churn_prediction/blob/main/1_event_data_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


###**Event data quality assurance**

This notebook analyzes Manning's liveBook event data for quality assurance (QA). The assessed data was collected over six months from December 1, 2019 through June 1, 2020.

---

1. Load and prepare data
  - Convert `events_per_account` query's [results](https://docs.google.com/spreadsheets/d/1lbEZy6SHQ6m7qAWTqEUXEys0YGSrCVQ7zHVf0heF584/edit?usp=sharing) to `account_metrics` pandas DataFrame
  - Prepare`account_metrics` for analysis
2. Narrow the scope of analysis
    - Determinine the types of events users engage in most frequently using `account_metrics`. Popular event types are defined as those that averaged at least 0.1 events per account per month over the measurement period.
3. Assess data quality
  - Check for anomalies in each event type using `events_per_day` query and time series plots

In [1]:
!pip install plotly --upgrade

Collecting plotly
  Downloading plotly-5.16.1-py2.py3-none-any.whl (15.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.6/15.6 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: plotly
  Attempting uninstall: plotly
    Found existing installation: plotly 5.15.0
    Uninstalling plotly-5.15.0:
      Successfully uninstalled plotly-5.15.0
Successfully installed plotly-5.16.1


In [2]:
from google.colab import auth, files
from google.auth import default
import gspread
import itertools
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

###**1. Load and prepare `events_per_account` results**

In [3]:
auth.authenticate_user()

creds, _ = default()

# Save events_per_account.gsheet as a Pandas DataFrame
gc = gspread.authorize(creds)
worksheet = gc.open('events_per_account').sheet1
rows = worksheet.get_all_values()
account_metrics = pd.DataFrame.from_records(rows)

In [4]:
# Set header
account_metrics.columns = rows[0]
account_metrics = account_metrics[1:]

# Convert numerical columns
num_cols = ['event_count', 'account_count', 'events_per_account', 'month_count', 'events_per_account_per_month']
account_metrics[num_cols] = account_metrics[num_cols].apply(pd.to_numeric, errors='coerce')

# Compute event frequency
event_total = account_metrics.event_count.sum() # 3156212
account_metrics['event_pct'] = account_metrics.apply(lambda x: x.event_count/event_total*100, axis=1)

# Verify
#account_metrics.dtypes



---

**About the data**

Notably `account_metrics` contains the number of events per account and events per account per month for each type of liveBook event. Over 3 million (3156212) liveBook events were logged by 88375 active accounts over the six-month measurement period. 35 unique event types were observed.


###**2. Find popular events using `events_per_account_per_month`**

In [60]:
# Sort by `events_per_account_per_month`
account_metrics.sort_values(by='events_per_account_per_month', ascending=True, inplace=True)

# Plot
fig = px.bar(account_metrics, y='event_type', x='events_per_account_per_month', title='Monthly engagement per account, by event type', height=800)
fig.update_xaxes(title_text=None, showgrid=False, nticks=10, ticks="outside", tickcolor='#003f5c', ticklen=3)
fig.update_yaxes(title_text=None, ticks="outside", tickcolor='white', ticklen=10)
fig.update_layout(
    font_family='Open Sans',
    xaxis={'side': 'top'},
    showlegend=False,
    hoverlabel=dict(
        bgcolor="white",
        font_size=10))
fig.add_vrect(x0=0, x1=0.1, fillcolor="white", opacity=0.35, layer="above", line_width=0)
fig.add_trace(go.Scatter(
    x=[0.2],
    y=["SearchMade"],
    text=["Engagement cutoff"],
    mode="text",
))
fig.add_shape(type="line",
    x0=0.1, y0=35, x1=0.1, y1=0,
    line=dict(color="white",width=3)
)
fig.update_traces(hovertemplate="%{x} %{y} events per account per month")
fig.show()

---

**Most popular events**

Of the 35 event types, 10 (28.6 percent) event types averaged at least 0.1 events per account per month over the measurement period.

In descending order of frequency, these event types are:

1. `ReadingOwnedBook`
2. `FirstLivebookAccess`
3. `FirstManningAccess`
4. `EBookDownloaded`
5. `ReadingFreePreview`
6. `FreeContentCheckout`
7. `HighlightCreated`
8. `ReadingOpenChapter`
9. `ProductTocLivebookLinkOpened`
10. `LivebookLogin`

Observe that event types related to reading are amongst the most popular on the liveBook platform. Reading events include: `ReadingOwnedBook`, `ReadingFreePreview`, `EBookDownloaded`, and `ReadingOpenChapter`. This is expected as the liveBook platform was launched to facilitate reading Manning Publishing's technical manuals.

---

**Least popular events**

The least common types of events averaged 0.0 events per account per month over the measurement period:
1. `ProductLiveaudioUpsell`
2. `SherlockHolmesClueFound`
3. `ProductSeeFreeLinkOpened`
4. `UnknownOriginLivebookLinkOpened`
5. `AddOrUpdateCoupon`
6. `RemoveProductOffering`
7. `SharebleLinkOpened`
8. `CommentCreated`

It is interesting to note that event types related to social features on the liveBook platform (i.e., sharing content, communicating with other users and non-users) are amongst the least common.

In [6]:
# List popular events to narrow the scope of analysis
popular_events = account_metrics['event_type'].loc[account_metrics['events_per_account_per_month'] > 0.1].tolist()
popular_events

['LivebookLogin',
 'ProductTocLivebookLinkOpened',
 'ReadingOpenChapter',
 'HighlightCreated',
 'FreeContentCheckout',
 'ReadingFreePreview',
 'EBookDownloaded',
 'FirstManningAccess',
 'FirstLivebookAccess',
 'ReadingOwnedBook']


###**3. Detect anomalies using `events_per_day`**


---

**Examine the big picture**
- Do events happen equally every day, or are there patterns?
- Are there any gaps in the record of any events?
- Are there any events that only occur in part of the history?
- Are there any anomalies in the number of events?


In [7]:
r = [i for i in range(1,6)]
c = [i for i in range(1,3)]
pos = list(itertools.product(r, c))

fig = make_subplots(rows=r[-1], cols=c[-1], subplot_titles=popular_events)

for event, p in zip(popular_events, pos):
  filepath = f"/content/drive/MyDrive/Churn prediction/events_per_day_{event}.csv"
  event_df = pd.read_csv(filepath, index_col='measurement_date')
  fig.append_trace(
      go.Scatter(x=event_df.index, y=event_df['event_count'],  line=dict(width=0.5), marker_color='blue'),
      row=p[0], col=p[1])
fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)
fig.update_layout(height=1000, width=1000, title_text="Anomaly detection", showlegend=False)
fig.show()

All event types show weekly seasonality, suggesting that liveBook users use the platform on weekdays as they work. This follows because Manning Publications produces books for technical professionals who likely use them at work.


- `LivebookLogin` dips slightly in March 2020 and recovers before April 2020.
- `ProductTocLivebookLinkOpened` drops in January 2020 and stays there.
- `ReadingOpenChapter` shows a slight positive trend.
- `HighlightCreated` dips slightly in March 2020 and recovers before April 2020.
- `FreeContentCheckout` spikes dramatically in April 2020, suggesting that there was popular free content being offered at that time.
- `ReadingFreePreview` shows a steady positive trend.
- `EBookDownloaded` has two large spikes, suggesting there were two popular offers on e-books in April, 2020
- `FirstManningAccess` and `FirstLivebookAccess` each have 1 recorded event up until March 11 2020, suggesting either that these events are associated with a feature that was added later to liveBook or that this type of event experienced delayed measurement. Because these event types are missing data over half of the measurement period, **they will be dropped in analysis**.
- `ReadingOwnedBook` has a steady upward trend over the measurement period.



---

**Investigate up-close**

Time series for each event can be analyzed in detail with the option to select a data range within the measurement period and a drop-down to switch between event types.


In [8]:
# Concatenate events_per_day data for each event
all_events = []
for event in popular_events:
    filepath = f"/content/drive/MyDrive/Churn prediction/events_per_day_{event}.csv"
    event_df = pd.read_csv(filepath, index_col='measurement_date')
    all_events.append(event_df)

all_events = pd.concat(all_events, axis=1, join="outer")
all_events.columns = popular_events

# Verify
#all_events.head(4)

In [9]:
# Dropdown
buttons_list = []
for event in popular_events:
  button = {}
  button["args"] = [{
      'y': [all_events[event]],
      'visible': True},
       {'title':event}, [0]]
  button["label"] = event
  button["method"] = "update"
  buttons_list.append(button)

# Interactive plot
fig = go.Figure()
fig.add_traces(go.Scatter(x=all_events.index, y=all_events.LivebookLogin, visible=True))
fig.update_yaxes(
    showgrid=False)
fig.update_xaxes(
    showgrid=False,
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=3, label="3m", step="month", stepmode="backward"),
            dict(step="all")])))
fig.update_layout(
    title="LivebookLogin",
    updatemenus=[
        dict(
            buttons=buttons_list,
            direction="down",
            showactive=True,
            x=1,
            xanchor="right",
            y=1.16,
            yanchor="top")])
fig.show()