<a href="https://colab.research.google.com/github/brook-miller/2023mbai417/blob/main/2-homework/B2B_SaaS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

B2B SaaS Metrics: Measure a SaaS Business

In this homework assignment, we will use event data to develop some of the key metrics that drive SaaS businesses. We will be using 2 sources of data:
1. The first is a set of events that are generated by users of the SaaS product. These events are stored in a csv file called events.json. Each event has a timestamp, a user id, and an event type. The event types are:
    - customer_id: the customer id of the user
    - date_time: the timestamp of the event (action)
    - action: what the user did subscribe, cancel or upgrade their service
    - amount: the monthly amount of the subscription or upgrade, amounts listed for upgrade are incremental to the base subscription amount
2. The second is a set of customer attributes that are available via an api.  The api is available at:
https://jdw6paifodaa6hb7alta76h4te0erogq.lambda-url.us-west-2.on.aws/  
```
The api returns JSON objects which contain:
    - id: the customer id
    - name: the name of the customer
    - email: the email address of the customer
    - company: the company of the customer
```
The API has 2 modes of operation:
    - GET /{customer_id} - returns the attributes for a single customer
    https://jdw6paifodaa6hb7alta76h4te0erogq.lambda-url.us-west-2.on.aws/2450
    - GET /bulk?startid={id}&size={size} - returns the attributes for all customers in a JSON array starting at the startid id. The maximum number of customers returned is 500
    https://jdw6paifodaa6hb7alta76h4te0erogq.lambda-url.us-west-2.on.aws/bulk?startid=2450&size=500

You will need to make multiple calls to the api to get data for all of the customers in the events data set.

Determine the annual recurring revenue for each year of 2017-2021.  Annual recurring revenue can be calculated by summing the monthly subscription amounts for each year.  For example, the ARR for 2018 is the sum of the monthly subscription amounts for all customers that signed up and were active in 2018.

There is a straightforward example of a join in pandas that will help get you started for [moving from event](https://github.com/brook-miller/2023mbai417/blob/main/2-class/events_to_aggregates.ipynb) to monthly revenues.


Determine the 12 month churn rate for customers that signed up in each year of 2017-2020.  The 12 month churn rate is the number of customers that cancelled their subscription in the current year divided by the number of customers that signed up in the previous year.  For example, the 12 month churn rate for 2018 is the number of customers that cancelled their subscription in 2018 divided by the number of customers that signed up in 2017.

Show the top 10 revenue generating companies in 2020.  By querying the API you can determine which companie various user subscriptions are associated with.  You can then sum the monthly subscription amounts for each user to determine the revenue generated by each company.

Determine the net revenue retention rate for companies who signed up in 2017-2020.  The revenue retention rate is the ratio of revenue generated in the current year by companies that were active / acquired in the previous year divided by their revenue in the previous year.  For example, the revenue retention rate for 2018 is the 2018 revenue generated by customers that were active/acquired in 2017.  


The venture capital firm Andreesen Horowitz has a great guide to SaaS metrics.  You can find it [here](https://a16z.com/growth/guide-growth-metrics/?view=results&vertical=enterprise&gtm-motion=bottom_up&arr-revenue-scale=scale_0_20_m&software=application)

In your final cell compare the 2021 metrics of the SaaS company to the metrics of the SaaS companies in the Andreesen Horowitz guide.  How does this company compare to the baseline in their guide?


In [2]:
#@title standard imports - we'll use in most EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

from datetime import datetime, timedelta
from dateutil.parser import parse
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/brook-miller/2023mbai417/main/2-homework/subscription_info.csv')
df



Unnamed: 0,customer_id,date_time,action,amount
0,2448,1/1/17 10:32,subscribe,49.99
1,2449,1/1/17 11:35,subscribe,49.99
2,2450,1/1/17 11:37,subscribe,49.99
3,2451,1/1/17 13:28,subscribe,49.99
4,2452,1/1/17 13:52,subscribe,99.99
...,...,...,...,...
308212,173079,12/31/21 19:16,cancel,
308213,497581,12/31/21 19:36,cancel,
308214,127630,12/31/21 20:30,cancel,
308215,497538,12/31/21 21:30,upgrade,119.99
