#  Cohort Analysis: Exploring Consumer Behavior Over Time 

**What is  cohort analysis?**

Cohort is a group of people who share similar characteristics.  
Cohort analysis is a type of behavioral analytics in which we group our users based on their common characteristics within a defined time-span to better track and understand their actions. In cohort analysis, we compare the behavior of customers or users over a period of time. Data can be obtained from a variety of sources such as ecommerce, product websites, mobile apps, business sales databases. We can then turn raw data into a valuable visualization that shows the current state of the business, product, or specific feature.


The first step in conducting cohort analysis is to select a key indicator, a metric that will become the main tool of our research. Based on the purpose of the cohort analysis, we can use the retention rate, churn rate, product sales number, transactions, app install number, etc.     


Cohort analysis is a simple tool for identifying the most important and hidden problems in a product or business. For example, the number of users of a certain product never changes, and we take this information as a very good indicator of the state of the business. However, cohort analysis shows that every day there are a huge number of new users who sign up, start using the product within an hour, and then they churn. We now look at the same information from a different perspective and understand that we may need to improve user experience, product quality, market targeting, and more. 

As a result of cohort analysis, we measure how many users stayed (engagement) instead of how many users came (growth) in a given time span.
In short, cohort analysis helps us separate growth metrics from engagement metrics. 

<br>

**Types of cohort analysis**
<br><br>
There are two types of cohort analysis. We'll dive deeper into each of them while coding.
<br>
1. Acquisition cohorts: Groups divided based on when they signed up for your product
2. Behavioral cohorts: Groups divided based on their behaviors and actions in your product

![alt text](docs/cohort.png "Cohort Image")

Image from: https://clevertap.com/blog/cohort-analysis/

There are two ways to read cohort table:
1. User lifetime perspective (vertically to)
2. Product lifetime perspective (horizontally to right)  


### Data

For cohort analysis, we will use the cleaned data from the previous lesson.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors
import re
import numpy as np

In [None]:
data = pd.read_csv('data/data_cleared.csv')
data.shape

First, we need to transform our data for cohort analysis. For this, we are going to create an order level dataset, that is, each row will represent single order.
For the analysis, we must have two variables: customer ids and invoice dates. 
For behavioral analysis, we can add the total price, the number of items customer has bought, and the total number of items. 

In [None]:
# data.InvoiceDate.dt.month
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
data['PurchaseMonth'] = data['InvoiceDate'].dt.to_period("M")

In [None]:
data.info()

In [None]:
# x = data.groupby(by=['CustomerID', 'PurchaseMonth'], as_index=False)['PurchaseMonth'].count()
# x[x['CustomerID']==12347]

In [None]:
grouped = data.groupby(['CustomerID', 'PurchaseMonth'], as_index=False).agg({'InvoiceNo' : 'count', 
                                                                             'Quantity' : 'sum', 
                                                                             'TotalPrice' : 'mean'})
grouped['CustomerID'] = grouped['CustomerID'].astype('O')
grouped.sort_values(by=['PurchaseMonth', 'CustomerID'], inplace=True)
grouped.head()

In [None]:
grouped.info()

In [None]:
# New dataset description
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=grouped['PurchaseMonth'])
ax.set(title='Number of customers per each date',
       xlabel='Period', 
       ylabel='Number of customers');
var = plt.xticks(rotation = 70)

In [None]:
grouped['FirstPurchaseMonth'] = grouped.groupby('CustomerID')['PurchaseMonth'].transform('min')
print(grouped[grouped['CustomerID']==12347.0])
grouped.head(25)

The main problem with our dataset is that we have no historical data. This means, we have to consider the user's first purchase date  as the date he first came to us. In other words, the first purchase we see in this dataset may not be the actual first purchase of a given customer. However, it is impossible to verify this without access to the entire set of historical data of the retailer.

We then aggregate the data for the month of purchase and the month of the first purchase and count the number of unique customers in each group. In addition, we add Period Number that indicates the number of periods between the month of the cohort and the month of purchase.

In [None]:
cohorts = grouped.groupby(['PurchaseMonth', 'FirstPurchaseMonth'], as_index=False).agg({"CustomerID" : 'count'})
print(cohorts['PurchaseMonth'] - cohorts['FirstPurchaseMonth'])
cohorts['PeriodNumber'] = (cohorts['PurchaseMonth'] - cohorts['FirstPurchaseMonth']).apply(lambda i: i.n)
cohorts = cohorts.rename(mapper={'CustomerID': 'CustomersNumber'}, axis='columns')
cohorts.head()

Next, we create pivot table in a way that each row contains information about a given cohort and each column contains values for a certain period.

In [None]:
cohort_pivot = cohorts.pivot_table(index = 'FirstPurchaseMonth',
                                  columns = 'PeriodNumber',
                                  values = 'CustomersNumber')

cohort_pivot

To get the retention matrix, we need to divide each row's values by the first row's value, which is actually the size of the cohort — all customers who made their first purchase on a given month.

In [None]:
cohort_size = cohort_pivot.iloc[:,0]
retention_matrix = cohort_pivot.divide(cohort_size, axis = 0)
retention_matrix

Finally, we're going to visualize the pivot tables to better understand the current state of customer retention.

In [None]:
plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
sns.heatmap(cohort_pivot, mask=cohort_pivot.isnull(), annot=True, cmap='RdYlGn' , fmt='g' )
plt.title('Monthly cohort of customers number')
plt.xlabel("Period Number'")
var = plt.ylabel("Month of Purchase")

In [None]:
plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
sns.heatmap(retention_matrix, mask=cohort_pivot.isnull(), annot=True, cmap='RdYlGn' , fmt='.0%' )
plt.title('Monthly cohort of customers retention')
plt.xlabel("Period Number'")
var = plt.ylabel("Month of Purchase")

Great job! For further analysis, we can use other variables to understand customer retention behavior. For example, the average expenses of a given cohort or the number of products purchased for a specific month. Let's look at one of them.

In [None]:
cohorts_behavior = grouped.groupby(['PurchaseMonth', 'FirstPurchaseMonth'], as_index=False).agg({"TotalPrice" : 'mean'})
cohorts_behavior['PeriodNumber'] = (cohorts['PurchaseMonth'] - cohorts['FirstPurchaseMonth']).apply(lambda i: i.n)
cohorts_behavior = cohorts_behavior.rename(mapper={'TotalPrice': 'AverageSpendings'}, axis='columns')
cohorts_behavior.head()

In [None]:
cohort_pivot2 = cohorts_behavior.pivot_table(index = 'FirstPurchaseMonth',
                                  columns = 'PeriodNumber',
                                  values = 'AverageSpendings')
cohort_pivot2

In [None]:
cohort_size2 = cohort_pivot2.iloc[:,0]
spending_percentage = cohort_pivot2.divide(cohort_size2, axis = 0)
spending_percentage

In [None]:
plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
sns.heatmap(spending_percentage, mask=cohort_pivot.isnull(), annot=True, cmap='RdYlGn' , fmt='.0%' )
plt.title('Monthly cohort of customers average spendings')
plt.xlabel("Period Number'")
var = plt.ylabel("Month of Purchase")