<a href="https://colab.research.google.com/github/gabi-pacheco/HomeSwap/blob/main/HOME_Exchange_Subscriptions_cleaning_and_exploring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SUMMARY

**Original Subscriptions table:**

Sample of subscriptions since 2019 (particular context in 2020/2021).

If a user has subscribed several years, there are as many lines as subscriptions.

If renew = 1 , you will find the subscription line and the renewal line for the next year

Schema:

* subscription_date
* user_id
* renew: did the user re-subscribe the following year (the month of the expiration of his subscription)
* first_subscription_date
* first_subscription: 1 if it’s the first subscription of the user (+ the users are old + they re-subscribe)
* referral: 1 if the user has been sponsored
* promotion: 1 if the user had a promotion for his subscription
* payment3x: 1 if the user has used the 3x payment to subscribe
* payment2: 1 if the user has paid his 2nd payment
* payment3: 1 if the user has paid his 3rd payment
* country: user country
* region: user region
* department: user department
* city: user city

## Process steps

### Step 1
**Subscription to subs_cleaned_v1** - first cleaning

.

*   Convert subscription_date and first_subscription_date to datetime objects. ✅

*   Inspect country, region, department, and city for missing values and determine an appropriate strategy. ✅

* Ensure binary columns such as renew, first_subscription, referral, promotion, payment3x, payment2, and payment3 are correctly formatted as 0 or 1. ✅

* Validate the subscription dates and ensure they are logically consistent (e.g., subscription_date should not be earlier than first_subscription_date). ✅

* Check for and remove any duplicate rows based on user_id and subscription_date. -> create subscription_id column ✅

* Check and correct names in regions and city ❗

* Drop department column ✅

* Create country_name column ✅


Exported resulting table (subs_cleaned_v1) to BigQuery

###Step 2
**subs_cleaned_v2** -

> After merging country_codes with subs_cleaned_v1 (in BigQuery) to get the countries names (instead of only their codes), I added the correct names of regions by merging the table country_lookup to subs_cleaned_v1. After that point, I imported the resulting table (subs_cleaned_v2_correct_regions) back to this notebook.

.

* Drop doubled country column ✅

* Create new dimension: Cancellation_date ✅

* Create payment_type column: single payment or instalment plan ✅

* Create customer_lifetime ✅

* Create customer segment types ✅

# Subscription to subs_cleaned_v1

## Import and examination

### Import tables

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
from google.cloud import bigquery

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

In [None]:
project_id = 'subtle-isotope-421312'
client = bigquery.Client(project=project_id)

subscriptions = client.query('''
  SELECT * FROM `subtle-isotope-421312.Home_exchange.subscriptions`''').to_dataframe()

### EXPORT TO BIGQUERY

In [None]:
#pip install pandas-gbq -U

In [None]:
#import pandas_gbq

#to export to bq
#pandas_gbq.to_gbq(subs_almost_done, 'Home_exchange.subs_cleaned_v3', project_id='subtle-isotope-421312')

100%|██████████| 1/1 [00:00<00:00, 2526.69it/s]


### Initial exploration

In [None]:
subscriptions.head()

Unnamed: 0,subscription_date,user_id,renew,first_subscription_date,first_subscription,referral,promotion,payment3x,payment2,payment3,country,region,department,city
0,2021-10-16,246592,0,2018-10-06,0,0,0,0,0,0,FRA,,,Aussonne
1,2020-04-07,1258705,1,2010-06-02,0,0,0,0,0,0,,,,
2,2021-10-19,2219682,1,2014-03-24,0,0,0,0,0,0,,,,
3,2021-08-26,1349069,1,2011-01-27,0,0,0,0,0,0,DNK,,,Ærøskøbing
4,2020-01-02,1418685,1,2012-01-01,0,0,0,0,0,0,,,,


In [None]:
subscriptions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   subscription_date        100000 non-null  dbdate
 1   user_id                  100000 non-null  Int64 
 2   renew                    100000 non-null  Int64 
 3   first_subscription_date  100000 non-null  dbdate
 4   first_subscription       100000 non-null  Int64 
 5   referral                 100000 non-null  Int64 
 6   promotion                100000 non-null  Int64 
 7   payment3x                100000 non-null  Int64 
 8   payment2                 100000 non-null  Int64 
 9   payment3                 100000 non-null  Int64 
 10  country                  96214 non-null   object
 11  region                   93263 non-null   object
 12  department               86825 non-null   object
 13  city                     87526 non-null   object
dtypes: Int64(8), dbdate(2

In [None]:
#How many unique user are there?
unique_users = subscriptions['user_id'].nunique()
unique_users

70726

In [None]:
#How many users renew their subscriptions?
subscriptions.groupby(['renew'])['user_id'].size().sort_values(ascending=False)

renew
1    66339
0    33661
Name: user_id, dtype: int64

In [None]:
#How many subscriptions were there per country? And which countries renewed the most?

per_country = subscriptions.groupby(['country']).agg({"user_id":"count",
                                                      "renew":"sum"}).sort_values(by='user_id', ascending=False).reset_index()

per_country.head(25)

Unnamed: 0,country,user_id,renew
0,FRA,28689,18469
1,USA,16476,11652
2,ESP,12289,8139
3,CAN,5940,3939
4,ITA,4266,2702
5,NLD,3328,2433
6,DEU,3073,2180
7,GBR,2389,1608
8,AUS,2285,1510
9,DNK,1723,1197


In [None]:
fig = px.bar(per_country.head(25), x='country', y='user_id', title="Users per country")
fig.show()

In [None]:
fig = px.bar(per_country.head(25), x='country', y='renew', title="Renews per country")
fig.show()

## Subscription CLEANING

1. Convert subscription_date and first_subscription_date to datetime objects. ✅

2. Inspect country, region, department, and city for missing values and determine an appropriate strategy. ✅

3. Ensure binary columns such as renew, first_subscription, referral, promotion, payment3x, payment2, and payment3 are correctly formatted as 0 or 1. ✅

4. Validate the subscription dates and ensure they are logically consistent (e.g., subscription_date should not be earlier than first_subscription_date). ✅

5. Check for and remove any duplicate rows based on user_id and subscription_date. -> create subscription_id column ✅

6. Check and correct names in regions and city ❗

7. Drop department column ✅

8. Create country_name column ✅

### Column types

Converting columns to correct types

In [None]:
subscriptions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   subscription_date        100000 non-null  dbdate
 1   user_id                  100000 non-null  Int64 
 2   renew                    100000 non-null  Int64 
 3   first_subscription_date  100000 non-null  dbdate
 4   first_subscription       100000 non-null  Int64 
 5   referral                 100000 non-null  Int64 
 6   promotion                100000 non-null  Int64 
 7   payment3x                100000 non-null  Int64 
 8   payment2                 100000 non-null  Int64 
 9   payment3                 100000 non-null  Int64 
 10  country                  96214 non-null   object
 11  region                   93263 non-null   object
 12  department               86825 non-null   object
 13  city                     87526 non-null   object
dtypes: Int64(8), dbdate(2

Converting date columns

In [None]:
#convert date columns to datetype
subscriptions['subscription_date'] = pd.to_datetime(subscriptions['subscription_date'], format= "%d/%m/%Y", errors='coerce')
subscriptions['first_subscription_date'] = pd.to_datetime(subscriptions['first_subscription_date'], format= "%d/%m/%Y", errors='coerce')
subscriptions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   subscription_date        100000 non-null  datetime64[ns]
 1   user_id                  100000 non-null  Int64         
 2   renew                    100000 non-null  Int64         
 3   first_subscription_date  100000 non-null  datetime64[ns]
 4   first_subscription       100000 non-null  Int64         
 5   referral                 100000 non-null  Int64         
 6   promotion                100000 non-null  Int64         
 7   payment3x                100000 non-null  Int64         
 8   payment2                 100000 non-null  Int64         
 9   payment3                 100000 non-null  Int64         
 10  country                  96214 non-null   object        
 11  region                   93263 non-null   object        
 12  department       

Converting id column into object

In [None]:
#subscriptions['user_id'] = subscriptions['user_id'].astype(str)
#subscriptions.info()

### Drop department column

In [None]:
subscriptions.drop('department', axis=1, inplace=True)
subscriptions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   subscription_date        100000 non-null  datetime64[ns]
 1   user_id                  100000 non-null  Int64         
 2   renew                    100000 non-null  Int64         
 3   first_subscription_date  100000 non-null  datetime64[ns]
 4   first_subscription       100000 non-null  Int64         
 5   referral                 100000 non-null  Int64         
 6   promotion                100000 non-null  Int64         
 7   payment3x                100000 non-null  Int64         
 8   payment2                 100000 non-null  Int64         
 9   payment3                 100000 non-null  Int64         
 10  country                  96214 non-null   object        
 11  region                   93263 non-null   object        
 12  city             

### Missing values

In [None]:
#check for null values
missing_values_sub = subscriptions.isnull().sum()

# Get percentage of missing values for each column
missing_percentage_sub = (missing_values_sub / len(subscriptions)) * 100

#Create df with missing values and percentage
missing_data_sub = pd.DataFrame({'Missing Values': missing_values_sub, 'Percentage': missing_percentage_sub})

#Filter only the ones with missing values (remove columns with no missing values)
missing_data_sub = missing_data_sub[missing_data_sub['Missing Values'] > 0].reset_index()

missing_data_sub

Unnamed: 0,index,Missing Values,Percentage
0,country,3786,3.786
1,region,6737,6.737
2,city,12474,12.474


**2. NULL COLUMNS**

Decided to fill nulls in geographic columns with unknown. Other alternative would be delete those rows. Since they are less than 30% of the dataset, using unknown is ok.

In [None]:
subscriptions = subscriptions.dropna(subset=['country', 'region'])

subscriptions.isnull().sum()

subscription_date             0
user_id                       0
renew                         0
first_subscription_date       0
first_subscription            0
referral                      0
promotion                     0
payment3x                     0
payment2                      0
payment3                      0
country                       0
region                        0
city                       8510
dtype: int64

 ❗❗❗ **FOR SECOND ITERATION** ❗❗❗

If we decide to use cities, come back to this and drop null values in 'city' column.

### 3. Check data consistency

In [None]:
subscriptions['renew'].value_counts()

renew
1    66339
0    33661
Name: count, dtype: Int64

In [None]:
subscriptions['first_subscription'].value_counts()

first_subscription
0    72737
1    27263
Name: count, dtype: Int64

In [None]:
subscriptions['referral'].value_counts()

referral
0    91830
1     8170
Name: count, dtype: Int64

In [None]:
subscriptions['promotion'].value_counts()

promotion
0    97753
1     2247
Name: count, dtype: Int64

In [None]:
subscriptions['payment3x'].value_counts()

payment3x
0    98746
1     1254
Name: count, dtype: Int64

In [None]:
subscriptions['payment2'].value_counts()

payment2
0    98824
1     1176
Name: count, dtype: Int64

In [None]:
subscriptions['payment3'].value_counts()

payment3
0    98869
1     1131
Name: count, dtype: Int64

### 4. Dates consistency

In [None]:
#check if dates match
subscriptions['date_diff'] =  (subscriptions['subscription_date'] - subscriptions['first_subscription_date']).dt.days

invalid_dates = subscriptions[subscriptions['date_diff'] < 0]

len(invalid_dates)

0

### 5. Create subscription_id and check for duplicates

In [None]:
# create subscription_id column
subscriptions['subscription_id'] = subscriptions['user_id'].astype(str) + subscriptions['subscription_date'].astype(str)

# check for duplicates
duplicates = subscriptions['subscription_id'].duplicated().sum()

duplicates

274

In [None]:
# Identify duplicate subscription_ids
duplicate_entries = subscriptions[subscriptions['subscription_id'].duplicated(keep=False)]

# Display the duplicate entries
duplicate_entries.sort_values(by='user_id')

Unnamed: 0,subscription_date,user_id,renew,first_subscription_date,first_subscription,referral,promotion,payment3x,payment2,payment3,country,region,city,subscription_id
89331,2019-04-24,8716,1,2015-02-25,0,0,0,0,0,0,FRA,Languedoc-Roussillon,Castelnau-Le-Lez,87162019-04-24
89325,2019-04-24,8716,1,2015-02-25,0,0,0,0,0,0,FRA,Languedoc-Roussillon,Castelnau-Le-Lez,87162019-04-24
70015,2019-04-24,22028,1,2019-04-24,1,0,0,0,0,0,FRA,Île-De-France,Paris,220282019-04-24
69713,2019-04-24,22028,1,2019-04-24,1,0,0,0,0,0,FRA,Île-De-France,Paris,220282019-04-24
9819,2019-12-13,55165,0,2018-12-13,0,0,0,0,0,0,URY,Rocha,La Paloma,551652019-12-13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57072,2021-07-08,3658273,0,2021-07-08,1,0,0,0,0,0,USA,Californie,San Francisco,36582732021-07-08
54010,2021-06-07,3733851,0,2021-06-07,1,0,0,0,0,0,USA,California,Yucca Valley,37338512021-06-07
54013,2021-06-07,3733851,0,2021-06-07,1,0,0,0,0,0,USA,California,Yucca Valley,37338512021-06-07
53579,2021-10-31,3855144,0,2021-10-31,1,0,0,0,0,0,USA,California,San Francisco,38551442021-10-31


In [None]:
#drop duplicates
subs_cleaned_v1 = subscriptions.drop_duplicates(subset=['subscription_id'], keep='first')

# Check the result
subs_cleaned_v1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92985 entries, 6737 to 99999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   subscription_date        92985 non-null  datetime64[ns]
 1   user_id                  92985 non-null  Int64         
 2   renew                    92985 non-null  Int64         
 3   first_subscription_date  92985 non-null  datetime64[ns]
 4   first_subscription       92985 non-null  Int64         
 5   referral                 92985 non-null  Int64         
 6   promotion                92985 non-null  Int64         
 7   payment3x                92985 non-null  Int64         
 8   payment2                 92985 non-null  Int64         
 9   payment3                 92985 non-null  Int64         
 10  country                  92985 non-null  object        
 11  region                   92985 non-null  object        
 12  city                     84501 non

### Unidecode

In [None]:
#pip install unidecode

COPY THE FOLLOWING CODE TO A CODE CELL TO RUN IT

import re
import unidecode

-- Define a function to clean text
def clean_text(text):
    text = unidecode.unidecode(text)
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text.strip()

-- Apply the cleaning function to relevant columns

subs_cleaned['country'] = subs_cleaned['country'].apply(clean_text)
subs_cleaned['country_name'] = subs_cleaned['country_name'].apply(clean_text)
subs_cleaned['region'] = subs_cleaned['region'].apply(clean_text)
subs_cleaned['city'] = subs_cleaned['city'].apply(clean_text)
cities['name'] = cities['name'].apply(clean_text)
cities['state_name'] = cities['state_name'].apply(clean_text)
cities['country_name'] = cities['country_name'].apply(clean_text)

### Export subs_cleaned_v1
to Home_Exchange_DEV in BigQuery

In [None]:
#pip install pandas-gbq -U

In [None]:
#import pandas_gbq

#to export to bq
#pandas_gbq.to_gbq(subs_cleaned_v1, 'Home_Exchange_DEV.subs_cleaned_v1', project_id='subtle-isotope-421312')

100%|██████████| 1/1 [00:00<00:00, 5722.11it/s]


# SUBS_CLEANED_V2

I merged the following table with subs_cleaned_v1 (in BigQuery) to get the countries names (instead of only their codes)

In [None]:
country_names = client.query('''
  SELECT * FROM `subtle-isotope-421312.Home_Exchange_DEV.countries_codes_names`''').to_dataframe()
country_names.head()

Unnamed: 0,geoname_countryname,fc_country,geonames_countrycode,fc_alpha2,fc_alpha3
0,Andorra,Andorra,AD,AD,AND
1,United Arab Emirates,United Arab Emirates (the),AE,AE,ARE
2,Afghanistan,Afghanistan,AF,AF,AFG
3,Antigua and Barbuda,Antigua and Barbuda,AG,AG,ATG
4,Anguilla,Anguilla,AI,AI,AIA


After that, I used the SQL code below (in Big Query) to add the correct names of regions by merging the table country_lookup to subs_cleaned_v1. After that point, I imported the resulting table (subs_cleaned_v2_correct_regions) back to this notebook.


In [None]:
#WITH lookp AS (
#  SELECT
#  name
#  , lower(replacement) as region_correct
#  FROM `subtle-isotope-421312.Home_exchange.country_lookup` )

#SELECT
#-- pk --
#  sub.subscription_id
#  -- user info --
#  , sub.user_id
#  , sub.renew
#  -- time-related info --
#  ,
#  sub.subscription_date,
#  sub.first_subscription,
#  sub.first_subscription_date
#  -- payment info --
#  ,
#  sub.referral,
#  sub.promotion,
#  sub.payment3x AS installments,
#  sub.payment2,
#  sub.payment3
#  -- demographics --
#  ,
#  sub.country_name,
#  sub.fc_country,
#  sub.country_code_subs AS country_code,
#  sub.fc_code2 AS country_code_2,
#  lookp.region_correct AS correct_region,
#  sub.city
#FROM `subtle-isotope-421312.Home_Exchange_DEV.subs_cleaned_v1_country_names` sub
#INNER JOIN lookp
#ON lower(sub.region) = lower(lookp.name)

**There are 50 countries in subs_cleaned_v1 with 5 or less subscriptions listed amongst the wrong regions list. Those rows have been dropped - which is one of the reasons why there's a big drop in the number of countries between v1 and v2. However, that has not heavily impacted the volume of the dataframe.**

In [None]:
subs_cleaned_v1['country'].nunique()

150

In [None]:
subs_cleaned_v2['country_name'].nunique()

## IMPORT subs_cleaned_v2 here

In [None]:
subs_cleaned_v2_correct_regions = client.query('''
  SELECT * FROM `subtle-isotope-421312.Home_Exchange_DEV.subs_cleaned_v2_no_dups`''').to_dataframe()

subs_cleaned_v2_correct_regions.head()

Unnamed: 0,subscription_id,user_id,renew,subscription_date,first_subscription,first_subscription_date,referral,promotion,installments,payment2,payment3,country_name,fc_country,country_code,country_code_2,original_region,correct_region,city,row_number
0,11300392020-03-26,1130039,1,2020-03-26 00:00:00+00:00,0,2018-03-27 00:00:00+00:00,0,0,0,0,0,Mexico,Mexico,MEX,MX,Quintana Roo,quintana roo,,1
1,11316382021-04-24,1131638,1,2021-04-24 00:00:00+00:00,0,2003-10-16 00:00:00+00:00,0,0,0,0,0,Switzerland,Switzerland,CHE,CH,Vaud,vaud,,1
2,11643142019-08-26,1164314,0,2019-08-26 00:00:00+00:00,0,2006-07-24 00:00:00+00:00,0,0,0,0,0,United Kingdom,United Kingdom of Great Britain and Northern I...,GBR,GB,Écosse,scotland,Edimbourg,1
3,11647722020-08-13,1164772,1,2020-08-13 00:00:00+00:00,0,2006-08-03 00:00:00+00:00,0,0,0,0,0,United States,United States of America (the),USA,US,Utah,utah,Salt Lake City,1
4,11718452019-03-18,1171845,1,2019-03-18 00:00:00+00:00,0,2007-01-27 00:00:00+00:00,0,0,0,0,0,United States,United States of America (the),USA,US,Louisiana,louisiana,New Orleans,1


In [None]:
subs_cleaned_v2_correct_regions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88152 entries, 0 to 88151
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   subscription_id          88152 non-null  object             
 1   user_id                  88152 non-null  object             
 2   renew                    88152 non-null  Int64              
 3   subscription_date        88152 non-null  datetime64[us, UTC]
 4   first_subscription       88152 non-null  Int64              
 5   first_subscription_date  88152 non-null  datetime64[us, UTC]
 6   referral                 88152 non-null  Int64              
 7   promotion                88152 non-null  Int64              
 8   installments             88152 non-null  Int64              
 9   payment2                 88152 non-null  Int64              
 10  payment3                 88152 non-null  Int64              
 11  country_name             881

In [None]:
# fill null names in country_name with respectivve name in fc_country

import numpy as np

subs_cleaned_v2_correct_regions['country_name'] = np.where(subs_cleaned_v2_correct_regions['country_name'].isnull(), subs_cleaned_v2_correct_regions['fc_country'], subs_cleaned_v2_correct_regions['country_name'])

subs_cleaned_v2_correct_regions['country_name'].isnull().sum()

0

## Further cleaning

1. Drop doubled country column ✅

2. Create new dimension: Cancellation_date ✅

3. Create payment_type column: single payment or instalment plan ✅

4. Create customer_lifetime ✅

5. Create customer segment types ✅

Dropping doubled country name column

In [None]:
subs_cleaned_v2_correct_regions = subs_cleaned_v2_correct_regions.drop(['fc_country', 'row_number'], axis=1)

subs_cleaned_v2_correct_regions.head()

Unnamed: 0,subscription_id,user_id,renew,subscription_date,first_subscription,first_subscription_date,referral,promotion,installments,payment2,payment3,country_name,country_code,country_code_2,original_region,correct_region,city
0,11300392020-03-26,1130039,1,2020-03-26 00:00:00+00:00,0,2018-03-27 00:00:00+00:00,0,0,0,0,0,Mexico,MEX,MX,Quintana Roo,quintana roo,
1,11316382021-04-24,1131638,1,2021-04-24 00:00:00+00:00,0,2003-10-16 00:00:00+00:00,0,0,0,0,0,Switzerland,CHE,CH,Vaud,vaud,
2,11643142019-08-26,1164314,0,2019-08-26 00:00:00+00:00,0,2006-07-24 00:00:00+00:00,0,0,0,0,0,United Kingdom,GBR,GB,Écosse,scotland,Edimbourg
3,11647722020-08-13,1164772,1,2020-08-13 00:00:00+00:00,0,2006-08-03 00:00:00+00:00,0,0,0,0,0,United States,USA,US,Utah,utah,Salt Lake City
4,11718452019-03-18,1171845,1,2019-03-18 00:00:00+00:00,0,2007-01-27 00:00:00+00:00,0,0,0,0,0,United States,USA,US,Louisiana,louisiana,New Orleans


### Create cancellation_date

In [None]:
from dateutil.relativedelta import relativedelta
#rename subs_cleaned_v2_correct_regions to subs_cleaned_v2
subs_cleaned_v2 = subs_cleaned_v2_correct_regions

#create new column
subs_cleaned_v2['cancellation_date'] = None

for index, row in subs_cleaned_v2.iterrows():
    if subs_cleaned_v2.loc[index, 'renew'] == 0:
        new_cancel_date = subs_cleaned_v2.loc[index, 'subscription_date'] + relativedelta(years=1)
        subs_cleaned_v2.at[index, 'cancellation_date'] = new_cancel_date

#convert cancelation_date to datetime
subs_cleaned_v2['cancellation_date'] = pd.to_datetime(subs_cleaned_v2['cancellation_date'])

subs_cleaned_v2.head()

Unnamed: 0,subscription_id,user_id,renew,subscription_date,first_subscription,first_subscription_date,referral,promotion,installments,payment2,payment3,country_name,country_code,country_code_2,original_region,correct_region,city,cancellation_date
0,11300392020-03-26,1130039,1,2020-03-26 00:00:00+00:00,0,2018-03-27 00:00:00+00:00,0,0,0,0,0,Mexico,MEX,MX,Quintana Roo,quintana roo,,NaT
1,11316382021-04-24,1131638,1,2021-04-24 00:00:00+00:00,0,2003-10-16 00:00:00+00:00,0,0,0,0,0,Switzerland,CHE,CH,Vaud,vaud,,NaT
2,11643142019-08-26,1164314,0,2019-08-26 00:00:00+00:00,0,2006-07-24 00:00:00+00:00,0,0,0,0,0,United Kingdom,GBR,GB,Écosse,scotland,Edimbourg,2020-08-26 00:00:00+00:00
3,11647722020-08-13,1164772,1,2020-08-13 00:00:00+00:00,0,2006-08-03 00:00:00+00:00,0,0,0,0,0,United States,USA,US,Utah,utah,Salt Lake City,NaT
4,11718452019-03-18,1171845,1,2019-03-18 00:00:00+00:00,0,2007-01-27 00:00:00+00:00,0,0,0,0,0,United States,USA,US,Louisiana,louisiana,New Orleans,NaT


### Clean date columns

In [None]:
subs_cleaned_v2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88152 entries, 0 to 88151
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   subscription_id          88152 non-null  object             
 1   user_id                  88152 non-null  object             
 2   renew                    88152 non-null  Int64              
 3   subscription_date        88152 non-null  datetime64[us, UTC]
 4   first_subscription       88152 non-null  Int64              
 5   first_subscription_date  88152 non-null  datetime64[us, UTC]
 6   referral                 88152 non-null  Int64              
 7   promotion                88152 non-null  Int64              
 8   installments             88152 non-null  Int64              
 9   payment2                 88152 non-null  Int64              
 10  payment3                 88152 non-null  Int64              
 11  country_name             881

In [None]:
subs_cleaned_v2['cancellation_date'] = pd.to_datetime(subs_cleaned_v2['cancellation_date'].dt.date)
subs_cleaned_v2['subscription_date'] = pd.to_datetime(subs_cleaned_v2['subscription_date'].dt.date)
subs_cleaned_v2['first_subscription_date'] = pd.to_datetime(subs_cleaned_v2['first_subscription_date'].dt.date)

subs_cleaned_v2.head()

Unnamed: 0,subscription_id,user_id,renew,subscription_date,first_subscription,first_subscription_date,referral,promotion,installments,payment2,payment3,country_name,country_code,country_code_2,original_region,correct_region,city,cancellation_date
0,11300392020-03-26,1130039,1,2020-03-26,0,2018-03-27,0,0,0,0,0,Mexico,MEX,MX,Quintana Roo,quintana roo,,NaT
1,11316382021-04-24,1131638,1,2021-04-24,0,2003-10-16,0,0,0,0,0,Switzerland,CHE,CH,Vaud,vaud,,NaT
2,11643142019-08-26,1164314,0,2019-08-26,0,2006-07-24,0,0,0,0,0,United Kingdom,GBR,GB,Écosse,scotland,Edimbourg,2020-08-26
3,11647722020-08-13,1164772,1,2020-08-13,0,2006-08-03,0,0,0,0,0,United States,USA,US,Utah,utah,Salt Lake City,NaT
4,11718452019-03-18,1171845,1,2019-03-18,0,2007-01-27,0,0,0,0,0,United States,USA,US,Louisiana,louisiana,New Orleans,NaT


### Create payment_type column

In [None]:
subs_cleaned_v2['payment_type'] = np.where(subs_cleaned_v2['installments'] == 1, 'instalment', 'single')

subs_cleaned_v2['payment_type'].value_counts(normalize=True).reset_index()

Unnamed: 0,payment_type,proportion
0,single,0.98709
1,instalment,0.01291


### Customer Lifetime

CREATE CUSTOMER LIFETIME COLUMN

In [None]:
#create a new column that indicates the time difference in years from first_subscription date to latest subscription_date
subs_cleaned_v2['customer_lifetime'] = (subs_cleaned_v2['subscription_date'] - subs_cleaned_v2['first_subscription_date']).dt.days / 365

#cast it as int
subs_cleaned_v2['customer_lifetime'] = subs_cleaned_v2['customer_lifetime'].astype(int)

subs_cleaned_v2['customer_lifetime'].describe()

count    88152.000000
mean         2.904007
std          3.480094
min          0.000000
25%          0.000000
50%          2.000000
75%          4.000000
max         25.000000
Name: customer_lifetime, dtype: float64

In [None]:
# 'customer_lifetime' column in a histogram to see the distribution of values

fig = px.histogram(subs_cleaned_v2, x="customer_lifetime", nbins=10, title="Distribution of Customer Lifetime")
fig.show()


### Create customer segment column

In [None]:
# segment subs_cleaned['customer_lifetime']

def segment_customer_lifetime(lifetime):
  if lifetime < 1:
    return 'new_customer'
  elif lifetime < 3:
    return 'frequent_user'
  elif lifetime < 5:
    return 'seasoned_customer'
  elif lifetime < 7:
    return 'very_seasoned_customer'
  else:
    return 'old_timer'

subs_cleaned_v2['customer_segment'] = subs_cleaned_v2['customer_lifetime'].apply(segment_customer_lifetime)

# Count the number of customers in each segment
segment_counts = subs_cleaned_v2['customer_segment'].value_counts()

segment_counts

customer_segment
frequent_user             27736
new_customer              26826
old_timer                 13795
seasoned_customer         11571
very_seasoned_customer     8224
Name: count, dtype: int64

1. create dimension to classify type of payments: single payment or instalment plan ✅
        
2. create metric indicating impact of promotions on subscription rates ✅

3. create metric indicating impact of referrals ✅

4. Create customer_lifetime column and segment ✅

5. create is_churner metric ❗

## Churn

### First churners

**How can I determine churners?**

If you consider that churners **could not have had a previous subscription**, then the code below creates a is_churner column only for the first_time churners.

In [None]:
#churners are stricly new clients that don't renew

def determine_new_churner(row):
    if row['first_subscription'] == 1 and row['renew'] == 0:
        return 1
    elif row['first_subscription'] == 0 and row['renew'] == 0:
        return 0
    elif row['first_subscription'] == 1 and row['renew'] == 1:
        return 0
    else:
        return 0

# Apply the function to each row
subs_cleaned_v2['first_churner'] = subs_cleaned_v2.apply(determine_new_churner, axis=1)

#check column
subs_cleaned_v2['first_churner'].value_counts()

first_churner
0    76740
1    11412
Name: count, dtype: int64

In [None]:
#First_timers renew rate
new_clients_churn_rate = subs_cleaned_v2[subs_cleaned_v2['first_subscription'] == 1]['is_churner'].mean()
print(f"New clients churn rate: {new_clients_churn_rate:.2%}")

New clients churn rate: 45.28%


### Overall churners

❗

If I consider that churners are people that don't renew, **careless of this being their first or sencond subscription**, then the code below applies.

In [None]:
#create is_churner
subs_cleaned_v2['is_churner'] = np.where(subs_cleaned_v2['renew'] == 0, 1, 0)

How many users are churners? That is, users that are not necessarily on their first subscription and that are not renewing their current subscription.

In [None]:
subs_cleaned_v2['is_churner'].sum()

29341

How many users on their **first** subscription are not renewing?

In [None]:
subs_cleaned_v2['first_churner'].sum()

11412

# Export subs_cleaned_v2 back to BigQuery

After V1 cleaning, on v2 the following steps have been done:

1. Created cancellation_date

2. Formated date columns with dates only

3. Created payment_type column

4. Created customer_lifetime column (number years since first subscription)

5. Created customer segmentation based on number of years since 1st subscription

6. Defined first_churners = users that cancelled subscription after their first year

7. Defined overall churners = users that cancelled their subscription careless their customer lifetime


Following this, the resulting table (subscriptions_v2) was exported to BigQuery and there, shared with the rest of the team.

In [None]:
#import pandas_gbq

#to export to bq
#pandas_gbq.to_gbq(subs_cleaned_v2, 'Home_Exchange_DEV.subscriptions_v2_no_dups', project_id='subtle-isotope-421312')

100%|██████████| 1/1 [00:00<00:00, 731.22it/s]




---



# SUBSCRIPTIONS_v2 EDA ⏰

### Import table

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
from google.cloud import bigquery

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

In [None]:
project_id = 'subtle-isotope-421312'
client = bigquery.Client(project=project_id)

subscriptions_v2 = client.query('''
  SELECT * FROM `subtle-isotope-421312.Home_Exchange_DEV.subscriptions_v2_no_dups`''').to_dataframe()

subscriptions_v2.head()

Unnamed: 0,subscription_id,user_id,renew,subscription_date,first_subscription,first_subscription_date,referral,promotion,installments,payment2,...,country_code_2,original_region,correct_region,city,cancellation_date,payment_type,customer_lifetime,customer_segment,first_churner,is_churner
0,29823062019-01-23,2982306,0,2019-01-23 00:00:00+00:00,0,2018-10-16 00:00:00+00:00,0,0,0,0,...,CH,Uri,uri,Bürglen,2020-01-23 00:00:00+00:00,single,0,new_customer,0,1
1,26406932021-01-14,2640693,0,2021-01-14 00:00:00+00:00,0,2016-07-23 00:00:00+00:00,0,0,0,0,...,CH,Uri,uri,Altdorf,2022-01-14 00:00:00+00:00,single,4,seasoned_customer,0,1
2,11534702020-09-29,1153470,1,2020-09-29 00:00:00+00:00,0,2012-06-28 00:00:00+00:00,0,0,0,0,...,CH,Uri,uri,Altdorf,NaT,single,8,old_timer,0,0
3,11534702021-09-29,1153470,1,2021-09-29 00:00:00+00:00,0,2012-06-28 00:00:00+00:00,0,0,0,0,...,CH,Uri,uri,Altdorf,NaT,single,9,old_timer,0,0
4,30449372021-07-02,3044937,1,2021-07-02 00:00:00+00:00,0,2020-07-02 00:00:00+00:00,0,0,0,0,...,CH,Zug,zug,Cham,NaT,single,1,frequent_user,0,0


Date columns are returning hour and time zone still. Why? ❓

In [None]:
subscriptions_v2['cancellation_date'] = pd.to_datetime(subscriptions_v2['cancellation_date'].dt.date)
subscriptions_v2['subscription_date'] = pd.to_datetime(subscriptions_v2['subscription_date'].dt.date)
subscriptions_v2['first_subscription_date'] = pd.to_datetime(subscriptions_v2['first_subscription_date'].dt.date)

subscriptions_v2.head()

Unnamed: 0,subscription_id,user_id,renew,subscription_date,first_subscription,first_subscription_date,referral,promotion,installments,payment2,...,country_code_2,original_region,correct_region,city,cancellation_date,payment_type,customer_lifetime,customer_segment,first_churner,is_churner
0,29823062019-01-23,2982306,0,2019-01-23,0,2018-10-16,0,0,0,0,...,CH,Uri,uri,Bürglen,2020-01-23,single,0,new_customer,0,1
1,26406932021-01-14,2640693,0,2021-01-14,0,2016-07-23,0,0,0,0,...,CH,Uri,uri,Altdorf,2022-01-14,single,4,seasoned_customer,0,1
2,11534702020-09-29,1153470,1,2020-09-29,0,2012-06-28,0,0,0,0,...,CH,Uri,uri,Altdorf,NaT,single,8,old_timer,0,0
3,11534702021-09-29,1153470,1,2021-09-29,0,2012-06-28,0,0,0,0,...,CH,Uri,uri,Altdorf,NaT,single,9,old_timer,0,0
4,30449372021-07-02,3044937,1,2021-07-02,0,2020-07-02,0,0,0,0,...,CH,Zug,zug,Cham,NaT,single,1,frequent_user,0,0


In [None]:
subscriptions_v2['subscription_id'].duplicated().sum()

0

### Users

The table has 100K values for user_id, but they are not all unique. Which shows that each line is per different subscription. So, if a client renews, they get a new row for that specific subscription.

In [None]:
print(subscriptions_v2['user_id'].count())
print(subscriptions_v2['user_id'].nunique())

89718
62553


**USERS PER CITIES**

In [None]:
users_per_city = subscriptions_v2.groupby(['country_name', 'correct_region'])['city'].value_counts().reset_index()
users_per_city = users_per_city.sort_values(by=['count'], ascending=False)
users_per_city.head(25)

Unnamed: 0,country_name,correct_region,city,count
7663,France,île-de-france,Paris,3162
10054,Netherlands,north holland,Amsterdam,1443
11634,Spain,catalonia,Barcelona,1393
8271,Germany,berlin,Berlin,1370
1500,Canada,quebec,Montréal,757
15014,United States,new york,New York,748
13578,United States,california,San Francisco,588
12196,Spain,madrid,Madrid,578
7187,France,provence-alpes-côte d'azur,Marseille,552
8919,Italy,lazio,Roma,488


## Subscriptions per month and countries

### Subscriptions per month

In [None]:
# create month_year column
subs_grouped = subscriptions_v2.resample('M', on='subscription_date').agg({"user_id":"size",
                                                                       "renew":"sum",
                                                                       "first_subscription":"sum",
                                                                       "promotion":"sum",
                                                                       "is_churner":"sum"
                                                                       }).reset_index()

subs_grouped

Unnamed: 0,subscription_date,user_id,renew,first_subscription,promotion,is_churner
0,2019-01-31,2997,1962,744,0,1035
1,2019-02-28,2416,1682,766,0,734
2,2019-03-31,2557,1587,924,0,970
3,2019-04-30,2538,1397,892,0,1141
4,2019-05-31,2258,1205,936,0,1053
5,2019-06-30,2279,1395,884,0,884
6,2019-07-31,2321,1418,980,0,903
7,2019-08-31,2335,1430,755,0,905
8,2019-09-30,2168,1392,684,0,776
9,2019-10-31,2563,1623,746,0,940


In [None]:
fig = px.line(subs_grouped, x='subscription_date', y='user_id', title='Subscriptions per month')
fig.show()

### Subscriptions per country

In [None]:
#How many subscriptions per country
subs_country = subscriptions_v2.groupby(['country_name']).agg({"subscription_id":"count",
                                                      "user_id":"nunique",
                                                      "renew":"sum",
                                                      "first_subscription":"sum",
                                                      "promotion":"sum",
                                                      "referral":"sum",
                                                      "is_churner":"sum",
                                                      "customer_lifetime":"mean"}).reset_index()

subs_country = subs_country.sort_values(['subscription_id'], ascending=[False])

subs_country.head(25)

Unnamed: 0,country_name,subscription_id,user_id,renew,first_subscription,promotion,referral,is_churner,customer_lifetime
24,France,28487,20721,18333,10674,709,3795,10154,2.076947
61,United States,16380,11471,11588,3378,513,671,4792,4.08547
54,Spain,12166,8674,8053,3990,280,1347,4113,2.170804
13,Canada,5925,4112,3926,1317,150,450,1999,3.287089
35,Italy,3939,2766,2492,1078,77,205,1447,3.15867
42,Netherlands,3311,2318,2420,767,64,191,891,3.464814
26,Germany,3049,2111,2160,801,45,215,889,2.687439
60,United Kingdom,2318,1592,1559,419,33,95,759,3.858067
4,Australia,2276,1537,1505,326,42,68,771,4.281195
9,Belgium,1333,935,968,397,18,155,365,2.627157


In [None]:
fig = px.bar(subs_country.head(25), x='country_name', y='subscription_id', title="Subscriptions per country")
fig.show()

### Subscriptions per country per month

In [None]:
#How many subscriptions per country and month
subs_country_month = subscriptions_v2.groupby(['country_name']).resample('M', on='subscription_date').agg({"subscription_id":"count",
                                                                                           "user_id":"nunique",
                                                                                           "renew":"sum",
                                                                                           "first_subscription":"sum",
                                                                                           "promotion":"sum",
                                                                                           "is_churner":"sum",
                                                                                           "customer_lifetime":"mean"}).reset_index()

subs_country_month = subs_country_month.sort_values(['subscription_id', 'subscription_date'], ascending=[False,True])

subs_country_month.head(25)

Unnamed: 0,country_name,subscription_date,subscription_id,user_id,renew,first_subscription,promotion,is_churner,customer_lifetime
627,France,2021-06-30,1421,1419,984,462,0,437,1.872625
628,France,2021-07-31,1282,1280,848,454,160,434,1.732449
615,France,2020-06-30,1249,1248,810,589,69,439,1.61249
608,France,2019-11-30,1225,1217,587,577,0,638,1.663673
611,France,2020-02-29,1189,1182,719,596,0,470,1.678722
610,France,2020-01-31,1185,1185,722,554,0,463,1.869198
622,France,2021-01-31,1096,1096,810,116,32,286,2.532847
631,France,2021-10-31,1088,1088,767,364,84,321,2.330882
623,France,2021-02-28,1087,1087,773,188,112,314,2.486661
626,France,2021-05-31,1072,1072,723,432,28,349,2.173507


### Subscriptions per country and region

In [None]:
#How many subscriptions per country and region
subs_country_region = subscriptions_v2.groupby(['country_name', 'correct_region']).agg({"subscription_id":"count",
                                                      "user_id":"nunique",
                                                      "renew":"sum",
                                                      "first_subscription":"sum",
                                                      "promotion":"sum",
                                                      "referral":"sum",
                                                      "is_churner":"sum",
                                                      "customer_lifetime":"mean"
                                                      }).reset_index()

subs_country_region = subs_country_region.sort_values(['subscription_id'], ascending=[False])

subs_country_region.head(25)

Unnamed: 0,country_name,correct_region,subscription_id,user_id,renew,first_subscription,promotion,referral,is_churner,customer_lifetime
133,France,île-de-france,5106,3796,3010,1826,113,622,2096,2.593028
355,United States,california,4714,3290,3344,869,160,193,1370,4.436996
119,France,auvergne-rhône-alpes,3852,2806,2538,1557,110,542,1314,1.87513
300,Spain,catalonia,3713,2634,2453,1161,80,417,1260,2.270401
132,France,provence-alpes-côte d'azur,3690,2703,2380,1330,100,363,1310,2.196477
130,France,occitanie,3566,2586,2301,1313,83,460,1265,1.952328
84,Canada,quebec,3397,2341,2217,828,84,327,1180,2.902561
121,France,brittany,3324,2389,2187,1179,87,483,1137,2.073406
129,France,nouvelle-aquitaine,3290,2366,2194,1260,72,417,1096,1.889666
291,Spain,andalusia,2220,1598,1425,745,59,240,795,2.196847


## Churn Metrics

### Overall churn rate

In [None]:
# Overall churn rate
churn_rate = subscriptions_v2['is_churner'].mean()
print(f"Overall churn rate: {churn_rate:.2%}")


Overall churn rate: 33.28%


### Churn per month

In [None]:
subs_grouped['churn_rate'] = subs_grouped['is_churner'] / subs_grouped['user_id']
subs_grouped.head()

Unnamed: 0,subscription_date,user_id,renew,first_subscription,promotion,is_churner,churn_rate
0,2019-01-31,2997,1962,744,0,1035,0.345345
1,2019-02-28,2416,1682,766,0,734,0.303808
2,2019-03-31,2557,1587,924,0,970,0.379351
3,2019-04-30,2538,1397,892,0,1141,0.449567
4,2019-05-31,2258,1205,936,0,1053,0.466342


In [None]:
fig = px.line(subs_grouped, x='subscription_date', y='churn_rate', title='Churn rate per month')
fig.show()

In [None]:
fig = px.line(subs_grouped, x='subscription_date', y='is_churner', hover_data='churn_rate', title='NB of churners per month')
fig.show()

### Churn per country

In [None]:
column_names = ['country_name', 'mean', 'nb_churners']

# check churn rate per country
churn_per_country = subscriptions_v2.groupby(['country_name']).agg(nb_churners=('is_churner', 'sum'),
                                                          churn_rate=('is_churner', 'mean')).sort_values(by=['nb_churners', 'churn_rate'], ascending=False).reset_index()

churn_per_country.head(20)

Unnamed: 0,country_name,nb_churners,churn_rate
0,France,10154,0.356443
1,United States,4792,0.292552
2,Spain,4113,0.338073
3,Canada,1999,0.337384
4,Italy,1447,0.367352
5,Netherlands,891,0.269103
6,Germany,889,0.291571
7,Australia,771,0.338752
8,United Kingdom,759,0.327437
9,Belgium,365,0.273818


In [None]:
import plotly.express as px

fig = px.bar(churn_per_country.head(20), x='country_name', y='nb_churners', hover_data='churn_rate', title='Nb of churners per country')
fig.show()

**Average churn rate per country**

In [None]:
# distribuition of churners per country

import plotly.express as px

# Create the churn per country dataframe
churn_per_country = subscriptions_v2.groupby(['country_name']).agg(nb_churners=('is_churner', 'sum'),
                                                          churn_rate=('is_churner', 'mean')).sort_values(by=['nb_churners', 'churn_rate'], ascending=False).reset_index()

# Create the bar chart
fig = px.bar(churn_per_country.head(25), x='country_name', y='churn_rate', hover_data='nb_churners', title='Average churn rate per country')

# Show the chart
fig.show()


### Churn EDA

In [None]:
subs_country.describe()

Unnamed: 0,subscription_id,user_id,renew,first_subscription,promotion,referral,is_churner,customer_lifetime
count,62.0,62.0,62.0,62.0,62.0,62.0,62.0,62.0
mean,1421.806452,1008.919355,948.564516,406.532258,33.483871,123.725806,473.241935,2.695173
std,4418.279488,3176.965948,2918.290678,1492.233069,115.503991,513.722996,1507.772439,1.632875
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,4.0,3.0,2.0,0.0,0.0,2.0,1.6
50%,80.5,59.5,43.0,21.0,2.0,4.5,29.0,2.627371
75%,557.25,382.25,366.75,133.0,8.0,26.0,209.5,3.694888
max,28487.0,20721.0,18333.0,10674.0,709.0,3795.0,10154.0,7.0


**Define Churn rate per country**

In [None]:
subs_country['CR_subscription'] = subs_country['is_churner'] / subs_country['subscription_id']

**Is churn per user different than churn per subscription?**

In [None]:
#Is churn per user different than churn per subscription?
subs_country['CR_user'] = subs_country['is_churner'] / subs_country['user_id']

churn_per_country = subs_country[['country_name', 'CR_user', 'CR_subscription']].sort_values(by=['CR_user', 'CR_subscription'], ascending=False)

churn_per_country.head(25)

Unnamed: 0,country_name,CR_user,CR_subscription
57,Tunisia,1.0,1.0
49,Russian Federation,1.0,1.0
10,Bolivia,1.0,1.0
37,Jordan,1.0,1.0
7,Bahrain,1.0,1.0
41,Morocco,1.0,0.6
11,Bosnia and Herzegovina,1.0,0.571429
58,Turkey,0.839286,0.573171
16,Colombia,0.833333,0.5
15,Chile,0.75,0.642857


In [None]:
churn_per_country.info()

<class 'pandas.core.frame.DataFrame'>
Index: 62 entries, 57 to 14
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country_name     62 non-null     object 
 1   CR_user          62 non-null     Float64
 2   CR_subscription  62 non-null     Float64
dtypes: Float64(2), object(1)
memory usage: 2.1+ KB


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=2, subplot_titles=("CR_user", "CR_subscription"))

# Add traces for each histogram
fig.add_trace(go.Histogram(x=subs_country["CR_user"]), row=1, col=1)
fig.add_trace(go.Histogram(x=subs_country["CR_subscription"]), row=1, col=2)

# Configure axes and titles
fig.update_xaxes(title_text="Churn Rate", row=1, col=1)
fig.update_xaxes(title_text="Churn Rate", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=2)

# Show the plot
fig.show()


## Renew rates

1. Check overall renewal rate

2. Check the impact of promotions on subscription renewal

3. Check the impact of referrals on subscription renewal

4. Check renewal rate for first time subscribers

5. Check renewal rates per customer segments

6. Check impact of payment types on renewal rate

### Overall renew rate

In [None]:
renew_rate = subscriptions_v2['renew'].mean()
print(f"Renew rate: {renew_rate:.2%}")

Renew rate: 66.72%


In [None]:
#What's the percentage of users in general that renewed their subscriptions?
renewals = subscriptions_v2['renew'].value_counts(normalize=True).reset_index()

fig = px.bar(renewals, x='renew', y='proportion', title="What's the percentage of users in general that renewed their subscriptions?")
fig.show()

### Promo renew rate

In [None]:
promo_renew_rate = subscriptions_v2[subscriptions_v2['promotion'] == 1]['renew'].mean()
print(f"Promo renew rate: {promo_renew_rate:.2%}")

Promo renew rate: 61.03%


In [None]:
# Amongst the clients that subscribed with a promotion, how many renewed?
promo_renew = subscriptions_v2[subscriptions_v2['promotion'] == 1]['renew'].value_counts(normalize=True).reset_index()
promo_renew

fig = px.bar(promo_renew, x='renew', y='proportion', title="Amongst the clients that subscribed with a promotion, how many renewed?")
fig.show()

### Referral renew rate

In [None]:
referral_renew_rate = subscriptions_v2[subscriptions_v2['referral'] == 1]['renew'].mean()
print(f"Promo renew rate: {referral_renew_rate:.2%}")

Promo renew rate: 63.76%


In [None]:
# Amongst the clients that subscribed with a referral, how many renewed?
referral_renew_rate = subscriptions_v2[subscriptions_v2['referral'] == 1]['renew'].value_counts(normalize=True).reset_index()
referral_renew_rate

fig = px.bar(referral_renew_rate, x='renew', y='proportion', title="Amongst the clients that subscribed with a referral, how many renewed?")
fig.show()

### Renew rate per first time subscribers

In [None]:
# Renew rate for first time subscriptions
first_timers = subscriptions_v2[subscriptions_v2['first_subscription'] == 1]

first_timers_RR = first_timers['renew'].mean()
print(f"Renew rate for first timers: {first_timers_RR:.2%}")


Renew rate for first timers: 54.72%


In [None]:
#What's the percetage of users on their first subscription that renewed it?
first_timers_renew = first_timers['renew'].value_counts(normalize=True).reset_index()

fig = px.bar(first_timers_renew, x='renew', y='proportion', title="What's the percetage of users on their first subscription that renewed it?")
fig.show()

### Renew rate per customer segment


In [None]:
# Calculate the renew rate for each customer segment
segment_renew_rates = subscriptions_v2.groupby('customer_segment')['renew'].mean().reset_index().sort_values(by='renew', ascending=False)

# bar chart to visualize the renew rates for each customer segment
fig = px.bar(segment_renew_rates, x='customer_segment', y='renew', title='Renew Rate by Customer Segment')
fig.show()

### Single payment renewal rate

In [None]:
# Amongst the people that opted for a single payment, how many renewed?
single_payment = subscriptions_v2[subscriptions_v2['payment_type'] == 'single']

single_payment_RR = single_payment['renew'].mean()
print(f"Renew rate among people that paid one time: {single_payment_RR:.2%}")

Renew rate among people that paid one time: 66.94%


### Installments renewal rate

In [None]:
# Amongst the people that opted for a instalment payment, how many renewed?
instalment_payment = subscriptions_v2[subscriptions_v2['payment_type'] == 'instalment']

instalment_payment_RR = instalment_payment['renew'].mean()
print(f"Renew rate among people that split their payments: {instalment_payment_RR:.2%}")

Renew rate among people that split their payments: 49.74%


### Payment types

In [None]:
# renewal rate based on their payment type

from plotly.subplots import make_subplots
import plotly.graph_objects as go


# Create two traces, one for each payment type
single_trace = go.Bar(
    x=['Renewed', 'Not Renewed'],
    y=[single_payment_RR, 1 - single_payment_RR],
    name='Single Payment',
    marker_color='turquoise'
)

instalment_trace = go.Bar(
    x=['Renewed', 'Not Renewed'],
    y=[instalment_payment_RR, 1 - instalment_payment_RR],
    name='Instalment Payment',
    marker_color='deepskyblue'
)

# Create the figure with two subplots
fig = make_subplots(rows=1, cols=2, subplot_titles=("Single Payment", "Instalment Payment"))

# Add traces to the figure
fig.add_trace(single_trace, row=1, col=1)
fig.add_trace(instalment_trace, row=1, col=2)

# Update the layout
fig.update_layout(title_text='User Renewal Rate by Payment Type')

# Show the figure
fig.show()


# Hypothesis testing

## Hypothesis 1

### Users from certain countries are more likely to churn

Test: Kruskal-Wallis H Test (because there are outliers)

Case: Users from certain regions or countries are more likely to churn.

**Null hypothesis (H0)**: The churn rates are the same across all countries.

**Alternative hypothesis (H1)**: At least one churn rate is different.

**Significance level:** 0.05

Apply Kruskal-Wallis H Test

In [None]:
from scipy.stats import kruskal

# Extract the relevant data from the churn_per_country dataframe
data = churn_per_country['CR_user'].tolist()
groups = churn_per_country['country_name'].tolist()

# Perform the Kruskal-Wallis H Test
statistic, pvalue = kruskal(data, groups)

# Print the results
print("Kruskal-Wallis H Test:")
print("Statistic:", statistic)
print("p-value:", pvalue)

# Interpret the results
alpha = 0.05
if pvalue < alpha:
    print("Reject the null hypothesis.")
    print("There is significant evidence that the churn rates are different across at least two countries.")
else:
    print("Fail to reject the null hypothesis.")
    print("There is not enough evidence to conclude that the churn rates are different across different countries.")


Kruskal-Wallis H Test:
Statistic: 92.29492217016772
p-value: 7.4675660582803555e-22
Reject the null hypothesis.
There is significant evidence that the churn rates are different across at least two countries.


### Diving deeper - Per region ⏰

**Now, let's check if there's churn difference per country region.**

**Case**: Compare churn rates across different regions in a country

**Null hypothesis (H0)**: The churn rates are the same across all regions in a countrie.

**Alternative hypothesis (H1)**: At least one churn rate is different.

In [None]:
#prepare data
subs_country_region['CR_subscription'] = subs_country_region['is_churner'] / subs_country_region['subscription_id']
subs_country_region['CR_user'] = subs_country_region['is_churner'] / subs_country_region['user_id']

churn_per_country_region = subs_country_region[['country_name', 'correct_region', 'CR_user', 'CR_subscription']].sort_values(by=['CR_user', 'CR_subscription'], ascending=False)

churn_per_country_region.head(25)

Unnamed: 0,country_name,correct_region,CR_user,CR_subscription
218,Mexico,tlaxcala,2.0,1.0
46,Bahamas,central abaco,2.0,0.666667
311,Sweden,uppsala,2.0,0.666667
7,Andorra,ordino,1.5,0.75
215,Mexico,sinaloa,1.0,1.0
67,Brazil,paraíba,1.0,1.0
339,Tunisia,tunis governorate,1.0,1.0
399,United States,west virginia,1.0,1.0
53,Bolivia,santa cruz,1.0,1.0
338,Tunisia,sousse governorate,1.0,1.0


In [None]:
#check cr_user distribuition
fig = px.histogram(churn_per_country_region, x='CR_user', title='Churn rate per region distribution')
fig.show()

In [None]:
top_15 = subs_country.sort_values(by=['is_churner'], ascending=False).head(15)
top_15_with_most_churners = top_15['country_name'].unique()
top_15_with_most_churners

array(['France', 'United States', 'Spain', 'Canada', 'Italy',
       'Netherlands', 'Germany', 'Australia', 'United Kingdom', 'Belgium',
       'Brazil', 'Ireland', 'Switzerland', 'Sweden', 'Mexico'],
      dtype=object)

**French regions**

In [None]:
from scipy.stats import kruskal

# Extract the relevant data from the churn_per_country dataframe
data = churn_per_country_region['CR_user'].tolist()
france_group = churn_per_country_region.loc[churn_per_country_region['country_name'] == 'France', 'correct_region'].tolist()
#region_groups = churn_per_country_region['correct_region'].tolist()

# Perform the Kruskal-Wallis H Test
statistic, pvalue = kruskal(data, france_group)

# Print the results
print("Kruskal-Wallis H Test:")
print("Statistic:", statistic)
print("p-value:", pvalue)

# Interpret the results
alpha = 0.05
if pvalue < alpha:
    print("Reject the null hypothesis.")
    print("There is significant evidence that the churn rates are different across French regions.")
else:
    print("Fail to reject the null hypothesis.")
    print("There is not enough evidence to conclude that the churn rates are different french regions.")


Kruskal-Wallis H Test:
Statistic: 43.420978241392
p-value: 4.414302316250527e-11
Reject the null hypothesis.
There is significant evidence that the churn rates are different across French regions.


## Hypothesis 2