# **PROBLEM SOLVING DESIGN**

![Lean StartUp Feedback Loop](../img/project_structure/lean_startup_feedback_loop.jpg)

# **BUSINESS CONTEXT**

## **What is the company?**

Hotmart

## **What is its business model?**

Two-sided marketplace. It is a platform for buying, selling and promoting digital products in which Hotmart connects product creators/disseminators to their customers.

## **What is the company stage on the market?**

"Virality" (Lean Analytics) or "early majority" (Innovation Adoption Curve). The company found a pain in the market and validated a product that solves the pain; now is the time to increase the customer base.

# **BUSINESS PROBLEM**

## **What is the business problem the company is facing?**

The company wants to get insight based on customers' data in order to unveil new product opportunities, especially in terms of product success, customer segmentation, and revenue estimation.

## **What is the business solution that this project has to deliver?**

A presentation of storytelling insights based on the available data and, possibly, answers to the following questions:
- Does Hotmart depend on the biggest producers on the platform? That is, the top-selling producers are responsible for most of the
Hotmart billing?
- Are there any relevant patterns or trends in the data?
- It is possible to segment users based on their characteristics (revenue, product niche, etc.)?
- What features most impact the success of a product? that is, the What makes a product sell more?
- It is possible to estimate how much revenue Hotmart will generate in the next three months from the last month shown in the dataset?

**References:**
- Case description
- https://hotmart.com/pt-br

# **SCOPE AND BUSINESS ASSUMPTIONS**

- **...**

- **...**


REFERENCES:
...

# **SOLUTION STRATEGY**

![IoT method](../img/project_structure/iot_method.png)*IOT (Input-Output-Taks) is a planning strategy to structure a problem solution and make sure it delivers a solution that solves the initial problem.*

### INPUT

- **Business context**:
    - It is a platform for buying, selling and promoting digital products in which Hotmart connects product creators/disseminators to their customers.
    - In principle, Hotmart makes money by **taxing**, either the creators or the disseminators, **a percentage of the purchase by the customer**.
- **Business problem**:
    - The company wants to get **insights** based on customers' data in order to **unveil new product opportunities**, especially in terms of product success, customer segmentation, and revenue estimation.
- **Business questions**:
    - Does **Hotmart depend** on the **biggest producers** on the platform? That is, the **top-selling producers** are responsible for **most** of the Hotmart **billing**?
    - Are there any **relevant patterns or trends** in the data?
    - It is possible to **segment users** based on their characteristics (revenue, product niche, etc.)?
    - What **features most impact** the success of a **product**? that is, the What makes a **product sell more**?
    - It is possible to **estimate** how much **revenue** Hotmart will generate in the **next three months from the last month** shown in the dataset?
- **Available data**:
    - Data referring to a **sample of purchases made** at Hotmart in 2016. These are more than 1.5 million records of purchases made on our **platform**.

### OUTPUT 

- A presentation of storytelling insights based on the available data and, possibly, answers to the previous questions.

### TASKs

- *QUESTION*:
    - Does **Hotmart depend** on the **biggest producers** on the platform? That is, the **top-selling producers** are responsible for **most** of the Hotmart **billing**?
        - What are the biggest producers on the platform? What is its definition?
            - Assuming higher than 95th percentile of volume of product sold.
        - What it means to be dependent on some producers?
            - Assuming "Pareto rule" like: 80% of revenue comes from the 5th top selling producers
        - What is the revenue difference from this customers to the remaining one?
            - Compare revenues

<br >

- *QUESTION*:
    - Are there any **relevant patterns or trends** in the data?
        - Check for features (correlation between features, feature distributions and time-changes trends) that shows patterns in terms of customers/producers groups or revenue impact or scaling impact.

<br >

- *QUESTION*:
    - It is possible to **segment users** based on their characteristics (revenue, product niche, etc.)?
        - What is the purpose of segmenting customers?
          - Find out what are the best customers and what coould be done to change the behaviour of the not-best ones. 
          - Revenue from best customer could support scaling efforts.
        - Check for features that can cluster customer/producers for better revenue undestanding
          - Initially try RFM (Recency-Frequency-Monetary)

<br >      

- *QUESTION*:
    - What **features most impact** the success of a **product**? that is, what makes a **product sell more**?
        - Success of a product = number of products sold
            - Inspect features with high correlation to the number of product sold
            - Inspect feature with high correlation with an increasing trend of products sold
            - Check for simple causal inference techniques
              - knowing features that best impact the product success, we can use this feature for marketing purpose (scalling effort) and, perhaps, get a better overview about what leads to focus on.

<br >

- *QUESTION*:
    - It is possible to **estimate** how much **revenue** Hotmart will generate in the **next three months from the last month** shown in the dataset?
        - Check the revenue time-series to understand how to extrapolate it to the future
            - Visual inspection
            - Check for trend and seasonality and noise
            - Define baseline (dummy = last available date)
                - Initially, ARIMA model
                - If possible, machine learning models
                - Check model error and extrapolate to business impact
                  - knowing revenue forecast we can predcit scaling investments and even prepone investments.

# **PRODUCT BUILDING ROADMAP**

![CRISP-DS Framework](../img/project_structure/crisp_ds.jpg)

---
---
---

# **0 - HELPERS**

## 0.1 - Libraries

*Import required libraries*

In [1]:
# don't cache libraries (especially project library)
%load_ext autoreload
%autoreload 2

In [2]:
# setup and environment
import os
from   pathlib import Path

# data extraction
from sqlalchemy import create_engine

# data manipulation
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# project library
from project_lib.initial_config import initial_settings
from project_lib.data_description import (check_dataframe, inspect_dtypes, 
                                          check_na_unique_dtypes, check_dtype_convertion,
                                          summary_statistics, categorical_summary, datetime_summary
                                          )
from project_lib.data_exploration import (numerical_plot, categorical_plot, datetime_plot)

## 0.2 - Functions

*Define functions that will be used on the project*

NOTE: Most functions made for this project are inside the project library. That is, **a package called "project_lib" was created to hold all functions that will be needed for this project.**


For further details, please check the modules inside "project_lib" package [in other words, check .py files inside project_lib folder]

In [3]:
# # example of function created for this project
# help(check_dataframe)

## 0.3 - Setup

*Define basic configurations*

In [4]:
# initial setup of dataframes and plots
initial_settings(storytelling=False)

## 0.4 - Constants

*Define reusuable constants*

In [5]:
# define the project root path that will be the "baseline" for all paths in the notebook
PROJECT_ROOT_PATH = Path.cwd().parent
PROJECT_ROOT_PATH

PosixPath('/home/ds-gustavo-cunha/Projects/hotmart_case')

In [6]:
# # variables to connect to data source
# HOST=os.environ["HOST"]
# PORT=os.environ["PORT"]
# USER=os.environ["USER"]
# PASSWORD=os.environ["PASSWORD"]
# SCHEMA=os.environ["SCHEMA"]
# TABLE=os.environ["TABLE"]

# **1 - DATA EXTRACTION**

## 1.1 - Entity Relationship Diagram

*Display Entity-Relationship Diagram to a better data understanding*

In [7]:
# Not available -> datasets are already merged

## 1.2 - Data Fields Description

*Describe available data in regard to database information*


---

Na Hotmart, possuímos três principais personas que integram nosso negócio: os produtores, os afiliados e os compradores.
- Produtores são pessoas que criam produtos digitais na Hotmart, como cursos de idiomas, ebooks de receitas culinárias, audiolivros, softwares, dentre muitos outros exemplos.
- Afiliados são pessoas que promovem produtos dos produtores em troca de uma comissão na venda, que varia de produto para produto, e de afiliado para afiliado.
- Compradores são pessoas que adquirem um ou mais produtos digitais.
    
Uma venda é feita por um afiliado quando alguém clica em um link de afiliados. Eles geralmente fazem a promoção desses produtos em redes sociais, vídeos, anúncios, etc.

Já uma venda é feita por um produtor quando alguém tem acesso direto ao seu produto, sem intermediação do afiliado. Por exemplo, pessoas que seguem o Whindersson Nunes no Youtube e entraram em seu site oficial para adquirir seu produto, ou clicaram no link do produto sem código de afiliação.

---

---

Durante sua avaliação, você irá analisar dados referentes a uma amostra de compras feitas na Hotmart em 2016. Tratam-se de mais de 1,5 milhão registros de compras realizadas em nossa plataforma. Abaixo, nós iremos detalhar o que significa cada campo:
- **purchase_id**: Identificação da compra na Hotmart;
- **product_id**: Identificação do produto na Hotmart;
- **affiliate_id**: Identificação do afiliado na Hotmart;
- **producer_id**: Identificação do produtor na Hotmart;
- **buyer_id**: Identificação do comprador na Hotmart;
- **purchase_date**: Data e hora em que a compra foi realizada;
- **product_creation_date**: Data e hora em que o produto foi criado na Hotmart;
- **product_category**: categoria do produto na Hotmart. Exemplo: e-book, software, curso online, e-tickets, etc.;
- **product_niche**: nicho de mercado que o produto faz parte. Exemplo: educação, saúde e bem-estar, sexualidade, etc.;
- **purchase_value**: valor da compra. Esse dado, assim como nicho e categoria foi codificado para manter a confidencialidade. O valor apresentado no dataset é o z-score do valor real;
- **affiliate_commission_percentual**: percentual de comissão que o afiliado receberá da compra;
- **purchase_device**: tipo de dispositivo utilizado no momento da compra, como: Desktop, Mobile, Tablet, ou Outros;
- **purchase_origin**: endereço do site do qual a pessoa veio antes da compra. Por exemplo, se uma pessoa veio do Facebook, Youtube, ou até mesmo de outra página no site oficial do produto;
- **is_origin_page_social_network**: informa se essa compra veio de uma URL do Facebook, Youtube, Instagram, Pinterest, ou Twitter.

---

---

Algumas regras de negócio:
- Quando a compra for feita diretamente pelo produtor, ou seja, quando não houver afiliado intermediando a compra, o campo affiliate_commission_percentual terá valor 0, e o campo affiliate_id será igual ao producer_id;
- No campo purchase_origin nós apenas consideramos o host do site. Isso quer dizer que, se uma pessoa veio do site www.meuproduto.com/promocoes, esse campo só irá retornar o valor www.meuproduto.com;

---

## 1.3 - Data Loading

*Load data from required files*

In [8]:
# # define connection "endpoint"
# db_connection_str = f'mysql+pymysql://{USER}:{PASSWORD}@{HOST}/{SCHEMA}'
# # create an engine to connect to database
# db_connection = create_engine(db_connection_str)

# # define query to get data
# query=f"""
# SELECT *
# FROM {TABLE}
# """

# # read all data from database
# df_sql = pd.read_sql(sql=query, con=db_connection)
# df_sql

In [9]:
# # save data to parquet so as to not overload database server unnecessarily
# df_sql.to_parquet(
#     path=os.path.join(PROJECT_ROOT_PATH, "data", "raw_data", "customer_data.parquet")
# )

In [10]:
# read data from local source
df_extraction = pd.read_parquet(
    path=os.path.join(PROJECT_ROOT_PATH, "data", "raw_data", "customer_data.parquet")
)

# inspect results
df_extraction.sample(5)

Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,Venda
586283,12036841,143399,27147,1471554,5924461,2016-03-13 15:06:17,2015-03-24 20:05:02,Phisical book,Media training,-0.415,0.0,Desktop,Origin 32cf,0,1
1426024,13669209,114914,3364586,3364586,1428164,2016-06-12 00:30:47,2014-09-03 13:29:48,Podcast,Negotiation,-0.448,0.0,eReaders,Origin 4f06,0,1
646279,12157405,191898,349701,349701,6829336,2016-03-20 09:52:35,2015-12-14 16:22:37,Phisical book,Anxiety management,0.132,0.0,eReaders,Origin d8b2,0,1
600834,12065078,86109,586839,586839,4983211,2016-03-14 19:10:32,2014-02-18 14:01:16,Phisical book,Personal finance,-0.415,0.0,Desktop,Origin 6d94,0,1
792602,12453847,214912,4734645,4734645,6970078,2016-04-06 18:52:33,2016-03-25 01:18:38,Phisical book,Anxiety management,-0.412,0.0,eReaders,Origin 7f8f,0,1


# **2 - DATA DESCRIPTION**

## 2.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [11]:
# create a restore point of the previous section
df_description = df_extraction.copy()

# check dataframe for this new section
check_dataframe( dataframe=df_description, summary_stats=True, head=True )

*************************************************
Dataframe size in memory: 660.704 MB 

-----------------------------
Dataframe overview:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
purchase_id,0,0,1.599.828,int64
product_id,0,0,17.883,int64
affiliate_id,0,0,22.947,int64
producer_id,0,0,8.020,int64
buyer_id,0,0,1.100.649,int64
purchase_date,0,0,1.488.964,datetime64[ns]
product_creation_date,0,0,17.879,datetime64[ns]
product_category,0,0,10,object
product_niche,0,0,25,object
purchase_value,0,0,32.617,float64


-----------------------------

 Dataframe shape is (1599828, 15) 

-----------------------------


Statistics for Numerical Variables [NaNs are ignored]:


Unnamed: 0,attribute,mean,median,std,iqr,min,max,range,skew,kurtosis
0,purchase_id,"12.445.456,601","12.468.487,500","917.581,737","1.579.356,500","1.663.958,000","14.357.203,000","12.693.245,000",-90,-756
1,product_id,"148.595,814","154.310,000","55.543,152","81.796,000",4000,"319.129,000","319.125,000",-482,-702
2,affiliate_id,"2.297.500,688","1.690.428,000","2.092.655,502","3.549.994,000",3000,"7.700.836,000","7.700.833,000",651,-823
3,producer_id,"2.164.479,522","1.377.289,000","2.038.959,782","3.366.648,000",3000,"9.868.481,000","9.868.478,000",724,-699
4,buyer_id,"5.187.551,341","5.999.153,500","2.199.255,869","3.216.124,250",60000,"12.014.792,000","12.014.732,000",-878,-492
5,purchase_value,0000,-0350,1000,0518,-0541,124561,125102,10817,629206
6,affiliate_commission_percentual,7596,0000,18477,0000,0000,100000,100000,2259,3753
7,Venda,1000,1000,0000,0000,1000,1000,0000,0,0


-----------------------------


dataframe.head(5)


Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,Venda
0,1663958,6640,209372,116238,1200397,2016-06-26 12:00:00,2011-03-19 15:47:36,Video,Presentation skills,-0.3,,Smart TV,Origin ef2b,0,1
1,1677087,2350,141418,2821,1083764,2016-06-26 12:00:00,2010-07-05 01:50:15,Podcast,Child psychology,-0.2,,Smart TV,Origin ef2b,0,1
2,2017360,35669,618642,618642,1436106,2016-06-26 12:00:00,2012-06-13 02:59:37,Podcast,Presentation skills,-0.5,,Smart TV,Origin ef2b,0,1
3,2017379,57998,1164511,70388,1436118,2016-06-26 12:00:00,2013-05-07 08:51:31,Podcast,Anxiety management,-0.4,,Smart TV,Origin ef2b,0,1
4,2017382,58329,1261488,221253,1386357,2016-06-26 12:00:00,2013-05-12 08:12:06,Podcast,Teaching English,-0.5,,Smart TV,Origin ef2b,0,1


*************************************************


## 2.2 - Rename Columns

*Search for misleading or error-prone column names*

In [12]:
# inspect column names
df_description.columns

Index(['purchase_id', 'product_id', 'affiliate_id', 'producer_id', 'buyer_id',
       'purchase_date', 'product_creation_date', 'product_category',
       'product_niche', 'purchase_value', 'affiliate_commission_percentual',
       'purchase_device', 'purchase_origin', 'is_origin_page_social_network',
       'Venda'],
      dtype='object')

In [13]:
# lower the case of venda column
df_description = df_description.rename(columns={"Venda": "sell"})

# inspect results
df_description.columns

Index(['purchase_id', 'product_id', 'affiliate_id', 'producer_id', 'buyer_id',
       'purchase_date', 'product_creation_date', 'product_category',
       'product_niche', 'purchase_value', 'affiliate_commission_percentual',
       'purchase_device', 'purchase_origin', 'is_origin_page_social_network',
       'sell'],
      dtype='object')

## 2.3 - Check Data Dimensions

*Check dataframe dimensions to know if pandas will be enough to handle such data size or we will need Big Data tools like Spark*

In [14]:
# check number of rows and columns
print( f'\
Dataframe has {df_description.shape[0]:,} \
rows and {df_description.shape[1]} columns' )

Dataframe has 1,599,828 rows and 15 columns


## 2.4 - Data Types

*Check if data types on dataframe makes sense according to database information*

In [15]:
# define shape before dtype convertion
shape_before = df_description.shape

# inspect dataframe types
inspect_dtypes(df_description, 15)

Unnamed: 0,types,random row: 1,random row: 2,random row: 3,random row: 4,random row: 5,random row: 6,random row: 7,random row: 8,random row: 9,random row: 10,random row: 11,random row: 12,random row: 13,random row: 14,random row: 15
purchase_id,int64,11.289.863,13.891.778,12.345.685,11.543.137,13.447.274,12.162.329,12.308.734,12.485.889,13.883.857,11.718.361,13.707.037,13.417.248,11.372.825,11.469.690,12.459.424
product_id,int64,111.830,239.217,209.381,131.756,132.273,83.916,207.407,199.922,121.779,124.027,207.374,135.461,42.903,197.068,218.030
affiliate_id,int64,3.258.278,1.111.682,236.083,811.062,4.057.408,348.488,4.372.178,6.728.566,1.845.090,3.810.829,5.441.590,96.585,3.124.408,1.095.211,41.463
producer_id,int64,3.258.278,1.111.682,236.083,811.062,4.057.408,348.488,4.372.178,6.090.854,2.546.880,3.810.829,5.441.590,96.585,442.241,1.095.211,41.463
buyer_id,int64,6.123.425,5.408.465,3.750.475,4.275.209,7.441.062,4.063.235,6.898.330,6.985.536,1.859.645,5.206.449,7.428.208,1.988.745,4.408.187,6.500.209,3.175.581
purchase_date,datetime64[ns],2016-01-28 22:17:05,2016-06-23 21:20:29,2016-03-31 12:04:26,2016-02-13 21:01:51,2016-05-31 20:01:11,2016-03-20 14:26:54,2016-03-28 19:37:40,2016-04-08 23:41:55,2016-06-23 13:05:24,2016-02-24 13:35:51,2016-06-13 21:54:07,2016-05-29 22:30:56,2016-02-02 16:20:19,2016-02-08 15:30:29,2016-04-07 02:55:50
product_creation_date,datetime64[ns],2014-08-06 17:59:26,2016-06-19 23:42:18,2016-03-05 12:06:48,2015-01-09 17:29:58,2015-01-12 15:15:55,2014-01-28 17:11:10,2016-02-25 17:32:16,2016-01-24 15:07:48,2014-10-29 19:15:23,2014-11-13 21:15:20,2016-02-25 12:56:08,2015-02-01 13:34:24,2012-09-26 15:54:59,2016-01-12 14:14:31,2016-04-05 01:32:57
product_category,object,Podcast,Workshop,Phisical book,Phisical book,Phisical book,Phisical book,Podcast,Phisical book,Phisical book,Phisical book,Podcast,Phisical book,Phisical book,Phisical book,Phisical book
product_niche,object,Careers,Anxiety management,Presentation skills,Anxiety management,YouTube video creation,Anxiety management,Government,Anxiety management,Personal finance,Online course creation,Physics,Global diplomacy,YouTube video creation,Personal finance,Presentation skills
purchase_value,float64,-0532,0204,-0509,0769,-0261,0955,-0414,0204,2748,-0372,-0522,-0415,-0252,-0359,-0491


In [16]:
# inspect basic column descriptions
check_na_unique_dtypes(df_description);

*************************************************
Dataframe size in memory: 660.704 MB 

-----------------------------
Dataframe overview:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
purchase_id,0,0,1.599.828,int64
product_id,0,0,17.883,int64
affiliate_id,0,0,22.947,int64
producer_id,0,0,8.020,int64
buyer_id,0,0,1.100.649,int64
purchase_date,0,0,1.488.964,datetime64[ns]
product_creation_date,0,0,17.879,datetime64[ns]
product_category,0,0,10,object
product_niche,0,0,25,object
purchase_value,0,0,32.617,float64


-----------------------------

 Dataframe shape is (1599828, 15) 



In [17]:
# print report
print(
    f"Unique values in colum 'sell': {set(df_description['sell'].tolist())}"
)

Unique values in colum 'sell': {1}


In [18]:
# print report
print(
    f"Unique values in colum 'sell': {set(df_description['is_origin_page_social_network'].tolist())}"
)

# convert column is_origin_page_social_network to boolean
df_description["is_origin_page_social_network"] = df_description["is_origin_page_social_network"].apply( lambda x: True if x == '0,0' else False if x == '1,0' else "NaN")

# print report
print(
    f"Unique values in colum 'sell' after transformation: {set(df_description['is_origin_page_social_network'].tolist())}"
)

Unique values in colum 'sell': {'0,0', '1,0'}
Unique values in colum 'sell' after transformation: {False, True}


In [19]:
# sanity check
assert df_description.shape == shape_before, "Data was missed during dtype convertion"

## 2.5 - Data Validation

*Check if columns make sense in regard to business understanding*

In [20]:
# as data was already made available to us 
# and there is no way to validate data source,
# no need for data validation right now.

## 2.6 - Check Duplicated Rows

*Inspect duplicated rows and handle them properly*

In [21]:
# define dataframe grain
grain = ["purchase_id"]

# check duplicated rows
print(
    f'{"*"*49}\n\n'
    f'There are {df_description.duplicated(keep=False).sum():,} '
    f'duplicated rows [{df_description.duplicated(keep=False).mean()*100:.2f}%] based on all columns. '
    f'Duplicated rows are double counted.'
    f'\n\n{"*"*49}\n\n'
    f'Dataframe granularity: {grain}\n\n'
    f'There are {df_description.duplicated(subset=grain, keep=False).sum():,} duplicated rows '
    f'[{df_description.duplicated(subset=grain, keep=False).mean()*100:.2f}%] based on table granularity. '
    f'Duplicated rows are double counted.'
    f'\n\n{"*"*49}'
)

*************************************************

There are 0 duplicated rows [0.00%] based on all columns. Duplicated rows are double counted.

*************************************************

Dataframe granularity: ['purchase_id']

There are 0 duplicated rows [0.00%] based on table granularity. Duplicated rows are double counted.

*************************************************


## 2.7 - Check Missing Values

*Inspect number and percentage of missing value per column to decide what to do with them*

In [22]:
#  get number of NA, percent of NA, number of unique and column type
check_na_unique_dtypes(df_description);

*************************************************
Dataframe size in memory: 566.315 MB 

-----------------------------
Dataframe overview:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
purchase_id,0,0,1.599.828,int64
product_id,0,0,17.883,int64
affiliate_id,0,0,22.947,int64
producer_id,0,0,8.020,int64
buyer_id,0,0,1.100.649,int64
purchase_date,0,0,1.488.964,datetime64[ns]
product_creation_date,0,0,17.879,datetime64[ns]
product_category,0,0,10,object
product_niche,0,0,25,object
purchase_value,0,0,32.617,float64


-----------------------------

 Dataframe shape is (1599828, 15) 



In [23]:
# print report
print(
    f'affiliate_commission_percentual\n'
    f'\tmax value {df_description["affiliate_commission_percentual"].max(skipna=True)}\n'
    f'\tmin value {df_description["affiliate_commission_percentual"].min(skipna=True)}'
)

affiliate_commission_percentual
	max value 100.0
	min value 0.0


## 2.8 - Handle Missing Values

*Handle missing value for columns*

**Business rule**
- Quando a compra for feita diretamente pelo produtor, ou seja, quando não houver afiliado intermediando a compra, o campo affiliate_commission_percentual terá valor 0, e o campo affiliate_id será igual ao producer_id;

In [24]:
# get number of NaN in affiliate_commission_percentual
num_nas = df_description["affiliate_commission_percentual"].isna().sum()

In [25]:
# inspect rows where affiliate_commission_percentual is NaN to validate business rule
df_description.loc[
    df_description["affiliate_commission_percentual"].isna(),
    ["affiliate_commission_percentual", "affiliate_id", "producer_id"]    
].sample(5, random_state=7)

Unnamed: 0,affiliate_commission_percentual,affiliate_id,producer_id
85,,213339,213339
106,,195000,195000
22,,431496,298517
11,,618642,618642
65,,8716,361052


In [26]:
# as NaNs in affiliate_commission_percentual don't seem to be due to business rule,
# let's fill NaN with -1 (number outside of the scope of min-max range)
df_description["affiliate_commission_percentual"] = df_description["affiliate_commission_percentual"].fillna(value=-1)

# sanity check
assert (df_description["affiliate_commission_percentual"] == -1).sum() == num_nas, "Misleading fillna operation"

## 2.9 - Descriptive Statistics

*Inspect some summary statistics for numerical columns*

In [27]:
# split dataset into types of features
df_number = df_description.select_dtypes(include=["number", "bool"])
df_date = df_description.select_dtypes(include=["datetime"])
df_string = df_description.select_dtypes(include=["object"])

# sanity check
assert df_number.shape[1] + df_date.shape[1] + df_string.shape[1] == df_description.shape[1], """Revise the previous split, something may be wrong!"""

### 2.9.1 - Numerical Variables

*Inspect numerical variables*

In [28]:
# check summary statistics
summary_statistics(df_number)



Statistics for Numerical Variables [NaNs are ignored]:


Unnamed: 0,attribute,mean,median,std,iqr,min,max,range,skew,kurtosis
0,purchase_id,"12.445.456,601","12.468.487,500","917.581,737","1.579.356,500","1.663.958,000","14.357.203,000","12.693.245,000",-90,-756
1,product_id,"148.595,814","154.310,000","55.543,152","81.796,000",4000,"319.129,000","319.125,000",-482,-702
2,affiliate_id,"2.297.500,688","1.690.428,000","2.092.655,502","3.549.994,000",3000,"7.700.836,000","7.700.833,000",651,-823
3,producer_id,"2.164.479,522","1.377.289,000","2.038.959,782","3.366.648,000",3000,"9.868.481,000","9.868.478,000",724,-699
4,buyer_id,"5.187.551,341","5.999.153,500","2.199.255,869","3.216.124,250",60000,"12.014.792,000","12.014.732,000",-878,-492
5,purchase_value,0000,-0350,1000,0518,-0541,124561,125102,10817,629206
6,affiliate_commission_percentual,7595,0000,18476,0000,-1000,100000,101000,2259,3754
7,sell,1000,1000,0000,0000,1000,1000,0000,0,0


According to business rule:
- purchase_value: "valor da compra. Esse dado, assim como nicho e categoria foi codificado para manter a  confidencialidade. O valor apresentado no dataset é o **z-score** do valor real";
  - So it is fine to have negative values!

### 2.9.2 - Categorical Variables

*Inspect categorical variables*

In [29]:
# check overview of categorical features
categorical_summary(df_string, nunique_threshold=30, unique_name_len_threshold=50)

Overview of string columns:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
product_category,0,0,10.0,object
product_niche,0,0,25.0,object
purchase_device,0,0,5.0,object
purchase_origin,0,0,9.603,object


------------------------------------------------- 

[94m--->[0m The unique values for [94m[1mproduct_category[0m[0m column are: [[1mvalues are truncated[0m] 

['Video', 'Podcast', 'Phisical book', 'eBook', 'In-class course', 'Workshop', 'Webinar', 'eTicket', 'Subscription', 'App']
------------------------------------------------- 

[94m--->[0m The unique values for [94m[1mproduct_niche[0m[0m column are: [[1mvalues are truncated[0m] 

['Presentation skills', 'Child psychology', 'Anxiety management', 'Teaching English', 'Online course creation', 'Media training', 'Storytelling', 'YouTube video creation', 'Procrastination', 'Organization', 'Negotiation', 'Careers', 'Personal finance', 'Filmmaking', 'Government', 'Global diplomacy', 'Immigration', 'Economics', 'Accounting', 'Biology', 'Physics', 'Genetics', 'Disease', 'Thermodynamics', 'Travel hacking']
------------------------------------------------- 

[94m--->[0m The unique values for [94m[1mpurchase_device[0m[0m c

### 2.9.3 - Datetime Variables

*Inspect datetime variables*

In [30]:
# check an overview of datetime features
datetime_summary(df_date)

Unnamed: 0,first date,last date,range [months],mean,median,Num NAs,Percent NAs,count [non-NA],nunique
purchase_date,2016-01-01 00:00:27,2016-06-30 23:59:57,6,2016-04-04 18:39:34.511339776,2016-04-07 18:50:16.500000,0,0,1.599.828,1.488.964
product_creation_date,2008-10-27 01:39:34,2016-12-31 13:43:50,99,2015-02-22 14:52:58.141221376,2015-05-31 00:12:18,0,0,1.599.828,17.879


### 2.9.4 - Investigate further:

*Variables to inspect the real meaning*

In [31]:
# None up to this point

# **3 - FEATURE ENGINEERING**

## 3.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [32]:
# create a restore point for the previous section dataframe
df_f_eng = df_description.copy()

# check dataframe
check_dataframe( df_f_eng )

*************************************************
Dataframe size in memory: 566.315 MB 

-----------------------------
Dataframe overview:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
purchase_id,0,0,1.599.828,int64
product_id,0,0,17.883,int64
affiliate_id,0,0,22.947,int64
producer_id,0,0,8.020,int64
buyer_id,0,0,1.100.649,int64
purchase_date,0,0,1.488.964,datetime64[ns]
product_creation_date,0,0,17.879,datetime64[ns]
product_category,0,0,10,object
product_niche,0,0,25,object
purchase_value,0,0,32.617,float64


-----------------------------

 Dataframe shape is (1599828, 15) 

-----------------------------


dataframe.sample(5)


Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,sell
383926,11621960,179883,2264099,2264099,4467102,2016-02-18 23:04:04,2015-10-16 13:18:03,Phisical book,Global diplomacy,-0.2,0.0,Desktop,Origin cb6b,True,1
806968,12483208,116882,4621070,2026525,6983894,2016-04-08 19:34:59,2014-09-19 10:14:10,Phisical book,Negotiation,-0.4,0.0,eReaders,Origin 5159,True,1
587584,12039290,181119,213339,213339,4695931,2016-03-13 16:41:19,2015-10-22 11:05:58,Phisical book,Organization,0.2,0.0,eReaders,Origin 3c5a,True,1
384197,11622527,144782,4719147,3241028,4432108,2016-02-18 23:46:29,2015-04-03 19:29:38,Phisical book,Online course creation,-0.2,45.0,Desktop,Origin 9034,True,1
1392825,13606396,233324,641011,641011,7513266,2016-06-08 17:38:28,2016-05-30 19:07:41,Phisical book,Personal finance,0.3,0.0,eReaders,Origin eeeb,True,1


*************************************************


## 3.2 - Hypothesis Testing List

*Define the list of hypotheses that will be validated during Exploratory Data Analysis (EDA)*

**HYPOTHESIS MIND MAP**

![Business hypothesis mindmap](../img/project_structure/xxx.jpg)

*The above image is the product of a brainstorm that took into consideration many different variables that can impact the main business metric. This mind map is a great help when trying to raise hypotheses that could lead to insights. It is also helpful to guide feature engineering (create new relevant features) and when there is a need to look for more data elsewhere.*

> *Taking into consideration hypothesis mind map (at the beginning of this notebook) and the business case questions:*


**H1**. Does **Hotmart depend** on the **biggest producers** on the platform? That is, the **top-selling producers** are responsible for **most** of the Hotmart **billing**?

**H2**. Are there any **relevant patterns or trends** in the data?

**H3**. It is possible to **segment users** based on their characteristics (revenue, product niche, etc.)?

**H4**. What **features most impact** the success of a **product**? that is, the What makes a **product sell more**?

**H5**. It is possible to **estimate** how much **revenue** Hotmart will generate in the **next three months from the last month** shown in the dataset?


## 3.3 - Feature Creation

*Create new features (columns) that can be meaningful for EDA and, especially, machine learning modelling.*

In [33]:
# create a column to indicate what is the age of the product when it was purchased
# purchase_date - product_creation_date in months
# month = 0 ---> purchased on the month of creation
df_f_eng["product_age_when_purchased"] = df_f_eng["purchase_date"].dt.to_period(freq="M") - df_f_eng["product_creation_date"].dt.to_period(freq="M")
# extract the month information
df_f_eng["product_age_when_purchased"] = df_f_eng["product_age_when_purchased"].apply(lambda x: x.n)

# inspect result
df_f_eng[["product_creation_date", "purchase_date", "product_age_when_purchased"]].sample(10, random_state=7)

Unnamed: 0,product_creation_date,purchase_date,product_age_when_purchased
687286,2015-12-01 12:37:00,2016-03-24 20:01:35,3
307154,2016-01-27 22:56:37,2016-02-08 01:28:05,1
94908,2015-01-10 15:32:09,2016-01-13 13:48:37,12
309710,2013-01-14 13:17:37,2016-02-08 13:19:27,37
1074287,2015-12-27 13:17:06,2016-05-06 12:38:26,5
1398614,2016-04-11 13:35:45,2016-06-09 12:45:41,2
1587844,2015-11-06 23:59:53,2016-06-29 12:32:21,7
1184721,2016-04-14 21:41:31,2016-05-17 02:28:10,1
514906,2016-02-12 14:33:39,2016-03-06 01:52:02,1
810722,2016-02-21 18:28:50,2016-04-09 11:46:53,2


# **4 - DATA FILTERING**

## 4.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [34]:
# create a restore point for the previous section dataframe
df_filter = df_f_eng.copy()

# check dataframe
check_dataframe( df_filter, summary_stats=True )

*************************************************
Dataframe size in memory: 579.113 MB 

-----------------------------
Dataframe overview:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
purchase_id,0,0,1.599.828,int64
product_id,0,0,17.883,int64
affiliate_id,0,0,22.947,int64
producer_id,0,0,8.020,int64
buyer_id,0,0,1.100.649,int64
purchase_date,0,0,1.488.964,datetime64[ns]
product_creation_date,0,0,17.879,datetime64[ns]
product_category,0,0,10,object
product_niche,0,0,25,object
purchase_value,0,0,32.617,float64


-----------------------------

 Dataframe shape is (1599828, 16) 

-----------------------------


Statistics for Numerical Variables [NaNs are ignored]:


Unnamed: 0,attribute,mean,median,std,iqr,min,max,range,skew,kurtosis
0,purchase_id,"12.445.456,601","12.468.487,500","917.581,737","1.579.356,500","1.663.958,000","14.357.203,000","12.693.245,000",-90,-756
1,product_id,"148.595,814","154.310,000","55.543,152","81.796,000",4000,"319.129,000","319.125,000",-482,-702
2,affiliate_id,"2.297.500,688","1.690.428,000","2.092.655,502","3.549.994,000",3000,"7.700.836,000","7.700.833,000",651,-823
3,producer_id,"2.164.479,522","1.377.289,000","2.038.959,782","3.366.648,000",3000,"9.868.481,000","9.868.478,000",724,-699
4,buyer_id,"5.187.551,341","5.999.153,500","2.199.255,869","3.216.124,250",60000,"12.014.792,000","12.014.732,000",-878,-492
5,purchase_value,0000,-0350,1000,0518,-0541,124561,125102,10817,629206
6,affiliate_commission_percentual,7595,0000,18476,0000,-1000,100000,101000,2259,3754
7,sell,1000,1000,0000,0000,1000,1000,0000,0,0
8,product_age_when_purchased,13416,10000,12933,17000,-6000,91000,97000,1277,1555


-----------------------------


dataframe.sample(5)


Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,sell,product_age_when_purchased
948881,12762646,218024,4328492,4328492,7115412,2016-04-24 01:23:33,2016-04-05 01:15:29,Podcast,Government,-0.4,0.0,Desktop,Origin 5187,True,1,0
407480,11668526,197735,3971196,3971196,6596833,2016-02-21 14:26:21,2016-01-15 20:04:41,Workshop,Presentation skills,-0.4,0.0,eReaders,Origin e499,True,1,1
1036827,12929735,210723,898929,898929,4199649,2016-05-03 00:49:03,2016-03-09 17:34:23,Phisical book,Presentation skills,0.6,0.0,eReaders,Origin cf02,True,1,2
1134631,13109731,224731,1770119,1770119,7283999,2016-05-11 23:05:22,2016-04-29 19:27:14,Phisical book,Online course creation,1.6,0.0,Cellphone,Origin 5187,True,1,1
18799,10877305,85986,2375948,34602,881614,2016-01-03 17:33:55,2014-02-17 14:25:28,Phisical book,Anxiety management,-0.5,0.0,Smart TV,Origin ef2b,True,1,23


*************************************************


## 4.2 Rows Filtering

*Remove rows with meaningless (or unimportant) data*

### purchase_value column

In [35]:
# According to business rule:
# - purchase_value: valor da compra. Esse dado, assim como nicho e categoria foi codificado para manter a  confidencialidade. O valor apresentado no dataset é o **z-score** do valor real;
# So it is fine to have negative values! ---> no need to filter rows!

### product_age_when_purchased column

In [36]:
# check negative product_age_when_purchased
df_filter[df_filter["product_age_when_purchased"] < 0]


Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,sell,product_age_when_purchased
1394257,13609042,319129,1738263,9868481,7049073,2016-06-08 19:41:27,2016-12-31 13:43:50,Phisical book,Negotiation,3.4,20.0,Desktop,Origin 5187,True,1,-6
1438238,13692956,319129,599274,9868481,4450488,2016-06-13 05:57:13,2016-12-31 13:43:50,Phisical book,Negotiation,3.5,30.0,Desktop,Origin 6c05,True,1,-6


In [37]:
# define shape before filtering data
shape_before = df_filter.shape

# it order to avoid misleading data (product was sold before being created)
# we will remove these rows
df_filter = df_filter[df_filter["product_age_when_purchased"] >= 0]

# sanity check
assert (
    df_filter.shape[0] == shape_before[0] - 2
) & (
    df_filter.shape[1] == shape_before[1]
), "Misleading rows filtering!"

## 4.3 - Columns Filtering

*Remove auxiliary columns or columns that won't be available in the prediction moment*

### sell column

In [38]:
# TO-DO
# print report
print(
    f"Unique values in colum 'sell': {set(df_filter['sell'].tolist())}"
)

Unique values in colum 'sell': {1}


In [39]:
# define shape before filtering data
shape_before = df_filter.shape

# column sell is a constant column ---> remove it
df_filter = df_filter.drop(columns=["sell"])

# sanity check
# sanity check
assert (
    df_filter.shape[0] == shape_before[0]
) & (
    df_filter.shape[1] == shape_before[1] - 1
), "Misleading columns filtering!"

# **5 - EXPLORATORY DATA ANALYSIS**

## 5.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_eda = df_filter.copy()

# check dataframe
check_dataframe( df_eda )

## 5.2 - Univariate Analysis

*Explore variables distributions*

In [None]:
# split dataset into types of features
df_eda_num = df_eda.select_dtypes(include=["number", "bool"])
df_eda_date = df_eda.select_dtypes(include=["datetime"])
df_eda_str = df_eda.select_dtypes(include=["object"])

# sanity check
assert df_eda_num.shape[1] + df_eda_date.shape[1] + df_eda_str.shape[1] == df_eda.shape[1], """Revise the previous split, something may be wrong!"""

### 5.2.1 - Numerical Columns

In [None]:
# define categorical figure path
numerical_fig_path = os.path.join(PROJECT_ROOT_PATH, "img", "data_exploration", "numerical_fatures_eda.png")

# plot numerical columns for base data
numerical_plot(
    dataframe=df_eda_num, 
    n_cols=3,
    hist=False,
    save_fig=numerical_fig_path
    )

### 5.2.2 - Categorical Columns

In [None]:
# define categorical figure path
categorical_fig_path = os.path.join(PROJECT_ROOT_PATH, "img", "data_exploration", "categorical_fatures_eda.png")

# plot categorical columns for base data
categorical_plot(
    dataframe=df_eda_str,
    max_num_cat=10,
    n_cols=3,
    trunc_label=20,
    save_fig=categorical_fig_path
    )

### 5.2.3 Datetime Columns

In [None]:
# define datetime figure path
datetime_fig_path = os.path.join(PROJECT_ROOT_PATH, "img", "data_exploration", "datetime_fatures_eda.png")

# plot datetime columns for base data
datetime_plot(
    dataframe=df_eda_date,
    n_cols=3,
    save_fig=datetime_fig_path
    )

## 5.3 - Bivariate Analysis

*Explore relationship between variables (in pairs)*

### 5.3.1 - Initial inspection

In [None]:
# plot pairplot
sns.pairplot( df_eda, diag_kind = "kde" );

### 5.3.2 - Numerical variables

In [None]:
# calculate pearson correlation coefficient
correlation = df_eda_ref.corr( method = 'spearman' )

# create figure and ax object
fig, ax = plt.subplots( figsize = (6, 6) )

# display heatmap of correlation on figure
sns.heatmap( correlation, annot = True, ax = ax)
plt.yticks( rotation = 0 );

### 5.3.3 - Categorical variables

In [None]:
# TO-DO ---> cramer-v heatmap

In [None]:
# create a dataframe with cramer-v for every row-column pair
cramer_v_corr = create_cramer_v_dataframe( multivar_cat_analysis )

# create figure and ax object
fig, ax = plt.subplots( figsize = (20, 20) )

# display heatmap of correlation on figure
sns.heatmap( cramer_v_corr, annot = True, ax = ax);

## 5.4 - Business Hypothesis

*Validate all business hypothesis based on available data*

### **H1. ..**

### **H2. ..**

### **H3. ..**

### **H4. ..**

### **H5. ..**

## 5.5 - Data Space Analysis

**Initial inspection on dimensionality reduction potential**

### PCA

In [None]:
# TO-DO

### UMAP

In [None]:
# TO-DO

### t-SNE

In [None]:
# TO-DO

### PHATE

In [None]:
# TO-DO

### Tree-Base Embedding

In [None]:
# TO-DO

### KMeans Embedding

In [None]:
# TO-DO

# **6 - DATA PREPARATION**

## 6.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_prep = df_eda.copy()

# check dataframe
check_dataframe( df_prep )

## 6.2 - Remove variables that won't be available in the production environment

*Remove variables that model can use on production to make predictions*

In [None]:
# TO-DO

## 6.3 - Train-Validation-Test split

*Split dataframe into training, validation and test dataset*

In [None]:
# TO-DO

## 6.4 - Scale numeric features

*Scale numeric feature to make modelling "easier" for ML models*

### 6.4.1 - Standard Scaler

In [None]:
# TO-DO

### 6.4.2 - Min-Max Scaler

In [None]:
# TO-DO

### 6.4.3 - Robust Scaler

In [None]:
# TO-DO

### 6.4.4 - Discretization

In [None]:
# TO-DO

## 6.5 - Encode categorical features

*Encode categorical feature to make modelling possible for ML models*

### 6.5.1 - One-Hot Encodingm

In [None]:
# TO-DO

### 6.5.2 - Ordinal Encoding

In [None]:
# TO-DO

### 6.5.3 - Target Encoding

In [None]:
# TO-DO

## 6.6 - Response variable transformation

*Transform target variable (e.g. log, sqrt, etc) to make modelling "easier" for ML models*

In [None]:
# TO-DO

## 6.7 - Cyclic variables transformation

*Transform cyclic variables (e.g. days of week, months in year, etc) with a sin and cos functions*

In [None]:
# TO-DO

## 6.8 - Double-check preparation

*Double-check the prepared dataset to make sure it is as expected*

In [None]:
# TO-DO

# **7 - FEATURE SELECTION**

## 7.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_f_selection = df_prep.copy()

# check dataframe
check_dataframe( df_f_selection )

## 7.2 - Logist regression coefficients

In [None]:
# TO-DO

## 7.3 - Random forest feature importance

In [None]:
# TO-DO

## 7.4 - Boruta algorithm

In [None]:
# TO-DO

## 7.5 - Mutual information

In [None]:
# TO-DO

# **8 - ML MODEL TRAINING**

## 8.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_train = df_f_selection.copy()

# check dataframe
check_dataframe( df_train )

## 8.2 - Metrics

*Define the metric of success and the health metrics*

In [None]:
# TO-DO

## 8.3 - Baseline model

*Check the performance metrics with a dummy model to get the baseline metric*

In [None]:
# TO-DO

## 8.4 - ML models

*Get performance metrics of ML model with cross-validation*

In [None]:
# TO-DO

## 8.5 - Final modelling comparison

*Compare all models and decide what one is the best (and will be fine-tuned)*

In [None]:
# TO-DO

# **9 - HYPERPARAMETER TUNNING**

## 9.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_tune = df_train.copy()

# check dataframe
check_dataframe( df_tune )

## 9.2 - Hypertune the best ML model

*Check the best hyperparams for the best ML model*

### 9.2.1 - Grid Search

In [None]:
# TO-DO

### 9.2.2 - Random Search

In [None]:
# TO-DO

### 9.2.3 - Bayesian Search

In [None]:
# TO-DO

## 9.3 - Define best hyperparameters

*Explicitly define best hyper parameters*

In [None]:
# TO-DO

# **10 - PERFORMANCE EVALUATION AND INTERPRETATION**

## 10.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_perform = df_tune.copy()

# check dataframe
check_dataframe( df_perform )

## 10.2 - Training Performance

*Get final model performance on training data*

In [None]:
# TO-DO

## 10.3 - Generalization performance

### 10.3.1 - Final model training

*Get final model performance on validation data*

In [None]:
# TO-DO

### 10.3.2 - Error analysis

*Perform error analysis on final model to make sure it is ready for production*

In [None]:
# TO-DO

## 10.4 - Define prodution model

*Train ML on "training + validation" data*

In [None]:
# TO-DO

## 10.5 - Testing performance

*Get production model performance on testing data*

In [None]:
# TO-DO

## 10.6 - Business performance

*Translate testing performance into business results*

In [None]:
# TO-DO

# **11 - DEPLOYMENT**

![Deployment architecture](../img/....jpg)

## 11.1 - API creation

*Code to create API for ML predictions*

In [None]:
# TO-DO

## 11.2 - Docker container

*Code to create a Docker container and deploy ML model*

In [None]:
# TO-DO