# **PROBLEM SOLVING DESIGN**

![Lean StartUp Feedback Loop](../img/project_structure/lean_startup_feedback_loop.jpg)

# **BUSINESS CONTEXT**

## **What is the company?**

Hotmart

## **What is its business model?**

Two-sided marketplace. It is a platform for buying, selling and promoting digital products in which Hotmart connects product creators/disseminators to their customers.

## **What is the company stage on the market?**

"Virality" (Lean Analytics) or "early majority" (Innovation Adoption Curve). The company found a pain in the market and validated a product that solves the pain; now is the time to increase the customer base.

# **BUSINESS PROBLEM**

## **What is the business problem the company is facing?**

The company wants to get insight based on customers' data in order to unveil new product opportunities, especially in terms of product success, customer segmentation, and revenue estimation.

## **What is the business solution that this project has to deliver?**

A presentation of storytelling insights based on the available data and, possibly, answers to the following questions:
- Does Hotmart depend on the biggest producers on the platform? That is, the top-selling producers are responsible for most of the
Hotmart billing?
- Are there any relevant patterns or trends in the data?
- It is possible to segment users based on their characteristics (revenue, product niche, etc.)?
- What features most impact the success of a product? that is, the What makes a product sell more?
- It is possible to estimate how much revenue Hotmart will generate in the next three months from the last month shown in the dataset?

**References:**
- Case description
- https://hotmart.com/pt-br

# **SCOPE AND BUSINESS ASSUMPTIONS**

- **...**

- **...**


REFERENCES:
...

# **SOLUTION STRATEGY**

![IoT method](../img/project_structure/iot_method.png)*IOT (Input-Output-Taks) is a planning strategy to structure a problem solution and make sure it delivers a solution that solves the initial problem.*

### INPUT

- **Business context**:
    - It is a platform for buying, selling and promoting digital products in which Hotmart connects product creators/disseminators to their customers.
    - In principle, Hotmart makes money by **taxing**, either the creators or the disseminators, **a percentage of the purchase by the customer**.
- **Business problem**:
    - The company wants to get **insights** based on customers' data in order to **unveil new product opportunities**, especially in terms of product success, customer segmentation, and revenue estimation.
- **Business questions**:
    - Does **Hotmart depend** on the **biggest producers** on the platform? That is, the **top-selling producers** are responsible for **most** of the Hotmart **billing**?
    - Are there any **relevant patterns or trends** in the data?
    - It is possible to **segment users** based on their characteristics (revenue, product niche, etc.)?
    - What **features most impact** the success of a **product**? that is, the What makes a **product sell more**?
    - It is possible to **estimate** how much **revenue** Hotmart will generate in the **next three months from the last month** shown in the dataset?
- **Available data**:
    - Data referring to a **sample of purchases made** at Hotmart in 2016. These are more than 1.5 million records of purchases made on our **platform**.

### OUTPUT 

- A presentation of storytelling insights based on the available data and, possibly, answers to the previous questions.

### TASKs

- *QUESTION*:
    - Does **Hotmart depend** on the **biggest producers** on the platform? That is, the **top-selling producers** are responsible for **most** of the Hotmart **billing**?
        - What are the biggest producers on the platform? What is its definition?
            - Assuming higher than 95th percentile of volume of product sold.
        - What is the revenue difference from this customers to the remaining one?
            - Compare revenues
        - What it means to be dependent on some producers?
            - Assuming "Pareto rule" like: 80% of revenue comes from the 5th top selling producers

<br >

- *QUESTION*:
    - Are there any **relevant patterns or trends** in the data?
        - Check for features that shows patterns in terms of customers/producers groups or revenue impact

<br >

- *QUESTION*:
    - It is possible to **segment users** based on their characteristics (revenue, product niche, etc.)?
        - What is the purpose of segmenting customers?
            - Assuming purchase_value as the target variable
            - Check for features that can cluster customer/producers for better revenue undestanding
                - Initially try RFM (Recency-Frequency-Monetary)

<br >      

- *QUESTION*:
    - What **features most impact** the success of a **product**? that is, the What makes a **product sell more**?
        - Success of a product = number of products sold
            - Inspect features with high correlation to the number of product sold
            - Check for simple causal inference techniques

<br >

- *QUESTION*:
    - It is possible to **estimate** how much **revenue** Hotmart will generate in the **next three months from the last month** shown in the dataset?
        - Check the revenue time-series to understand how to extrapolate it to the future
            - Visual inspection
            - Check for trend and seasonality and noise
            - Define baseline (dummy = last available date)
                - Initially, ARIMA model
                - If possible, machine learning models
                - Check model error and extrapolate to business impact

# **PRODUCT BUILDING ROADMAP**

![CRISP-DS Framework](../img/project_structure/crisp_ds.jpg)

---
---
---

# **0 - HELPERS**

## 0.1 - Libraries

*Import required libraries*

In [1]:
# don't cache libraries (especially project library)
%load_ext autoreload
%autoreload 2

In [2]:
# setup and environment
import os
from   pathlib import Path

# data extraction
from sqlalchemy import create_engine

# data manipulation
import numpy as np
import pandas as pd

# project library
from project_lib.initial_config import initial_settings
from project_lib.data_description import check_dataframe

## 0.2 - Functions

*Define functions that will be used on the project*

NOTE: Most functions made for this project are inside the project library. That is, **a package called "project_lib" was created to hold all functions that will be needed for this project.**


For further details, please check the modules inside "project_lib" package [in other words, check .py files inside project_lib folder]

In [3]:
# # example of function created for this project
# help(check_dataframe)

## 0.3 - Setup

*Define basic configurations*

In [4]:
# initial setup of dataframes and plots
initial_settings(storytelling=False)

## 0.4 - Constants

*Define reusuable constants*

In [5]:
# define the project root path that will be the "baseline" for all paths in the notebook
PROJECT_ROOT_PATH = Path.cwd().parent
PROJECT_ROOT_PATH

PosixPath('/home/ds-gustavo-cunha/Projects/hotmart_case')

In [6]:
# # variables to connect to data source
# HOST=os.environ["HOST"]
# PORT=os.environ["PORT"]
# USER=os.environ["USER"]
# PASSWORD=os.environ["PASSWORD"]
# SCHEMA=os.environ["SCHEMA"]
# TABLE=os.environ["TABLE"]

# **1 - DATA EXTRACTION**

## 1.1 - Entity Relationship Diagram

*Display Entity-Relationship Diagram to a better data understanding*

In [7]:
# Not available -> datasets are already merged

## 1.2 - Data Fields Description

*Describe available data in regard to database information*


---

Na Hotmart, possuímos três principais personas que integram nosso negócio: os produtores, os afiliados e os compradores.
- Produtores são pessoas que criam produtos digitais na Hotmart, como cursos de idiomas, ebooks de receitas culinárias, audiolivros, softwares, dentre muitos outros exemplos.
- Afiliados são pessoas que promovem produtos dos produtores em troca de uma comissão na venda, que varia de produto para produto, e de afiliado para afiliado.
- Compradores são pessoas que adquirem um ou mais produtos digitais.
    
Uma venda é feita por um afiliado quando alguém clica em um link de afiliados. Eles geralmente fazem a promoção desses produtos em redes sociais, vídeos, anúncios, etc.

Já uma venda é feita por um produtor quando alguém tem acesso direto ao seu produto, sem intermediação do afiliado. Por exemplo, pessoas que seguem o Whindersson Nunes no Youtube e entraram em seu site oficial para adquirir seu produto, ou clicaram no link do produto sem código de afiliação.

---

---

Durante sua avaliação, você irá analisar dados referentes a uma amostra de compras feitas na Hotmart em 2016. Tratam-se de mais de 1,5 milhão registros de compras realizadas em nossa plataforma. Abaixo, nós iremos detalhar o que significa cada campo:
- **purchase_id**: Identificação da compra na Hotmart;
- **product_id**: Identificação do produto na Hotmart;
- **affiliate_id**: Identificação do afiliado na Hotmart;
- **producer_id**: Identificação do produtor na Hotmart;
- **buyer_id**: Identificação do comprador na Hotmart;
- **purchase_date**: Data e hora em que a compra foi realizada;
- **product_creation_date**: Data e hora em que o produto foi criado na Hotmart;
- **product_category**: categoria do produto na Hotmart. Exemplo: e-book, software, curso online, e-tickets, etc.;
- **product_niche**: nicho de mercado que o produto faz parte. Exemplo: educação, saúde e bem-estar, sexualidade, etc.;
- **purchase_value**: valor da compra. Esse dado, assim como nicho e categoria foi codificado para manter a confidencialidade. O valor apresentado no dataset é o z-score do valor real;
- **affiliate_commission_percentual**: percentual de comissão que o afiliado receberá da compra;
- **purchase_device**: tipo de dispositivo utilizado no momento da compra, como: Desktop, Mobile, Tablet, ou Outros;
- **purchase_origin**: endereço do site do qual a pessoa veio antes da compra. Por exemplo, se uma pessoa veio do Facebook, Youtube, ou até mesmo de outra página no site oficial do produto;
- **is_origin_page_social_network**: informa se essa compra veio de uma URL do Facebook, Youtube, Instagram, Pinterest, ou Twitter.

---

---

Algumas regras de negócio:
- Quando a compra for feita diretamente pelo produtor, ou seja, quando não houver afiliado intermediando a compra, o campo affiliate_commission_percentual terá valor 0, e o campo affiliate_id será igual ao producer_id;
- No campo purchase_origin nós apenas consideramos o host do site. Isso quer dizer que, se uma pessoa veio do site www.meuproduto.com/promocoes, esse campo só irá retornar o valor www.meuproduto.com;

---

## 1.3 - Data Loading

*Load data from required files*

In [8]:
# # define connection "endpoint"
# db_connection_str = f'mysql+pymysql://{USER}:{PASSWORD}@{HOST}/{SCHEMA}'
# # create an engine to connect to database
# db_connection = create_engine(db_connection_str)

# # define query to get data
# query=f"""
# SELECT *
# FROM {TABLE}
# """

# # read all data from database
# df_sql = pd.read_sql(sql=query, con=db_connection)
# df_sql

In [9]:
# # save data to parquet so as to not overload database server unnecessarily
# df_sql.to_parquet(
#     path=os.path.join(PROJECT_ROOT_PATH, "data", "raw_data", "customer_data.parquet")
# )

In [10]:
# read data from local source
df_extraction = pd.read_parquet(
    path=os.path.join(PROJECT_ROOT_PATH, "data", "raw_data", "customer_data.parquet")
)

# inspect results
df_extraction.sample(5)

Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,Venda
287950,11422946,75112,1710129,1710129,6478103,2016-02-06 00:10:19,2013-10-30 11:16:08,Phisical book,YouTube video creation,0.135,0.0,Desktop,Origin f01f,0,1
280310,11407415,152793,1466995,1466995,6471288,2016-02-05 11:09:33,2015-05-22 16:59:28,Podcast,Government,-0.485,0.0,eReaders,Origin adf0,0,1
442964,11742887,143650,233684,233684,4980693,2016-02-26 09:36:55,2015-03-27 13:41:05,Phisical book,Procrastination,-0.148,0.0,eReaders,Origin 386a,0,1
465287,11789366,88450,289832,289832,5245069,2016-02-28 17:40:06,2014-02-28 23:43:21,Phisical book,Physics,-0.485,0.0,Smart TV,Origin ef2b,0,1
51001,10943595,120214,4181,4181,717475,2016-01-08 03:29:03,2014-10-17 15:32:24,Phisical book,Negotiation,0.341,0.0,Desktop,Origin 71cd,0,1


# **2 - DATA DESCRIPTION**

## 2.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [11]:
# create a restore point of the previous section
df_description = df_extraction.copy()

# check dataframe for this new section
check_dataframe( dataframe=df_description, summary_stats=True, head=True )

*************************************************
Dataframe size in memory: 660.704 MB 

-----------------------------
Dataframe overview:


Unnamed: 0,Num NAs,Percent NAs,Num unique [include NAs],Data Type
purchase_id,0,0,1.599.828,int64
product_id,0,0,17.883,int64
affiliate_id,0,0,22.947,int64
producer_id,0,0,8.020,int64
buyer_id,0,0,1.100.649,int64
purchase_date,0,0,1.488.964,datetime64[ns]
product_creation_date,0,0,17.879,datetime64[ns]
product_category,0,0,10,object
product_niche,0,0,25,object
purchase_value,0,0,32.617,float64


-----------------------------

 Dataframe shape is (1599828, 15) 

-----------------------------


Statistics for Numerical Variables [NaNs are ignored]:


Unnamed: 0,attribute,mean,median,std,iqr,min,max,range,skew,kurtosis
0,purchase_id,"12.445.456,601","12.468.487,500","917.581,737","1.579.356,500","1.663.958,000","14.357.203,000","12.693.245,000",-90,-756
1,product_id,"148.595,814","154.310,000","55.543,152","81.796,000",4000,"319.129,000","319.125,000",-482,-702
2,affiliate_id,"2.297.500,688","1.690.428,000","2.092.655,502","3.549.994,000",3000,"7.700.836,000","7.700.833,000",651,-823
3,producer_id,"2.164.479,522","1.377.289,000","2.038.959,782","3.366.648,000",3000,"9.868.481,000","9.868.478,000",724,-699
4,buyer_id,"5.187.551,341","5.999.153,500","2.199.255,869","3.216.124,250",60000,"12.014.792,000","12.014.732,000",-878,-492
5,purchase_value,0000,-0350,1000,0518,-0541,124561,125102,10817,629206
6,affiliate_commission_percentual,7596,0000,18477,0000,0000,100000,100000,2259,3753
7,Venda,1000,1000,0000,0000,1000,1000,0000,0,0


-----------------------------


dataframe.head(5)


Unnamed: 0,purchase_id,product_id,affiliate_id,producer_id,buyer_id,purchase_date,product_creation_date,product_category,product_niche,purchase_value,affiliate_commission_percentual,purchase_device,purchase_origin,is_origin_page_social_network,Venda
0,1663958,6640,209372,116238,1200397,2016-06-26 12:00:00,2011-03-19 15:47:36,Video,Presentation skills,-0.3,,Smart TV,Origin ef2b,0,1
1,1677087,2350,141418,2821,1083764,2016-06-26 12:00:00,2010-07-05 01:50:15,Podcast,Child psychology,-0.2,,Smart TV,Origin ef2b,0,1
2,2017360,35669,618642,618642,1436106,2016-06-26 12:00:00,2012-06-13 02:59:37,Podcast,Presentation skills,-0.5,,Smart TV,Origin ef2b,0,1
3,2017379,57998,1164511,70388,1436118,2016-06-26 12:00:00,2013-05-07 08:51:31,Podcast,Anxiety management,-0.4,,Smart TV,Origin ef2b,0,1
4,2017382,58329,1261488,221253,1386357,2016-06-26 12:00:00,2013-05-12 08:12:06,Podcast,Teaching English,-0.5,,Smart TV,Origin ef2b,0,1


*************************************************


## 2.2 - Rename Columns

*Search for misleading or error-prone column names*

In [None]:
# TO-DO

## 2.3 - Check Data Dimensions

*Check dataframe dimensions to know if pandas will be enough to handle such data size or we will need Big Data tools like Spark*

In [None]:
# check number of rows and columns
print( f'\
Dataframe has {df_description.shape[0]:,} \
rows and {df_description.shape[1]} columns' )

## 2.4 - Data Types

*Check if data types on dataframe makes sense according to database information*

In [None]:
# inspect dataframe types
inspect_dtypes(df_description, 15)

## 2.5 - Data Validation

*Check if columns make sense in regard to business understanding*

In [None]:
# instanciate data validator object
dv = DataValidator(
    dataframe=df_description, 
    col_funcions_checker=col_funcions_checker,
    dataframe_granularity=df_grain,
    col_aggregations_checker=col_aggregations_checker,
    # pandas_queries = pandas_queries,
    # records_file=validation_file
)

# validate data
dv.validate_data()

# # plot historical validations
# dv.plot_historical_report(records_file=validation_file, save_report=False)

## 2.6 - Check Duplicated Rows

*Inspect duplicated rows and handle them properly*

In [None]:
# check duplicated rows
print(
    f'{"*"*49}\n\n'
    f'There are {df_description.duplicated(keep=False).sum():,} \
duplicated rows [{df_description.duplicated(keep=False).mean()*100:.2f}%] based on all columns. \
Duplicated rows are double counted.'
    f'\n\n{"*"*49}\n\n'
    f'There are {df_description.duplicated(subset=df_grain, keep=False).sum():,} duplicated rows [{df_description.duplicated(subset=df_grain, keep=False).mean()*100:.2f}%] based on table granularity. \
Duplicated rows are double counted.'
    f'\n\n{"*"*49}'
)

## 2.7 - Check Missing Values

*Inspect number and percentage of missing value per column to decide what to do with them*

In [None]:
#  get number of NA, percent of NA, number of unique and column type
check_na_unique_dtypes(df_description);

## 2.8 - Handle Missing Values

*Handle missing value for columns*

In [None]:
# TO-DO

## 2.9 - Descriptive Statistics

*Inspect some summary statistics for numerical columns*

In [None]:
# split dataset into types of features
df_number = df_description.select_dtypes(include=["number", "bool"])
df_date = df_description.select_dtypes(include=["datetime"])
df_string = df_description.select_dtypes(include=["object"])

# sanity check
assert df_number.shape[1] + df_date.shape[1] + df_string.shape[1] == df_description.shape[1], """Revise the previous split, something may be wrong!"""

### 2.9.1 - Numerical Variables

*Inspect numerical variables*

In [None]:
# check summary statistics
summary_statistics(df_number)

### 2.9.2 - Categorical Variables

*Inspect categorical variables*

In [None]:
# check overview of categorical features
categorical_summary(df_string)

### 2.9.3 - Datetime Variables

*Inspect datetime variables*

In [None]:
# check an overview of datetime features
datetime_summary(df_date)

### 2.9.4 - Investigate further:

*Variables to inspect the real meaning**

In [None]:
# TO-DO

# **3 - FEATURE ENGINEERING**

## 3.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_f_eng = df_description.copy()

# check dataframe
check_dataframe( df_f_eng )

## 3.2 - Hypothesis Testing List

*Define the list of hypotheses that will be validated during Exploratory Data Analysis (EDA)*

**HYPOTHESIS MIND MAP**

![Business hypothesis mindmap](../img/project_structure/xxx.jpg)

*The above image is the product of a brainstorm that took into consideration many different variables that can impact the main business metric. This mind map is a great help when trying to raise hypotheses that could lead to insights. It is also helpful to guide feature engineering (create new relevant features) and when there is a need to look for more data elsewhere.*

> *Taking into consideration hypothesis mind map (at the beginning of this notebook) and the data available on dataset:*

H1. **...**

H2. **...**

H3. **...**

H4. **...**

H5. **...**

## 3.3 - Feature Creation

*Create new features (columns) that can be meaningful for EDA and, especially, machine learning modelling.*

In [None]:
# TO-DO

# **4 - DATA FILTERING**

## 4.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_filter = df_f_eng.copy()

# check dataframe
check_dataframe( df_filter )

## 4.2 Rows Filtering

*Remove rows with meaningless (or unimportant) data*

In [None]:
# TO-DO

## 4.3 - Columns Filtering

*Remove auxiliary columns or columns that won't be available in the prediction moment*

In [None]:
# TO-DO

Some columns may be removed in the beginning of DATA DESCRIPTION SECTION.

No more columns will be removed so far.

# **5 - EXPLORATORY DATA ANALYSIS**

## 5.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_eda = df_eng.copy()

# check dataframe
check_dataframe( df_eda )

## 5.2 - Univariate Analysis

*Explore variables distributions*

In [None]:
# split dataset into types of features
df_eda_num = df_eda.select_dtypes(include=["number", "bool"])
df_eda_date = df_eda.select_dtypes(include=["datetime"])
df_eda_str = df_eda.select_dtypes(include=["object"])

# sanity check
assert df_eda_num.shape[1] + df_eda_date.shape[1] + df_eda_str.shape[1] == df_eda.shape[1], """Revise the previous split, something may be wrong!"""

### 5.2.1 - Numerical Columns

In [None]:
# plot numerical columns for base data
numerical_plot(df_eda_num, hist=False)

### 5.2.2 - Categorical Columns

In [None]:
# plot categorical columns for base data
categorical_plot(df_eda_str)

### 5.2.3 Datetime Columns

In [None]:
# plot datetime columns for base data
datetime_plot(df_eda_date)

## 5.3 - Bivariate Analysis

*Explore relationship between variables (in pairs)*

### 5.3.1 - Initial inspection

In [None]:
# plot pairplot
sns.pairplot( df_eda, diag_kind = "kde" );

### 5.3.2 - Numerical variables

In [None]:
# calculate pearson correlation coefficient
correlation = df_eda_ref.corr( method = 'spearman' )

# create figure and ax object
fig, ax = plt.subplots( figsize = (6, 6) )

# display heatmap of correlation on figure
sns.heatmap( correlation, annot = True, ax = ax)
plt.yticks( rotation = 0 );

### 5.3.3 - Categorical variables

In [None]:
# TO-DO ---> cramer-v heatmap

In [None]:
# create a dataframe with cramer-v for every row-column pair
cramer_v_corr = create_cramer_v_dataframe( multivar_cat_analysis )

# create figure and ax object
fig, ax = plt.subplots( figsize = (20, 20) )

# display heatmap of correlation on figure
sns.heatmap( cramer_v_corr, annot = True, ax = ax);

## 5.4 - Business Hypothesis

*Validate all business hypothesis based on available data*

### **H1. ..**

### **H2. ..**

### **H3. ..**

### **H4. ..**

### **H5. ..**

## 5.5 - Data Space Analysis

**Initial inspection on dimensionality reduction potential**

### PCA

In [None]:
# TO-DO

### UMAP

In [None]:
# TO-DO

### t-SNE

In [None]:
# TO-DO

### PHATE

In [None]:
# TO-DO

### Tree-Base Embedding

In [None]:
# TO-DO

### KMeans Embedding

In [None]:
# TO-DO

# **6 - DATA PREPARATION**

## 6.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_prep = df_eda.copy()

# check dataframe
check_dataframe( df_prep )

## 6.2 - Remove variables that won't be available in the production environment

*Remove variables that model can use on production to make predictions*

In [None]:
# TO-DO

## 6.3 - Train-Validation-Test split

*Split dataframe into training, validation and test dataset*

In [None]:
# TO-DO

## 6.4 - Scale numeric features

*Scale numeric feature to make modelling "easier" for ML models*

### 6.4.1 - Standard Scaler

In [None]:
# TO-DO

### 6.4.2 - Min-Max Scaler

In [None]:
# TO-DO

### 6.4.3 - Robust Scaler

In [None]:
# TO-DO

### 6.4.4 - Discretization

In [None]:
# TO-DO

## 6.5 - Encode categorical features

*Encode categorical feature to make modelling possible for ML models*

### 6.5.1 - One-Hot Encodingm

In [None]:
# TO-DO

### 6.5.2 - Ordinal Encoding

In [None]:
# TO-DO

### 6.5.3 - Target Encoding

In [None]:
# TO-DO

## 6.6 - Response variable transformation

*Transform target variable (e.g. log, sqrt, etc) to make modelling "easier" for ML models*

In [None]:
# TO-DO

## 6.7 - Cyclic variables transformation

*Transform cyclic variables (e.g. days of week, months in year, etc) with a sin and cos functions*

In [None]:
# TO-DO

## 6.8 - Double-check preparation

*Double-check the prepared dataset to make sure it is as expected*

In [None]:
# TO-DO

# **7 - FEATURE SELECTION**

## 7.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_f_selection = df_prep.copy()

# check dataframe
check_dataframe( df_f_selection )

## 7.2 - Logist regression coefficients

In [None]:
# TO-DO

## 7.3 - Random forest feature importance

In [None]:
# TO-DO

## 7.4 - Boruta algorithm

In [None]:
# TO-DO

## 7.5 - Mutual information

In [None]:
# TO-DO

# **8 - ML MODEL TRAINING**

## 8.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_train = df_f_selection.copy()

# check dataframe
check_dataframe( df_train )

## 8.2 - Metrics

*Define the metric of success and the health metrics*

In [None]:
# TO-DO

## 8.3 - Baseline model

*Check the performance metrics with a dummy model to get the baseline metric*

In [None]:
# TO-DO

## 8.4 - ML models

*Get performance metrics of ML model with cross-validation*

In [None]:
# TO-DO

## 8.5 - Final modelling comparison

*Compare all models and decide what one is the best (and will be fine-tuned)*

In [None]:
# TO-DO

# **9 - HYPERPARAMETER TUNNING**

## 9.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_tune = df_train.copy()

# check dataframe
check_dataframe( df_tune )

## 9.2 - Hypertune the best ML model

*Check the best hyperparams for the best ML model*

### 9.2.1 - Grid Search

In [None]:
# TO-DO

### 9.2.2 - Random Search

In [None]:
# TO-DO

### 9.2.3 - Bayesian Search

In [None]:
# TO-DO

## 9.3 - Define best hyperparameters

*Explicitly define best hyper parameters*

In [None]:
# TO-DO

# **10 - PERFORMANCE EVALUATION AND INTERPRETATION**

## 10.1 - Restore Point

*Create a checkpoint of the last dataframe from previous section*

In [None]:
# create a restore point for the previous section dataframe
df_perform = df_tune.copy()

# check dataframe
check_dataframe( df_perform )

## 10.2 - Training Performance

*Get final model performance on training data*

In [None]:
# TO-DO

## 10.3 - Generalization performance

### 10.3.1 - Final model training

*Get final model performance on validation data*

In [None]:
# TO-DO

### 10.3.2 - Error analysis

*Perform error analysis on final model to make sure it is ready for production*

In [None]:
# TO-DO

## 10.4 - Define prodution model

*Train ML on "training + validation" data*

In [None]:
# TO-DO

## 10.5 - Testing performance

*Get production model performance on testing data*

In [None]:
# TO-DO

## 10.6 - Business performance

*Translate testing performance into business results*

In [None]:
# TO-DO

# **11 - DEPLOYMENT**

![Deployment architecture](../img/....jpg)

## 11.1 - API creation

*Code to create API for ML predictions*

In [None]:
# TO-DO

## 11.2 - Docker container

*Code to create a Docker container and deploy ML model*

In [None]:
# TO-DO