# Example of a user case 

## (1) Run iDora EDA pipeline

Typical EDA workflow: 
1. Load the set (csv or txt) 
2. Create summary statistics for your dataset 
3. Create visuals for your dataset (bar and boxplots) 
4. Transform the dataset (one-hot encodiging, label encoding & varaible removal) 
5. Produce feature importance measures and rank the features

Two ways of working with Dora: 
- Use iDora.main_run file to create a pipeline from the pre-selected functions ( by commenting out functions you don't need )
- Call the functions from iDora one by one   

 Raw dataset load 

In [1]:
# Some imports
import os
import pandas as pd
import numpy as np
import sys
import time

In [8]:
# Where is your data? Indicate now!

# OR use test datasets

# # ------ Bank -------
path_to_data = r'C:\Users\pnl0516p\Documents\PyScripts\iDora\idora\test_data\Bank Marketing'
file_name = r'balanced_bank.csv'
target = 'y' # idnetify only if already known

# #  ------ Housing Prices -------
# path_to_data = r'C:\Users\pnl0516p\Documents\PyScripts\iDora\idora\test_data\HousingPricesKC'
# file_name = r'kc_house_data.csv'
# target = 'price'

# # ------ Rossman -------
# path_to_data = r'C:\Users\pnl0516p\Documents\PyScripts\iDora\idora\test_data\Rossman'
# file_name = r'train.csv'
# target = 'Sales' 

## import iDora_main as idora ( thus, refer to the tool as idora)
os.chdir(r'C:/Users/pnl0516p/Documents/PyScripts/iDoraModel/idora')
import iDora_main as idora
cwd = os.getcwd()
os.chdir(path_to_data)
## Read Raw data in dataframe
df_raw = pd.read_csv(file_name)

#### EDA pipeline
Load the data with iDora & start exploring with the following EDA pipeline:
    
* Exploration (should be always the first step you do):
 - _ = explore( df ) = get to know what columns you have 
 - summary = summarize( df ) = produce summary table for all numerical variables

* Plotting (create and save box- & barplots for numerical variables) 
 - settings.boxplots_path = make_boxplots( summary, df=df )
 - settings.barplots_path = make_barplots( summary, df=df )

* Produce HTMl (enrich summary table with plots) 
 - df_summary_plots = produce_df_plot( summary )
 - html_example = display_df( df_summary_plots,['Count','Unique','Mean'] )

* Transformation (transform data to be able to model & get more accurate MI scores)
 - df_sparse = onehot_encode( settings.cat_vars, df )

* Variables Removal
 - df_new = remove_vars( df_sparse ) 

* Saving the output
 - save_html( html_example )
 - save_df( df_new )


In [9]:
## Start the pipeline: Load dataframe from a destination, if file is txt  give the argument: txt_flag = True 
df_new = idora.eda_run( path_to_data + '\\' + file_name )

Your data has this dimensionality: (9280, 22) 

You have 0 missing values (cells) in your data. 

These are all the variables you have: ['Unnamed: 0', 'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'] 

Tell me the ID column name, please.

Tell me the Date column name, please.

You have these variables with numerical values to put in a model:
['Unnamed: 0', 'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'] 

You have these variables with object type (e.g. strings):
[] 

You might want to (one-hot) encode these variables as they have less than 100 unique numerical values
['age', 'campaign', 'pdays', 'previ

## (3) Automatic modelling

In [14]:
params = idora.set_params(n_classes=2,loss="binary:logistic")

In [15]:
all_features = list(df_new.columns)
temp = all_features.pop(all_features.index(target))
features = all_features
all_features = list(df_new.columns)

In [16]:
XY_train, XY_test = idora.train_test(df_new)

In [17]:
model = idora.gmb_xgb_binary(XY_train,XY_test,params=params,all_features=all_features,target=target,features=features)

Train a XGBoost model
[0]	train-error:0.092777	eval-error:0.140067
Multiple eval metrics have been passed: 'eval-error' will be used for early stopping.

Will train until eval-error hasn't improved in 10 rounds.
[1]	train-error:0.088904	eval-error:0.122559
[2]	train-error:0.077959	eval-error:0.119192
[3]	train-error:0.075434	eval-error:0.119192
[4]	train-error:0.072066	eval-error:0.114478
[5]	train-error:0.06853	eval-error:0.119865
[6]	train-error:0.067015	eval-error:0.119192
[7]	train-error:0.064657	eval-error:0.119192
[8]	train-error:0.062974	eval-error:0.119192
[9]	train-error:0.059438	eval-error:0.112458
[10]	train-error:0.057754	eval-error:0.110438
[11]	train-error:0.056744	eval-error:0.113131
[12]	train-error:0.053881	eval-error:0.115152
[13]	train-error:0.052702	eval-error:0.114478
[14]	train-error:0.050682	eval-error:0.115152
[15]	train-error:0.049503	eval-error:0.114478
[16]	train-error:0.049672	eval-error:0.116498
[17]	train-error:0.047651	eval-error:0.116498
[18]	train-error