# Advanced/optional features of `do_datasciencing`

The function `do_datasciencing` has several optional parameters with which you can customize the training. We will explore them in detail in this tutorial.

Note: We will now import `datasciencefunctions` as `ds` and in the rest of this tutorial we will sometimes refer to `datasciencefunctions` as `ds` when discussing the modules of the package and their functions

In [0]:
import datasciencefunctions as ds
import datasciencefunctions.classification as dsclass

#0. Load and prepare data

#### 
* We will load the adult databricks example dataset. 
* It contains categorical, ordinal and numeric (continuous) predictors representing demographic info of US adults and a target denoting whether their income exceeded USD 50 000. 
* We want to train a model to predict whether the person's income exceeds USD 50 000. 
(you can read more details in the readme below).

In [0]:
with open("/dbfs/databricks-datasets/adult/README.md") as f:
    x = ''.join(f.readlines())

print(x)

In [0]:
schema = """
  age DOUBLE,
  workclass STRING,
  fnlwgt DOUBLE,
  education STRING,
  education_num DOUBLE,
  marital_status STRING,
  occupation STRING,
  relationship STRING,
  race STRING,
  sex STRING,
  capital_gain DOUBLE,
  capital_loss DOUBLE,
  hours_per_week DOUBLE,
  native_country STRING,
  income STRING
"""

df_adult = (
    spark
    .read
    .format("csv")
    .schema(schema)
    .option("header", True)
    .option("path", "dbfs:/databricks-datasets/adult/adult.data")
    .load()
    .sample(fraction=0.35) # only take a sample of the dataset for tutorial purposes
)

df_adult.printSchema()

When running binary classification, datasciencefunctions expects you to specify a label (or target) column with values 1 and 0.

We will create a classification target column called "income_above_50K" with value 1 if the person's income exceeds USD 50K and 0 otherwise.

In [0]:
import pyspark.sql.functions as F

df_adult_ml = (
    df_adult
    .withColumn(
        "income_above_50K", 
        F.when(F.col("income")==" >50K", 1).otherwise(0)
    )
    # we drop the income column because it is perfectly correlated with our label
    .drop("income")
)

In [0]:
display(df_adult_ml.limit(10))

age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_above_50K
50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,0
38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,0
28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,0
52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,1
31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,1
32.0,Private,186824.0,HS-grad,9.0,Never-married,Machine-op-inspct,Unmarried,White,Male,0.0,0.0,40.0,United-States,0
43.0,Self-emp-not-inc,292175.0,Masters,14.0,Divorced,Exec-managerial,Unmarried,White,Female,0.0,0.0,45.0,United-States,1
35.0,Federal-gov,76845.0,9th,5.0,Married-civ-spouse,Farming-fishing,Husband,Black,Male,0.0,0.0,40.0,United-States,0
54.0,?,180211.0,Some-college,10.0,Married-civ-spouse,?,Husband,Asian-Pac-Islander,Male,0.0,0.0,60.0,South,1
49.0,Private,193366.0,HS-grad,9.0,Married-civ-spouse,Craft-repair,Husband,White,Male,0.0,0.0,40.0,United-States,0


## 1. Using MLFlow with `do_datasciencing`
The most straightforward optional parameter of `do_datasciencing` is `use_mlflow`. If it is set to `True` (which it is by default), it tells `do_datasciencing` to log the model summmary (including the model itself) to an MLFlow experiment. 

If you use Databricks and do not specify the MLFlow experient it will be created automatically and linked to the source notebook. However a better practice is to specify where you want your experiment logged. To do that, you just need to import mlflow and set an experiment by denoting a path where it will be stored (if you're using Databricks it is a path within the Databricks Workspace and has to start with `/`)

In [0]:
import mlflow
from datasciencefunctions.utils import current_dbx_notebook_path

# set experiment based on current notebook path, you will probably want to change it to the shared experiment you'll be working on
mlflow.set_experiment(current_dbx_notebook_path(dbutils) + "_test_experiment")

In [0]:
df_test, df_train, model_summary = (
    dsclass.do_datasciencing(
        df_adult_ml,
        model_type=ds.MlModel.spark_GLM_binomial,
        label_col="income_above_50K",
        # we are logging the model into an MLFlow experiment
        # (you don't actually need to set it to True explicitly as it is True by default)
        use_mlflow=True,
        params_fit_model={
            "max_evals": 2,  # just to make it faster
        },
    )
)

## 2. Skip some columns

Sometimes you want to keep some columns in a dataframe but exclude it from the model features (or label). Typically this would be something like a unique ID of a datapoint. The optional parameter `skip_cols` allows you to do just that, you just need to pass a list of names of columns which will not be considered as features/label during training/prediction.

## 3. Transformation pipeline options - specify categorical and numerical predictors and scaling

By default, `do_datasciencing` uses a heuristic to find categorical and numerical columns automatically based on the number of unique values in each column. However, this is usually only useful for a quick exploration and typically you want to specify your categorical and numerical features manually. Likewise, by default, the features are not scaled in the transformation pipeline and you might want that sometimes.

You can do both in `do_datasciencing` with the use of the optional parameter called `params_fit_pipeline`. This paramater (like other parameters of `do_datasciencing` which start with `params_`) is expected to be a dictionary. It passes specific values to the function `fit_transformation_pipeline` which is one of the functions that do all the work under the hood of `do_datasciencing`. Therefore, you can use the names of the parameters of `fit_transformation_pipeline` which you want to change as keys of the `params_fit_pipeline` dictionary.

**Note:** If you use `params_fit_pipeline` to specify your categorical and/or numerical columns, the heuristic used to find categorical and numerical columns is disabled automatically and all columns not listed among your categorical or numerical columns are ignored by the model (as if listed in the `skip_cols` parameter)

In [0]:
help(ds.data_processing.fit_transformation_pipeline)

From the docstring above (or the [documentation](../../docs)) we see that `cat_cols` and `num_cols` as well as `scaling` are parameters of `fit_transformation_pipeline` and so we can specify their values in the optional parameter `params_fit_pipeline`

In [0]:
cat_list = [
    "workclass",
    "education",
    "marital_status",
    "race",
    "sex",
    "native_country",
]

num_list = [
    "age",
    "capital_gain",
    "capital_loss",
    "hours_per_week",
]

In [0]:
df_test, df_train, model_summary = (
    dsclass.do_datasciencing(
        df_adult_ml,
        model_type=ds.MlModel.spark_GLM_binomial,
        label_col="income_above_50K",
        # we are logging the model as an MLFlow experiment
        use_mlflow=True,
        # we specify the categorical and continuous columns and scaling used
        params_fit_pipeline={
            "cat_cols": cat_list,
            "num_cols": num_list,
            "scaling": "standard",
        },
        params_fit_model={
            "max_evals": 5,  # just to make it faster
        },
    )
)

## 3. Model fitting options - specify the hyperparameter space and how it is explored

When passing an instance of MlModel to `do_datasciencing`, a default hyperparameter search space is also passed with it. In the default setting `do_datasciencing` uses the [hyperopt](http://hyperopt.github.io/hyperopt/) package to search the default hyperparameter space and select the best hyperparameters based on cross-validation (using the area uder ROC curve as the loss function). Using the optional parameter `params_fit_model` this can be modified in several ways. 

Just like `params_fit_pipeline`, the parameter `params_fit_model` is expected to be a dictionary of parameters of a function used within `do_datasciencing`, in this case it is the `ds.classification.fit_classification_model` function. These values can be modified using `params_fit_model`:


1. `param_space_search` ... can be either set to "hyperopt" (the default value) or "param_grid" for a grid search of the hyperparameter space
2. `validation_technique` ... currently the only option is cross-validation, no point in changing (nothing else will work)
3. `max_evals` ... a parameter relevant when using the hyperopt search, changes the number of hyperopt iterations. Typically, more evaluations lead to better choice of hyperparameters but training time increases with the number of iterations. The default value is set to `30` which might be too low or too high depending on the model, the hyperparameter space and the data.
4. `custom_params` allows you to change the default hyperparameter search space for each model. The format depends on whether you use "param_grid" or "hyperopt" as your space search. Since this is somewhat more complex than the other optional parameters, we will discuss `custom_params` in the last chapter of the tutorial where we also show how tu set up a custom hyperparameter space within the datasciencefunctions framework even without the use of `do_datasciencing`.

In [0]:
df_test, df_train, model_summary = (
    dsclass.do_datasciencing(
        df_adult_ml,
        model_type=ds.MlModel.spark_GLM_binomial,
        label_col="income_above_50K",
        # we are logging the model as an MLFlow experiment
        use_mlflow=True,
        # we specify the categorical and continuous columns and scaling used
        params_fit_pipeline={
            "cat_cols": cat_list,
            "num_cols": num_list,
            "scaling": "standard",
        },
        params_fit_model={
            "max_evals": 10,
        },
    )
)

## 4. Modify the train/test split
If you want to change the default value of the random train/test split ration, fix the randomness of the split via a seed or change some other parameter of the `ds.data_processing.train_test_split` function, you can use the `params_split` optional parameter to change these in the same way as `params_fit_pipeline`.

## 5. Change the parameters of the automatic categorical columns finder

If you want to use the `ds.data_exploration.get_categorical_and_numeric_cols` heuristic to get the categorical and continuous predictors automatically but want to change some of its parameters, you can use `params_get_cat_num_columns` in the same way as `params_fit_pipeline` to pass specific parameter values to `get_categorical_and_numeric_cols`

#FAQ

# Where to now?

[Back to the introductory notebook](classification.ipynb)

[To the next chapter](03_custom_training_pipelines.ipynb)