# Introduction
In this tutorial, we will go through a typical ML workflow with Foreshadow using a subset of the [adult data set](https://archive.ics.uci.edu/ml/datasets/Adult) from the UCI machine learning repository.


# Getting Started
To get started with foreshadow, install the package using `pip install foreshadow`. This will also install the dependencies. Now create a simple python script that uses all the defaults with Foreshadow. Note that Foreshadow requires `Python >=3.6, <4.0`. 

First import foreshadow related classes. Also import sklearn, pandas and numpy packages. 

In [1]:
from foreshadow import Foreshadow
from foreshadow.intents import IntentType
from foreshadow.utils import ProblemType
from foreshadow.logging import logging

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

  from numpy.core.umath_tests import inner1d


Configure the random seed and logging level.

In [2]:
np.random.seed(42)
logging.set_level('warning')

# Load the dataset

In [3]:
data = pd.read_csv('adult.csv').iloc[:2000]
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,38.563,185748.34,10.0335,1373.5955,85.4295,40.537
std,13.696051,99561.629374,2.670757,8967.134813,398.18108,12.05003
min,17.0,13769.0,1.0,0.0,0.0,1.0
25%,27.0,115531.0,9.0,0.0,0.0,40.0
50%,37.0,176661.0,10.0,0.0,0.0,40.0
75%,48.0,235696.25,13.0,0.0,0.0,45.0
max,90.0,662460.0,16.0,99999.0,3004.0,99.0


#### Split data to Train and Test

In [5]:
X_df = data.drop(columns="class")
y_df = data[["class"]]
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2)

# Train a Simple LogisticRegression Model using Foreshadow and making predictions

**The following example is for classification. For regression problems, use `problem_type=ProblemType.REGRESSION`.**

In [6]:
shadow = Foreshadow(problem_type=ProblemType.CLASSIFICATION, 
                    estimator=LogisticRegression())

In [7]:
_ = shadow.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


## Making predictions

In [8]:
predictions = shadow.predict(X_test)

  if diff:


In [9]:
predictions.head()

Unnamed: 0,0
0,<=50K
1,<=50K
2,<=50K
3,>50K
4,>50K


### Use the trained estimator to compute the evaluation score.
Note that the scoring method is defined by the selected estimator.

In [10]:
shadow.score(X_test, y_test)

0.835

## You can inspect and change Foreshadow's decision

Foreshadow uses a machine learning model to power the auto intent resolving step. As a user, you may not agree with the decision made by Foreshadow. The following APIs allow you to inspect the decisions and change them if you have a different opinion. 

### To check the intent of a particular column

In [11]:
shadow.get_intent('education-num')

'Numeric'

### Override the decision of intent resolving
If you want to explore a different intent type, simply call the `override_intent` API. 

In [12]:
shadow.override_intent('education-num', IntentType.CATEGORICAL)

In [13]:
_ = shadow.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [14]:
shadow.score(X_test, y_test)

0.8325

To show that the intent has been updated: 

In [15]:
shadow.get_intent('education-num')

'Categorical'

### You can also provide override to fix the intent/column type before fitting the data

This tells Foreshadow to not run auto intent resolving on some columns but use your decisions instead.

In [16]:
shadow = Foreshadow(problem_type=ProblemType.CLASSIFICATION, estimator=LogisticRegression())
shadow.override_intent('education-num', IntentType.CATEGORICAL)
_ = shadow.fit(X_train, y_train)
print(shadow.get_intent('education-num'))

                                         

Categorical


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


# Now Let's Search the Best Model and Hyper-Parameter
At this point, you have a basic pipeline fitted by Foreshadow using a logistic regression estimator. You can update the estimator to something more powerful and retrain the model. Another way is to use the AutoEstimator option in Foreshadow. 

Foreshadow leverages the [TPOT AutoML](https://epistasislab.github.io/tpot/using/) package to search the best model and hyper-parameter for you. **Note that AutoML algorithms can take a long time to finish their search, so here we only configure Foreshadow to search for 2 minutes. Please refer to the TPOT manual for more details.**

In [17]:
from foreshadow.estimators import AutoEstimator
estimator = AutoEstimator(
    problem_type=ProblemType.CLASSIFICATION,
    auto="tpot",
    estimator_kwargs={"max_time_mins": 2}, # change here
)
shadow = Foreshadow(problem_type=ProblemType.CLASSIFICATION, estimator=estimator)

In [18]:
shadow.override_intent('education-num', IntentType.CATEGORICAL)

In [19]:
_ = shadow.fit(X_df, y_df)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


10 operators have been imported by TPOT.


  y = column_or_1d(y, warn=True)


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 97.
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative.
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False

## Making predictions and evaluations

In [20]:
predictions = shadow.predict(X_test)

  if diff:


In [21]:
shadow.score(X_test, y_test)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


1.0

# Model persistence
## Save the fitted pipeline
After finding the best pipeline, you can export the fitted pipeline as a pickle file for your prediction task. 

In [22]:
pickled_fitted_pipeline_location = "fitted_pipeline.p"
shadow.pickle_fitted_pipeline(pickled_fitted_pipeline_location)

## Load back the pipeline for prediction

In [23]:
import pickle

with open(pickled_fitted_pipeline_location, "rb") as fopen:
    shadow_reload = pickle.load(fopen)

## Reuse the pipeline to do predictions and evaluations

In [24]:
predictions = shadow_reload.predict(X_test)
predictions.head()

  if diff:


Unnamed: 0,0
0,<=50K
1,>50K
2,<=50K
3,<=50K
4,>50K


In [25]:
shadow_reload.score(X_test, y_test)

1.0

# [Experimental] Register customized data cleaners
Foreshadow provides several built-in data cleaning transformations. These transformations work on a per column basis. 
- datetime cleaner (covert date time into YYYY, mm, and dd respectively)
- financial number cleaner (reformat financial numbers by removing signs like "$" and ",")
- drop cleaner (drop a column if a column has over 90% NaN values)

It is also possible to provide your own data cleaning transformations. The follow (dummy) example shows how to change a column of strings to lowercase.

## Define your own cleaner and transformation function
There are two components when defining your own data cleaner (We may change it to only 1 component in the future). 

- One is the transformation you want to apply to each row in a column. 

- The second is a subclass of the `CustomizableBaseCleaner`. You will need to override the `metric_score` method. The metric_score returns a confidence score between 0 and 1 representing how certain this particular cleaner should be applied to the column being processed.



In [26]:
from foreshadow.concrete.internals.cleaners.customizable_base import (
        CustomizableBaseCleaner,
    )

def lowercase_row(row):
    """Lowercase a row.

    Args:
        row: string of text

    Returns:
        transformed row.

    """
    return row if row is None else str(row).lower()

class LowerCaseCleaner(CustomizableBaseCleaner):
    def __init__(self):
        super().__init__(transformation=lowercase_row)

    def metric_score(self, X: pd.DataFrame) -> float:
        """Calculate the matching metric score of the cleaner on this col.

        In this method, you specify the condition on when to apply the
        cleaner and calculate a confidence score between 0 and 1 where 1
        means 100% certainty to apply the transformation.

        Args:
            X: a column as a dataframe.

        Returns:
            the confidence score.

        """
        column_name = list(X.columns)[0]
        if column_name == "workclass":
            return 1
        else:
            return 0

## Register the cleaner in foreshadow object then train the model

In [27]:
# Note that you need to reinitialize the Foreshadow object to pick up the customized data cleaner 
shadow = Foreshadow(problem_type=ProblemType.CLASSIFICATION, 
                    estimator=LogisticRegression())

shadow.register_customized_data_cleaner(data_cleaners=[LowerCaseCleaner])

### List the unique values of the workclass column

In [28]:
workclass_values = list(X_train["workclass"].unique())
print(workclass_values)

[' Private', ' Local-gov', ' Self-emp-not-inc', ' State-gov', ' Self-emp-inc', ' Federal-gov', ' ?']


### List the unique values of the workclass after the transformation

In [29]:
X_train_cleaned = shadow.X_preparer.steps[0][1].fit_transform(X_train)

workclass_values_transformed = list(X_train_cleaned["workclass"].unique())
print(workclass_values_transformed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  X[X.columns[0]] = out


[' private', ' local-gov', ' self-emp-not-inc', ' state-gov', ' self-emp-inc', ' federal-gov', ' ?']


## Train, predict and evaluate as usual

In [30]:
# Note that right now you need to reinitialize the Foreshadow object before retraining.
shadow = Foreshadow(problem_type=ProblemType.CLASSIFICATION, 
                    estimator=LogisticRegression())

shadow.register_customized_data_cleaner(data_cleaners=[LowerCaseCleaner])

shadow.fit(X_train, y_train)
predictions = shadow.predict(X_test)
shadow.score(X_test, y_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  X[X.columns[0]] = out
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  X[X.columns[0]] = out
  if diff:


0.8325