# Introduction
In this tutorial, we will go through a typical ML workflow with Foreshadow using the [Titanic](https://www.kaggle.com/c/titanic) dataset from Kaggle.


# Getting Started
To get started with foreshadow, install the package using `pip install foreshadow`. This will also install the dependencies. Now create a simple python script that uses all the defaults with Foreshadow. Note that Foreshadow requires `Python >=3.6, <4.0`. 

First import foreshadow related classes. Also import sklearn, pandas and numpy packages. 

In [1]:
# Import required libraries
import re
import pandas as pd 
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

from foreshadow import Foreshadow
from foreshadow.intents import IntentType
from foreshadow.utils import ProblemType
from foreshadow.estimators import AutoEstimator
from foreshadow.concrete.internals.cleaners.customizable_base import CustomizableBaseCleaner

# Set the random state
RANDOM_SEED = 10001
np.random.seed(RANDOM_SEED)

# Load the dataset

In [2]:
# Load the data
titanic = pd.read_csv('data/titanic_train.csv')
print('Data Shape: ' + str(titanic.shape))
X_train, X_test, y_train, y_test = train_test_split(titanic.drop('Survived', axis=1), titanic['Survived'],
                                                    train_size=0.8, test_size=0.2)
titanic.head(5)

Data Shape: (891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Model Training Iteration 1 - Use a simple LogisticRegression Model

In [3]:
# Define a function to measure the performance of the model to be trained


def measure(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print('Accuracy = %5.2f' % accuracy_score(y_test, y_pred))
    print('Classification Report:\n')
    print(classification_report(y_test, y_pred))

In [4]:
shadow1 = Foreshadow(problem_type=ProblemType.CLASSIFICATION, estimator=LogisticRegression(random_state=RANDOM_SEED))
shadow1.fit(X_train, y_train)

2020-03-26 17:46:33,716 - foreshadow - INFO - 26003 - Starting cleaning rows...
2020-03-26 17:46:33,720 - foreshadow - INFO - 26003 - Ending cleaning rows...
2020-03-26 17:46:33,855 - foreshadow - INFO - 26003 - Starting cleaning rows...
2020-03-26 17:46:33,859 - foreshadow - INFO - 26003 - Ending cleaning rows...
2020-03-26 17:46:34,980 - foreshadow - INFO - 26003 - Column PassengerId has intent type: Droppable
2020-03-26 17:46:35,287 - foreshadow - INFO - 26003 - Column Pclass has intent type: Categorical
2020-03-26 17:46:35,617 - foreshadow - INFO - 26003 - Column Name has intent type: Text
2020-03-26 17:46:35,936 - foreshadow - INFO - 26003 - Column Sex has intent type: Categorical
2020-03-26 17:46:36,269 - foreshadow - INFO - 26003 - Column Age has intent type: Numeric
2020-03-26 17:46:36,582 - foreshadow - INFO - 26003 - Column SibSp has intent type: Categorical
2020-03-26 17:46:36,900 - foreshadow - INFO - 26003 - Column Parch has intent type: Categorical
2020-03-26 17:46:37,232

Foreshadow(allowed_seconds=300,
           data_columns=['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
                         'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
           estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                        fit_intercept=True, intercept_scaling=1,
                                        l1_ratio=None, max_iter=100,
                                        multi_class='auto', n_jobs=None,
                                        penalty='l2', random_state=10001,
                                        solver='lbfgs', tol=0.0001, verbose=0,
                                        warm_start=False),
           estimator_kwargs=None, n_jobs=1, problem_type='classification',
           random_state=None)

In [5]:
measure(shadow1, X_test, y_test)

2020-03-26 17:46:39,810 - foreshadow - INFO - 26003 - Starting cleaning rows...
2020-03-26 17:46:39,813 - foreshadow - INFO - 26003 - Ending cleaning rows...
2020-03-26 17:46:40,133 - foreshadow - INFO - 26003 - Exported processed data to processed_test_data.csv


Accuracy =  0.78
Classification Report:

              precision    recall  f1-score   support

           0       0.77      0.87      0.81        98
           1       0.81      0.68      0.74        81

    accuracy                           0.78       179
   macro avg       0.79      0.77      0.78       179
weighted avg       0.79      0.78      0.78       179



### You might be curious how Foreshadow handled the input data. Let's take a look

In [6]:
shadow1.get_data_summary()

Unnamed: 0,Pclass,Sex,SibSp,Parch,Cabin,Embarked,PassengerId,Age,Ticket,Fare,Name,Survived
intent,Categorical,Categorical,Categorical,Categorical,Categorical,Categorical,Droppable,Numeric,Numeric,Numeric,Text,Label
count,712,712,712,712,712,712,712,712,712,712,712,712
nan_pct,0,0,0,0,79.073,0,0,20.7865,25.5618,0,0,0
unique,3,2,7,7,119,3,712,85,432,226,712,2
#1_value,3 56.60%,male 65.31%,0 66.99%,0 76.12%,G6 0.56%,S 73.46%,891 0.14%,24.0 3.65%,1601.0 0.84%,8.05 4.78%,"Webber, Miss. Susan 0.14%",0 63.34%
#2_value,1 78.37%,female 100.00%,1 91.43%,1 89.75%,B96 B98 1.12%,C 90.31%,273 0.28%,28.0 6.74%,347082.0 1.54%,13.0 9.13%,"Attalah, Miss. Malake 0.28%",1 100.00%
#3_value,2 100.00%,,2 94.80%,2 98.46%,C22 C26 1.54%,Q 100.00%,300 0.42%,30.0 9.69%,347088.0 2.11%,7.8958 13.48%,"Weisz, Mrs. Leopold (Mathilde Francoise Pede) ...",
#4_value,,,3 96.77%,3 99.16%,E101 1.97%,,299 0.56%,19.0 12.50%,382652.0 2.67%,26.0 17.56%,"Norman, Mr. Robert Douglas 0.56%",
#5_value,,,4 98.60%,5 99.58%,C23 C25 C27 2.39%,,298 0.70%,18.0 15.31%,4133.0 3.23%,7.75 21.63%,"Asim, Mr. Adola 0.70%",
#6_value,,,8 99.30%,4 99.86%,C78 2.67%,,295 0.84%,22.0 18.12%,2666.0 3.79%,10.5 24.44%,"Dakic, Mr. Branko 0.84%",


#### Foreshadow use a machine learning model to identify the 'intent' of features. 3 intents are supported as of v1.0 and they are 'Categorical', 'Numeric' and 'Text'. Foreshadow will transform the features intelligently according to its intent and statistics. Features not belonging to these three are tagged as 'Droppable'. For example, the PassengerId is droppable since it has a unique value for each row and will not provide any signal to the model. Also in the above table, 'Label' in the intent row indicate that is the target column.

# Model Training Iteration 2 - Override 

#### From the table above, we found that the 'Ticket' column is tagged with Numeric intent, but the column seems represent the ticket identification number. In order to avoid confusing the model, we can mark the column with Droppable intent. We similarly apply this to the Cabin and Name column.

In [7]:
shadow1 = Foreshadow(problem_type=ProblemType.CLASSIFICATION, estimator=LogisticRegression(random_state=RANDOM_SEED))
shadow1.override_intent('Ticket', IntentType.DROPPABLE)
shadow1.override_intent('Cabin', IntentType.DROPPABLE)
shadow1.override_intent('Name', IntentType.DROPPABLE)
shadow1.fit(X_train, y_train)

2020-03-26 17:46:40,205 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Ticket exist to ensure the override takes effect.
2020-03-26 17:46:40,206 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Cabin exist to ensure the override takes effect.
2020-03-26 17:46:40,207 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Name exist to ensure the override takes effect.
2020-03-26 17:46:40,510 - foreshadow - INFO - 26003 - Starting cleaning rows...
2020-03-26 17:46:40,514 - foreshadow - INFO - 26003 - Ending cleaning rows...
2020-03-26 17:46:40,646 - foreshadow - INFO - 26003 - Starting cleaning rows...
2020-03-26 17:46:40,649 - foreshadow - INFO - 26003 - Ending cleaning rows...
2020-03-26 17:46:41,731 - foreshadow - INFO - 26003 - Column PassengerId has intent type: Droppable
2020-03-26 17:46:42,033 - foreshadow - INFO - 26003 - Column Pcla

Foreshadow(allowed_seconds=300,
           data_columns=['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
                         'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
           estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                        fit_intercept=True, intercept_scaling=1,
                                        l1_ratio=None, max_iter=100,
                                        multi_class='auto', n_jobs=None,
                                        penalty='l2', random_state=10001,
                                        solver='lbfgs', tol=0.0001, verbose=0,
                                        warm_start=False),
           estimator_kwargs=None, n_jobs=1, problem_type='classification',
           random_state=None)

In [8]:
measure(shadow1, X_test, y_test)

2020-03-26 17:46:44,227 - foreshadow - INFO - 26003 - Starting cleaning rows...
2020-03-26 17:46:44,230 - foreshadow - INFO - 26003 - Ending cleaning rows...
2020-03-26 17:46:44,324 - foreshadow - INFO - 26003 - Exported processed data to processed_test_data.csv


Accuracy =  0.82
Classification Report:

              precision    recall  f1-score   support

           0       0.80      0.89      0.84        98
           1       0.84      0.73      0.78        81

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.81       179



# Model Training Iteration 3 - AutoEstimator 

#### Instead of trying one estimator, we can leverage AutoEstimator to search ML models and hyper-parameters. When we do not provide an estimator, Foreshadow will create the AutoEstimator automatically. 

In [9]:
shadow2 = Foreshadow(problem_type=ProblemType.CLASSIFICATION, allowed_seconds=300, random_state=RANDOM_SEED, n_jobs=-1)
shadow2.override_intent('Ticket', IntentType.DROPPABLE)
shadow2.override_intent('Cabin', IntentType.DROPPABLE)
shadow2.override_intent('Name', IntentType.DROPPABLE)

2020-03-26 17:46:44,354 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Ticket exist to ensure the override takes effect.
2020-03-26 17:46:44,355 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Cabin exist to ensure the override takes effect.
2020-03-26 17:46:44,356 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Name exist to ensure the override takes effect.


In [10]:
shadow2.fit(X_train, y_train)
measure(shadow2, X_test, y_test)

2020-03-26 17:46:49,523 - foreshadow - INFO - 26003 - Exported processed data to processed_training_data.csv


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.8342657342657344
Generation 2 - Current best internal CV score: 0.8342657342657344
Generation 3 - Current best internal CV score: 0.8371023342854329
Generation 4 - Current best internal CV score: 0.8371023342854329
Generation 5 - Current best internal CV score: 0.8399290849995076
Generation 6 - Current best internal CV score: 0.8399290849995076

5.13 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(LinearSVC(RandomForestClassifier(CombineDFs(input_matrix, input_matrix), bootstrap=True, criterion=entropy, max_features=0.6500000000000001, min_samples_leaf=18, min_samples_split=10, n_estimators=100), C=25.0, dual=False, loss=squared_hinge, penalty=l1, tol=0.001), bootstrap=False, criterion=gini, max_features=0.1, min_samples_leaf=1, min_samples_split=9, n_estimators=100)


2020-03-26 17:51:59,095 - foreshadow - INFO - 26003 - Exported processed data to processed_test_data.csv


Accuracy =  0.78
Classification Report:

              precision    recall  f1-score   support

           0       0.76      0.88      0.82        98
           1       0.82      0.67      0.73        81

    accuracy                           0.78       179
   macro avg       0.79      0.77      0.77       179
weighted avg       0.79      0.78      0.78       179



# Model Training Iteration 4 - Data Cleaner 

#### A closer look at the Name column found that people's titles are embedded. Since the title represent people's social status, that might have an impact to the model performance. Let's try to extract that information.

In [11]:
class TitleExtractor(CustomizableBaseCleaner):
    def __init__(self):
        super().__init__(
            transformation=lambda row: row if row is None else re.search(' ([A-Za-z]+)\.', str(row)).group(0)
        )

    def metric_score(self, X: pd.DataFrame) -> float:
        return 1 if X.columns[0] == "Name" else 0


In [12]:
shadow3 = Foreshadow(problem_type=ProblemType.CLASSIFICATION, allowed_seconds=300, random_state=RANDOM_SEED, n_jobs=-1)
shadow3.override_intent('Ticket', IntentType.DROPPABLE)
shadow3.override_intent('Cabin', IntentType.DROPPABLE)
shadow3.register_customized_data_cleaner(data_cleaners=[TitleExtractor])

2020-03-26 17:51:59,170 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Ticket exist to ensure the override takes effect.
2020-03-26 17:51:59,172 - foreshadow - INFO - 26003 - The foreshadow object is not trained yet. Please make sure the column Cabin exist to ensure the override takes effect.


In [13]:
shadow3.fit(X_train, y_train)
measure(shadow3, X_test, y_test)

2020-03-26 17:52:00,814 - foreshadow - INFO - 26003 - Exported processed data to processed_training_data.csv


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.8427262877967104
Generation 2 - Current best internal CV score: 0.8441445878065597
Generation 3 - Current best internal CV score: 0.8469417906037625
Generation 4 - Current best internal CV score: 0.8469516399093864
Generation 5 - Current best internal CV score: 0.8497389934009651
Generation 6 - Current best internal CV score: 0.8497389934009651

5.08 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: RandomForestClassifier(DecisionTreeClassifier(input_matrix, criterion=entropy, max_depth=3, min_samples_leaf=6, min_samples_split=15), bootstrap=True, criterion=gini, max_features=0.5, min_samples_leaf=1, min_samples_split=20, n_estimators=100)


2020-03-26 17:57:06,338 - foreshadow - INFO - 26003 - Exported processed data to processed_test_data.csv


Accuracy =  0.82
Classification Report:

              precision    recall  f1-score   support

           0       0.79      0.91      0.84        98
           1       0.86      0.70      0.78        81

    accuracy                           0.82       179
   macro avg       0.83      0.81      0.81       179
weighted avg       0.82      0.82      0.81       179



In [14]:
shadow3.get_data_summary()

Unnamed: 0,Pclass,Name,Sex,SibSp,Parch,Embarked,PassengerId,Ticket,Cabin,Age,Fare,Survived
intent,Categorical,Categorical,Categorical,Categorical,Categorical,Categorical,Droppable,Droppable,Droppable,Numeric,Numeric,Label
count,712,712,712,712,712,712,712,712,712,712,712,712
nan_pct,0,0,0,0,0,0,0,0,79.073,20.7865,0,0
unique,3,14,2,7,7,3,712,573,119,85,226,2
#1_value,3 56.60%,Mr. 58.29%,male 65.31%,0 66.99%,0 76.12%,S 73.46%,891 0.14%,CA 2144 0.84%,G6 0.56%,24.0 3.65%,8.05 4.78%,0 63.34%
#2_value,1 78.37%,Miss. 78.51%,female 100.00%,1 91.43%,1 89.75%,C 90.31%,273 0.28%,1601 1.69%,B96 B98 1.12%,28.0 6.74%,13.0 9.13%,1 100.00%
#3_value,2 100.00%,Mrs. 92.28%,,2 94.80%,2 98.46%,Q 100.00%,300 0.42%,CA. 2343 2.39%,C22 C26 1.54%,30.0 9.69%,7.8958 13.48%,
#4_value,,Master. 96.91%,,3 96.77%,3 99.16%,,299 0.56%,347082 3.09%,E101 1.97%,19.0 12.50%,26.0 17.56%,
#5_value,,Dr. 97.75%,,4 98.60%,5 99.58%,,298 0.70%,382652 3.65%,C23 C25 C27 2.39%,18.0 15.31%,7.75 21.63%,
#6_value,,Rev. 98.60%,,8 99.30%,4 99.86%,,295 0.84%,4133 4.21%,C78 2.67%,22.0 18.12%,10.5 24.44%,
