## Pipelines: Putting your entire ML workflow together

Let's import everything we're going to use this morning, like always:

In [11]:
import sys
sys.path

['',
 '/Library/Frameworks/Python.framework/Versions/3.6/lib/python36.zip',
 '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6',
 '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/lib-dynload',
 '/Users/suesong/Library/Python/3.6/lib/python/site-packages',
 '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages',
 '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/extensions',
 '/Users/suesong/.ipython']

In [5]:
#data handling, model creation/evaluation
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
import scipy.stats as stats
from sklearn_pandas import DataFrameMapper, cross_val_score
from sklearn.model_selection import cross_val_score as cv_sk
# visualization
%matplotlib inline
import seaborn as sns

### Pipelines: Putting An Entire Model Together End to End

Ok, the last thing we are going to learn how to do, is how to combine every aspect of creating and using a supervised machine learning model:

1. Transforming your original data (removing skew, standard scaling, encoding categorical variables as numbers)
2. Training and validating a model on that data
3. Picking parameters for a given model to optimize accuracy/precision/recall/f1 score, etc.

Let's try to see how we would do this without a pipeline. Let's get some data:

In [6]:
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
numeric_columns = columns[1:-1]
categorical_columns = columns[0]
target = columns[-1]

abalone_data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data",
                           names=columns)
abalone_data.head()

Unnamed: 0,sex,length,diam,height,whole,shucked,viscera,shell,age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


Now let's preprocess it in the standard way I've shown you:

1. Let's convert the categorical column using one-hot encoding
2. Standard scale (Z-score) the numeric columns

In [12]:
# variable encoding 
## drop off last column because its unnecessary
X_categorical = pd.get_dummies(abalone_data[categorical_columns]).astype(int).iloc[:,:-1]

#get and transform numeric features
X_numeric = abalone_data[numeric_columns]
X_numeric[numeric_columns] = StandardScaler().fit_transform(X_numeric)

#get outcome variable
y = abalone_data[target]

#combine transformed categorical and numeric features
X_final = pd.concat((X_numeric,X_categorical),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.loc._setitem_with_indexer((slice(None), indexer), value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_array(key, value)


In [13]:
X_final

Unnamed: 0,length,diam,height,whole,shucked,viscera,shell,F,I
0,-0.574558,-0.432149,-1.064424,-0.641898,-0.607685,-0.726212,-0.638217,0,0
1,-1.448986,-1.439929,-1.183978,-1.230277,-1.170910,-1.205221,-1.212987,0,0
2,0.050033,0.122130,-0.107991,-0.309469,-0.463500,-0.356690,-0.207139,1,0
3,-0.699476,-0.432149,-0.347099,-0.637819,-0.648238,-0.607600,-0.602294,0,0
4,-1.615544,-1.540707,-1.423087,-1.272086,-1.215968,-1.287337,-1.320757,0,1
5,-0.824395,-1.087206,-1.064424,-0.973307,-0.983919,-0.940625,-0.853756,0,1
6,0.050033,0.071741,0.250672,-0.104505,-0.551363,-0.356690,0.655017,1,0
7,0.174951,0.172519,-0.347099,-0.123880,-0.294533,-0.283698,0.152092,1,0
8,-0.408000,-0.381760,-0.347099,-0.651076,-0.643732,-0.621286,-0.530447,0,0
9,0.216591,0.323686,0.250672,0.134109,-0.202164,-0.270012,0.583170,1,0


And now, let's do some standard 10-fold cross-validation scoring:

In [14]:
#create rf regressor and check 10-fold RMSE
rf = RandomForestRegressor()

cross_val_scores = np.abs(cross_val_score(rf,
                                          X_final,
                                          y,
                                          scoring = "neg_mean_squared_error", # minimizing residuals
                                          cv=10))

rmse_cross_val_scores = list(map(np.sqrt, 
                                 cross_val_scores))

print("Mean 10-fold rmse: ", str(np.mean(rmse_cross_val_scores)))
print("Std 10-fold rmse: ", str(np.std(rmse_cross_val_scores)))

Mean 10-fold rmse:  2.242919811448675
Std 10-fold rmse:  0.6206212313126896


Now, we are going to do the same thing using a combination of [sklearn-pandas](https://github.com/scikit-learn-contrib/sklearn-pandas) and Scikit-learn's pipeline feature.

In [16]:
def my_analyzer(x):
    return [x]

#convert categoricals into one-hot-encoded features
#standard scale all numeric columns
mapper = DataFrameMapper([
    ('sex', # column you want to operate on; this can be done list of columns
     CountVectorizer(analyzer = my_analyzer)), # dummy-coding the categorical variable (returning 3 columns)
    (abalone_data.columns.tolist()[1:-1], StandardScaler())]) # everything but the first column gets scaled (columns first to last)

In [17]:
mapper.fit_transform(abalone_data)

array([[ 0.        ,  0.        ,  1.        , ..., -0.60768536,
        -0.72621157, -0.63821689],
       [ 0.        ,  0.        ,  1.        , ..., -1.17090984,
        -1.20522124, -1.21298732],
       [ 1.        ,  0.        ,  0.        , ..., -0.4634999 ,
        -0.35668983, -0.20713907],
       ...,
       [ 0.        ,  0.        ,  1.        , ...,  0.74855917,
         0.97541324,  0.49695471],
       [ 1.        ,  0.        ,  0.        , ...,  0.77334105,
         0.73362741,  0.41073914],
       [ 0.        ,  0.        ,  1.        , ...,  2.64099341,
         1.78744868,  1.84048058]])

In [18]:
from sklearn.pipeline import FeatureUnion, Pipeline

In [19]:
full_pipeline = Pipeline([("all_features", mapper),
                          ("rf_regressor", RandomForestRegressor(n_estimators=100))])

In [20]:
cv_scores = cross_val_score(full_pipeline,
                            abalone_data.iloc[:,:-1],
                            abalone_data.age,
                            cv = 10,
                            scoring = "neg_mean_squared_error",
                            n_jobs = -1,
                            verbose = 1)

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.3min finished


In [21]:
np.mean(np.sqrt(np.abs(cv_scores)).round(2)) #compute RMSE

2.152

#### Exercise Time!

* Change the pipeline to perform PCA and keep only the first 6 components on the complete feature pipe (after standard scaling numeric features and encoding the categorical feature)

In [41]:
pass

Now let's do this on a slightly more involved example, where we will have to do some imputation (filling in of missing values).

Here the process will be as follows:

1. Impute missing categorical values (marked with 0 after encoding) with most frequent category using `Imputer`
2. One-hot encode the categorical columns using `OneHotEncoder`
3. Impute missing numerical values using the median value of each column using `Imputer`
4. Z-score/standardize each numerical column using `StandardScaler`
5. Combine both collections of columns (one-hot encoded categorical columns and standardized numeric columns) using `FeatureUnion`
6. Pass the whole collection to a `RandomForestClassifier` to build a Random Forest classification model.
7. Use `cross_val_score` with 10-fold cross-validation on the entire pipeline.

Ready? Let's start by loading in the data:

In [22]:
# names of the columns
kidney_columns = ["age","bp","sg","al","su","rbc","pc","pcc","ba","bgr","bu","sc","sod","pot","hemo","pcv","wc","rc","htn","dm","cad","appet","pe","ane","class"]

# importing data
kidney_data = pd.read_csv("../day_1/data/chronic_kidney_disease.csv",
                          header = None,
                          na_values = "?",
                          names = kidney_columns)
kidney_data.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


Let's rearrange the columns so that the numeric columns are together, followed by all of the categorical columns:

In [23]:
# rearrange kidney columns as before
kidney_columns = kidney_columns[:5] + kidney_columns[9:18] + kidney_columns[5:9] + kidney_columns[18:]

kidney_data = kidney_data[kidney_columns]
kidney_data.head()

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,...,pc,pcc,ba,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,,,...,normal,notpresent,notpresent,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4.0,0.0,,18.0,0.8,,,...,normal,notpresent,notpresent,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,,,...,normal,notpresent,notpresent,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,abnormal,present,notpresent,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,,,...,normal,notpresent,notpresent,no,no,no,good,no,no,ckd


In [24]:
kidney_data.dtypes

age      float64
bp       float64
sg       float64
al       float64
su       float64
bgr      float64
bu       float64
sc       float64
sod      float64
pot      float64
hemo     float64
pcv      float64
wc       float64
rc       float64
rbc       object
pc        object
pcc       object
ba        object
htn       object
dm        object
cad       object
appet     object
pe        object
ane       object
class     object
dtype: object

In [25]:
# categorical variables
cat_cols = kidney_data.columns.tolist()[14:-1]

# numerical variables
numeric_cols = kidney_data.columns.tolist()[:14]

In [26]:
# replacing _yes into "yes"
kidney_data[cat_cols]=kidney_data[cat_cols].replace({' yes':"yes"})

In [27]:
# transformer diy for any pipeline functions
from sklearn.base import BaseEstimator, TransformerMixin

# subselect distinct columns in the dataframe 
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

In [28]:
X,y = kidney_data.iloc[:,:-1], LabelEncoder().fit_transform(kidney_data.iloc[:,-1])

And here is the code for the entire pipeline:

In [41]:
from sklearn.preprocessing import Imputer

X_transformed_pipe = FeatureUnion(
        transformer_list=[
            # Pipeline for filling in missing values, one hot encoding all categorical columns
            ('categoricals_1', Pipeline([
                ('selector', ItemSelector(key = cat_cols)), # select categorical variables 
                ('imputer', Imputer(strategy = "most_frequent")), # choose the mode
                ('encoder', OneHotEncoder(sparse = False))                    
            ])),
            # Pipeline for pulling out numeric features, filling in missing values, and scaling them
            ('numeric', Pipeline([
                ('selector', ItemSelector(key = numeric_cols)), # select numerical variables
                ('imputer', Imputer(strategy = "median")), # choose the median
                ('scaler', StandardScaler()), # impute 
            ]))])

full_pipeline = Pipeline([("all_features",X_transformed_pipe), # feature union above
                          ("rf_classifier",RandomForestClassifier())]) # classification pipeline

In [34]:
cv_sk(full_pipeline,
      X,
      y,
      cv = 10)

ValueError: could not convert string to float: 'no'

Each pipeline object contains a sequence of steps, which are stored in a list. Each step is a tuple, where the first element is the name you gave the given step, and the second element is the transformation or model you are applying at that step:

In [35]:
my_feature_union = full_pipeline.steps[0][1]

In [36]:
my_feature_union.transformer_list[1][1].steps

[('selector',
  ItemSelector(key=['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc'])),
 ('imputer',
  Imputer(axis=0, copy=True, missing_values='NaN', strategy='median', verbose=0)),
 ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))]

Let's take a look at a few steps:

In [42]:
print("The first step in the pipeline:\n",full_pipeline.steps[0])
print()
print("The second step in the pipeline: \n", full_pipeline.steps[1])
print()
print("The second step's transformation/model:\n", full_pipeline.steps[1][1])
# print("Since we know this is a random forest model, lets try to get the models feature importances:\n",full_pipeline.steps[1][1].feature_importances_)

The first step in the pipeline:
 ('all_features', FeatureUnion(n_jobs=1,
       transformer_list=[('categoricals_1', Pipeline(memory=None,
     steps=[('selector', ItemSelector(key=['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane'])), ('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='most_frequent',
    verbose=0)), ('encoder', OneHotEncod...tegy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True))]))],
       transformer_weights=None))

The second step in the pipeline: 
 ('rf_classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))

The 

Remember, in order to be able to get feature importances or coefficients of a given model, it needs to be trained first. Just like any other transformation in sklearn, you can fit a pipeline by calling its `fit` method:

In [38]:
# using pipeline to predict the new data
full_pipeline.fit(X,y)

ValueError: could not convert string to float: 'no'

In [39]:
full_pipeline.predict(X)

NotFittedError: This Imputer instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

Now that it's been fit, we can extract the feature importances as we wanted:

In [171]:
full_pipeline.steps[1][1].feature_importances_.round(3)

array([0.005, 0.003, 0.011, 0.004, 0.   , 0.   , 0.   , 0.   , 0.043,
       0.006, 0.001, 0.027, 0.   , 0.   , 0.001, 0.   , 0.017, 0.005,
       0.002, 0.   , 0.001, 0.01 , 0.058, 0.084, 0.002, 0.031, 0.004,
       0.086, 0.011, 0.006, 0.286, 0.139, 0.013, 0.144])

What's really great about pipelines is that you can even put them into `GridSearchCV` methods, and search across parameters to tune your models. To do so, create a dictionary entry in `param_grid` that names the step (which you named earlier) and parameters you want to test:

In [47]:
# using GridSearchCV with Pipeline, must impute first otherwise, will throw error. this will change in next sklearn release
X[cat_cols] = Imputer(strategy = "most_frequent").fit_transform(X[cat_cols])
X[numeric_cols] = Imputer(strategy="median").fit_transform(X[numeric_cols])

ValueError: could not convert string to float: 'no'

In [62]:
# for categorical values
def get_mode(my_column):
    return my_column.value_counts().index[0]

mode_per_column = X[cat_cols].apply(get_mode, axis = 0)

In [63]:
X[cat_cols].isnull().sum()

rbc      152
pc        65
pcc        4
ba         4
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64

In [64]:
median_per_columns = X[numeric_cols].apply(lambda x: x.median(), axis = 0)

In [65]:
X[numeric_cols] = X[numeric_cols].fillna(median_per_columns, axis = 0)

In [66]:
X

Unnamed: 0,age,bp,sg,al,su,bgr,bu,sc,sod,pot,...,rbc,pc,pcc,ba,htn,dm,cad,appet,pe,ane
0,48.0,80.0,1.020,1.0,0.0,121.0,36.0,1.2,138.0,4.4,...,,normal,notpresent,notpresent,yes,yes,no,good,no,no
1,7.0,50.0,1.020,4.0,0.0,121.0,18.0,0.8,138.0,4.4,...,,normal,notpresent,notpresent,no,no,no,good,no,no
2,62.0,80.0,1.010,2.0,3.0,423.0,53.0,1.8,138.0,4.4,...,normal,normal,notpresent,notpresent,no,yes,no,poor,no,yes
3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,normal,abnormal,present,notpresent,yes,no,no,poor,yes,yes
4,51.0,80.0,1.010,2.0,0.0,106.0,26.0,1.4,138.0,4.4,...,normal,normal,notpresent,notpresent,no,no,no,good,no,no
5,60.0,90.0,1.015,3.0,0.0,74.0,25.0,1.1,142.0,3.2,...,,,notpresent,notpresent,yes,yes,no,good,yes,no
6,68.0,70.0,1.010,0.0,0.0,100.0,54.0,24.0,104.0,4.0,...,,normal,notpresent,notpresent,no,no,no,good,no,no
7,24.0,80.0,1.015,2.0,4.0,410.0,31.0,1.1,138.0,4.4,...,normal,abnormal,notpresent,notpresent,no,yes,no,good,yes,no
8,52.0,100.0,1.015,3.0,0.0,138.0,60.0,1.9,138.0,4.4,...,normal,abnormal,present,notpresent,yes,yes,no,good,no,yes
9,53.0,90.0,1.020,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,abnormal,abnormal,present,notpresent,yes,yes,no,poor,no,yes


In [72]:
X_no_imputation_transformed_pipe = FeatureUnion(
        transformer_list=[
            # Pipeline for one hot encoding all categorical columns
            ('categoricals_1', Pipeline([
                ('selector', ItemSelector(key=cat_cols)),
                ('encoder', OneHotEncoder(sparse=False))                    
            ])),
            # Pipeline for pulling out numeric features and scaling them
            ('numeric', Pipeline([
                ('selector', ItemSelector(key=numeric_cols)),
                ('scaler', StandardScaler()),
            ]))])

full_pipeline = Pipeline([("all_features",X_no_imputation_transformed_pipe),
                          ("rf_classifier",RandomForestClassifier())])

In [70]:
full_pipeline.fit(X, y)

ValueError: could not convert string to float: 'no'

In [73]:
# using GridSearchCV with Pipeline, must impute first otherwise, will throw error.
from sklearn.model_selection import GridSearchCV


estimators_range = [20,50,100]
param_grid = dict(rf_classifier__n_estimators = estimators_range)

grid = GridSearchCV(full_pipeline, 
                    param_grid, 
                    cv = 20, 
                    scoring = 'accuracy',
                    n_jobs=-1)
grid.fit(X, y)

JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py in _run_module_as_main(mod_name='ipykernel_launcher', alter_argv=1)
    188         sys.exit(msg)
    189     main_globals = sys.modules["__main__"].__dict__
    190     if alter_argv:
    191         sys.argv[0] = mod_spec.origin
    192     return _run_code(code, main_globals, None,
--> 193                      "__main__", mod_spec)
        mod_spec = ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py')
    194 
    195 def run_module(mod_name, init_globals=None,
    196                run_name=None, alter_sys=False):
    197     """Execute a module's code without importing it

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py in _run_code(code=<code object <module> at 0x1039ce8a0, file "/Lib...3.6/site-packages/ipykernel_launcher.py", line 5>, run_globals={'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/Library/Frameworks/Python.framework/Versions/3....ges/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/Library/Fra.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}, init_globals=None, mod_name='__main__', mod_spec=ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), pkg_name='', script_name=None)
     80                        __cached__ = cached,
     81                        __doc__ = None,
     82                        __loader__ = loader,
     83                        __package__ = pkg_name,
     84                        __spec__ = mod_spec)
---> 85     exec(code, run_globals)
        code = <code object <module> at 0x1039ce8a0, file "/Lib...3.6/site-packages/ipykernel_launcher.py", line 5>
        run_globals = {'__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__cached__': '/Library/Frameworks/Python.framework/Versions/3....ges/__pycache__/ipykernel_launcher.cpython-36.pyc', '__doc__': 'Entry point for launching an IPython kernel.\n\nTh...orts until\nafter removing the cwd from sys.path.\n', '__file__': '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py', '__loader__': <_frozen_importlib_external.SourceFileLoader object>, '__name__': '__main__', '__package__': '', '__spec__': ModuleSpec(name='ipykernel_launcher', loader=<_f...b/python3.6/site-packages/ipykernel_launcher.py'), 'app': <module 'ipykernel.kernelapp' from '/Library/Fra.../python3.6/site-packages/ipykernel/kernelapp.py'>, ...}
     86     return run_globals
     87 
     88 def _run_module_code(code, init_globals=None,
     89                     mod_name=None, mod_spec=None,

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py in <module>()
     11     # This is added back by InteractiveShellApp.init_path()
     12     if sys.path[0] == '':
     13         del sys.path[0]
     14 
     15     from ipykernel import kernelapp as app
---> 16     app.launch_new_instance()

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    653 
    654         If a global instance already exists, this reinitializes and starts it
    655         """
    656         app = cls.instance(**kwargs)
    657         app.initialize(argv)
--> 658         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    659 
    660 #-----------------------------------------------------------------------------
    661 # utility functions, for convenience
    662 #-----------------------------------------------------------------------------

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    492         if self.poller is not None:
    493             self.poller.start()
    494         self.kernel.start()
    495         self.io_loop = ioloop.IOLoop.current()
    496         try:
--> 497             self.io_loop.start()
        self.io_loop.start = <bound method BaseAsyncIOLoop.start of <tornado.platform.asyncio.AsyncIOMainLoop object>>
    498         except KeyboardInterrupt:
    499             pass
    500 
    501 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/platform/asyncio.py in start(self=<tornado.platform.asyncio.AsyncIOMainLoop object>)
    127         except (RuntimeError, AssertionError):
    128             old_loop = None
    129         try:
    130             self._setup_logging()
    131             asyncio.set_event_loop(self.asyncio_loop)
--> 132             self.asyncio_loop.run_forever()
        self.asyncio_loop.run_forever = <bound method BaseEventLoop.run_forever of <_Uni...EventLoop running=True closed=False debug=False>>
    133         finally:
    134             asyncio.set_event_loop(old_loop)
    135 
    136     def stop(self):

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py in run_forever(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
    417             sys.set_asyncgen_hooks(firstiter=self._asyncgen_firstiter_hook,
    418                                    finalizer=self._asyncgen_finalizer_hook)
    419         try:
    420             events._set_running_loop(self)
    421             while True:
--> 422                 self._run_once()
        self._run_once = <bound method BaseEventLoop._run_once of <_UnixS...EventLoop running=True closed=False debug=False>>
    423                 if self._stopping:
    424                     break
    425         finally:
    426             self._stopping = False

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/base_events.py in _run_once(self=<_UnixSelectorEventLoop running=True closed=False debug=False>)
   1427                         logger.warning('Executing %s took %.3f seconds',
   1428                                        _format_handle(handle), dt)
   1429                 finally:
   1430                     self._current_handle = None
   1431             else:
-> 1432                 handle._run()
        handle._run = <bound method Handle._run of <Handle BaseAsyncIOLoop._handle_events(18, 1)>>
   1433         handle = None  # Needed to break cycles when an exception occurs.
   1434 
   1435     def _set_coroutine_wrapper(self, enabled):
   1436         try:

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/events.py in _run(self=<Handle BaseAsyncIOLoop._handle_events(18, 1)>)
    140             self._callback = None
    141             self._args = None
    142 
    143     def _run(self):
    144         try:
--> 145             self._callback(*self._args)
        self._callback = <bound method BaseAsyncIOLoop._handle_events of <tornado.platform.asyncio.AsyncIOMainLoop object>>
        self._args = (18, 1)
    146         except Exception as exc:
    147             cb = _format_callback_source(self._callback, self._args)
    148             msg = 'Exception in callback {}'.format(cb)
    149             context = {

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/platform/asyncio.py in _handle_events(self=<tornado.platform.asyncio.AsyncIOMainLoop object>, fd=18, events=1)
    117             self.writers.remove(fd)
    118         del self.handlers[fd]
    119 
    120     def _handle_events(self, fd, events):
    121         fileobj, handler_func = self.handlers[fd]
--> 122         handler_func(fileobj, events)
        handler_func = <function wrap.<locals>.null_wrapper>
        fileobj = <zmq.sugar.socket.Socket object>
        events = 1
    123 
    124     def start(self):
    125         try:
    126             old_loop = asyncio.get_event_loop()

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    295         # Fast path when there are no active contexts.
    296         def null_wrapper(*args, **kwargs):
    297             try:
    298                 current_state = _state.contexts
    299                 _state.contexts = cap_contexts[0]
--> 300                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    301             finally:
    302                 _state.contexts = current_state
    303         null_wrapper._wrapped = True
    304         return null_wrapper

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    445             return
    446         zmq_events = self.socket.EVENTS
    447         try:
    448             # dispatch events:
    449             if zmq_events & zmq.POLLIN and self.receiving():
--> 450                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    451                 if not self.socket:
    452                     return
    453             if zmq_events & zmq.POLLOUT and self.sending():
    454                 self._handle_send()

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    475             else:
    476                 raise
    477         else:
    478             if self._recv_callback:
    479                 callback = self._recv_callback
--> 480                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function wrap.<locals>.null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    481         
    482 
    483     def _handle_send(self):
    484         """Handle a send event."""

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function wrap.<locals>.null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    427         close our socket."""
    428         try:
    429             # Use a NullContext to ensure that all StackContexts are run
    430             # inside our blanket exception handler rather than outside.
    431             with stack_context.NullContext():
--> 432                 callback(*args, **kwargs)
        callback = <function wrap.<locals>.null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    433         except:
    434             gen_log.error("Uncaught exception in ZMQStream callback",
    435                           exc_info=True)
    436             # Re-raise the exception so that IOLoop.handle_callback_exception

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    295         # Fast path when there are no active contexts.
    296         def null_wrapper(*args, **kwargs):
    297             try:
    298                 current_state = _state.contexts
    299                 _state.contexts = cap_contexts[0]
--> 300                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    301             finally:
    302                 _state.contexts = current_state
    303         null_wrapper._wrapped = True
    304         return null_wrapper

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    278         if self.control_stream:
    279             self.control_stream.on_recv(self.dispatch_control, copy=False)
    280 
    281         def make_dispatcher(stream):
    282             def dispatcher(msg):
--> 283                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    284             return dispatcher
    285 
    286         for s in self.shell_streams:
    287             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {'allow_stdin': True, 'code': "# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 9, 16, 16, 58, 12, 29223, tzinfo=tzutc()), 'msg_id': '8a2630cdb4034d73ab741dbfb41fbf39', 'msg_type': 'execute_request', 'session': '1dfdc30e3599423a92970be11d5a27ed', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '8a2630cdb4034d73ab741dbfb41fbf39', 'msg_type': 'execute_request', 'parent_header': {}})
    228             self.log.warning("Unknown message type: %r", msg_type)
    229         else:
    230             self.log.debug("%s: %s", msg_type, msg)
    231             self.pre_handler_hook()
    232             try:
--> 233                 handler(stream, idents, msg)
        handler = <bound method Kernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = [b'1dfdc30e3599423a92970be11d5a27ed']
        msg = {'buffers': [], 'content': {'allow_stdin': True, 'code': "# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 9, 16, 16, 58, 12, 29223, tzinfo=tzutc()), 'msg_id': '8a2630cdb4034d73ab741dbfb41fbf39', 'msg_type': 'execute_request', 'session': '1dfdc30e3599423a92970be11d5a27ed', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '8a2630cdb4034d73ab741dbfb41fbf39', 'msg_type': 'execute_request', 'parent_header': {}}
    234             except Exception:
    235                 self.log.error("Exception in message handler:", exc_info=True)
    236             finally:
    237                 self.post_handler_hook()

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=[b'1dfdc30e3599423a92970be11d5a27ed'], parent={'buffers': [], 'content': {'allow_stdin': True, 'code': "# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 9, 16, 16, 58, 12, 29223, tzinfo=tzutc()), 'msg_id': '8a2630cdb4034d73ab741dbfb41fbf39', 'msg_type': 'execute_request', 'session': '1dfdc30e3599423a92970be11d5a27ed', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '8a2630cdb4034d73ab741dbfb41fbf39', 'msg_type': 'execute_request', 'parent_header': {}})
    394         if not silent:
    395             self.execution_count += 1
    396             self._publish_execute_input(code, parent, self.execution_count)
    397 
    398         reply_content = self.do_execute(code, silent, store_history,
--> 399                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    400 
    401         # Flush output before sending the reply.
    402         sys.stdout.flush()
    403         sys.stderr.flush()

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code="# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)", silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    203 
    204         self._forward_input(allow_stdin)
    205 
    206         reply_content = {}
    207         try:
--> 208             res = shell.run_cell(code, store_history=store_history, silent=silent)
        res = undefined
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = "# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)"
        store_history = True
        silent = False
    209         finally:
    210             self._restore_input()
    211 
    212         if res.error_before_exec is not None:

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=("# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)",), **kwargs={'silent': False, 'store_history': True})
    532             )
    533         self.payload_manager.write_payload(payload)
    534 
    535     def run_cell(self, *args, **kwargs):
    536         self._last_traceback = None
--> 537         return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
        self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        args = ("# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)",)
        kwargs = {'silent': False, 'store_history': True}
    538 
    539     def _showtraceback(self, etype, evalue, stb):
    540         # try to preserve ordering of tracebacks and print statements
    541         sys.stdout.flush()

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell="# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)", store_history=True, silent=False, shell_futures=True)
   2657         -------
   2658         result : :class:`ExecutionResult`
   2659         """
   2660         try:
   2661             result = self._run_cell(
-> 2662                 raw_cell, store_history, silent, shell_futures)
        raw_cell = "# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)"
        store_history = True
        silent = False
        shell_futures = True
   2663         finally:
   2664             self.events.trigger('post_execute')
   2665             if not silent:
   2666                 self.events.trigger('post_run_cell', result)

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py in _run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell="# using GridSearchCV with Pipeline, must impute ...0, scoring = 'accuracy',n_jobs=-1)\ngrid.fit(X, y)", store_history=True, silent=False, shell_futures=True)
   2780                 self.displayhook.exec_result = result
   2781 
   2782                 # Execute the user code
   2783                 interactivity = 'none' if silent else self.ast_node_interactivity
   2784                 has_raised = self.run_ast_nodes(code_ast.body, cell_name,
-> 2785                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler object>
   2786                 
   2787                 self.last_execution_succeeded = not has_raised
   2788                 self.last_execution_result = result
   2789 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.ImportFrom object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Expr object>], cell_name='<ipython-input-73-5d96fb502e01>', interactivity='last', compiler=<IPython.core.compilerop.CachingCompiler object>, result=<ExecutionResult object at 10dd85278, execution_...rue silent=False shell_futures=True> result=None>)
   2902                     return True
   2903 
   2904             for i, node in enumerate(to_run_interactive):
   2905                 mod = ast.Interactive([node])
   2906                 code = compiler(mod, cell_name, "single")
-> 2907                 if self.run_code(code, result):
        self.run_code = <bound method InteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 0x105caac00, file "<ipython-input-73-5d96fb502e01>", line 9>
        result = <ExecutionResult object at 10dd85278, execution_...rue silent=False shell_futures=True> result=None>
   2908                     return True
   2909 
   2910             # Flush softspace
   2911             if softspace(sys.stdout, 0):

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 0x105caac00, file "<ipython-input-73-5d96fb502e01>", line 9>, result=<ExecutionResult object at 10dd85278, execution_...rue silent=False shell_futures=True> result=None>)
   2956         outflag = True  # happens in more places, so it's easier as default
   2957         try:
   2958             try:
   2959                 self.hooks.pre_run_code_hook()
   2960                 #rprint('Running code', repr(code_obj)) # dbg
-> 2961                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 0x105caac00, file "<ipython-input-73-5d96fb502e01>", line 9>
        self.user_global_ns = {'BaseEstimator': <class 'sklearn.base.BaseEstimator'>, 'CountVectorizer': <class 'sklearn.feature_extraction.text.CountVectorizer'>, 'DataFrameMapper': <class 'sklearn_pandas.dataframe_mapper.DataFrameMapper'>, 'FeatureUnion': <class 'sklearn.pipeline.FeatureUnion'>, 'GridSearchCV': <class 'sklearn.model_selection._search.GridSearchCV'>, 'Imputer': <class 'sklearn.preprocessing.imputation.Imputer'>, 'In': ['', "#data handling, model creation/evaluation\nimport...gic('matplotlib', 'inline')\nimport seaborn as sns", "#data handling, model creation/evaluation\nimport...gic('matplotlib', 'inline')\nimport seaborn as sns", 'columns = ["sex","length","diam","height","whole.../abalone.data",names=columns)\nabalone_data.head()', '#get categorical features\n#drop off last column ...nal = pd.concat((X_numeric,X_categorical),axis=1)', "#data handling, model creation/evaluation\nimport...gic('matplotlib', 'inline')\nimport seaborn as sns", 'columns = ["sex","length","diam","height","whole.../abalone.data",names=columns)\nabalone_data.head()', '#get categorical features\n#drop off last column ...nal = pd.concat((X_numeric,X_categorical),axis=1)', 'Sys', 'sys.argv', 'import sys', 'import sys\nsys.path', '#get categorical features\n#drop off last column ...nal = pd.concat((X_numeric,X_categorical),axis=1)', 'X_final', '#create rf regressor and check 10-fold RMSE\nrf =...fold rmse: ", str(np.std(rmse_cross_val_scores)))', 'mapper.fit_transform(abalone_data)', 'def my_analyzer(x):\n    return [x]\n\n#convert cat... first column gets scaled (columns first to last)', 'mapper.fit_transform(abalone_data)', 'from sklearn.pipeline import FeatureUnion, Pipeline', 'full_pipeline = Pipeline([("all_features", mappe...ssor", RandomForestRegressor(n_estimators=100))])', ...], 'ItemSelector': <class '__main__.ItemSelector'>, 'KFold': <class 'sklearn.model_selection._split.KFold'>, 'LabelEncoder': <class 'sklearn.preprocessing.label.LabelEncoder'>, ...}
        self.user_ns = {'BaseEstimator': <class 'sklearn.base.BaseEstimator'>, 'CountVectorizer': <class 'sklearn.feature_extraction.text.CountVectorizer'>, 'DataFrameMapper': <class 'sklearn_pandas.dataframe_mapper.DataFrameMapper'>, 'FeatureUnion': <class 'sklearn.pipeline.FeatureUnion'>, 'GridSearchCV': <class 'sklearn.model_selection._search.GridSearchCV'>, 'Imputer': <class 'sklearn.preprocessing.imputation.Imputer'>, 'In': ['', "#data handling, model creation/evaluation\nimport...gic('matplotlib', 'inline')\nimport seaborn as sns", "#data handling, model creation/evaluation\nimport...gic('matplotlib', 'inline')\nimport seaborn as sns", 'columns = ["sex","length","diam","height","whole.../abalone.data",names=columns)\nabalone_data.head()', '#get categorical features\n#drop off last column ...nal = pd.concat((X_numeric,X_categorical),axis=1)', "#data handling, model creation/evaluation\nimport...gic('matplotlib', 'inline')\nimport seaborn as sns", 'columns = ["sex","length","diam","height","whole.../abalone.data",names=columns)\nabalone_data.head()', '#get categorical features\n#drop off last column ...nal = pd.concat((X_numeric,X_categorical),axis=1)', 'Sys', 'sys.argv', 'import sys', 'import sys\nsys.path', '#get categorical features\n#drop off last column ...nal = pd.concat((X_numeric,X_categorical),axis=1)', 'X_final', '#create rf regressor and check 10-fold RMSE\nrf =...fold rmse: ", str(np.std(rmse_cross_val_scores)))', 'mapper.fit_transform(abalone_data)', 'def my_analyzer(x):\n    return [x]\n\n#convert cat... first column gets scaled (columns first to last)', 'mapper.fit_transform(abalone_data)', 'from sklearn.pipeline import FeatureUnion, Pipeline', 'full_pipeline = Pipeline([("all_features", mappe...ssor", RandomForestRegressor(n_estimators=100))])', ...], 'ItemSelector': <class '__main__.ItemSelector'>, 'KFold': <class 'sklearn.model_selection._split.KFold'>, 'LabelEncoder': <class 'sklearn.preprocessing.label.LabelEncoder'>, ...}
   2962             finally:
   2963                 # Reset our crash handler in place
   2964                 sys.excepthook = old_excepthook
   2965         except SystemExit as e:

...........................................................................
/Users/suesong/ml_workshop/2018-ml-workshop/day_3/<ipython-input-73-5d96fb502e01> in <module>()
      4 
      5 estimators_range = [20,50,100]
      6 param_grid = dict(rf_classifier__n_estimators = estimators_range)
      7 
      8 grid = GridSearchCV(full_pipeline, param_grid, cv=20, scoring = 'accuracy',n_jobs=-1)
----> 9 grid.fit(X, y)

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self=GridSearchCV(cv=20, error_score='raise',
       ...ore='warn',
       scoring='accuracy', verbose=0), X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[400 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), groups=None, **fit_params={})
    635                                   return_train_score=self.return_train_score,
    636                                   return_n_test_samples=True,
    637                                   return_times=True, return_parameters=False,
    638                                   error_score=self.error_score)
    639           for parameters, (train, test) in product(candidate_params,
--> 640                                                    cv.split(X, y, groups)))
        cv.split = <bound method StratifiedKFold.split of Stratifie...d(n_splits=20, random_state=None, shuffle=False)>
        X =       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[400 rows x 24 columns]
        y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1])
        groups = None
    641 
    642         # if one choose to see train score, "out" will contain train score info
    643         if self.return_train_score:
    644             (train_score_dicts, test_score_dicts, test_sample_counts, fit_time,

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=-1), iterable=<generator object BaseSearchCV.fit.<locals>.<genexpr>>)
    784             if pre_dispatch == "all" or n_jobs == 1:
    785                 # The iterable was consumed all at once by the above for loop.
    786                 # No need to wait for async callbacks to trigger to
    787                 # consumption.
    788                 self._iterating = False
--> 789             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=-1)>
    790             # Make sure that we get a last message telling us we are done
    791             elapsed_time = time.time() - self._start_time
    792             self._print('Done %3i out of %3i | elapsed: %s finished',
    793                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Sun Sep 16 12:58:12 2018
PID: 3510                              Python 3.6.5: /usr/local/bin/python3
...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_and_score>, (Pipeline(memory=None,
     steps=[('all_features...None, verbose=0,
            warm_start=False))]),       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[400 rows x 24 columns], array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), {'score': make_scorer(accuracy_score)}, array([ 13,  14,  15,  16,  17,  18,  19,  20,  ..., 392, 393, 394, 395, 396, 397,
       398, 399]), array([  0,   1,   2,   3,   4,   5,   6,   7,  ...,
       250, 251, 252, 253, 254, 255, 256, 257]), 0, {'rf_classifier__n_estimators': 20}), {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': False, 'return_times': True, 'return_train_score': 'warn'})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_and_score>
        args = (Pipeline(memory=None,
     steps=[('all_features...None, verbose=0,
            warm_start=False))]),       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[400 rows x 24 columns], array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), {'score': make_scorer(accuracy_score)}, array([ 13,  14,  15,  16,  17,  18,  19,  20,  ..., 392, 393, 394, 395, 396, 397,
       398, 399]), array([  0,   1,   2,   3,   4,   5,   6,   7,  ...,
       250, 251, 252, 253, 254, 255, 256, 257]), 0, {'rf_classifier__n_estimators': 20})
        kwargs = {'error_score': 'raise', 'fit_params': {}, 'return_n_test_samples': True, 'return_parameters': False, 'return_times': True, 'return_train_score': 'warn'}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator=Pipeline(memory=None,
     steps=[('all_features...None, verbose=0,
            warm_start=False))]), X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[400 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1]), scorer={'score': make_scorer(accuracy_score)}, train=array([ 13,  14,  15,  16,  17,  18,  19,  20,  ..., 392, 393, 394, 395, 396, 397,
       398, 399]), test=array([  0,   1,   2,   3,   4,   5,   6,   7,  ...,
       250, 251, 252, 253, 254, 255, 256, 257]), verbose=0, parameters={'rf_classifier__n_estimators': 20}, fit_params={}, return_train_score='warn', return_parameters=False, return_n_test_samples=True, return_times=True, error_score='raise')
    453 
    454     try:
    455         if y_train is None:
    456             estimator.fit(X_train, **fit_params)
    457         else:
--> 458             estimator.fit(X_train, y_train, **fit_params)
        estimator.fit = <bound method Pipeline.fit of Pipeline(memory=No...one, verbose=0,
            warm_start=False))])>
        X_train =       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns]
        y_train = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])
        fit_params = {}
    459 
    460     except Exception as e:
    461         # Note fit time as time until error
    462         fit_time = time.time() - start_time

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self=Pipeline(memory=None,
     steps=[('all_features...None, verbose=0,
            warm_start=False))]), X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]), **fit_params={})
    243         Returns
    244         -------
    245         self : Pipeline
    246             This estimator
    247         """
--> 248         Xt, fit_params = self._fit(X, y, **fit_params)
        Xt = undefined
        fit_params = {}
        self._fit = <bound method Pipeline._fit of Pipeline(memory=N...one, verbose=0,
            warm_start=False))])>
        X =       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns]
        y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])
    249         if self._final_estimator is not None:
    250             self._final_estimator.fit(Xt, y, **fit_params)
    251         return self
    252 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self=Pipeline(memory=None,
     steps=[('all_features...None, verbose=0,
            warm_start=False))]), X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]), **fit_params={})
    208                 else:
    209                     cloned_transformer = clone(transformer)
    210                 # Fit or load from cache the current transfomer
    211                 Xt, fitted_transformer = fit_transform_one_cached(
    212                     cloned_transformer, None, Xt, y,
--> 213                     **fit_params_steps[name])
        fit_params_steps = {'all_features': {}, 'rf_classifier': {}}
        name = 'all_features'
    214                 # Replace the transformer of the step with the fitted
    215                 # transformer. This is necessary when loading the transformer
    216                 # from the cache.
    217                 self.steps[step_idx] = (name, fitted_transformer)

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py in __call__(self=NotMemorizedFunc(func=<function _fit_transform_one at 0x119b430d0>), *args=(FeatureUnion(n_jobs=1,
       transformer_list=[..._std=True))]))],
       transformer_weights=None), None,       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])), **kwargs={})
    357     # Should be a light as possible (for speed)
    358     def __init__(self, func):
    359         self.func = func
    360 
    361     def __call__(self, *args, **kwargs):
--> 362         return self.func(*args, **kwargs)
        self.func = <function _fit_transform_one>
        args = (FeatureUnion(n_jobs=1,
       transformer_list=[..._std=True))]))],
       transformer_weights=None), None,       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]))
        kwargs = {}
    363 
    364     def call_and_shelve(self, *args, **kwargs):
    365         return NotMemorizedResult(self.func(*args, **kwargs))
    366 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer=FeatureUnion(n_jobs=1,
       transformer_list=[..._std=True))]))],
       transformer_weights=None), weight=None, X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]), **fit_params={})
    576 
    577 
    578 def _fit_transform_one(transformer, weight, X, y,
    579                        **fit_params):
    580     if hasattr(transformer, 'fit_transform'):
--> 581         res = transformer.fit_transform(X, y, **fit_params)
        res = undefined
        transformer.fit_transform = <bound method FeatureUnion.fit_transform of Feat...std=True))]))],
       transformer_weights=None)>
        X =       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns]
        y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])
        fit_params = {}
    582     else:
    583         res = transformer.fit(X, y, **fit_params).transform(X)
    584     # if we have a weight for this transformer, multiply output
    585     if weight is None:

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self=FeatureUnion(n_jobs=1,
       transformer_list=[..._std=True))]))],
       transformer_weights=None), X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]), **fit_params={})
    734         """
    735         self._validate_transformers()
    736         result = Parallel(n_jobs=self.n_jobs)(
    737             delayed(_fit_transform_one)(trans, weight, X, y,
    738                                         **fit_params)
--> 739             for name, trans, weight in self._iter())
        self._iter = <bound method FeatureUnion._iter of FeatureUnion...std=True))]))],
       transformer_weights=None)>
    740 
    741         if not result:
    742             # All transformers are None
    743             return np.zeros((X.shape[0], 0))

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=1), iterable=<generator object FeatureUnion.fit_transform.<locals>.<genexpr>>)
    774         self.n_completed_tasks = 0
    775         try:
    776             # Only set self._iterating to True if at least a batch
    777             # was dispatched. In particular this covers the edge
    778             # case of Parallel used with an exhausted iterator.
--> 779             while self.dispatch_one_batch(iterator):
        self.dispatch_one_batch = <bound method Parallel.dispatch_one_batch of Parallel(n_jobs=1)>
        iterator = <generator object FeatureUnion.fit_transform.<locals>.<genexpr>>
    780                 self._iterating = True
    781             else:
    782                 self._iterating = False
    783 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self=Parallel(n_jobs=1), iterator=<generator object FeatureUnion.fit_transform.<locals>.<genexpr>>)
    620             tasks = BatchedCalls(itertools.islice(iterator, batch_size))
    621             if len(tasks) == 0:
    622                 # No more tasks available in the iterator: tell caller to stop.
    623                 return False
    624             else:
--> 625                 self._dispatch(tasks)
        self._dispatch = <bound method Parallel._dispatch of Parallel(n_jobs=1)>
        tasks = <sklearn.externals.joblib.parallel.BatchedCalls object>
    626                 return True
    627 
    628     def _print(self, msg, msg_args):
    629         """Display the message on stout or stderr depending on verbosity"""

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self=Parallel(n_jobs=1), batch=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    583         self.n_dispatched_tasks += len(batch)
    584         self.n_dispatched_batches += 1
    585 
    586         dispatch_timestamp = time.time()
    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588         job = self._backend.apply_async(batch, callback=cb)
        job = undefined
        self._backend.apply_async = <bound method SequentialBackend.apply_async of <...lib._parallel_backends.SequentialBackend object>>
        batch = <sklearn.externals.joblib.parallel.BatchedCalls object>
        cb = <sklearn.externals.joblib.parallel.BatchCompletionCallBack object>
    589         self._jobs.append(job)
    590 
    591     def dispatch_next(self):
    592         """Dispatch more data for parallel processing

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self=<sklearn.externals.joblib._parallel_backends.SequentialBackend object>, func=<sklearn.externals.joblib.parallel.BatchedCalls object>, callback=<sklearn.externals.joblib.parallel.BatchCompletionCallBack object>)
    106             raise ValueError('n_jobs == 0 in Parallel has no meaning')
    107         return 1
    108 
    109     def apply_async(self, func, callback=None):
    110         """Schedule a func to be run"""
--> 111         result = ImmediateResult(func)
        result = undefined
        func = <sklearn.externals.joblib.parallel.BatchedCalls object>
    112         if callback:
    113             callback(result)
    114         return result
    115 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self=<sklearn.externals.joblib._parallel_backends.ImmediateResult object>, batch=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    327 
    328 class ImmediateResult(object):
    329     def __init__(self, batch):
    330         # Don't delay the application, to avoid keeping the input
    331         # arguments in memory
--> 332         self.results = batch()
        self.results = undefined
        batch = <sklearn.externals.joblib.parallel.BatchedCalls object>
    333 
    334     def get(self):
    335         return self.results
    336 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function _fit_transform_one>, (Pipeline(memory=None,
     steps=[('selector', I...nknown='error', n_values='auto', sparse=False))]), None,       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0=<list_iterator object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        func = <function _fit_transform_one>
        args = (Pipeline(memory=None,
     steps=[('selector', I...nknown='error', n_values='auto', sparse=False))]), None,       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]))
        kwargs = {}
    132 
    133     def __len__(self):
    134         return self._size
    135 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer=Pipeline(memory=None,
     steps=[('selector', I...nknown='error', n_values='auto', sparse=False))]), weight=None, X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]), **fit_params={})
    576 
    577 
    578 def _fit_transform_one(transformer, weight, X, y,
    579                        **fit_params):
    580     if hasattr(transformer, 'fit_transform'):
--> 581         res = transformer.fit_transform(X, y, **fit_params)
        res = undefined
        transformer.fit_transform = <bound method Pipeline.fit_transform of Pipeline...known='error', n_values='auto', sparse=False))])>
        X =       age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns]
        y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])
        fit_params = {}
    582     else:
    583         res = transformer.fit(X, y, **fit_params).transform(X)
    584     # if we have a weight for this transformer, multiply output
    585     if weight is None:

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self=Pipeline(memory=None,
     steps=[('selector', I...nknown='error', n_values='auto', sparse=False))]), X=      age     bp     sg   al   su    bgr     bu ...o   no  good   no   no  

[379 rows x 24 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]), **fit_params={})
    278             Transformed samples
    279         """
    280         last_step = self._final_estimator
    281         Xt, fit_params = self._fit(X, y, **fit_params)
    282         if hasattr(last_step, 'fit_transform'):
--> 283             return last_step.fit_transform(Xt, y, **fit_params)
        last_step.fit_transform = <bound method OneHotEncoder.fit_transform of One..._unknown='error', n_values='auto', sparse=False)>
        Xt =           rbc        pc         pcc          ba ... no   no  good   no   no

[379 rows x 10 columns]
        y = array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])
        fit_params = {}
    284         elif last_step is None:
    285             return Xt
    286         else:
    287             return last_step.fit(Xt, y, **fit_params).transform(Xt)

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py in fit_transform(self=OneHotEncoder(categorical_features='all', dtype=...e_unknown='error', n_values='auto', sparse=False), X=          rbc        pc         pcc          ba ... no   no  good   no   no

[379 rows x 10 columns], y=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1]))
   2014         ----------
   2015         X : array-like, shape [n_samples, n_feature]
   2016             Input array of type int.
   2017         """
   2018         return _transform_selected(X, self._fit_transform,
-> 2019                                    self.categorical_features, copy=True)
        self.categorical_features = 'all'
   2020 
   2021     def _transform(self, X):
   2022         """Assumes X contains only categorical features."""
   2023         X = check_array(X, dtype=np.int)

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py in _transform_selected(X=          rbc        pc         pcc          ba ... no   no  good   no   no

[379 rows x 10 columns], transform=<bound method OneHotEncoder._fit_transform of On..._unknown='error', n_values='auto', sparse=False)>, selected='all', copy=True)
   1804 
   1805     Returns
   1806     -------
   1807     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1808     """
-> 1809     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
        X =           rbc        pc         pcc          ba ... no   no  good   no   no

[379 rows x 10 columns]
        copy = True
   1810 
   1811     if isinstance(selected, six.string_types) and selected == "all":
   1812         return transform(X)
   1813 

...........................................................................
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array=          rbc        pc         pcc          ba ... no   no  good   no   no

[379 rows x 10 columns], accept_sparse='csc', dtype=<class 'numpy.float64'>, order=None, copy=True, force_all_finite=True, ensure_2d=True, allow_nd=False, ensure_min_samples=1, ensure_min_features=1, warn_on_dtype=False, estimator=None)
    428 
    429     if sp.issparse(array):
    430         array = _ensure_sparse_format(array, accept_sparse, dtype, copy,
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
        array =           rbc        pc         pcc          ba ... no   no  good   no   no

[379 rows x 10 columns]
        dtype = <class 'numpy.float64'>
        order = None
        copy = True
    434 
    435         if ensure_2d:
    436             if array.ndim == 1:
    437                 raise ValueError(

ValueError: could not convert string to float: 'no'
___________________________________________________________________________

In [None]:
## n_estimator = dictates how many branches 
## estimator_range 

param_grid = dict(rf_classifier__n_estimator = estimators_range)

In [201]:
print("Best cross-validated accuracy: ",grid.best_score_)
print()
print("Best parameter found:\n",grid.best_params_)
print()
print("Fitted_model:\n",grid.best_estimator_.steps[1][1])

Best cross-validated accuracy:  0.995

Best parameter found:
 {'rf_classifier__n_estimators': 20}

Fitted_model:
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


And, you can predict directly with fitted grid-searched `pipeline`, as well!

In [202]:
grid.predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

#### Exercise Time!

* add a PCA transformation step before training the classifier
* search over the number of PCA components to keep using `GridSearchCV` (test whether to keep the first 5,10, or all components)