# PRÁCTICA GUIADA: Pipelines StumbleUpon Evergreen

## 1. Introducción

StambleUpon: construir un clasificador que categorice páginas web como perennes o efímeras

Utilizaremos el dataset de StambleUpon para armar nuestro primer Pipeline. StambleUpon es un sitio web que recomienda páginas y contenido a sus usuarios basados en los intereses de estos últimos. Entre esas páginas recomendadas hay algunas que tienen períodos cortos de relevancia (noticias, recetas de cocina, etc.) y hay otras que matienen interés a lo largo del tiempo y pueden ser recomendadas a los usuarios mucho tiempo después de que han sido publicadas. Las páginas pueden ser clasificadas en "ephemeral" (efímeras) o "evergreen" (perennes).

El objetivo es, entonces, poder construir un clasificador que clasifique las páginas en estas dos categorías para poder mejorar el sistema de recomendación del sitio.

Para ello, trataremos de mostrar la utilidad que tiene los pipelines.

**Nota:** esta práctica está basada en un [desafío de Kaggle](https://www.kaggle.com/c/stumbleupon).

## 2. Pipelines "simples"

* Primero importaremos los datos, paquetes, etc.

In [1]:
from sklearn.pipeline import Pipeline
import pandas as pd
import json

data = pd.read_csv("../Data/stumbleupon.tsv", sep='\t')
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,1,1,40,0,4973,187,9,0.181818,0.125448,1
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,1,1,55,0,2240,258,11,0.166667,0.057613,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,1,0,24,0,2737,120,5,0.041667,0.100858,1
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,1,1,14,0,12032,162,10,0.098765,0.082569,0


In [2]:
data.sample()['boilerplate'].values

array(['{"title":"Knowing when to quit Buzz knowing when to quit","body":"7 Athletes Don t Know When To Quit By Brian Wisniewski Like the Kenny Rogers song goes You ve got to know when to hold em know when to fold em know when to walk and know when to run Time and time again we see athletes in sports that simply don t know when that time is The most recent case being Chuck Liddell who came back for one more swan song at UFC 115 only to earn another beating and some more brain damage On that note here s a list of seven athletes that should have thrown in the towel folded hit the road or just simply retired much sooner 1 Chuck Liddell UFC Career 12 Years Should Have Retired After UFC 97 Chuck Liddell fought at UFC 17 and UFC 115 He s been around for nearly 100 UFC s and he looked like it in his latest knockout loss to Rich Franklin The Iceman was the face of the UFC and a big reason why the sport was able to adjust to the mainstream Not only was he one of the building blocks of the UFC h

* Tomamos del campo “boilerplate” los subcampos “title” y “body” y los agregamos a data
* Rellenamos vacíos con ''

* Verificamos los valores obtenidos en el vector

In [3]:
data['title'] = data.boilerplate.apply(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.apply(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [4]:
data['boilerplate'][data['title'].isna()]

17      {"title":null,"body":"The annual Chap Olympiad...
1222    {"title":null,"body":"The only show at NYFW I ...
1642    {"title":null,"body":"When you wake up with yo...
2111    {"title":null,"body":"Stories of human flight ...
3026    {"title":null,"body":"Researchers at Volkswage...
3753    {"title":null,"body":"December 31 2008 12 01 a...
3909    {"title":null,"body":"Soft and chewy peanut bu...
4142    {"title":null,"body":null,"url":"icanhascheezb...
5358    {"title":null,"body":"18 Feb Like my play on w...
6138    {"title":null,"body":" discover fall's top bea...
6941    {"title":null,"body":"A new photograph analyzi...
7029    {"title":null,"body":"Today I have a recipe th...
Name: boilerplate, dtype: object

#### Balanceo de la clase

Verifiquemos cómo se encuentra balanceada la clase:

In [5]:
data.label.value_counts()

1    3796
0    3599
Name: label, dtype: int64

La clase parece bien balanceada, por lo tanto el accuracy será una buena medida de performance.

## Experimentación

Antes de encarar la construcción de un pipeline tenemos que determinar qué posibles combinaciones de preprocesamiento y modelos vamos a explorar.

* En el preprocesamiento, vamos a usar la clase `CountVectorizer` para extraer a partir de los títulos, un vector de palabras. (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

    **Parámetros:**

    1. `max_features`: Sólo considera las primeras X características, ordenadas por frecuencia.
    2. `ngram_range` : tuple (min_n, max_n): Va a tomar palabras de a una y de a dos.
    3. `stop_words`: Va a descartar artículos y palabras sin poder predictivo del idioma inglés. Se pueden usar listas custom.
    4. `binary`: Las posibilidades son 0 o 1 (booleano, no acumula).
    
    
* Para el modelo de clasificación, por ser basado en texto vamos a utilizar MultinomialNB sin contemplar la exploración de los hiperparámetros vistos para regularizar, ya que NB carece de ellos. MultinomialNB tiene un hiperparámetro `alpha` que, a diferencia de regularización, remite a dejar algún grado de libertad para que el modelo tenga cierto margen de funcionamiento para operar frente a ciertas propabilidades (features) extrañas o no contempladas cuando fiteamos el modelo. Este `alpha` se denomina `alpha de Laplace` y, por default, se fija en 1. La fundamentación de este parámetro surge de un planteo sobre [la probabilidad de que el sol salga cada mañana](https://en.wikipedia.org/wiki/Sunrise_problem).


Con estos pasos vamos a crear un pipeline que contenga:

     1. El vectorizador de texto
     2. El modelo de clasificación

### Split train/test

Para tener una estimación de la performance del modelo seleccionado sobre datos no observados, comenzamos por hacer un split train/test sobre los datos.

In [6]:
X = data[['title','body']].fillna('')
y = data['label']
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Breve repaso de CountVectorizer

In [8]:
print(X.title[:5].values)

['IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries'
 'The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races'
 "Fruits that Fight the Flu fruits that fight the flu | cold & flu | men's health"
 '10 Foolproof Tips for Better Sleep '
 "The 50 Coolest Jerseys You Didn t Know Existed coolest jerseys you haven't seen"]


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X.title[:5])
vectorizer.get_feature_names()

['10',
 '50',
 'advantages',
 'air',
 'batteries',
 'better',
 'breathing',
 'calls',
 'cold',
 'coolest',
 'didn',
 'electronic',
 'eliminates',
 'existed',
 'fight',
 'flu',
 'foolproof',
 'for',
 'fruits',
 'fully',
 'futuristic',
 'gun',
 'haven',
 'health',
 'holographic',
 'ibm',
 'in',
 'jerseys',
 'know',
 'men',
 'races',
 'seen',
 'sees',
 'sleep',
 'starting',
 'that',
 'the',
 'tips',
 'you']

In [10]:
title_vec = vectorizer.transform(X.title[:5])
title_vec

<5x39 sparse matrix of type '<class 'numpy.int64'>'
	with 42 stored elements in Compressed Sparse Row format>

In [11]:
pd.DataFrame(title_vec.todense(), columns=vectorizer.get_feature_names()).T

Unnamed: 0,0,1,2,3,4
10,0,0,0,1,0
50,0,0,0,0,1
advantages,0,3,0,0,0
air,2,0,0,0,0
batteries,2,0,0,0,0
better,0,0,0,1,0
breathing,2,0,0,0,0
calls,2,0,0,0,0
cold,0,0,1,0,0
coolest,0,0,0,0,2


### Pipeline simple sin Gridsearch

Importamos y creamos el pipeline. Lo entrenamos con el set de entrenamiento y lo ejecutamos sobre el test.

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer()
model = MultinomialNB()

pipeline = Pipeline([
        ('vec', vectorizer),
        ('model', model)   
    ])

# Vamos a ejecutar el pipeline sobre los títulos
X_train_tit = X_train['title']
X_test_tit = X_test['title']

pipeline.fit(X_train_tit, y_train)
pred = pipeline.predict(X_test_tit)
pred

array([1, 1, 1, ..., 1, 1, 1])

* Comparemos la predicción con el label
* Para eso, pasamos el array de predicciones a un boolean para comparar con los labels y ejecutamos el reporte de clasificación

In [13]:
#pred_bool=pred[:,0]<0.5
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(classification_report(y_test, pred))
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.78      0.72      0.75      1198
           1       0.75      0.81      0.78      1243

    accuracy                           0.77      2441
   macro avg       0.77      0.77      0.77      2441
weighted avg       0.77      0.77      0.77      2441

0.7668988119623106


## 3. Combinando pipelines y GridSearchCV

Veamos ahora como utilizar conjuntamente los pipelines junto con el tunning de hiperparámetros con `GridSearchCV`

In [14]:
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

Generamos un pipeline que tiene tres etapas:

1. Un vectorizador de texto: `CountVectorizer`
2. Un transformador de la matriz original `TfidfTransformer`
3. Un clasificador basado en Multinomial Naive Bayes

Notar que en este caso no los instanciamos previamente.

In [15]:
pipeline = Pipeline([
   ('vect', CountVectorizer()), 
   ('tfidf', TfidfTransformer()), 
   ('clf', MultinomialNB()), 
])

#### Experimentación

* Definimos los parámetros a buscar.
  - Es importante notar la forma en que se pasan los parámetros: en general, se escriben `[nombre de la etapa]__[parametro]`.
  En esta primera etapa queremos determinar si es beneficioso agregar al modelo nuevos n-gramas (combinaciones de dos palabras) como features y si tenemos que eliminar palabras que figuran menos de determinada cantidad de veces en el corpus
* Entonces, los parámetros que usamos en el `GridSeachCV` son 
  - para `CountVectorizer` (llamado `vect` en el pipeline): `min_df` y `n_gram_range`


In [16]:
parameters = {
    'vect__min_df': [1,2,3,4],
    'vect__max_df': np.linspace(0.01,1,5),
    'vect__ngram_range': ((1, 1), (1, 2)),
}

In [17]:
grid_search = GridSearchCV (pipeline, parameters, n_jobs = 2 , verbose = 2, cv=3)

In [18]:
print("Performing grid search...") 
grid_search.fit(X_train_tit, y_train)

Performing grid search...
Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:    2.8s
[Parallel(n_jobs=2)]: Done 120 out of 120 | elapsed:    8.2s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

* E imprimimos los mejores parámetros

In [19]:
print("Best score: %0.3f" % grid_search.best_score_) 
print("Best parameters set:" )
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted (parameters . keys()): 
                    print("\t %s: %r" % (param_name, best_parameters[param_name])) 

Best score: 0.756
Best parameters set:
	 vect__max_df: 0.2575
	 vect__min_df: 2
	 vect__ngram_range: (1, 1)


#### Evaluando la performance de la búsqueda sobre datos no observados

In [20]:
y_pred = grid_search.predict(X_test_tit)

In [21]:
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.74      0.75      1198
           1       0.76      0.78      0.77      1243

    accuracy                           0.76      2441
   macro avg       0.76      0.76      0.76      2441
weighted avg       0.76      0.76      0.76      2441

0.7615731257681279


** BONUS:** ¿Qué tanto mejor es el tiempo de cómputo con `RandomizedSearchCV`?

In [22]:
from sklearn.model_selection import RandomizedSearchCV
rand_search = RandomizedSearchCV(pipeline, parameters, n_jobs = 3 , verbose = 2, n_iter=10, cv=3)

In [23]:
print("Performing randomized search...") 
rand_search.fit(X_train_tit, y_train)

Performing randomized search...
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  30 out of  30 | elapsed:    1.5s finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('vect',
                                              CountVectorizer(analyzer='word',
                                                              binary=False,
                                                              decode_error='strict',
                                                              dtype=<class 'numpy.int64'>,
                                                              encoding='utf-8',
                                                              input='content',
                                                              lowercase=True,
                                                              max_df=1.0,
                                                              max_features=None,
                                                              min_df=1,
                                          

## 4. Pipelines y Gridsearch con funciones propias

A veces las clases que existen en el módulo de preprocesamiento de sklearn pueden "quedarnos chicas". Es decir, puede ser que tengamos que definir alguna otra transformación para el preprocesamiento que no exista en el módulo.


### 4.1. Extender la BaseClass en Scikit-Learn. 


En este ejemplo creamos un transformador muy simple que devuelve el título + el cuerpo o sólo el cuerpo dependiendo del valor del parámetro __include_body__.

In [24]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

In [25]:
class BodyIncluder(BaseEstimator, TransformerMixin):
    def __init__(self,include_body=False):
        self.include_body = include_body
    
    def transform(self, X):
        if (self.include_body):
            
            return X['title'].astype(str) + X['body'].astype(str)
        else:
            return X['title']
    
    def fit(self, *_):
        return self

In [26]:
X_train.head()

Unnamed: 0,title,body
2598,OpenDonor Bash s content Kwotes serving it,chat quote database - view tons of funny chat...
6554,ALEX AND CHLOE ONLINE SHOP,WICKEY NECKLACE MINI FOREVER AND NEVER COLLECT...
6952,You And Your Cholesterol Level by Mike Field H...,Hundreds of millions of Americans have high ch...
3542,Contagion TakePart News Culture Videos and Pho...,
4614,UFC Ultimate Fictional Characters Fighting Ran...,


In [27]:
bi = BodyIncluder(include_body= False)

In [28]:
bi.transform(X_train).head().values

array(['OpenDonor Bash s content Kwotes serving it ',
       'ALEX AND CHLOE ONLINE SHOP ',
       'You And Your Cholesterol Level by Mike Field Heart you and your cholesterol level by mike field - heart - insidershealth.com',
       'Contagion TakePart News Culture Videos and Photos That Make the World Better ',
       'UFC Ultimate Fictional Characters Fighting Random RR '],
      dtype=object)

In [29]:
bi = BodyIncluder(include_body=True)

In [30]:
bi.transform(X_train).head().values

array(['OpenDonor Bash s content Kwotes serving it  chat quote database - view tons of funny chat dialogs quotes, kwotes, chat dialogs, funny chats, humorous dialog, irc, chat history',
       'ALEX AND CHLOE ONLINE SHOP WICKEY NECKLACE MINI FOREVER AND NEVER COLLECTIONPRICE 28 00 CHOOSE STYLE SIZE BLACKWHITECLASSIC CROSS UPSIDE DOWN NECKLACETHE TRINITY COLLECTIONPRICE 70 00 CHOOSE STYLE SIZE ANT BRASSANT SILVERGUNMETALCLASSIC CROSS NECKLACETHE TRINITY COLLECTIONPRICE 70 00 CHOOSE STYLE SIZE ANT BRASSANT SILVERGUNMETALCROSS UPSIDE DOWN NECKLACE MINI ACRYLIC THE TRINITY COLLECTIONPRICE 28 00 CHOOSE STYLE SIZE BLACK W ANT BRASSBLACK W GUNMETALWHITE W ANT BRASSWHITE W GUNMETALCROSS NECKLACE MINI ACRYLIC THE TRINITY COLLECTIONPRICE 28 00 CHOOSE STYLE SIZE BLACK W ANT BRASSBLACK W GUNMETALWHITE W ANT BRASSWHITE W GUNMETALCROWN OF THORNS NECKLACE WHITE HEMATITETHE TRINITY COLLECTIONPRICE 180 00 CHOOSE STYLE SIZE ANT BRASSANT SILVERGUNMETALCROWN OF THORNS RING ANT BRASS WITH BLACK ONYXTHE TRI

* Supongamos que quisiéramos generar un transformador que extrajera el largo del cuerpo de los textos...

### 4.2. Experimentando en el pipeline con Body Includer

Queremos probar aumentar la complejidad del modelo incluyendo el cuerpo de las páginas y no únicamente el título.

In [31]:
pipeline = Pipeline([
   ('bi', BodyIncluder()),  
   ('vect', CountVectorizer()), 
   ('tfidf', TfidfTransformer()), 
   ('clf', MultinomialNB()), 
])

In [32]:
parameters = {
    'vect__min_df': [2,3,4],
    #'vect__max_df': np.linspace(0.01,1,5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__stop_words': ['english', None],
    'bi__include_body': [True, False]
}

In [33]:
grid_search = GridSearchCV (pipeline, parameters, n_jobs = 2 , verbose = 2 , cv=3)

In [34]:
print("Performing grid search...") 
grid_search.fit(X_train, y_train)

Performing grid search...
Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:  1.1min
[Parallel(n_jobs=2)]: Done  72 out of  72 | elapsed:  1.2min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('bi', BodyIncluder(include_body=False)),
                                       ('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                       

In [35]:
print("Best score: %0.3f" % grid_search . best_score_) 
print("Best parameters set:" )
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted (parameters . keys()): 
                    print("\t %s: %r" % (param_name, best_parameters[param_name])) 

Best score: 0.796
Best parameters set:
	 bi__include_body: True
	 vect__min_df: 2
	 vect__ngram_range: (1, 2)
	 vect__stop_words: 'english'


#### 4.3 Evaluamos el modelo sobre datos no observados

In [36]:
grid_search.best_estimator_.fit(X_train,y_train)

Pipeline(memory=None,
         steps=[('bi', BodyIncluder(include_body=True)),
                ('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=2,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         ver

In [37]:
y_pred = grid_search.predict(X_test)

In [38]:
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.79      0.79      1198
           1       0.80      0.80      0.80      1243

    accuracy                           0.79      2441
   macro avg       0.79      0.79      0.79      2441
weighted avg       0.79      0.79      0.79      2441

0.7935272429332241


### 5. Usando la función  FunctionTransformer del módulo de pre-procesamiento

FunctionTransformer es otra manera de generar features con transformaciones definidas por el usuario.

* Si queremos generar un paso que aplique transformaciones matemáticas puntuales a los features podemos utilizar FunctionTransformer()

In [39]:
from sklearn.preprocessing import FunctionTransformer

In [40]:
transformer = FunctionTransformer(np.log, validate=True)

In [41]:
X = np.array([[0.5,1],[2,3]])

In [42]:
transformer.transform(X)

array([[-0.69314718,  0.        ],
       [ 0.69314718,  1.09861229]])