# PRÁCTICA GUIADA: Pipelines StumbleUpon Evergreen

## 1. Introducción

Utilizaremos el dataset de StambleUpon para armar nuestro primer Pipeline. StambleUpon es un sitiuo web que recomienda páginas y contenido a sus usuarios basados en los intereses de estos últimos. Entre esas páginas recomendadas hay algunas que tienen períodos cortos de relevancia (noticias, recetas de cocina, etc.) y hay otras que matienen interés a lo largo del tiempo y pueden ser recomendadas a los usuarios mucho tiempo después de que han sido publicadas. Las páginas pueden ser clasificadas en "ephemeral" (efímeras) o "evergreen" (perennes).

El objetivo es, entonces, poder construir un clasificador que clasifique las páginas en estas dos categorías para poder mejorar el sistema de recomendación del sitio.

Para ello, trataremos de mostrar la utilidad que tiene los pipelines.

**Nota:** esta práctica está basada en un [desafío de Kaggle](https://www.kaggle.com/c/stumbleupon).

## 2. Pipelines "simples"

* Primero importaremos los datos, paquetes, etc.

In [1]:
from sklearn.pipeline import Pipeline
import pandas as pd
import json

data = pd.read_csv("../Data/stumbleupon.tsv", sep='\t')
data['boilerplate'].head()

0    {"title":"IBM Sees Holographic Calls Air Breat...
1    {"title":"The Fully Electronic Futuristic Star...
2    {"title":"Fruits that Fight the Flu fruits tha...
3    {"title":"10 Foolproof Tips for Better Sleep "...
4    {"title":"The 50 Coolest Jerseys You Didn t Kn...
Name: boilerplate, dtype: object

* Tomamos del campo “boilerplate” los subcampos “title” y “body” y los agregamos a data
* Rellenamos vacíos con ''

* Verificamos los valores obtenidos en el vector

In [2]:
data['title'] = data.boilerplate.apply(lambda x: json.loads(x).get('title', ''))
data['body'] = data.boilerplate.apply(lambda x: json.loads(x).get('body', ''))
data.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label,title,body
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,24,0,5424,170,8,0.152941,0.07913,0,IBM Sees Holographic Calls Air Breathing Batte...,A sign stands outside the International Busine...
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,40,0,4973,187,9,0.181818,0.125448,1,The Fully Electronic Futuristic Starting Gun T...,And that can be carried on a plane without the...
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,55,0,2240,258,11,0.166667,0.057613,1,Fruits that Fight the Flu fruits that fight th...,Apples The most popular source of antioxidants...
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,24,0,2737,120,5,0.041667,0.100858,1,10 Foolproof Tips for Better Sleep,There was a period in my life when I had a lot...
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,14,0,12032,162,10,0.098765,0.082569,0,The 50 Coolest Jerseys You Didn t Know Existed...,Jersey sales is a curious business Whether you...


In [3]:
titles = data['title'].fillna('')
body =  data['body'].fillna('')
y = data['label']
titles[0:3]

#y[0:3]
#y.value_counts() / len(y)

0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
Name: title, dtype: object

#### Balanceo de la clase

Verifiquemos cómo se encuentra balanceada la clase:

In [4]:
y.value_counts()

1    3796
0    3599
Name: label, dtype: int64

La clase parece bien balanceada, por lo tanto el accuracy será una buena medida de performance.

## Experimentación

Antes de encarar la construcción de un pipeline tenemos que determinar qué posibles combinaciones de preprocesamiento y modelos vamos a explorar.

* En el preprocesamiento, vamos a usar la clase `CountVectorizer` para extraer a partir de los títulos, un vector de palabras.

    **Parámetros:**

    1. `max_features`: Sólo considera las primeras X características, ordenadas por frecuencia.
    2. `ngram_range` : tuple (min_n, max_n): Va a tomar palabras de a una y de a dos.
    3. `stop_words`: Va a descartar artículos y palabras sin poder predictivo del idioma inglés. Se pueden usar listas custom.
    4. `binary`: Las posibilidades son 0 o 1(no acumula).
    
    
* Para el modelo de clasificación, por ser basado en texto vamos a utilizar MultinomialNB que no tiene hiperparámetros para explorar. 


Con estos pasos vamos a crear un pipeline que contenga:

     1. El vectorizador de texto
     2. El modelo de regresión

### Split train/test

Para tener una estimación de la performance del modelo seleccionado sobre datos no observados, comenzamos por hacer un split train/test sobre los datos.

In [5]:
X = data[['title','body']].fillna('')
y = data['label']
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

### Pipeline simple sin Gridsearch

Importamos y creamos el pipeline
Lo entrenamos con el set de entrenamiento y lo ejecutamos sobre el testset

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = CountVectorizer()
model = MultinomialNB()

pipeline = Pipeline([
        ('vec', vectorizer),
        ('model', model)   
    ])

# Vamos a ejecutar el pipeline sobre los títulos
X_train_tit = X_train['title']
X_test_tit = X_test['title']

pipeline.fit(X_train_tit, y_train)
pred = pipeline.predict(X_test_tit)
pred

array([1, 1, 1, ..., 1, 1, 1])

* Comparemos la predicción con el label
* Para eso, pasamos el array de predicciones a un boolean para comparar con los labels y ejecutamos el reporte de clasificación

In [8]:
#pred_bool=pred[:,0]<0.5
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(classification_report(y_test, pred))
print(accuracy_score(y_test, pred))

             precision    recall  f1-score   support

          0       0.78      0.72      0.75      1198
          1       0.75      0.81      0.78      1243

avg / total       0.77      0.77      0.77      2441

0.766898811962


## 3. Combinando pipelines y GridSearchCV

Veamos ahora como utilizar conjuntamente los pipelines junto con el tunning de hiperparámetros con `GridSearchCV`

In [10]:
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

Generamos un pipeline que tiene tres etapas:

1. Un vectorizador de texto: `CountVectorizer`
2. Un transformador de la matriz original `TfidfTransformer`
3. Un clasificador basado en Multinomial Naive Bayes

Notar que en este caso no los instanciamos previamente.

In [11]:
pipeline = Pipeline([
   ('vect', CountVectorizer()), 
   ('tfidf', TfidfTransformer()), 
   ('clf', MultinomialNB()), 
])

#### Experimentación

* Definimos los parámetros a buscar.
  - Es importante notar la forma en que se pasan los parámetros: en general, se escriben `[nombre de la etapa]__[parametro]`.
  En esta primera etapa queremos determinar si es beneficioso agregar al modelo nuevos n-gramas (combinaciones de dos palabras) como features y si tenemos que eliminar palabras que figuran menos de determinada cantidad de veces en el corpus
* Entonces, los parámetros que usamos en el `GridSeachCV` son 
  - para `CountVectorizer` (llamado `vect` en el pipeline): `min_df` y `n_gram_range`


In [12]:
parameters = {
    'vect__min_df': [1,2,3,4],
    'vect__max_df': np.linspace(0.01,1,5),
    'vect__ngram_range': ((1, 1), (1, 2)),
}

In [13]:
grid_search = GridSearchCV (pipeline, parameters, n_jobs = 3 , verbose = 2 )

In [14]:
print("Performing grid search...") 
grid_search.fit(X_train_tit, y_train)

Performing grid search...
Fitting 3 folds for each of 40 candidates, totalling 120 fits
[CV] vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 1) .....
[CV] vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 1) .....
[CV] vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 1) .....
[CV]  vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 2) .....
[CV]  vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 1), total=   0.3s
[CV]  vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 2) .....
[CV] vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 2) .....
[CV]  vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 2), total=   0.6s
[CV] vect__max_df=0.01, vect__min_df=2, vect__ngram_range=(1, 1) .....
[CV]  vect__max_df=0.01, vect__min_df=1, vect__ngram_range=(1, 2), total=   0.6s
[CV] vect_

[Parallel(n_jobs=3)]: Done  35 tasks      | elapsed:    8.3s


[CV]  vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 2) ...
[CV]  vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 2) ...
[CV]  vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 2) ...
[CV]  vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 2), total=   0.6s
[CV] vect__max_df=0.2575, vect__min_df=4, vect__ngram_range=(1, 1) ...
[CV]  vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 2), total=   0.6s
[CV]  vect__max_df=0.2575, vect__min_df=3, vect__ngram_range=(1, 2), total=   0.6s
[CV] vect__max_df=0.2575, vect__min_df=4, vect__ngram_range=(1, 1) ...
[CV] vect__max_df=0.2575, vect__min_df=4, vect__ngram_range=(1, 1) ...
[CV]  vect__max_df=0.2575, vect__min_df=4, vect__ngram_range=(1, 1), total= 

[CV] vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 1) ...
[CV]  vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 2) ...
[CV]  vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 1), total=   0.3s
[CV]  vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 1), total=   0.3s
[CV] vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 2) ...
[CV] vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 2) ...
[CV]  vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 2), total=   0.6s
[CV] vect__max_df=1.0, vect__min_df=1, vect__ngram_range=(1, 1) ......
[CV]  vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 2), total=   0.6s
[CV]  vect__max_df=0.7525, vect__min_df=4, vect__ngram_range=(1, 2), total=   0.6s
[CV] vect__max_df=1.0, vect__min_df=1, vect__ngram_range=(1, 1) ......
[CV] vect__max_df=1.0, vect__min_df=1, vect__ngram_range=(1, 1) ......
[CV] 

[Parallel(n_jobs=3)]: Done 120 out of 120 | elapsed:   27.1s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid=True, n_jobs=3,
       param_grid={'vect__min_df': [1, 2, 3, 4], 'vect__max_df': array([ 0.01  ,  0.2575,  0.505 ,  0.7525,  1.    ]), 'vect__ngram_range': ((1, 1), (1, 2))},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=2)

* E imprimimos los mejores parámetros

In [15]:
print("Best score: %0.3f" % grid_search . best_score_) 
print("Best parameters set:" )
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted (parameters . keys()): 
                    print("\t %s: %r" % (param_name, best_parameters[param_name])) 

Best score: 0.756
Best parameters set:
	 vect__max_df: 0.25750000000000001
	 vect__min_df: 2
	 vect__ngram_range: (1, 1)


#### Evaluando la performance de la búsqueda sobre datos no observados

In [None]:
grid_search.best_estimator_.fit(X_train_tit,y_train)

In [None]:
y_pred = grid_search.best_estimator_.predict(X_test_tit)

In [None]:
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

** BONUS:** ¿Qué tanto mejor es el tiempo de cómputo con `RandomizedSearchCV`?

In [None]:
from sklearn.model_selection import RandomizedSearchCV
rand_search = RandomizedSearchCV(pipeline, parameters, n_jobs = 3 , verbose = 2, n_iter=10)

In [None]:
print("Performing randomized search...") 
rand_search.fit(X_train_tit, y_train)

## 4. Pipelines y Gridsearch con funciones propias

A veces las clases que existen en el módulo de preprocesamiento de sklearn pueden "quedarnos chicas". Es decir, puede ser que tengamos que definir alguna otra transformación para el preprocesamiento que no exista en el módulo.


### 4.1. Extender la BaseClass en Scikit-Learn. 


En este ejemplo creamos un transformador muy simple que devuelve la entrada multiplicada por un factor X:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

In [None]:
class BodyIncluder(BaseEstimator, TransformerMixin):
    def __init__(self,include_body=False):
        self.include_body = include_body
    
    def transform(self, X):
        if (self.include_body):
            
            return X['title'].astype(str) + X['body'].astype(str)
        else:
            return X['title']
    
    def fit(self, *_):
        return self

In [None]:
X_train.head()

In [None]:
bi = BodyIncluder(include_body= False)

In [None]:
bi.transform(X_train).head()

In [None]:
bi = BodyIncluder(include_body=True)

In [None]:
bi.transform(X_train).head()

* Supongamos que quisiéramos generar un transformador que extrajera el largo del cuerpo de los textos...

### 4.2. Experimentando en el pipeline con Body Includer

Queremos probar aumentar la complejidad del modelo incluyendo el cuerpo de las páginas y no únicamente el título.

In [None]:
pipeline = Pipeline([
   ('bi', BodyIncluder()),  
   ('vect', CountVectorizer()), 
   ('tfidf', TfidfTransformer()), 
   ('clf', MultinomialNB()), 
])

In [None]:
parameters = {
    'vect__min_df': [2,3,4],
    #'vect__max_df': np.linspace(0.01,1,5),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__stop_words': ['english',None],
    'bi__include_body': [True,False]
}

In [None]:
grid_search = GridSearchCV (pipeline, parameters, n_jobs = 3 , verbose = 2 )

In [None]:
print("Performing grid search...") 
grid_search.fit(X_train, y_train)

In [None]:
print("Best score: %0.3f" % grid_search . best_score_) 
print("Best parameters set:" )
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted (parameters . keys()): 
                    print("\t %s: %r" % (param_name, best_parameters[param_name])) 

#### 4.3 Evaluamos el modelo sobre datos no observados

In [None]:
grid_search.best_estimator_.fit(X_train,y_train)

In [None]:
y_pred = grid_search.best_estimator_.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

### 5. Usando la función  FunctionTransformer del módulo de pre-procesamiento

FunctionTransformer es otra manera de generar features con transformaciones definidas por el usuario.

* Si queremos generar un paso que aplique transformaciones matemáticas puntuales a los features podemos utilizar FunctionTransformer()

In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:
transformer = FunctionTransformer(np.log)

In [None]:
X = np.array([[0.5,1],[2,3]])

In [None]:
transformer.transform(X)