# Pipelines

### Cargando el Dataset
Importamos el dataset de review de películas.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import sklearn
import numpy as np
from sklearn.datasets import load_files

In [3]:
# Acá deben completar con la dirección donde tienen el dataset de películas que usamos para la entrega 5
moviedir = r'../DataSets/movie_reviews' 
movie_reviews = load_files(moviedir, shuffle=True)

Recordemos que este dataset consistía de un diccionario donde la key `data` contenía el texto de la review, y la key `target` tenía un 1 si ese review fue positivo y un 0 si fue negativo.

In [4]:
movie_reviews.data[0]

b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actually so absurd , that they would fit right in with dogma . \nyes , the film is 

In [5]:
movie_reviews.target

array([0, 1, 1, ..., 1, 0, 0])

### Armando el pipeline
Primero, antes de definir nuestro workflow de trabajo, vamos a separar en Train y Test nuestros datos.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    movie_reviews.data, movie_reviews.target, test_size = 0.20, stratify=movie_reviews.target, random_state = 12)

Ahora sí, vamos a definir un workflow muy simple donde primero vectorizamos el texto, aplicamos una transofrmación tf-idf y luego clasificamos con un SVM.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.svm import LinearSVC

vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = LinearSVC()

Ahora ponemos estos dos pasos adentro de un único objeto, una instancia de un objeto Pipeline, al que llamamos `pipeline`. Al igual que los otros modelos, también cuenta con las funciones `fit` y `predict`.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

pipeline = Pipeline([('vect',vect),('tfidf',tfidf),('clf',clf)])

In [8]:
pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

array([0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1,