# Pipeline and Multiple Outputs

## Skills

1. Understand the basic vocabulary of machine learning.
2. Explain the importance of training and testing data.
3. Train and evaluate a Support Vector Machine
4. **Build a classification pipeline.**
6. **Use a multilabel classifier.**
7. Train and evaluate a transformer classifier.

## Vocabulary List

* **model.** A mathematical/computational that takes in some data, outputs some other data, and has parameters that can be fit. A line ($y=mx+b$) is a very simple model to get you from $x$ to $y$.
* **Naïve Bayes Classifier.** A simple model using Bayes Theorem to calculate the probability that data is in some class, assuming each word appears independently of the others.
* **parameter.** Variables in a model that are fit during training.
* **Support Vector Machine.** A simple model that attempts to put a line between two classes of observations. SVMs are able to account for interactions, depending on the kernel chosen.
* **Transformer Architecture.** A deep learning model which takes into account word meanings and order.
* **x variables.** The inputs to a model.
* **y variable.** The output to a model, the thing we are trying to predict.
* **training set.** The data used to train a model.
* **validation set.** Data used to evaluate a model and find the best hyperparameters.
* **testing set.** Data used in a final evaluation of a model.

## Additional Resources
1. [Scikit-Learn SVM Page](https://scikit-learn.org/stable/modules/svm.html)
2. [Working with Text Data in Scikit-Learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [None]:
from sklearn.feature_extraction.text import *
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics

import pandas as pd

## Scikit-Learn Pipelines

The work in our previous notebook was rather tedious. There were several pre-processing steps that were necessary after the train/test split, and also for any new data after the model was fit. This can all be condensed using a pipeline, in a way similar to what we saw with spaCy. In this case, our pipeline is:

* Tokenize the data and fit to a (using `CountVectorizer`)
* Convert counts to frequencies and then TF-IDF (using `TfidfTransformer`)
* Make predictions based on the model (`SGDClassifier` or `MultinomialNB` or whatever).

To create the pipeline, we put each component in a list, along with a descriptive label to give it, in case we need to refer to it later.

### Worked Example

In [None]:
netflix = pd.read_csv("https://raw.githubusercontent.com/Greg-Hallenbeck/class-datasets/main/datasets/netflix.csv")

In [None]:
netflix.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,...,history,horror,music,reality,romance,scifi,sport,thriller,war,western
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,...,0,0,0,0,0,0,0,0,0,0
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,...,0,0,0,0,0,0,0,0,0,0
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,...,0,0,0,0,0,0,0,0,0,0
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,...,0,0,0,0,0,0,0,0,0,0
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,...,0,1,0,0,0,0,0,0,0,0


In [None]:
X = netflix["description"]
y = netflix["type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=314159)

In [None]:
text_pipeline = Pipeline([
     ('tokenize', CountVectorizer(max_features=10000, stop_words="english")),
     ('tfidf', TfidfTransformer(use_idf=True)),
     ('classifier', SGDClassifier())
    ])

In [None]:
# This automatically applies each component of the pipeline, and fits them
text_pipeline.fit(X_train, y_train)

In [None]:
# Produce predictions just by running them through the pipeline using .predict()
y_pred_test = text_pipeline.predict(X_test)
y_pred_train = text_pipeline.predict(X_train)

In [None]:
metrics.accuracy_score(y_train, y_pred_train)

0.9775377969762419

In [None]:
metrics.accuracy_score(y_test, y_pred_test)

0.7107081174438687

In [None]:
# Print out a report of how wel the classifier does
# precision of X = If I predict something is X, what % of the time am I right?
# recall of X = what % of X do I find?
print(metrics.classification_report(y_test, y_pred_test, zero_division=0))

              precision    recall  f1-score   support

       MOVIE       0.74      0.83      0.78       723
        SHOW       0.64      0.51      0.57       435

    accuracy                           0.71      1158
   macro avg       0.69      0.67      0.68      1158
weighted avg       0.70      0.71      0.70      1158



In [None]:
# And predict on new data, that hopefully does better than this one.
new_data = ["When a murder occurs on the train on which he's travelling, celebrated detective Hercule Poirot is recruited to solve the case."]
text_pipeline.predict(new_data)

array(['SHOW'], dtype='<U5')

## Multi-Output Classifier

One issue you've all run into is the question of how to predict the columns which are multi-valued, that is, that contain more than one value. We see this for the genres: they're not in these nice, neat categories like simply fantasy *or* drama, but tend to be *both* fantasy *and* drama. Is it possible to do a single prediction for all of them?

Yes, it can be done, and it can be done easily, but perhaps not in the way you'd expect. If there are 19 possible genres, then what we are doing is producing 19 separate models, one predicting if the piece of media is a comedy, one predicting if it is a drama, and so on. To do that, we first transform it using one-hot encoding.


In [None]:
temp = netflix["genres"]
temp = temp.str.replace("'", "")
temp = temp.str.replace("[", "")
temp = temp.str.replace("]", "")
temp = temp.str.replace(" ", "")

#Takes all the extra stuff out of the genres
temp = temp.str.split(",") #splits each entry into a list of the genres present as strings
temp.head()

  temp = temp.str.replace("[", "")
  temp = temp.str.replace("]", "")


0      [documentation]
1       [crime, drama]
2    [comedy, fantasy]
3             [comedy]
4             [horror]
Name: genres, dtype: object

In [None]:
mlb = MultiLabelBinarizer()
mlb.fit_transform(temp)
temp = pd.DataFrame(mlb.fit_transform(temp), columns=mlb.classes_)

In [None]:
temp.head() #now it notes what genre each is by ticking off the columns it falls under, so the first row was a documentary, so it has a 1 under documentation

#the second row was a crime drama and has a 1 under those spots

Unnamed: 0,Unnamed: 1,action,animation,comedy,crime,documentation,drama,european,family,fantasy,history,horror,music,reality,romance,scifi,sport,thriller,war,western
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [None]:
netflix = netflix.join(temp)

In [None]:
netflix.columns

Index(['id', 'title', 'type', 'description', 'release_year',
       'age_certification', 'runtime', 'genres', 'production_countries',
       'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity',
       'tmdb_score', '', 'action', 'animation', 'comedy', 'crime',
       'documentation', 'drama', 'european', 'family', 'fantasy', 'history',
       'horror', 'music', 'reality', 'romance', 'scifi', 'sport', 'thriller',
       'war', 'western'],
      dtype='object')

In [None]:
netflix.head(5)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,...,history,horror,music,reality,romance,scifi,sport,thriller,war,western
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,...,0,0,0,0,0,0,0,0,0,0
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,...,0,0,0,0,0,0,0,0,0,0
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,...,0,0,0,0,0,0,0,0,0,0
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,...,0,0,0,0,0,0,0,0,0,0
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,...,0,1,0,0,0,0,0,0,0,0


Then, we use a pipeline again, but our classifier of choice needs to be loaded into a `MultiOutputClassifier`, which then goes in the pipeline:

In [None]:
single_classifier = SGDClassifier()

text_pipeline_multi = Pipeline([
     ('tokenize', CountVectorizer(max_features=10000, stop_words="english")),
     ('tfidf', TfidfTransformer(use_idf=True)),
     ('classifier', MultiOutputClassifier(single_classifier)), #this multioutput classifier trains however many machines using the classifier
    ])

In [None]:
X = netflix["description"]
y = netflix.loc[:,"action":]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=314159)

In [None]:
# This automatically applies each component of the pipeline, and fits them
text_pipeline_multi.fit(X_train, y_train)

In [None]:
# Produce predictions just by running them through the pipeline using .predict()
y_pred_test = text_pipeline_multi.predict(X_test)
y_pred_train = text_pipeline_multi.predict(X_train)

In [None]:
metrics.accuracy_score(y_train, y_pred_train)

0.8669546436285097

In [None]:
metrics.accuracy_score(y_test, y_pred_test) #13.7% Accuracy is a big difference between training and testing, so this model is ass

0.13730569948186527

In [None]:
# Print out a report of how wel the classifier does
# precision of X = If I predict something is X, what % of the time am I right?
# recall of X = what % of X do I find?
print(metrics.classification_report(y_test, y_pred_test, zero_division=0))

              precision    recall  f1-score   support

           0       0.60      0.42      0.49       224
           1       0.69      0.38      0.49       139
           2       0.65      0.54      0.59       463
           3       0.63      0.39      0.48       183
           4       0.70      0.43      0.54       176
           5       0.70      0.71      0.70       603
           6       0.30      0.03      0.06        95
           7       0.59      0.24      0.34       126
           8       0.59      0.30      0.40       130
           9       0.43      0.06      0.11        47
          10       0.78      0.25      0.38        84
          11       0.59      0.31      0.41        42
          12       0.83      0.34      0.48        44
          13       0.61      0.29      0.40       211
          14       0.70      0.32      0.44       132
          15       0.86      0.21      0.34        28
          16       0.57      0.43      0.49       243
          17       0.75    

In [None]:
temp.head(10)
#If something were a documentary itd be in row 0, and row 0 is correctly placed about 60% of the time

0                    [documentation]
1                     [crime, drama]
2                  [comedy, fantasy]
3                           [comedy]
4                           [horror]
5                 [comedy, european]
6          [thriller, crime, action]
7    [drama, music, romance, family]
8                   [romance, drama]
9             [drama, crime, action]
Name: genres, dtype: object