# Building a clickbait classification pipeline

![Clickbait](http://www.ogilvy.com/wp-content/uploads/2015/10/166_main.jpg)

## Introduction

The aim of this workshop is to walk you through the process of taking data to a production machine learning classifier for text. You'll build a classifier to detect clickbait from the article text and eventually roll out the classifier into an app.

We'll be using
 - pandas for data wrangling
 - matplotlib for plotting
 - scikit-learn for feature engineering, model building, and model analysis
 - flask for building our web app
 - jupyter for getting stuff done

By the end of the workshop you'll understand the steps needed to build a basic text processing pipeline.

## Loading data and wrangling

You've been presented with tabular data in a `.csv` format. A natural choice for loading the data is pandas, which is based around tabular representations.

We'll load in the data and get a broad overview of the data.

In [None]:
import pandas as pd
%matplotlib inline 

In [None]:
df = pd.read_csv('./data/training_data.csv')
df.head(5)

### Article details

An article contains:
 - **author** (string) - the article author's name
 - **description** (string) - a short description of the article
 - **label** (interger) - 1=clickbait 0=not clickbait
 - **publishedAt** (string) - a timestamp for the time of publication
 - **title** (string) - the title of the article

We can print the content of one article:

In [None]:
print(df['title'][1])
print('-----')
print(df['description'][1])
print('Published at:', df['publishedAt'][1])

### Basic statistics

Pandas makes it easy to inspect, plot, and transform our data. Pandas is easy to use but still has lots of functionality. When you can chain multiple functions together, you've become a pandas pro!

In [None]:
# Number of clickbait and non-clickbait articles
df['label'].value_counts()

In [None]:
# Plotting the number of author fields that are Null
df['author'].isnull().value_counts().plot('barh')

In [None]:
# The number of characters in the description field
df['description'].apply(len).mean()

In [None]:
# Comparing the number of description characters in clickbait to news
df['description'].apply(len).groupby(df['label']).mean()

In [None]:
# TEST YOUR KNOWLEDGE
# Can you write a one-liner to compute the number of clickbait articles
# written by each author? Hint: you might find the .sum() function helpful!

### Create the full content column

In [None]:
df['full_content'] = df.description + ' ' + df.title
df.head(1)

Nice work - have a panda

![panda](http://www.nathab.com/uploaded-files/carousels/TRIPS/Wild-China/Asia-Wild-China-4-panda.JPG)

---

# Scikit-learn text classification pipeline


Some important terminology:

<img src="http://tfwiki.net/mediawiki/images2/thumb/3/37/Optimusg1.jpg/350px-Optimusg1.jpg" alt="optimus" style="width:50px;" align="left"/> **TRANSFORMERS** - take some input data and transform it into another format. Often we want to transform textual data or image data into numerical data. We may also transform our input data into new features
<br/>
<br/>

<img src="http://www.kennyskiphire.co.uk/blog/wp-content/uploads/Wheelie-Bins.jpg" alt="bins" style="width:70px;" align="left"/> **CLASSIFIERS** - take some input data and classify the sample by assigning a label to the input data. In binary classification we often use the labels 1 and 0.
<br/>
<br/>

<img src="https://reichanjapan.files.wordpress.com/2016/02/mariogiftcard.png?w=230&h=335" alt="pipe" style="width:50px;" align="left"/> **PIPELINE** - consist of one or many transformer steps followed by a classifier. We can use pipelines to elegantly chain together operations and construct an easy to use interface. 

## Textual to numerical data (bag of words model) 

Our classifier isn't going to understand text like we can - we must create numerical data. A common approach to this is the bag of words model. 

For example - **Literally just 8 really really cute dogs** - transforms into the bag of words:

| Token | id | Count |
|---|---|---|
| cute | 0 |1 |
| dogs | 1 |1 |
| just | 2 | 1 |
| literally | 3 | 1 |
| really | 4 | 2 |

This is simply achieved with a scikit-learn `CountVectorizer`. There are two steps:
 - **Fit** the vectorizer, which populates all the tokens in the left hand column and assigns the numerical ids
 - **Transform** the data, which turns a sentence into it's bag of words representation
 
 Note that the bag of words representation of a sentence ignores the word order and dependencies between them.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

Fitting the `CountVectorizer` learns the vocabulary

In [None]:
sentence = ["Literally just 8 really really cute dogs"]
vectorizer.fit(sentence)
print(vectorizer.vocabulary_) # dictionary of words and ids

We can then transform our textual to numerical data

In [None]:
vectorizer.transform(sentence).toarray()

Note: we cannot transform textual data that is not in our learned vocabulary.

In [None]:
sentence = ["OMG 5 truly hilarious dogs 😂"]
vectorizer.transform(sentence).toarray()

## Classifier

In the classification task, we take a single example (such as an article row) and decide which class it belongs to (e.g., clickbait or not clickbait). A standard approach to classification is to find a boundary that best seperates training examples according to their class. In the binary classification problem below, we've indicated a linear boundary that separates the data pretty well. Each sample is described by the number of times that word1 and word2 occur. In reality we will have many more words associated with each sample but the concept remains the same.

We determine this boundary using a Support Vector Machine. The SVM has two steps:
 - **Fit** we learn the boundary from labelled data
 - **Predict** we predict the classes of unlabelled data

![classify](images/svm-classify.png)

In [None]:
from sklearn.svm import LinearSVC
svc = LinearSVC()

Let's make up some samples.

In [None]:
bag_of_words = [
    [1, 5], [1, 4], [2, 6], [4, 2], [3, 4], [2, 1]
]
labels = [1, 1, 1, 0, 0, 0]

In [None]:
from utils.plotting import plot_2d_samples
plot_2d_samples(bag_of_words, labels)

Now we learn the boundary!

In [None]:
svc = svc.fit(bag_of_words, labels)

In [None]:
from utils.plotting import plot_2d_trained_svc
plot_2d_trained_svc(bag_of_words, labels, svc)

Once we have learned the boundary then we can predict the label of novel samples

In [None]:
svc.predict([[3, 1], [2,4]])

## Putting it all together in a pipeline

A pipeline consists of multiple transform steps and a final classification step. A pipeline is an easy way to wrap up all our transformations in one easy to use box. In general we use the following functions:
 - **Fit** to fit the transformers and classifier
 - **Predict** to transform data and predict it's label
 
Below is a detailed schematic of the data flow in the pipeline when we call these two methods.

![pipeobj](images/sklearn-pipeline.png)

We'll create a pipeline with two steps:
1. Transform textual data to a bag of words vector
2. Predict label from the bag of words vector

So the input of our pipeline is text data and the output is a label!

In [None]:
steps = (
    ('vectorizer', CountVectorizer()),
    ('classifier', LinearSVC())
)

In [None]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps)

**Congratulations!** You've built your first text classification pipeline.
![pipeygif](https://tinynin.files.wordpress.com/2012/01/warppipe-copy.gif)

## Training our pipeline on real data

Now that we know how the vectorizer and classifier work together to form a pipeline, we can train it on the real data. 

**Machine learning discipline 101**
 - Split your data into a training and testing set
 - NEVER look at your testing data. Hide it away. Save it for later. Lock it in a drawer!
 - Your training data helps you to fit your models and select one
 - Your testing data is used for final evaluation
 
Scikit-learn's train and test split shuffles our data and splits it into two sets. We can also use *stratified sampling* to ensure that both sets have the same distribution of labels.

In [None]:
from sklearn.model_selection import train_test_split
training, testing = train_test_split(
    df,                # The dataset we want to split
    train_size=0.7,    # The proportional size of our training set
    stratify=df.label, # The labels are used for stratification
    random_state=400   # Use the same random state for reproducibility
)

In [None]:
training.head(5)

In [None]:
print(len(training))
print(len(testing))

### Train

Now we're ready to train!

In [None]:
pipeline = pipeline.fit(training.title, training.label)

What? That was it?! 

That's right. You've just built a machine learning classifier for clickbait and it was that easy to train. Let's test it out:

In [None]:
pipeline.predict(["10 things you need to do..."])

In [None]:
pipeline.predict(["French election polls show an early lead for Macron."])

## Introspect our model

In [None]:
from utils.plotting import print_top_features
print_top_features(pipeline, n_features=10)

## Evaluating our model

Cross-validation is a method of measuring model performance:
1. Split the training data into $n$ chunks
2. Train the pipeline on all but one of the chunks
3. Predict the label of the samples in the remaining chunk
4. Repeat this until we have predicted the labels of the entire training set

In [None]:
from sklearn.model_selection import cross_val_predict
predicted_labels = cross_val_predict(pipeline, training.title, training.label)

Now we have our predicted labels, we can check our performance

In [None]:
from utils.plotting import pipeline_performance
pipeline_performance(training.label, predicted_labels)

It's always a good idea to inspect those samples that we got wrong.

In [None]:
training[training.label != predicted_labels]

## Improving the model

We had pretty high accuracy but can we do better?

### Use title and descritption

Rather than using just the title, we could use the title and description together.

In [None]:
predicted_labels = cross_val_predict(pipeline, training.full_content, training.label)
pipeline_performance(training.label, predicted_labels)

### TfidfVectorizer Transformer

The TfidifVectorizer is an improved version of the CountVectorizer. Words are still transformed into a bag of words but we emphasise the importance of less common words.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

steps = (
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LinearSVC())
)
pipeline = Pipeline(steps)

predicted_labels = cross_val_predict(pipeline, training.full_content, training.label)
pipeline_performance(training.label, predicted_labels)

### Replace ordinals

Earlier we saw that numbers are very common in clickbait articles, they are likely to be a very descriptive feature. However our vocabulary is limited by the numbers that we saw in training. We could replace all numbers by some label so that all numbers are represented equally.

All we need to do is write a string preprocessor and pass it to the vectorizer.

In [None]:
import re
def mask_integers(s):
    return re.sub(r'\d+', 'INTMASK', s)

steps = (
    ('vectorizer', TfidfVectorizer(preprocessor=mask_integers)),
    ('classifier', LinearSVC())
)
pipeline = Pipeline(steps)

predicted_labels = cross_val_predict(pipeline, training.full_content, training.label)
pipeline_performance(training.label, predicted_labels)

### Hyperparameters

When we constructed our `LinearSVC` and `CountVectorizer` we used the default model parameters. These additonal parameters (hyperparameters) can be chosen to improve our classifier by performing a grid search. A grid search trains a classifier on every combination of the parameters and analyses their performance. We can then pick the best one.

In [None]:
steps = (
    ('vectorizer', TfidfVectorizer()),
    ('classifier', LinearSVC())
)

pipeline = Pipeline(steps)

Some parameters we can fiddle:
 - stop_words - we can ignore certain words (the, a, it,...). scikit-learn has an 'english' stop word vocabulary we can use
 - ngram_range - in the above example we split sentences into words. We could also try pairs of words.
 - C - the SVM has a property C that performs regularisation
 
We set up our grid as a dictionary (note we must use the step names so that scikit learn knows which component we are fiddling with):

In [None]:
gs_params = {
    'vectorizer__stop_words': ['english', None],
    'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'vectorizer__preprocessor': [mask_integers, None],
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]
}

In [None]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(pipeline, gs_params, n_jobs=-1)
gs.fit(training.full_content, training.label)

In [None]:
print(gs.best_params_)
print(gs.best_score_)

## Testing generality

So far we have selected our model based on the training data alone. It looks like our performance is excellent but we may have overfit to the training data. Does our classifier generalise well?

In other words, do we perform as well on data that we have never seen before?

In [None]:
pipeline = gs.best_estimator_
predicted_labels = pipeline.predict(testing.full_content)
pipeline_performance(testing.label, predicted_labels)

HURRAH!

# Going into the wild

![hunger](https://ronanwills.files.wordpress.com/2015/06/vlcsnap-2015-06-11-23h05m49s76.png?w=625)

An excellent feature of scikit-learn is that we can save our classifier using the pickle tool. We can load it later for
 - data analysis
 - data provenance
 - to share with somebody
 - to provide ML as a service (coming up)

In [None]:
filename = 'classifiers/clickbait_svc_v1'

In [None]:
import pickle
with open(filename, 'wb') as f:
    pickle.dump(pipeline, f)

# Congratulations

You've built a clickbait classifer. Head over to `advanced.ipynb` or look through the `webservice` directory to put your pipeline to use.

![gatsby](https://media.tenor.co/images/78f5d1acd72e8a66257ea671b4aefd5f/tenor.gif)
