# Playing around with our classifer

In the data directory, you have access to a much larger collection of news articles from various news sources. We can use our classifier in combination with pandas to gain some insights. 

In [None]:
import pandas as pd
%matplotlib inline

Load in the data

In [None]:
df = pd.read_csv('./data/all_data.csv')
df.head()

Load our pre-trained classifier

In [None]:
import pickle
CLASSIFIER_PATH = './classifiers/clickbait_svc_v1'
with open(CLASSIFIER_PATH, 'rb') as f:
    classifier = pickle.load(f)

Create the combined text field and make sure that we handle any missing values (NaN values are replaced with empty strings)

In [None]:
df['description'] = df['description'].fillna('')
df['title'] = df['title'].fillna('')
df['text'] = df['description'] + df['title']

Now we predict the clickbait labels (remember 1=clickbait and 0=not clickbait). Then we assign a new column to our data containing the predicted label.

In [None]:
predicted_labels = classifier.predict(df['text'])
df = df.assign(label=predicted_labels)

## Which News source is the most clickbaity?

In [None]:
(df.groupby('source')['label'].sum() / df.groupby('source').size()).plot('barh')

The Financial Times is surprisingly high. Let's inspect.

In [None]:
df.groupby('source').get_group('financial-times')

It seems that there aren't many FT articles - the ones we have seem to be opinion columns. It's understandable that our rudimentary classifier saw these as clickbaity.