Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

This model currently achieves an F-score of 92% (obtained through 5-fold cross-validation). It is trained on 8000 sensational and 8000 objective headlines and articles.

It takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts. The output file contains a third column with the label - 1 if the input is categorized as sensationalist, 0 if not.

The classifier is a non-linear SVM, and it uses the following features:

POS tags (unigrams and bigrams)

Punctuation counts

Average sentence length

Number of all-cap tokens (excluding common abbreviations, and normalized by length of text)

Number of words that overlap with the Pattern Profanity word list (normalized by length of text)

Polarity and Subjectivity scores (obtained through the Pattern Sentiment module)

This directory contains the trained model, as well as Python code to train a model and to classify new data, the data used for training, and a sample of test data (a held-out set) with TRUE labels.

Usage for the Python code

python -args

The arguments are:

-t, --trainset: Path to training data (if you are training a model)

-m, --model: Path to model (if you are using a pre-trained model)

-d, --dump: Dump trained model? Default is False

-v, --verbose: Default is non-verbose

-c, --classify: Path to new inputs to classify

-s, --save: Path to the output file (default is 'output.csv')

Experiments and Results

Below is the classification report for the final model:

	precision	recall	f-score
	objective	0.91	0.93	0.92
	sensationalist	0.92	0.91	0.92

Below are f-scores for models trained on subsets of features (text statistics refers to the combination of sentence length, all-caps, profanity, polarity, and subjectivity features):

	Features	F-score
	Text Stats	71
	Punctuation	85
	POS unigrams	79
	POS bigrams	86
	Text Stats + Punctuation	86
	Text Stats + POS Bigrams	88
	Punctuation + POS Bigrams	91
	Text Stats + Punctuation + POS Bigrams	92

This model is trained on features from both the headlines and the articles. Training the same model ONLY on headlines or ONLY on articles results in lower f-scores (82% and 89% respectively).


- Scikit-learn for modeling

- NLTK for POS-tagging

- Pattern for Polarity, Subjectivity, and Profanity features
You can’t perform that action at this time.