# News Classifier Notebook #

This file is an [ipython notebook](https://ipython.org/ipython-doc/3/notebook/notebook.html) that defines a simple machine learning classifier for news articles.

## Install ##
The easiest way to explore machine learning with python is to [install the
Anaconda Distribution](https://www.anaconda.com/distribution/) which contains a
variety of useful python packages in a single installation. Alternatively, you
can [download and install python](https://www.python.org/downloads/) and then
use [pipenv](https://pipenv.readthedocs.io/en/latest/) to install the packages
specified in this repo's Pipfile using `pipenv update`.

Once you have python and jupyter installed, run the `jupyter notebook NewsClassifier.ipynb` command. If you used pipenv, you'll want to run `pipenv run jupyter notebook NewsClassifier.ipynb`. 

At that point, this notebook should be running on your local computer, and you can follow along by running the code in the following cells.

In [59]:
from newspaper import Article
import csv
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

## How does this work? ##

The following function loads the [TrainingData.csv](TrainingData.csv) file, which contains URLs and
corresponding classification labels gathered from the [News Classification
Form](https://goo.gl/forms/nLcf2ol0o5dAJxGw1). It returns a list containing a dictionary corresponding to each row in the CSV.


In [60]:
def get_csv_data(local_file):
    labeled_data = []
    with open(local_file) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_count = 0
        for row in csv_reader:
            if line_count == 0:
                print("Discarding first CSV header row: {}".format(", ".join(row)))
            else:
                # for each row, append a new dictionary with "url" and "label" keys
                labeled_data.append({"url": row[1], "label": row[2]})
            line_count += 1
        print("Processed {} lines.".format(line_count))
    return labeled_data


The following function takes a URL as input and returns a [newspaper3k library article](https://newspaper.readthedocs.io/en/latest/) loaded with the URL content. 

In [61]:
def process_article(url):
    a = Article(url, fetch_images=False)
    a.download()
    a.parse()
    return a

The following function takes the labeled data and returns two lists. 

The first list contains the text content of the articles. The second list contains the label corresponding to each document.

In [62]:
def get_doc_and_label_lists(labeled_data):
    # Shuffle the labeled data
    random.shuffle(labeled_data)
    docs = []
    labels = []
    for item in labeled_data:
        article = process_article(item['url'])
        docs.append(article.text)
        labels.append(item['label'])
    return docs, labels

The following function uses our model to return the label for a given URL

In [80]:
def classify(model, url):
    art = process_article(url)
    label = model.predict([art.text])
    return label[0]

The following code uses the functions defined above to prepare data. It then uses code similar to the code found in the [Multinomial Naive Bayes section of the Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html#Multinomial-Naive-Bayes) to create our classifier.



In [118]:
labeled_data = get_csv_data('TrainingData.csv')
print("Scraping data from URLs...")
docs, labels = get_doc_and_label_lists(labeled_data)

# use 80% of the data for training, reserve 20% for testing
split = int(len(docs) * 0.8)  

print("Processing / vectorizing input text for the model")
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

print("Generating the model using the training data")
model.fit(docs[:split], labels[:split])

# This is debug code to manually inspect results
#for test_doc, test_label in zip(docs[split:], labels[split:]):
#    label = model.predict([test_doc])
#    print(label, " | ", test_label, " ||||| ", test_doc[:200])
#    print("--------")
    
# this does the same thing as the loop above,
# while keeping track of right and wrong answers, returning an accuracy score
score = model.score(docs[split:], labels[split:])
print("The model accurately classified {:.0%} of {} test articles".format(score, len(labels[split:])))

Discarding first CSV header row: Timestamp, URL, Label
Processed 11 lines.
Scraping data from URLs...
Processing / vectorizing input text for the model
Generating the model using the training data
The model accurately classified 0% of 2 test articles


The following code defines a set of test URLs, classifies them, and outputs the results

In [99]:
testcases = ['https://www.propublica.org/series/machine-bias', 
             'https://www.propublica.org/article/facebook-blocks-ad-transparency-tools',
             'https://www.propublica.org/series/a-users-guide-to-democracy',
             'https://www.chicagotribune.com/sports/baseball/cubs/ct-spt-cubs-diamondbacks-20190316-story.html']
for t in testcases:
    label = classify(model, t)
    print("{} classified as {}".format(t, label))

https://www.propublica.org/series/machine-bias classified as Science and Technology
https://www.propublica.org/article/facebook-blocks-ad-transparency-tools classified as Science and Technology
https://www.propublica.org/series/a-users-guide-to-democracy classified as Science and Technology
https://www.chicagotribune.com/sports/baseball/cubs/ct-spt-cubs-diamondbacks-20190316-story.html classified as Politics
