## Classifying Documents

This example demonstrates how to write a classifier that is able to classify documents

In [1]:
import nltk
import random
import feedparser

In [3]:
urls = {
    'mlb': 'https://sports.yahoo.com/mlb/rss.xml',
    'nfl': 'https://sports.yahoo.com/nfl/rss.xml'
}

In [5]:
feedmap = {}
stopwords = set(nltk.corpus.stopwords.words('english'))

In [6]:
def feature_extractor(words):
    features = {}
    for word in words:
        if word not in stopwords:
            features[f"word({word})"] = True
    return features

In [7]:
sentences = []

for category in urls.keys():
    feedmap[category] = feedparser.parse(urls[category])
    print('downloading {}'.format(urls[category]))
    for entry in feedmap[category]['entries']:
        data = entry['summary']
        words = data.split()
        sentences.append((category, words))

downloading https://sports.yahoo.com/mlb/rss.xml
downloading https://sports.yahoo.com/nfl/rss.xml


In [9]:
feature_sets = [(feature_extractor(words), category)
               for category, words in sentences]
random.shuffle(feature_sets)

In [10]:
total = len(feature_sets)
off = total // 2
trainset = feature_sets[off:]
testset = feature_sets[:off]

classifier = nltk.NaiveBayesClassifier.train(trainset)

In [11]:
nltk.classify.accuracy(classifier, testset)

0.9019607843137255

In [12]:
classifier.show_most_informative_features(5)
for i, entry in enumerate(feedmap['nfl']['entries']):
    if i < 4:
        features = feature_extractor(entry['title'].split())
        category = classifier.classify(features)
        print('{} -> {}'.format(category, entry['summary']))

Most Informative Features
              word(list) = True              nfl : mlb    =      5.9 : 1.0
              word(home) = True              mlb : nfl    =      4.8 : 1.0
              word(back) = True              nfl : mlb    =      4.5 : 1.0
              word(team) = True              nfl : mlb    =      4.5 : 1.0
            word(player) = True              nfl : mlb    =      3.8 : 1.0
nfl -> Doug Pederson is the second coach in the league known to have contracted the coronavirus. Saints coach Sean Payton tested positive in March.
nfl -> It looks like the Raiders have added a veteran running back to their mix. Jeremy Hill tweeted Sunday morning "I'm a Raider." NFL Media's Ian Rapoport reported Friday that the Raiders hosted Hill for a workout and planned to sign the running back.
mlb -> The Vikings are nearing completion on a multiyear deal for General Manager Rick Spielman, Tom Pelissero of NFL Media reports Sunday night. The move was expected after coach Mike Zimmer recen