# Programming a Category Predictor

A Category Predictor is used to predict the category to which a given piece of text belongs.
This is frequently used in text classification to categorize text documents. Search engines for example
frequently use this tool to order the search results by relevance.

In order to build this predictor, we will use a statistic called *TermFrequency – Inverse
Document Frequency (tf-idf)*. We will not get into detail but generally the tf-idf statistic helps us understand how important a given word is to a document in a set of documents.

Simply put, the Term Frequency (tf) is basically a measure of how frequently each word appears in a given document. It is the number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency. The second part of the statistic is the Inverse Document Frequency (idf), which is a measure of how unique a word is to this document in the given set of documents.

## 1. Import and download the data

First import the following packages.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

We can use the `fetch_20newsgroups`-method to get a scikit-learn dataset with articles from 20 different newsgroups. Every article belongs to one of the following categories:

```
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

We define a map of categories that we will use for training. We will use seven categories from the list of 20. The keys in this dictionary object refer to the categories in the scikit-learn dataset. The values in the dictionary give us a more comprehensible description of the category.

In [None]:
# define the category map
category_map = {'talk.politics.misc': 'Politics', 'rec.autos': 'Cars',
 'rec.sport.hockey': 'Hockey', 'sci.electronics': 'Electronics',
 'sci.med': 'Medicine', 'talk.politics.guns' : 'Guns', 'sci.space' : 'Space'}

We can get the training dataset using the `fetch_20newsgroups`-method. You can find more information <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html">here</a>.

In [None]:
# get the training dataset
training_data = fetch_20newsgroups(subset='train', categories=category_map.keys(), shuffle=True, random_state=5)

## 2. Explore the data

The function returns a bunch object with the following attributes:

- bunch.data: list, length [n_samples]
- bunch.target: array, shape [n_samples]
- bunch.filenames: list, length [n_samples]
- bunch.DESCR: a description of the dataset.
- bunch.target_names: a list of categories of the returned data, length [n_classes]. This depends on the categories parameter.

So you can display a list of categories of the returned data as follows:

In [None]:
print(training_data.target_names)

The category of the first article can be found as follows:

In [None]:
print(training_data.target[0])

So the first article belongs to:

In [None]:
print(training_data.target_names[training_data.target[0]])

Let's check this by printing the article itself:

In [None]:
print(training_data.data[0])

We can alse print the number of articles.

In [None]:
print(len(training_data.data))

### Exercise

Print the first 10 articles together with the category.

## 3. Create and train the classifier

We build a count vectorizer and extract term counts (we already did this in the lesson about *Content Based Recommender Systems*, more information can be found there).

In [None]:
# build a count vectorizer and extract term counts
count_vectorizer = CountVectorizer()
train_tc = count_vectorizer.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

As you can see above there are 3,983 articles containing 53,448 different words.

Next create a Term Frequency – Inverse Document Frequency (tf-idf) transformer and train it using the train data. As mentioned earlier you don't need to understand the details about the classifier.

In [None]:
# create the tf-idf transformer
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

Finally train a Multinomial Bayes classifier using the training data:

In [None]:
# train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(train_tfidf, training_data.target)

## 4. Convert the test data to the right format

Let's define some sample articles (from Wikipedia and some newspaper sites) that will be used for testing. We need to transform them in the same manner so they can be used for testing with the classifier.

In [None]:
# define test data 
input_data = [
    'Electronic circuits usually use direct current sources. The load of an electronic circuit may be as simple as a few resistors, capacitors, and a lamp, all connected together to create the flash in a camera. Or an electronic circuit can be complicated, connecting thousands of resistors, capacitors, and transistors. It may be an integrated circuit such as the microprocessor in a computer. Resistors and other circuit elements can be connected in series or in parallel. Resistance in series circuits is the sum of the resistance.',
    'Van Doren became European champions with the Belgium under-21 squad in 2012. His first selection for the national team was at the age of 17.[8] With Belgium, he became European vice-champion at the 2013 European Championship on home ground in Boom and at the 2017 European Championship in Amstelveen, Netherlands. He was a part of the Belgian squad which won the silver medal at the 2016 Summer Olympics. In 2016 he won the FIH Rising Star of the Year award, which he won again in 2017 together with the FIH Player of the Year award. He was a part of the Belgian squad which won Belgium its first World Cup and European title.',
    'NRA stands for National Rifle Association. The group was founded in 1871 as a recreational group designed to "promote and encourage rifle shooting on a scientific basis". The NRA s path into political lobbying began in 1934 when it began mailing members with information about upcoming firearms bills.',
    'Frank, Viscount De Winne (born 25 April 1961, in Ledeberg, Belgium) is a Belgian Air Component officer and an ESA astronaut. He is Belgiums second person in space (after Dirk Frimout). He was the first ESA astronaut to command a space mission when he served as commander of ISS Expedition 21. ESA astronaut de Winne serves currently as Head of the European Astronaut Centre of the European Space Agency in Cologne/Germany (Köln).',
    'The Chevrolet Silverado, and its mechanically identical cousin the GMC Sierra, are a series of full-size and heavy-duty pickup trucks manufactured by General Motors and introduced in 1998 as the successor to the long-running Chevrolet C/K line. The Silverado name was taken from a trim level previously used on its predecessor, the Chevrolet C/K pickup truck from 1975 through 1998.',
    'Ambassador Gordon Sondland said that President Trump and his personal lawyer, Rudolph W. Giuliani, sought to condition a White House invite for Ukraines new president to demands that his country publicly launch investigations that could damage Trumps opponents.',
    'Viagra (sildenafil) relaxes muscles found in the walls of blood vessels and increases blood flow to particular areas of the body. Viagra is used to treat erectile dysfunction (impotence) in men. Another brand of sildenafil is Revatio, which is used to treat pulmonary arterial hypertension and improve exercise capacity in men and women. This page contains specific information for Viagra, not Revatio. Do not take Viagra while also taking Revatio, unless your doctor tells you to.'
]

Transform the input data using the count vectorizer. Transform the vectorized data using the tf-idf transformer so that it can run through the classifier.

In [None]:
# transform input data using count vectorizer
input_tc = count_vectorizer.transform(input_data)

# transform vectorized data using tfidf transformer
input_tfidf = tfidf.transform(input_tc)

## 5. Predict the category

Finally let's predict the output using the tf-idf transformed vector and print the output category for each sample in the input test data. Check if our classifier can find the category of the sample articles.

In [None]:
# predict the output categories
predictions = classifier.predict(input_tfidf)

# print the outputs
for text, category in zip(input_data, predictions):
    print('\nInput:', text, '\nPredicted category:', \
            category_map[training_data.target_names[category]])

Maybe you can try some other article categories as well.