# Text classification

In this exercise we will work with a dataset called [20 newsgroups](http://qwone.com/~jason/20Newsgroups/), which contains post from newsgroups on twenty different subjects. We will however only work with two of these: one on computer graphics (comp.graphics) and one one medicine (sci.med).

Scikit-learn has a function for easy download of this data,

In [0]:
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, random_state=0)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Let us look at an example from the dataset

In [0]:
print("\n".join(twenty_train.data[0].split("\n")[:50]))

From: harti@mikro.ee.tu-berlin.de (Stefan Hartmann (Behse))
Subject: Genoa graphics board Drivers FTP site!
Article-I.D.: mailgzrz.1qpf1r$9ti
Organization: TUBerlin/ZRZ
Lines: 29
NNTP-Posting-Host: mikro.ee.tu-berlin.de

Hi,

well I have opened up a FTP site for getting the latest software drivers
for Genoa graphics cards.

Here is how to access it:

ftp 192.109.42.11
login:ftp
password:ftp
cd pub/genoa
ls -l
binary
prompt
hash

(now if you wanna have the latest drivers for the 7900 board)

cd 7000series
mget *

quit

This is the sequence to get the drivers.

If you have any further question, please email me.

Best regards, Stefan Hartmann
email to: harti@mikro.ee.tu-berlin.de



In order to turn the 1178 documents in our dataset into vectors we can use the scikit-learn **CountVectorizer()** which will extract all words from the dataset and make rows for each document.

A total of 24614 unique words are found in the dataset

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(1178, 24614)

Let's have a look at some of them.

In [0]:
count_vect.vocabulary_

{'from': 10161,
 'harti': 11081,
 'mikro': 14891,
 'ee': 8648,
 'tu': 22675,
 'berlin': 4477,
 'de': 7414,
 'stefan': 21073,
 'hartmann': 11083,
 'behse': 4412,
 'subject': 21310,
 'genoa': 10403,
 'graphics': 10721,
 'board': 4745,
 'drivers': 8378,
 'ftp': 10187,
 'site': 20415,
 'article': 3829,
 'mailgzrz': 14223,
 '1qpf1r': 823,
 '9ti': 2510,
 'organization': 16414,
 'tuberlin': 22680,
 'zrz': 24601,
 'lines': 13813,
 '29': 1182,
 'nntp': 15837,
 'posting': 17635,
 'host': 11533,
 'hi': 11304,
 'well': 23945,
 'have': 11107,
 'opened': 16324,
 'up': 23131,
 'for': 9969,
 'getting': 10457,
 'the': 22081,
 'latest': 13500,
 'software': 20626,
 'cards': 5428,
 'here': 11266,
 'is': 12576,
 'how': 11552,
 'to': 22308,
 'access': 2788,
 'it': 12624,
 '192': 634,
 '109': 253,
 '42': 1504,
 '11': 262,
 'login': 13950,
 'password': 16861,
 'cd': 5573,
 'pub': 18142,
 'ls': 14039,
 'binary': 4581,
 'prompt': 18009,
 'hash': 11090,
 'now': 15969,
 'if': 11815,
 'you': 24506,
 'wanna': 23795

So we have a dataset with 24614 columns and 1178 rows. We can train a perceptron on this

In [0]:
from sklearn.linear_model import Perceptron

classifier = Perceptron()
classifier.fit(X_train_counts, twenty_train.target) 

Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
           fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=None,
           penalty=None, random_state=0, shuffle=True, tol=0.001,
           validation_fraction=0.1, verbose=0, warm_start=False)

To test the classifier we can write a couple of documents about computer graphics and medicine and see how they are classified.

In [0]:
doc=["something about GPU graphics", "something about medicine human body science"]
doc_counts = count_vect.transform(doc)


predicted = classifier.predict(doc_counts)
list(map(lambda x: twenty_train.target_names[x], predicted))

['comp.graphics', 'sci.med']

Try to change the documents above.

- Can you come up with documents that fool the classifier?