## Text Classification

To demonstrate a number of applications of text mining, we will use a subset of a corpus of Internet newsgroup articles provided by Scikit-learn.
(**Note: this requires a net connection to download the articles**.)

In [None]:
from sklearn.datasets import fetch_20newsgroups
# We will only download documents for 2 from 20 newsgroups: Space & Autos
newsgroups = fetch_20newsgroups(subset="train", categories=["sci.space","rec.autos"])

Extract the raw documents and the corresponding two class labels for those documents:

In [None]:
documents = newsgroups.data
target = newsgroups.target
target_names = newsgroups.target_names
print("Dataset has %d documents. Targets are %s" % (len(documents), set(target_names)) )

Apply standard text pre-processing steps to create a document-term matrix:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
# we can pass in the same preprocessing parameters
vectorizer = TfidfVectorizer(stop_words="english",min_df = 10)
X = vectorizer.fit_transform(documents)
print(X.shape)

In [None]:
terms = vectorizer.get_feature_names()
print("Vocabulary has %d distinct terms" % len(terms))

In [None]:
# Display a set of sample terms
print(terms[200:220])

### Term Frequency Analysis

Once we have constructed a document-term matrix, a simple analysis procedure is to identify the most frequent terms (or the highest weighted terms, after TF-IDF is applied).

In [None]:
# sum over the columns
freqs = X.sum(axis=0)

In [None]:
# sort the indexes of the array by value, and then reverse it
sorted_term_indexes = freqs.argsort()
sorted_term_indexes = sorted_term_indexes[0, ::-1]

In [None]:
# display the top 10 terms
for i in range(10):
    term_index = sorted_term_indexes[0,i]
    print("%s = %.2f" % ( terms[term_index], freqs[0,term_index] ) )

### Classifying Documents

If we consider the annotated class labels (targets) for this corpus subset, a more interesting application to predict a label for newsgroup posts, from one of either of the two classes in the data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
# build a model on the document-term matrix created from the original set of documents
model.fit(X, target)
print(model)

We can also use the test data provided by Scikit-learn from the same corpus.
(**Note: this requires a net connection to download the articles**.)

In [None]:
test_newsgroups = fetch_20newsgroups(subset="test", categories=["sci.space","rec.autos"])

In [None]:
test_documents = test_newsgroups.data
test_target = test_newsgroups.target
test_target_names = test_newsgroups.target_names
print("Test dataset has %d documents. Targets are %s" % (len(test_documents), set(test_target_names)) )

We also convert the test set to document-term matrix. Note that we call *transform()* not *fit_transform()*. This ensure that we use the same terms as the original training set.

In [None]:
test_X = vectorizer.transform(test_documents)
print(test_X.shape)

Now we can perform predictions, and evaluation the accuracy of those predictions as we saw in previous labs:

In [None]:
predicted = model.predict(test_X)
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_target, predicted)
print("Classification accuracy = %.2f" % acc)