# NLP Core 2 Exercise 1: Hot News

In this exercise we will learn how to perform document classification in order to predict the category of news articles from the Reuters Corpus using a **bag-of-words** model and **one-hot encoding**. We will then see how we can use **TF-IDF** to improve our features for classification. Finally, we will perform topic modeling with **LDA** to see whether we can predict the categories of news articles without any labelled data.

## The Reuters Corpus

The Reuters Corpus is a collection of news documents along with category tags that are commonly used to test document classification. It is split into two sets: the *training* documents used to train a classification algorithm, and the *test* documents used to test the classifier's performance.

The Reuters Corpus is accessible through NLTK; for more information see the [NLTK Corpus HOWTO](http://www.nltk.org/howto/corpus.html#categorized-corpora).

**Questions**:
  1. How many documents are in the Reuters Corpus? What percentage are training and what percentage are testing documents?
  2. How many words are in the training documents? In the testing documents?
  3. What are the five most common categories in the training documents?

In [1]:
from nltk.corpus import reuters

In [2]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to /Users/Yohan/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

In [3]:
import pandas as pd
import numpy as np

In [4]:
reuters_files = reuters.fileids()

In [5]:
reuters_len = len(reuters_files)
print("There are {} documents in the Reuters Corpus".format(reuters_len))

There are 10788 documents in the Reuters Corpus


In [6]:
train_per = np.sum(np.array([mydoc[0:8] for mydoc in reuters_files])=="training")/reuters_len*100
test_per = np.sum(np.array([mydoc[0:4] for mydoc in reuters_files])=="test")/reuters_len*100
print("Training % is: {}".format(train_per))
print("Test % is: {}".format(test_per))

Training % is: 72.01520207638117
Test % is: 27.98479792361884


2.

In [7]:
training_files = np.array(reuters_files)[np.array([mydoc[0:8] for mydoc in reuters.fileids()])=="training"]

In [8]:
training_words = len([word for mydoc in training_files for word in reuters.words(mydoc)])

In [9]:
test_files = np.array(reuters_files)[np.array([mydoc[0:4] for mydoc in reuters.fileids()])=="test"]

In [10]:
test_words = len([word for mydoc in test_files for word in reuters.words(mydoc)])

In [11]:
train_words = len([word for mydoc in training_files for word in reuters.words(mydoc)])

In [12]:
print("There are {} words in the training documents".format(train_words))

There are 1253696 words in the training documents


In [13]:
print("There are {} words in the testing documents".format(test_words))

There are 467205 words in the testing documents


3.

In [14]:
pd.Series([word for mydoc in test_files for word in reuters.categories(mydoc)]).value_counts()[0:5]

earn        1087
acq          719
crude        189
money-fx     179
grain        149
dtype: int64

## Bag of words representations

We will now see how a sentence can be transformed into a feature vector using a bag of words model. Consider the following sentences:

In [15]:
sentences = [
  'This is the first document.',
  'This document is the second document.',
  'And this is the third one.',
   'Is this the first document?',
]

We can represent each word as a **one-hot** encoded vector (with a single 1 in the column for that word), and add their vectors together to get the feature vector for a sentence:

**Questions:**
  4. Use CountVectorizer from scikit-learn to get an array of one-hot encoded vectors for the given sentences. What do the rows and columns of the feature matrix X represent?
  5. What word does the second column of X represent? What about the third column? (If you are stuck, look at *vectorizer.get_feature_names()*)
 
 **Bonus**: Try using TfidfVectorizer instead of CountVectorizer, and try to explain why some values of X become smaller than others.

4.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

The rows represent each sentence and the columns each word in the sentences.

5.

In [17]:
vector_words = vectorizer.get_feature_names()
print("second word in column: {}".format(vector_words[1]))
print("third word in column: {}".format(vector_words[2]))

second word in column: document
third word in column: first


## Classifying Reuters

Now let's put these together in order to build a classifier for Reuters articles.

**Questions:**
  6. Convert the training and testing documents into matrices X and X2 of feature vectors using CountVectorizer(), and convert the category labels into matrices y and y2 of binary features for classification using MultiLabelBinarizer() from scikit-learn. (Hint: use fit_transform() first on the training set, and then transform() on the testing set.)
  7. add code to fit a multiclass SVM classifier on the training data . (Hint: use *OneVsRestClassifier(LinearSVC())* as the classifier object, and then call its fit() and predict() methods on the data.) Use sklearn.metrics.classification_report to evaluate its performance.
  
 **Bonus**: Try using TF-IDF (TfidfVectorizer) weighted features. Does the classifier's performance improve?

In [18]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

train_docs = [reuters.raw(train_id) for train_id in training_files]
test_docs = [reuters.raw(test_id) for test_id in test_files]

In [19]:
#### (A) add code here from question 6
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_docs)
X_train.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 3, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0]], dtype=int64)

In [20]:
X_test = vectorizer.transform(test_docs)
X_test.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 3, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [21]:
# convert the category labels into binary features for classification
mlb = MultiLabelBinarizer()
y = mlb.fit_transform([reuters.categories(train_id) for train_id in training_files])
y2 = mlb.transform([reuters.categories(test_id) for test_id in test_files])

In [22]:
svm = OneVsRestClassifier(LinearSVC())
y_svm = svm.fit(X_train, y)
predictions = svm.predict(X_test)



In [23]:
print(classification_report(y2, predictions))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96       719
           1       1.00      0.39      0.56        23
           2       1.00      0.64      0.78        14
           3       0.78      0.70      0.74        30
           4       0.92      0.67      0.77        18
           5       0.00      0.00      0.00         1
           6       1.00      0.83      0.91        18
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         3
           9       0.93      0.93      0.93        28
          10       1.00      0.78      0.88        18
          11       0.00      0.00      0.00         1
          12       0.91      0.86      0.88        56
          13       1.00      0.50      0.67        20
          14       0.00      0.00      0.00         2
          15       0.70      0.50      0.58        28
          16       0.00      0.00      0.00         1
          17       0.84    

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## Topic Modeling with LDA

Now we will see if we can use topic modeling to discover the topics in the Reuters news articles without using the labels provided in the corpus.

**Questions:**

8. Encode the articles as a matrix of feature vectors using one-hot encoding. Exclude stopwords by using NLTK's list of English stopwords (see *nltk.corpus.stopwords*).
9. Create a model *lda* by using scikit-learn's LatentDirichletAllocation to model the topics in the documents. Set the argument *n_components* to equal the number of categories in Reuters, and use the matrix from question 8 as input to the model's *fit_transform()* function. What does the output of this function represent?

**Bonus:** Plot three histograms of the most prominent topic for documents with the categories: 'trade', 'acq', 'cocoa'. (Hint: use *np.argmax(topic_matrix, axis = 1)* to find the most prominent topic for each document.)

In [55]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Yohan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [72]:
# 8.
vectorizer = CountVectorizer(stop_words=nltk.corpus.stopwords.words('english'))
X_train = vectorizer.fit_transform(train_docs)
X_test = vectorizer.transform(test_docs)
X = np.vstack([X_train.toarray(), X_test.toarray()])

In [73]:
from sklearn.decomposition import LatentDirichletAllocation

In [74]:
#9.
lda = LatentDirichletAllocation(n_components=len(reuters.categories()))
lda.fit_transform(X)

array([[3.30687831e-05, 3.30687831e-05, 3.30687831e-05, ...,
        3.30687831e-05, 3.30687831e-05, 2.77947887e-02],
       [1.97357876e-01, 7.55857899e-05, 7.55857899e-05, ...,
        7.55857899e-05, 7.55857899e-05, 7.55857899e-05],
       [1.68350168e-04, 1.68350168e-04, 3.99961325e-01, ...,
        1.68350168e-04, 1.68350168e-04, 1.68350168e-04],
       ...,
       [1.98412698e-04, 1.98412698e-04, 1.98412698e-04, ...,
        1.98412698e-04, 4.33735046e-01, 1.98412698e-04],
       [5.14403292e-05, 5.14403292e-05, 5.14403292e-05, ...,
        2.61525055e-02, 5.14403292e-05, 5.14403292e-05],
       [1.82149362e-04, 1.88235546e-01, 1.82149362e-04, ...,
        1.82149362e-04, 1.82149362e-04, 1.82149362e-04]])

The output ideally represents all the categories from the reuters corpus. It represents clustering done on all the articles which should cluster all the articles according to how similar their content is.