# Reuters-21578 News classification

#### Author: Qihao LIU
Date: 02/01/2021

Reuters-21578 is arguably the most commonly used collection for text classification. It contains structured information about newswire articles that can be assigned to several classes, making it a multi-label problem. 
It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics. The collection originally consisted of 21,578 documents but a subset and split is traditionally used. 
The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.This method of splitting can directly been used by library nltk.

## 1. Data Cleaning -  Data Preparation

### 1.1 Introduction

This part goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task. Feeding dirty data into a model will give us results that are meaningless.

Processing:

* Getting the data - in this case, we'll be scraping data from the 22 sgm files containing the 21 578 Reuters articles
* Cleaning the data - we will walk through popular text pre-processing techniques
* Organizing the data - we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this part - organized data in two standard text formats:

* Corpus - a collection of text;
* Document-Term Matrix - Word2Vec in matrix format

### 1.2 Problem Statement

The Reuters-21578 dataset contains 21 578 financial articles tagged with topics.
There are 135 different topics, but this exercise will focus on only 5 of them:
-	Money/Foreign Exchange (MONEY-FX)
-	Shipping (SHIP)
-	Interest Rates (INTEREST)
-	Mergers/Acquisitions (ACQ)
-	Earnings and Earnings Forecasts (EARN)

So this is a supervised classification, we do not need to use topic modelling techniques for modelling topics form data.

### 1.3 Getting The Data

There are two ways to get the reuter21578 data:

* Download the collection and parse the multiple SGML files in order to recreate the original dataset;
* Or, much easier way with the NLTK library which has the reuters corpus already available. 

Libraies:
Using BeautifulSoup libray to help us pick out certain sections from sgml files in order to parse all SGML files, removing all unwanted tags and a simple regex to strip the ending signature.

The following code shows how to deal with the original dataset sgml but actually using NLTK library with Mode-Apte will be efficient:

In [7]:
# Selected categories
selected_categories = ["to_money-fx","to_ship","to_interest","to_acq","to_earn"] 
# Category files
category_files = { 'to_': ('Topics', 'all-topics-strings.lc.txt')}

In [8]:
from pandas import DataFrame
# Create category dataframe

# Read all categories
category_data = []

# Newsline folder and format
data_folder = 'C:/Users/LQH/Desktop/CA CIB/reuters21578/'

# Building dataframe for visualising topics details like numbers of documents to this topic
for category_prefix in category_files.keys():
    with open(data_folder + category_files[category_prefix][1], 'r') as file:
        for category in file.readlines():
            category_data.append([category_prefix + category.strip().lower(), 
                                  category_files[category_prefix][0], 0])

# Create category dataframe
news_categories = DataFrame(data=category_data, columns=['Name', 'Type', 'Newslines'])
print(news_categories.head())

         Name    Type  Newslines
0      to_acq  Topics          0
1     to_alum  Topics          0
2  to_austdlr  Topics          0
3  to_austral  Topics          0
4   to_barley  Topics          0


In [9]:
import numpy as np

#update the numbres of documents with same topic (count nbs)
def update_frequencies(categories):
    for category in categories:
        idx = news_categories[news_categories.Name == category].index[0]
        f = news_categories.get_value(idx, 'Newslines')
        news_categories.set_value(idx, 'Newslines', f+1)

#building vector Y(labels) for each document to represent whether the documents relative to the topic in 
#list ["to_money-fx","to_ship","to_interest","to_acq","to_earn"]
def to_category_vector(categories, target_categories):
    vector = np.zeros(len(target_categories)).astype(np.float32)
    for i in range(len(target_categories)):
        if target_categories[i] in categories:
            vector[i] = 1.0
    return vector

In [10]:
from bs4 import BeautifulSoup
import re
import xml.sax.saxutils as saxutils
from glob import glob

# Parse SGML files
document_X = {}
document_Y = {}

# removing Special symbol like < for reading the specific part in the files
#'&amp;','&lt;','&gt;'
def strip_tags(text):
    return re.sub('<[^<]+?>', '', text).strip()
def unescape(text):
    return saxutils.unescape(text)

f_list = glob('C:/Users/LQH/Desktop/CA CIB/reuters21578/*.sgm')

# Iterate all files
for filename in f_list:
    print('Start parsing {0}...'.format(filename))
    file = open(filename, 'rb')
    content = BeautifulSoup(file.read().lower())
    file.close()
    for newsline in content('reuters'):
        document_categories = []           
        # News-line Id
        document_id = newsline['newid']
        # Extracting document body
        document_body = strip_tags(str(newsline('text')[0])).replace('reuter\n&#3;', '')
        document_body = unescape(document_body)
        # News-line categories
        topics = newsline.topics.contents
        for topic in topics:
            document_categories.append('to_' + strip_tags(str(topic)))                
        # Create new document    
        update_frequencies(document_categories)
        #Filter the documents in list of ["to_money-fx","to_ship","to_interest","to_acq","to_earn"] 
        if sum(to_category_vector(document_categories, selected_categories))>=1.0:
            document_Y[document_id] = to_category_vector(document_categories, selected_categories)
            document_X[document_id] = document_body

Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-000.sgm...


  import sys
  


Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-001.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-002.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-003.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-004.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-005.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-006.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-007.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-008.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-009.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-010.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-011.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-012.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-013.sgm...
Start parsing C:/Users/LQH/Desktop/CA CIB/reuters21578\reut2-014

In [11]:
print(document_body)
print(topic)

american exchange introduces institutional index
    new york, oct 19 - the american stock exchange said it has
introduced options with expirations of up to three years on the
institutional index.
    with the ticker symbol <xii>, the index is a guage of the
core equity holdings of the nation's largest institutions, the
exchange explained.
    the new listings represent the first long-term options to
be traded by the amex, it added.
    it said the long-term institutional index options began
trading monday with expirations of december 1988 <xiv> and
december 1989 <xix>.
   
    the amex said a third long-term option with an expiration
of december 1990 will begin trading following the december 1987
expiration.
    it said strike prices on the long-term options have been
set at 50 point intervals with initial strikes of 250, 300 and
350. to avoid conflicting strike price codes, the 350 stike
prices will carry the ticker symbols <xvv> for the option
expiring in december 1988 and <xvx> for

In [12]:
print(len(document_Y))
#print(document_Y)
print(len(document_X))

7824
7824


In [13]:
#print(document_X["5"])
#topic_list = ["to_money-fx","to_ship","to_interest","to_acq","to_earn"] 
news_categories.sort_values(by='Newslines', ascending=False, inplace=True)
news_categories.head(10)

Unnamed: 0,Name,Type,Newslines
35,to_earn,Topics,3987
0,to_acq,Topics,2448
73,to_money-fx,Topics,801
28,to_crude,Topics,634
45,to_grain,Topics,628
126,to_trade,Topics,552
55,to_interest,Topics,513
130,to_wheat,Topics,306
108,to_ship,Topics,305
19,to_corn,Topics,254


The following table shows the numbers of documents relative to the topic that we consider in this exercise, where the sample is not evenly distributed.

In [14]:
news_categories.loc[[35,0,73,55,108]]
#print(document_X.keys())

Unnamed: 0,Name,Type,Newslines
35,to_earn,Topics,3987
0,to_acq,Topics,2448
73,to_money-fx,Topics,801
55,to_interest,Topics,513
108,to_ship,Topics,305


### 1.4 Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

We're going to execute just the common cleaning steps here:
- Make text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text (/n)
- Tokenize text
- Remove stop words


In [37]:
from nltk import word_tokenize
from nltk.stem.snowball import PorterStemmer, SnowballStemmer
import re
from nltk.corpus import stopwords
import string 

cachedStopWords = stopwords.words("english")
stemmer = SnowballStemmer('english')
#Make text lowercase, remove text in square brackets,remove punctuation, remove \n and remove words containing numbers.
def tokenize(text):
    min_length = 3
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    words = map(lambda word: word.lower(), word_tokenize(text))
    words = [word for word in words if word not in cachedStopWords]
   # words = ' '.join([word for word in text.split() if word not in cachedStopWords])
    tokens =(list(map(lambda token: PorterStemmer().stem(token),words)))
   # tokens = ' '.join(stemmer.stem(i) for i in words)
    p = re.compile('[\'a-zA-Z]+') #Matches one or more alphabetical characters.
   # filtered_tokens = ' '.join(token for token in tokens if (p.match(token) and len(token)>=min_length))
    filtered_tokens = list(filter(lambda token: p.match(token) and len(token)>=min_length,tokens))
    return filtered_tokens

In [38]:
# Tokenized document collection
newsline_documents = []
word_nb = 0
# Tokenize
for key in document_X.keys():
    newsline_documents.append(tokenize(document_X[key]))
    word_nb += len(tokenize(document_X[key]))
number_of_documents = len(document_X)
print(number_of_documents)
print(word_nb)

7824
0


In [41]:
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer, SnowballStemmer
def tokenize1(text):
    """function to clean text by removing punctuations, and numbers
    ---------------------------------------------
    
    :param text: a string
    
    :returns: string with punctuations and numbers removed
    """
    min_length = 2
    stopwords_set= set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    text = text.replace('\n',' ').lower().strip()
    text = re.sub("[^a-zA-Z]+", " ", text).split()
    text = ' '.join(stemmer.stem(i) for i in text)
    stemmed = ' '.join([word for word in text.split() if word not in stopwords_set and len(word)>=min_length])
    return(stemmed)

In [42]:
# Tokenized document collection
newsline_documents = []
word_nb = 0
# Tokenize
for key in document_X.keys():
    newsline_documents.append(tokenize1(document_X[key]))
    word_nb += len(tokenize1(document_X[key]))
number_of_documents = len(document_X)
print(number_of_documents)
print(word_nb)

7824
3401172


In [43]:
print(newsline_documents[10])
#print(document_X["10"])
# 7824
# 3415142
# 7824
# 489300
#document_X[key]

owen minor inc obod rais qtli dividend richmond va feb qtli div eight cts vs cts prior pay march record march reuter


### TF-IDF transformation in sklearn

In [44]:
import time
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import RegexpTokenizer


body_list = newsline_documents
start = time.clock()
vectorizer = TfidfVectorizer()
vectorizer.fit(newsline_documents)
print ("sklearn TFIDF processing time: {0:.5f} s".format(time.clock() - start))

  # Remove the CWD from sys.path while we load stuff.


sklearn TFIDF processing time: 1.52021 s


  del sys.path[0]


In [46]:
matrix_tfidf = vectorizer.transform(newsline_documents)
matrix_tfidf.shape

(7824, 18249)

In [47]:
num_categories = len(selected_categories)
topic_class = np.zeros(shape=(number_of_documents, num_categories))
for idx, key in enumerate(document_Y.keys()):
    topic_class[idx, :] = document_Y[key]
topic_class

array([[0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [49]:
from sklearn.model_selection import train_test_split, cross_val_score
#split train and test dataset with ratio of 0.3
X_train, X_test, Y_train, Y_test = train_test_split(matrix_tfidf, topic_class, test_size=0.3)
print(X_train.shape)
print(Y_train.shape)

(5476, 18249)
(5476, 5)


In [50]:
print(X_train)

  (0, 16502)	0.0630335651497171
  (0, 16120)	0.14835008665637345
  (0, 15592)	0.07553828422534323
  (0, 15462)	0.10061046790307208
  (0, 14594)	0.06026518968723119
  (0, 14591)	0.08036762016998687
  (0, 14524)	0.061006377090775644
  (0, 14494)	0.2043151895341332
  (0, 14070)	0.08589557766381234
  (0, 13743)	0.32396102634753565
  (0, 13637)	0.019531329156803508
  (0, 13610)	0.06048435177120554
  (0, 13157)	0.09917668555384727
  (0, 13141)	0.08122198744509536
  (0, 12930)	0.12852100619352408
  (0, 12752)	0.06136301929722827
  (0, 12667)	0.15946153342884392
  (0, 12179)	0.10455567067064062
  (0, 12053)	0.040828144390355106
  (0, 11803)	0.09756131485583543
  (0, 11503)	0.08726465611651313
  (0, 11482)	0.17567377673879003
  (0, 9808)	0.055534656072440285
  (0, 9673)	0.06654663157461375
  (0, 9201)	0.09756131485583543
  :	:
  (5475, 14591)	0.13925626617681716
  (5475, 13639)	0.16596120473504186
  (5475, 13637)	0.03384273375389291
  (5475, 13610)	0.10480371288795148
  (5475, 13595)	0.16111672

In [56]:
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

def linearsvc_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(LinearSVC(random_state=42))
    classifier.fit(train_docs, train_labels)
    return classifier

In [57]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

def naive_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(MultinomialNB(alpha=0.01))
    classifier.fit(train_docs, train_labels)
    return classifier

In [66]:
import logging
from sklearn.model_selection import train_test_split, cross_val_score

# estimator:估计方法对象(分类器)
# X：数据特征(Features)
# y：数据标签(Labels)
# soring：调用方法(包括accuracy和mean_squared_error等等)
# cv：几折交叉验证
# n_jobs：同时工作的cpu个数（-1代表全部）

def cross_validation(clf):
    num_folds = 10
    # logger.info("Cross validation")
    print("Cross validation")
    scores = cross_val_score(clf,
                             X_train, Y_train,
                             cv=num_folds,
                             n_jobs=-1,
                             verbose=0)
    print(f"Real risk by {num_folds}-fold CV : {scores.mean():.2} (+/- {scores.std():.2})")

In [59]:
from sklearn.metrics import f1_score, precision_score, recall_score

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='micro')
    recall = recall_score(test_labels, predictions, average='micro')
    f1 = f1_score(test_labels, predictions, average='micro')
    print("Micro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

    precision = precision_score(test_labels, predictions,average='macro',zero_division=1)
    recall = recall_score(test_labels, predictions, average='macro',zero_division=1)
    f1 = f1_score(test_labels, predictions, average='macro',zero_division=1)

    print("Macro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

In [67]:
cross_validation(OneVsRestClassifier(LinearSVC(random_state=42)))

Cross validation
Real risk by 10-fold CV : 0.94 (+/- 0.0067)


In [61]:
#documents = reuters.fileids()
model_svm = linearsvc_classifier(X_train, Y_train)
predictions_svm = model_svm.predict(X_test)
evaluate(Y_test, predictions_svm)

Micro-average quality numbers
Precision: 0.9753, Recall: 0.9680, F1-measure: 0.9716
Macro-average quality numbers
Precision: 0.9510, Recall: 0.9273, F1-measure: 0.9386


In [62]:
model_svm = linearsvc_classifier(X_train, Y_train)
predictions_svm = model_svm.predict(X_train)
evaluate(Y_train, predictions_svm)

Micro-average quality numbers
Precision: 0.9935, Recall: 0.9954, F1-measure: 0.9944
Macro-average quality numbers
Precision: 0.9863, Recall: 0.9887, F1-measure: 0.9875


#### Navie bayes

In [68]:
cross_validation(OneVsRestClassifier(MultinomialNB(alpha=0.01)))

Cross validation
Real risk by 10-fold CV : 0.84 (+/- 0.013)


In [69]:
model_naive = naive_classifier(X_train, Y_train)
predictions_navie = model_naive.predict(X_test)
evaluate(Y_test, predictions_navie)

Micro-average quality numbers
Precision: 0.8955, Recall: 0.9298, F1-measure: 0.9123
Macro-average quality numbers
Precision: 0.8391, Recall: 0.9420, F1-measure: 0.8802


In [70]:
model_naive = naive_classifier(X_train, Y_train)
predictions_navie = model_naive.predict(X_train)
evaluate(Y_train, predictions_navie)

Micro-average quality numbers
Precision: 0.9399, Recall: 0.9834, F1-measure: 0.9611
Macro-average quality numbers
Precision: 0.8913, Recall: 0.9905, F1-measure: 0.9317


### 1.5 Organizing The Data
Organized data in two standard text formats:

* Corpus - a collection of text
* Document-Term Matrix - using Word2Vec

Using One-Vs-Rest categorization method, using Word2Vec  (implemented by Gensim), which is much more effective than a standard bag-of-words or Tf-Idf approach.

In [12]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count

# Word2Vec number of features
num_features = 100

# Create new Gensim Word2Vec model
w2v_model = Word2Vec(newsline_documents, size=num_features, min_count=1, window=10, workers=cpu_count())
w2v_model.init_sims(replace=True)
w2v_model.save(data_folder + 'reuters.word2vec')


In [13]:
# Limit each newsline to a fixed number of words
document_max_num_words = 100

num_categories = len(selected_categories)
X = np.zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(np.float32)
Y = np.zeros(shape=(number_of_documents, num_categories)).astype(np.float32)

empty_word = np.zeros(num_features).astype(np.float32)

for idx, document in enumerate(newsline_documents):
    for jdx, word in enumerate(document):
        if jdx == document_max_num_words:
            break            
        else:
            if word in w2v_model:
                X[idx, jdx, :] = w2v_model[word]
            else:
                X[idx, jdx, :] = empty_word

for idx, key in enumerate(document_Y.keys()):
    Y[idx, :] = document_Y[key]
    

  from ipykernel import kernelapp as app
  app.launch_new_instance()


## 2. Classifying Reuters 


In order to classify the collection, we have to apply a number of steps which are standard for the majority of classification problems:

* Define our training and testing subsets to make sure that we do not evaluate with documents that the system has learnt from. In our case, split train and test dataset with ratio of 0.3.
* Represent all the documents in each subset.
* Train a classifier on the represented training data.
* Predict the labels for each one of the represented testing documents.
* Compare the real and predicted document labels to evaluate our solution.

Model:
* Using model linear SVM (LinearSVC), this model has traditionally produced good quality with text classification problems;
* and Gaussian Naive Bayes (GaussianNB).
The problem we are solving has a multi-label nature, we have to train our model (which is binary by nature) N times, once per category, where the negative cases will be the documents in all the other categories. This allows our model to make a binary decision per category and produce multi-label results. This can be done with the OneVsRestClassifier object in Scikit-learn. This step might change depending on the estimator like kNN which is multi-label by nature.

In [14]:
from sklearn.model_selection import train_test_split
nsamples, nx, ny = X.shape
X_2dim = X.reshape((nsamples,nx*ny))

#split train and test dataset with ratio of 0.3
X_train, X_test, Y_train, Y_test = train_test_split(X_2dim, Y, test_size=0.3)
print(X_train.shape)
print(Y_train.shape)

(5476, 10000)
(5476, 5)


In [27]:
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

def linearsvc_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(LinearSVC(penalty='l2', loss='squared_hinge',random_state=42))
    classifier.fit(train_docs, train_labels)
    return classifier

In [25]:
from sklearn.metrics import f1_score, precision_score, recall_score

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='micro')
    recall = recall_score(test_labels, predictions, average='micro')
    f1 = f1_score(test_labels, predictions, average='micro')
    print("Micro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

    precision = precision_score(test_labels, predictions,average='macro',zero_division=1)
    recall = recall_score(test_labels, predictions, average='macro',zero_division=1)
    f1 = f1_score(test_labels, predictions, average='macro',zero_division=1)

    print("Macro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

In [28]:
#documents = reuters.fileids()
model = linearsvc_classifier(X_train, Y_train)
predictions = model.predict(X_test)
evaluate(Y_test, predictions)



Micro-average quality numbers
Precision: 0.9270, Recall: 0.9144, F1-measure: 0.9207
Macro-average quality numbers
Precision: 0.8622, Recall: 0.8362, F1-measure: 0.8487


In [23]:
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier

def naive_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(GaussianNB())
    classifier.fit(train_docs, train_labels)
    return classifier

In [26]:
print(Y_train.shape)
print(X_train.shape)

model = naive_classifier(X_train, Y_train)
predictions = model.predict(X_test)
evaluate(Y_test, predictions)

(5476, 5)
(5476, 10000)
Micro-average quality numbers
Precision: 0.4318, Recall: 0.7328, F1-measure: 0.5434
Macro-average quality numbers
Precision: 0.3475, Recall: 0.6624, F1-measure: 0.4259


In [None]:
## RandomForestRegressor model
# from sklearn.ensemble import RandomForestRegressor

# RF = RandomForestRegressor(n_estimators=10, criterion="mae", max_depth=3)
# RF.fit(X_train, Y_train)
# predictions = model.predict(X_test)
# evaluate(Y_test, predictions)

## 3 Predictive analysis

Our samples are not evenly distributed:
	Name	     Type	Newslines
35	to_earn	     Topics	3987
0	to_acq	     Topics	2448
73	to_money-fx	  Topics	801
55	to_interest 	Topics	513
108	to_ship	     Topics	305

So Micro-average is better performed because it chonsider this problem in the formula:
* Linear SVM (LinearSVC):
Micro-average quality numbers
Precision: 0.9270, Recall: 0.9144, F1-measure: 0.9207
Macro-average quality numbers
Precision: 0.8622, Recall: 0.8362, F1-measure: 0.8487

* Gaussian Naive Bayes (GaussianNB):
Micro-average quality numbers
Precision: 0.4318, Recall: 0.7328, F1-measure: 0.5434
Macro-average quality numbers
Precision: 0.3475, Recall: 0.6624, F1-measure: 0.4259

We could see Linear SVM preform much better than Naive Bayes in this case, 

## 4 Topic Modeling

In this case, our problem is supervised. Generally, NLP problems are unsupervised which need to do the topic modeling. The ultimate goal of topic modeling is to find various topics that are present in the corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

Method: Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, we need to provide 
* a document-term matrix;
* and the number of topics we would like the algorithm to pick up.

Once the topic modeling technique is applied, our job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, we can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

##  Getting reuters dataset from NLTK library

The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.This method of splitting can directly been used by library nltk.

Useful blog:
https://martin-thoma.com/nlp-reuters/
https://ana.cachopo.org/datasets-for-single-label-text-categorization
https://towardsdatascience.com/analysis-and-visualization-of-unstructured-text-data-2de07d9adc84

#### Thanks!