# Reuters-21578 News classification

#### Author: Qihao LIU
Date: 02/01/2021

Reuters-21578 is arguably the most commonly used collection for text classification. It contains structured information about newswire articles that can be assigned to several classes, making it a multi-label problem. 
It has a highly skewed distribution of documents over categories, where a large proportion of documents belong to few topics. The collection originally consisted of 21,578 documents but a subset and split is traditionally used. 
The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.This method of splitting can directly been used by importing the library nltk.

### Contents:
1. Problem Statement
2. Data Cleaning - Data Preparation
    * 2.1 Introduction
    * 2.2 Getting The Data
    * 2.3 Cleaning The Data
    * 2.4 Organizing The Data
3. Classifying Reuters
4. Predictive analysis
5. Topic Modeling

## 1. Problem Statement

The Reuters-21578 dataset contains 21 578 financial articles tagged with topics.
There are 135 different topics, but this exercise will focus on only 5 of them:
-	Money/Foreign Exchange (MONEY-FX)
-	Shipping (SHIP)
-	Interest Rates (INTEREST)
-	Mergers/Acquisitions (ACQ)
-	Earnings and Earnings Forecasts (EARN)

So this is a supervised classification, we do not need to use topic modelling techniques for modelling topics form data.

## 2. Data Cleaning -  Data Preparation

### 2.1 Introduction

This part goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task. Feeding dirty data into a model will give us results that are meaningless.

Processing:

* Getting the data - in this case, we'll be scraping data from the 22 sgm files containing the 21 578 Reuters articles;
* Cleaning the data - we will walk through popular text pre-processing techniques;
* Organizing the data - we will organize the cleaned data into a way that is easy to input into other algorithms.

The output of this part - organized data in two standard text formats:

* Corpus - a collection of text;
* Document-Term Matrix: TF-IDF transformation(word importance) or Word2Vec(Each domument with fixed nb of words and each word with fiexed nb of features)

### 2.2 Getting The Data

There are two ways to get the reuter21578 data:

* Download the collection and parse the multiple SGML files in order to recreate the original dataset;
* Or, much easier way with the NLTK library which has the reuters corpus already available. 

Libraies:
Using BeautifulSoup libray to help us pick out certain sections from sgml files in order to parse all SGML files, removing all unwanted tags and a simple regex to strip the ending signature.

The following code shows how to deal with the original dataset sgml (but actually using NLTK library with Mode-Apte could be efficient):

In [2]:
# Selected categories
selected_categories = ["to_money-fx","to_ship","to_interest","to_acq","to_earn"] 
# Category files, add prefix 'to_' for emphasizing the categories are from topics categories
category_files = { 'to_': ('Topics', 'all-topics-strings.lc.txt')}

In [3]:
from pandas import DataFrame
# Create category dataframe

# Read all categories
category_data = []

# Newsline folder and format
data_folder = 'reuter/'

# Building dataframe for visualising topics details like numbers of documents to this topic
for category_prefix in category_files.keys():
    with open(data_folder + category_files[category_prefix][1], 'r') as file:
        for category in file.readlines():
            category_data.append([category_prefix + category.strip().lower(), 
                                  category_files[category_prefix][0], 0])

# Create category dataframe
news_categories = DataFrame(data=category_data, columns=['Name', 'Type', 'Newslines'])
print(news_categories.head())

         Name    Type  Newslines
0      to_acq  Topics          0
1     to_alum  Topics          0
2  to_austdlr  Topics          0
3  to_austral  Topics          0
4   to_barley  Topics          0


In [4]:
import numpy as np

#update the numbres of documents with same topic (count nbs)
def update_frequencies(categories):
    """function to update the numbres of documents with same topic
    ---------------------------------------------
    :param:categories, a list of categories from loading reuters' documents
    :returns: category dataframe with label 'Newslines' the nbs of documents with same topic
    """
    for category in categories:
        idx = news_categories[news_categories.Name == category].index[0]
        f = news_categories._get_value(idx, 'Newslines')
        news_categories._set_value(idx, 'Newslines', f+1)

#building vector Y(labels) for each document to represent whether the documents relative to the topic in 
#list ["to_money-fx","to_ship","to_interest","to_acq","to_earn"]
def to_category_vector(categories, target_categories):
    """function to update the numbres of documents with same topic
    ---------------------------------------------
    :param: categories, a list of categories from loading reuters' documents
            target_categories, the topics that we will selecte like: ["to_money-fx","to_ship","to_interest","to_acq","to_earn"] 
    :returns: a vector of 5 dims for representing whether the documents relative to the topic in target_categories
    """
    vector = np.zeros(len(target_categories)).astype(np.float32)
    for i in range(len(target_categories)):
        if target_categories[i] in categories:
            vector[i] = 1.0
    return vector

In [5]:
from bs4 import BeautifulSoup
import re
import xml.sax.saxutils as saxutils
from glob import glob

# Parse SGML files
document_X = {}
document_Y = {}

f_list = glob('reuter/*.sgm')

# removing Special symbol like < for reading the specific part in the files
#'&amp;','&lt;','&gt;'
def strip_tags(text):
    return re.sub('<[^<]+?>', '', text).strip()
def unescape(text):
    return saxutils.unescape(text)

# Iterate all files
for filename in f_list:
    print('Start parsing {0}...'.format(filename))
    file = open(filename, 'rb')
    content = BeautifulSoup(file.read().lower())
    file.close()
    for newsline in content('reuters'):
        document_categories = []           
        # News-line Id
        document_id = newsline['newid']
        # Extracting document body
        document_body = strip_tags(str(newsline('text')[0])).replace('reuter\n&#3;', '')
        document_body = unescape(document_body)
        # News-line categories
        topics = newsline.topics.contents
        for topic in topics:
            document_categories.append('to_' + strip_tags(str(topic)))                
        # Create new document    
        update_frequencies(document_categories)
        #Filter the documents in list of ["to_money-fx","to_ship","to_interest","to_acq","to_earn"] 
        if sum(to_category_vector(document_categories, selected_categories))>=1.0:
            document_Y[document_id] = to_category_vector(document_categories, selected_categories)
            document_X[document_id] = document_body

Start parsing reuter/reut2-020.sgm...
Start parsing reuter/reut2-001.sgm...
Start parsing reuter/reut2-003.sgm...
Start parsing reuter/reut2-004.sgm...
Start parsing reuter/reut2-007.sgm...
Start parsing reuter/reut2-018.sgm...
Start parsing reuter/reut2-012.sgm...
Start parsing reuter/reut2-011.sgm...
Start parsing reuter/reut2-006.sgm...
Start parsing reuter/reut2-000.sgm...
Start parsing reuter/reut2-005.sgm...
Start parsing reuter/reut2-013.sgm...
Start parsing reuter/reut2-015.sgm...
Start parsing reuter/reut2-014.sgm...
Start parsing reuter/reut2-016.sgm...
Start parsing reuter/reut2-002.sgm...
Start parsing reuter/reut2-021.sgm...
Start parsing reuter/reut2-008.sgm...
Start parsing reuter/reut2-019.sgm...
Start parsing reuter/reut2-009.sgm...
Start parsing reuter/reut2-010.sgm...
Start parsing reuter/reut2-017.sgm...


In [6]:
print(document_body)
#document_Y


u.s. and soviets draft euromissile treaty
    geneva, june 2 - u.s. and soviet negotiators have completed
the text of a draft treaty calling for the elimination of
medium-range missiles in europe, a soviet negotiator said.
    "we must say that as a result of the work done at the
current round the sides have drafted the first joint draft text
of the treaty on medium-range missiles," alexei obukhov, deputy
leader of the soviet negotiating team, told reporters.
    he said there was still much work to be done and several
areas of disagreement remained to be resolved.         
 reuter



In [8]:
print(len(document_Y))
#print(document_Y)
print(len(document_X))

7824
7824


In [9]:
#print(document_X["5"])
news_categories.sort_values(by='Newslines', ascending=False, inplace=True)
news_categories.head(10)

Unnamed: 0,Name,Type,Newslines
35,to_earn,Topics,3987
0,to_acq,Topics,2448
73,to_money-fx,Topics,801
28,to_crude,Topics,634
45,to_grain,Topics,628
126,to_trade,Topics,552
55,to_interest,Topics,513
130,to_wheat,Topics,306
108,to_ship,Topics,305
19,to_corn,Topics,254


The following table shows the numbers of documents relative to the topic that we consider in this exercise, where the sample is not evenly distributed.

In [10]:
#the indexs selected is the 5 topics in selected_topics
news_categories.loc[[35,0,73,55,108]]
#print(document_X.keys())

Unnamed: 0,Name,Type,Newslines
35,to_earn,Topics,3987
0,to_acq,Topics,2448
73,to_money-fx,Topics,801
55,to_interest,Topics,513
108,to_ship,Topics,305


### 2.3 Cleaning The Data

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

We're going to execute just the common cleaning steps here:
- Make text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text (/n)
- Tokenize text
- Remove stop words


In [11]:
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer, SnowballStemmer
def tokenize1(text):
    """function to clean text by removing punctuations, and numbers
    ---------------------------------------------
    :param text: a string
    :returns: string with punctuations, numbers removed and length>=3
    """
    min_length = 3
    stopwords_set= set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    text = text.replace('\n',' ').lower().strip()
    text = re.sub("[^a-zA-Z]+", " ", text).split()
    text = ' '.join(stemmer.stem(i) for i in text)
    stemmed = ' '.join([word for word in text.split() if word not in stopwords_set and len(word)>=min_length])
    return(stemmed)

In [12]:
# Tokenized document collection
newsline_documents = []
word_nb = 0
# Tokenize
for key in document_X.keys():
    newsline_documents.append(tokenize1(document_X[key]))
    word_nb += len(tokenize1(document_X[key]))
number_of_documents = len(document_X)
print(number_of_documents)
print(word_nb)

7824
3333699


In [80]:
print(newsline_documents[10])
#print(document_X["10"])
#document_X[key]

poehl say rate rise caus concern frankfurt oct rise west german intern interest rate caus concern bundesbank interest higher capit market rate bundesbank presid karl otto poehl said consid interest rate increas occur intern problem caus concern poehl told invest confer would like stress bundesbank interest higher capit market rate said short poehl spoke bundesbank announc tender secur repurchas pact fix rate pct previous tender last month interest rate seen alloc rate facil rise pct last week pact last fix rate tender late septemb bundesbank reduct key alloc rate pct herald monday repeat inject money market liquid pct move cap interest rate follow meet poehl financ minist gerhard stoltenberg treasuri secretari jame baker monday frankfurt offici said afterward three men reaffirm commit louvr accord currenc stabil weekend critic baker tighten west german monetari polici prompt sharp fall dollar specul louvr cooper end dollar ralli news monday meet nervous trade trade abov mark tuesday po

### 2.4 Organizing The Data
#### 2.4.1 TF-IDF transformation in sklearn

At this point, we want to weight each of the features according to their "importance" for the document. We are going to use tf-idf where the terms weight is higher the more common in the document, and the more uncommon in the collection they are.

Organized data in two standard text formats:
* Corpus - a collection of text: list variable --- body_list
* Document-Term Matrix: sparse matrix --- matrix_tfidf

In [13]:
import time
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import RegexpTokenizer

body_list = newsline_documents
start = time.clock()
vectorizer = TfidfVectorizer()
vectorizer.fit(newsline_documents)
print ("sklearn TFIDF processing time: {0:.5f} s".format(time.clock() - start))

sklearn TFIDF processing time: 0.26124 s


In [14]:
#td-idf matrix:
matrix_tfidf = vectorizer.transform(newsline_documents)
matrix_tfidf.shape

(7824, 17972)

In [15]:
#label:
num_categories = len(selected_categories)
topic_class = np.zeros(shape=(number_of_documents, num_categories))
for idx, key in enumerate(document_Y.keys()):
    topic_class[idx, :] = document_Y[key]
topic_class

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.]])

#### 2.4.2 Word2Vec transformation in gensim

At this point, we want to train the more efficient dense matrix with Word2Vec method: each domument with fixed nb of words and each word with fiexed nb of features.

Organized data in two standard text formats:
* Corpus: a collection of text --- newsline_documents
* Document-Term Matrix: dense matrix --- X

In [56]:
###second version of tokenize with more words will be removed
from nltk import word_tokenize
from nltk.stem.snowball import PorterStemmer, SnowballStemmer
import re
from nltk.corpus import stopwords
import string 

#For adapting the word2Vec, rewriting the tokenize function as below: 
#Make text lowercase, remove text in square brackets,remove punctuation, remove \n and remove words containing numbers.
def tokenize2(text):
    min_length = 3
    cachedStopWords = stopwords.words("english")
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    words = map(lambda word: word.lower(), word_tokenize(text))
    words = [word for word in words if word not in cachedStopWords]
    tokens =(list(map(lambda token: PorterStemmer().stem(token),words)))
    p = re.compile('[\'a-zA-Z]+') #Matches one or more alphabetical characters.
    filtered_tokens = list(filter(lambda token: p.match(token) and len(token)>=min_length,tokens))
    return filtered_tokens

In [47]:
# Tokenized document collection
newsline_documents = []
word_nb = 0
# Tokenize
for key in document_X.keys():
    newsline_documents.append(tokenize2(document_X[key]))
    word_nb += len(tokenize2(document_X[key]))
number_of_documents = len(document_X)
print(number_of_documents)
print(word_nb)

7824
489300


In [48]:
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count

# Word2Vec number of features
num_features = 200

# Create new Gensim Word2Vec model
w2v_model = Word2Vec(newsline_documents, size=num_features, min_count=1, window=10, workers=cpu_count())
w2v_model.init_sims(replace=True)
w2v_model.save(data_folder + 'reuters.word2vec')

In [49]:
# Limit each newsline to a fixed number of words
document_max_num_words = 100
num_categories = len(selected_categories)
X = np.zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(np.float32)
Y = np.zeros(shape=(number_of_documents, num_categories)).astype(np.float32)

empty_word = np.zeros(num_features).astype(np.float32)
for idx, document in enumerate(newsline_documents):
    for jdx, word in enumerate(document):
        if jdx == document_max_num_words:
            break            
        else:
            if word in w2v_model:
                X[idx, jdx, :] = w2v_model[word]
            else:
                X[idx, jdx, :] = empty_word
for idx, key in enumerate(document_Y.keys()):
    Y[idx, :] = document_Y[key]
    
print(X.shape)
print(Y.shape)

  del sys.path[0]
  


(7824, 100, 200)
(7824, 5)


## 3. Classifying Reuters 


In order to classify the collection, we have to apply a number of steps which are standard for the majority of classification problems:

* Define our training and testing subsets to make sure that we do not evaluate with documents that the system has learnt from. In our case, split train and test dataset with ratio of 0.3.
* Represent all the documents in each subset.
* Train a classifier on the represented training data.
* Predict the labels for each one of the represented testing documents.
* Compare the real and predicted document labels to evaluate our solution.

Model:
* Using model linear SVM (LinearSVC), this model has traditionally produced good quality with text classification problems;
* and Gaussian Naive Bayes (GaussianNB or MultinomialNB).

The problem we are solving has a multi-label nature, we have to train our model (which is binary by nature) N times, once per category, where the negative cases will be the documents in all the other categories. This allows our model to make a binary decision per category and produce multi-label results. This can be done with the OneVsRestClassifier object in Scikit-learn. This step might change depending on the estimator like kNN which is multi-label by nature.

### 3.1 Classifying Reuters with TF-IDF transformation

* Split train and test dataset with ratio of 0.3;
* Using SVM and MultinomialNB methods for traning models;
* Using cross validation(cross_val_score in sklearn) for evaluating the models;
* Test on test dataset with the params of micro/macro Precision, Recall, F1.

In [24]:
from sklearn.model_selection import train_test_split, cross_val_score
#split train and test dataset with ratio of 0.3
X_train, X_test, Y_train, Y_test = train_test_split(matrix_tfidf, topic_class, test_size=0.3)
print(X_train.shape)
print(Y_train.shape)
# print(X_train)

(5476, 17972)
(5476, 5)


In [25]:
import logging
from sklearn.model_selection import train_test_split, cross_val_score

##param list:
# estimator:clf
# X：(Features)
# y：(Labels)
# soring：accuracy,mean_squared_error..
# cv：nb of flod
# n_jobs：nb of cpus（-1 for all）

def cross_validation(clf,num_folds = 10):
    # logger.info("Cross validation")
    print("Cross validation")
    scores = cross_val_score(clf,
                             X_train, Y_train,
                             cv=num_folds,
                             n_jobs=-1,
                             verbose=0)
    print(f"Real risk by {num_folds}-fold CV : {scores.mean():.2} (+/- {scores.std():.2})")

In [26]:
from sklearn.metrics import f1_score, precision_score, recall_score

def evaluate(test_labels, predictions):
    precision = precision_score(test_labels, predictions, average='micro')
    recall = recall_score(test_labels, predictions, average='micro')
    f1 = f1_score(test_labels, predictions, average='micro')
    print("Micro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

    precision = precision_score(test_labels, predictions,average='macro')
    recall = recall_score(test_labels, predictions, average='macro')
    f1 = f1_score(test_labels, predictions, average='macro')

    print("Macro-average quality numbers")
    print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

#### 3.1.1 Linear SVM method

In [27]:
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier

def linearsvc_classifier(train_docs, train_labels):
    #classifier = LinearSVC(["penalty='l2'", "loss='squared_hinge'", "multi_class='ovr'"])
    classifier = OneVsRestClassifier(LinearSVC(random_state=42))
    classifier.fit(train_docs, train_labels)
    return classifier

In [28]:
cross_validation(OneVsRestClassifier(LinearSVC(random_state=42)))
#cross_validation(LinearSVC(["penalty='l2'", "loss='squared_hinge'", "multi_class='ovr'"]))

Cross validation
Real risk by 10-fold CV : 0.94 (+/- 0.0097)


In [29]:
#documents = reuters.fileids()
model_svm = linearsvc_classifier(X_train, Y_train)
predictions_svm = model_svm.predict(X_test)
evaluate(Y_test, predictions_svm)

Micro-average quality numbers
Precision: 0.9678, Recall: 0.9598, F1-measure: 0.9638
Macro-average quality numbers
Precision: 0.9439, Recall: 0.9199, F1-measure: 0.9310


In [30]:
model_svm = linearsvc_classifier(X_train, Y_train)
predictions_svm = model_svm.predict(X_train)
evaluate(Y_train, predictions_svm)

Micro-average quality numbers
Precision: 0.9940, Recall: 0.9975, F1-measure: 0.9958
Macro-average quality numbers
Precision: 0.9851, Recall: 0.9931, F1-measure: 0.9891


#### 3.1.2 Naive Bayes 

In [31]:
from sklearn.naive_bayes import MultinomialNB,GaussianNB

def naive_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(MultinomialNB(alpha=0.01))
    classifier.fit(train_docs, train_labels)
    return classifier

In [32]:
cross_validation(OneVsRestClassifier(MultinomialNB(alpha=0.01)))

Cross validation
Real risk by 10-fold CV : 0.84 (+/- 0.017)


In [33]:
model_naive = naive_classifier(X_train, Y_train)
predictions_navie = model_naive.predict(X_test)
evaluate(Y_test, predictions_navie)

Micro-average quality numbers
Precision: 0.8788, Recall: 0.9254, F1-measure: 0.9015
Macro-average quality numbers
Precision: 0.8304, Recall: 0.9303, F1-measure: 0.8701


In [34]:
model_naive = naive_classifier(X_train, Y_train)
predictions_navie = model_naive.predict(X_train)
evaluate(Y_train, predictions_navie)

Micro-average quality numbers
Precision: 0.9419, Recall: 0.9851, F1-measure: 0.9630
Macro-average quality numbers
Precision: 0.8888, Recall: 0.9900, F1-measure: 0.9294


### 3.2 Classifying Reuters with Word2Vec transformation
* New idea: Average 100 word vectors of each document: X (shape:7824,100,200) to X (shape:7824,200)
* Split train and test dataset with ratio of 0.3;
* Using SVM and GaussianNB methods for traning models;
* Using cross validation(cross_val_score in sklearn) for evaluating the models;
* Test on test dataset with the params of micro/macro Precision, Recall, F1.

In [50]:
from sklearn.model_selection import train_test_split

#average word vector in each document: X (shape:7824*100*200) to X (shape:7824*200)
matrix_word2vec = X.sum(axis = 1)/100
topic_class = Y
#split train and test dataset with ratio of 0.3
X_train, X_test, Y_train, Y_test = train_test_split(matrix_word2vec, topic_class, test_size=0.3)
print(X_train.shape)
print(Y_train.shape)

(5476, 200)
(5476, 5)


#### 3.2.1 Linear SVM method

In [51]:
cross_validation(OneVsRestClassifier(LinearSVC(random_state=42)),num_folds = 10)

Cross validation
Real risk by 10-fold CV : 0.83 (+/- 0.021)


In [52]:
#documents = reuters.fileids()
model = linearsvc_classifier(X_train, Y_train)
predictions = model.predict(X_test)
evaluate(Y_test, predictions)

Micro-average quality numbers
Precision: 0.9511, Recall: 0.8696, F1-measure: 0.9085
Macro-average quality numbers
Precision: 0.9068, Recall: 0.7217, F1-measure: 0.7949


In [53]:
#documents = reuters.fileids()
model = linearsvc_classifier(X_train, Y_train)
predictions = model.predict(X_train)
evaluate(Y_train, predictions)

Micro-average quality numbers
Precision: 0.9421, Recall: 0.8649, F1-measure: 0.9018
Macro-average quality numbers
Precision: 0.9005, Recall: 0.7168, F1-measure: 0.7889


#### 3.2.2 Naive Bayes 

In [54]:
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier

def naive_classifier(train_docs, train_labels):
    classifier = OneVsRestClassifier(GaussianNB())
    classifier.fit(train_docs, train_labels)
    return classifier

In [57]:
cross_validation(OneVsRestClassifier(GaussianNB()))

Cross validation
Real risk by 10-fold CV : 0.63 (+/- 0.024)


In [55]:
# print(Y_train.shape)
# print(X_train.shape)
model = naive_classifier(X_train, Y_train)
predictions = model.predict(X_test)
evaluate(Y_test, predictions)

(5476, 5)
(5476, 200)
Micro-average quality numbers
Precision: 0.6715, Recall: 0.8559, F1-measure: 0.7526
Macro-average quality numbers
Precision: 0.5587, Recall: 0.8089, F1-measure: 0.6378


In [59]:
model = naive_classifier(X_train, Y_train)
predictions = model.predict(X_train)
evaluate(Y_train, predictions)

Micro-average quality numbers
Precision: 0.6798, Recall: 0.8516, F1-measure: 0.7560
Macro-average quality numbers
Precision: 0.5720, Recall: 0.8130, F1-measure: 0.6493


In [None]:
## RandomForestRegressor model
# from sklearn.ensemble import RandomForestRegressor

# RF = RandomForestRegressor(n_estimators=10, criterion="mae", max_depth=3)
# RF.fit(X_train, Y_train)
# predictions = model.predict(X_test)
# evaluate(Y_test, predictions)

## 4. Predictive analysis

Our samples are nobalanced distributed:
*	    Name	     Type	Newslines
* 35	to_earn	     Topics	 3987
* 0	   to_acq	     Topics	2448
* 73	to_money-fx	   Topics 801
* 55	to_interest 	Topics 513
* 108	to_ship	      Topics 305

So Micro-average will better performed because it considers this nobalanced problem in its formula:

TF-IDF transformation(same case for Word2Vec):

Linear SVM (LinearSVC):
* Micro-average quality numbers
* Precision: 0.9678, Recall: 0.9598, F1-measure: 0.9638
* Macro-average quality numbers
* Precision: 0.9439, Recall: 0.9199, F1-measure: 0.9310

Gaussian Naive Bayes (GaussianNB):
* Micro-average quality numbers
* Precision: 0.8788, Recall: 0.9254, F1-measure: 0.9015
* Macro-average quality numbers
* Precision: 0.8304, Recall: 0.9303, F1-measure: 0.8701

Results:
* We could see Linear SVM preform much better than Naive Bayes in both two transformation problem;
* With new methods of averaging Word2Vec(averaging words vector in one document), Word2Vec could will perform with Linear SVM, but bad performing with Naive Bayes;
* With new methods of averaging Word2Vec, the dim of dataset(7824, 200) are much lower than that of TF-IDF (7824, 17972), and when we use SVM method, the performences are also good:
    * Micro-average quality numbers
    * Precision: 0.9511, Recall: 0.8696, F1-measure: 0.9085
    * Macro-average quality numbers
    * Precision: 0.9068, Recall: 0.7217, F1-measure: 0.7949
* Word2Vec could be used in neural network when we want to deal with all categories, but in our problem with the low dim features espace, the ML method will be sufficient to well classifier the documents:
    * CNN + word2vec
    * LSTM + word2vec

## 5. Topic Modeling

In this case, our problem is supervised. Generally, NLP problems are unsupervised which need to do the topic modeling. The ultimate goal of topic modeling is to find various topics that are present in the corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

Method: Latent Dirichlet Allocation (LDA), which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, we need to provide 
* a document-term matrix;
* and the number of topics we would like the algorithm to pick up.

Once the topic modeling technique is applied, our job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, we can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.

##  Getting reuters dataset from NLTK library

The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.This method of splitting can directly been used by library nltk.

Useful blog:
https://martin-thoma.com/nlp-reuters/
https://ana.cachopo.org/datasets-for-single-label-text-categorization
https://towardsdatascience.com/analysis-and-visualization-of-unstructured-text-data-2de07d9adc84

#### Thanks!