In [1]:
import pandas as pd
from bs4 import BeautifulSoup

Beautiful Soup is a Python library for extracting data out of HTML and XML type files. The Reuters data is a SGM type file, which is like HTML in terms of tages.

In [2]:
files=[]
for i in range(0,10):
    files.append('reut2-00'+str(i)+'.sgm')
for i in range(10,22):
    files.append('reut2-0'+str(i)+'.sgm') 

There are 21 files, so we save the names of the files in a list such that we can extract them one after the another.

In [3]:
df=pd.DataFrame(columns=['Topics','Body'])

In [4]:
for file in files:
    
    f = open(file, 'r')
    data= f.read()
    
    soup = BeautifulSoup(data)
    
    topics=soup.findAll('topics') 
    #Finds all occurrences of the tag "topic" and stores it as a list. 
    
    y_topics=[]
    for j in range(0,len(topics)):
        x=[]
        for i in topics[j]:
            x.append(i.text)
            '''i.text converts everything found between the tag as a string 
            and ignores all tags that come in between.'''
        y_topics.append(x)
    topics=y_topics  
    '''The above loop and all is done because not all documents have topics.
    So we don't want a mismatch between the document's index and its topics's.'''
    
    txt = soup.findAll('text')    
    body=list()    
    for i in txt:
        body.append(str(i.text))
    '''Extracting the body was easy as it was definitely there for all docs'''
    
    '''Now we consider only those docs in a data frame who have topics, 
    because otherwise we can't train our model.'''
    for i in range(0,len(topics)):
        
        if len(topics[i])!=0:       
            
            body[i]=body[i].replace('\d+'," num ")       
            '''replacing all the digits by num 
            as they are not really the ones defining our topics
            and doing so makes our program run faster.'''
            
            z={'Topics':topics[i],'Body':body[i]}
            df=df.append(z,ignore_index=True)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,jaccard_similarity_score,hamming_loss

In [6]:
mlb = MultiLabelBinarizer()
x=df['Body']
y=mlb.fit_transform(df['Topics'])

The classifier does not allow the topics to be in list format, but rather expects them to be as a sparse matrix.
fit_transform is used to convert multilabel to  a binary matrix that indicates the presence of class label. 

Thus the topics column is fitted and transformed using the multilabel binarizer. This transforms each item into a binary row 
of 0's and 1's and says if a topic is present(1), or not(0), in a particular item in the topics column.

Multilabel binarizer is used as we can have multiple labels for the same document.

In [7]:
x_train,x_test,y_train,y_test=train_test_split(x, y, test_size=0.10, random_state=100)

We split our data into training and test, taking the test size to be 10%

In [8]:
classifier = LabelPowerset(ComplementNB())

So now we had to define a classifier.

Label Powerset is a problem transformation approach to multi-label classification that transforms a multi-label problem to a multi-class problem with 1 multi-class classifier trained on all unique label combinations found in the training data, i.e. each combination is treated as a separate class and probability will be estimated per that class.

I implemented the Complement Naive Bayes algorithm as it is suited for imbalanced data sets.
Instead of calculating the likelihood of a word occurring in a class, it calculates the likelihood that the word occurs in other classes. A higher value means that it is highly likely that a document with those words does not belong to that class. 
 
Complement Naive Bayes outperforms Multinomial Naive Bayes on text classification tasks, as I also checked out.

In [9]:
pipeline = Pipeline([('vectorizer', TfidfVectorizer(max_features=5000,analyzer="word",stop_words=stopwords.words('english'))),
                         ('classifier', classifier)])

A pipeline connects a series of steps into one object which you train and then use to make predictions, which leads to convenience in creating a easy to understand workflow. The purpose of the pipeline is to assemble several steps that can be crossvalidated together while setting different parameters. 

Here I've used a pipeline to connect the vectorization and classification tasks. 

TFIDF resolves the issue that a high frequency word might also be having a high frequency of occurrence in other documents as well. So TFIDF value is high only if a word has a high frequency of occurrence in that specific document but lower in all the other documents.

Stop words are the less meaningful English words, which are supposed to be ignored while text classification.

Note that TFIDF considers punctuations,etc., as separators and converts everything to lowercase by default. So we don't need to worry about the difference of words on the basis of case or presence of punctuation marks.  

In [10]:
pipeline.fit(x_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...entNB(alpha=1.0, class_prior=None, fit_prior=True, norm=False),
       require_dense=[True, True]))])

This is learning the vocabulary stage.

In [11]:
y_test_pred=pipeline.predict(x_test)

This predicts the view for the documents in our test set

In [12]:
print("The cross validation scores are: {}".format(cross_val_score(pipeline, x_train, y_train, cv=4)))
print("The fraction of correctly classified samples according to \nAccuracy Score : {}".format(accuracy_score(y_test, y_test_pred)))
print("1-Hamming Loss : {}".format(1-hamming_loss(y_test, y_test_pred)))

The cross validation scores are: [0.78225176 0.7853792  0.79429018 0.77708252]
The fraction of correctly classified samples according to 
Accuracy Score : 0.8082673702726473
1-Hamming Loss : 0.996621225447083


Our general method of splitting the dataset into 2 parts for test and train has the disadvantage of our 
classifier not getting trained and validated on all examples in the data set, which is overcome by cross validation.
The cross validation scores obtained as output should not vary too much for a good fit.

Accuracy score computes the subset accuracy i.e. the set of labels predicted for a sample must 
exactly match the corresponding set of labels in y_true.

Hamming loss is the fraction of labels that are incorrectly predicted.
So (1-Hamming loss) is the fraction of correctly predicted labels. 
This penalizes the individual labels, unlike the accuracy_score which considers the entire set of labels 
for a given sample as incorrect if it does entirely match the true set of labels.

In [14]:
y_test_pred_actual=mlb.inverse_transform(y_test_pred)
y_test_actual=mlb.inverse_transform(y_test)

'''.fit_transform() -(Used before) It fits the label sets' binarizer and transforms the given label sets. 
   .inverse_transform -It tansforms the binary array to list of topics '''

df_new=pd.DataFrame(columns=['Actual Labels','Predicted Labels','Document Body'])
x_test=list(x_test)
for j in range(0,len(y_test)):
    df_new=df_new.append({'Actual Labels':y_test_actual[j],'Predicted Labels':y_test_pred_actual[j],'Document Body':x_test[j]},ignore_index=True)

'''This is just to get the predicted and actual labels of our test data stored in an excel file.'''

writer = pd.ExcelWriter('output_fitted.xlsx', engine='xlsxwriter') 
df_new.to_excel(writer, sheet_name='Sheet1') 
writer.save() 