# Supervised Topic Classification 

## Version 2.0

Welcome to my Supervised Toppic Classification Learning Experiment, using the BBC News Kaggle dataset. This series of Notebooks is to serve myself, and potentially others, as I learn bits of NLP relating to topic classification. Each notebook is split into versions to capture my design and learning progress, and to demonstrate in situ the mistakes and poor judgement calls I make. 

As the end of each iteration, I will leave some notes to discuss how I will (hopefully) improve my work in future versions. 

See data source here https://www.kaggle.com/pariza/bbc-news-summary 

Last accessed on 2022-Mar-12.



In [1]:
# Import some standard dependencies
import numpy as np
import pandas as pd

In [2]:
# If you're cloning the gitlab repo, this cell wont work for you unless you save the data until a similar directory.
# Data is available in Notebook header text. 

import glob
files = glob.glob('data/News Articles/**/*.txt',
                  recursive=True)

### The Data

The data repo contains **five** folders, {business, entertainment, politics, sport, tech} each containing a `.txt` file corresponding to one Article. 

I know already, because I looked at *some* of them, that these files are pretty clean. There is no header/footer text to remove or other structural features, but there are a lot of symbols for currency (£, $, Eur), percentages (%) etc. So these will need to be handled. 

**Things to consider**
* The variation in article length
* The number of articles per 'label'

In [3]:
# We create data 'labels' by using the name of the folder the Article is saved in
labels,fileNames = zip(*[s.split('/')[2:] for s in files])
# Then make a pandas.DataFrame to store the data
corpus = pd.DataFrame(data = {
    "filePath":files, "topic":labels, "fileName":fileNames})

# Open and read each file, save to list, and append to the corpus pd.DataFrame
# "unicode_escape" files with the symbols in the text
text = []
for name in files:
    with open(name,"r",encoding="unicode_escape") as f:
        text.append(f.read())
corpus['text'] = text

### Label Encoding

This task is strictly multi-Class classification. See https://scikit-learn.org/stable/modules/multiclass.html

We want our articles to have one label (target). They are either business, or sport etc. but cannot be both. Having multiple possible labels would be a multi-LABEL classification problem. 

The type of problem determines how we need to encode our labels. 

See https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html

and https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

We will be doing Label Encoding, as we want to turn our string labels into numerical values which go from 0 to 1-n, where n = number of unique labels

In [4]:
# Create Encoded labels
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(corpus['topic'].unique())
corpus['topicLabel'] = le.transform(corpus['topic'])
# Is this overkill? Maybe. It it makes me happy.

In [5]:
# Lets check out our corpus DataFrame
corpus

Unnamed: 0,filePath,topic,fileName,text,topicLabel
0,data/News Articles/business/052.txt,business,052.txt,Italy to get economic action plan\n\nItalian P...,0
1,data/News Articles/business/019.txt,business,019.txt,India widens access to telecoms\n\nIndia has r...,0
2,data/News Articles/business/260.txt,business,260.txt,Asia shares defy post-quake gloom\n\nIndonesia...,0
3,data/News Articles/business/007.txt,business,007.txt,Jobs growth still slow in the US\n\nThe US cre...,0
4,data/News Articles/business/042.txt,business,042.txt,UK Coal plunges into deeper loss\n\nShares in ...,0
...,...,...,...,...,...
2220,data/News Articles/entertainment/335.txt,entertainment,335.txt,De Niro film leads US box office\n\nFilm star ...,1
2221,data/News Articles/entertainment/006.txt,entertainment,006.txt,Bennett play takes theatre prizes\n\nThe Histo...,1
2222,data/News Articles/entertainment/187.txt,entertainment,187.txt,Double eviction from Big Brother\n\nModel Capr...,1
2223,data/News Articles/entertainment/034.txt,entertainment,034.txt,Vera Drake scoops film award\n\nOscar hopefuls...,1


### Shuffle, Stratify, Test, Train, Split 

So, unlike Version 1.0, here we'll be following more well established procedure and splitting out our test dataset early. We'll also be doing some stratification, to ensure our test and test datasets are split proportionally.

This isn't going to help our ML model that much if our data topics have massively different populations, but I know already they are just about okay. We can shelve the outlier cases for another project. 

(I want to be a bit carful of my wording here. I have read that if the populations counts are very different, your ML of choice will struggle and you may need to do some sort of normalization process. Google imbalanced classes. This is a different problem to feature scaling.) See also [this link](https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/)

In [6]:
corpus['topic'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: topic, dtype: int64

In [7]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(corpus, corpus['topicLabel']):
    strat_train_set = corpus.loc[train_index]
    strat_test_set = corpus.loc[test_index]
# We're splitting 10 times (actually the default value for n_spilts) 
# to reshuffle the data 10 times. Somewhat overkill, and might quickly
# blow up for bigger datasets

In [8]:
strat_train_set['topic'].value_counts()/len(strat_train_set)

sport            0.229775
business         0.229213
politics         0.187079
tech             0.180337
entertainment    0.173596
Name: topic, dtype: float64

In [9]:
corpus['topic'].value_counts()/len(corpus)

sport            0.229663
business         0.229213
politics         0.187416
tech             0.180225
entertainment    0.173483
Name: topic, dtype: float64

In [10]:
strat_train_set

Unnamed: 0,filePath,topic,fileName,text,topicLabel
56,data/News Articles/business/391.txt,business,391.txt,Yukos heading back to US courts\n\nRussian oil...,0
558,data/News Articles/tech/022.txt,tech,022.txt,Sun offers processing by the hour\n\nSun Micro...,4
42,data/News Articles/business/072.txt,business,072.txt,S Korean consumers spending again\n\nSouth Kor...,0
485,data/News Articles/business/408.txt,business,408.txt,South African car demand surges\n\nCar manufac...,0
879,data/News Articles/tech/117.txt,tech,117.txt,Joke e-mail virus tricks users\n\nA virus that...,4
...,...,...,...,...,...
253,data/News Articles/business/270.txt,business,270.txt,Oil prices reach three-month low\n\nOil prices...,0
778,data/News Articles/tech/310.txt,tech,310.txt,Latest Opera browser gets vocal\n\nNet browser...,4
174,data/News Articles/business/179.txt,business,179.txt,Irish duo could block Man Utd bid\n\nIrishmen ...,0
1358,data/News Articles/sport/396.txt,sport,396.txt,Wales silent on Grand Slam talk\n\nRhys Willia...,3


### Transformation Pipeline

It is more consistent in bigger projects to ensure your transformations occur within a sklearn pipeline. This allows for more interoperabiltiy with sklearn tools, and allows for a more logical understanding of the steps taken. Or at least I think so. 

Anyway, here I am going ahead and creating a pipeline which carries out the **TfidfVectorization**, giving myself the flexibility _later_ to change properties such as:
* stop words
* ngram range
* other stuff?

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
textPreprocess_pipeline = Pipeline([
     ("tfidfVec", TfidfVectorizer(stop_words = 'english',ngram_range = (1,1)))
]) # for now I'm setting some keyword parameters to be explicit 

# These lines below are not used, but are here just to demonstrate how a parameter matrix is set up ...I think.
# Will need to be deleted/moved to the correct section at a later iteration. 
tf_params = {
    'tfidfVec__ngram_range':((1,1),(1,2)),
    'tfidfVec__stop_words':(None,'english')
}


### An Open Question

Should this **TfidfVectorization** step (and any later alterations to the text) be part of a discreet pre-processing step that is outside of GridSearch/Cross_val/RandomSearch? Or should it be part of one pipeline, including the model? I honestly don't know. 

In [27]:
# Lets try out this pipeline first, before we add more to it

tfidfTrain = textPreprocess_pipeline.fit_transform(strat_train_set['text'])

In [28]:
# Extracting components from a pipeline needs a little more unpacking, but it is possible
tfidf_features = textPreprocess_pipeline.named_steps['tfidfVec'].get_feature_names_out()

In [29]:
## TO DO - Beautify output for demonstration
pd.DataFrame(data = tfidfTrain.toarray(),columns = tfidf_features)
#textPreprocess_pipeline.named_steps['tfidfVec'].get_feature_names_out()

Unnamed: 0,00,000,0001,000bn,000m,000s,000th,001,001and,001st,...,zooms,zooropa,zornotza,zorro,zubair,zuluaga,zurich,zutons,zvonareva,zvyagintsev
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.023236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.042693,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.023043,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1775,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1776,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1777,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1778,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### The Full Model Pipeline

We now bring in a model selection, which for a simple first-pass classification we will use a linear model such as SVC (supprt vector... classifier?). 

In [30]:
from sklearn.svm import SVC
svm_clf = SVC()

In [31]:
# Construct our full pipeline
model_pipeline = Pipeline([
    ("tfidfVec", TfidfVectorizer(stop_words = 'english',ngram_range = (1,1))),
    ('svm',SVC())
])

### Cross Val Score 

The cross val method is a common way to 'cheat' with your data and create sub-sets of test-train datasets using slices of only the training data. While this is done at the expense of computing power & time, it can ensure your model is overall more reliable. (I need to work on this wording, it's not the most convincing. Why is it more reliable? Less overfitting? More tests = more trustworthiness in find outputs?)

In [32]:
from sklearn.model_selection import cross_val_score

In [33]:
tfSVM_cvs = cross_val_score(model_pipeline,
                            X = strat_train_set['text'], 
                            y = strat_train_set['topicLabel'],
                            cv = 5, # i.e. 3 cross val splits
                            scoring = 'accuracy' # default for SVC
                           )
# Warning -- Accuracy is not a typically a good metric for classification
# will be writing more on this later

In [34]:
tfSVM_cvs
print("For {} cross-val splits, the model accuracy outputs:\n".format(len(tfSVM_cvs)))
print("{}\n".format(tfSVM_cvs))

For 5 cross-val splits, the model accuracy outputs:

[0.98314607 0.96910112 0.9747191  0.96910112 0.98314607]



### Classification Accuracy

When it comes to classification problems, "accuracy" is often not the most reliable metric to judge your model. Admittedly, the raw outputs above are pretty good, and high enough be better than a "dumb" estimator that simply just guesses the topic. 

Instead, we'll look at measures such as the:
* **precision**
* **recall** 
* and the **confusion matrix**

These can be created using **cross_val_predict**. (I need a better explination here as to why we use cross_val_predict. The accuracy gives us an idea of how well the model does with respect to the predicted label compared to the actual label. This seems fine, but it doesn't give us a view on how many of those cases are false positives or false negatives. Of course, if the accuracy is 100%, this wouldn't be an issue really. However, knowing the rates of false +ives and false -ives gives us a better insight into how the model works, and lets us make better choices in tuning our model for the desired output based on our use-case.)

In [34]:
from sklearn.model_selection import cross_val_predict

In [47]:
tfSVM_cvs

array([0.97979798, 0.96795953, 0.96795953])