# Supervised Topic Classification 

## Version 1.0
(see **warning** below)

Supervised Topic classification using the BBC News Kaggle dataset.

See data source here https://www.kaggle.com/pariza/bbc-news-summary 

Last accessed on 2022-Mar-12.

**Warning** 
This is a notebook created for my own learning. So it will be rough, and likely full of mistakes. I will try to keep some of the more interesting mistakes, or poor judgement, or bad choices, in so they can be used as learning points for myself and others. So over time I might split this notebook out into different versions just so there is not too much scrolling to the most up to date version "correct" code implementation. At the end of each version, I'll create a "To do!" to help bookmark the end of the current version and describe how I'll try and improve it. 


### What is this Notebook?

This Notebook has been created as a personal learning exercise.

It focuses on using basic NLP tools to carry out topic 
classifcation (and maybe topic modelling)
with a labelled dataset of BBC news articles. 


In [1]:
import numpy as np
import pandas as pd 

In [2]:
""" 
Raw data is .gitignore 'd. 
It is available from the source at the beginning of this notebook.
"""
import glob
files = glob.glob('data/News Articles/**/*.txt',
                  recursive=True)

### The Data
Each file is a .txt.

We can generate our labelled column for the dataset by using the name of the 
parent folder of each .txt file.

Then we'll build up a pandas DataFrame. 


In [3]:
labels,fileNames = zip(*[s.split('/')[2:] for s in files])
corpus = pd.DataFrame(data = {
    "filePath":files, "label":labels, "fileName":fileNames})

In [4]:
# Open and read each file, save to list, and append to the
# corpus pd.DataFrame
text = []
for name in files:
    with open(name,"r",encoding="unicode_escape") as f:
        text.append(f.read())
corpus['text'] = text

### Unicode Nightmares

Regarding the "unicode_escape" keyword...
There seem to be some mixed unicode standards in the raw BBC 
news articles. I've been a bit lazy and just used this to
avoid errors.

However this might come back to bite me when I tokenize 
my text and see dodgy looking tokens...

There is the question, did I need to bother with this? 
Would sklearn have handled it?

## Category (label) to numbers

While our labelled dataset category is nice, most ML processes prefers numbers. There are a few ways we could create a number equivalent value for the `corpus['label']` column. Remember, this is just the label column we used for our supervised model. We're not editing a feature.

{Do I even really need to do this?}

As a side, if we were really playoung around and turning a categorical feature into something more ML-friendly, then we might want to consider using a **OneHotEncoder**.

In [1]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

corpus_label_encoded = ordinal_encoder.fit_transform(corpus[['label']])
corpus['cat'] = corpus_label_encoded
# I've now learnt that what I'm doing here is MULIT-LABEL SUPERVISED CLASSIFICATION 

NameError: name 'corpus' is not defined

## TF-IDF & Test-Train-Split

In this example, I am going to be pretty fast and loose (a.k.a. not very scientific) and perform the TF-IDF BEFORE the test-train split. This is, I believe, generally ill advised as it will generate data leakage between the training and test sets. This bad practice also makes your model less generalizable to real-world examples where new text will have tokens/terms not in your original training data. 

See step three https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50

**So why am I doing it here?**

I want to create a first pass model, even if it does have some flaws, and then I will correct this later down the line. So, just bare with me. This is a learning journey!

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', ngram_range=(1,1))
corpus_tfidf = vectorizer.fit_transform(corpus['text'])
vectorizer.get_feature_names_out()

array(['00', '000', '0001', ..., 'zutons', 'zvonareva', 'zvyagintsev'],
      dtype=object)

### A few things to note...

Removal of **stop words** is probably fine for this program, and will likely have a new positive to the model. (I should try this out!). But, in more advanced models it is generally not carried out as: 

* The definition of stop words is different between Python libraries and can even change over time within the same library. 
* You do lose information. Though as we're doing the tf-idf anyway, it is relatively minor. 
* tf-idf already balances out the frequent words. So why bother?

I've also selected a 1-token **ngram** range. (I should also encode some testing for higher ranges). I'd recommend going ahead and googling ngrams if you are unaware of how to understand them. In this use case, I would likely benefit from having 2-grams as well as 1-grams, but it does blow up the calculation quite quickly. 

## Test-Train-Split

In [None]:
# Now lets check out the data, making sure there's nothing dodgy going on 
# between our labelled datasets

In [8]:
corpus['label'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

The contents of each category are a bit different. 

For the sake of showing
the process, I will make sure our test-train-split pulls a proportional 
selection from each category. 
Aka **stratified sampling**.

We've also not worried too much about assigning a unique identifier to each row. If we were perhaps planning to make this a production-level model where new data is added over time, we would need to fix this. New model runs with new data would want to avoid mixing the old test and the old train data. I.e. you'd want to ensure your future testing data contains only the old test data + a proportion of the new data. 

In [9]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(corpus, corpus['label']):
    strat_train_set = corpus.loc[train_index]
    strat_test_set = corpus.loc[test_index]

Can demonstrate this generates (almost) the same proportions as follows:

In [10]:
strat_train_set['label'].value_counts()/len(strat_train_set)

sport            0.229775
business         0.229213
politics         0.187079
tech             0.180337
entertainment    0.173596
Name: label, dtype: float64

In [11]:
corpus['label'].value_counts()/len(corpus)

sport            0.229663
business         0.229213
politics         0.187416
tech             0.180225
entertainment    0.173483
Name: label, dtype: float64

## Fitting 

In [12]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=4)
from sklearn.model_selection import cross_val_score

In [17]:
scores = cross_val_score(estimator=lda, X= corpus_tfidf.toarray()[train_index],y= strat_train_set['cat'],
                         cv=10)

In [18]:
scores

array([0.52808989, 0.3988764 , 0.53932584, 0.56741573, 0.51123596,
       0.38764045, 0.47752809, 0.46629213, 0.58426966, 0.58426966])

### Result Discussion
I'm being a bit impatient, because I haven't tested the Test data set yet on the model, but I can say that the scores above aren't great.

First, these scores *might* be  normal for a first pass supervised LDA model. The dimensional reduction provided by LDA requires human-labels which may not be distinct enough, or there may just not be enough data to accurately capture the 5 categories.

Other Stuff
* Item 1
* Item 2
* Item 3

## Future Version Improvements

* Correctly the highly questionable processing step of TF-IDF
    * Turn into a sklearn pipeline
    * Apply to test and train data seperately (fit_transform vs fit)
* Change the validation method to more standard classifer-validation techniques, e.g.,:
    * Confusion Matrix
    * precision - recall plots (PR)
    * POC plots
* Play around with classifier options such as:
    * One vs One Classifying
    * One vs All Classifying
* Encode a grid and randomized search to allow us to explore:
    * effects of n-grams
    * effects of stop word removal
* Explore some explainable AI methods to clarify how models work
    * Create textplots showing how different words sway the classification
* Explore how we can use UNSUPERVISED models to seed into a supervised classification model

* Blue Sky...
    * Get more labelled news data
    
    