# Supervised Topic Classification 

## Version 2.0

Welcome to my Supervised Toppic Classification Learning Experiment, using the BBC News Kaggle dataset. This series of Notebooks is to serve myself, and potentially others, as I learn bits of NLP relating to topic classification. Each notebook is split into versions to capture my design and learning progress, and to demonstrate in situ the mistakes and poor judgement calls I make. 

As the end of each iteration, I will leave some notes to discuss how I will (hopefully) improve my work in future versions. 

See data source here https://www.kaggle.com/pariza/bbc-news-summary 

Last accessed on 2022-Mar-12.



In [14]:
# Import some standard dependencies
import numpy as np
import pandas as pd

In [15]:
# If you're cloning the gitlab repo, this cell wont work for you unless you save the data until a similar directory.
# Data is available in Notebook header text. 

import glob
files = glob.glob('data/News Articles/**/*.txt',
                  recursive=True)

### The Data

The data repo contains **five** folders, {business, entertainment, politics, sport, tech} each containing a `.txt` file corresponding to one Article. 

I know already, because I looked at *some* of them, that these files are pretty clean. There is no header/footer text to remove or other structural features, but there are a lot of symbols for currency (£, $, Eur), percentages (%) etc. So these will need to be handled. 

**Things to consider**
* The variation in article length
* The number of articles per 'label'

In [16]:
# We create data 'labels' but using the name of the folder the Article is saved in
labels,fileNames = zip(*[s.split('/')[2:] for s in files])
# Then make a pandas.DataFrame to store the data
corpus = pd.DataFrame(data = {
    "filePath":files, "topic":labels, "fileName":fileNames})

# Open and read each file, save to list, and append to the corpus pd.DataFrame
# "unicode_escape" deals with the symbols in the text
text = []
for name in files:
    with open(name,"r",encoding="unicode_escape") as f:
        text.append(f.read())
corpus['text'] = text

### Label Encoding

This task is strictly multi-Class classification. See https://scikit-learn.org/stable/modules/multiclass.html

We want our articles to have one label (target). They are either business, or sport etc. but cannot be both. Having multiple possible labels would be a multi-LABEL classification problem. 

The type of problem determines how we need to encode our labels. 

See https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html

and https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

We will be doing Label Encoding, as we want to turn our string labels into numerical values which go from 0 to 1-n, where n = number of unique labels

In [18]:
# Create Encoded labels
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(corpus['topic'].unique())
corpus['topicLabel'] = le.transform(corpus['topic'])
# Is this overkill? Maybe. It it makes me happy.

In [19]:
# Lets check out our corpus DataFrame
corpus

Unnamed: 0,filePath,topic,fileName,text,topicLabel
0,data/News Articles/business/052.txt,business,052.txt,Italy to get economic action plan\n\nItalian P...,0
1,data/News Articles/business/019.txt,business,019.txt,India widens access to telecoms\n\nIndia has r...,0
2,data/News Articles/business/260.txt,business,260.txt,Asia shares defy post-quake gloom\n\nIndonesia...,0
3,data/News Articles/business/007.txt,business,007.txt,Jobs growth still slow in the US\n\nThe US cre...,0
4,data/News Articles/business/042.txt,business,042.txt,UK Coal plunges into deeper loss\n\nShares in ...,0
...,...,...,...,...,...
2220,data/News Articles/entertainment/335.txt,entertainment,335.txt,De Niro film leads US box office\n\nFilm star ...,1
2221,data/News Articles/entertainment/006.txt,entertainment,006.txt,Bennett play takes theatre prizes\n\nThe Histo...,1
2222,data/News Articles/entertainment/187.txt,entertainment,187.txt,Double eviction from Big Brother\n\nModel Capr...,1
2223,data/News Articles/entertainment/034.txt,entertainment,034.txt,Vera Drake scoops film award\n\nOscar hopefuls...,1


### Shuffle, Stratify, Test, Train, Split 

So, unlike Version 1.0, here we'll be following more well established procedure and splitting out our test dataset early. We'll also be doing some stratification, to ensure our test and test datasets are split proportionally.

This isn't going to help that much if our datasets are massively different, but I know already they are just about okay. We can shelve the outlier cases for another project. 

In [20]:
corpus['topic'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: topic, dtype: int64

In [39]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
for train_index, test_index in split.split(corpus, corpus['topicLabel']):
    strat_train_set = corpus.loc[train_index]
    strat_test_set = corpus.loc[test_index]
# We're splitting 10 times (actually the default) just 
# to reshuffle 10 times. Somewhat overkill, and might quickly
# blow up for bigger datasets

In [40]:
strat_train_set['topic'].value_counts()/len(strat_train_set)

sport            0.229775
business         0.229213
politics         0.187079
tech             0.180337
entertainment    0.173596
Name: topic, dtype: float64

In [41]:
corpus['topic'].value_counts()/len(corpus)

sport            0.229663
business         0.229213
politics         0.187416
tech             0.180225
entertainment    0.173483
Name: topic, dtype: float64