Kaggle Competition: BBC News Classification
A Jupyter notebook with exploratory data analysis (EDA) procedure, model building and training, and comparison with supervised learning. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_train = pd.read_csv("https://raw.githubusercontent.com/gt2onew/dtsa5510/main/week4/BBC%20News%20Train.csv")
     

In [3]:
df_test = pd.read_csv("https://raw.githubusercontent.com/gt2onew/dtsa5510/main/week4/BBC%20News%20Test.csv")


In [5]:
df_train.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.1+ KB


In [7]:
df_train.Category.value_counts()

Category
sport            346
business         336
politics         274
entertainment    273
tech             261
Name: count, dtype: int64

In [8]:
df_train.Category.value_counts().sum()

1490

In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 735 entries, 0 to 734
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  735 non-null    int64 
 1   Text       735 non-null    object
dtypes: int64(1), object(1)
memory usage: 11.6+ KB


In [10]:
df_test.head()

Unnamed: 0,ArticleId,Text
0,1018,qpr keeper day heads for preston queens park r...
1,1319,software watching while you work software that...
2,1138,d arcy injury adds to ireland woe gordon d arc...
3,459,india s reliance family feud heats up the ongo...
4,1020,boro suffer morrison injury blow middlesbrough...


Cleaning Data

In [12]:
import nltk

In [13]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/home/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /Users/home/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string

In [15]:
translator = str.maketrans('', '', string.punctuation)

In [16]:
def preprocessDataset(train_text):
  stop_words = set(stopwords.words('english'))
  words = word_tokenize(train_text)
  words = [word for word in words if not word in stop_words]
  stemmer= PorterStemmer()
  stem_text=' '.join([stemmer.stem(word.translate(translator)) for word in words])
  return stem_text


In [17]:
preprocessDataset(df_train.iloc[0].Text)

'worldcom exboss launch defenc lawyer defend former worldcom chief berni ebber batteri fraud charg call compani whistleblow first wit  cynthia cooper worldcom exhead intern account alert director irregular account practic us telecom giant 2002 warn led collaps firm follow discoveri  11bn  £57bn  account fraud  mr ebber plead guilti charg fraud conspiraci  prosecut lawyer argu mr ebber orchestr seri account trick worldcom order employe hide expens inflat revenu meet wall street earn estim  ms cooper run consult busi told juri new york wednesday extern auditor arthur andersen approv worldcom account earli 2001 2002 said andersen given green light procedur practic use worldcom  mr ebber lawyer said unawar fraud argu auditor alert problem  ms cooper also said sharehold meet mr ebber often pass technic question compani financ chief give brief answer  prosecut star wit former worldcom financi chief scott sullivan said mr ebber order account adjust firm tell hit book  howev ms cooper said mr 

In [18]:
df_train.iloc[0].Text



In [19]:
df_train['TextCleaned'] = df_train['Text'].apply(preprocessDataset)

In [20]:
df_train['TextCleaned']

0       worldcom exboss launch defenc lawyer defend fo...
1       german busi confid slide german busi confid fe...
2       bbc poll indic econom gloom citizen major nati...
3       lifestyl govern mobil choic faster better funk...
4       enron boss  168m payout eighteen former enron ...
                              ...                        
1485    doubl evict big brother model capric holbi cit...
1486    dj doubl act revamp chart show dj duo jk joel ...
1487    weak dollar hit reuter revenu media group reut...
1488    appl ipod famili expand market appl expand ipo...
1489    santi worm make unwelcom visit thousand websit...
Name: TextCleaned, Length: 1490, dtype: object

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
vect = TfidfVectorizer()
X = vect.fit_transform(df_train.TextCleaned)
X

<1490x19645 sparse matrix of type '<class 'numpy.float64'>'
	with 218394 stored elements in Compressed Sparse Row format>

In [23]:
X.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02407628, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02546556, 0.        , ..., 0.        , 0.        ,
        0.        ]])

Now let's build and train our models

In [24]:
from sklearn.decomposition import NMF


In [25]:
model = NMF(n_components=5, random_state=42)
W = model.fit_transform(X)
H = model.components_

In [26]:
W.shape


(1490, 5)

In [27]:
H.shape


(5, 19645)

In [28]:
W

array([[6.92692014e-04, 4.07263255e-02, 1.24470770e-02, 5.30154711e-03,
        5.88158134e-02],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        2.14560228e-01],
       [1.47048247e-02, 3.34845561e-02, 2.23136449e-02, 0.00000000e+00,
        1.21383956e-01],
       ...,
       [8.27934717e-03, 0.00000000e+00, 1.19923608e-04, 8.25323565e-03,
        1.60985809e-01],
       [0.00000000e+00, 0.00000000e+00, 2.18646066e-01, 1.38263954e-02,
        2.41925381e-02],
       [0.00000000e+00, 0.00000000e+00, 1.15825008e-01, 0.00000000e+00,
        0.00000000e+00]])

In [29]:
predicted_label = np.squeeze(np.asarray(W.argmax(axis=1)))


In [30]:
predicted_label


array([4, 4, 4, ..., 4, 2, 2])

In [31]:
predicted_label.shape


(1490,)

In [32]:
type(predicted_label)


numpy.ndarray

In [33]:
label_to_categ = {}


In [39]:
for i in range(5):
  label_to_categ[i] = df_train.iloc[np.where(predicted_label == i)[0]]['Category'].value_counts().idxmax()
     

In [40]:
label_to_categ


{0: 'sport', 1: 'politics', 2: 'tech', 3: 'entertainment', 4: 'business'}

In [41]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [42]:
predicted_categ = np.vectorize(label_to_categ.get)(predicted_label)
predicted_categ

array(['business', 'business', 'business', ..., 'business', 'tech',
       'tech'], dtype='<U13')

In [43]:
accuracy_score(df_train.Category, predicted_categ)

0.9194630872483222

In [44]:
confusion_matrix(df_train.Category, predicted_categ)

array([[312,   1,  13,   0,  10],
       [ 10, 226,   6,   3,  28],
       [ 18,   0, 249,   3,   4],
       [  2,   4,   0, 340,   0],
       [  4,   6,   3,   5, 243]])

In [45]:
test_pred_categ = np.vectorize(label_to_categ.get)(model.transform(vect.transform(df_test['Text'].apply(preprocessDataset))).argmax(axis=1))
     

In [46]:
type(test_pred_categ)

numpy.ndarray

In [47]:
test_pred_categ.shape

(735,)

In [48]:
df_test['Category'] = test_pred_categ

In [49]:
df_test

Unnamed: 0,ArticleId,Text,Category
0,1018,qpr keeper day heads for preston queens park r...,sport
1,1319,software watching while you work software that...,tech
2,1138,d arcy injury adds to ireland woe gordon d arc...,sport
3,459,india s reliance family feud heats up the ongo...,business
4,1020,boro suffer morrison injury blow middlesbrough...,sport
...,...,...,...
730,1923,eu to probe alitalia state aid the european ...,business
731,373,u2 to play at grammy awards show irish rock ba...,entertainment
732,1704,sport betting rules in spotlight a group of mp...,tech
733,206,alfa romeos to get gm engines fiat is to sto...,business


In [50]:
df_test.drop(['Text'], axis = 1).to_csv('submission.csv', index = False)

In [51]:
model_kl = NMF(n_components=5, random_state=42, solver = 'mu', beta_loss = 'kullback-leibler')
W_kl = model_kl.fit_transform(X)
predicted_label_kl = np.squeeze(np.asarray(W_kl.argmax(axis=1)))
label_to_categ_kl = {}
for i in range(5):
  label_to_categ_kl[i] = df_train.iloc[np.where(predicted_label_kl == i)[0]]['Category'].value_counts().idxmax()
print(label_to_categ_kl)
test_pred_categ_kl = np.vectorize(label_to_categ_kl.get)(model_kl.transform(vect.transform(df_test['Text'].apply(preprocessDataset))).argmax(axis=1))
df_test['Category'] = test_pred_categ_kl
df_test.drop(['Text'], axis = 1).to_csv('submission_kl.csv', index = False)
     


{0: 'sport', 1: 'politics', 2: 'tech', 3: 'entertainment', 4: 'business'}


In [52]:
model_is = NMF(n_components=5, random_state=42, solver = 'mu', beta_loss = 'itakura-saito')
W_is = model_is.fit_transform(X.todense()+0.00000001)
predicted_label_is = np.squeeze(np.asarray(W_is.argmax(axis=1)))
label_to_categ_is = {}
for i in range(5):
  label_to_categ_is[i] = df_train.iloc[np.where(predicted_label_is == i)[0]]['Category'].value_counts().idxmax()
print(label_to_categ_is)
test_pred_categ_is = np.vectorize(label_to_categ_is.get)(model_is.transform(vect.transform(df_test['Text'].apply(preprocessDataset)).todense()+0.00000001).argmax(axis=1))
df_test['Category'] = test_pred_categ_is
df_test.drop(['Text'], axis = 1).to_csv('submission_is.csv', index = False)
     

TypeError: np.matrix is not supported. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html

In [53]:
for min_df in [10,50,100]:
  vect = TfidfVectorizer(min_df = min_df)
  X = vect.fit_transform(df_train.TextCleaned)
  model_kl = NMF(n_components=5, random_state=42, solver = 'mu', beta_loss = 'kullback-leibler')
  W_kl = model_kl.fit_transform(X)
  predicted_label_kl = np.squeeze(np.asarray(W_kl.argmax(axis=1)))
  label_to_categ_kl = {}
  for i in range(5):
    label_to_categ_kl[i] = df_train.iloc[np.where(predicted_label_kl == i)[0]]['Category'].value_counts().idxmax()
  print(label_to_categ_kl)
  test_pred_categ_kl = np.vectorize(label_to_categ_kl.get)(model_kl.transform(vect.transform(df_test['Text'].apply(preprocessDataset))).argmax(axis=1))
  df_test['Category'] = test_pred_categ_kl
  df_test.drop(['Text'], axis = 1).to_csv('min_df_'+str(min_df)+'_submission_kl.csv', index = False)
     


{0: 'sport', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'tech'}
{0: 'business', 1: 'sport', 2: 'business', 3: 'entertainment', 4: 'tech'}
{0: 'business', 1: 'sport', 2: 'entertainment', 3: 'politics', 4: 'tech'}


Let's compare with Supervised Learning


In [54]:
from sklearn.ensemble import RandomForestClassifier

In [55]:
rfc = RandomForestClassifier(random_state=42)

In [56]:
rfc.fit(X, df_train.Category)

In [57]:
rfc.score(X, df_train.Category)

1.0

In [58]:
df_test['Category'] = rfc.predict(vect.transform(df_test['Text'].apply(preprocessDataset)))
df_test.drop(['Text'], axis = 1).to_csv('submission_rfc.csv', index = False)


In [59]:
for frac in [0.1,0.2,0.5]:
  df_train_sample = df_train.sample(frac=frac, random_state=42)
  df_train_sample['TextCleaned'] = df_train_sample['Text'].apply(preprocessDataset)
  #nmf kl
  vect = TfidfVectorizer()
  X = vect.fit_transform(df_train_sample.TextCleaned)
  model_kl = NMF(n_components=5, random_state=42, solver = 'mu', beta_loss = 'kullback-leibler')
  W_kl = model_kl.fit_transform(X)
  predicted_label_kl = np.squeeze(np.asarray(W_kl.argmax(axis=1)))
  label_to_categ_kl = {}
  for i in range(5):
    label_to_categ_kl[i] = df_train_sample.iloc[np.where(predicted_label_kl == i)[0]]['Category'].value_counts().idxmax()
  print(label_to_categ_kl)
  test_pred_categ_kl = np.vectorize(label_to_categ_kl.get)(model_kl.transform(vect.transform(df_test['Text'].apply(preprocessDataset))).argmax(axis=1))
  df_test['Category'] = test_pred_categ_kl
  df_test.drop(['Text'], axis = 1).to_csv('submission_kl_frac'+str(frac)+'.csv', index = False)
  #rfc
  rfc = RandomForestClassifier(random_state=42)
  rfc.fit(X, df_train_sample.Category)
  df_test['Category'] = rfc.predict(vect.transform(df_test['Text'].apply(preprocessDataset)))
  df_test.drop(['Text'], axis = 1).to_csv('submission_rfc_frac'+str(frac)+'.csv', index = False)
     

{0: 'tech', 1: 'politics', 2: 'entertainment', 3: 'business', 4: 'sport'}
{0: 'sport', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'tech'}
{0: 'tech', 1: 'politics', 2: 'sport', 3: 'entertainment', 4: 'business'}


Here is a comparison of NMF (kl) Score vs Random Forest Classifier Score on train data:

100% train data
    NMF Score: 0.940
    RandomForestClassifier Score:0.946
    
50% train data
    NMF Score: 0.916
    RandomForestClassifier Score:0.951
    
20% train data
    NMF Score:0.932
    RandomForestClassifier Score:0.936
    
10% train data
    NMF Score:0.829
    RandomForestClassifier Score:0.849

Conclusion:
- Random Forest Classifier performs best with 50% training data
- NMF performs best with 100% training data
- Random Forest Classifier has an overfitting problem at 100% whereas NMF works better comparing the improvement from 10 to 20% training data