# BBC News Headlines Classification Model

We will use the BBC headline news text, labeled in 5 categories, i.e., 'Tech', 'Sports', 'Business', 'Entertainment', and 'Politics', and train our model with Logistic Regression and Naive Bayes.

Finally we will try using some random out of the dataset headlines to test whether our model correctly classifies them into respective label class.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/bbc-fulltext-and-category/bbc-text.csv


## Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn

## Data Reading and Understanding

In [3]:
bbc_text = pd.read_csv('../input/bbc-fulltext-and-category/bbc-text.csv')
bbc_text

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


In [4]:
bbc_text.category.unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

There are a totla of 5 classes of news that we have here in our dataset. As our model would require it in numeric form, lets map it to numeric form.

In [5]:
bbc_text.category = bbc_text.category.map({'tech':0, 'business':1, 'sport':2, 'entertainment':3, 'politics':4})
bbc_text.category.unique()

array([0, 1, 2, 3, 4])

In [6]:
bbc_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   int64 
 1   text      2225 non-null   object
dtypes: int64(1), object(1)
memory usage: 34.9+ KB


In [7]:
bbc_text.shape

(2225, 2)

### Train Test Split

In [8]:
# bbc_news = bbc_text.values

X = bbc_text.text
y = bbc_text.category

#split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 1)
print(X_train)
print(y_train)

407     bush to get  tough  on deficit us president ge...
1065    bellamy under new fire newcastle boss graeme s...
1141    mandelson warns bbc on campbell the bbc should...
987     john peel replacement show begins the permanen...
2145    wales coach elated with win mike ruddock paid ...
                              ...                        
960     fed warns of more us rate rises the us looks s...
905     bets off after big brother  leak  a bookmaker ...
1096    internet boom for gift shopping cyberspace is ...
235     consumer spending lifts us growth us economic ...
1061    t in the park sells out in days tickets for sc...
Name: text, Length: 1557, dtype: object
407     1
1065    2
1141    4
987     3
2145    2
       ..
960     1
905     3
1096    0
235     1
1061    3
Name: category, Length: 1557, dtype: int64


### Creating the Bag of Words Representation

We now have to convert the data into a format which can be used for training the model. We'll use the **bag of words representation** for each sentence (document).

Imagine breaking X in individual words and putting them all in a bag. Then we pick all the unique words from the bag one by one and make a dictionary of unique words. 

This is called **vectorization of words**. We have the class ```CountVectorizer()``` in scikit learn to vectorize the words. 

We will also use the `stop_words` in english to clear our data of stop words.


In [9]:
# countVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words = 'english')

In [10]:
# fit the vectorizer on the training data

vec.fit(X_train)
print(len(vec.get_feature_names()))
vec.vocabulary_

25184


{'bush': 4146,
 'tough': 23043,
 'deficit': 6671,
 'president': 17651,
 'george': 10063,
 'pledged': 17266,
 'introduce': 12277,
 'federal': 9034,
 'budget': 4034,
 'february': 9031,
 'bid': 3304,
 'halve': 10693,
 'country': 6014,
 'years': 25056,
 'trade': 23100,
 'deep': 6629,
 'red': 18614,
 'helping': 11030,
 'push': 18101,
 'dollar': 7481,
 'lows': 13925,
 'euro': 8501,
 'fuelling': 9775,
 'fears': 9018,
 'economy': 7913,
 'mr': 15289,
 'indicated': 11887,
 'strict': 21797,
 'discipline': 7198,
 'non': 15777,
 'defence': 6652,
 'spending': 21299,
 'vow': 24306,
 'cut': 6334,
 'election': 8011,
 'declarations': 6594,
 'hit': 11181,
 'record': 18579,
 '412bn': 715,
 '211': 430,
 '6bn': 979,
 '12': 115,
 'months': 15153,
 '30': 565,
 'september': 20295,
 '377bn': 665,
 'previous': 17697,
 'year': 25052,
 'submit': 21905,
 'fits': 9295,
 'times': 22862,
 'said': 19765,
 'provide': 17962,
 'tool': 22969,
 'resource': 19106,
 'military': 14868,
 'protect': 17932,
 'homeland': 11266,
 '

In [11]:
# another way of representing the features
X_transformed = vec.transform(X_train)
X_transformed

<1557x25184 sparse matrix of type '<class 'numpy.int64'>'
	with 224873 stored elements in Compressed Sparse Row format>

In [12]:
print(X_transformed)

  (0, 115)	1
  (0, 430)	1
  (0, 565)	1
  (0, 665)	1
  (0, 715)	1
  (0, 979)	1
  (0, 1466)	1
  (0, 2422)	1
  (0, 3256)	1
  (0, 3304)	1
  (0, 3644)	1
  (0, 4034)	6
  (0, 4146)	6
  (0, 4259)	1
  (0, 5358)	1
  (0, 5508)	1
  (0, 6014)	1
  (0, 6305)	1
  (0, 6334)	1
  (0, 6519)	2
  (0, 6594)	1
  (0, 6629)	1
  (0, 6652)	1
  (0, 6671)	7
  (0, 6672)	1
  :	:
  (1556, 17865)	1
  (1556, 17955)	1
  (1556, 18579)	1
  (1556, 19233)	1
  (1556, 19765)	2
  (1556, 20057)	1
  (1556, 20058)	1
  (1556, 20233)	2
  (1556, 20238)	1
  (1556, 20904)	1
  (1556, 21086)	4
  (1556, 21158)	1
  (1556, 21468)	1
  (1556, 21782)	1
  (1556, 22104)	1
  (1556, 22350)	1
  (1556, 22732)	1
  (1556, 22825)	2
  (1556, 22857)	1
  (1556, 23198)	1
  (1556, 24494)	1
  (1556, 24554)	1
  (1556, 24557)	1
  (1556, 24766)	1
  (1556, 25052)	5


In [13]:
X_transformed.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0]])

In [14]:
# convert X_transformed to sparse matrix, just for readability.
pd.DataFrame(X_transformed.toarray(), columns= [vec.get_feature_names()])

Unnamed: 0,00,000,000bn,000m,000th,001,001and,001st,0051,007,...,zone,zonealarm,zones,zoom,zooms,zorro,zubair,zuluaga,zurich,zvonareva
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1552,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1553,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1554,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1555,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We dont use sparse matrix while model building as it unnecessarily creates a dimensionality expansion, where other than a single position all postion carry zero value

In [15]:
# for test data
X_test_transformed = vec.transform(X_test)
X_test_transformed

<668x25184 sparse matrix of type '<class 'numpy.int64'>'
	with 92787 stored elements in Compressed Sparse Row format>

In [16]:
print(X_test_transformed)

  (0, 1)	1
  (0, 559)	1
  (0, 565)	1
  (0, 809)	1
  (0, 1466)	1
  (0, 1606)	1
  (0, 2494)	1
  (0, 3304)	2
  (0, 3364)	4
  (0, 3397)	3
  (0, 3509)	2
  (0, 3768)	2
  (0, 3950)	1
  (0, 3985)	1
  (0, 4175)	1
  (0, 4254)	1
  (0, 4277)	1
  (0, 4976)	1
  (0, 5144)	1
  (0, 5146)	2
  (0, 5579)	1
  (0, 6110)	1
  (0, 6174)	1
  (0, 6176)	1
  (0, 6210)	1
  :	:
  (667, 16129)	1
  (667, 16148)	1
  (667, 16712)	1
  (667, 16767)	1
  (667, 17019)	1
  (667, 17328)	1
  (667, 17661)	1
  (667, 17701)	1
  (667, 17772)	1
  (667, 17793)	1
  (667, 17802)	2
  (667, 18760)	1
  (667, 18847)	1
  (667, 19310)	1
  (667, 19765)	2
  (667, 19915)	1
  (667, 20446)	1
  (667, 20484)	1
  (667, 20562)	1
  (667, 20935)	1
  (667, 20951)	1
  (667, 22515)	1
  (667, 23911)	1
  (667, 24548)	1
  (667, 24682)	1


In [17]:
# convert X_transformed to sparse matrix, just for readability
pd.DataFrame(X_test_transformed.toarray(), columns= [vec.get_feature_names()])

Unnamed: 0,00,000,000bn,000m,000th,001,001and,001st,0051,007,...,zone,zonealarm,zones,zoom,zooms,zorro,zubair,zuluaga,zurich,zvonareva
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,1,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
664,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
665,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
666,0,0,0,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Building the model

### Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression()
logit.fit(X_transformed, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [19]:
# fit
logit.fit(X_transformed,y_train)

# predict class
y_pred_class = logit.predict(X_test_transformed)

# predict probabilities
y_pred_proba = logit.predict_proba(X_test_transformed)

## Model Evaluation Logistic Regression

In [20]:
# printing the overall accuracy
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9835329341317365

In [21]:
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print(confusion)
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]

[[116   1   1   1   0]
 [  0 168   0   0   0]
 [  0   1 153   0   0]
 [  0   1   1 106   0]
 [  1   3   1   0 114]]


In [22]:
sensitivity = TP / float(FN + TP)
print("sensitivity",sensitivity)

specificity = TN / float(TN + FP)
print("specificity",specificity)

sensitivity 1.0
specificity 0.9914529914529915


In [23]:
print("PRECISION SCORE :",metrics.precision_score(y_test, y_pred_class, average = 'micro'))
print("RECALL SCORE :", metrics.recall_score(y_test, y_pred_class, average = 'micro'))
print("F1 SCORE :",metrics.f1_score(y_test, y_pred_class, average = 'micro'))

PRECISION SCORE : 0.9835329341317365
RECALL SCORE : 0.9835329341317365
F1 SCORE : 0.9835329341317365


Lets now build a Naive Bayes model, and see if we get any better results

### Naive Bayes

In [24]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_transformed, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [25]:
# fit
nb.fit(X_transformed,y_train)

# predict class
y_pred_class = nb.predict(X_test_transformed)

# predict probabilities
y_pred_proba = nb.predict_proba(X_test_transformed)

## Model Evaluation Naive Bayes

In [26]:
# printing the overall accuracy
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9865269461077845

In [27]:
# confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
# help(metrics.confusion_matrix)

array([[117,   0,   1,   0,   1],
       [  1, 165,   0,   0,   2],
       [  0,   1, 153,   0,   0],
       [  0,   0,   0, 108,   0],
       [  1,   2,   0,   0, 116]])

In [28]:
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print(confusion)
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]

[[117   0   1   0   1]
 [  1 165   0   0   2]
 [  0   1 153   0   0]
 [  0   0   0 108   0]
 [  1   2   0   0 116]]


In [29]:
sensitivity = TP / float(FN + TP)
print("sensitivity",sensitivity)

specificity = TN / float(TN + FP)
print("specificity",specificity)

sensitivity 0.9939759036144579
specificity 1.0


In [30]:
print("PRECISION SCORE :",metrics.precision_score(y_test, y_pred_class, average = 'micro'))
print("RECALL SCORE :", metrics.recall_score(y_test, y_pred_class, average = 'micro'))
print("F1 SCORE :",metrics.f1_score(y_test, y_pred_class, average = 'micro'))

PRECISION SCORE : 0.9865269461077845
RECALL SCORE : 0.9865269461077845
F1 SCORE : 0.9865269461077845


Both the Logistic Regression as well as the Naive Bayes model offer similar performance. We will go ahead choosing Naive bayes as our final model

------------------------------------------------------------------------------------------------------------

## Test from random data outside the dataset

   >Lets choose random news headlines from the internet, and see if our model perform well in classifying them

In [31]:
s1 = ['FIR against Delhi Minorities Commission chairman for inflammatory content on social media']
vec1 = vec.transform(s1).toarray()
print('Headline:' ,s1)
print(str(list(nb.predict(vec1))[0]).replace('0', 'TECH').replace('1', 'BUSINESS').replace('2', 'SPORTS').replace('3','ENTERTAINMENT').replace('4','POLITICS'))

Headline: ['FIR against Delhi Minorities Commission chairman for inflammatory content on social media']
POLITICS


In [32]:
relabel = {'0': 'tech', '1': 'business', '2': 'sport', '3': 'entertainment', '4': 'politics'}

In [33]:
s2 = ['Need to restart economy but with caution: Yogi Adityanath at E-Agenda AajTak']
vec2 = vec.transform(s2).toarray()
print('Headline:' ,s2)
print(str(list(nb.predict(vec2))[0]).replace('0', 'TECH').replace('1', 'BUSINESS').replace('2', 'SPORTS').replace('3','ENTERTAINMENT').replace('4','POLITICS'))

Headline: ['Need to restart economy but with caution: Yogi Adityanath at E-Agenda AajTak']
BUSINESS


In [34]:
s3 = ['2 doctors attacked in Andhra Pradesh Vijayawada']
vec3 = vec.transform(s3).toarray()
print('Headline:', s3)
print(str(list(nb.predict(vec3))[0]).replace('0', 'TECH').replace('1', 'BUSINESS').replace('2', 'SPORTS').replace('3','ENTERTAINMENT').replace('4','POLITICS'))

Headline: ['2 doctors attacked in Andhra Pradesh Vijayawada']
POLITICS


In [35]:
s4 = ['If I bat for an hour, you’ll see a big one: How Dravid spelt doom for Pak']
vec4 = vec.transform(s4).toarray()
print('Headline:', s4)
print(str(list(nb.predict(vec4))[0]).replace('0', 'TECH').replace('1', 'BUSINESS').replace('2', 'SPORTS').replace('3','ENTERTAINMENT').replace('4','POLITICS'))

Headline: ['If I bat for an hour, you’ll see a big one: How Dravid spelt doom for Pak']
SPORTS


#### Our Naive Bayes model is performing pretty well on random News Healines out of the dataset !