# Implementing classification algorithms when working with text data

This notebook will use the text dataset from BBC to show how we can take textual data points and perfom classification on them

In [1]:
import pandas as pd

### BBC Text Dataset
Source : https://www.kaggle.com/yufengdev/bbc-fulltext-and-category/downloads/bbc-text.csv

category: One of 5 categories

text: The title and body of the article, concatenated.

In [2]:
data = pd.read_csv('datasets/bbc-text.csv')

data.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


### Unique category in data

In [3]:
data.category.unique()

array(['tech', 'business', 'sport', 'entertainment', 'politics'],
      dtype=object)

### Count each unique category frequency

In [4]:
data['category'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [5]:
data.shape

(2225, 2)

In [6]:
data['text'][15]

's korean credit card firm rescued south korea s largest credit card firm has averted liquidation following a one trillion won ($960m; £499m) bail-out.  lg card had been threatened with collapse because of its huge debts but the firm s creditors and its former parent have stepped in to rescue it. a consortium of creditors and lg group  a family owned conglomerate  have each put up $480m to stabilise the firm. lg card has seven million customers and its collapse would have sent shockwaves through the country s economy.  the firm s creditors - which own 99% of lg card - have been trying to agree a deal to secure its future for several weeks. they took control of the company in january when it avoided bankruptcy only through a $4.5bn bail-out.  they had threatened to delist the company  a move which would have triggered massive debt redemptions and forced the company into bankruptcy  unless agreement was reached on its future funding.  lg card will not need any more financial aid after th

### Transforming text using count encoding

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
x_train_count = count_vectorizer.fit_transform(data['text'])

count_vectorizer.vocabulary_

{'tv': 27299,
 'future': 11495,
 'in': 13801,
 'the': 26462,
 'hands': 12561,
 'of': 18726,
 'viewers': 28191,
 'with': 28955,
 'home': 13183,
 'theatre': 26464,
 'systems': 26007,
 'plasma': 20146,
 'high': 13002,
 'definition': 7862,
 'tvs': 27302,
 'and': 2429,
 'digital': 8332,
 'video': 28174,
 'recorders': 21691,
 'moving': 17832,
 'into': 14340,
 'living': 16089,
 'room': 22779,
 'way': 28572,
 'people': 19740,
 'watch': 28540,
 'will': 28852,
 'be': 3620,
 'radically': 21317,
 'different': 8316,
 'five': 10875,
 'years': 29260,
 'time': 26665,
 'that': 26454,
 'is': 14510,
 'according': 1650,
 'to': 26730,
 'an': 2395,
 'expert': 10212,
 'panel': 19383,
 'which': 28749,
 'gathered': 11644,
 'at': 2994,
 'annual': 2524,
 'consumer': 6764,
 'electronics': 9414,
 'show': 24042,
 'las': 15561,
 'vegas': 28057,
 'discuss': 8504,
 'how': 13346,
 'these': 26500,
 'new': 18278,
 'technologies': 26246,
 'impact': 13703,
 'one': 18837,
 'our': 19052,
 'favourite': 10530,
 'pastimes': 195

In [8]:
x_train_count.shape

(2225, 29421)

In [9]:
print(x_train_count[15])

  (0, 11495)	2
  (0, 13801)	4
  (0, 26462)	13
  (0, 18726)	9
  (0, 28955)	1
  (0, 2429)	6
  (0, 14340)	2
  (0, 28852)	3
  (0, 3620)	1
  (0, 29260)	1
  (0, 14510)	1
  (0, 26730)	8
  (0, 2395)	1
  (0, 28749)	2
  (0, 2994)	1
  (0, 6764)	1
  (0, 18837)	2
  (0, 26588)	2
  (0, 12684)	3
  (0, 3706)	3
  (0, 28740)	2
  (0, 26501)	2
  (0, 11102)	1
  (0, 17721)	1
  (0, 3744)	1
  :	:
  (0, 11992)	1
  (0, 17240)	1
  (0, 15206)	1
  (0, 2408)	1
  (0, 23593)	1
  (0, 23552)	1
  (0, 24429)	1
  (0, 21703)	1
  (0, 7268)	1
  (0, 4776)	1
  (0, 4917)	1
  (0, 17375)	1
  (0, 6766)	1
  (0, 10599)	1
  (0, 22132)	1
  (0, 20825)	1
  (0, 23681)	1
  (0, 18174)	1
  (0, 11483)	1
  (0, 26594)	1
  (0, 25306)	1
  (0, 10104)	1
  (0, 10089)	1
  (0, 2948)	1
  (0, 22940)	1


### Transform count vectorizer into tfidf
Calculating inverse document frequency

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
x_train_tfidf = tfidf_vectorizer.fit_transform(data['text']) 

tfidf_vectorizer.vocabulary_

{'tv': 27299,
 'future': 11495,
 'in': 13801,
 'the': 26462,
 'hands': 12561,
 'of': 18726,
 'viewers': 28191,
 'with': 28955,
 'home': 13183,
 'theatre': 26464,
 'systems': 26007,
 'plasma': 20146,
 'high': 13002,
 'definition': 7862,
 'tvs': 27302,
 'and': 2429,
 'digital': 8332,
 'video': 28174,
 'recorders': 21691,
 'moving': 17832,
 'into': 14340,
 'living': 16089,
 'room': 22779,
 'way': 28572,
 'people': 19740,
 'watch': 28540,
 'will': 28852,
 'be': 3620,
 'radically': 21317,
 'different': 8316,
 'five': 10875,
 'years': 29260,
 'time': 26665,
 'that': 26454,
 'is': 14510,
 'according': 1650,
 'to': 26730,
 'an': 2395,
 'expert': 10212,
 'panel': 19383,
 'which': 28749,
 'gathered': 11644,
 'at': 2994,
 'annual': 2524,
 'consumer': 6764,
 'electronics': 9414,
 'show': 24042,
 'las': 15561,
 'vegas': 28057,
 'discuss': 8504,
 'how': 13346,
 'these': 26500,
 'new': 18278,
 'technologies': 26246,
 'impact': 13703,
 'one': 18837,
 'our': 19052,
 'favourite': 10530,
 'pastimes': 195

In [11]:
x_train_tfidf.shape

(2225, 29421)

In [12]:
print(x_train_tfidf[15])

  (0, 22940)	0.03762214126336901
  (0, 2948)	0.0509409236776086
  (0, 10089)	0.06981179437208619
  (0, 10104)	0.043640674374489895
  (0, 25306)	0.04331658623558077
  (0, 26594)	0.05521129576584879
  (0, 11483)	0.03425691772201353
  (0, 18174)	0.03731684355489801
  (0, 23681)	0.039443696544706294
  (0, 20825)	0.046240612976897
  (0, 22132)	0.065541422283846
  (0, 10599)	0.041923535563925994
  (0, 6766)	0.04321073576201849
  (0, 17375)	0.04408891365788541
  (0, 4917)	0.06251154506896749
  (0, 4776)	0.065541422283846
  (0, 7268)	0.053682233979380184
  (0, 21703)	0.06391790216496572
  (0, 24429)	0.06070161241172701
  (0, 23552)	0.049507175061705364
  (0, 23593)	0.0844122929783236
  (0, 2408)	0.042696876120132315
  (0, 15206)	0.057811234368255894
  (0, 17240)	0.04353153315462155
  (0, 11992)	0.04321073576201849
  :	:
  (0, 3744)	0.023561865350246923
  (0, 17721)	0.017870201923334023
  (0, 11102)	0.011226274875132187
  (0, 26501)	0.03492083190497568
  (0, 28740)	0.04159225642155242
  (0, 370

In [13]:
idf_scores = list(zip(tfidf_vectorizer.get_feature_names(), tfidf_vectorizer.idf_))

In [14]:
idf_scores[4000:4050]

[('biocidal', 8.014814351275545),
 ('biodiversity', 8.014814351275545),
 ('biogen', 8.014814351275545),
 ('biographer', 8.014814351275545),
 ('biography', 6.143012174373953),
 ('biological', 7.3216671707156),
 ('biology', 8.014814351275545),
 ('biometric', 6.916202062607435),
 ('biometrics', 7.60934924316738),
 ('biopic', 5.817589773939326),
 ('biotech', 8.014814351275545),
 ('bioware', 8.014814351275545),
 ('bipedal', 8.014814351275545),
 ('bipin', 8.014814351275545),
 ('birch', 8.014814351275545),
 ('birchfield', 7.3216671707156),
 ('bird', 6.628519990155654),
 ('birds', 8.014814351275545),
 ('birgit', 7.60934924316738),
 ('birkbeck', 8.014814351275545),
 ('birkenhead', 7.09852361940139),
 ('birkett', 7.60934924316738),
 ('birkin', 8.014814351275545),
 ('birmingham', 4.776135899111164),
 ('birth', 5.874748187779274),
 ('birthday', 5.99991133073328),
 ('birthdays', 7.3216671707156),
 ('birthistle', 8.014814351275545),
 ('birthplace', 8.014814351275545),
 ('births', 8.014814351275545),

### Split data into train, test data

In [15]:
from sklearn.model_selection import train_test_split

Y = data['category']

x_train, x_test, y_train, y_test = train_test_split(x_train_tfidf, Y, test_size=0.2)

In [16]:
x_train.shape, y_train.shape

((1780, 29421), (1780,))

In [17]:
x_test.shape, y_test.shape

((445, 29421), (445,))

### Decision Tree
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [18]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=10)
clf.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=10, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [19]:
y_pred = clf.predict(x_test)

y_pred[0:10]

array(['sport', 'politics', 'entertainment', 'politics', 'business',
       'tech', 'sport', 'business', 'entertainment', 'sport'],
      dtype=object)

### Accuracy score

In [20]:
from sklearn.metrics import accuracy_score

print("Accuracy : ", accuracy_score(y_test, y_pred))

Accuracy :  0.7775280898876404


### Table for predicted and actual values

In [21]:
df_y = pd.DataFrame({'y_test' : y_test, 'y_pred' : y_pred})

df_y.sample(10)

Unnamed: 0,y_test,y_pred
560,sport,sport
1724,tech,business
211,politics,politics
1964,sport,sport
1677,business,business
939,entertainment,entertainment
2065,tech,tech
2109,entertainment,entertainment
812,politics,politics
1073,entertainment,entertainment
