### Data Preparation and feature extraction
- Convert categorical label values into numerical values
- Extract features from text, including converting text to numeric repricentation as vectors
- Split data into training and test dataset

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import scipy
import util # the utility module contains the feature extraction functions

In [2]:
df = pd.read_csv('bbc-text.csv')

#### Convert categorical label values into numerical values

In [3]:
le = LabelEncoder().fit(df["category"])
df['encoded_category'] = le.transform(df["category"])

In [4]:
df.head()

Unnamed: 0,category,text,encoded_category
0,tech,tv future in the hands of viewers with home th...,4
1,business,worldcom boss left books alone former worldc...,0
2,sport,tigers wary of farrell gamble leicester say ...,3
3,sport,yeading face newcastle in fa cup premiership s...,3
4,entertainment,ocean s twelve raids box office ocean s twelve...,1


check inverse transform

In [5]:
le.inverse_transform([3])[0]

'sport'

Check the encoded categories

In [6]:
df[['category', 'encoded_category']].drop_duplicates().sort_values('encoded_category')

Unnamed: 0,category,encoded_category
1,business,0
4,entertainment,1
5,politics,2
2,sport,3
0,tech,4


#### Before transform text into input features, split the sample data into train and test datasets

In [7]:
x_train, x_test, y_train, y_test = train_test_split(
    df['text'], df['encoded_category'], test_size=.2, stratify=df['category'], random_state=42)

Check the first train text, show only the first 1000 characters

In [8]:
x_train[0][:1000]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

Encode the input text using TF-IDF (term frequency-inverse document frequency) features, tf-idf provides a weight of how relavent a perticular word is to the document or text context

In [10]:
def tfidf_transform(x_train, x_test):
    kwargs = {
            'ngram_range': (1,1),  # Use 1-grams + 2-grams.
            'analyzer': 'word',  # Split text into word tokens.
            'min_df': 1,
            'stop_words': "english",
    }
    vectorizer = TfidfVectorizer(**kwargs)
    # Learn vocabulary from training texts and vectorize training texts.
    x_train_transformed = vectorizer.fit_transform(x_train)
    # Vectorize validation texts.
    x_test_transformed = vectorizer.transform(x_test)
    return x_train_transformed, x_test_transformed

tfidf_train, tfidf_test = tfidf_transform(x_train, x_test)
print(tfidf_train.shape)

(1780, 26501)


Check first sample data after tfidf transform

In [12]:
tfidf_train[0].data

array([0.09250662, 0.0604464 , 0.08807122, 0.04670963, 0.04975965,
       0.06622331, 0.05306013, 0.0616137 , 0.04227424, 0.07514611,
       0.04037069, 0.15099678, 0.06576305, 0.05473991, 0.09012997,
       0.08247605, 0.01788117, 0.06669774, 0.10944471, 0.05377262,
       0.07787397, 0.12013145, 0.08000394, 0.07019845, 0.14357862,
       0.10116865, 0.07212817, 0.10600434, 0.0789022 , 0.07178931,
       0.05613192, 0.0789022 , 0.08000394, 0.0811905 , 0.07644979,
       0.05997197, 0.06384666, 0.07319189, 0.07556854, 0.08030768,
       0.06669774, 0.0805858 , 0.08935123, 0.0953176 , 0.17084337,
       0.11388011, 0.0550763 , 0.08906626, 0.04704264, 0.09005206,
       0.03632227, 0.05377262, 0.05877387, 0.04684198, 0.07473494,
       0.04232373, 0.04959988, 0.06306585, 0.14039689, 0.13084726,
       0.06821546, 0.06903429, 0.05997197, 0.03497236, 0.05779686,
       0.03356176, 0.04377833, 0.10992671, 0.0953176 , 0.12013145,
       0.05806975, 0.06821546, 0.06268979, 0.07514611, 0.17813

When feeding the TF-IDF encoded sparse matrix data to deep learning network, it needs to be converted back to dense matrix.
Convert sparse matrix to dense matrix and check the first sample

In [18]:
train_tfidf_dense = scipy.sparse.csr_matrix.todense(tfidf_train)
print(train_tfidf_dense[0]) # [train_tfidf_dense[0] != 0][0]
print(len(train_tfidf_dense))
print(train_tfidf_dense[0][train_tfidf_dense[0] != 0][0])

[[0. 0. 0. ... 0. 0. 0.]]
1780
[[0.0953176  0.0616137  0.07514611 0.05589101 0.06576305 0.03632227
  0.54722354 0.07212817 0.09012997 0.06268979 0.0604464  0.08807122
  0.08906626 0.05806975 0.13643092 0.07178931 0.07473494 0.18038548
  0.10992671 0.05473991 0.05377262 0.14039689 0.05997197 0.06821546
  0.08625528 0.05613192 0.0550763  0.04670963 0.08000394 0.06306585
  0.17084337 0.17813252 0.07787397 0.09250662 0.06821546 0.0811905
  0.07644979 0.07787397 0.18133919 0.05997197 0.12013145 0.07319189
  0.08030768 0.04037069 0.04684198 0.0789022  0.06622331 0.10944471
  0.0805858  0.1232274  0.06576305 0.04959988 0.14357862 0.06903429
  0.05877387 0.05145562 0.06466913 0.07514611 0.09005206 0.04377833
  0.08713653 0.05377262 0.07019845 0.04975965 0.08935123 0.04704264
  0.15099678 0.06669774 0.10116865 0.07738448 0.05779686 0.0789022
  0.06384666 0.01788117 0.07556854 0.07113314 0.06669774 0.10600434
  0.08247605 0.08000394 0.11388011 0.12013145 0.0953176  0.07178931
  0.1261317  0.0530


Another way to encode text into numeric presentation is use tokenize the text and apply padded sequences

In [19]:
def pad_sequence_transform(x_train, x_test, vocab_size=50000, max_len=20000):
    """Convert input raw tests into pad sequence encoded integer matrix
    Args:
        x_train: array of input text, the raw input training text data
        x_test: array of input text, the raw input test text data
        vocab_size: maximum number of vocabulary used for tokenization
                    default to 50000
        max_len: maximum length of padded sequences
                default to 20000
    """
    oov_tok = '<OOV>'
    tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
    tokenizer.fit_on_texts(x_train)
    word_index = tokenizer.word_index
    x_seq = tokenizer.texts_to_sequences(x_train)
    train_padded = pad_sequences(x_seq, padding='post', maxlen=max_len)
    test_padded = pad_sequences(tokenizer.texts_to_sequences(x_test), padding='post', maxlen=max_len)
    return train_padded, test_padded

train_padded, test_padded = pad_sequence_transform(x_train, x_test)

In [20]:
train_padded.shape

(1780, 20000)

In [21]:
train_padded[0][:10]

array([4353, 2571,  361,  629, 3112, 1528, 3331, 4353, 2572,  653],
      dtype=int32)

I put the tow feature extraction functions into a util.py module so they can be call when build machine learning or deep learning models