# 5. Feature extraction

Now that we have both the cleaned data formatted in a right way and a set of 100 words with the highest predictive power, we're ready to finally extract the exact features that will be later used to train our classification model. More specifically, for each of the top 100 words, we're looking to extract counts above and below a given word, within a given window (e.g. within 100 words above and 100 words below). So, this will end up being 200 total features to extract (100 for each top word above and 100 more for each top word below).

## Table of Contents

1. [Load pre-processed data and best features](#Load-pre-processed-data-and-best-features)
1. [One-hot encode top words](#One-hot-encode-top-words)
1. [Define a function to shift sparse matrix](#Define-a-function-to-shift-sparse-matrix-)
1. [Calculate cumulative sums](#Calculate-cumulative-sums)
1. [Save the features matrix](#Save-the-features-matrix)

In [1]:
import numpy as np
import pandas as pd
import sys
from collections import Counter
from IPython.display import display
from scipy.sparse import csr_matrix, csc_matrix, lil_matrix, save_npz, load_npz, hstack
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
%matplotlib inline

## Load pre-processed data and best features

In [2]:
data = np.load('files/data.npz', allow_pickle=True)
data.files

['company', 'train', 'labels', 'words']

In [3]:
best_features = np.load('files/best_features.npz', allow_pickle=True)['best_features']
best_features.shape

(100,)

## One-hot encode top words

First, we need to one-hot encode the words in our dataset. I will be using `sklearn`'s `OneHotEncoder` for those purposes, passing our 100 top words as a predefined list of categories.

In [4]:
# Reshape the data set to fit OneHotEncoder's input shape
words = data['words'].reshape(-1, 1)
words.shape

(23742514, 1)

In [5]:
# Instantiate the encoder and fit_transform the data
enc = OneHotEncoder(categories=[best_features], handle_unknown='ignore', dtype=np.int64)
words_enc = enc.fit_transform(words)
words_enc.shape

(23742514, 100)

To make sure that one-hot encoding ran correctly, let's spot check a couple of data points vs. the original word array:

In [6]:
# Calculate total counts by feature from the one-hot encoded array
totals = words_enc.sum(axis=0)

In [7]:
def spot_check_onehot(index):
    '''Spot checks one-hot encoding for a given feature index.'''
    
    word = enc.categories_[0][index]
    count1 = totals[0, index]
    count2 = (words==word).sum()
    
    print('Word: {}'.format(word))
    print('Count in one-hot encoded array: {}'.format(count1))
    print('Count in the full array: {}'.format(count2))
    print('Check passes: {}'.format(count1==count2))

In [8]:
# spot check for feature in position 4
spot_check_onehot(4)

Word: able
Count in one-hot encoded array: 9359
Count in the full array: 9359
Check passes: True


In [9]:
# spot check for feature in position 25
spot_check_onehot(25)

Word: clinical
Count in one-hot encoded array: 9147
Count in the full array: 9147
Check passes: True


## Define a function to shift sparse matrix

Now we need to define a couple of helper functions that would enable us to calculate running counts of each word above and below a given word, grouped by a company name. While this function might theoretically be a little more straightforward to define in `pandas`, the size of the data set makes using `pandas` DataFrames - and even `numpy` arrays - unattainable. Instead, I'll stick with the `csr_matrix` sparse matrix format that, luckily, `sklearn`'s `OneHotEncoder` has as its default output type.

In [10]:
def is_same_after_shift(vector, shift=1):
    '''Checks if the value in a numpy array stays the same if all values are shifted up or down.'''
    
    return (vector==np.roll(vector, shift)).astype(np.int64)

In [11]:
def csr_shift(csr, down=True):
    '''Shifts all raws of a CSR sparse matrix by one row down or up.'''
    
    y = csr.copy()
    
    if down:
        # add an empty row at the beginning of the index pointer matrix, drop the last row
        y.indptr=np.insert(csr.indptr[:-1], 0, 0)
        
        # if the last row has values, drop them from index and data matrices
        if (csr.indptr[-1]!=csr.indptr[-2]):  
            y.indices = np.delete(y.indices, -1)
            y.data = np.delete(y.data, -1)
    else:
        # drop the first row from the index pointer matrix, subtract its value from all remaining index pointers
        y.indptr = np.insert(csr.indptr[2:] - csr.indptr[1], 0, 0)
        # add an empty row (same value as the one before it) at the end of the index pointer matrix
        y.indptr = np.append(y.indptr, [y.indptr[-1]])
    
        # if the first row had values, drop them from index and data matrices
        number_of_values_to_drop = csr.indptr[1]
        if (number_of_values_to_drop!=0):
            y.indices = y.indices[number_of_values_to_drop:].copy()
            y.data = y.data[number_of_values_to_drop:].copy()
    
    y.check_format()
    
    return y

In [12]:
def spot_check_shift(csr_init, csr_shifted):
    '''Spot checks one-hot encoding for a given feature index.'''

    # check total counts
    total1 = csr_init.sum()
    total2 = csr_shifted.sum()
    print('Total counts')
    print('  - Original matrix: {}'.format(total1))
    print('  - Shifted matrix: {}'.format(total2))
    print('\n')
    
    # check heads
    print('Matrix head:')
    print('  - Original matrix:')
    print(csr_init[:10])
    print('  - Shifted matrix:')
    print(csr_shifted[:10])
    print('\n')
    
    # check tails
    print('Matrix tail:')
    print('  - Original matrix:')    
    print('\n'.join(str(csr_init).split('\n')[-3:]))
    print('  - Shifted matrix:')
    print('\n'.join(str(csr_shifted).split('\n')[-3:]))

In [13]:
# spot check shifting down
spot_check_shift(words_enc, csr_shift(words_enc))

Total counts
  - Original matrix: 11587505
  - Shifted matrix: 11587504


Matrix head:
  - Original matrix:
  (1, 1)	1
  (5, 82)	1
  (6, 13)	1
  - Shifted matrix:
  (2, 1)	1
  (6, 82)	1
  (7, 13)	1


Matrix tail:
  - Original matrix:
  (23742510, 89)	1
  (23742511, 27)	1
  (23742513, 1)	1
  - Shifted matrix:
  (23742510, 22)	1
  (23742511, 89)	1
  (23742512, 27)	1


In [14]:
# spot check shifting down
spot_check_shift(words_enc, csr_shift(words_enc, down=False))

Total counts
  - Original matrix: 11587505
  - Shifted matrix: 11587505


Matrix head:
  - Original matrix:
  (1, 1)	1
  (5, 82)	1
  (6, 13)	1
  - Shifted matrix:
  (0, 1)	1
  (4, 82)	1
  (5, 13)	1


Matrix tail:
  - Original matrix:
  (23742510, 89)	1
  (23742511, 27)	1
  (23742513, 1)	1
  - Shifted matrix:
  (23742509, 89)	1
  (23742510, 27)	1
  (23742512, 1)	1


## Calculate cumulative sums

Now that we have necessary helper functions in place, we can define a custom cumulative sum method that would take a one-hot encoded csr_matrix, a vector with company names, and, over a given window, add up all instances of a given word above or below each word:

In [15]:
def csr_cum_sum(onehot_csr, company_vector, window_size, above=True):
    '''Calculates cumulative sum of sparse matrix items above or below each row.'''
    
    cumsum_scr = csr_matrix(onehot_csr.shape, dtype=onehot_csr.dtype)
    shifted_csr = onehot_csr.copy()
    
    for i in range(1, window_size + 1):
        print('   Processing step #{}/{}...'.format(i, window_size), end='\r', flush=True)
        same_company = csr_matrix(is_same_after_shift(company_vector, i * [-1, 1][above]))
        shifted_csr = csr_shift(shifted_csr, above)
        cumsum_scr += same_company.T.multiply(shifted_csr)
    print('   Calculations complete, variable assignment in process...')
    
    return cumsum_scr

Finally, we can define a method that would take a one-hot encoded data set, a vector with company names, a window size, and extract all necessary features from the data. More specifically, these features would be word counts for our top 100 words above and below a given word within a given window (say, a window can be set to 100 words above and 100 words below).

Because `csr_matrix` format is not well suited for horizontal stacking (which we need to do after we calculate word counts above and below a given word), we'll be temporarily converting the results into `csc_matrix` format. Also, given the size of our data set, this will end up being a highly memory-intensive operation. So, after the horizontal stacking is done, we'll need to drop all intermediate varibles to clean up the memory and avoid getting `out of memory` errors.

In [16]:
def extract_features(onehot_csr, company_vec, window_size):
    '''Extracts cumulative counts above and below a given word given one-hot encoded text and window size.'''
    
    num_of_entries, num_of_words = onehot_csr.shape
    
    print('Calculating cumulative counts above:')
    above = csr_cum_sum(onehot_csr, company_vec, window_size, above=True).tocsc()
    
    print('Calculating cumulative counts below:')
    below = csr_cum_sum(onehot_csr, company_vec, window_size, above=False).tocsc()
    
    print('All calculations complete, stacking features in process...')
    X = hstack([above, below])
    
    print('Stacking complete, cleaning up memory...')
    del above
    del below
    
    print('Converting the matrix to csr format...')
    X = X.tocsr()
    
    return X

In [17]:
# set the window size
window = 100

In [18]:
# extract all features
X = extract_features(words_enc, data['company'], window)

Calculating cumulative counts above:
   Calculations complete, variable assignment in process...
Calculating cumulative counts below:
   Calculations complete, variable assignment in process...
All calculations complete, stacking features in process...
Stacking complete, cleaning up memory...
Converting the matrix to csr format...


In [19]:
X.shape

(23742514, 200)

Now, to ensure that feature extraction ran correctly, we need to do a couple of spot checks vs. direct calculations on the original data.

In [20]:
def spot_check(sparse, word_vector, company_vector, features, window_size, above=True, company=None, feature=None):
    
    # if not specified, pull a random company and/or a random word
    if not company:
        company = np.random.choice(np.unique(company_vector))
        print('Sampled company:', company)
    if not feature:
        feature = np.random.choice(features)
        print('Sampled word:', feature)
    
    # calculate counts directly and store in array1
    step = [-1, 1][above]
    is_target_feature = (word_vector[company_vector==company]==feature)
    cumsum = np.cumsum(is_target_feature[::step])
    cumsum_tail = np.roll(cumsum, window_size)
    np.put(cumsum_tail, range(window_size), 0)
    rolling_sum = cumsum - cumsum_tail
    array1 = np.roll(np.nan_to_num(rolling_sum), 1)[::step]
    rolled_over = [-1, 0][above]
    array1[rolled_over] = 0
    
    # pull counts from the pre-calculated array and store in array2
    array2 = sparse[company_vector==company, features.tolist().index(feature)].toarray().flatten()
    
    not_matching = (array2!=array1)
    
    print('Number of mismatched entries:', sum(not_matching))
    
    return array1, array2, not_matching

In [21]:
def spot_check_samples(X, words_vector, company_vector, best_features, num_samples, window_size, above=True):
    
    samples = []
    for i in range(num_samples):
        print('Sample #{}:'.format(i+1))
        array1, array2, not_matching = spot_check(X, words_vector, company_vector, best_features, window_size, above)
        samples.append(sum(not_matching)==0)
        print('-'*100)
    print('Total number of samples without errors: {}/{}'.format(sum(samples), num_samples))

In [22]:
spot_check_samples(X[:,:100], data['words'], data['company'], best_features, 5, window)

Sample #1:
Sampled company: DPW Holdings, Inc.
Sampled word: is
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #2:
Sampled company: AMERICAN WOODMARK CORP
Sampled word: marketing
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #3:
Sampled company: PCT LTD
Sampled word: data
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #4:
Sampled company: SMARTSHEET INC
Sampled word: forward-looking
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #5:
Sampled company: Jacksam Corp
Sampled word: sales
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Total n

In [23]:
spot_check_samples(X[:,100:], data['words'], data['company'], best_features, 5, window, above=False)

Sample #1:
Sampled company: REGO PAYMENT ARCHITECTURES, INC.
Sampled word: million
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #2:
Sampled company: Frelii, Inc.
Sampled word: an
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #3:
Sampled company: Achaogen, Inc.
Sampled word: or
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #4:
Sampled company: Longwen Group Corp.
Sampled word: 
Number of mismatched entries: 0
----------------------------------------------------------------------------------------------------
Sample #5:
Sampled company: STONEMOR PARTNERS LP
Sampled word: customers
Number of mismatched entries: 0
---------------------------------------------------------------------------------------------------

## Save the features matrix

In [25]:
save_npz('files/X.npz', X)