# Preparing Text Data For Machine Learning
by **Dane Arnesen** on November 4, 2017. 

Our dataset is a collection of medical texts that we will use to classify different variants of cancer mutations. Because we cannot feed raw text into any kind of machine learning algorithm, we must first do some preprocessing to convert the text into some numeric representation. Below is a list of the high-level preprocessing tasks:
- Split the data into train and test sets 
- Separate the raw text on white space into individual words
- Remove all punctuation
- Set all words to lowercase
- Remove all words that are not purely compromised of alphabetical characters
- Remove stop words
- Perform stemming
- Remove words that are not at least 2 characters in length

### Load the Data
First we need to load the raw data from file. We will use the pandas package to assist us with this task. There are actually two files that we need to load: 
1. **Training Variants**: contains the genetic mutation class (the target) 
2. **Training Text**: contains the raw medical text 

Both files contain a field called **ID** which we can use to join the files together.

In [1]:
import os
import pandas as pd
 
# Set the working directory for the project
os.chdir('C://Users/Dane/Documents/GitHub/seis735_project/')

# Training variants
variants = pd.read_csv("data/raw/training_variants")

# Load the data from file
text = pd.read_csv("data/raw/training_text", 
                   sep="\|\|", 
                   header=None, 
                   skiprows=1, 
                   names=["ID","Text"],
                   engine="python"
                  )

print(variants.shape)
print(text.shape)

(3321, 4)
(3321, 2)


### Merge the Variants and Text
Now that the two files have been loaded into memory, we merge the files together using the **ID** field.

In [2]:
# Use inner join to merge the datasets on ID
merged = pd.merge(left=variants, right=text, how="inner", on="ID")

# Dropping the variants and text datasets as we won't need them anymore
del variants, text

print(merged.shape)
print(merged.dtypes)

(3321, 5)
ID            int64
Gene         object
Variation    object
Class         int64
Text         object
dtype: object


Get an idea of what the merged dataframe looks like

In [3]:
merged.head(10)

Unnamed: 0,ID,Gene,Variation,Class,Text
0,0,FAM58A,Truncating Mutations,1,Cyclin-dependent kinases (CDKs) regulate a var...
1,1,CBL,W802*,2,Abstract Background Non-small cell lung canc...
2,2,CBL,Q249E,2,Abstract Background Non-small cell lung canc...
3,3,CBL,N454D,3,Recent evidence has demonstrated that acquired...
4,4,CBL,L399V,4,Oncogenic mutations in the monomeric Casitas B...
5,5,CBL,V391I,4,Oncogenic mutations in the monomeric Casitas B...
6,6,CBL,V430M,5,Oncogenic mutations in the monomeric Casitas B...
7,7,CBL,Deletion,1,CBL is a negative regulator of activated recep...
8,8,CBL,Y371H,4,Abstract Juvenile myelomonocytic leukemia (JM...
9,9,CBL,C384R,4,Abstract Juvenile myelomonocytic leukemia (JM...


### Split the Data into Train and Test Sets
Our model needs to be able to have predictive power on previously unseen data. It is important that we do not train our model using the entire dataset, because this will give our model prior knowledge of the validation data. In other words, we run into a leaky data problem. Therefore, when preparing the data to train the model, we constrain it to only the designated training data.

In [4]:
from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train, test = train_test_split(merged, test_size=0.1, random_state=20171104)

print(train.shape)
print(test.shape)

(2988, 5)
(333, 5)


In [5]:
train.head(15)

Unnamed: 0,ID,Gene,Variation,Class,Text
769,769,ERBB2,S310Y,7,A 58-year-old woman was evaluated at our insti...
1640,1640,FLT3,FLT3 internal tandem duplications,7,Internal tandem duplications of the FMS-like t...
2635,2635,BRCA1,P1856T,6,Abstract The BRCA1 gene from individuals at ...
2706,2706,BRAF,K483E,2,Precision medicine approaches are ideally suit...
2113,2113,SRC,Amplification,2,"The non-receptor tyrosine kinase c-Src, hereaf..."
2290,2290,STAT3,S614R,7,Lymphomas arising from NK or gd-T cells are ve...
2834,2834,BRCA2,K2950N,5,Mutation screening of the breast and ovarian c...
3233,3233,NTRK2,R715G,5,Purpose TrkB has been involved in poor cancer...
736,736,ERBB2,T733I,2,Purpose: Mutations associated with resistance ...
2044,2044,IL7R,T244_I245insCPT,7,Signaling mediated by IL-7 and IL-7R is essent...


After splitting the datasets, we want to verify we have relatively similar distribution of the target class in both the training and test dataset.

In [6]:
# Print the number of each class in each dataset. 
print(train.groupby('Class').agg({'Class':'count'}).apply(lambda x: 100.0 * x / float(x.sum())))
print()
print(test.groupby('Class').agg({'Class':'count'}).apply(lambda x: 100.0 * x / float(x.sum())))

           Class
Class           
1      17.068273
2      13.386881
3       2.710843
4      20.649264
5       7.463186
6       8.467202
7      28.781794
8       0.468541
9       1.004016

           Class
Class           
1      17.417417
2      15.615616
3       2.402402
4      20.720721
5       5.705706
6       6.606607
7      27.927928
8       1.501502
9       2.102102


### Define our Vocabulary
We will create a function called **clean_doc** that will perform the following tasks:
- Separate the raw text on white space into individual words 
- Remove all punctuation 
- Set all words to lowercase
- Remove all words that are not purely compromised of alphabetical characters
- Remove stop words
- Perform stemming
- Remove words that are not at least 2 characters in length

In [7]:
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Function that turns a doc into clean tokens
def clean_doc(doc, stemmer, stop_words):
    # Split into individual tokens by white space
    tokens = doc.split()
    # Remove punctuation and set to lowercase
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table).lower() for w in tokens]
    # Remove words that are not entirely alphabetical
    tokens = [w for w in tokens if w.isalpha()]
    # Removing all known stop words
    tokens = [w for w in tokens if not w in stop_words]
    # Remove tokens that aren't at least two characters in length
    tokens = [w for w in tokens if len(w) > 1]
    # Stem the remaining tokens
    tokens = [stemmer.stem(w) for w in tokens]
    return(tokens)

In [9]:
# Get a distinct list of stop words
stop_words = set(stopwords.words('english'))

# Initialize a stemmer
stemmer = PorterStemmer()

We've built a function that will do pre-processing on a single text. We want to apply that function to our entire collection of texts in order to build our vocabulary. We will develop our vocabulary as a counter, which is a dictionary that maps words to their counts.

In [10]:
from collections import Counter

# Define vocab
vocab = Counter()

# Iterate over each of the texts in our training sample
for text in train['Text']:
    # Create a list of tokens
    tokens = clean_doc(text, stemmer, stop_words)
    # Add tokens to vocab
    vocab.update(tokens)

In [11]:
# Print the size of the vocab
print('Size of the vocabulary: %d' % len(vocab))
print()

# Print the 20 most common words in the vocab
print(vocab.most_common(20))

Size of the vocabulary: 87330

[('mutat', 327526), ('cell', 266388), ('activ', 168430), ('mutant', 107869), ('protein', 107245), ('express', 106428), ('use', 103634), ('tumor', 102478), ('cancer', 100389), ('et', 96765), ('patient', 96397), ('figur', 92763), ('al', 92599), ('fig', 92463), ('gene', 88950), ('variant', 81290), ('domain', 74107), ('function', 69263), ('result', 67523), ('studi', 65508)]


We can remove all words from the vocabulary that occur very infrequently. These words are likely not very important to making any sort of prediction. For the time being we'll remove words that do not occur at least n times. This reduces our vocabulary by almost half.

In [12]:
# Removing words that do not occur at least n number of times in the vocabulary
tokens = [k for k,c in vocab.items() if c >= 10]

print(len(tokens))

23899


Let's save our vocabulary to a text file so that we can use it later.

In [13]:
# Function to save list to file
def save_list(lines, filename):
    # convert lines to a single blob of text
    data = '\n'.join(lines)
    # open file
    file = open(filename, 'w')
    # write text
    file.write(data)
    # close file
    file.close()

# Save tokens to a vocabulary file
save_list(tokens, 'data/interim/vocab.txt')

### Represent Text As A Vector
In order to feed our text data into a machine learning algorithm like XGBoost or neural networks, it needs to be converted from text into a numerical vector. Each medical text will be converted into a vector, where the number of items in the vector corresponds to the words in our previously defined vocabulary. Most importantly, each word in the text will be scored. There are multiple scoring methods, including but not limited to: 
- Binary: Words are marked as present (1) or not present (0)
- Count: Counts the occurrence of each word in the document
- Tfidf: Scores each word based on occurrence within a document and across all documents. Words that are frequent across many documents will receive a lower score.
- Frequency: Scores each word based on their frequency of occurrence within the document

First, load the vocab that we previously created.

In [14]:
# Open the vocab file
file = open('data/interim/vocab.txt', 'r')

# Read the vocab from file
vocab = file.read()

# Close the file
file.close()

# Unique list of our vocab
vocab = set(vocab.split())

print(len(vocab))

23899


Next we need to filter each of our texts down to only the words in our defined vocabulary. Important note, although we've filtered out all of the words that aren't in our defined vocabulary, all of the words are still in order of occurrence in each text.

In [15]:
# A container object that will hold the words of each individual document
lines = list()

# Iterate over each of the texts in our training sample
for text in train['Text']:
    # Create a list of tokens
    tokens = clean_doc(text, stemmer, stop_words)
    # Filter the words in the document by our defined vocabulary
    tokens = [w for w in tokens if w in vocab]
    # Concatentate each word in the document by a single space and append to our lines container
    lines.append(' '.join(tokens))

# Printing the size of our lines object. It should be 2,988 in length
print(len(lines))

2988


In [16]:
for i,l in enumerate(lines):
    print(l[:25])
    if i > 25:
        break

woman evalu institut mana
intern tandem duplic fmsl
abstract gene individu ri
precis medicin approach i
nonreceptor tyrosin kinas
lymphoma aris nk gdt cell
mutat screen breast ovari
purpos trkb involv poor c
purpos mutat associ resis
signal mediat essenti nor
heterozyg mutat gene enco
pten phosphatas tensin ho
pioneer transcript factor
patient chronic myeloid l
abstract cancerspecif mut
protein kinas braf mutat 
abstract gene individu ri
tyrosin kinas domain muta
mutat account major hered
publish analys effect mis
hallmark stem cell myelop
screen tumor suppressor g
kinaseakt pathway promot 
gefitinib effect firstlin
purpos kit major oncogen 
forkhead box fox superfam
era person medicin unders


Saving the cleansed training text in case we want to reference it later. The cleansing step takes a long time, so it would help if we could avoid repeating that step.

In [17]:
import pickle

# Pickle the cleansed training text
with open('models/training_text.pickle', 'wb') as output:
    pickle.dump(lines, output, pickle.HIGHEST_PROTOCOL)

Filter the test data using the exact same approach.

In [18]:
# A container object that will hold the words of each individual document
lines_test = list()

# Iterate over each of the texts in our training sample
for text in test['Text']:
    # Create a list of tokens
    tokens = clean_doc(text, stemmer, stop_words)
    # Filter the words in the document by our defined vocabulary
    tokens = [w for w in tokens if w in vocab]
    # Concatentate each word in the document by a single space and append to our lines container
    lines_test.append(' '.join(tokens))

# Printing the size of our lines object. It should be 2,988 in length
print(len(lines_test))

333


In [19]:
for i,l in enumerate(lines_test):
    print(l[:25])
    if i > 25:
        break

nuclear factor erythroid 
cancer aris owe mutat sub
tumor suppressor protein 
inherit mutat affect locu
compar genom hybrid cgh r
util genet screen yeast s
cancer genom character ef
activ cyclin ddepend kina
background numer biolog e
oxid electrophil stress s
famili adenomat polyposi 
alter ofth egfr gene occu
sequencespecif dna bind e
graphic abstract imag unl
report fusion monocyt leu
pten phosphatas tensin ho
dna helicas directli inte
rhabdomyosarcoma rm child
cell lung cancer nsclc di
function character cancer
fusion defin subset lung 
purpos plateletderiv grow
abstract classif rare mis
mutat caus inactiv tumor 
cancer develop acquisit c
function character cancer
gene encod transcript act


In [20]:
# Pickle the cleansed test text
with open('models/test_text.pickle', 'wb') as output:
    pickle.dump(lines_test, output, pickle.HIGHEST_PROTOCOL)

We will use Keras to help us wrap up our preprocessing tasks. Keras has a class called **Tokenizer** which can help us encode our data in matrix format using the four different methods we described above.

In [21]:
from keras.preprocessing.text import Tokenizer

# Instantiate a tokenizer
tokenizer = Tokenizer()

# Fit the tokenizer on the documents
tokenizer.fit_on_texts(lines)

Using TensorFlow backend.


Before moving any further, we should save the trained tokenizer object so we can reference it again at a later point in time. We'll use pickle to save the tokenizer.

In [22]:
import pickle

# Pickle the tokenizer
with open('models/tokenizer.pickle', 'wb') as output:
    pickle.dump(tokenizer, output, pickle.HIGHEST_PROTOCOL)
    
# Load our tokenizer object that was already trained 
#with open('models/tokenizer.pickle', 'rb') as obj:
#    tokenizer = pickle.load(obj)

Now, let's encode our data using the **Frequency** method.

In [23]:
train_vector = tokenizer.texts_to_matrix(lines, mode='freq')
print(train_vector.shape)

(2988, 23900)


In [24]:
train_vector[0:20, 1:5]

array([[ 0.03722804,  0.02127317,  0.0069299 ,  0.00918614],
       [ 0.02322581,  0.0116129 ,  0.01225806,  0.00193548],
       [ 0.04157667,  0.00215983,  0.01781857,  0.00809935],
       [ 0.01090151,  0.0132714 ,  0.03507441,  0.01516731],
       [ 0.00209986,  0.02438171,  0.02449837,  0.00116659],
       [ 0.02727542,  0.0444708 ,  0.00770827,  0.01156241],
       [ 0.02140613,  0.00039277,  0.00078555,  0.        ],
       [ 0.0265596 ,  0.05188388,  0.01050031,  0.01050031],
       [ 0.05757242,  0.02181885,  0.01210121,  0.00311698],
       [ 0.02622951,  0.03636364,  0.004769  ,  0.01639344],
       [ 0.0367878 ,  0.01435621,  0.01076716,  0.01570211],
       [ 0.04830508,  0.00988701,  0.02542373,  0.0019774 ],
       [ 0.00705171,  0.05372733,  0.00235057,  0.00436535],
       [ 0.01406534,  0.00181488,  0.00317604,  0.00045372],
       [ 0.02444324,  0.01792504,  0.01412276,  0.03910918],
       [ 0.01983038,  0.01234722,  0.05861811,  0.0193315 ],
       [ 0.02125057,  0.

Convert the matrix to a dataframe and export to csv.

In [25]:
# Convert the train vector to a dataframe 
train_df = pd.DataFrame(train_vector[:,1:], columns=[key for key in tokenizer.word_counts])

print(train_df.shape)
print(train.shape)

# Merge the original train dataset to the vectorized dataset
train_final = pd.concat([train.reset_index(drop=True), train_df], axis=1)

print(train_final.shape)

# Drop the text blob from the original train dataset
train_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(train_final.shape)

(2988, 23899)
(2988, 5)
(2988, 23904)
(2988, 23903)


In [26]:
train_final[['Class','woman','evalu','institut']].head(20)

Unnamed: 0,Class,woman,evalu,institut
0,7,0.037228,0.021273,0.00693
1,7,0.023226,0.011613,0.012258
2,6,0.041577,0.00216,0.017819
3,2,0.010902,0.013271,0.035074
4,2,0.0021,0.024382,0.024498
5,7,0.027275,0.044471,0.007708
6,5,0.021406,0.000393,0.000786
7,5,0.02656,0.051884,0.0105
8,2,0.057572,0.021819,0.012101
9,7,0.02623,0.036364,0.004769


In [27]:
# Doing some cleanup
del train_df, train_vector

In [28]:
# Export the dataframe to a compressed file
train_final.to_csv('data/interim/train_freq.gz', index=False, compression='gzip')

In [29]:
# Convert the test data to matrix format using the frequency method
test_vector = tokenizer.texts_to_matrix(lines_test, mode='freq')

# Convert the train vector to a dataframe 
test_df = pd.DataFrame(test_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
test_final = pd.concat([test.reset_index(drop=True), test_df], axis=1)

# Drop the text blob from the original train dataset
test_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(test_final.shape)

# Doing some cleanup
del test_df, test_vector

# Export the dataframe to a csv file
test_final.to_csv('data/interim/test_freq.gz', index=False, compression='gzip')

(333, 23903)


Encode our data using the **binary** method.

In [30]:
# Convert the train data to matrix format using the binary method
train_vector = tokenizer.texts_to_matrix(lines, mode='binary')

# Convert the train vector to a dataframe 
train_df = pd.DataFrame(train_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
train_final = pd.concat([train.reset_index(drop=True), train_df], axis=1)

# Drop the text blob from the original train dataset
train_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(train_final.shape)

# Doing some cleanup
del train_df, train_vector

# Export the dataframe to a compressed file
train_final.to_csv('data/interim/train_binary.gz', index=False, compression='gzip')

(2988, 23903)


In [31]:
# Convert the test data to matrix format using the binary method
test_vector = tokenizer.texts_to_matrix(lines_test, mode='binary')

# Convert the train vector to a dataframe 
test_df = pd.DataFrame(test_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
test_final = pd.concat([test.reset_index(drop=True), test_df], axis=1)

# Drop the text blob from the original train dataset
test_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(test_final.shape)

# Doing some cleanup
del test_df, test_vector

# Export the dataframe to a compressed file
test_final.to_csv('data/interim/test_binary.gz', index=False, compression='gzip')

(333, 23903)


Encode our data using the **Tfidf** method.

In [32]:
# Convert the train data to matrix format using the tfidf method
train_vector = tokenizer.texts_to_matrix(lines, mode='tfidf')

# Convert the train vector to a dataframe 
train_df = pd.DataFrame(train_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
train_final = pd.concat([train.reset_index(drop=True), train_df], axis=1)

# Drop the text blob from the original train dataset
train_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(train_final.shape)

# Doing some cleanup
del train_df, train_vector

# Export the dataframe to a compressed file
train_final.to_csv('data/interim/train_tfidf.gz', index=False, compression='gzip')

(2988, 23903)


In [33]:
# Convert the test data to matrix format using the tfidf method
test_vector = tokenizer.texts_to_matrix(lines_test, mode='tfidf')

# Convert the train vector to a dataframe 
test_df = pd.DataFrame(test_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
test_final = pd.concat([test.reset_index(drop=True), test_df], axis=1)

# Drop the text blob from the original train dataset
test_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(test_final.shape)

# Doing some cleanup
del test_df, test_vector

# Export the dataframe to a compressed file
test_final.to_csv('data/interim/test_tfidf.gz', index=False, compression='gzip')

(333, 23903)


Encode our data using the **Count** method

In [34]:
# Convert the train data to matrix format using the count method
train_vector = tokenizer.texts_to_matrix(lines, mode='count')

# Convert the train vector to a dataframe 
train_df = pd.DataFrame(train_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
train_final = pd.concat([train.reset_index(drop=True), train_df], axis=1)

# Drop the text blob from the original train dataset
train_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(train_final.shape)

# Doing some cleanup
del train_df, train_vector

# Export the dataframe to a compressed file
train_final.to_csv('data/interim/train_count.gz', index=False, compression='gzip')

(2988, 23903)


In [35]:
# Convert the test data to matrix format using the count method
test_vector = tokenizer.texts_to_matrix(lines_test, mode='count')

# Convert the train vector to a dataframe 
test_df = pd.DataFrame(test_vector[:,1:], columns=[key for key in tokenizer.word_counts])

# Merge the original train dataset to the vectorized dataset
test_final = pd.concat([test.reset_index(drop=True), test_df], axis=1)

# Drop the text blob from the original train dataset
test_final.drop(['Text'], inplace=True, axis=1)

# Confirm the dataframe shape
print(test_final.shape)

# Doing some cleanup
del test_df, test_vector

# Export the dataframe to a compressed file
test_final.to_csv('data/interim/test_count.gz', index=False, compression='gzip')

(333, 23903)
