## XGBoost with Word2Vec Word Encoding
This model will do some additional pre-processing using Word2Vec word encoding. The encoded matrices will be averaged and combined with the Tfidf dataset. The resulting data will be used to train an XGBoost model.

### Data Prep
First things first, we need to load and prepare the dataset. Note, in a previous exercise we already cleansed the text per our requirements, so we'll just import those pickled objects.

In [None]:
import os
import pickle
import pandas as pd

# Set the working directory for the project
os.chdir('C://Users/Dane/Documents/GitHub/seis735_project/')

# Import the pre-defined training tokens
with open('models/training_text.pickle', 'rb') as obj:
    texts_train = pickle.load(obj)
    
# Import the pre-defined test tokens
with open('models/test_text.pickle', 'rb') as obj:
    texts_test = pickle.load(obj)
    
# Printing the size of our lines object. It should be 2,988 in length
print(len(texts_train))
print(len(texts_test))

Next we need to split our cleansed texts into tokens (words).

In [None]:
tokens_train = [line.split() for line in texts_train]
tokens_test = [line.split() for line in texts_test]
print(len(tokens_train))
print(len(tokens_test))

We are ready now to train a Word2Vec model. We will use Gensim to help with this task.

In [None]:
from gensim.models import Word2Vec

# Train our model on the train_tokens
model = Word2Vec(tokens_train, min_count=10, size=100)

# Summarize the model
print(model)

# Save the model
model.save('models/word2vec_train.bin')

Now, let's import our Tfidf datasets.

In [None]:
# Import the files
train = pd.read_csv('data/interim/train_freq.gz', compression='gzip', encoding='ISO-8859-1')
test = pd.read_csv('data/interim/test_freq.gz', compression='gzip', encoding='ISO-8859-1')

# Drop the attributes that aren't needed for the prediction
train.drop(['ID','Gene','Variation'], inplace=True, axis=1)
test.drop(['ID','Gene','Variation'], inplace=True, axis=1)
print(train.shape)
print(test.shape)

In [None]:
train.drop(['aë\x9a','ï\x83','ï\x88','ï\x83aweight'], axis=1, inplace=True)
test.drop(['aë\x9a','ï\x83','ï\x88','ï\x83aweight'], axis=1, inplace=True)

Next, we want to combine the averaged encoded matrix with the Tfidf value for each word.

In [None]:
import numpy as np

# Initialize an empty dictionary
avg_encoding = dict()

# Get the average value of the encoding matrix for each word
for c in train.columns[1:]:
    avg_encoding[c] = np.mean(model[c])

print(len(avg_encoding))

In [None]:
# Transform the training data
for c in train.columns[1:]:
    train[c] = train[c] * avg_encoding[c]

In [None]:
# Transform the test data
for c in test.columns[1:]:
    test[c] = test[c] * avg_encoding[c]

Before feeding this dataset into XGBoost, we should perform feature reduction. We'll use an autoencoder to help us with this task.

In [None]:
# Convert the training and test data into matrix format
x_train = train.as_matrix()[:,1:]
y_train = train.as_matrix()[:,0]
x_test = test.as_matrix()[:,1:]
y_test = test.as_matrix()[:,0]

# Cleanup
del train, test

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [1]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout

dims = x_train.shape[1]

# This is the shape of our input
inputs = Input(shape=(dims,))

# These are the model layers
encoded = Dropout(rate=0.25)(inputs)
encoded = Dense(1000, activation='relu')(encoded)
decoded = Dense(dims, activation='sigmoid')(decoded)

# this model maps an input to its reconstruction
autoencoder = Model(inputs, decoded)

# this model maps an input to its encoded representation
encoder = Model(inputs, encoded)

# Compile the autoencoder
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
autoencoder.fit(x, x, epochs=15, batch_size=50)

Using TensorFlow backend.


KeyboardInterrupt: 