## Word2Vec

Serie of Word2Vec tries. This uses a custom class called Word2Vec, relying on gensim's word2vec.

In [1]:
%load_ext autoreload
%autoreload 2
import numpy as np
from word2vec import generateBin, Word2Vec
from sklearn.linear_model import LogisticRegression
from helpers import trainScore, testScore



Importing positive, negative and test tweets

In [2]:
pos = open ('train_pos.txt').readlines ()
neg = open ('train_neg.txt').readlines ()
test = open ('test_data.txt').readlines ()

## Segmented

Here we'll generate 2 word2vec models : one based on positive tweets, the other one on negative tweets.

In [3]:
generateBin ('train_pos.txt', 'pos_model.bin', nbFeatures = 50)
generateBin ('train_neg.txt', 'neg_model.bin', nbFeatures = 50)

Bin file pos_model.bin has been created
Bin file neg_model.bin has been created


Creating positive Word2Vec model

In [4]:
pos_word2vec = Word2Vec ('pos_model.bin', pos, neg, 50)

Word2Vec Instantiation
	Converting Train Set...
		Extracting Features from Positive Tweets... [####################]    (100%)    ETA : 20:44:12
		Extracting Features from Negative Tweets... [####################]    (100%)    ETA : 20:44:17
	Standardizing...
Terminated


Creating negative Word2Vec model

In [5]:
neg_word2vec = Word2Vec ('neg_model.bin', pos, neg, 50)

Word2Vec Instantiation
	Converting Train Set...
		Extracting Features from Positive Tweets... [####################]    (100%)    ETA : 20:44:28
		Extracting Features from Negative Tweets... [####################]    (100%)    ETA : 20:44:33
	Standardizing...
Terminated


Concatenating into full train set

In [6]:
X_seg = np.hstack ((pos_word2vec.getX (), neg_word2vec.getX ()))
Y_seg = pos_word2vec.getY ()

Using Logistic Regression model

In [7]:
clf_seg = LogisticRegression ()
clf_seg.fit (X_seg, Y_seg)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Scoring on train

In [8]:
trainScore (X_seg, Y_seg, clf_seg)

Computing Train Score [####################]    (100%)    ETA : 20:46:02


0.7800947368421053

Creating test features

In [9]:
test_X_seg = np.hstack ((pos_word2vec.convertTest (test), neg_word2vec.convertTest (test)))

Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 20:46:29
	Standardizing...
Terminated
Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 20:46:30
	Standardizing...
Terminated


Scoring on test

In [11]:
testScore (test_X_seg, clf_seg)

Computing Test Score [####################]    (100%)    ETA : 20:46:44


0.7665

## Single

This time we won't make a distinction between positive and negative and make a single Word2Vec model from both train sets

First, generating the full train set.

In [15]:
with (open ('train_all.txt', 'w', encoding = 'utf-8')) as f:
    f.write (open ('train_pos.txt').read () + open ('train_neg.txt').read ())

Generating the single model

In [16]:
generateBin ('train_all.txt', 'single_model.bin')

Bin file single_model.bin has been created


Creating the Word2Vec model

In [17]:
single_word2vec = Word2Vec ('single_model.bin', pos, neg)

Word2Vec Instantiation
	Converting Train Set...
		Extracting Features from Positive Tweets... [####################]    (100%)    ETA : 20:49:49
		Extracting Features from Negative Tweets... [####################]    (100%)    ETA : 20:49:54
	Standardizing...
Terminated


Using Logistic Regression model

In [18]:
clf_single = LogisticRegression ()
clf_single.fit (single_word2vec.getX (), single_word2vec.getY ())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Scoring on train

In [20]:
trainScore (single_word2vec.getX (), single_word2vec.getY (), clf_single)

Computing Train Score [####################]    (100%)    ETA : 20:50:49


0.7167736842105263

Generating test features

In [21]:
test_X_single = single_word2vec.convertTest (test)

Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 20:51:34
	Standardizing...
Terminated


Scoring on test

In [22]:
testScore (test_X_single, clf_single)

Computing Test Score [####################]    (100%)    ETA : 20:51:45


0.715

We notice how important it is to segmentate the models into positive and negative

## Full Train

We will now proceed to do the segmentation but on more tweets (800,000 instead of 190,000). We'll also use Multi-Layer Perceptron model to predict.

In [23]:
pos_full = open ('train_pos_full.txt', encoding = 'utf-8').readlines () [:400000]
neg_full = open ('train_neg_full.txt', encoding = 'utf-8').readlines () [:400000]

Creating the segmented models

In [24]:
generateBin ('train_pos_full.txt', 'pos_full_model.bin', nbFeatures = 50)
generateBin ('train_neg_full.txt', 'neg_full_model.bin', nbFeatures = 50)

Bin file pos_full_model.bin has been created
Bin file neg_full_model.bin has been created


Generating features from the positive model

In [25]:
pos_full_word2vec = Word2Vec ('pos_full_model.bin', pos_full, neg_full, 50)

Word2Vec Instantiation
	Converting Train Set...
		Extracting Features from Positive Tweets... [####################]    (100%)    ETA : 20:59:26
		Extracting Features from Negative Tweets... [####################]    (100%)    ETA : 20:59:47
	Standardizing...
Terminated


Generating features from the negative model

In [26]:
neg_full_word2vec = Word2Vec ('neg_full_model.bin', pos_full, neg_full, 50)

Word2Vec Instantiation
	Converting Train Set...
		Extracting Features from Positive Tweets... [####################]    (100%)    ETA : 21:00:11
		Extracting Features from Negative Tweets... [####################]    (100%)    ETA : 21:00:32
	Standardizing...
Terminated


Concatenating into train set

In [27]:
X_full_seg = np.hstack ((pos_full_word2vec.getX (), neg_full_word2vec.getX ()))
Y_full_seg = pos_full_word2vec.getY ()

Using Logistic Regression model

In [28]:
clf_seg_full = LogisticRegression ()
clf_seg_full.fit (X_full_seg, Y_full_seg)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Scoring on train

In [29]:
trainScore (X_full_seg, Y_full_seg, clf_seg_full)

Computing Train Score [####################]    (100%)    ETA : 21:03:00


0.7805725

Generating test set

In [30]:
test_X_full_seg = np.hstack ((pos_full_word2vec.convertTest (test), neg_full_word2vec.convertTest (test)))

Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 21:03:05
	Standardizing...
Terminated
Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 21:03:06
	Standardizing...
Terminated


Scoring on test

In [31]:
testScore (test_X_full_seg, clf_seg_full)

Computing Test Score [####################]    (100%)    ETA : 21:03:09


0.7745

### MLP

In [33]:
from sklearn.neural_network import MLPClassifier

clf_seg_mlp_full = MLPClassifier ()
clf_seg_mlp_full.fit (X_full_seg, Y_full_seg)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

Scoring on train

In [34]:
trainScore (X_full_seg, Y_full_seg, clf_seg_mlp_full)

Computing Train Score [####################]    (100%)    ETA : 21:11:15


0.83579875

Scoring on test

In [36]:
testScore (test_X_full_seg, clf_seg_mlp_full)

Computing Test Score [####################]    (100%)    ETA : 21:11:30


0.8175

## Exporting to Kaggle

Importing true test set

In [38]:
true_test = open ('true_test_data.txt', encoding = 'utf-8').readlines ()

Generating true test set

In [39]:
true_test_X_full_seg = np.hstack ((pos_full_word2vec.convertTest (true_test), neg_full_word2vec.convertTest (true_test)))

Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 21:12:37
	Standardizing...
Terminated
Converting Test Set
	Extracting Features from Test Tweets... [####################]    (100%)    ETA : 21:12:38
	Standardizing...
Terminated


Predicting labels

In [40]:
from helpers import predict, exportPredictions

pred = predict (true_test_X_full_seg, clf_seg_mlp_full)

Computing Predictions [####################]    (100%)    ETA : 21:12:42


Exporting for Kaggle

In [41]:
exportPredictions (pred, 'kaggle_submission')