# Demonstration of LDA Approach

As our dataset is so large, it is not possible to create and test models from it on our laptops. Therefore, we have made use of a high performance computing system to carry out the following processes on the full dataset. To demonstate what has been done, here we downsample the dataset and go through the same steps.

## 1. Downsampling

We begin by loading in our cleaned training datasets, one which contains the combined review text and summary and the other which contains just the summaries. We then perform stratified sampling according to the sentiment of the reviews.

In [1]:
import pandas as pd

import numpy as np
np.random.seed(2018)
import gensim
from gensim import corpora
from numpy.random import RandomState
rng = RandomState()

In [2]:
trainingDf_combined = pd.read_csv('data_cleaned_full_text_and_summaries.csv')

In [3]:
trainingDf_summaries = pd.read_csv('data_cleaned_full_summaries.csv')

In [4]:
trainingDf_combined.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,"['not', 'good', 'buy', 'book', 'read', 'glow',...",0
1,1,"['buyer', 'beware', 'self', 'publish', 'book',...",0
2,2,"['bad', 'complete', 'waste', 'time', 'typograp...",0
3,3,"['please', 'guess', 'romance', 'novel', 'lover...",0
4,4,"['awful', 'beyond', 'belief', 'feel', 'write',...",0


In [5]:
trainingDf_summaries.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,0,"['not', 'good']",0
1,1,1,"['buyer', 'beware']",0
2,2,2,[u'bad'],0
3,3,3,['please'],0
4,4,4,"['awful', 'beyond', 'belief']",0


In [6]:
trainingDf_summaries.shape

(2399972, 4)

In [7]:
ones = trainingDf_combined[trainingDf_combined['sentiment'] == 1]
zeros = trainingDf_combined[trainingDf_combined['sentiment'] == 0]

num_ones = round(len(ones) * 1/2.4)
num_zeros = round(len(zeros) * 1/2.4)

sample_ones = ones.sample(n=num_ones, random_state=rng)
sample_zeros = zeros.sample(n=num_zeros, random_state=rng)

sampleframe = [sample_ones,sample_zeros]

train_combined = pd.concat(sampleframe)

train_summaries = trainingDf_summaries.loc[train_combined.index,:]

In [8]:
train_combined.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
2363882,2363909,"['hip', 'hop', 'masterpiece', 'book', 'dope', ...",1
1634847,1634869,"['nice', 'change', 'usual', 'chapter', 'book',...",1
2206461,2206487,"['secret', 'weapon', 'not', 'terrific', 'easy'...",1
956732,956745,"['extremadamente', 'chistoso', 'realista', 'es...",1
2046680,2046705,"['public', 'know', 'book', 'becomes', 'relevan...",1


In [9]:
train_summaries.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,sentiment
2363882,2363882,2363909,"['hip', 'hop', 'masterpiece']",1
1634847,1634847,1634869,"['nice', 'change', 'usual', 'chapter', u'book']",1
2206461,2206461,2206487,"['secret', 'weapon']",1
956732,956732,956745,"['extremadamente', 'chistoso', 'realista']",1
2046680,2046680,2046705,"['public', 'know']",1


## 2. LDA Model

We now proceed to create an LDA model for the first dataset.

In [10]:
def myprocess(thisdoc):
    ## Runs on documents (vector of words)
    return(thisdoc.strip('[]').replace("'", ' ').replace("u ", '').replace(' ', '').split(','))

docs=train_combined['text'].map(myprocess) 

In [11]:
dict_LoS = corpora.Dictionary(docs)

dict_LoS.filter_extremes(no_below=5, no_above=0.5)

bow_corpus = [dict_LoS.doc2bow(w) for w in docs]

In [12]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=58, id2word=dict_LoS, passes=1, workers=2)

Now, get the topic distributions for the documents in the training set, and convert them into a dataframe.

In [13]:
lda_scores = lda_model[bow_corpus]

In [23]:
all_topics_csr = gensim.matutils.corpus2csc(lda_scores)
all_topics_numpy = all_topics_csr.T.toarray()
all_topics_pandas = pd.DataFrame(all_topics_numpy)

In [24]:
all_topics_pandas.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.0,0.390647,0.0,0.0,0.0,0.225551,0.0,0.0,0.0,...,0.0,0.0,0.0,0.054489,0.079873,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.028978,0.0,0.0,0.0,...,0.0,0.0,0.0,0.064256,0.0,0.026534,0.0,0.523261,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.034275,0.0,0.0,0.019339,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.132576,0.0,0.0,0.0,0.029975,0.0,0.0


We want the topic scores to be our X value and our target value, y, is the sentiments of the reviews.

In [25]:
y_train=train_combined['sentiment']
X_train=all_topics_pandas

In [26]:
y_train=pd.DataFrame(y_train, columns=['sentiment'])

## 3. Classification

Now build a Boosted Decision Tree model, making sure to reweight as we have unbalanced classes.

In [27]:
from xgboost import XGBClassifier
import xgboost as xgb

In [31]:
posScaling=(len(y_train['sentiment'])-sum(y_train['sentiment']))/sum(y_train['sentiment'])

In [32]:
model_combined = XGBClassifier(
    objective = 'binary:logistic',
    eval_metric='logloss', 
    scale_pos_weight = posScaling, 
    learning_rate =0.05,
    n_estimators=10000)

Here we use a cross validation method to choose how many trees should be combined into our boosted decision tree.

In [33]:
#xgb_param = model_combined.get_xgb_params()
#xgtrain = xgb.DMatrix(X_train.values, label=y_train.values)

#cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=model_combined.get_params()['n_estimators'], nfold=3,
                          early_stopping_rounds=10, metrics={'logloss'})


#n_est=cvresult.shape[0]
#print(cvresult.shape[0])
n_est=300

10000


Now, redefine the boosted decision tree model to have that number of trees, and fit it to the training data.

In [34]:
model_combined = XGBClassifier(
    objective = 'binary:logistic',
    eval_metric='logloss', 
    scale_pos_weight = posScaling, 
    learning_rate =0.05,
    n_estimators=n_est)

In [35]:
model_combined.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=10000, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)

We complete a similar process for the dataset with just summaries. We will just choose to define the boosted decision tree model the same as above in order to save time.

In [36]:
model_summaries = XGBClassifier(
    objective = 'binary:logistic',
    eval_metric='logloss', 
    scale_pos_weight = posScaling, 
    learning_rate =0.05,
    n_estimators=n_est)

In [37]:
docs_summaries=train_summaries['text'].map(myprocess) 
dict_LoS_summaries = corpora.Dictionary(docs_summaries)
dict_LoS_summaries.filter_extremes(no_below=5, no_above=0.5)
bow_corpus_summaries = [dict_LoS_summaries.doc2bow(w) for w in docs_summaries]
lda_model_summaries = gensim.models.LdaMulticore(bow_corpus_summaries, num_topics=58, id2word=dict_LoS_summaries, passes=1, workers=2)
lda_scores_summaries = lda_model_summaries[bow_corpus_summaries]

all_topics_csr_summaries = gensim.matutils.corpus2csc(lda_scores_summaries)
all_topics_numpy_summaries = all_topics_csr_summaries.T.toarray()
all_topics_pandas_summaries = pd.DataFrame(all_topics_numpy_summaries)

In [38]:
X_train_summaries=all_topics_pandas_summaries

In [39]:
X_train_summaries

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.337458,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.000000,0.000000,0.000000,0.508606,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,...,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999984,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
999985,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,...,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241,0.017241
999986,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
999987,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.201163,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [40]:
y_train

Unnamed: 0,sentiment
2363882,1
1634847,1
2206461,1
956732,1
2046680,1
...,...
279045,0
122808,0
308630,0
146413,0


In [41]:
model_summaries.fit(X_train_summaries, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.05, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=10000, n_jobs=None,
              num_parallel_tree=None, predictor=None, random_state=None, ...)

Now load in the test sets. We begin by processing them in a similar way to the training sets, turning them into a bag of words, but using the corresponding dictionaries produced previously. Then, compute topic scores for each of the documents using the LDA models created above, and finally turn these into a dataframe as we did before.

In [42]:
testDf_combined = pd.read_csv('test_cleaned_text_and_summaries.csv')
testDf_summaries = pd.read_csv('test_cleaned_summaries.csv')

In [None]:
docs_combined_test=testDf_combined['text'].map(myprocess) 
bow_corpus_combined_test = [dict_LoS.doc2bow(w) for w in docs_combined_test]
lda_scores_combined_test = lda_model[bow_corpus_combined_test]
all_topics_csr_combined_test = gensim.matutils.corpus2csc(lda_scores_combined_test)
all_topics_numpy_combined_test = all_topics_csr_combined_test.T.toarray()
all_topics_pandas_combined_test = pd.DataFrame(all_topics_numpy_combined_test)
X_test_combined=all_topics_pandas_combined_test

In [None]:
docs_summaries_test=testDf_summaries['text'].map(myprocess) 
bow_corpus_summaries_test = [dict_LoS_summaries.doc2bow(w) for w in docs_summaries_test]
lda_scores_summaries_test = lda_model_summaries[bow_corpus_summaries_test]
all_topics_csr_summaries_test = gensim.matutils.corpus2csc(lda_scores_summaries_test)
all_topics_numpy_summaries_test = all_topics_csr_summaries_test.T.toarray()
all_topics_pandas_summaries_test = pd.DataFrame(all_topics_numpy_summaries_test)
X_test_summaries=all_topics_pandas_summaries_test

## 4. Prediction for test set

We are now ready to compute our prediction probabilities for the test set.

In [None]:
pred_probs_combined = model_combined.predict_proba(X_test_combined)
pred_probs_summaries = model_summaries.predict_proba(X_test_summaries)

In [None]:
pred_probs_combined_df=pd.DataFrame(pred_probs_combined[:,1],columns=['Probability of positive'])
pred_probs_combined_df.to_csv('pred_probs_combined_1M.csv', index=False)

pred_probs_summaries_df=pd.DataFrame(pred_probs_summaries[:,1],columns=['Probability of positive'])
pred_probs_summaries_df.to_csv('pred_probs_summaries_1M.csv', index=False)

In [None]:
import pickle
pickle.dump(model_combined, open(model_combined.pkl, 'wb'))
pickle.dump(model_summaries, open(model_summaries.pkl, 'wb'))

In [22]:
train_summaries.to_csv('train_summaries_1M.csv', index=False)
train_combined.to_csv('train_combined_1M.csv', index=False)