# Doc2Vec demonstration 

In this notebook, let us take a look at how to "learn" document embeddings and use them for text classification. We will be using the dataset of "Sentiment and Emotion in Text" from [Kaggle](https://www.kaggle.com/c/sa-emotions/data).

"In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft’s Cortana Intelligence Gallery."


In [2]:
import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [7]:
#Load the dataset and explore.
filepath = "https://github.com/duybluemind1988/Data-science/blob/master/Practical%20NLP%20Oreilly/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/text_emotion.csv?raw=true"
df = pd.read_csv(filepath)
print(df.shape)
df.head()

(40000, 4)


Unnamed: 0,tweet_id,sentiment,author,content
0,1956967341,empty,xoshayzers,@tiffanylue i know i was listenin to bad habi...
1,1956967666,sadness,wannamama,Layin n bed with a headache ughhhh...waitin o...
2,1956967696,sadness,coolfunky,Funeral ceremony...gloomy friday...
3,1956967789,enthusiasm,czareaquino,wants to hang out with friends SOON!
4,1956968416,neutral,xkilljoyx,@dannycastillo We want to trade with someone w...


In [8]:
df['sentiment'].value_counts()

neutral       8638
worry         8459
happiness     5209
sadness       5165
love          3842
surprise      2187
fun           1776
relief        1526
hate          1323
empty          827
enthusiasm     759
boredom        179
anger          110
Name: sentiment, dtype: int64

In [9]:
#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(22306, 4)

# Text pre-processing:
Tweets are different. Somethings to consider:
- Removing @mentions, and urls perhaps?
- using NLTK Tweet tokenizer instead of a regular one
- stopwords, numbers as usual.

After loading the dataset and taking a subset of the three most frequent labels, an
important step to consider here is pre-processing the data. What’s different here
compared to previous examples? Why can’t we just follow the same procedure as
before? There are a few things that are different about tweets compared to news
articles or other such text, as we briefly discussed in Chapter 2 when we talked
about text pre-processing. First, they are very short. Second, our traditional
tokenizers may not work well with tweets, splitting smileys, hashtags, Twitter
handles, etc., into multiple tokens. Such specialized needs prompted a lot of
research into NLP for Twitter in the recent past, which resulted in several preprocessing options for tweets. One such solution is a TweetTokenizer,
implemented in the NLTK [21] library in Python. We’ll discuss more on this topic in
Chapter 8. For now, let’s see how we can use a TweetTokenizer in the following
code snippet:

In [11]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [24]:
tweeter.tokenize('We want to trade with someone')

['we', 'want', 'to', 'trade', 'with', 'someone']

In [12]:
#strip_handles removes personal information such as twitter handles, which don't
#contribute to emotion in the tweet. preserve_case=False converts everything to lowercase.
tweeter = TweetTokenizer(strip_handles=True,preserve_case=False)
mystopwords = set(stopwords.words("english"))

#Function to tokenize tweets, remove stopwords and numbers. 
#Keeping punctuations and emoticon symbols could be relevant for this task!
def preprocess_corpus(texts):
    def remove_stops_digits(tokens):
        #Nested function that removes stopwords and digits from a list of tokens
        return [token for token in tokens if token not in mystopwords and not token.isdigit()]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(tweeter.tokenize(content)) for content in texts]
    #explain return: tach texts thanh tung cau van nho, tung cau van nay se duoc 
    # token thanh tung tu, sau do se remove stopwords va digit cac tu nay
#df_subset contains only the three categories we chose. 
mydata = preprocess_corpus(df_subset['content'])
mycats = df_subset['sentiment']
print(len(mydata), len(mycats))

22306 22306


In [13]:
df_subset['content']

4        @dannycastillo We want to trade with someone w...
5        Re-pinging @ghostridah14: why didn't you go to...
7                     Hmmm. http://www.djhero.com/ is down
10                                        cant fall asleep
11                                 Choked on her retainers
                               ...                        
39992    @jasimmo Ooo showing of your French skills!! l...
39993    @sendsome2me haha, yeah. Twitter has many uses...
39994                        Succesfully following Tayla!!
39995                                     @JohnLloydTaylor
39998    @niariley WASSUP BEAUTIFUL!!! FOLLOW ME!!  PEE...
Name: content, Length: 22306, dtype: object

In [17]:
mydata[:5] #after tweeter.tokenize and remove stopword, digit each toeknize

[['want', 'trade', 'someone', 'houston', 'tickets', ',', 'one', '.'],
 ['re-pinging', ':', 'go', 'prom', '?', 'bc', 'bf', 'like', 'friends'],
 ['hmmm', '.', 'http://www.djhero.com/'],
 ['cant', 'fall', 'asleep'],
 ['choked', 'retainers']]

In [19]:
#Split data into train and test, following the usual process
train_data, test_data, train_cats, test_cats = train_test_split(mydata,mycats,random_state=1234)

In [28]:
print(len(mydata))
print(len(train_data))
print(len(test_data))
print('split:',len(test_data)/len(mydata))

22306
16729
5577
split: 0.25002241549358917


In [39]:
print(train_data[:5])

[['good', 'morning', 'plan', 'day', ':', 'church', 'followed', 'f1', '&', 'lunch', 'mum', '&', 'dads', '.', 'dm', 'discussions', 'star', 'trek', '!'], ['happy', 'anniversary', '.', 'know', 'whyyyy', '.', 'three', 'years', 'baby', '!', '!', '!'], ['never', '...'], ['lol', '...', 'maybe', '...', 'still', 'go', 'monday', '.'], ['got', 'home', 'leave']]


The next step in this process is to train a Doc2vec model to learn tweet
representations. Ideally, any large dataset of tweets will work for this step. However,
since we don’t have such a ready-made corpus, we’ll split our dataset into train-test
and use the training data for learning the Doc2vec representations. The first part of
this process involves converting the data into a format readable by the Doc2vec
implementation, which can be done using the TaggedDocument class. It’s used to
represent a document as a list of tokens, followed by a “tag,” which in its simplest
form can be just the filename or ID of the document. However, Doc2vec by itself
can also be used as a nearest neighbor classifier for both multiclass and multilabel
classification problems using . We’ll leave this as an exploratory exercise for the
reader. Let’s now see how to train a Doc2vec classifier for tweets through the code
snippet below:

In [30]:
#prepare training data in doc2vec format:
train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(train_data)]
train_doc2vec[:5]

[TaggedDocument(words=['good', 'morning', 'plan', 'day', ':', 'church', 'followed', 'f1', '&', 'lunch', 'mum', '&', 'dads', '.', 'dm', 'discussions', 'star', 'trek', '!'], tags=['0']),
 TaggedDocument(words=['happy', 'anniversary', '.', 'know', 'whyyyy', '.', 'three', 'years', 'baby', '!', '!', '!'], tags=['1']),
 TaggedDocument(words=['never', '...'], tags=['2']),
 TaggedDocument(words=['lol', '...', 'maybe', '...', 'still', 'go', 'monday', '.'], tags=['3']),
 TaggedDocument(words=['got', 'home', 'leave'], tags=['4'])]

In [22]:
#Train a doc2vec model to learn tweet representations. Use only training data!!
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm =1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
model.save("d2v.model")
print("Model Saved")

Model Saved


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Training for Doc2vec involves making several choices regarding parameters, as seen
in the model definition in the code snippet above. vector_size refers to the
dimensionality of the learned embeddings; alpha is the learning rate; min_count
is the minimum frequency of words that remain in vocabulary; dm, which stands for
distributed memory, is one of the representation learners implemented in Doc2vec
(the other is dbow, or distributed bag of words); and epochs are the number of
training iterations. There are a few other parameters that can be customized. While
there are some guidelines on choosing optimal parameters for training Doc2vec
models [22], these are not exhaustively validated, and we don’t know if the
guidelines work for tweets. 

The best way to address this issue is to explore a range of values for the ones that
matter to us (e.g., dm versus dbow, vector sizes, learning rate) and compare
multiple models. How do we compare these models, as they only learn the text
representation? One way to do it is to start using these learned representations in a downstream task—in this case, text classification. Doc2vec’s infer_vector
function can be used to infer the vector representation for a given text using a pretrained model. Since there is some amount of randomness due to the choice of
hyperparameters, the inferred vectors differ each time we extract them. For this
reason, to get a stable representation, we run it multiple times (called steps) and
aggregate the vectors. Let’s use the learned model to infer features for our data and
train a logistic regression classifier:

In [None]:
#Infer the feature representation for training and test data using the trained model
model= Doc2Vec.load("d2v.model")

In [32]:
#infer in multiple steps to get a stable representation. 
train_vectors =  [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in train_data]
test_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in test_data]

In [38]:
print(len(train_vectors))
print(len(train_vectors[0]))
train_vectors[:2]
# train data khoang 16000 cau, do do train vector cung chua 16000 cau nay, tuy nhien
# da duoc vector thanh 50 features bang model Doc2Vec (size=50)

16729
50


[array([-0.10931911, -0.7313683 , -0.7262368 ,  0.25416514,  0.64902323,
         0.8690006 , -0.21135978, -0.42349106,  1.2730827 , -1.3717281 ,
        -0.5986231 ,  0.08442881, -0.42836455, -0.1387815 ,  0.33349255,
        -0.08933026, -0.6297568 ,  0.15202159, -0.26946872, -0.9958995 ,
         0.01033765, -0.27450958, -0.86155444, -0.1431924 ,  0.36794204,
         0.88481367, -0.7720944 ,  0.427225  , -0.461724  ,  1.1377105 ,
         0.7434162 , -0.499262  , -0.20816197, -0.4692031 , -0.23724903,
        -1.2255505 ,  0.04729887, -0.45278624,  0.7991501 ,  0.05607954,
        -0.39569253,  0.8643881 , -0.68532467, -0.9694727 ,  0.22574818,
        -0.1878743 , -0.04316855, -0.2019228 , -0.18028729,  0.41176215],
       dtype=float32),
 array([ 0.01172177, -0.33072165,  0.5792671 ,  0.11403397, -0.7678863 ,
         0.23992556, -0.31349587, -0.69862753, -0.44846058,  0.35984936,
        -0.05374104, -0.22830594, -0.24324952, -0.24897365,  0.6162712 ,
        -0.20070425, -0.550

In [23]:
#Use any regular classifier like logistic regression
from sklearn.linear_model import LogisticRegression

myclass = LogisticRegression(class_weight="balanced") #because classes are not balanced. 
myclass.fit(train_vectors, train_cats)

preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))

#print(confusion_matrix(test_cats,preds))


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


              precision    recall  f1-score   support

   happiness       0.46      0.54      0.50      1331
     neutral       0.49      0.55      0.52      2143
       worry       0.57      0.43      0.49      2103

    accuracy                           0.50      5577
   macro avg       0.51      0.51      0.50      5577
weighted avg       0.51      0.50      0.50      5577



Now, the performance of this model seems rather poor, achieving an F1 score of
0.51 on a reasonably large corpus, with only three classes. There are a couple of
interpretations for this poor result. First, unlike full news articles or even wellformed sentences, tweets contain very little data per instance. Further, people write
with a wide variety in spelling and syntax when they tweet. There are a lot of
emoticons in different forms. Our feature representation should be able to capture
such aspects. While tuning the algorithms by searching a large parameter space for
the best model may help, an alternative could be to explore problem-specific feature
representations, as we discussed in Chapter 3. We’ll see how to do this for tweets in
Chapter 8. An important point to keep in mind when using Doc2vec is the same as
for fastText: if we have to use Doc2vec for feature representation, we have to store
the model that learned the representation. While it’s not typically as bulky as
fastText, it’s also not as fast to train. Such trade-offs need to be considered and
compared before we make a deployment decision