# Doc2Vec demonstration 

In this notebook, let us take a look at how to "learn" **document embeddings** and use them for text classification. We will be using the dataset of "Sentiment and Emotion in Text" from [Kaggle](https://www.kaggle.com/c/sa-emotions/data).

"In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft’s Cortana Intelligence Gallery."


In [1]:
import pandas as pd
import nltk
# nltk.download('stopwords')

from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [2]:
"""
#downloding data
!wget -P DATAPATH https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv
!wget -P DATAPATH https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv
!ls -lah DATAPATH
"""

--2020-08-10 11:36:06--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/train_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2479133 (2.4M) [text/plain]
Saving to: ‘DATAPATH/train_data.csv’


2020-08-10 11:36:07 (14.7 MB/s) - ‘DATAPATH/train_data.csv’ saved [2479133/2479133]

--2020-08-10 11:36:08--  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/Data/Sentiment%20and%20Emotion%20in%20Text/test_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 20

In [2]:
#Load the dataset and explore.
data_path = '/Users/chenwang/Workspace/github/practical_nlp/chapter4_text_classification/data/Sentiment_and_Emotion_in_Text/'
filepath = data_path + "train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)


(30000, 2)


In [3]:
df.head()


Unnamed: 0,sentiment,content
0,empty,@tiffanylue i know i was listenin to bad habi...
1,sadness,Layin n bed with a headache ughhhh...waitin o...
2,sadness,Funeral ceremony...gloomy friday...
3,enthusiasm,wants to hang out with friends SOON!
4,neutral,@dannycastillo We want to trade with someone w...


In [4]:
df['sentiment'].value_counts()

worry         7433
neutral       6340
sadness       4828
happiness     2986
love          2068
surprise      1613
hate          1187
fun           1088
relief        1021
empty          659
enthusiasm     522
boredom        157
anger           98
Name: sentiment, dtype: int64

我们可以看出，一共有30000 条数据，共13种不同的sentiment，下面我们选择三个情绪，对以后的document 进行分类。

In [5]:
#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(16759, 2)

可以看出，进过筛选，我们有16759 条数据，总共3个sentiments.

# Text pre-processing:
Tweets are different. Somethings to consider:
- Removing @mentions, and urls perhaps?
- using NLTK Tweet tokenizer instead of a regular one
    - ;-) should be 1 token rather than 3
- stopwords, numbers as usual.

In [6]:
# strip_handles removes personal information such as twitter handles, which don't contribute to emotion in the tweet. 
# preserve_case=False converts everything to lowercase.
tweeterTokenizer = TweetTokenizer(strip_handles=True,preserve_case=False)
mystopwords = set(stopwords.words("english"))

# Function to tokenize tweets, remove stopwords and numbers. 
# Keeping punctuations and emoticon symbols could be relevant for this task!
def preprocess_corpus(texts):
    def remove_stops_digits(tokens):
        #Nested function that removes stopwords and digits from a list of tokens
        return [token for token in tokens if token not in mystopwords and not token.isdigit()]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(tweeterTokenizer.tokenize(content)) for content in texts]

#df_subset contains only the three categories we chose. 
mydata = preprocess_corpus(df_subset['content'])
mycats = df_subset['sentiment']
print(len(mydata), len(mycats))

16759 16759


In [7]:
# Split data into train and test, following the usual process
train_data, test_data, train_cats, test_cats = train_test_split(mydata,mycats,random_state=1234)

#### converting the data into a format readable by the Doc2vec implementation

It’s used to represent a document as a list of tokens, followed by a “tag,” which in its simplest form can be just the filename or ID of the document. 这里我们使用document 的ID（第几个docuemnt）.

知道了做什么，但是不清楚为什么要这么做。

In [8]:
#prepare training data in doc2vec format:
train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(train_data)]


#### 训练

参数:
- `vector_size`: the dimensionality of the learned embeddings; 
- `alpha`: the learning rate; 
- `min_count` the minimum frequency of words that remain in vocabulary; 
- `dm`: distributed memory, is one of the representation learners implemented in Doc2vec (the other is `dbow`, or distributed bag of words); 
- `epochs`: the number of training iterations.

如何调整参数: Lau, Jey Han and Timothy Baldwin. “An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation”. (2016).

最好的选参方法是**grid search**.
The best way to address this issue is to explore a range of values for the ones that matter to us (e.g., dm versus dbow, vector sizes, learning rate) and compare multiple models.

因为model 的输出是一个document vector, 所以如何对比两组参数的结果哪一个更好？一个方法是使用downstream application.

In [9]:
%%time

#Train a doc2vec model to learn tweet representations. Use only training data!!
model = Doc2Vec(vector_size=50, alpha=0.025, min_count=5, dm =1, epochs=100)
model.build_vocab(train_doc2vec)
model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)


CPU times: user 1min 45s, sys: 18.8 s, total: 2min 3s
Wall time: 1min 38s


In [10]:
model.save("output/d2v.model")
print("Model Saved")

Model Saved


In [12]:
#Infer the feature representation for training and test data using the trained model
model= Doc2Vec.load("output/d2v.model")


there is some amount of randomness due to the choice of hyperparameters, the inferred vectors differ each time we extract them. For this reason, to get a stable representation, we run it multiple times (called **steps**) and aggregate the vectors.

In [13]:
#infer in multiple steps to get a stable representation. 
train_vectors =  [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in train_data]
test_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in test_data]




In [14]:
#Use any regular classifier like logistic regression
from sklearn.linear_model import LogisticRegression

myclass = LogisticRegression(class_weight="balanced") #because classes are not balanced. 
myclass.fit(train_vectors, train_cats)

preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))






              precision    recall  f1-score   support

   happiness       0.38      0.44      0.41       713
     neutral       0.46      0.55      0.50      1595
       worry       0.58      0.45      0.51      1882

   micro avg       0.49      0.49      0.49      4190
   macro avg       0.47      0.48      0.47      4190
weighted avg       0.50      0.49      0.49      4190



In [15]:
print(confusion_matrix(test_cats,preds))



[[314 262 137]
 [233 873 489]
 [272 754 856]]


从上面的结果看，准确率很低，分析原因如下：
- 很多短文，比新闻等长文的难度大
- emoticons, spelling, etc.

#### Tips

An important point to keep in mind when using Doc2vec is the same as for fastText: if we have to use Doc2vec for feature representation, we have to store the model that learned the representation. While it’s not typically as bulky as fastText, it’s also not as fast to train. Such trade-offs need to be considered and compared before we make a deployment decision.