# Text Pre-processing

**Tasks**
- remove html tags 

## import packages

In [1]:
import pandas as pd
import sys
sys.path.append("/Users/lesleymi/data_science_tutorials/IMDB_Sentiment_Analysis/src")
import imdb_functions as imdb

# text modelling 
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import spacy

import multiprocessing

## functions 

In [11]:
def preprocess_text(text):
    """ 
    Produces the text with html tags removed and converts to all lower case. 
    
    Arguments
    ---------
    text (pandas.core.series.Series) A series of text documents. 
    
    Returns
    -------
    pandas.core.series.Series A series of text with html tags removed & lower case letters. 
        
    """
    # initialize a list for cleaned text 
    clean_text = []
    for doc in text:
        
        ## remove html tags with beautifulsoup 
        soup = BeautifulSoup(doc)
        text = soup.get_text().lower()

        # append the text to a new series 
        clean_text.append(text)

    # convert list to a pandas series 
    clean_text = pd.Series(clean_text)
    
    return clean_text
    

In [None]:
# initialize a dict to store the doc vectors 
vector_dict = {}
colnames = []
for i in range(len(model.docvecs)):
    # build the dict of doc vectors 
    vector_dict[i] = model.docvecs[i]
    
# create the column names
for dim in range(vec_size):
    colname = "dim_{0}".format(dim)
    colnames.append(colname)
    
# create a dataframe of doc vectors
vector_df = pd.DataFrame(vector_dict).transpose()
# set the col names to be number of dimensions
vector_df.columns = colnames

## load data 

In [45]:
train = pd.read_csv("data/train_clean.csv")
X_train = train.text
train_raw = pd.read_csv("data/Train.csv")
y_train = train_raw.label

In [46]:
X_train.head()

0    grow watch love thunderbirds mate school watch...
1    movie dvd player sit coke chip expectation hop...
2    people know particular time past like feel nee...
3    great interest biblical movie bore death minut...
4    be die hard dad army fan change get tape dvd a...
Name: text, dtype: object

In [27]:
print("There are {} training documents.".format(len(X_train)))

There are 40000 training documents.


In [28]:
y_train.head()

0    0
1    0
2    0
3    0
4    1
Name: label, dtype: int64

## Tokenize text

In [29]:
# convert documents into tokens
docs = imdb.tokenize(X_train)

## Build Doc2Vec Model 
The `Doc2Vec` instances take 2 inputs. A single document that is represented as a list of unicode strings (tokens) and a unique `tag` for the document. Can just be an integer index. 

The data structure input into `Doc2Vec` should be a list of `TaggedDocument`. 

**How the model is trained**

The `dm=1` model param indicates which training algorithm should be used along with its underlying model architecture. In this case the 1 means it uses the `distributed memory (PV-DM)` version. This is the version of the model that learns not only the document vector but the individual word vectors as well. 



In [30]:
# tag the documents 
tagged_docs = [TaggedDocument(words= doc, tags=[tag]) for tag, doc in enumerate(docs)]

In [19]:
# set number of processing cores 
cores = multiprocessing.cpu_count()

In [20]:
# set model params
max_epochs = 100
vec_size = 100
min_count=2
alpha = 0.025
dm=1
window=10

# initialize the model 
model = Doc2Vec(vector_size=vec_size,
               min_count=min_count,
               dm=dm,
               epochs=max_epochs,
               window=window, 
               workers=cores)


The `.build_vocab` builds a dictionary for the model. It consists of all the unique words from the training corpus along with their word count frequency in the corpus. 

The vocabulary can be access by: `model.wv.vocab`

In [25]:
# build the vocobulary 
model.build_vocab(tagged_docs)

In [26]:
# get the number of times "love" is used in the corpus
print("Word love is used {} times throughout the corpus.".format(model.wv.vocab['love'].count))

Word love is used 10403 times throughout the corpus.


In [27]:
# train the model
model.train(documents=tagged_docs, 
            total_examples=model.corpus_count, 
            epochs=model.epochs)

In [29]:
# save the model 
model.save("results/d2v.model")

## Explore the Model 

**Most similar documents**

My main thoughts right now are that I do not understand the most similar docs. The scores are low; the most similar doc for either a positive or negative review is below 0.5 similarity. So perhaps that explains why the documents don't seem related at all. There is nothing about the words being used at a superficial glance that gives me any intution why they are most similar. 


**most similar words**

For the most part, these make more sense. The top most similar words are usually synonyms or antonyms. 

In [31]:
# load the saved model 
model = Doc2Vec.load("results/d2v_train_clean.model")

In [32]:
# load spacy model 
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

# infer doc vector for a new document 
test_data = nlp("i love the imdb.")

# get the text from the document
test_data_text = [token.text for token in test_data]

In [33]:
# look at the test_data
test_data_text

['i', 'love', 'the', 'imdb', '.']

### Infer a vector 

In [34]:
# look at the inferred vector for the test document 
test_data_vector = model.infer_vector(test_data_text)
test_data_vector

array([-8.7601978e-01, -7.1176410e-02,  7.0591187e-01, -4.8143679e-01,
       -4.8137760e-01, -3.5408694e-02, -6.3677318e-02,  1.2787600e-01,
        1.5174484e-02, -7.7707690e-01, -5.6983024e-01,  1.5190088e+00,
       -5.5980694e-01,  1.0904826e+00, -6.5231264e-01,  2.8780508e-01,
       -4.6022081e-01,  5.5990957e-02,  3.1254226e-01,  2.3393029e-01,
       -2.3947120e-01,  3.2985991e-01, -6.5190800e-02,  3.7434554e-01,
        1.0477492e+00,  6.2189631e-02, -2.0642418e-01, -4.9563262e-01,
       -8.6489081e-02, -7.7440524e-01, -2.5099444e-01,  5.6423777e-01,
        5.8801621e-01,  4.6902564e-01,  1.4091048e+00,  1.9119181e-01,
       -8.9552552e-01,  1.0647823e+00, -1.1273570e-01,  6.1967146e-01,
       -3.8818705e-01,  1.7704718e-01,  7.8034365e-01,  3.7679043e-01,
       -2.9321137e-01, -7.2484118e-01, -1.0252564e-01, -5.6680793e-01,
        2.7603558e-01,  8.9655185e-01, -1.1634492e+00,  1.1330364e+00,
       -3.0282557e-03, -2.1823885e-01,  1.8622139e-02,  9.7461119e-02,
      

In [17]:
# check that the vector is 100 dimensions 
len(test_data_vector)

100

### Get doc vector for a document in training data 


In [35]:
# get doc vector for document with tag 0
model.docvecs[0]

array([ 2.3663208 , -0.8338305 , -1.5876216 , -0.4484202 ,  0.37763113,
        0.18997066, -0.15291765, -0.33705813,  1.0348196 , -1.830415  ,
       -1.8290714 , -0.59267575,  2.2594986 ,  0.47995684,  0.94431174,
        3.1104918 ,  1.8962055 ,  0.25603727, -0.9457041 ,  0.39135978,
        2.0950372 ,  1.123366  ,  1.5927137 ,  0.23922084,  2.2375321 ,
       -1.6106455 , -1.9829898 , -1.9768217 ,  0.24138217, -2.785498  ,
        2.7964442 , -1.17678   ,  0.20068131,  3.0975914 ,  0.41910535,
        2.8714657 , -0.15959644, -0.39355338, -2.368877  ,  0.85839856,
        0.4438577 , -0.56905115, -0.01014811, -1.033843  ,  0.42319435,
       -1.2470948 ,  1.8577793 ,  0.6409478 , -3.4192915 , -1.5226547 ,
        0.82558554,  1.6715181 , -1.1165489 , -1.0488758 , -2.9167726 ,
        0.6252657 ,  3.7159734 , -1.2257144 , -1.5467651 , -0.8403192 ,
        0.9729997 , -1.0113367 , -2.298275  , -2.788685  , -1.6347834 ,
        0.18489991, -2.1888363 ,  0.8171673 ,  0.91485846, -2.83

In [36]:
# get the doc vector for the document with tag 1
model.docvecs[1]

array([ 1.6520679e+00, -3.8588434e-01, -6.7569941e-01,  3.4696057e+00,
       -1.9679284e+00, -2.2057221e+00, -1.4142027e+00, -4.7539362e-01,
       -7.5158638e-01, -1.0545136e+00, -1.4270027e+00, -3.2239280e+00,
       -1.2632240e+00,  3.6208701e+00, -2.3817019e+00,  1.4604803e+00,
       -1.2440633e+00, -1.5918282e+00, -2.2776539e+00,  2.8869450e+00,
        4.9006929e+00, -3.4761369e-01, -1.0802556e+00,  1.8751051e+00,
        9.4934928e-01,  1.8222005e+00, -6.7758745e-01, -3.0567009e+00,
        7.6635689e-02,  1.9879228e+00, -5.4529065e-01,  1.3167815e-02,
        2.3686938e+00, -5.3301048e-01,  8.2191074e-01,  6.8047595e-01,
       -1.7336306e-01,  2.6591143e-01,  2.5094647e+00, -6.7854130e-01,
       -6.3020778e-01,  2.7396691e+00,  2.2822950e+00, -1.8262483e-01,
       -1.6745805e+00, -3.9558086e-01,  1.2627271e+00, -8.4782064e-01,
       -1.9645272e+00, -2.8780472e+00, -5.5718642e-01,  4.7494566e-01,
       -1.8261996e-01,  1.3231115e+00, -1.6106341e+00,  6.6597298e-02,
      

### Most similar document: negative
This to returns the document tags along with the cosine similarity score to `doc 1` a negative review. 

In [37]:
most_sim_docs = model.docvecs.most_similar(1)
most_sim_docs

[(24141, 0.4895094037055969),
 (7551, 0.4757860004901886),
 (7709, 0.4681105613708496),
 (28398, 0.46748244762420654),
 (37855, 0.4662354588508606),
 (38331, 0.4656111001968384),
 (26914, 0.45377689599990845),
 (37692, 0.4481343626976013),
 (32846, 0.44798168540000916),
 (5742, 0.445815771818161)]

In [38]:
# look at doc 1 
query_text = " ".join(tagged_docs[1].words)
query_text

'movie dvd player sit coke chip expectation hope movie contain strong point movie awsome animation good flow story excellent voice cast funny comedy kick ass soundtrack disappointment find atlantis milo return read review let follow paragraph direct see movie enjoy primarily point scene appear shock pick atlantis milo return display case local videoshop expectation music feel bad imitation movie voice cast replace fit exception character like voice sweet actual drawing not bad animation particular sad sight storyline pretty weak like episode schooby doo single adventurous story get time not misunderstand good schooby doo episode not laugh single time snigger audience see movie especially care similar sequel fast review movie stand product like schooby doo like movie enjoy movie suspect good kid movie know well milo return episode series cartoon channel breakfast tv'

In [48]:
# get original raw text of doc 1 
train_raw.text[1]

"When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie would contain some of the strong-points of the first movie: Awsome animation, good flowing story, excellent voice cast, funny comedy and a kick-ass soundtrack. But, to my disappointment, not any of this is to be found in Atlantis: Milo's Return. Had I read some reviews first, I might not have been so let down. The following paragraph will be directed to those who have seen the first movie, and who enjoyed it primarily for the points mentioned.<br /><br />When the first scene appears, your in for a shock if you just picked Atlantis: Milo's Return from the display-case at your local videoshop (or whatever), and had the expectations I had. The music feels as a bad imitation of the first movie, and the voice cast has been replaced by a not so fitting one. (With the exception of a few characters, like the voice of Sweet). The actual drawings isnt that bad, 

In [39]:
# get the sentiment of the query doc 
print("The sentiment of Doc 1 is: {}".format(y_train[1]))

The sentiment of Doc 1 is: 0


In [40]:
# initialize a dictionary to hold most similar texts 
# keyed by their index in the TaggedDocuments lists 
most_similar_texts = {}

# get the texts of the most similar docs
for most_sim_doc in most_sim_docs:
    # get the tagged doc index 
    index = most_sim_doc[0]
    
    # convert the tokens from most similar into text
    most_sim_text = " ".join(tagged_docs[index].words)
    
    # append the text to the list 
    most_similar_texts[index] = most_sim_text
    

In [41]:
# convert the most similar texts into a dataframe 
most_similar_df = pd.DataFrame(most_similar_texts, index=[1]).transpose().rename(columns={1:'most_similar_texts'})
most_similar_df

Unnamed: 0,most_similar_texts
24141,new bear big fan surface think script computer...
7551,warn movie scary horror movie fan especially c...
7709,contain spoiler inuyasha good anime actually o...
28398,shock surprise negative review see web think c...
37855,despite disney well effort enjoyable movie fol...
38331,entertain random love hate expect sophisticate...
26914,despite review angel outfield pretty good movi...
37692,lion king doubt favorite disney movie time fig...
32846,lot music see movie time tonight road picture ...
5742,see movie teenager come theater way see nearly...


#### Most similar doc

In [42]:
# most similar text 25471
most_similar_df.loc[24141][0]

'new bear big fan surface think script computer graphic exceptional good sci fi flick see theater february tv guide say season finale announcer say effect season finale surface season finale series finale wait fall go happen fall get find nbc go pick sci fi usa bay watch long abc usa pick go gang buster bet abc chock ha series mini series loyal fan closure happen guy trap church steeple creature chaple nim grouth spert clone guy come unanswered question thank listen babble'

In [47]:
# original text 
train_raw.text[24141]

'I am new at this, so bear with me please. I am a big fan of Surface. I thought the script and the computer graphics were exceptional, as good as any Sci Fi flick I\'ve seen at the theater. In February the TV guide said Season Finale, the announcer for the show said something to the effect of, "...and now for the season finale of Surface." Season Finale, not series finale! I couldn\'t wait for fall to get here, to see was going to happen next. So fall gets here and it\'s nowhere to be found! If NBC isn\'t going to pick it up, what about Sci Fi or USA? It seems to me that Bay Watch didn\'t last long on ABC & then USA picked it up, and it went gang busters! (I bet ABC was chocking) Ha! If not a series, then at least a mini series, to give all us loyal fans closure. What happened to our guy\'s trapped in the church steeple? Was the creature in the chaple Nim? Did he have a grouth spert? Does the cloned guy come over to our side? There are so many unanswered questions. Thank\'s for listeni

In [49]:
print("Sentiment of most similar doc is: {}".format(y_train[24141]))

Sentiment of most similar doc is: 1


#### 2nd most similar doc 

In [50]:
most_similar_df.loc[7551][0]

'warn movie scary horror movie fan especially child play fan think incredibly funny will scare bad movie scary'

In [52]:
# original text
train_raw.text[7551]



In [53]:
print("Sentiment of 2nd most similar doc is: {}".format(y_train[7551]))

Sentiment of 2nd most similar doc is: 1


#### 3rd most similar doc

In [54]:
most_similar_df.loc[7709][0]

'contain spoiler inuyasha good anime actually overrate absolutely story line plot drag story filler episode plot progress filler story repeat episode plot kagome sense jewel shard worm slime tentacle demon thing pop inuyasha say wind scar iron reaver soul stealer etc kill demon jewel shard repeat scene repeat episode repeat comedic device funny anymore wait sexual harassment funny viz rate series old teen idea rate bad call funny sexual harassment kind suggestive arrest know inuyasha overrate videogame suck especially mask game play friend house interest game slow bore nintendo like graphic magazine get rate say role play game slow milkshake move cocktail straw stupid inuyasha toy action figure trade card sticker color book color book think inuyasha maybe member inuyasha group msn half people guess inuyasha little kid anime think small bite edit show toonami manga volume help wonder mile forest cut sad music music annoy hear song episode episode music get annoy anime music fit mood hea

In [55]:
print("Sentiment of 3rd most similar doc is: {}".format(y_train[7709]))

Sentiment of 3rd most similar doc is: 0


In [56]:
# get original text 
train_raw.text[7709]

'May or may not contain spoilers. <br /><br />Inuyasha is not a good anime. It\'s actually very overrated. Why? There\'s absolutely no story line, no plot, and the show just drags on... and on... and on... That\'s because there are more side stories and fillers than episodes that make the plot progress. And the fillers are just the same stories being repeated over and over again. The same episodes seem to go with the same plot: Kagome sensing a jewel shard, a worm/slime/tentacle demon thing pops up, Inuyasha says "Wind-Scar", "Iron Reaver Soul Stealer", etc. and kills the demon, they get the jewel shard, and then we just repeat this scene 160 or more times.<br /><br />Besides the repeating of episodes, there\'s the repeating of comedic devices, and they\'re not funny anymore. Wait, they never were. Sexual harassment is NOT funny. Viz rated the series Older Teens, 16+. I have no idea why they rated it that. There\'s nothing bad about it except for the so-called funny sexual harassment, 

## Most similar document: positive

In [57]:
# query document 
most_sim_docs = model.docvecs.most_similar(4)
most_sim_docs

[(5435, 0.4771555960178375),
 (18429, 0.4644291400909424),
 (27699, 0.44662174582481384),
 (9161, 0.4408680498600006),
 (36322, 0.43742644786834717),
 (186, 0.43081188201904297),
 (13058, 0.42809751629829407),
 (16188, 0.4186444878578186),
 (25912, 0.41499632596969604),
 (23975, 0.4139326214790344)]

In [58]:
# look at doc 4
query_text = " ".join(tagged_docs[4].words)
query_text

'be die hard dad army fan change get tape dvd audiobooks time watch listen brand new film film run certain episode man hour enemy gate battle school numerous different edge introduction new general instead captain square brilliant especially cash cheque rarely follow early year get equipment uniform start train great film bore sunday afternoon draw back germans bogus dodgy accent come germans not pronounce letter w like cast liz frazer instead familiar janet davis like liz film like carry ons carry correctly janet davis well choice'

In [59]:
print("Sentiment of doc 4 is: {}".format(y_train[4]))

Sentiment of doc 4 is: 1


In [60]:
# get most similar texts
most_similar_df = imdb.get_most_similar_docs(tagged_docs=tagged_docs, most_sim_docs=most_sim_docs)
most_similar_df

Unnamed: 0,most_similar_texts
5435,music laurence olivier sombre delivery set ton...
18429,watch series avidly wonder lengthy break tune ...
27699,sitcom big screen spin offs come list serve bl...
9161,think series go fun action series dynamic plot...
36322,like idea female turtle know tmnt brother teac...
186,good movie typical war flick bite different mo...
13058,episode man man dean learner air scratch episo...
16188,memorable line short live view episode line in...
25912,movie kick ass bar bam crue film get dvd day a...
23975,well martial fu movie time u love martial art ...


In [61]:
most_similar_df.loc[5435][0]

'music laurence olivier sombre delivery set tone perfectly outstanding documentary ww ii buff descendant participant conflict politician think thing way extend foreign policy deck aircraft carrier hear george bush curious need know s s s aspect conflict episode roughly chronological order see sequence self contain bind new insight new viewer sheer volume present actual footage battle intersperse interview involve story interview 2 line authority support personnel main character private captain secretary eyewitness like real upfront taste war presently watch dvd version original television documentary strongly recommend wear gaptoothed overpriced vhs offering available ebay pay cdn dvd disc new release include bonus material screen mode menu easy follow choice episode want view select give option chapter episode play episode understandable comprehensive presentation tiny navigation menu impact diminish year nay year war remember watch broadcast buffalo pbs station move london wish right

In [14]:
print("Sentiment of most similar doc is: {}".format(y_train[39119]))

Sentiment of most similar doc is: 0


In [62]:
most_similar_df.loc[18429][0]

'watch series avidly wonder lengthy break tune episode series hook excellent telly grind break stuff like mission impossible character round expand series go change adapt readily new surrounding cleverly remain strictly character possible sympathy think feel sorry crush happen hope year lose look forward actually mind answer mystery'

In [63]:
print("Sentiment of 2nd most similar doc is: {}".format(y_train[18429]))

Sentiment of 2nd most similar doc is: 1


In [64]:
most_similar_df.loc[18429][0]

'watch series avidly wonder lengthy break tune episode series hook excellent telly grind break stuff like mission impossible character round expand series go change adapt readily new surrounding cleverly remain strictly character possible sympathy think feel sorry crush happen hope year lose look forward actually mind answer mystery'

In [65]:
print("sentiment of 3rd most similar doc is: {}".format(y_train[18429]))

sentiment of 3rd most similar doc is: 1


### Most similar words

In [66]:
model.wv.most_similar('love')

[('flat', 0.49725016951560974),
 ('asleep', 0.4771397113800049),
 ('apart', 0.46345919370651245),
 ('category', 0.45624393224716187),
 ('lust', 0.4409942030906677),
 ('lover', 0.4356623888015747),
 ('vassar', 0.41631919145584106),
 ('short', 0.40895459055900574),
 ('donna', 0.40465378761291504),
 ('ect', 0.390438973903656)]

In [67]:
model.wv.most_similar('violent')

[('brutal', 0.5618649125099182),
 ('vicious', 0.5486468076705933),
 ('violence', 0.5431817770004272),
 ('gory', 0.5042787194252014),
 ('cannibalism', 0.44713398814201355),
 ('tame', 0.44368988275527954),
 ('explicit', 0.4399290680885315),
 ('gruesome', 0.4369996190071106),
 ('exploit', 0.41187113523483276),
 ('levres', 0.4106130003929138)]

In [68]:
model.wv.most_similar('grass')

[('decieve', 0.43368470668792725),
 ('facepaint', 0.40532469749450684),
 ('occassionaly', 0.4004054665565491),
 ('unspoken', 0.39546847343444824),
 ('mindbogglingly', 0.3859484791755676),
 ('abstinence', 0.382354736328125),
 ('temple', 0.3801541328430176),
 ('beret', 0.3800097405910492),
 ('walrus', 0.3783597946166992),
 ('curtain', 0.37587088346481323)]

In [69]:
model.wv.most_similar('cat')

[('nancy', 0.4182920455932617),
 ('fenchurch', 0.40522515773773193),
 ('dog', 0.39421069622039795),
 ('mammy', 0.38828516006469727),
 ('michael', 0.3875651955604553),
 ('mutt', 0.3847420811653137),
 ('monkey', 0.3802471160888672),
 ('oog', 0.3756932020187378),
 ('bobbi', 0.37538284063339233),
 ('edgar', 0.3712940216064453)]

In [70]:
model.wv.most_similar('hero')

[('villain', 0.6178774833679199),
 ('heroic', 0.5215709805488586),
 ('protagonist', 0.5046625137329102),
 ('guy', 0.46219581365585327),
 ('soldier', 0.4537964463233948),
 ('criminal', 0.4315299093723297),
 ('evil', 0.4274206757545471),
 ('cop', 0.4197346866130829),
 ('character', 0.4120820164680481),
 ('gun', 0.4019153416156769)]

## Extract Doc Vectors 

In [95]:
# initialize a dict to store the doc vectors 
vector_dict = {}
colnames = []
for i in range(len(model.docvecs)):
    # build the dict of doc vectors 
    vector_dict[i] = model.docvecs[i]
    
# create the column names
for dim in range(vec_size):
    colname = "dim_{0}".format(dim)
    colnames.append(colname)

In [96]:
# create a dataframe of doc vectors
vector_df = pd.DataFrame(vector_dict).transpose()
# set the col names to be number of dimensions
vector_df.columns = colnames

In [98]:
print("There are {0} documents; each document is represented in {1} dimensions.".format(vector_df.shape[0], vector_df.shape[1]))

There are 40000 documents; each document is represented in 100 dimensions.


In [91]:
# look at the first few documents 
# each row represents a movie review 
vector_df.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,...,dim_90,dim_91,dim_92,dim_93,dim_94,dim_95,dim_96,dim_97,dim_98,dim_99
0,0.784555,-0.397502,-0.331985,0.965592,2.012932,1.887488,0.510781,3.288674,3.433098,-3.79123,...,-1.244087,1.192884,0.891242,2.866597,8.549128,-0.478778,1.692087,2.417104,1.771905,-1.312251
1,1.027664,-3.78656,-1.583889,3.12444,-0.892188,-1.558566,3.319089,1.988021,2.08938,-0.524098,...,0.397586,2.595253,2.952577,-2.68418,1.753658,-1.200642,-0.714399,-1.44035,1.5492,-1.61397
2,0.591049,-1.170077,0.830686,3.815988,-0.730998,0.646123,4.21671,3.768117,1.581684,0.66036,...,2.918566,0.399827,-1.357417,-0.124357,0.657655,-2.304216,-3.644226,-0.174466,2.901997,0.991823
3,-1.579851,0.370804,0.325788,2.72486,0.076738,0.271273,2.576356,1.767729,-1.268051,-3.105602,...,1.757116,-3.50896,-0.034007,0.554777,2.150062,1.295318,-0.126439,2.238556,1.850339,-1.993807
4,-1.05891,-1.690887,-0.004952,-2.187342,-3.583359,2.466982,2.651749,0.484369,1.344823,-2.241074,...,-2.418538,-4.10603,2.691415,-2.924291,-0.589641,-2.774781,-1.087569,-1.415747,-0.630574,0.065931


In [99]:
# save the document vector dataset
vector_df.to_csv("data/train_d2v.csv")