# Word Vectors

## *"I know words. I have the best words!"*
    - Noam Chomsky

## Discrete Sparse Representations

In [None]:
# documents = [line.strip() for line in open('../data/moby_dick.txt', encoding='utf8')]
import pandas as pd
df = pd.read_csv('../data/reviews.tsv', sep='\t')
documents = df.text.values.tolist()
print(documents[:2])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

sentences_2 = documents[:1]

small_vectorizer = CountVectorizer()

X1 = small_vectorizer.fit_transform(sentences_2)

The result is a *sparse count matrix*:

In [None]:
# indexed representation
print(X1)

# dense representation
print(X1.todense())

We can access the mapping from vector position to feature names via `get_feature_names()`:

In [None]:
print(small_vectorizer.get_feature_names())

The inverse (the mapping from feature names to vector positions) is encoded as a list in `vocabulary_`:

In [None]:
print(small_vectorizer.vocabulary_)

## Terminology 

![](../../material/pics/matrix.pdf)

Let's redo this for the entire corpus:

In [None]:
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0.001, max_df=0.75, stop_words='english')

X = vectorizer.fit_transform(documents)

print(X.shape)

## Exercise

Use vector operations to find out 
- what the 5 most frequent words are in `X`
- in how many different documents the word `delivery` occurs
- what percentage of the overall corpus that number corresponds to

In [None]:
# your code here

## Character $n$-grams

We can also use characters to analyze text:

In [None]:
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 6), min_df=1, max_df=0.75)

C = char_vectorizer.fit_transform(documents[:10])
C

In [None]:
print(char_vectorizer.vocabulary_)

## Syntactic $n$-grams

In [None]:
features = [' '.join(["{}_{}".format(c.lemma_, c.head.lemma_) 
                      for c in nlp(sentence)])
            for sentence in documents[:100]]

syntax_vectorizer = CountVectorizer()
X = syntax_vectorizer.fit_transform(features)

In [None]:
print(syntax_vectorizer.vocabulary_)

# Dense Distributed Representations

## Word embeddings with `Word2vec`

In [None]:
from gensim.models import Word2Vec
from gensim.models.word2vec import FAST_VERSION

corpus = [document.split() for document in documents]
# initialize model
w2v_model = Word2Vec(size=100, 
                     window=15, 
                     hs=0,
                     sample=0.000001,
                     negative=5, 
                     min_count=100,
                     workers=-1, 
                     iter=100
)

w2v_model.build_vocab(corpus)

w2v_model.train(corpus, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs)


Now, we can use the embeddings of the model

In [None]:
w2v_model.wv['delivery']

In [None]:
# birthday - present + husband => birthday:present as husband:?
w2v_model.wv.most_similar(positive=['birthday', 'husband'], negative=['present'], topn=3)

In [None]:
word1 = "birthday"
word2 = "weekend"

# retrieve the actual vector
print(w2v_model.wv[word1])

# compare
print(w2v_model.wv.similarity(word1, word2))

# get the 3 most similar words
print(w2v_model.wv.most_similar(word1, topn=3))



### Exercise
Use `spacy` to restrict the words in the tweets to *content words*, i.e., nouns, verbs, and adjectives. Transform the words to lower case and add the POS with an underderscore. E.g.:

`love_VERB old-fashioneds_NOUN`

This also allows us to distinguish between homographs, i.e., words that are written the same, but belong to different word classes, e.g., *love* in "I **love** old-fashioneds" vs. "He felt so sick, it must have been **love**".


Make sure to exclude sentences that contain none of the above.

Write the resulting corpus to a variable called `word_corpus`.

In [None]:
# Your code here


Rerun the `Word2vec` model from above on the new data set and test the words out

In [None]:
# Your code here

## Exercise

Train 4 more `Word2vec` models and average the resulting embedding matrices.

In [None]:
# Your code here

## Document embeddings with `Doc2Vec`

In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import FAST_VERSION
from gensim.models.doc2vec import TaggedDocument

df2 = pd.read_csv('../data/reviews.full.tsv', sep='\t')

corpus = []
# for docid, document in enumerate(documents):
#     corpus.append(TaggedDocument(document.split(), tags=["{0:0>4}".format(docid)]))
for row in df2.iterrows():
    label = row[1].score
    text = row[1].text
    corpus.append(TaggedDocument(text.split(), tags=[str(label)]))

print('done')
d2v_model = Doc2Vec(vector_size=100, 
                    window=15,
                    hs=0,
                    sample=0.000001,
                    negative=5,
                    min_count=100,
                    workers=-1,
                    epochs=500,
                    dm=0, 
                    dbow_words=1)

d2v_model.build_vocab(corpus)

d2v_model.train(corpus, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)

We can now look at the elements

In [None]:
d2v_model.docvecs.doctags

In [None]:
target_doc = '1'

similar_docs = d2v_model.docvecs.most_similar(target_doc, topn=5)
print(similar_docs)

## Exercise

What are the 10 most similar ***words*** to each category?

In [None]:
# your code here
