# Fast Forward into Word Embeddings - Laboratory
**Module:**
1. Playing around with classical embedding models
2. Playing around with deep embedding models
3. Playing around with simple information retrieval mechanism

In [0]:
!pip install glove_python

In [0]:
import nltk
nltk.download('punkt')

from bs4 import BeautifulSoup
from gensim.models import Word2Vec
from glove import Corpus, Glove
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from nltk.tokenize import sent_tokenize
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
import string
from tqdm import tqdm
tqdm.pandas()

## 0. Preparing and Preprocessing the Dataset
We will use the same dataset from W1 Lab. we can use preprocessing method you learned in W1 here.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
data = pd.read_csv('DATASET_PATH')
data[['Text']].head()

In [0]:
print data.shape

In [0]:
# preprocess the text
def remove_punctuations(text):
    return text.translate(string.maketrans('', ''), string.punctuation)
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()
def preprocess(text):
    text = strip_html(text)
    text = text.encode('utf-8')
    text = remove_punctuations(text)
    return text.lower()

In [0]:
data['text_tfidf'] = data['Text'].progress_apply(preprocess)
data[['text_tfidf']].head()

## 1. Playing around with classical embedding models
We will use sklearn for this module. There are 3 different classes in sklearn which we can use:
- CountVectorizer,
- HashingVectorizer, and
- TfidfVectorizer.

### Task 1.1: Transform each document into vectors using TfidfVectorizer
TfidfVectorizer is very easy to use. It also contains many different functionalities, such as preprocessing (e.g. lower_case, stop_words) and weighting parameters.
**Task**: Let's try to use the default setting to build the embeddings, i.e. *raw_count* tf and *smooth_idf*.

In [0]:
tfidf_model = TfidfVectorizer(
                    preprocessor=None, # if we need custom preprocessor
                    tokenizer=None, # if we need custom tokenizer
                    stop_words='english', # we can pass list as well
                    max_features=1000, # number of vocabularies, or we can set the vocabulary
                    norm=None, # l2 is useful for cosine_distance
                    binary=False, # if true then tf will be binary
                    sublinear_tf=False, # if true then tf will be log_normalization
                    use_idf=True, # if false then idf will be unary
                    smooth_idf=True # if true then idf will be idf_smooth
                )
tfidf_model.fit(data.text_tfidf.values) # we can also use fit_transform to get the result directly

In [0]:
# useful properties
# print tfidf_model.get_stop_words() # get the stopwords
# print tfidf_model.idf_ # get the idf of each word
# print tfidf_model.vocabulary_ # get the vocabularies

In [0]:
print data.at[0,'text_tfidf']
print tfidf_model.transform([data.at[0,'text_tfidf']])

In [0]:
text = 'I love dog'
print text
print tfidf_model.transform([text])
# print tfidf_model.transform([text]).toarray() # to convert sparse matrix into array

### Task 1.2: Playing around with TfidfVectorizer
We have tried the default setting of TfidfVectorizer. **Task**: Let's try using *binary_tf* and/or *unary_idf*.

In [0]:
# code here

### Task 1.3: Transform each document into vectors using CountVectorizer and HashingVectorizer.
CountVectorizer and HashingVectorizer are very similar with TfidfVectorizer in terms of usage. One of the difference is that they don't use idf to build the embeddings. However, they are much faster than TfidfVectorizer (well, I guess so).
**Task**: Let's play around with CountVectorizer and HashingVectorizer.

In [0]:
# code here

## 2. Playing around with deep embedding models
We will use gensim and glove for this module. We will create word embeddings using Word2Vec and Glove.

In [0]:
# preprocess the data
sentences = [sent_tokenize(text.decode('utf-8')) for text in tqdm(data.Text.values)]
sentences = [preprocess(stc).split() for stcs in tqdm(sentences) for stc in stcs]
print len(sentences)

### Task 2.1: Build word embeddings using Word2Vec
Using the same dataset, we would like to build our word embeddings. Note that our dataset is actually very few. Here, we just focus on building the word embeddings. **Task**: Let's build SG word embeddings with 128 word dimension.

In [0]:
w2v_model = Word2Vec(
                    sentences=sentences, 
                    size=128, # number of vector dimension
                    window=5, # maximum distance between the current and predicted words
                    min_count=2, # will remove words with occurrence less than this value
                    max_vocab_size=None, # limit the number of words
                    sg=1, # if 1, then skipgram is used; if 0, then cbow
                    negative=20, # number of random negative sampling
                    iter=10, # number of iteration to the whole document
                    workers=-1 # number of cpu cores to use
                )

In [0]:
# useful properties
# print w2v_model.wv['dog'] # get the word vector of 'dog'
# print w2v_model.wv.index2word # get the vocabularies
# print w2v_model.wv.vectors # get all word vectors

In [0]:
# get most similar words
w2v_model.wv.most_similar('store')

### Task 2.1: Build word embeddings using Glove
Task: Let's build glove word embeddings with 128 word dimension.

In [0]:
corpus = Corpus()
corpus.fit(sentences, window=5)

In [0]:
glove = Glove(
            no_components=128, # number of vector dimension
            learning_rate=0.05,
            alpha=0.75, # weighting
            max_count=100 # weighting
        )
glove.fit(corpus.matrix, epochs=10, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

In [0]:
# useful properties
# print glove.dictionary
# print glove.word_biases
# print glove.word_vectors

In [0]:
# get most similar words
glove.most_similar('store', 10)

## 3. Playing around with simple information retrieval mechanism
After we successfully transformed every document into vector, we would like to build simple information retrieval algorithm using *cosine similarity*. We will do matrix calculation using numpy.

### Task 3.1: Find most similar documents using tfidf vectors
We have built our tfidf_model in Task 1.1. However, it's not normalized yet. To make the calculation easier, we need to normalize each vector since we are going to use cosine similarity.
**Task**: Let's normalize each tfidf document vector and build simple document search.

In [0]:
# transform documents into tfidf vectors
tfidf_vectors = tfidf_model.transform(data.Text.values[:10000]).toarray()
print tfidf_vectors.shape

In [0]:
# normalize the vector using l2 norm to each vector length is 1
tfidf_vectors /= np.linalg.norm(tfidf_vectors, axis=1).reshape(-1,1) + 1e-10

In [0]:
def retrieve_documents(text, k=5):
    # return top_k most similar documents
    print 'query text: {}'.format(text)
    print
    
    query_vector = tfidf_model.transform([text]).toarray()
    query_vector /= np.linalg.norm(query_vector, axis=1).reshape(-1,1) + 1e-10
    
    scores = np.sum(tfidf_vectors*query_vector, axis=1)
    results = zip(np.sort(scores)[::-1], data.Text.values[np.argsort(scores)[::-1]])
    
    for i in range(k):
        print 'result {}: {}'.format(i+1, results[i][0])
        print results[i][1]
        print 

In [0]:
retrieve_documents('I love dog')

In [0]:
retrieve_documents(data.at[15000,'Text'])

### Task 3.2: Find most similar documents using Word2Vec & Glove vectors
We have built our word2vec_model in Task 2.1. We obtained the word embeddings. However, we would like to compare the document similarity, not word similarity. The easiest method is to just average all the word vectors in the document to get the document vector. **Task**: Let's use the word2vec embedding to find the most similar documents.

In [0]:
# code here