# Document Encoding using <a href = 'https://arxiv.org/pdf/1801.06146.pdf'>ULMFIT</a> 

In this approach to find the similarity between the documents, we will encode the document into a high-dimensional vector representation, by taking the output of last timestamp of language model encoder

In [1]:
from fastai import *
from fastai.text import *
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
with open('../list_of_sentences', 'r+') as f:
    documents = f.readlines()
for i in range(len(documents)):
    documents[i] = documents[i][:-1]

In [3]:
documents

['good morning',
 'how are you doing ?',
 'the weather is awesome today',
 'samsung',
 'good afternoon',
 'baseball is played in the USA',
 'there is a thunderstorm ',
 'are you doing good ?',
 'The polar regions are melting"',
 'apple',
 'nokia',
 'cricket is a fun game',
 'the climate change is a problem']

In [4]:
path = untar_data(URLs.IMDB) #this will download a 176 MB tgz file(only downloads once)
path.ls()

[PosixPath('/home/anubhav/.fastai/data/imdb/unsup'),
 PosixPath('/home/anubhav/.fastai/data/imdb/test'),
 PosixPath('/home/anubhav/.fastai/data/imdb/train'),
 PosixPath('/home/anubhav/.fastai/data/imdb/imdb.vocab'),
 PosixPath('/home/anubhav/.fastai/data/imdb/tmp_clas'),
 PosixPath('/home/anubhav/.fastai/data/imdb/README'),
 PosixPath('/home/anubhav/.fastai/data/imdb/tmp_lm')]

Here we're using imdb's movie reviews dataset's vocabulary, for encoding the documents

In [5]:
data_lm = (TextList.from_folder(path)
            .filter_by_folder(include=['train', 'test', 'unsup']) 
            .split_by_rand_pct(0.01, seed=42)
            .label_for_lm()           
            .databunch(bs=32, num_workers=-1))

In [18]:
data_lm.save('lm_db_movie.pkl')

In [7]:
len(data_lm.vocab.itos)

60000

The architecture chosen is <a href = "https://docs.fast.ai/text.models.html#AWD_LSTM">AWD-LSTM</a> pretrained on wikitext-103 dataset

In [8]:
learn = language_model_learner(data_lm, AWD_LSTM)

In [9]:
learn.save('learn_similar')

**This language model learner object has 2 sub-nets Encoder and a Decoder**

In [10]:
learn.model

SequentialRNN(
  (0): AWD_LSTM(
    (encoder): Embedding(60000, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(60000, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1152, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1152, 1152, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1152, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=400, out_features=60000, bias=True)
    (output_dp): RNNDropout()
  )
)

**The encoding of our document will be a 400-dim vector, taken as from the last timestamp of the language model encoder**

In [11]:
def get_one_item(learn, doc):
    xb, yb = learn.data.one_item(doc)
    return xb

In [12]:
def encode_doc(learn, doc):
    xb = get_one_item(learn, doc)
    lstm_encoder = learn.model[0]
    lstm_encoder.reset()
    with torch.no_grad():
        out = lstm_encoder.eval()(xb)
    return out[0][2][0][-1].detach().numpy()

    we take the vector representation of the documents in the variable document_matrix

In [13]:
document_matrix = []
for doc in documents:
    doc_vector = encode_doc(learn, doc)
    document_matrix.append(doc_vector)

In [14]:
document_matrix = np.array(document_matrix)
document_matrix.shape

(13, 400)

In [15]:
def find_similar(documents, document_matrix):
    '''
    find the similar documents based on the cosine distance of two vectors
    '''
    similar_list = []
    for i in range(len(documents)):
        sim = cosine_similarity(document_matrix[i:i+1], document_matrix)[0]
#         print((sim))
        indexes = [i for i,s in enumerate(sim) if np.logical_and(s<0.99 , s>0.)]#np.argwhere(np.logical_and(sim>0., sim<1.))
#         print(indexes)
        idx_sim_pairs = {}
        for idx in indexes:
            idx_sim_pairs[int(idx)] = sim[idx]
        idx_sim_pairs = {k: v for k, v in sorted(idx_sim_pairs.items(), key=lambda item: item[1], reverse=True)}
#         print(idx_sim_pairs)
        sim_sentences = [documents[i]]
        sim_sentences.extend([documents[i] for i in list(idx_sim_pairs.keys())[:2]])
        similar_list.append(sim_sentences)
        
    return similar_list

In [16]:
find_similar(documents, list(document_matrix))

[['good morning', 'good afternoon', 'apple'],
 ['how are you doing ?',
  'are you doing good ?',
  'the weather is awesome today'],
 ['the weather is awesome today',
  'the climate change is a problem',
  'how are you doing ?'],
 ['samsung', 'apple', 'good afternoon'],
 ['good afternoon', 'good morning', 'samsung'],
 ['baseball is played in the USA',
  'the weather is awesome today',
  'cricket is a fun game'],
 ['there is a thunderstorm ',
  'the weather is awesome today',
  'the climate change is a problem'],
 ['are you doing good ?',
  'how are you doing ?',
  'the weather is awesome today'],
 ['The polar regions are melting"',
  'the weather is awesome today',
  'the climate change is a problem'],
 ['apple', 'samsung', 'nokia'],
 ['nokia', 'apple', 'good afternoon'],
 ['cricket is a fun game',
  'the weather is awesome today',
  'baseball is played in the USA'],
 ['the climate change is a problem',
  'the weather is awesome today',
  'cricket is a fun game']]

Here as compared to TF-IDF it is able to find similarity in the below example <br>

['cricket is a fun game',
  'the weather is awesome today',
  'baseball is played in the USA']
  

as it contains **'baseball is played in the USA'**, somewhat related to **sports**


In [17]:
documents

['good morning',
 'how are you doing ?',
 'the weather is awesome today',
 'samsung',
 'good afternoon',
 'baseball is played in the USA',
 'there is a thunderstorm ',
 'are you doing good ?',
 'The polar regions are melting"',
 'apple',
 'nokia',
 'cricket is a fun game',
 'the climate change is a problem']


## Scope of improvements

* if we had a big corpus, where we can fine-tune our language-model for our text documents, we would be able to group documents together with better accuracy