<a href="https://colab.research.google.com/github/farahFif/Facts-database-query-with-word2vec/blob/master/Searching_in_fact_database_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Searching in the curious facts database using word2vec and on topic modeling dataset 

We want you to retrieve facts relevant to the query, for example, you type "good mood", and get to know that Cherophobia is the fear of fun.

In [9]:
!pip install gensim



In [1]:
from gensim.models.doc2vec import Doc2Vec
import pickle
import os 
import re 
import numpy as np 
import nltk
import nltk.tokenize as tokenizer
nltk.download('punkt')
from sklearn.metrics.pairwise import cosine_similarity 
from collections import Counter
import heapq


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We need first to read facts. Facts file is available  [here](https://github.com/hsu-ai-course/hsu.ai/blob/master/code/datasets/nlp/facts.txt)

In [2]:
# Read facts into list
# facts can be found here https://github.com/hsu-ai-course/hsu.ai/blob/master/code/datasets/nlp/facts.txt

file = open('facts.txt')
facts = [ re.sub(r"\n","",x) for x in file.readlines()]
print(facts[0],)
file.close()

1. If you somehow found a way to extract all of the gold from the bubbling core of our lovely little planet, you would be able to cover all of the land in a layer of gold up to your knees.


For inferring vector with word2vec.
First, let's load the pre-trained doc2vec model from https://github.com/jhlau/doc2vec


In [4]:
#transforming sentences to vector

def is_apt_word(word):
  """ Checking if it is a word """
  return word.isalpha()

def norm_vectors(A):
    """ Normalizing vectors """
    An = A.copy()
    norm = np.linalg.norm(An , axis=1).reshape(-1,1)
    v = An/norm    
    return An/norm

words = []
for i in range(len(facts)):
  st = re.sub(r"\d+.","",facts[i])
  tok  = tokenizer.word_tokenize(st)
  words.append( [w for w in tok if is_apt_word(w)])
print(words)

# Generationg vectors
fact_array = np.array(words)
model = Doc2Vec.load('doc2vec.bin', mmap=None)
sent_vecs = np.array([model.infer_vector(v) for v in fact_array])
sent_vecs = norm_vectors(sent_vecs)

[['If', 'you', 'somehow', 'found', 'a', 'way', 'to', 'extract', 'all', 'of', 'the', 'gold', 'from', 'the', 'bubbling', 'core', 'of', 'our', 'lovely', 'little', 'planet', 'you', 'would', 'be', 'able', 'to', 'cover', 'all', 'of', 'the', 'land', 'in', 'a', 'layer', 'of', 'gold', 'up', 'to', 'your', 'knees'], ['McDonalds', 'calls', 'frequent', 'buyers', 'of', 'their', 'food', 'heavy', 'users'], ['The', 'average', 'person', 'spends', 'months', 'of', 'their', 'lifetime', 'waiting', 'on', 'a', 'red', 'light', 'to', 'turn', 'green'], ['The', 'largest', 'recorded', 'snowflake', 'was', 'in', 'Keogh', 'MT', 'during', 'year', 'and', 'was', 'inches', 'wide'], ['You', 'burn', 'more', 'calories', 'sleeping', 'than', 'you', 'do', 'watching', 'television'], ['There', 'are', 'more', 'lifeforms', 'living', 'on', 'your', 'skin', 'than', 'there', 'are', 'people', 'on', 'the', 'planet'], ['Southern', 'sea', 'otters', 'have', 'flaps', 'of', 'skin', 'under', 'their', 'forelegs', 'that', 'act', 'as', 'pockets'

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Then we need to find 5 closest facts to the query. We need to calculate cosine similarity between query vector and vectors from facts.

In [7]:
def get_words_from_sentence(sentences):
    for sentence in sentences: 
        yield nltk.word_tokenize(sentence.split('.', 1)[1])

def find_k_closest(query, dataset, k=5):    
    #find 5 closest rows in dataset in terms of cosine similarity
    #Since vectors in dataset are already normed, cosine similarity is just dot product.  
    op = []
    for i in range(len(dataset)):
      op.append((i, np.dot(query, dataset[i])))
          
    cos = np.asarray([tup[1] for tup in op])
    indx = heapq.nlargest(5, range(len(cos)), cos.take)
    sc = [op[j] for j in indx]
    return sc


query = "good mood"
query_vec = model.infer_vector(nltk.word_tokenize(query))
query_vec_normed = query_vec/np.linalg.norm(query_vec)
r = find_k_closest(query_vec_normed,sent_vecs)

print("Results for query:", query)
for k, p in r:
    print("\t", facts[k], "sim=", p)

Results for query: good mood
	 144. Dolphins sleep with one eye open! sim= 0.6115808
	 68. Cherophobia is the fear of fun. sim= 0.60771847
	 57. Gorillas burp when they are happy sim= 0.59873986
	 76. You breathe on average about 8,409,600 times a year sim= 0.5648149
	 110. Cats have 32 muscles in each of their ears. sim= 0.56407446


# Training doc2vec model on topic-modeling dataset



In [8]:
# first we download the dataset that consists of 4 files each file has one specific topic

! wget 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt'
! wget 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt'
! wget 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_fuel_845docs.txt'
! wget 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_braininjury_10000docs.txt'

--2020-06-04 15:10:18--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_music_2084docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.23.128, 2404:6800:4008:c01::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.23.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13985603 (13M) [application/octet-stream]
Saving to: ‘testdata_news_music_2084docs.txt’


2020-06-04 15:10:19 (19.3 MB/s) - ‘testdata_news_music_2084docs.txt’ saved [13985603/13985603]

--2020-06-04 15:10:21--  https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/topic-modeling-tool/testdata_news_economy_2073docs.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.23.128, 2404:6800:4008:c01::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.23.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length

In [0]:
def read_dataset(file_path):
    docs = []
    with open(file_path) as fp:
        for cnt, line in enumerate(fp):
            docs.append(nltk.word_tokenize(line))
    return docs

fuel_data = read_dataset("testdata_news_fuel_845docs.txt")
brain_inj_data = read_dataset("testdata_braininjury_10000docs.txt")
economy_data = read_dataset("testdata_news_economy_2073docs.txt")
music_data = read_dataset("testdata_news_music_2084docs.txt")

all_data = fuel_data + brain_inj_data + economy_data + music_data

In [11]:
print(len(all_data))
assert len(all_data) == 15002

15002


In [12]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# just a test set of tokenized sentences
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(all_data)]

# train a model
model_d2v = Doc2Vec(
    documents,     # collection of texts
    vector_size=300, # output vector size
    window=2,      # maximum distance between the target word and its neighboring word
    min_count=1,   # minimal number of 
    workers=4      # in parallel
)

# clean training data
model_d2v.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

# save and load
model_d2v.save("d2v.model")
model_d2v = Doc2Vec.load("d2v.model")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# organazing labels
all_labels = np.zeros((len(all_data)))
all_labels[:len(fuel_data)] = 1
all_labels[len(fuel_data):len(fuel_data) + len(brain_inj_data)] = 2
all_labels[len(fuel_data) + len(brain_inj_data): len(fuel_data) + len(brain_inj_data) + len(economy_data)] = 3     

# transforming data to vectors
all_data_vecs = np.array(list(model_d2v.infer_vector(sent) for sent in all_data))

In [0]:
from sklearn import utils
from collections import Counter
from sklearn.model_selection import train_test_split

all_data_vecs = np.array(list(model_d2v.infer_vector(sent) for sent in all_data))
X_train, X_test, y_train, y_test = train_test_split(all_data_vecs, all_labels, test_size=0.33, 
                                                    random_state=0, stratify=all_labels)

In [15]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

# SVM

clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(X_train, y_train)
target_names = ["music", "fuel", "brain", "economy"] # (0,1,2,3)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names = target_names))

              precision    recall  f1-score   support

       music       0.79      0.89      0.83       688
        fuel       0.67      0.36      0.47       279
       brain       1.00      1.00      1.00      3300
     economy       0.76      0.80      0.78       684

    accuracy                           0.92      4951
   macro avg       0.80      0.76      0.77      4951
weighted avg       0.92      0.92      0.92      4951



