# Distributed_Information_Retrieval (Phase 6)

#### Input:
   Query by enduser.
#### Output:
   List of relevant documents.
#### Algorithm:
   a) Get cached dataframes and load models using joblib. <br>
   b) Transform the input query by tfidfVectorizer.<br>
   c) If tfidf returns all zeros then the query is outside of domain the return.<br>
   d) Else transform using SVD and Doc2vec.<br>
   e) Compute cosine_similarity of given query vector with doc2vec and SVD.<br>
   f) Multiply the weights with cosine similarities to penalise lower weight results.<br>
   g) Aggregate and sort the results to get Top 10 relevant files.<br>

In [1]:
from dask.distributed import get_client
import joblib
from gensim.models import Doc2Vec
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
from sklearn.metrics.pairwise import cosine_similarity
import time
import gc

print("Time: 0")
start = time.time()

Time: 0


In [2]:
client = get_client(address="tcp://127.0.0.1:40605")

In [3]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:40605  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 16.69 GB


In [4]:
client.list_datasets()

('doc2vec_feature_matrix', 'files', 'svd_feature_matrix', 'weights')

In [5]:
weights = client.get_dataset("weights")

In [6]:
svd_feature_matrix = client.get_dataset("svd_feature_matrix")

In [7]:
doc2vec_feature_matrix = client.get_dataset("doc2vec_feature_matrix")

In [8]:
files = client.get_dataset("files")

In [9]:
%%time
dv, tf, svd = joblib.load("./model/joblib/joblib/__main__--home-dhiraj-Recommendation-__ipython-input__/cache_models/72a755383fba437e4dead6ff3e3d81e3/output.pkl")

CPU times: user 459 ms, sys: 40.3 ms, total: 499 ms
Wall time: 497 ms


In [10]:
shape = list(files.tail(1).keys())[0]

In [11]:
query = "projects related to deep learning"

In [12]:
tfidf = tf.transform([query]).toarray()

In [13]:
len(tf.get_feature_names())

2246

In [14]:
np.sum(tfidf.reshape(-1) == 0)

2243

In [15]:
latent_matrix = svd.transform(tfidf)

In [16]:
%%time
svd_matrix_similarity = da.from_array(cosine_similarity(svd_feature_matrix, latent_matrix))

CPU times: user 3.36 s, sys: 2.7 s, total: 6.06 s
Wall time: 4.37 s


In [17]:
del latent_matrix
gc.collect()

230

In [18]:
inf = dv.infer_vector(query.split(" "), epochs=100).reshape(1, -1)

In [19]:
%%time
doc2vec_matrix_similarity = da.from_array(cosine_similarity(doc2vec_feature_matrix, inf))

CPU times: user 157 ms, sys: 105 ms, total: 261 ms
Wall time: 215 ms


In [20]:
del inf
gc.collect()

90

In [21]:
svd_matrix_similarity = svd_matrix_similarity.reshape(-1)

In [22]:
weights = weights.to_dask_array(lengths=True)

In [23]:
svd_matrix_similarity = svd_matrix_similarity*weights

In [24]:
doc2vec_matrix_similarity = doc2vec_matrix_similarity.reshape(-1)

In [25]:
doc2vec_matrix_similarity = doc2vec_matrix_similarity*weights

In [26]:
total_weights = svd_matrix_similarity + doc2vec_matrix_similarity

In [27]:
total_weights = dd.from_array(total_weights)

In [28]:
similar_files = dd.concat([files, total_weights], axis=1)

In [29]:
similar_files.columns = ["files", "total"]

In [30]:
agg_files = similar_files.groupby("files").agg({"total": sum})

In [31]:
agg_files["total"].nlargest(10).compute().keys()

Index(['lazynotes_54.pdf', 'AUTOMATED BRAKING TEST FOR_VEHICLE LICENSES.pdf',
       'PATTERN DETECTION AND_RECOGNITION SYSTEM FOR VEHICLES.pdf',
       'DEEP LEARNING IN MEDICAL IMAGE_ANALYSIS.pdf',
       'INTELLIGENT TOLL AUTOMATION SYSTEM_50.pdf',
       'AUTOMATED GLAUCOMA DETECTION.pdf',
       'QUERY BASED CAR MAKE AND MODEL_RECOGNITION SYSTEM USING DEEP_LEARNING.pdf',
       'Robust Speaker Recognition System for  online authentication and real-time verification.pdf',
       'Development of Intelligent automated indoor_navigator and assistance system.pdf',
       'Visual Question Answering Using Deep_Learning_30.pdf'],
      dtype='object', name='files')

In [32]:
print(time.time() - start)

7.536633014678955


Algorithm tool 7.53 s for giving results.

# End of phase 6