# Distributed_Information_Retrieval (Phase 6)

#### Input:
   Query by enduser.
#### Output:
   List of relevant documents.
#### Algorithm:
   a) Get cached dataframes and load models using joblib. <br>
   b) Transform the input query by tfidfVectorizer.<br>
   c) If tfidf returns all zeros then the query is outside of domain the return.<br>
   d) Else transform using SVD and Doc2vec.<br>
   e) Compute cosine_similarity of given query vector with doc2vec and SVD.<br>
   f) Multiply the weights with cosine similarities to penalise lower weight results.<br>
   g) Aggregate and sort the results to get Top 10 relevant files.<br>

In [1]:
from dask.distributed import get_client
import joblib
from gensim.models import Doc2Vec
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.array as da
from sklearn.metrics.pairwise import cosine_similarity
import time
import gc

print("Time: 0")
start = time.time()

Time: 0


In [2]:
client = get_client(address="tcp://127.0.0.1:37417")

In [3]:
kmeans = client.get_dataset("cluster")

In [4]:
svd_feature_matrix = client.get_dataset("svd_feature_matrix")

In [5]:
doc2vec_feature_matrix = client.get_dataset("doc2vec_feature_matrix")

In [6]:
df = client.get_dataset("df")

In [7]:
with joblib.parallel_backend("dask"):
    dv, tf, svd = joblib.load("./model/joblib/joblib/__main__--home-dhiraj-Recommendation-__ipython-input__/cache_models/72a755383fba437e4dead6ff3e3d81e3/output.pkl")

In [8]:
query = "convolutional neural networks"

In [9]:
tfidf = tf.transform([query]).toarray()

In [10]:
len(tf.get_feature_names())

2246

In [11]:
np.sum(tfidf.reshape(-1) == 0)

2243

In [12]:
inf = dv.infer_vector(query.split(" "), epochs=10).reshape(1, -1)

In [13]:
cluster = kmeans.predict(inf)

In [14]:
prediction = cluster.compute()[0]

In [15]:
doc2vec_cluster = doc2vec_feature_matrix[doc2vec_feature_matrix["labels"] == prediction][doc2vec_feature_matrix.columns.difference(['labels'])]

In [16]:
svd_cluster = svd_feature_matrix[svd_feature_matrix["labels"] == prediction][svd_feature_matrix.columns.difference(['labels'])]

In [17]:
df = df[df["labels"] == prediction]

In [18]:
latent_matrix = svd.transform(tfidf)

In [19]:
with joblib.parallel_backend("dask"):
    svd_cluster_similarity = dd.from_array(cosine_similarity(svd_cluster, latent_matrix))
    doc2vec_cluster_similarity = dd.from_array(cosine_similarity(doc2vec_cluster, inf))

In [20]:
svd_cluster_similarity.columns = ["bow"]
doc2vec_cluster_similarity.columns = ["doc2vec"]

In [21]:
svd_cluster_similarity["bow"] = svd_cluster_similarity["bow"]*df["weights"]

In [22]:
doc2vec_cluster_similarity["doc2vec"] = doc2vec_cluster_similarity['doc2vec']*df['weights']

In [23]:
total_weights = svd_cluster_similarity["bow"] + doc2vec_cluster_similarity["doc2vec"]

In [24]:
similar_files = dd.concat([df["files"], total_weights], axis=1)

In [25]:
similar_files.columns = ["files", "total"]

In [26]:
agg_files = similar_files.groupby("files").agg({"total": sum})

In [27]:
agg_files["total"].nlargest(10).compute().keys()

Index(['SCRUTINIZING BEHAVIOUR OF VM COMMUNICATION.pdf',
       'APPLICATION TO DETERMINE THE SAFEST ROUTE USING CRIME JANALYSIS VIA DECISION TREE JALGORITHM_78.pdf',
       'ONLINE REVIEW ANALYSIS.pdf',
       'PATTERN DETECTION AND_RECOGNITION SYSTEM FOR VEHICLES.pdf',
       'Marathi translation using wsd concept_36.pdf',
       'E-Care-Android Application For Health Monitoring_73.pdf',
       'CONVERSION OF VIDEO CONTAINING SIGN_LANGUAGE TO TEXTUAL FORMAT.pdf',
       'INTELLIGENT TOLL AUTOMATION SYSTEM_50.pdf',
       'Object Based Visual Sentiment Analysis.pdf',
       'Robust Speaker Recognition System for  online authentication and real-time verification.pdf'],
      dtype='object', name='files')

In [28]:
print(time.time() - start)

6.52833104133606


Algorithm took 6.44 s for giving results.

# End of phase 6