## Script 
This script reads a input file name "input.dat" and returns top 5 matching data from the knowledge graph. 

## The knowledge graph 
It is hosted in the graph database neo4j. With py2neo, a package which helps to connect neo4j with python, we connect to the database. We run the query to return the most matching words from the list of query words named "input.dat". Before running this script, you must have loaded the knowledge graph(db.csv) in the neo4j graph and should be running. The instructions for loading the knowledge graph is mentioned in the README.md file in github. 

## Format for input.dat
 The file must contain query words seperated by commas. And the query words must be contained in the graph. For eg:
 fawn,pet

In [1]:
import time
import pandas as pd
from py2neo import Graph, Node
graph = Graph(password = "rosebay")

# Single case Match

In [12]:
# Read input from the dat file
query_words = [word.split(',') for word in open("input.dat","r").readlines()][0]

In [13]:
# Todo : Return mismatch message if no word matches in the knowledge graph
# open the result file
result = open("results.csv","w")
d = {}
for each in query_words:
    
    print("--------------------------------------------")
    print(each)
    result.write("\n----------------\n")
    result.write(each)
    result.write("\n----------------\n")
    print("--------------------------------------------")
    query1 = '''
MATCH (n:Word)-[r]->(n2:Word) where n.name= '%s' RETURN n2.name as words,r.weight as %s order by %s asc
    '''%(each,each,each)
    data = graph.run(query1).data()
    d[each]= pd.DataFrame(data)

--------------------------------------------
captain
--------------------------------------------
--------------------------------------------
chair
--------------------------------------------


### Neighbor based recursive greedy algorithm

In [14]:
total=pd.concat(d.values(),axis=0)
total.set_index('words')
# remove rows containing words in the query words

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,captain,chair
words,Unnamed: 1_level_1,Unnamed: 2_level_1
sit,,5.0
laugh,,5.0
sofa,,6.0
table,,6.0
fox,,6.0
take,,6.0
bob,,6.0
empty,,6.0
lift,,7.0
pilot,,7.0


In [15]:
merged_total=total.groupby(by=['words']).agg('sum')
merged_total.replace(0,1000,inplace=True)

In [16]:
merged_total['total']=merged_total.sum(axis=1)

In [17]:
merged_total.sort_values('total',inplace=True)

#### Expanding the hops
Steps:
    1. list of top close nodes
    2. set maximum threshold
    3. expand each neighbour and select that whose sum is less than that of 

In [18]:
global_list =merged_total[:5]
global_list.head(3)

Unnamed: 0_level_0,captain,chair,total
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
pilot,13.0,7.0,20.0
angry,3.0,1000.0,1003.0
smile,4.0,1000.0,1004.0


In [19]:
# get maximum value
max_distance = global_list['total'].iloc[4]

In [None]:
# For each node having value less than max
#find the neighbours of that node
for row_index,row in global_list.iterrows():
    d2 = {}
    #print(row_index)
    for each in words[0]:
        query='''// neighbor nodes and total distance to a query word
        match(n:Word)-[r]->(n1:Word)-[r2]->(n2:Word) 
        where n1.name='%s' AND n.name='%s'
        return n2.name as words,sum(r.weight+r2.weight) as %s
        order by words,%s'''%(row_index,each,each,each)
        data2 = graph.run(query).data()
        d2[each]=pd.DataFrame(data2)
    aggregate=pd.concat(d2.values(),axis=0)    
    #TODO: Discard empty dataframes
    if aggregate.shape[0]==0:
        continue
    # Remove rows with duplicate values and having distance greater than max
    aggregate[~aggregate['words'].isin(query_words)]

    #check if columns are missing
    A = set(aggregate)
    B = set(words[0])
    missing=list(B-A)
    for each in missing:
        aggregate[each]= 1000
    print(row_index)
    print(aggregate)
    aggregate_total=aggregate.groupby(by=['words']).agg('sum')
    aggregate_total.replace(0,1000,inplace=True)
    aggregate_total['total']=aggregate_total.sum(axis=1)
    aggregate_total.sort_values('total',inplace=True)
    #print(aggregate_total)
    global_list=global_list.append(aggregate_total)

In [None]:
global_list.sort_values('total',inplace=True)
global_list.head(5)

## Minimum Spanning Tree

In [None]:
# Read input from the dat file
words = [word.split(',') for word in open("input.dat","r").readlines()]

In [None]:
words[0]

In [None]:
query2 = '''Match (n:Word)-[r]->(n2:Word) 
where n.name in ['sofa','fawn']
return n,r.weight,n2'''%words[0]

In [None]:
query2

In [None]:
data = graph.run(query2).data()

In [37]:
aggregate

In [38]:
aggregate.shape[0]

0