## Script 
This script reads a input file name "input.dat" and returns top 5 matching data from the knowledge graph. 

## The knowledge graph 
It is hosted in the graph database neo4j. With py2neo, a package which helps to connect neo4j with python, we connect to the database. We run the query to return the most matching words from the list of query words named "input.dat". Before running this script, you must have loaded the knowledge graph(db.csv) in the neo4j graph and should be running. The instructions for loading the knowledge graph is mentioned in the README.md file in github. 

## Format for input.dat
 The file must contain query words seperated by commas. And the query words must be contained in the graph. For eg:
 fawn,pet

In [2]:
import time
import pandas as pd
from py2neo import Graph, Node
graph = Graph(password = "rosebay")

# Single case Match

In [3]:
# Read input from the dat file
words = [word.split(',') for word in open("input.dat","r").readlines()]

In [4]:
# Todo : Return mismatch message if no word matches in the knowledge graph
# open the result file
result = open("results.csv","w")
d = {}
for each in words[0]:
    
    print("--------------------------------------------")
    print(each)
    result.write("\n----------------\n")
    result.write(each)
    result.write("\n----------------\n")
    print("--------------------------------------------")
    query1 = '''
MATCH (n:Word)-[r]->(n2:Word) where n.name= '%s' RETURN n2.name as words,r.weight as %s order by %s asc
    '''%(each,each,each)
    data = graph.run(query1).data()
    d[each]= pd.DataFrame(data)

--------------------------------------------
captain
--------------------------------------------
--------------------------------------------
chair
--------------------------------------------


### Neighbor based recursive greedy algorithm

In [131]:
total=pd.concat(d.values(),axis=0)
total.set_index('words')
# remove rows containing words in the query words

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,captain,chair
words,Unnamed: 1_level_1,Unnamed: 2_level_1
angry,3.0,
ring,4.0,
whale,4.0,
shake,4.0,
smile,4.0,
man,4.0,
nod,5.0,
shout,5.0,
sea,6.0,
cricket,6.0,


In [148]:
merged_total=total.groupby(by=['words']).agg('sum')
merged_total.replace(0,1000,inplace=True)

In [149]:
merged_total['total']=merged_total.sum(axis=1)

In [160]:
merged_total.sort_values('total',inplace=True)

#### Expanding the hops
Steps:
    1. list of top close nodes
    2. set maximum threshold
    3. expand each neighbour and select that whose sum is less than that of 

In [161]:
top_list =merged_total.head(5)

In [162]:
top_list

Unnamed: 0_level_0,captain,chair,total
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
pilot,13.0,7.0,20.0
angry,3.0,1000.0,1003.0
smile,4.0,1000.0,1004.0
shake,4.0,1000.0,1004.0
ring,4.0,1000.0,1004.0


In [177]:
# get maximum value
max_distance = top_list['total'].iloc[4]

In [178]:
# For each node having value less than max
#find the neighbours of that node 
for i in range(4):
    print(top_list.iloc[i])

captain    13.0
chair       7.0
total      20.0
Name: pilot, dtype: float64
captain       3.0
chair      1000.0
total      1003.0
Name: angry, dtype: float64
captain       4.0
chair      1000.0
total      1004.0
Name: smile, dtype: float64
captain       4.0
chair      1000.0
total      1004.0
Name: shake, dtype: float64


## Minimum Spanning Tree

In [2]:
# Read input from the dat file
words = [word.split(',') for word in open("input.dat","r").readlines()]

In [3]:
words[0]

['fawn', 'pet']

In [13]:
query2 = '''Match (n:Word)-[r]->(n2:Word) 
where n.name in ['sofa','fawn']
return n,r.weight,n2'''%words[0]

In [14]:
query2

"Match (n:Word)-[r]->(n2:Word) \nwhere n.name in ['sofa','fawn']\nreturn n,r.weight,n2"

In [15]:
data = graph.run(query2).data()