## Script 
This script reads a input file name "input.dat" and returns top 5 matching data from the knowledge graph. 

## The knowledge graph 
It is hosted in the graph database neo4j. With py2neo, a package which helps to connect neo4j with python, we connect to the database. We run the query to return the most matching words from the list of query words named "input.dat". Before running this script, you must have loaded the knowledge graph(db.csv) in the neo4j graph and should be running. The instructions for loading the knowledge graph is mentioned in the README.md file in github. 

## Format for input.dat
 The file must contain query words seperated by commas. And the query words must be contained in the graph. For eg:
 fawn,pet

In [46]:
import time
import pandas as pd
from py2neo import Graph, Node
graph = Graph(password = "rosebay")

# Single case Match

In [179]:
# Read input from the dat file
words = [word.split(',') for word in open("input.dat","r").readlines()]
words[0]

['captain', 'chair']

In [176]:
# Todo : Return mismatch message if no word matches in the knowledge graph
# open the result file
result = open("results.csv","w")
d = {}
for each in words[0]:
    
    print("--------------------------------------------")
    print(each)
    result.write("\n----------------\n")
    result.write(each)
    result.write("\n----------------\n")
    print("--------------------------------------------")
    query1 = '''
MATCH (n:Word)-[r]->(n2:Word) where n.name= '%s' RETURN n2.name as words,r.weight as %s order by %s asc
    '''%(each,each,each)
    data = graph.run(query1).data()
    d[each]= pd.DataFrame(data)

--------------------------------------------
captain
--------------------------------------------
--------------------------------------------
chair
--------------------------------------------


### Neighbor based recursive greedy algorithm

In [49]:
total=pd.concat(d.values(),axis=0)
total.set_index('words')
# remove rows containing words in the query words

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,captain,chair
words,Unnamed: 1_level_1,Unnamed: 2_level_1
sit,,5.0
laugh,,5.0
sofa,,6.0
table,,6.0
fox,,6.0
take,,6.0
bob,,6.0
empty,,6.0
lift,,7.0
pilot,,7.0


In [50]:
merged_total=total.groupby(by=['words']).agg('sum')
merged_total.replace(0,1000,inplace=True)

In [51]:
merged_total['total']=merged_total.sum(axis=1)

In [52]:
merged_total.sort_values('total',inplace=True)

#### Expanding the hops
Steps:
    1. list of top close nodes
    2. set maximum threshold
    3. expand each neighbour and select that whose sum is less than that of 

In [185]:
global_list =merged_total[:5]
global_list

Unnamed: 0_level_0,captain,chair,total
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
pilot,13.0,7.0,20.0
angry,3.0,1000.0,1003.0
smile,4.0,1000.0,1004.0
shake,4.0,1000.0,1004.0
ring,4.0,1000.0,1004.0


In [186]:
# get maximum value
max_distance = top_list['total'].iloc[4]

In [189]:
# For each node having value less than max
#find the neighbours of that node

for row_index,row in top_list.iterrows():
    d2 = {}
    #print(row_index)
    for each in words[0]:
        query='''// neighbor nodes and total distance to a query word
        match(n:Word)-[r]->(n1:Word)-[r2]->(n2:Word) 
        where n1.name='%s' AND n.name='%s'
        return n2.name as words,sum(r.weight+r2.weight) as %s
        order by words,%s'''%(row_index,each,each,each)
        data2 = graph.run(query).data()
        d2[each]=pd.DataFrame(data2)
    aggregate=pd.concat(d2.values(),axis=0)
    
    # Remove rows with duplicate values and having distance greater than max
    aggregate= aggregate.drop(['father'])
    
    #check if columns are missing
    A = set(aggregate)
    B = set(words[0])
    missing=list(B-A)
    for each in missing:
        aggregate[each]= 1000
    aggregate_total=aggregate.groupby(by=['words']).agg('sum')
    aggregate_total.replace(0,1000,inplace=True)
    aggregate_total['total']=aggregate_total.sum(axis=1)
    aggregate_total.sort_values('total',inplace=True)
    print(aggregate_total)
    global_list=global_list.append(aggregate_total)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  from ipykernel import kernelapp as app


KeyError: "['father'] not found in axis"

In [173]:
global_list.sort_values('total',inplace=True)
global_list

Unnamed: 0_level_0,captain,chair,total
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
pilot,13.0,7.0,20.0
father,25.0,13.0,38.0
plant,25.0,13.0,38.0
wing,27.0,14.0,41.0
whale,27.0,14.0,41.0
helicopter,27.0,14.0,41.0
down,27.0,14.0,41.0
fly,29.0,15.0,44.0
plane,31.0,16.0,47.0
train,48.0,25.0,73.0


In [127]:
A = set(aggregate_total)
B = set(words[0])

In [139]:
missing=list(B-A)

In [140]:
for each in missing:
    aggregate_total[each]= 1000

In [141]:
aggregate_total

Unnamed: 0_level_0,captain,total,chair
words,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
pop,9,9,1000
round,9,9,1000
beautiful,10,10,1000
box,10,10,1000
diamond,10,10,1000
kiss,10,10,1000
loud,10,10,1000
seal,10,10,1000
bell,11,11,1000
break,11,11,1000


In [199]:
aggregate[aggregate.words != ['captain']]

ValueError: Arrays were different lengths: 18 vs 1

## Minimum Spanning Tree

In [2]:
# Read input from the dat file
words = [word.split(',') for word in open("input.dat","r").readlines()]

In [3]:
words[0]

['fawn', 'pet']

In [13]:
query2 = '''Match (n:Word)-[r]->(n2:Word) 
where n.name in ['sofa','fawn']
return n,r.weight,n2'''%words[0]

In [14]:
query2

"Match (n:Word)-[r]->(n2:Word) \nwhere n.name in ['sofa','fawn']\nreturn n,r.weight,n2"

In [15]:
data = graph.run(query2).data()