<a href="https://colab.research.google.com/github/mneedham/data-science-training/blob/master/03_Recommendations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommendations

In this notebook we're going to learn how to make recommendations using Neo4j. As with the other notebooks, let's get our environment setup.

In [None]:
!pip install py2neo pandas matplotlib

And let's import those libraries:

In [1]:
from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)


Update the cell below with the same Sandbox credentials that you used in the first notebook:

In [2]:
# Change the line of code below to use the IP Address, Bolt Port, and Password of your Sandbox.
# graph = Graph("bolt://<IP Address>:<Bolt Port>", auth=("neo4j", "<Password>")) 

# graph = Graph("bolt://18.234.168.45:33679", auth=("neo4j", "daybreak-cosal-rumbles")) 
graph = Graph("bolt://3.17.177.63:7687", auth=("neo4j", "digityser"))

##  Finding popular authors

Since we're going to make collaborator suggestions so let's find authors who have written the most articles so that we have some data to work with.

In [3]:
popular_authors_query = """
MATCH (author:Author)
RETURN author.name, size((author)<-[:AUTHOR]-()) AS articlesPublished
ORDER BY articlesPublished DESC
LIMIT 10
"""

graph.run(popular_authors_query).to_data_frame()

Unnamed: 0,articlesPublished,author.name
0,89,Peter G. Neumann
1,80,Peter J. Denning
2,72,Moshe Y. Vardi
3,71,Pamela Samuelson
4,65,Bart Preneel
5,56,Vinton G. Cerf
6,53,Barry W. Boehm
7,49,Mark Guzdial
8,47,Edwin R. Hancock
9,46,Josef Kittler


Let's pick one of these authors...

In [4]:
author_name = "Peter G. Neumann"

And let's have a look what articles they've published and how many citations they've received:

In [5]:
author_articles_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)
RETURN article.title AS article, article.year AS year, size((article)<-[:CITED]-()) AS citations
ORDER BY citations DESC
LIMIT 20
"""

graph.run(author_articles_query,  {"authorName": author_name}).to_data_frame()

Unnamed: 0,article,citations,year
0,"The foresight saga, redux",2,2012
1,Security by obscurity,2,2003
2,Risks of automation: a cautionary total-system perspective of our cyberfuture,1,2016
3,Crypto policy perspectives,1,1994
4,Risks of National Identity Cards,1,2001
5,"Computers, ethics, and values",1,1991
6,Are dependable systems feasible,1,1993
7,Information system security redux,1,2003
8,The foresight saga,1,2006
9,Robust open-source software,1,1999


Find the authors collaborators...

In [6]:
collaborations_query = """
MATCH (:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor)
RETURN coauthor.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

graph.run(collaborations_query,  {"authorName": author_name}).to_data_frame()

Unnamed: 0,coauthor,collaborations
0,Lauren Weinstein,3
1,Whitfield Diffie,3
2,Susan Landau,3
3,Steven Michael Bellovin,2
4,Matt Blaze,2
5,Rebecca T. Mercuri,2
6,Alfred Z. Spector,1
7,Seymour E. Goodman,1
8,David Lorge Parnas,1
9,Douglas Miller,1


How would we suggest some future collaborators for this author? One way is by looking at the collaborators of their collaborators!

In [7]:
collaborations_query = """
MATCH (author:Author {name: $authorName})<-[:AUTHOR]-(article)-[:AUTHOR]->(coauthor),
      (coauthor)<-[:AUTHOR]-()-[:AUTHOR]->(coc)
WHERE not((coc)<-[:AUTHOR]-()-[:AUTHOR]->(author)) AND coc <> author      
RETURN coc.name AS coauthor, count(*) AS collaborations
ORDER BY collaborations DESC
LIMIT 10
"""

graph.run(collaborations_query,  {"authorName": author_name}).to_data_frame()

Unnamed: 0,coauthor,collaborations
0,John Ioannidis,10
1,Scott Bradner,9
2,Angelos D. Keromytis,8
3,John Kelsey,7
4,Virgil D. Gligor,5
5,David Wagner,4
6,Gerald Jay Sussman,4
7,David K. Gifford,4
8,Ran Canetti,4
9,Peter Wolcott,4


What about if an author wanted to find articles, with articles close to their own work showing up first.

In [8]:
query = """
CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""

In [9]:
query = """
MATCH (a:Article)-[:AUTHOR]->(author:Author)
WHERE author.name=$authorName
WITH author, collect(a) as articles
CALL algo.pageRank.stream(
  'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
   YIELD node, score
   RETURN id(node) as id',
  'MATCH (a1:Article)-[:CITED]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  {sourceNodes: articles,graph:'cypher', params: {searchTerm: $searchTerm}})
YIELD nodeId, score
WITH algo.getNodeById(nodeId) AS n, score
WHERE not(exists((author)<-[:AUTHOR]-(n)))
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10
"""

params = {"authorName": "Tao Xie", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,authors,score
0,Static detection of cross-site scripting vulnerabilities,"[Zhendong Su, Gary Wassermann]",0.236
1,Characterizing logging practices in open-source software,"[Ding Yuan, Soyeon Park, Yuanyuan Zhou]",0.128
2,"Automated, contract-based user testing of commercial-off-the-shelf components","[Lionel C. Briand, Yvan Labiche, Michal M. Sówka]",0.128
3,Concern graphs: finding and describing concerns using structural program dependencies,"[Gail C. Murphy, Martin P. Robillard]",0.128
4,Who should fix this bug,"[Lyndon Hiew, John Anvik, Gail C. Murphy]",0.128
5,Conceptual module querying for software reengineering,"[Gail C. Murphy, Elisa L. A. Baniassad]",0.108
6,IBM's pragmatic embrace of open source,[Pamela Samuelson],0.0
7,Open courseware and open source software,"[Stefan Baldi, Anett Mehler-Bicher, Hauke Heier]",0.0
8,Reusing Open-Source Software and Practices: The Impact of Open-Source on Commercial Vendors,"[Alan W. Brown, Grady Booch]",0.0
9,From Research Software to Open Source,[Susan L. Graham],0.0


In [10]:
params = {"authorName": "Margus Veanes", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,authors,score
0,"TEG: A High-Performance, Scalable, Multi-network Point-to-Point Communications Methodology","[Ralph H. Castain, Mitchel W. Sukalski, Edgar Gabriel, Graham E. Fagg, Jack Dongarra]",3.319
1,"Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation","[Edgar Gabriel, Vishal Sahay, Timothy S. Woodall, Jack Dongarra, Jeffrey M. Squyres]",2.87
2,Concern graphs: finding and describing concerns using structural program dependencies,"[Gail C. Murphy, Martin P. Robillard]",2.626
3,Conceptual module querying for software reengineering,"[Gail C. Murphy, Elisa L. A. Baniassad]",2.542
4,Who should fix this bug,"[Lyndon Hiew, John Anvik, Gail C. Murphy]",2.342
5,Hipikat: recommending pertinent software development artifacts,"[Gail C. Murphy, Davor Cubranic]",2.205
6,Version Sensitive Editing: Change History as a Programming Tool,[David L. Atkins],2.151
7,DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones,"[Stéphane Glondu, Zhendong Su, Lingxiao Jiang, Ghassan Misherghi]",2.145
8,Recovering documentation-to-source-code traceability links using latent semantic indexing,"[Andrian Marcus, Jonathan I. Maletic]",2.006
9,Coverage is not strongly correlated with test suite effectiveness,"[Laura Inozemtseva, Reid Holmes]",1.928
