# Deep Code Curator

## Prerequisites

Please check our [readme](./) for the requirements file and other prerequisites.

## Import Modules

In [105]:
import sys
import os

sys.path.append(os.path.abspath('../../src'))

import text2graph
import diagram2graph
import code2graph
import pytesseract
import IPython

from visualize import get_vis
from rdflib import Graph, URIRef

%load_ext autoreload
%autoreload 2

print("Necessary modules have been successfully imported!")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Necessary modules have been successfully imported!


## Specify Input /Output Folders and Required Dependencies
<a id="input"></a>

In [61]:
# --------- INPUT ---------
# Path to the folder which contains the input pdf file(s)
### We included two sample papers in the demo_input folder
inputFolder = 'demo_input'

# Path to the folder which contains the code repositories of the papers in the inputFolder
### For the sample papers we provided, you can download the project codes 
### as a zip from the below links
### https://github.com/ShichenLiu/CondenseNet and extract to the folder demo_code
### https://github.com/mikacuy/pointnetvlad and extract to the folder demo_code
codeFolder = "demo_code"

# CSV file that maps the pdf file name in the inputFolder to the code repository name in the codeFolder
# A sample CSV file is provided for the sample papers above
inputCSV = 'input.csv'

# --------- OUTPUT ---------
# Path to the folder to which the output from all three modalities will be placed
outputFolder = 'demo_output'

# --------- DEPENDENCIES ---------
# The dependencies that you have downloaded following the instructions from README
ontology_file = "DeepSciKG.nt"
text2graph_models_dir = "text2graph_models"
image2graph_models_dir = "image2graph_models"
grobid_client = "grobid-client-python"

# Comment below line for LINUX - Update below path for WINDOWS
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

## Run Text2Graph!
Reminders: Make sure that Grobid Server is running! (cd into the grobid-0.5.5 folder and run `gradle run`)

In [66]:
from text2graph import text2rdfgraph

# Uncomment below two lines if you have a proxy in your network
os.environ['http_proxy'] = ""
os.environ['https_proxy'] = ""

text2rdfgraph.run_demo(inputFolder, outputFolder, ontology_file, text2graph_models_dir, grobid_client)

[Info] Extracting XML from PDF's...
[Info] Extracting abstracts from XML's...
[Info] Extracting entities/relationships and generating RDF's...
Done with file input_paper
Saving rdf file demo_output/text2graph/input_paper_text2graph.ttl
[Info] Completed text2graph pipeline!


If you have seen the message "[Info] Completed text2graph pipeline!", this means you can now explore the output of the text2graph module in the outputFolder you specified. The file `text2graph.ttl` contains the output graph for each paper!

Below, we visualize the results as a graph:

In [50]:
g = Graph()
g.parse(outputFolder + "/text2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper_text2graph.ttl", format="ttl")
g_vis = get_vis(g, "Text")
g_vis.show("text2graph.html")

# g_image = Graph()
# g_image.parse(outputFolder + "/image2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper/image2graph.ttl", format="ttl")
# g_image_vis = get_vis(g_image, "Image")
# g_image_vis.show("image2graph.html")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Backup-Restored\\DCC\\DCC\\demo\\run_all_modalities\\demo_output\\text2graph\\Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper_text2graph.ttl'

## Run Image2Graph!

In [106]:
from diagram2graph.FigAnalysis.ShapeExtraction import i2graph

i2graph.run(inputFolder, outputFolder, ontology_file, image2graph_models_dir)

[INFO] Loading trained models ...
Loaded binary classifier model from disk
Loaded multiclass classifier model from disk
[INFO] Loading and analyzing images ...
[INFO] Creating RDF graph ...
Processing paper: input_paper.pdf
demo_output\image2graph\input_paper\image2graph.ttl
[Info] Completed image2graph pipeline!


If you have seen the message "[Info] Completed image2graph pipeline!", this means you can now explore the output of the image2graph module in the outputFolder you specified. The file `image2graph.ttl` contains the output graph for each paper!

Below, we visualize the results as a graph:

In [6]:
g = Graph()
g.parse(outputFolder + "/image2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper/image2graph.ttl", format="ttl")
g_vis = get_vis(g, "Image")
g_vis.show("image2graph.html")

## Run Code2Graph!

In [9]:
from code2graph import c2graph

# Uncomment below two lines if you have a proxy in your network
# Update the ip address and the port number with your proxy information
# os.environ['http_proxy'] = "194.138.0.9:9400" 
# os.environ['https_proxy'] = "194.138.0.9:9400" 

c2graph.run(codeFolder, outputFolder, ontology_file, inputCSV)

python script_lightweight.py -ip demo_code/condensenet-tensorflow-master -opt 3 -ont DeepSciKG.nt -pid CondenseNet.pdf --arg --url
Listing 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo\\run_all_modalities\\demo_code\\condensenet-tensorflow-master'...
Compiling 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo\\run_all_modalities\\demo_code\\condensenet-tensorflow-master\\cifar10.py'...
Listing 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo\\run_all_modalities\\demo_code\\condensenet-tensorflow-master\\data'...
Compiling 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo\\run_all_modalities\\demo_code\\condensenet-tensorflow-master\\data\\generate_cifar10_tfrecords.py'...
Compiling 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo\\run_all_modalities\\demo_code\\condensenet-tensorflow-master\\experiment.py'...
Compiling 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo\\run_all_modalities\\demo_code\\condensenet-tensorflow-master\\main.py'...
Compiling 'C:\\home\\projects\\DARPA ASKE\\DCC-dev\\demo

If you have seen the message "[Info] Completed code2graph pipeline for all repositories!", this means you can now explore the output of the code2graph module in the outputFolder you specified. The file `code2graph.ttl` contains the output graph for each paper!

There is a graph visualization created for each of the input python files from the repository. Change the filename in the below cell to display different graph visualizations.

In [10]:
# Visualize code2graph results
# You may change the below file name to display graphs for other python files
vis_file_to_display = outputFolder + "/code2graph/pointnetvlad-master/pointnetvlad_clsquadruplet_loss.html"
iframe = '<iframe src=' + vis_file_to_display + ' width=100% height=350></iframe>'
IPython.display.HTML(iframe)