# Deep Code Curator

## Prerequisites

Please check our [readme](./) for the requirements file and other prerequisites.

## Import Modules

In [1]:
import sys
import os

sys.path.append(os.path.abspath('../../src'))

import text2graph
import diagram2graph
import code2graph
import pytesseract
import IPython

from visualize import get_vis
from rdflib import Graph, URIRef

%load_ext autoreload
%autoreload 2

print("Necessary modules have been successfully imported!")

Necessary modules have been successfully imported!


## Specify Input /Output Folders and Required Dependencies
<a id="input"></a>

In [2]:
# --------- INPUT ---------
# Path to the folder which contains the input pdf file(s)
### We included two sample papers in the demo_input folder
inputFolder = 'demo_input_Copy'

# Path to the folder which contains the code repositories of the papers in the inputFolder
### For the sample papers we provided, you can download the project codes 
### as a zip from the below links
### https://github.com/ShichenLiu/CondenseNet and extract to the folder demo_code
### https://github.com/mikacuy/pointnetvlad and extract to the folder demo_code
codeFolder = "demo_code_Copy"

# CSV file that maps the pdf file name in the inputFolder to the code repository name in the codeFolder
# A sample CSV file is provided for the sample papers above
inputCSV = 'input_Copy.csv'

# --------- OUTPUT ---------
# Path to the folder to which the output from all three modalities will be placed
outputFolder = 'demo_output_Copy'

# --------- DEPENDENCIES ---------
# The dependencies that you have downloaded following the instructions from README
ontology_file = "DeepSciKG.nt"
text2graph_models_dir = "text2graph_models"
image2graph_models_dir = "image2graph_models"
grobid_client = "grobid-client-python"

# Comment below line for LINUX - Update below path for WINDOWS
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

## Run Text2Graph!
Reminders: Make sure that Grobid Server is running! (cd into the grobid-0.5.5 folder and run `gradle run`)

In [3]:
from text2graph import t2graph

# Uncomment below two lines if you have a proxy in your network
os.environ['http_proxy'] = ""
os.environ['https_proxy'] = ""

t2graph.run(inputFolder, outputFolder, ontology_file, text2graph_models_dir, grobid_client)

[Info] Extracting XML from PDF's...
[Info] Extracting abstracts from XML's...
[Info] Extracting entities/relationships and generating RDF's...
Completed processing file CondenseNet.txt
Saving rdf file demo_output_Copy/text2graph/CondenseNet_text2graph.ttl
Completed processing file input_paper.txt
Saving rdf file demo_output_Copy/text2graph/input_paper_text2graph.ttl
Completed processing file Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper.txt
Saving rdf file demo_output_Copy/text2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper_text2graph.ttl
[Info] Completed text2graph pipeline!


If you have seen the message "[Info] Completed text2graph pipeline!", this means you can now explore the output of the text2graph module in the outputFolder you specified. The file `text2graph.ttl` contains the output graph for each paper!

Below, we visualize the results as a graph:

In [4]:
g = Graph()
g.parse(outputFolder + "/text2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper_text2graph.ttl", format="ttl")
g_vis = get_vis(g, "Text")
g_vis.show("text2graph.html")

# g_image = Graph()
# g_image.parse(outputFolder + "/image2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper/image2graph.ttl", format="ttl")
# g_image_vis = get_vis(g_image, "Image")
# g_image_vis.show("image2graph.html")

## Run Image2Graph!

In [5]:
from diagram2graph.FigAnalysis.ShapeExtraction import i2graph

i2graph.run(inputFolder, outputFolder, ontology_file, image2graph_models_dir)

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


[INFO] Loading trained models ...
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Loaded binary classifier model from disk
Loaded multiclass classifier model from disk
[INFO] Loading and analyzing images ...
[INFO] Creating RDF graph ...
Processing paper: CondenseNet.pdf
demo_output_Copy/image2graph/CondenseNet/diag2graph
[]
Processing paper: input_paper.pdf
demo_output_Copy/image2graph/input_paper/diag2graph
['demo_output_Copy/image2graph/input_paper/diag2graph\\input_paper-Figure2-1.txt', 'demo_output_Copy/image2graph/input_paper/diag2graph\\input_paper-Figure4-1.txt']
Processing paper: Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper.pdf
demo_output_Copy/image2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper/diag2graph
['demo_output_Copy/image2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper/diag2graph\\Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper-Figure

If you have seen the message "[Info] Completed image2graph pipeline!", this means you can now explore the output of the image2graph module in the outputFolder you specified. The file `image2graph.ttl` contains the output graph for each paper!

Below, we visualize the results as a graph:

In [6]:
g = Graph()
g.parse(outputFolder + "/image2graph/Uy_PointNetVLAD_Deep_Point_CVPR_2018_paper/image2graph.ttl", format="ttl")
g_vis = get_vis(g, "Image")
g_vis.show("image2graph.html")

## Run Code2Graph!

In [7]:
from code2graph import c2graph

# Uncomment below two lines if you have a proxy in your network
# Update the ip address and the port number with your proxy information
# os.environ['http_proxy'] = "194.138.0.9:9400" 
# os.environ['https_proxy'] = "194.138.0.9:9400" 

c2graph.run(codeFolder, outputFolder, ontology_file, inputCSV)

python script_lightweight.py -ip demo_code_Copy/condensenet-tensorflow-master -opt 3 -ont DeepSciKG.nt -pid CondenseNet.pdf --arg --url
Listing 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master'...
Compiling 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master\\cifar10.py'...
Listing 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master\\data'...
Compiling 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master\\data\\generate_cifar10_tfrecords.py'...
Compiling 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master\\experiment.py'...
Compiling 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master\\main.py'...
Compiling 'C:\\DCC\\DCC\\demo\\run_all_modalities\\demo_code_Copy\\condensenet-tensorflow-master\\model.py'...
Reading cached tf types file
C:\DCC\DCC\demo\run_all_modalities\demo_cod

If you have seen the message "[Info] Completed code2graph pipeline for all repositories!", this means you can now explore the output of the code2graph module in the outputFolder you specified. The file `code2graph.ttl` contains the output graph for each paper!

There is a graph visualization created for each of the input python files from the repository. Change the filename in the below cell to display different graph visualizations.

In [8]:
# Visualize code2graph results
# You may change the below file name to display graphs for other python files
vis_file_to_display = outputFolder + "/code2graph/pointnetvlad-master/pointnetvlad_clsquadruplet_loss.html"
iframe = '<iframe src=' + vis_file_to_display + ' width=100% height=350></iframe>'
IPython.display.HTML(iframe)

