# Importing LDBC data into KatanaGraph using Python Pandas/Dask DataFrames

**This section details how to import LDBC data into KatanaGraph, using Python Pandas/Dask DataFrames.**
<br>
This Notebook is part of a series of Notebooks demonstrating use of the KatanaGraph DataFrame importer.
Where the other Notebooks target importing from other (database servers), this Notebook imports the LDBC
data set. As LDBC is a data set and not a database, we first import the LDBC data into a Pandas/Dask
DataFrame using the DataFrame read_csv() method.

Further:

-  As LDBC is already a graph modeled data set, no data maniplucation is required.
-  We load the following vertices and edges; Person (vertice), City (vertice), and IS_LOCATED_IN (edge).
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -- At the smallest scale factor, City is located in a single CSV file, but is entered with other (Place) data, which we wish to filter out.
<br>
&nbsp;&nbsp;&nbsp;&nbsp; -- At the smallest scale factor, Person is located in several CSV files, so we demonstrate a technique to manage that condition.


#  Setup; filenames and other control settings

In [None]:
import dask.dataframe as dd
import numpy as np

print("--")    
    

In [None]:
#  The LDBC data set, as it sits on our storage buckets, requires a good amount of string
#  manipulation to process file names. We will avoid that burden (as it's unlikely we'd
#  provide exactly the processing required by any consumer), and just hard code a number
#  of input CSV file names.


NUM_PARTITIONS = 5

DB_NAME     = "my_db"
GRAPH_NAME  = "my_graph"


l_parent    = "gs://katana-demo-datasets/csv-datasets/ldbc/"
  
l_place     = [
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/static/Place/part-00000-c16729dd-16d5-456b-84ca-d0f19fa783d6-c000.csv"
   ]
   
l_person    = [
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00000-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00001-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00002-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00003-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00004-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00005-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00006-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00007-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00008-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00009-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00010-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00011-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00012-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00013-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00014-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person/part-00015-18fa0376-2cfb-4b82-a7a6-e136cd266801-c000.csv",
   ]

l_located   = [
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00000-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00001-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00002-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00003-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00004-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00005-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00006-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00007-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00008-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00009-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00010-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00011-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00012-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00013-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00014-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   l_parent + "sf-0.003/csv/bi/composite-projected-fk/initial_snapshot/dynamic/Person_isLocatedIn_Place/part-00015-cc301033-2cd0-4094-b5cd-c85099d3f7ed-c000.csv",
   ]
    
print("--")    
    

#  Read 3 of the LDBC data files

In [None]:
#  Read just City. City is included in a file for all places; Country, City, (possibly other)

l_place_df = dd.read_csv(
   l_place,
   delimiter ="|",
   header    = 0,
   dtype={
      "id"   : int,
      "name" : np.dtype("O"),
      "url"  : np.dtype("O"),
      "type" : np.dtype("O"),
      })
         #
l_city_df  =  l_place_df.loc[(l_place_df["type"].isin(["City"]))]


display("Number of records: " + str(len(l_city_df.index)))
   #
display(l_city_df.head(5))


print("--")


In [None]:
#  Read Person-
#
#  Even at the smallest scale factor, Person is spread across a large number of files.

l_person_df = dd.read_csv(
   l_person,
   delimiter ="|",
   header    = 0,
   dtype={
      "creationDate"  : np.dtype("O"),
      "id"            : int,
      "firstName"     : np.dtype("O"),
      "lastName"      : np.dtype("O"),
      "gender"        : np.dtype("O"),
      "birthday"      : np.dtype("O"),
      "locationIP"    : np.dtype("O"),
      "browserUsed"   : np.dtype("O"),
      "language"      : np.dtype("O"),
      "email"         : np.dtype("O"),
      })


display("Number of records: " + str(len(l_person_df.index)))
   #
display(l_person_df.head(n=5, npartitions=NUM_PARTITIONS))


display("--")


In [None]:
#  Read the edge, IS_LOCATED_IN

l_located_df = dd.read_csv(
   l_located,
   delimiter ="|",
   header    = 0,
   dtype={
      "creationDate"  : np.dtype("O"),
      "Person.id"       : int,
      "Place.id"        : int,
      })


display("Number of records: " + str(len(l_located_df.index)))
   #
display(l_located_df.head(n=5, npartitions=NUM_PARTITIONS))


display("--")


#  Graph setup ..

In [None]:
import os

from katana import remote
from katana.remote import import_data


my_client = remote.Client()

print(my_client)


In [None]:
#  CREATE DATABASE

my_database = my_client.create_database(name=DB_NAME)

print(my_database.database_id)

In [None]:
#  CREATE A GRAPH

my_graph=my_client.get_database(name=DB_NAME).create_graph(name=GRAPH_NAME, num_partitions=NUM_PARTITIONS)

print(my_graph)

#  Make the Graph from the 3 previously imported DataFrames

In [None]:
# Import the 3 previously created (LDBC) Python DataFrames into KatanaGraph

with import_data.DataFrameImporter(my_graph) as df_importer:   
    
   df_importer.nodes_dataframe(l_city_df,                      #  City set of Nodes
      id_column             = "id",    
      id_space              = "City",  
      label                 = "City",  
      )
    
   df_importer.nodes_dataframe(l_person_df,                    #  Person set of Nodes
      id_column             = "id",
      id_space              = "Person", 
      label                 = "Person", 
      )
   
   df_importer.edges_dataframe(l_located_df,                    #  Our Edge, specifying the relationship between Person --> IS_LOCATED_IN --> City
      source_id_space       = "Person", 
      destination_id_space  = "City",   
      source_column         = "Person.id",
      destination_column    = "Place.id",
      type                  = "IS_LOCATED_IN"
      )

print("--")


#  The result set should resemble

![Data Model](./01_Images/result_set2.png)

In [None]:
#  Take a look at the graph ..

display(my_graph.num_nodes())
display(my_graph.num_edges())

l_result = my_graph.query("""

   MATCH (n)  - [ r ] ->  (m )
   RETURN n, m, r
   LIMIT 100
   
   """, contextualize=True)

l_result.view()
