
# Importing MongoDB database server data into KatanaGraph using Python Pandas/Dask DataFrames

[![](./01_Images/mdb_kg_logos1.png)](https://www.katanagraph.com)

**This section details how to import data from a MongoDB database server, into KatanaGraph using Python Pandas/Dask DataFrames.**
<br>
Presumably, MongoDB is acting as the system of record, for operational data, and you wish to import that data into KatanaGraph,
where you can perform; graph queries, graph analytics, graph machine learning and mining.

## Importing data into KatanaGraph:

All told, there are several means to (get data into) Katanagraph, as well as affect data that is already present. These include: 
importing, manpiulating, and mutating.

- For a full treatment on all of (importing, manipulating, and mutating) data into and inside KatanaGraph, see [Here](https://www/google.com).
- Here we detail importing data from MongoDB into KatanaGraph using Python Pandas/Dask DataFrames.

To source data from MongoDB, we use the standard MongoDB driver for Python titled, PyMongo.
There is the Dask MongoDB project on GitHub, [Here](https://github.com/coiled/dask-mongo), which we chose not to use. While this project offers
increased parallelism, the functional query capabilities are much reduced over Pymongo.


## In this section, the following assumptions are made:

-  You have a functional Katanagraph cluster that you can authenticate against, with at least 3 worker nodes. See [Here](https://www/google.com).
-  You have a functional MongoDB database server that you can authenticate against. For the example presented below, the MongoDB system is available
   at **localhost, and port 7777**. (The steps presented also work for remotely accessed MongoDB systems.) 
-  For this example, we use the common MongoDB demonstration database titled, **mFlix (Sample_mFlix).** 
<br>&nbsp;&nbsp; - Documentation on this MongoDB sample data set available [Here]( https://www.mongodb.com/docs/atlas/sample-data/sample-mflix/).
<br>&nbsp;&nbsp; - While both MongoDB and KatanaGraph offer polymorphic schemas, MongoDB and KatanaGraph model data differently; 
                   MongoDB being a document database, and KatanaGraph being a graph platform.
<br>&nbsp;&nbsp; -  From the MongoDB mFlix demonstration database, we use the collections titled; Users, Movies, and Comments. 
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -- The MongoDB collections titled **Users and Movies, and the documents they contain,
                                            become nodes (two vertices, each with many nodes)** inside KatanaGraph, and remain
                                            largely unmodified.
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -- The MongoDB collection titled, **Comments, becomes relationships (one edge, between
                                            User and Movies, with many relationships)** inside Katangraph. 
<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -- Here we choose to run a MongoDB aggregate query in order to manipulate the data to our
                                            specific needs. (Model a Collection --> Edge/Relationships)

![Data Model](./01_Images/models.png)


#  Get a MongoDB connection handle, read data

In [None]:

#  Get a MongoDB connection handle ..
#
#  The following assumptions are in place,
#
#     .  A working MongoDB server is operating at localhost, port 7777
#        (All instructions that follow work against a remotely operating MongoDB server. Edit the hostname and port as needed.)

import pymongo
from   pymongo import MongoClient
   #
from bson.objectid import ObjectId

from pandas import DataFrame

   ###

l_cn_m = MongoClient("localhost:7777")                #  <--  Edit this line as needed

l_db_m = l_cn_m.my_db_m                               #  Our MongoDB database name; where our MongoDB collections (tables) should be found


print("--")


In [None]:

#  Get the MongoDB collections titled; Users, and Movies
#
#  .  These Python DataFrames may be either Pandas or Dask, and will become Vdertices/Node in KatanaGraph.
#
#  The following assumptions are in place,
#
#     .  The MongoDB sample database titled, fFlix (Sample_mFlix) is created/loaded, in a database
#        whose name is refelected in the Python variable titled, l_db_m

#  Get the MongoDB Users collection

l_result_u =  DataFrame(list(l_db_m.users.find( {}, { "_id": 0, "name" : 1, "email": 1, "password": 1 })))


#  Get the MongoDB  Movies collection

#  This line fails. See Jira, https://katanagraph.atlassian.net/browse/KAT-6522
#  The source of error is unknown. Possibly it's; jagged rows (NULL values) inside MongoDb, or a given datatype
#
# l_result_m = DataFrame(list(l_db_m.movies.find( {} )))

l_result_m = DataFrame(list(l_db_m.movies.find( {}, { "_id": 1, "title": 1, "awards": 1, "lastupdated": 1, "year": 1, "imdb": 1, "type": 1 })))

   
print("--")


In [None]:

#  Get the MongoDB collection title, Comments
#
#  .  Because MongoDB is a document database, and KatanaGraph is a graph platform, the two (servers) model
#     data differently. As such, and expecting the user knows MongoDB, we use the MongoDB aggregate() 
#     method to reshape our data.
#
#  .  Further, some Comments (documents, records) may not relate to existing Users or Movies, so we add
#     those filters to our aggregate() method.
#
#
#  This routine takes a few seconds to run ..
#
#
#  Intersection between Comments and Movies, then Comments and Users ..
#
#  (We use this query to create the Edge record between Users and Movies.)
#
#    50,304  Comments
#    23,539  Movies
#
#    41,080  Comments have a movie_id found in Movies
#       then
#    41,006  Comments have an email found in Users
#
#
#  A MongoDB aggregate query below, includes the following steps ..
#
#     .  (Read from Comments)
#     .  "Lookup" (effectively, an Outer Join) into Movies. These results return as an array.
#     .  "Project" to add a derived property; the length of the results array above.
#     .  "Match", to filter out (rows) where there were no matching Movies.
#
#     .  Repeat this (Lookup, Project, Match) pattern, outer joining into Users.
#
#     .  And a final Project to suppress unneeded properties.


l_result_c = DataFrame(list(l_db_m.comments.aggregate( [
   
   #  Start with Comments --> Movies
   #
   {
   "$lookup" :
      {
      "localField"   : "movie_id",
         #
      "from"         : "movies",
      "foreignField" : "_id",
      "as"           : "ddd"
      }
   },
   {
   "$project" :
      {
      "_id"     : 0,                                      #  PK from Comments, don't need this
         #
      "movie_id": 1,                                      #  FK into Movies
      "email"   : 1,                                      #  FK into Users
         #
      "name"    : 1,                                      #  What we get from Comments
      "text"    : 1,
         #
      "ddd_size": {"$size": "$ddd"}                       #  These are effectively outer joins, filter out where no match existed
      }
   } ,
   {
   "$match":                                              #  41,080  rows will pass this point  (Comments --> Movies)
      {
      "ddd_size": {"$gt": 0}
      }
   },
    
   #  Move now to Comments --> Users
   # 
   {
   "$lookup" :
      {
      "localField"   : "email",
         #
      "from"         : "users",
      "foreignField" : "email",
      "as"           : "eee"
      }
   },
   {
   "$project" :
      {
      "_id"     : 0,                                      #  PK from Comments, don't need this
         #
      "movie_id": 1,                                      #  FK into Movies
      "email"   : 1,                                      #  FK into Users
         #
      "name"    : 1,                                      #  What we get from Comments
      "text"    : 1,
         #
      "eee_size": {"$size": "$eee"}                       #  These are effectively outer joins, filter out where no match existed
      }
   } ,
   {
   "$match":                                              #  41,006  rows will pass this point  (Comments --> Users)
      {
      "eee_size": {"$gt": 0}
      }
   },
    
   #  Final project
   #
   {
   "$project" :
      {
      "_id"     : 0,                                      #  PK from Comments, don't need this
         #
      "movie_id": 1,                                      #  FK into Movies
      "email"   : 1,                                      #  FK into Users
         #
      "name"    : 1,                                      #  What we get from Comments
      "text"    : 1,
      }
   } 
    
   ] ) ))


print("--")


#  Get a KatanaGraph connection handle, setup graph

In [None]:
#  Get a KatanaGraph connection handle
#
#  The following assumptions are in place,
#
#     .  A working KatanaCluster is accessible at localhost, port 8080
#     .  There are at least 3 KatanaGraph worker nodes; a requirement for Python Dask

import os

from katana import remote
from katana.remote import import_data


l_cn_k = remote.Client()

print(l_cn_k)


In [None]:
#  Create a KatanaGraph database, graph


NUM_PARTITIONS  = 3
   #
DB_NAME         = "my_db"
GRAPH_NAME     = "my_graph"


l_db_k = l_cn_k.create_database(name=DB_NAME)
   #
print(l_db_k)


l_graph=l_cn_k.get_database(name=DB_NAME).create_graph(name=GRAPH_NAME, num_partitions=NUM_PARTITIONS)
   #
print(l_graph)


#  Import the 3 previously created (MongoDB) Python DataFrames into KatanaGraph

In [None]:

# Import the 3 previously created (MongoDB) Python DataFrames into KatanaGraph

with import_data.DataFrameImporter(l_graph) as df_importer:   
    
   df_importer.nodes_dataframe(l_result_u,                     #  Our Users set of Nodes
      id_column             = "email",
      id_space              = "Users",
      label                 = "Users"
      )
    
   df_importer.nodes_dataframe(l_result_m,                      #  Our Movies set of Nodes
      id_column             = "_id",
      id_space              = "Movies",
      label                 = "Movies"
      )
   
   df_importer.edges_dataframe(l_result_c,                      #  Our Edge, specifying the relationship between Users --> COMMENT_ON --> Movies
      source_id_space       = "Users",
      destination_id_space  = "Movies",
      source_column         = "email",
      destination_column    = "movie_id",
      type                  = "COMMENT_ON"
      )

print("--")


#  The result set should resemble

![Data Model](./01_Images/result_set.png)

In [None]:

#  Take a look at the graph ..

display(l_graph.num_nodes())
display(l_graph.num_edges())

l_result = l_graph.query("""

   MATCH (n)  - [ r ] ->  (m )
   RETURN n, m, r
   LIMIT 100
   
   """, contextualize=True)

l_result.view()
