# Import AMLSim Small CSV Data
This notebook describes an example use of the [import_data.DataFrameImporter](../../reference/python/katana.remote.import_data.rst#katana_enterprise.remote.dataframe_importer.DataFrameImporter) function to import AMLSim data from CSV using the Katana Service.

## Before you start
* Make sure you have a running [Katana cluster](../../getting-started/index.rst) (cloud or local deployment) and have set up access.
* CSV data in a location accessible from the Katana cluster
* Schema files as described in the [import data requirements](import-data-requirements.rst) section, accessible from the Client or the Katana cluster

## Setup Katana Client
Starting a Katana remote Client is required to interface with the Katana remote service and schedule distributed operations. Provide the Katana Graph server address by setting the environment variable `KATANA_SERVER_ADDRESS`. This variable applies only to your current shell session. If you want the variable to be automatically set for future shell sessions, set the variable in your shell startup file. For more detailed information refer to the Getting Started section of the documentation. Alternatively, you can pass in the address of the remote service by calling ``Client(address="localhost:8080")`` for Linux machines or ``Client(address="host.docker.internal:8080")`` for macOS and Windows machines.

In [1]:
from katana import remote

# Connect to the Katana Server
client = remote.Client()

                Environment variable MODIN_ENGINE is not set to python, if you run into issues please try setting it by doing:
                import os
                os.environ['MODIN_ENGINE']='python'
                


In [2]:
graph = client.create_graph(num_partitions=3)

## Sample data

We use generated AML SIM data from [IBM’s AML SIM project](https://github.com/IBM/AMLSim) stored in Big Query to build this application. The source data can be found at `s3://` and `gs://katana-demo-datasets/csv-datasets/AMLSim`.

In [3]:
dataset_dir = "gs://katana-internal2/csv-datasets/amlsim-small/csv"

### Graph schema

![AML SIM schema](images/aml-sim-schema.svg)

### Data Exploration Setup
Set up GCSFs for data exploration.

In [4]:
import dask.dataframe as dd
import gcsfs
    
fs = gcsfs.GCSFileSystem()

## Create DataFrame from tables
From each table in AML SIM, create a dataframe.

### Create DataFrame for addresses

In [5]:
dfs={}
for root, dirs, names in fs.walk(dataset_dir, maxdepth=2):
    for name in names:
            dfs[name.split('.')[0]]=dd.read_csv("gs://" + root + "/" + name)

In [6]:
dfs.keys()

dict_keys(['accounts', 'address', 'email', 'names', 'ssn', 'transactions'])

In [7]:
dfs["transactions"]

Unnamed: 0_level_0,ID_IGNORE,TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP,IS_FRAUD,ALERT_ID,method,suspectRegion,DOT
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
,int64,int64,int64,int64,object,float64,int64,bool,int64,object,bool,object
,...,...,...,...,...,...,...,...,...,...,...,...


In [8]:
graph.graph_id

'Em4vR3NdzGUAd4m7wx3hEpGBe7v1pL8f5qwez35nxWLA'

In [9]:
%%time
from katana_enterprise.remote import import_data
with import_data.DataFrameImporter(graph) as df_importer:
    # Account
    df_importer.nodes_dataframe(dfs["accounts"].drop(['CUSTOMER_ID'], axis=1),
                                id_column='ACCOUNT_ID',
                                id_space='ACCOUNT')
    # Customer
    df_importer.nodes_dataframe(dfs["names"],
                                id_column='CUSTOMER_ID',
                                id_space='CUSTOMER')
    # Customer->Account
    df_importer.edges_dataframe(dfs["accounts"][['CUSTOMER_ID','ACCOUNT_ID']],
                                source_id_space='CUSTOMER',
                                destination_id_space='ACCOUNT',
                                source_column='CUSTOMER_ID',
                                destination_column='ACCOUNT_ID',
                                type='hasAccount')
    # Address
    df_importer.nodes_dataframe(dfs["address"].drop(['CUSTOMER_ID'], axis=1),
                                id_column='ADDRESS_ID',
                                id_space='ADDRESS')
    # Customer->Address
    df_importer.edges_dataframe(dfs["address"][['CUSTOMER_ID','ADDRESS_ID']],
                                source_id_space='CUSTOMER',
                                destination_id_space='ADDRESS',
                                source_column='CUSTOMER_ID',
                                destination_column='ADDRESS_ID',
                                type='hasAddress')
    # Email
    df_importer.nodes_dataframe(dfs["email"].drop(['CUSTOMER_ID'], axis=1),
                                id_column='EMAIL_ID',
                                id_space='EMAIL')
    # Customer->Email
    df_importer.edges_dataframe(dfs["email"][['CUSTOMER_ID','EMAIL_ID']],
                                source_id_space='CUSTOMER',
                                destination_id_space='EMAIL',
                                source_column='CUSTOMER_ID',
                                destination_column='EMAIL_ID',
                                type='hasEmail')
    # SSN
    df_importer.nodes_dataframe(dfs["ssn"].drop(['CUSTOMER_ID'], axis=1),
                                id_column='SSN_ID',
                                id_space='SSN')
    # Customer->SSN
    df_importer.edges_dataframe(dfs["ssn"][['CUSTOMER_ID','SSN_ID']],
                                source_id_space='CUSTOMER',
                                destination_id_space='SSN',
                                source_column='CUSTOMER_ID',
                                destination_column='SSN_ID',
                                type='hasSSN')
    df_importer.edges_dataframe(dfs["transactions"],
                                source_id_space='ACCOUNT',
                                source_column='SENDER_ACCOUNT_ID',
                                destination_id_space='ACCOUNT',
                                destination_column='RECEIVER_ACCOUNT_ID',
                                type_column='TX_TYPE')



          0/? [?op/s]

KeyboardInterrupt: 

In [10]:
graph.schema().view()

          0/? [?op/s]

KeyboardInterrupt: 