# PayPal Requirements

##### Describes requirments from PayPal story that can be found here: [PayPal Story](https://docs.google.com/document/d/1oVM5ewbC3UVdr21dlR3kjD5pRChdSm7y/edit)

---

## Connect to the cluster

Run this command in separate terminal to forward port from Katana controller node

```gcloud compute ssh --zone us-east1-b --project CLUSTER_NAME katana-controller -- -NL 8080:127.0.0.1:8080 -vvv ```

Connect local Docker container to Katana Controller

In [None]:
import os
from timeit import default_timer as timer

import pandas as pd
from katana import remote
from katana.remote import analytics, import_data

os.environ["KATANA_SERVER_ADDRESS"] = "host.docker.internal:8080"

## Initial Load - Social Graph (2mm nodes, 41mm edges)

Specify input and output locations on cloud storage

In [None]:
graph = remote.Client().create_graph(num_partitions=4)

In [None]:
start = timer()
print("Importing graph from CSV files...")
import_data.csv(
    graph,
    input_node_path="gs://katana-demo-datasets/csv-datasets/social/mapping_files/node_list_2M_full.txt",
    input_edge_path="gs://katana-demo-datasets/csv-datasets/social/mapping_files/edge_list_2M_full.txt",
    input_dir="gs://katana-demo-datasets/csv-datasets/social/csv/2M",
    data_delimiter="|",
    schema_delimiter="|",
    ids_are_integers=True,
)
end = timer()
print(f"  import: {end - start:.1f} seconds")

start = timer()
assert graph.num_nodes() == 2_000_000
print(f"  num nodes: {graph.num_nodes()}")
end = timer()
print(f"  count nodes: {end - start:.1f} seconds")

start = timer()
assert graph.num_edges() == 41_002_994
print(f"  num nodes: {graph.num_edges()}")
end = timer()
print(f"  count edges: {end - start:.1f} seconds")

In [None]:
from katana_enterprise.client.rest.v2 import client

## Adjust the number of rows (None: no limit) to limit the result set size.
client.Limits.set_max_operation_result_rows(1000000)

In [None]:
start = timer()
query1_result = graph.query(
    """
match (acct1)-[:FriendConB]->(acct2)-[:FriendConB]->(acct3)
where (acct3.userIsBlacklisted="Y" or acct3.userIsSuspended="Y")
and not (acct1)-->(acct3)
return acct1.id, count(acct3) ;
"""
)
end = timer()
print(query1_result)
print(f"  query1: {end - start:.1f} seconds")

Report requirements: 
- Cluster size and SKU used for data ingestion
- % of completion
- Time used for ingestion
- Warnings & errors happened so far
- Statistics of # of vertices and edges successfully ingested & not ingested due to errors, etc.



Output: 
- Report already shows skew (complete - available now)
- Cluster size and SKU used for data ingestion (complete - available now)
- Time used for ingestion (complete - available now)
- % of completion
- Warnings & errors happened so far (as the load is occurring..) 
- Statistics of # of vertices and edges successfully ingested & not ingested due to errors, etc. (complete - available now) 


## Incremental Data Loading - Social Graph 

Requirements: (data will be staged in gs) 
- 80K of vertices to be inserted
- 7M of edges to be inserted


In [None]:
# incremental node+edge inserts, these should also work separately by leading the node or edge mapping path blank

next_node_ingest_batch = (
    "gs://katana-demo-datasets/csv-datasets/social/mapping_files/incremental/node_insert_2M_100k.txt"
)
next_edge_ingest_batch = (
    "gs://katana-demo-datasets/csv-datasets/social/mapping_files/incremental/edge_insert_2M_1.5M.txt"
)

start = timer()
# re-uses the graph from initial load
import_data.csv(
    graph,
    operation=import_data.Operation.Insert,
    input_node_path=next_node_ingest_batch,
    input_edge_path=next_edge_ingest_batch,
    input_dir="gs://katana-demo-datasets/csv-datasets/social/csv/",
    data_delimiter="|",
    schema_delimiter="|",
    ids_are_integers=True,
)

end = timer()
print(f"  Insert Op: {end - start:.1f} seconds")

- 100K of vertices to be updated
- 1M of edges to be updated


In [None]:
# incremental node + edge updates (XXX these paths are guesses)
next_node_ingest_batch = (
    "gs://katana-demo-datasets/csv-datasets/social/mapping_files/incremental/node_update_2M_100k.txt"
)
next_edge_ingest_batch = "gs://katana-demo-datasets/csv-datasets/social/mapping_files/incremental/edge_update_2M_1M.txt"

start = timer()
# re-uses the graph from initial load
import_data.csv(
    graph,
    operation=import_data.Operation.Update,
    input_node_path=next_node_ingest_batch,
    input_edge_path=next_edge_ingest_batch,
    input_dir="gs://katana-demo-datasets/csv-datasets/social/csv/",
    data_delimiter="|",
    schema_delimiter="|",
    ids_are_integers=True,
)

end = timer()
print(f"  Update Op: {end - start:.1f} seconds")

- 100K of vertices to be deleted
- 1M of edges to be deleted


In [None]:
# incremental node + edge deletes (XXX these paths are guesses)
next_node_ingest_batch = (
    "gs://katana-demo-datasets/csv-datasets/social/mapping_files/incremental/node_update_2M_100k.txt"
)
next_edge_ingest_batch = "gs://katana-demo-datasets/csv-datasets/social/mapping_files/incremental/edge_update_2M_1M.txt"

start = timer()
# re-uses the graph from initial load
import_data.csv(
    graph,
    operation=import_data.Operation.Delete,
    input_node_path=next_node_ingest_batch,
    input_edge_path=next_edge_ingest_batch,
    input_dir="gs://katana-demo-datasets/csv-datasets/social/csv/",
    data_delimiter="|",
    schema_delimiter="|",
    ids_are_integers=True,
)

end = timer()
print(f"  Delete Op: {end - start:.1f} seconds")

Output: 
- Report already shows skew (complete - available now)
- Cluster size and SKU used for data ingestion (complete - available now)
- Time used for ingestion (complete - available now)
- % of completion
- Warnings & errors happened so far (as the load is occurring..) 
- Statistics of # of vertices and edges successfully ingested & not ingested due to errors, etc. (complete - available now) 


## f) Query execution and output

#### Run 2hop aggregations only on edges of type FriendConB: for each account, find all his neighbors within 2 hops exactly (exclude accounts that are 1 hop from the account) and do the following aggregations on all these accounts: 
sum(user_is_blacklisted=’Y’ or suspended=’Y’)/(# of accounts in 2 hops)
sum(phone_is_verified=’Y’) /(# of accounts in 2 hops)
avg(cnt_decl)

##### FriendB Connection Only
- 2 hop query by account
- Global triangles for totals
- Users suspended and blacklisted
- Users not verified
- Average count of users declined

In [None]:
## Corresponds https://github.com/KatanaGraph/katana-tools/blob/main/bench/queries/paypal/11.q used by QA
start = timer()

query = """
MATCH (acct1)-[:FriendConB]->(acct2)-[:FriendConB]->(acct3) 
WHERE NOT (acct1)-[:FriendConB]->(acct3) 
WITH acct1, avg(acct3.cntDecl) as avgCntDecl, count(acct3) as totalAccounts 
MATCH (acct1)-[:FriendConB]->(acct2)-[:FriendConB]->(acct3Bad) 
WHERE (acct3Bad.userIsBlacklisted="Y" OR acct3Bad.userIsSuspended="Y") 
AND NOT (acct1)-[:FriendConB]->(acct3Bad) 
WITH acct1, avgCntDecl, totalAccounts, count(acct3Bad) as suspAccounts 
MATCH (acct1)-[:FriendConB]->(acct2)-[:FriendConB]->(acct3Bad) 
WHERE acct3Bad.phoneIsVerified="N" 
AND NOT (acct1)-[:FriendConB]->(acct3Bad) 
WITH acct1, avgCntDecl, totalAccounts, suspAccounts, count(acct3Bad) as unverifiedAccounts 
RETURN acct1.id, avgCntDecl, toFloat(suspAccounts) / toFloat(totalAccounts) as percentSusp, toFloat(unverifiedAccounts) / toFloat(totalAccounts) as percentUnverified;
"""

query2_result = graph.query(query)
end = timer()
print(query2_result)
print(f"  query2: {end - start:.1f} seconds")

Output: 
- Report already shows skew (complete - available now)
- Cluster size and SKU used for data ingestion (complete - available now)
- Time used for ingestion (complete - available now)
- % of completion
- Warnings & errors happened so far (as the load is occurring..) 
- Statistics of # of vertices and edges successfully ingested & not ingested due to errors, etc. (complete - available now) 


#### Run 2hop aggregations on the full graph: for each account, find all his neighbors within 2 hops exactly (exclude accounts that are 1 hop from the account) and do the following aggregations on all these accounts: 
sum(user_is_blacklisted=’Y’ or suspended=’Y’)/(# of accounts in 2 hops)
sum(phone_is_verified=’Y’) /(# of accounts in 2 hops)
avg(cnt_decl)

##### Full Social Graph
- 2 hop query by account
- Global triangles for totals
- Users suspended and blacklisted
- Users not verified
- Average count of users declined

In [None]:
## Corresponds https://github.com/KatanaGraph/katana-tools/blob/main/bench/queries/paypal/12.q used by QA
start = timer()

query = """
MATCH (acct1)-->(acct2)-->(acct3) 
WHERE NOT (acct1)-->(acct3) 
WITH acct1, avg(acct3.cntDecl) as avgCntDecl, count(acct3) as totalAccounts 
MATCH (acct1)-->(acct2)-->(acct3Bad) 
WHERE (acct3Bad.userIsBlacklisted="Y" OR acct3Bad.userIsSuspended="Y") 
AND NOT (acct1)-->(acct3Bad) 
WITH acct1, avgCntDecl, totalAccounts, count(acct3Bad) as suspAccounts 
MATCH (acct1)-->(acct2)-->(acct3Bad) 
WHERE acct3Bad.phoneIsVerified="N" 
AND NOT (acct1)-->(acct3Bad) 
WITH acct1, avgCntDecl, totalAccounts, suspAccounts, count(acct3Bad) as unverifiedAccounts 
RETURN acct1.id, avgCntDecl, toFloat(suspAccounts) / toFloat(totalAccounts) as percentSusp, toFloat(unverifiedAccounts) / toFloat(totalAccounts) as percentUnverified;
"""

query3_result = graph.query(query)
end = timer()
print(query3_result)
print(f"  query3: {end - start:.1f} seconds")

##### Full Social Graph
- 2 hop query by account
- Global triangles for totals
- Users suspended and blacklisted
- Users not verified
- Average count of users declined

## Algorithms

#### PageRank (Social Graph)

In [None]:
start = timer()
analytics.pagerank(graph, result_property_name="page_rank")
end = timer()
print(f"  Pagerank: {end - start:.1f} seconds")

result = graph.query(
    """
    MATCH (n)
    RETURN n, n.page_rank
    ORDER BY n.page_rank DESC
    LIMIT 30
    """,
    contextualize=True,
)
pd.DataFrame(result[0:10])

#### Distributed Louvain (Social Graph)

In [None]:
prop_name = "cluster_id"
start = timer()
louvain_graph = graph.project(node_types=["Account"], edge_types=["AssetConB"])
end = timer()
print(f"  Projection: {end - start:.1f} seconds")

start = timer()
analytics.louvain_clustering(louvain_graph, result_property_name=prop_name, is_symmetric=True)
end = timer()
print(f"  Louvain: {end - start:.1f} seconds")

result = graph.query(
    f"""
    MATCH (n)
    WHERE exists(n.{prop_name})
    RETURN n, n.{prop_name}
    LIMIT 30
    """,
    contextualize=True,
)
pd.DataFrame(result[0:10])