This file is to prepare data to add to `neo4j` database.
From the original data, have the relationship
`(:Lender)-[:LEND]->(:Loan)-[:TAGS_WITH]->(:Tag)`

We will attempt to create new relationships

- `(:Lender)-[:INTEREST]->(:Tag)`
- `(:Lender)-[:SHARE_TAGS]->(:Lender)`
- `(:Lender)-[:SHARE_LOANS]->(:Lender)`

Two later take `O(n^2)` and that's where I struggling. There are choices to solve:

- Subsampling the data, only take one or some countries to build the graph
- Or, use tools like `cuGRAPH` or `cuDF` to prebuilt the relationships. Just load into `neo4j` for visualization and query

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import cudf

tqdm.pandas()

# Import raw data
First, read data in `.jsonl` file format as a pandas data frame
Then store the dataframe in `.parquet` format for easy access later

In [None]:
%%script false --no-raise-error

df = pd.read_json("../fulldata/kiva_activity_2023-08-28T11-09-39.jsonl", lines=True)
df = pd.json_normalize(df["loan"], sep='_')

In [None]:
%%script false --no-raise-error

df["loanAmount"] = df["loanAmount"].astype(float)
df["loanFundraisingInfo_fundedAmount"] = df["loanFundraisingInfo_fundedAmount"].astype(float)
df["raisedDate"] = pd.to_datetime(df["raisedDate"])
df["fundraisingDate"] = pd.to_datetime(df["fundraisingDate"])
df["geocode_country_name"] = df["geocode_country_name"].astype("category")
df["sector_id"] = df["sector_id"].astype(int)
df["sector_name"] = df["sector_name"].astype("category")
df["activity_id"] = df["activity_id"].astype(int)
df["activity_name"] = df["activity_name"].astype("category")

In [None]:
%%script false --no-raise-error
df.to_parquet("../fulldata/kiva_activity_2023-08-28T11-09-39.parquet")

In [None]:
ds = cudf.read_parquet("../fulldata/kiva_activity_2023-08-28T11-09-39.parquet")

In [None]:
ds.dropna(axis=0, how="all", inplace=True)
ds.tail()

# Contruct a Graph

The idea is construct a graph with following node type
- `Lender`
- `Loan`
- `Tag`

With following relationships
- `Lender`s can `LEND` to `Loan`s
- `Loan`s can be `TAGGED_WITH` `Tag`s

Lenders have properties
- `id`
- `name`
- `publicId`

Loan have properties
- `id`
- `name`
- `loanAmount`
- `fundedAmount`
- `postDate`
- `raisedDate`

`Tag` have properties:
- `name`

LEND's properties
- `shareAmount`
- `date`

TAGGED_WITH have no properties

## Construct full graph using `neo4j-admin database import`


> The most efficient way of performing a first import of large amounts of data into a new database is the neo4j-admin database import command.
(batch_data_creation)[https://neo4j.com/docs/python-manual/current/performance/#_batch_data_creation]

We now create 4 files like this

`tags.csv`

```csv
name:ID,:LABEL
women,Tag
user_favorite,Tag
```

`lenders.csv`

```csv
id:ID,name,publicId,:LABEL
123,"dat","datnt527",Lender
```

`loans.csv`

```csv
id:ID,name,fundraisingDate:date,raisedDate:date,loanAmount:float,loanFundraisingInfo_fundedAmount:float,geocode_country_name,sector_id,sector_name,activity_id,activity_name,:LABEL
2622552,'Elsa','2023-08-18T04:40:27Z','2023-08-21T16:46:54Z','550.00','550.00','Philippines',14,'Construction',24, 'Construction Supplies',Loan
```

relationshipo between `Lender` and `Loan`

`lender_loan.csv`

```csv
:START_ID,:END_ID,shareAmount,date,:TYPE
123,2622552,25.0,2023-04-10 00:00:00,LEND
```

`loan_tags.csv`

```csv
:START_ID,:END_ID,:TYPE
2622552,women,TAGGED_WITH
```

In [None]:
ds['geocode_country_name'].value_counts()['Vietnam']

Filtering, only take `Vietnam` into account
Why? Because there are a lot of rows and we try to localize the task

In [None]:
ds = ds[ds['geocode_country_name'] == 'Vietnam']

In [None]:
ds.head()

### create `tags` df

In [None]:
# create those df
ds_tags = ds[['tags']].explode('tags').drop_duplicates().dropna()
ds_tags[':LABEL'] = 'Tag'
ds_tags.rename(columns={'tags': 'name:ID'}, inplace=True)
ds_tags.to_csv('../data/neo4jtry/tags.csv',index=False)
del ds_tags

### create `loans` df

duplicated loan: same `id` but `funded_amount` different, maybe because of the query time

In [None]:
ds_loan = ds.drop(['tags', 'lendingActions_totalCount', 'lendingActions_values'], axis=1)
ds_loan.drop_duplicates(inplace=True)

In [None]:
ds_loan.loc[[9628, 1366545]]

In [None]:
temp = ds_loan.groupby('id', group_keys=False)[['loanFundraisingInfo_fundedAmount']].idxmax()
iloc = temp['loanFundraisingInfo_fundedAmount'].values # NOTE: just iloc, not loc

In [None]:
ds_loan = ds_loan.iloc[iloc]

In [None]:
ds_loan[ds_loan.duplicated(subset=['id'], keep=False)] # no duplicated

In [None]:
ds_loan.loc[[9628, 1366545]] # see, only keep the one with higher fundedAmount

In [None]:
ds_loan[':LABEL'] = 'Loan'
ds_loan.rename(columns={'id': 'id:ID(Loan-ID)'}, inplace=True)
ds_loan.to_csv('../data/neo4jtry/loans.csv',index=False)
del ds_loan

### create `Lender` df

In [None]:
ds_lender = ds[['lendingActions_values']].explode('lendingActions_values')
ds_lender.dropna(inplace=True)
ds_lender.iloc[0]['lendingActions_values']['lender']

In [None]:
df_lender = ds_lender.to_pandas()

In [None]:
df_lender = df_lender.progress_apply(lambda x: x['lendingActions_values']['lender'], axis=1)
df_lender = pd.json_normalize(df_lender)
df_lender.tail(2)

In [None]:
ds_lender = cudf.from_pandas(df_lender)
ds_lender.drop_duplicates(inplace=True)
del df_lender

In [None]:
# drop duplicated_lender who publicId is None
duplicated_lender_id = ds_lender[ds_lender.duplicated(subset=['id'])]['id']
should_remove = ds_lender[(ds_lender['id'].isin(duplicated_lender_id)) & (ds_lender['publicId'].isna())]
ds_lender.drop(should_remove.index, axis=0, inplace=True)
# still duplicate, might be because user change name and publicId. Just remove duplicates here.
ds_lender.drop_duplicates(subset='id', inplace=True)
# display the duplicated
ds_lender[ds_lender.duplicated(subset=['id'], keep=False)]

In [None]:
ds_lender.rename(columns={'id': 'id:ID(Lender-ID)'}, inplace=True)
ds_lender[':LABEL'] = 'Lender'
ds_lender.to_csv('../data/neo4jtry/lenders.csv',index=False)
del ds_lender

### Create `loan-tags` relationship

In [None]:
ds_loan_tags = ds[['id', 'tags']].explode('tags')
ds_loan_tags.dropna(inplace=True)
ds_loan_tags.isna().sum()

In [None]:
ds_loan_tags.drop_duplicates(inplace=True)
ds_loan_tags.duplicated().sum()

In [None]:
ds_loan_tags['tags'].value_counts()

In [None]:
# care full with tag '', and remove tag `tag_user_favorite` and `tag_user_like`
ds_loan_tags = ds_loan_tags[~ds_loan_tags.tags.isin(['', 'user_favorite', 'user_like', 'volunteer_pick', 'volunteer_like'])]

In [None]:
ds_loan_tags.rename(columns={'id': ':START_ID(Loan-ID)', 'tags':':END_ID'}, inplace=True)
ds_loan_tags[':TYPE'] = 'TAGGED_WITH'
ds_loan_tags.to_csv('../data/neo4jtry/loan_tags.csv', index=False)
del ds_loan_tags

### create `lender-loan` relationship

In [None]:
ds_lender_loan = ds[['id', 'lendingActions_values']].explode('lendingActions_values')
ds_lender_loan.dropna(inplace=True)
ds_lender_loan.tail(5)

In [None]:
df_lender_loan = ds_lender_loan.to_pandas()

In [None]:
df_lender_loan['lender_id'] = df_lender_loan.progress_apply(lambda x: x['lendingActions_values']['lender']['id'], axis=1)
df_lender_loan['shareAmount'] = df_lender_loan.progress_apply(lambda x: x['lendingActions_values']['shareAmount'], axis=1)
df_lender_loan['date'] = df_lender_loan.progress_apply(lambda x: x['lendingActions_values']['latestSharePurchaseDate'], axis=1)

In [None]:
ds_lender_loan = cudf.from_pandas(df_lender_loan)

In [None]:
ds_lender_loan.drop(['lendingActions_values'], axis=1, inplace=True)

In [None]:
ds_lender_loan.drop_duplicates(inplace=True)

In [None]:
ds_lender_loan[':TYPE'] = 'LEND'
ds_lender_loan.rename(columns={'lender_id': ':START_ID(Lender-ID)', 'id':':END_ID(Loan-ID)'}, inplace=True)
ds_lender_loan.to_csv('../data/neo4jtry/lender_loan.csv', index=False)
del ds_lender_loan