# Runway Notebook

This application takes a CSV file and some optional data descriptions as input and produces the following:
- Discovery on your data
- Graph data model in json format
- Ingestion code in the following formats: 
  - PyIngest yaml file
  - load_csv cypher file
  - constraints cypher file
- Database size estimation
- Automatic data loading service if Neo4j credentials are provided

A free Neo4j Aura database can be created here: https://console.neo4j.io/

In [1]:
import os

import json
import pandas as pd
import regex as re

from summarizer.summarizer import Summarizer
from llm.llm import LLM
from ingestion.generate_ingest import IngestionGenerator
from pyingest.pyingest import PyIngest

## Gather CSV Information

In [3]:
data = pd.read_csv("data/csv/ppp_loan_data.csv")

In [6]:
for col in list(data.columns):
    print(col)

Unnamed: 0
LoanNumber
DateApproved
SBAOfficeCode
ProcessingMethod
BorrowerName
BorrowerAddress
BorrowerCity
BorrowerState
BorrowerZip
LoanStatusDate
LoanStatus
Term
SBAGuarantyPercentage
InitialApprovalAmount
CurrentApprovalAmount
UndisbursedAmount
FranchiseName
ServicingLenderLocationID
ServicingLenderName
ServicingLenderAddress
ServicingLenderCity
ServicingLenderState
ServicingLenderZip
RuralUrbanIndicator
HubzoneIndicator
LMIIndicator
BusinessAgeDescription
ProjectCity
ProjectCountyName
ProjectState
ProjectZip
CD
JobsReported
NAICSCode
Race
Ethnicity
UTILITIES_PROCEED
PAYROLL_PROCEED
MORTGAGE_INTEREST_PROCEED
RENT_PROCEED
REFINANCE_EIDL_PROCEED
HEALTH_CARE_PROCEED
DEBT_INTEREST_PROCEED
BusinessType
OriginatingLenderLocationID
OriginatingLender
OriginatingLenderCity
OriginatingLenderState
Gender
Veteran
NonProfit
ForgivenessAmount
ForgivenessDate


The application requires desired columns from the user. The user may optionally include a description of each desired column as well as a general description of the data. If no description is provided, then the value must be an empty string.

In [2]:
general_description = 'The data in my .csv file contains information about financial loans made to businesses.'

USER_GENERATED_INPUT = {
    'General Description': general_description,
    'BorrowerName': 'BorrowerName contains the name of the Business that applied for the loan.',
    'BusinessType': 'BusinessType contains the type of business (i.e., Corp, Partnership, LLC, etc.)',
    'LoanNumber': 'LoanNumber contains the unique identifier for the loan.',
    'CurrentApprovalAmount': 'CurrentApprovalAmount contains the financial amount of the loan.',
    'JobsReported': 'JobsReported contains the number of jobs the loan supports.',
    'ProjectState': 'ProjectState contains the state where the funds will be used.',
    'OriginatingLender': 'OriginatingLender contains the lender that originated the loan.',
    'UTILITIES_PROCEED': 'UTILITIES_PROCEED contains the amount of the loan the borrower said they will use to pay utilities.',
    'PAYROLL_PROCEED': 'PAYROLL_PROCEED contains the amount of the loan the borrower said they will use for payroll.',
    'MORTGAGE_INTEREST_PROCEED': 'MORTGAGE_INTEREST_PROCEED contains the amount of the loan the borrower said they will use to pay mortgage interest.',
    'RENT_PROCEED': 'RENT_PROCEED contains the amount of the loan the borrower said they will use to pay rent.',
    'REFINANCE_EIDL_PROCEED': 'REFINANCE_EIDL_PROCEED contains the amount of the loan the borrower said they will use to refinance an existing loan.',
    'HEALTH_CARE_PROCEED': 'HEALTH_CARE_PROCEED contains the amount of the loan the borrower said they will use to pay employee health care.',
    'DEBT_INTEREST_PROCEED': 'DEBT_INTEREST_PROCEED contains the amount of the loan the borrower said they will use to pay debt interest.'
}

## Initialize LLM & Summarizer

In [4]:
llm = LLM()

In [5]:
summarizer = Summarizer(llm=llm, user_input=USER_GENERATED_INPUT, data=data)

## Discovery

In [6]:
discovery = summarizer.run_discovery()
print(discovery)

Based on the provided information, here are some preliminary observations about your data:

1. **Data Size and Completeness**: The dataset is quite large with 968,525 entries and 14 features. However, there are missing values in several columns that need to be addressed. The columns with the most missing values are `UTILITIES_PROCEED`, `MORTGAGE_INTEREST_PROCEED`, `RENT_PROCEED`, `REFINANCE_EIDL_PROCEED`, `HEALTH_CARE_PROCEED`, and `DEBT_INTEREST_PROCEED`. These missing values could potentially skew any analysis and should be handled appropriately.

2. **Data Types**: The dataset contains a mix of numerical (float64 and int64) and categorical (object) data types. The numerical data are mostly amounts related to the loan and its intended use, while the categorical data represent the borrower's name, business type, project state, and originating lender.

3. **Unique Identifiers**: The `LoanNumber` column appears to be a unique identifier for each loan, as it has no missing values and its

## Generate the Initial Model

We make two calls to the LLM to generate the initial model. The first call returns a model, though this first model can almost always be improved upon. We therefor call again to receive a - hopefully - better model.

In [7]:
summarizer.create_initial_model()
summarizer.iterate_model(iterations=1)
summarizer.current_model_viz

Validating response...
```json
{
    "Nodes": [
        {
            "Label": "Loan",
            "Properties": ["LoanNumber", "CurrentApprovalAmount", "JobsReported", "UTILITIES_PROCEED", "MORTGAGE_INTEREST_PROCEED", "RENT_PROCEED", "REFINANCE_EIDL_PROCEED", "HEALTH_CARE_PROCEED", "DEBT_INTEREST_PROCEED"],
            "Unique Constraints": ["LoanNumber"],
            "Reasoning": "The 'Loan' node represents each unique loan in the dataset. The properties are all the features related to the loan itself. The 'LoanNumber' is a unique identifier for each loan, so it is used as a unique constraint."
        },
        {
            "Label": "Borrower",
            "Properties": ["BorrowerName", "BusinessType"],
            "Unique Constraints": ["BorrowerName"],
            "Reasoning": "The 'Borrower' node represents each unique borrower in the dataset. The properties are all the features related to the borrower. The 'BorrowerName' is a unique identifier for each borrower, so it is used 

## Iterate on the Data Model

If the user has any issues with the data model, they can be addressed here. They may enter corrections with the user_corrections variable, or run the iteration process with an empty string.

In [10]:
user_corrections = ""

summarizer.iterate_model(iterations=1, user_corrections=user_corrections)
summarizer.current_model_viz

Validating response...


'```json\n{\n    "Nodes": [\n        {\n            "Label": "Loan",\n            "Properties": ["LoanNumber", "CurrentApprovalAmount", "JobsReported", "UTILITIES_PROCEED", "MORTGAGE_INTEREST_PROCEED", "RENT_PROCEED", "REFINANCE_EIDL_PROCEED", "HEALTH_CARE_PROCEED", "DEBT_INTEREST_PROCEED"],\n            "Unique Constraints": ["LoanNumber"],\n            "Reasoning": "The \'Loan\' node represents each unique loan in the dataset. The properties are all the features related to the loan itself. The \'LoanNumber\' is a unique identifier for each loan, so it is used as a unique constraint."\n        },\n        {\n            "Label": "Borrower",\n            "Properties": ["BorrowerName", "BusinessType"],\n            "Unique Constraints": ["BorrowerName"],\n            "Reasoning": "The \'Borrower\' node represents each unique borrower in the dataset. The properties are all the features related to the borrower. The \'BorrowerName\' is a unique identifier for each borrower, so it is used a

## Neo4j Credentials

In [None]:
username = "neo4j"
password = "password"
database = "neo4j"
uri = "bolt://localhost:7687"

## Generate Ingestion Code

In [None]:
model_to_use = summarizer.current_model
# if a model other than most recent is desired, then uncomment these lines and select the version number appropriately
# model_version_to_use = []
# model_to_use = summarizer.model_history[model_version_to_use-1].dict
gen = IngestionGenerator(data_model=summarizer.current_model,
                         username=username,
                         password=password,
                         database=database,
                         uri=uri)

In [None]:
gen.generate_pyingest_yaml_file()
gen.generate_load_csv_file()
gen.generate_constraints_cypher_file()

## Ingest Data into Neo4j Database

In [None]:
PyIngest(yaml_string=gen.generate_pyingest_yaml_string())