# Demo CSV --> Graph Notebook

This notebooks demonstrates the data flow of generating a graph from a CSV file. 

In [1]:
import os

import json
import pandas as pd
import regex as re

from summarizer.summarizer import Summarizer
from llm.llm import LLM

## Initialize Test Data

In [2]:
USER_GENERATED_INPUT = {
    'General Description': 'The data in my .csv file contains information about financial loans made to businesses.',
    'BorrowerName': 'BorrowerName contains the name of the Business that applied for the loan.',
    'BusinessType': 'BusinessType contains the type of business (i.e., Corp, Partnership, LLC, etc.)',
    'LoanNumber': 'LoanNumber contains the unique identifier for the loan.',
    'CurrentApprovalAmount': 'CurrentApprovalAmount contains the financial amount of the loan.',
    'JobsReported': 'JobsReported contains the number of jobs the loan supports.',
    'ProjectState': 'ProjectState contains the state where the funds will be used.',
    'OriginatingLender': 'OriginatingLender contains the lender that originated the loan.',
    'UTILITIES_PROCEED': 'UTILITIES_PROCEED contains the amount of the loan the borrower said they will use to pay utilities.',
    'PAYROLL_PROCEED': 'PAYROLL_PROCEED contains the amount of the loan the borrower said they will use for payroll.',
    'MORTGAGE_INTEREST_PROCEED': 'MORTGAGE_INTEREST_PROCEED contains the amount of the loan the borrower said they will use to pay mortgage interest.',
    'RENT_PROCEED': 'RENT_PROCEED contains the amount of the loan the borrower said they will use to pay rent.',
    'REFINANCE_EIDL_PROCEED': 'REFINANCE_EIDL_PROCEED contains the amount of the loan the borrower said they will use to refinance an existing loan.',
    'HEALTH_CARE_PROCEED': 'HEALTH_CARE_PROCEED contains the amount of the loan the borrower said they will use to pay employee health care.',
    'DEBT_INTEREST_PROCEED': 'DEBT_INTEREST_PROCEED contains the amount of the loan the borrower said they will use to pay debt interest.'
}

In [3]:
data = pd.read_csv("data/csv/ppp_loan_data.csv")

## Initialize LLM

In [4]:
llm = LLM()

In [5]:
summarizer = Summarizer(llm=llm, user_input=USER_GENERATED_INPUT, data=data)

In [6]:
discovery = summarizer.run_discovery()
print(discovery)

Based on the initial analysis of the data, we can see that we have a large dataset, containing 968,525 entries and 14 features. The memory usage is 103.4+ MB.

Below are some important details about the dataset:

1. Missing Values: Some columns have missing values, such as `BorrowerName`, `BusinessType`, `JobsReported`, `ProjectState` and all columns related to the purposes of the loan, e.g., `UTILITIES_PROCEED`, `PAYROLL_PROCEED`, etc. This missing data will need to be addressed during data preprocessing.

2. Distributions: The distributions of the `CurrentApprovalAmount` and `LoanNumber` seem to be right-skewed, meaning most of the loan amounts and numbers are on the lower end of the scale, but there are some loans with very high amounts and numbers. The `JobsReported` is also right-skewed, suggesting that most loans support a lesser number of jobs, but a few loans support a large number of jobs.

3. Unique Values: `BorrowerName` has 857334 unique values, indicating a high cardinalit

In [7]:
initial_model = summarizer.create_initial_model()
print(initial_model)

Sure, here's a preliminary suggestion for translating your data into a Neo4j graph data model. 

```
{
    "Nodes": [
        {
            "Label": "Loan",
            "Properties": ["LoanNumber", "CurrentApprovalAmount", "UTILITIES_PROCEED", "PAYROLL_PROCEED"],
            "Reasoning": "Each loan can be uniquely identified by its LoanNumber, and it encapsulates key details about the loan like the amount and purpose. These attributes are important for identifying patterns related to potential fraud."
        },
        {
            "Label": "Borrower",
            "Properties": ["BorrowerName", "BusinessType"],
            "Reasoning": "Each borrower is a distinct entity that can be uniquely identified by its name. The business type can provide context about the borrower and might be relevant for fraud detection."
        },
        {
            "Label": "Lender",
            "Properties": ["OriginatingLender"],
            "Reasoning": "The lender is a distinct entity and can provi

In [8]:
summarizer.iterate_model(iterations=1)

'Certainly, based on your feedback, we might extract more granularity from the data by considering the \'UTILITIES_PROCEED\' and \'PAYROLL_PROCEED\' as separate nodes instead of properties of the \'Loan\' node. This could help in identifying patterns where specific types of loans are more prone to fraudulent behavior.\n\nHere\'s the updated suggestion for the data model.\n\n```\n{\n    "Nodes": [\n        {\n            "Label": "Loan",\n            "Properties": ["LoanNumber", "CurrentApprovalAmount"],\n            "Reasoning": "Each loan can be uniquely identified by its LoanNumber, and the CurrentApprovalAmount provides key information about the loan. These attributes form the core identity of a loan."\n        },\n        {\n            "Label": "Borrower",\n            "Properties": ["BorrowerName", "BusinessType"],\n            "Reasoning": "Each borrower is a distinct entity that can be uniquely identified by its name. The business type can provide context about the borrower and

In [12]:
summarizer.model_history[-1]

'Certainly, based on your feedback, we might extract more granularity from the data by considering the \'UTILITIES_PROCEED\' and \'PAYROLL_PROCEED\' as separate nodes instead of properties of the \'Loan\' node. This could help in identifying patterns where specific types of loans are more prone to fraudulent behavior.\n\nHere\'s the updated suggestion for the data model.\n\n```\n{\n    "Nodes": [\n        {\n            "Label": "Loan",\n            "Properties": ["LoanNumber", "CurrentApprovalAmount"],\n            "Reasoning": "Each loan can be uniquely identified by its LoanNumber, and the CurrentApprovalAmount provides key information about the loan. These attributes form the core identity of a loan."\n        },\n        {\n            "Label": "Borrower",\n            "Properties": ["BorrowerName", "BusinessType"],\n            "Reasoning": "Each borrower is a distinct entity that can be uniquely identified by its name. The business type can provide context about the borrower and

In [13]:
summarizer.parse_model_from_response(summarizer.model_history[-1])

{'Nodes': [{'Label': 'Loan',
   'Properties': ['LoanNumber', 'CurrentApprovalAmount'],
   'Reasoning': 'Each loan can be uniquely identified by its LoanNumber, and the CurrentApprovalAmount provides key information about the loan. These attributes form the core identity of a loan.'},
  {'Label': 'Borrower',
   'Properties': ['BorrowerName', 'BusinessType'],
   'Reasoning': 'Each borrower is a distinct entity that can be uniquely identified by its name. The business type can provide context about the borrower and might be relevant for fraud detection.'},
  {'Label': 'Lender',
   'Properties': ['OriginatingLender'],
   'Reasoning': "The lender is a distinct entity and can provide valuable information about the loan's origin. Patterns from specific lenders might be indicative of fraud."},
  {'Label': 'State',
   'Properties': ['ProjectState'],
   'Reasoning': 'The state where the project is located can provide geographical context that could be relevant for identifying regional patterns i

In [11]:
# json.loads(re.findall(r"(?:```\njson|```)\n(\{[\n\s\w\"\:\[\]\{\\},\'\.\-]*)```", test)[0])