# Demo CSV --> Graph Notebook

This notebooks demonstrates the data flow of generating a graph from a CSV file. 

In [1]:
import os

import json
import pandas as pd
import regex as re

from summarizer.summarizer import Summarizer
from llm.llm import LLM

## Initialize Test Data

In [2]:
USER_GENERATED_INPUT = {
    'General Description': 'The data in my .csv file contains information about financial loans made to businesses.',
    'BorrowerName': 'BorrowerName contains the name of the Business that applied for the loan.',
    'BusinessType': 'BusinessType contains the type of business (i.e., Corp, Partnership, LLC, etc.)',
    'LoanNumber': 'LoanNumber contains the unique identifier for the loan.',
    'CurrentApprovalAmount': 'CurrentApprovalAmount contains the financial amount of the loan.',
    'JobsReported': 'JobsReported contains the number of jobs the loan supports.',
    'ProjectState': 'ProjectState contains the state where the funds will be used.',
    'OriginatingLender': 'OriginatingLender contains the lender that originated the loan.',
    'UTILITIES_PROCEED': 'UTILITIES_PROCEED contains the amount of the loan the borrower said they will use to pay utilities.',
    'PAYROLL_PROCEED': 'PAYROLL_PROCEED contains the amount of the loan the borrower said they will use for payroll.',
    'MORTGAGE_INTEREST_PROCEED': 'MORTGAGE_INTEREST_PROCEED contains the amount of the loan the borrower said they will use to pay mortgage interest.',
    'RENT_PROCEED': 'RENT_PROCEED contains the amount of the loan the borrower said they will use to pay rent.',
    'REFINANCE_EIDL_PROCEED': 'REFINANCE_EIDL_PROCEED contains the amount of the loan the borrower said they will use to refinance an existing loan.',
    'HEALTH_CARE_PROCEED': 'HEALTH_CARE_PROCEED contains the amount of the loan the borrower said they will use to pay employee health care.',
    'DEBT_INTEREST_PROCEED': 'DEBT_INTEREST_PROCEED contains the amount of the loan the borrower said they will use to pay debt interest.'
}

In [3]:
data = pd.read_csv("data/csv/ppp_loan_data.csv")

## Initialize LLM

In [4]:
llm = LLM()

In [5]:
summarizer = Summarizer(llm=llm, user_input=USER_GENERATED_INPUT, data=data)

In [6]:
discovery = summarizer.run_discovery()
print(discovery)

Based on the provided information, here are some preliminary observations about your data:

1. **Data Size and Completeness**: The dataset is quite large with 968,525 entries and 14 features. However, there are missing values in several columns that need to be addressed. The columns with the most missing values are `UTILITIES_PROCEED`, `MORTGAGE_INTEREST_PROCEED`, `RENT_PROCEED`, `REFINANCE_EIDL_PROCEED`, `HEALTH_CARE_PROCEED`, and `DEBT_INTEREST_PROCEED`. These missing values could potentially skew any analysis and should be handled appropriately.

2. **Data Types**: The dataset contains a mix of numerical (float64 and int64) and categorical (object) data types. The numerical data are mostly amounts related to the loan and its intended use, while the categorical data include the borrower's name, business type, project state, and originating lender.

3. **Key Features**: The most important features in this dataset are likely to be `LoanNumber`, `CurrentApprovalAmount`, `JobsReported`, 

In [7]:
initial_model = summarizer.create_initial_model()
print(initial_model)

Validating response...
```json
{
    "Nodes": [
        {
            "Label": "Loan",
            "Properties": ["LoanNumber", "CurrentApprovalAmount", "JobsReported", "UTILITIES_PROCEED", "MORTGAGE_INTEREST_PROCEED", "RENT_PROCEED", "REFINANCE_EIDL_PROCEED", "HEALTH_CARE_PROCEED", "DEBT_INTEREST_PROCEED"],
            "Reasoning": "The 'Loan' node encapsulates all the key information about each loan. The properties chosen represent the unique identifier of the loan, the amount approved, the number of jobs reported, and the intended use of the loan. These properties are crucial for understanding the characteristics of each loan and can be used to identify patterns that may indicate potential fraud."
        },
        {
            "Label": "Borrower",
            "Properties": ["BorrowerName", "BusinessType"],
            "Reasoning": "The 'Borrower' node represents the entity that receives the loan. The properties chosen provide information about the borrower's identity and the type

In [8]:
summarizer.current_model

{'Nodes': [{'Label': 'Loan',
   'Properties': ['LoanNumber',
    'CurrentApprovalAmount',
    'JobsReported',
    'UTILITIES_PROCEED',
    'MORTGAGE_INTEREST_PROCEED',
    'RENT_PROCEED',
    'REFINANCE_EIDL_PROCEED',
    'HEALTH_CARE_PROCEED',
    'DEBT_INTEREST_PROCEED'],
   'Reasoning': "The 'Loan' node encapsulates all the key information about each loan. The properties chosen represent the unique identifier of the loan, the amount approved, the number of jobs reported, and the intended use of the loan. These properties are crucial for understanding the characteristics of each loan and can be used to identify patterns that may indicate potential fraud."},
  {'Label': 'Borrower',
   'Properties': ['BorrowerName', 'BusinessType'],
   'Reasoning': "The 'Borrower' node represents the entity that receives the loan. The properties chosen provide information about the borrower's identity and the type of business they operate. This information can be used to identify patterns in the types 

In [9]:
summarizer.iterate_model(iterations=1)

Validating response...


'```json\n{\n    "Nodes": [\n        {\n            "Label": "Loan",\n            "Properties": ["LoanNumber", "CurrentApprovalAmount", "JobsReported", "UTILITIES_PROCEED", "MORTGAGE_INTEREST_PROCEED", "RENT_PROCEED", "REFINANCE_EIDL_PROCEED", "HEALTH_CARE_PROCEED", "DEBT_INTEREST_PROCEED"],\n            "Reasoning": "The \'Loan\' node encapsulates all the key information about each loan. The properties chosen represent the unique identifier of the loan, the amount approved, the number of jobs reported, and the intended use of the loan. These properties are crucial for understanding the characteristics of each loan and can be used to identify patterns that may indicate potential fraud."\n        },\n        {\n            "Label": "Borrower",\n            "Properties": ["BorrowerName"],\n            "Reasoning": "The \'Borrower\' node represents the entity that receives the loan. The \'BorrowerName\' property provides information about the borrower\'s identity. This information can be 

In [10]:
summarizer.current_model

{'Nodes': [{'Label': 'Loan',
   'Properties': ['LoanNumber',
    'CurrentApprovalAmount',
    'JobsReported',
    'UTILITIES_PROCEED',
    'MORTGAGE_INTEREST_PROCEED',
    'RENT_PROCEED',
    'REFINANCE_EIDL_PROCEED',
    'HEALTH_CARE_PROCEED',
    'DEBT_INTEREST_PROCEED'],
   'Reasoning': "The 'Loan' node encapsulates all the key information about each loan. The properties chosen represent the unique identifier of the loan, the amount approved, the number of jobs reported, and the intended use of the loan. These properties are crucial for understanding the characteristics of each loan and can be used to identify patterns that may indicate potential fraud."},
  {'Label': 'Borrower',
   'Properties': ['BorrowerName'],
   'Reasoning': "The 'Borrower' node represents the entity that receives the loan. The 'BorrowerName' property provides information about the borrower's identity. This information can be used to identify patterns in the types of borrowers that may be more likely to engage 