# Document types

### 📦 **Sales & Purchase Documents**

1. **Invoice** – A bill issued by a seller to request payment from a buyer.
2. **Receipt** – Proof of payment received for a transaction.
3. **Purchase Order (PO)** – A buyer’s official request to a seller for goods/services.
4. **Sales Order** – Confirmation from the seller accepting a customer’s purchase order.
5. **Delivery Note** – Document listing items delivered, often signed by the recipient.
6. **Credit Note** – Issued to reduce the amount owed by a customer due to returns or errors.
7. **Debit Note** – A buyer’s request to reduce the amount payable to a supplier.

---

### 📘 **Core Accounting Records**

8. **Journal Entry** – A record of a financial transaction in the accounting journal.
9. **General Ledger** – The master record of all company financial transactions.
10. **Trial Balance** – A report that lists all account balances to check for ledger accuracy.
11. **Fixed Asset Register** – A record of company-owned assets and their depreciation.

---

### 📊 **Financial Statements**

12. **Balance Sheet** – A financial snapshot showing assets, liabilities, and equity.
13. **Income Statement (P\&L)** – Shows revenues, expenses, and profit/loss over a period.
14. **Cash Flow Statement** – Details the inflows and outflows of cash in a business.
15. **Financial Statement Notes** – Explanatory notes that provide context to financial statements.
16. **Audit Report** – An independent review of financial statements for accuracy and compliance.

---

### 💼 **Operational Reports**

17. **Payroll Record** – Document showing employee wages, taxes, and deductions.
18. **Inventory Report** – A list showing quantity and value of items in stock.
19. **Budget Report** – Comparison between actual performance and planned budget.
20. **Expense Report** – Document submitted to record and justify business expenses.

---

### 💳 **Receivables & Payables**

21. **Accounts Receivable Aging Report** – Lists unpaid customer invoices by due date.
22. **Accounts Payable Aging Report** – Lists company’s owed invoices by due date.

---

### 🏦 **Banking & Tax**

23. **Bank Statement** – A summary from the bank showing transactions and balances.
24. **Tax Return** – Official report filed with tax authorities showing income and taxes owed.

---

# Libraries

In [None]:
from huggingface_hub import list_datasets
from datasets import load_dataset
import pdfplumber
import kagglehub
from pathlib import Path
import pandas as pd
import kagglehub
from kaggle.api.kaggle_api_extended import KaggleApi
from charset_normalizer import from_path
from pathlib import Path
import json
from IPython.display import JSON, display


# Q&A service

## Data from Hugging Face

### Functions

In [None]:
def hf_search_by_keyword(keyword):
    # Search on Kaggle by keywords
    datasets = list_datasets(search=keyword)
    # Store dataset references in kaggle_dataset_list
    hf_dataset_list = []
    for ds in datasets:
        hf_dataset_list.append(ds.id)
    return hf_dataset_list

In [None]:
# Define a function to safely load a dataset
def safe_load(dataset_id):
    try:
        dataset = load_dataset(dataset_id)
        if not dataset:
            print(f"[!] No dataset found for {dataset_id}")
            return None, None
        return list(dataset.keys()), dataset
    except Exception as e:
        print(f"[!] Failed to load {dataset_id}: {e}")
        return None, None

### Retrieve data

In [None]:
# Search datasets by keyword
dataset_list = hf_search_by_keyword("banking chat")
dataset_list

In [None]:
# Load dataset
keys, dataset = safe_load(dataset_list[0])
keys, dataset

In [None]:
# View data of the first split in dataset format
# dataset[keys[0]]

# Turn into dataframe
df = dataset[keys[0]].to_pandas()
# View dataframe
# df.info()
# df

# Convert to JSON as a Python list of records
data_as_json = json.loads(df.to_json(orient="records", force_ascii=False))

# Pretty display in Jupyter
display(JSON(data_as_json))

## Data from Kaggle

### Functions

In [None]:
def kaggle_search_by_keyword(keyword):
    # Search on Kaggle by keywords
    datasets = api.dataset_list(search=keyword)
    # Store dataset references in kaggle_dataset_list
    kaggle_dataset_list = []
    for ds in datasets:
        kaggle_dataset_list.append(ds.ref)
    return kaggle_dataset_list

In [None]:
def detect_encoding(file_path):
    from charset_normalizer import from_path
    result = from_path(str(file_path)).best()
    if result is None or result.encoding is None:
        print(f"⚠️ Encoding detection failed for: {file_path}")
        return None
    return result.encoding

In [None]:
# Main function to get all file paths and let user choose how to load them
def kaggle_load_file(dataset_id):
    # Download the dataset
    path = kagglehub.dataset_download(dataset_id)

    # List all files in dataset
    files = api.dataset_list_files(dataset_id)
    if not files.files:
        print(f"[!] No files found in dataset: {dataset_id}")
        return []

    # Get full paths to all files
    file_paths = [f"{path}\\{f.name}" for f in files.files]
    print("📂 Available files:")
    for f in file_paths:
        print(f" - {f}")

    return file_paths

In [None]:

# Function to read a CSV and return either a DataFrame or JSON
def read_csv_file(file_path, output_format="json"):
    file_path = Path(file_path).expanduser().resolve()

    encoding = detect_encoding(file_path)
    if encoding is None:
        print(f"⚠️ Skipping file due to undetectable encoding: {file_path.name}")
        return None

    try:
        df = pd.read_csv(file_path, encoding=encoding)
    except Exception as e:
        print(f"⚠️ Failed to read CSV: {e}")
        return None

    if output_format == "dataframe":
        return df
    else:
        return json.loads(df.to_json(orient="records", force_ascii=False))

### Retrieve data

In [None]:
# Go to Kaggle > Profile > Settings > Account > Create new token > Save the json file to .kaggle folder in your local computer
api = KaggleApi()
api.authenticate()

In [None]:
# Call function, Input parameter is the keyword that you want to search for
kaggle_dataset_list = kaggle_search_by_keyword("banking chat")
kaggle_dataset_list

In [None]:
# View dataset files
# Function parameters: dataset ID and encoding type
for i in kaggle_dataset_list[0:4]:
    kaggle_load_file(i)

In [None]:
# To get only 1 dataset
files = kaggle_load_file(kaggle_dataset_list[0])

# Read from all files
# for file in files:j
#     output = read_csv_file(file, output_format="json")  # or "dataframe"
#     print(output)

# Read a single file
data_as_json= read_csv_file(files[0], output_format="json")

# Pretty display in Jupyter
display(JSON(data_as_json))

# Report Generation

## Data from Kaggle

In [None]:
# Go to Kaggle > Profile > Settings > Account > Create new token > Save the json file to .kaggle folder in your local computer
api = KaggleApi()
api.authenticate()

In [None]:
# Call function, Input parameter is the keyword that you want to search for
# Call function, Input parameter is the keyword that you want to search for
kaggle_dataset_list = kaggle_search_by_keyword("general ledger")
kaggle_dataset_list

In [None]:
# View dataset files
# Function parameters: dataset ID and encoding type
for i in kaggle_dataset_list[:]:
    kaggle_load_file(i)

In [None]:
# To get only 1 dataset
files = kaggle_load_file(kaggle_dataset_list[1])

# Read from all files
# for file in files:j
#     output = read_csv_file(file, output_format="json")  # or "dataframe"
#     print(output)

# Read a single file
read_csv_file(files[0], output_format="dataframe")

## Data from Hugging Face
https://huggingface.co/datasets?sort=trending&search=accounting

In [None]:
dataset_list = hf_search_by_keyword('accounting')
dataset_list

In [None]:
# Load dataset
keys, dataset = safe_load(dataset_list[0])
keys, dataset

In [None]:
# View data of the first split in dataset format
# dataset[keys[0]]

# Turn into dataframe
df = dataset[keys[0]].to_pandas()
# View dataframe
# df.info()
# df

# Convert to JSON as a Python list of records
data_as_json = json.loads(df.to_json(orient="records", force_ascii=False))

# Pretty display in Jupyter
display(JSON(data_as_json))

Example for Bank statement keyword

In [None]:
dataset_list = hf_search_by_keyword('bank statement')
dataset_list

In [None]:

# Load dataset
keys, dataset = safe_load(dataset_list[0])

# View data of the first split in dataset format
# dataset[keys[0]]

# Turn into dataframe
df = dataset[keys[0]].to_pandas()
# View dataframe
# df.info()
# df

# Convert to JSON as a Python list of records
data_as_json = json.loads(df.to_json(orient="records", force_ascii=False))

# Pretty display in Jupyter
display(JSON(data_as_json))