# SQLAlchemy

This notebook demonstrates how to load documents from a [CrateDB] database,
using the document loader `CrateDBLoader`, which is based on [SQLAlchemy].

It loads the result of a database query with one document per row.

[CrateDB]: https://github.com/crate/crate
[SQLAlchemy]: https://www.sqlalchemy.org/

## Prerequisites

First, install the required dependencies by uncommenting and invoking the
`pip install` command below. Please make sure to restart the notebook runtime
environment afterwards. If you observe any installation problems, please report
them back to us.

In [40]:
#!pip install -U -r https://github.com/crate/cratedb-examples/raw/main/topic/machine-learning/llm-langchain/requirements.txt

Populate database.

In [3]:
# Connect to a self-managed CrateDB instance.
CONNECTION_STRING = "crate://crate@localhost/?schema=notebook"

In [4]:
import requests
from cratedb_toolkit.util import DatabaseAdapter


def import_mlb_teams_2012():
    """
    Import data into database table `mlb_teams_2012`.

    TODO: Refactor into general purpose package.
    """
    cratedb = DatabaseAdapter(dburi=CONNECTION_STRING)
    url = "https://github.com/crate-workbench/langchain/raw/cratedb/docs/docs/integrations/document_loaders/example_data/mlb_teams_2012.sql"
    sql = requests.get(url).text
    cratedb.run_sql(sql)
    cratedb.refresh_table("mlb_teams_2012")


import_mlb_teams_2012()

## Usage

In [5]:
import sqlalchemy as sa
from langchain_community.document_loaders import CrateDBLoader
from langchain_community.utilities.sql_database import SQLDatabase
from pprint import pprint

db = SQLDatabase(engine=sa.create_engine(CONNECTION_STRING))

loader = CrateDBLoader(
    'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
    db=db,
)
documents = loader.load()

In [6]:
pprint(documents)

[Document(page_content='Team: Angels\nPayroll (millions): 154.49\nWins: 89'),
 Document(page_content='Team: Astros\nPayroll (millions): 60.65\nWins: 55'),
 Document(page_content='Team: Athletics\nPayroll (millions): 55.37\nWins: 94'),
 Document(page_content='Team: Blue Jays\nPayroll (millions): 75.48\nWins: 73'),
 Document(page_content='Team: Braves\nPayroll (millions): 83.31\nWins: 94')]


## Specifying Which Columns are Content vs Metadata

In [15]:
loader = CrateDBLoader(
    'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
    db=db,
    page_content_mapper=lambda row: row["Team"],
    metadata_mapper=lambda row: {"Payroll (millions)": row["Payroll (millions)"]},
)
documents = loader.load()

In [16]:
pprint(documents)

[Document(page_content='Team: Angels', metadata={'Payroll (millions)': 154.49}),
 Document(page_content='Team: Astros', metadata={'Payroll (millions)': 60.65}),
 Document(page_content='Team: Athletics', metadata={'Payroll (millions)': 55.37}),
 Document(page_content='Team: Blue Jays', metadata={'Payroll (millions)': 75.48}),
 Document(page_content='Team: Braves', metadata={'Payroll (millions)': 83.31})]


## Adding Source to Metadata

In [17]:
loader = CrateDBLoader(
    'SELECT * FROM mlb_teams_2012 ORDER BY "Team" LIMIT 5;',
    db=db,
    source_columns=["Team"],
)
documents = loader.load()

In [18]:
pprint(documents)

[Document(page_content='Team: Angels\nPayroll (millions): 154.49\nWins: 89', metadata={'source': 'Angels'}),
 Document(page_content='Team: Astros\nPayroll (millions): 60.65\nWins: 55', metadata={'source': 'Astros'}),
 Document(page_content='Team: Athletics\nPayroll (millions): 55.37\nWins: 94', metadata={'source': 'Athletics'}),
 Document(page_content='Team: Blue Jays\nPayroll (millions): 75.48\nWins: 73', metadata={'source': 'Blue Jays'}),
 Document(page_content='Team: Braves\nPayroll (millions): 83.31\nWins: 94', metadata={'source': 'Braves'})]
