Given the following:
- An interesting PostgreSQL database with numbers that can be plotted.
- A local large language model that can generate SQL queries from plain text questions.
- A small language model that can generate python code.
Can we assemble a program that can, from a plain text English query, generate a plot of data germane to the query?
- PostgreSQL database and administration tool: https://www.postgresql.org/download/
- A large language model that can generate SQL from plain text. I chose Defog.ai's SQLCoder2 and specifically the GGUF formatted one generated by TheBloke: https://huggingface.co/TheBloke/sqlcoder2-GGUF. I am utilizing the prompt format from Defog as well which gives me the most consistent results.
- A small language model that can generate python code (specifically code that can setup matplotlib plots from proposed datasets). For this I chose Microsoft's Phi-2, but again, went for the newer GGUF formatted version here: https://huggingface.co/TheBloke/phi-2-GGUF.
- langchain, llama-cpp-python, and psycopg2
- A C++ compiler all setup and ready to be used from your commandline (needed by llama-cpp-python). You might also need CMake if I remember correctly.
- A computer with a GPU or a superior CPU.
- 32 GB of RAM.
Basically, get everything I mentioned in requirements and clone this repo. Place the models in a subdirectory called models. Create yourself a python environment and activate it:
python -m venv dbvizenv [windows] dbvizenv/scripts/activate [linux] source dbvizenv/bin/activate
I've included two sample databases with some sample data and a small data generator:
- product_sales.sql is the one I would recommend. This is a fake toy store with roughly 30 products and several sales regions. GPT-4 and I faked up some good sales numbers for each of the regions along with some rough times for a couple of regions to enable interesting trend scenarios.
- books_n_authors.sql is another nice one, though probably pretty boring to plot anything from.
As you can see, the design is fairly simple. After setting up the database (not shown in the picture), the user is placed into a chat mode. During that time, and with knowledge of things like "sales" and "products" and "regions" (using my favorite example), the user can ask basic questions. Once issued, the prompt generator will include information about the structure of the target DB into the prompt and issue it to the first LLM. This will generate SQL code which we then immediately execute against the database. This produces results. We take those results and use a second LLM (good at generating python) and have it use the data to build a nice chart in using matplotlib.
(dbvizenv) dbviz> python main.py product_sales.sql sqlcoder2.Q5_K_M.gguf Generating fake sales figures... Done generating sales figures.
In the Northwest region, which products were sold and in what quantity in 2015?
SELECT p.description, SUM(s.quantitysold) AS totalsales FROM sales s JOIN products p ON s.productid = p.productid JOIN regions r ON s.regionid = r.regionid WHERE r.regionname ILIKE '%northwest%' AND s.year::TEXT ILIKE '%2015%' GROUP BY p.description;
import matplotlib.pyplot as plt data = [('Butterfly Net and Kit', 1345450), ('Cuddly Kitten Plush', 1124063), ('Dinosaur Adventure Playset', 323667), ('Doctor Role Play Set', 257225), ('Electronic Keyboard for Kids', 282534), ('Fairy Garden Kit', 344791), ('Firefighter Play Kit', 357457), ('Giant Bubble Maker', 3206946), ('Glow-in-the-Dark Stars', 2339025), ('Interactive Learning Tablet', 187643), ('Jungle Animal Puzzle', 981863), ('Junior Detective Kit', 851040), ('Knights and Dragons Set', 328808), ('Little Chef Cooking Set', 456430), ('Magic Trick Set', 404113), ('Magical Princess Castle', 563102), ('Mermaid Doll', 819576), ('Miniature Train Set', 186533), ('Mystical Unicorn Figurine', 448914), ('Ocean Explorer Submarine', 429554), ('Pirate Ship Adventure', 398840), ('Rainbow Building Blocks', 598887), ('Remote Control Race Car', 231165), ('Robot Builder Kit', 300455), ('Solar System Mobile Kit', 554076), ('Space Ranger Action Figure', 756401), ('Superhero Cape and Mask', 1112981), ('Teddy Bear Plush', 758804), ('Treasure Hunt Board Game', 789839), ('Wild West Cowboys', 573475)]
x, y = zip(*data)
plt.bar(x,y)
plt.title('Toy Sales Data') plt.xlabel('Toys') plt.ylabel('Sales (in thousands of dollars)') plt.show()
q or quit will quit the chat. logging is enabled, but I have the sensitivity set high. If you want to watch the inner workings, you can set it to DEBUG.