Skip to content

Commit

Permalink
feat:support pdf txt
Browse files Browse the repository at this point in the history
  • Loading branch information
byebyebruce committed Oct 26, 2023
1 parent 7a0f6bd commit d3f343c
Show file tree
Hide file tree
Showing 13 changed files with 433 additions and 249 deletions.
27 changes: 17 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,35 @@
# CHANGELOG

## v0.5.2 - 2023-10-26
### Features
* Support txt/pdf
* Choose doc to chat
### Fix
* Compare local cache

## v0.5.1 - 2023-10-23
### Features
* support chat to html. e.g. `chat2data html http://someurl`
* change command mode
* Support chat to html.
* Change command mode

## v0.4.0 - 2023-07-18
### Features
* add web ui `chat2data --csv=./testdata/csv/routes.csv --web=8088`
* Add web ui

## v0.2.8 - 2023-07-11
### Features
* add csv support
* Add csv support
### Fixed
* all table mode panic
* All table mode panic

## v0.2.0 - 2023-07-10
### Features
* add postgre support
* add docker support
* Add postgre support
* Add docker support
### Fixed
* load .env error
* Load .env error

## v0.1.0 - 2023-07-04
### Features
* add sqlite3 support
* add mysql support
* Add sqlite3 support
* Add mysql support
45 changes: 24 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,14 @@
<img src="https://readme-typing-svg.demolab.com/?lines=Chat+2+Data&size=50&height=80&center=true&vCenter=true&&duration=1000&pause=5000">
</div>

> 馃棧 馃搳Chat2Data is tool for interacting with databases, supporting MySQL, PostgreSQL, SQLite3, and CSV files, HTML page.
> 馃棧 馃搳 Chat2Data is a tool for interacting with your data, including MySQL, PostgreSQL, SQLite3, CSV, Text, PDF, and HTML pages.
## Feature
* 馃棧 Easy Interaction: Chat2Data lets you chat with your databases, making it intuitive to use.
* 馃敆 Multiple Databases: It supports MySQL, PostgreSQL, SQLite3, and CSV files.
* 馃惓 Docker Support: It provides a Docker image for easy deployment.
* 馃捇 CLI and Web UI: It offers both a command line and a web interface.
* 鈿欙笍 Simple Installation: It's easy to install with Go command.
* 馃 AI Integration: It leverages OpenAI API for advanced natural language processing.

馃棧 Easy Interaction: Chat2Data allows you to chat with your data, making it intuitive to use.
馃敆 Multiple Databases: It supports MySQL, PostgreSQL, SQLite3, CSV, Text, PDF, and HTML pages.
馃惓 Docker Support: It provides a Docker image for easy deployment.
馃捇 CLI and Web UI: It offers both a command line and a web interface.
鈿欙笍 Simple Installation: It's easy to install with a Go command.
馃 AI Integration: It leverages the OpenAI API for advanced natural language processing.
## Preview
![CLI](doc/cli.jpg)
![Web UI](doc/web-ui.png)
Expand All @@ -30,7 +29,7 @@
## Quick Run
* Binary
```bash
OPENAI_API_KEY=xxx ./chat2data db -c ./testdata/world_happiness_2015.db
OPENAI_API_KEY=xxx chat2data db -c testdata/world_happiness_2015.db
```
Ask: `Which is the highest happiness country?`

Expand All @@ -41,28 +40,30 @@ docker run --rm -it -e OPENAI_API_KEY=xxx -p 8088:8088 bailu1901/chat2data html
Open `http://localhost:8088` in browser, then ask: `What is the feature of chat2data?`

## Config
* Use local `.env` file `./cp .env.template .env` then edit it.
* Use local `.env` file `cp .env.template .env` then edit it.
* You can also use `export OPENAI_API_KEY=xxx` to specify the environment variables.
* Or run with env `OPENAI_API_KEY=xxx OPENAI_BASE_URL=https://api.openai.com/v1 ./chat2data db root:pwd@tcp(localhost:3306)/mydb`
* Or run with env `OPENAI_API_KEY=xxx OPENAI_BASE_URL=https://api.openai.com/v1 chat2data db root:pwd@tcp(localhost:3306)/mydb`

## Usage
* help `./chat2data --help`
* help `chat2data --help`
global flags
```bash
--web -w web ui port
--cli -c cli mode
```
1. Run CLI(command line interface)
* mysql `./chat2data db -c root:pwd@tcp(localhost:3306)/mydb`
* postgre `./chat2data db -c postgres://db_user:mysecretpassword@localhost:5438/test?sslmode=disable`
* sqlite3 `./chat2data db -c ./sqlite.db`
* csv `./chat2data csv -c ./csvfile.csv` or `./chat2data csv ./csvdir`
* html `./chat2data html -c https://github.com/byebyebruce/chat2data`
* mysql `chat2data db -c root:pwd@tcp(localhost:3306)/mydb`
* postgre `chat2data db -c postgres://db_user:mysecretpassword@localhost:5438/test?sslmode=disable`
* sqlite3 `chat2data db -c sqlite.db`
* csv `chat2data csv -c csvfile.csv` or `chat2data csv csvdir`
* html `chat2data html -c https://github.com/byebyebruce/chat2data`
* text `chat2data txt -c textfile.txt`
* with env `OPENAI_API_KEY=xxx chat2data db -c root:pwd@tcp(localhost:3306)/mydb`
2. Run Web UI
* mysql `./chat2data db root:example@tcp(10.12.21.101:3306)/mydb`
* html `./chat2data html https://github.com/byebyebruce/chat2data`
* sqlite3 `./chat2data db -w=:0.0.0.0:8088 ./mytest.db`
* mysql `chat2data db root:example@tcp(10.12.21.101:3306)/mydb`
* html `chat2data html https://github.com/byebyebruce/chat2data`
* pdf `chat2data pdf testdata/sample.pdf`
* sqlite3 `chat2data db -w=:0.0.0.0:8088 mytest.db`

## Build
`git clone github.com/byebyebruce/chat2data`
Expand All @@ -82,7 +83,9 @@ docker build -t chat2data .
- [x] Add Web ui
- [x] Local vector database
- [x] Support load html
- [ ] Doc QA
- [x] Support load pdf
- [x] Doc QA
- [ ] Support word
- [ ] Beautiful CLI

## [Change Log](CHANGELOG.md)
Expand Down
184 changes: 184 additions & 0 deletions cmd/chat2data/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
package main

import (
"fmt"
"os"
"path"

"github.com/byebyebruce/chat2data/localvectordb"
"github.com/byebyebruce/chat2data/qa/ragchain"
"github.com/byebyebruce/chat2data/ui/cli"
"github.com/fatih/color"
"github.com/manifoldco/promptui"
"github.com/spf13/cobra"
embedding_openai "github.com/tmc/langchaingo/embeddings/openai"
"github.com/tmc/langchaingo/llms"
"github.com/tmc/langchaingo/schema"
"github.com/tmc/langchaingo/textsplitter"
)

const dbFile = ".chat2data.tmp"

var (
tmpDBFilePath string
)
var (
printDocsFlag bool
topN int
scoreThreshold float64
chunkSize int
chunkOverlap int
)

func init() {
dir, err := os.UserHomeDir()
if err != nil {
dir = os.TempDir()
}
tmpDBFilePath = path.Join(dir, dbFile)
}

func docFlag(cmd *cobra.Command) {
cmd.PersistentFlags().IntVarP(&chunkSize, "chunk-size", "s", 1000, "chunk size")
cmd.PersistentFlags().IntVarP(&chunkOverlap, "chunk-overlap", "o", 0, "chunk overlap")
cmd.PersistentFlags().IntVarP(&topN, "topN", "n", 5, "vector search topN")
cmd.PersistentFlags().BoolVarP(&printDocsFlag, "print-docs", "p", false, "print docs")
cmd.PersistentFlags().Float64VarP(&scoreThreshold, "score-threshold", "t", 0.7, "score threshold")
}

func printDocs(docs []schema.Document) {
for i, doc := range docs {
fmt.Println(color.RedString("Page #%d", i+1))
fmt.Println(doc)
fmt.Println()
fmt.Println()
}
}

func splitterText(from string, str string) ([]schema.Document, error) {
ts := textsplitter.NewRecursiveCharacter()
ts.ChunkOverlap = chunkOverlap
ts.ChunkSize = chunkSize
chunks, err := ts.SplitText(str)
if err != nil {
return nil, err
}
var docs []schema.Document
for _, c := range chunks {
docs = append(docs, schema.Document{
PageContent: c,
})
}
return docs, nil
}

type RunE func(cmd *cobra.Command, args []string) error

func docWrapper(llm llms.LanguageModel, f func(cmd *cobra.Command, args []string) (string, []schema.Document, error)) RunE {
return func(cmd *cobra.Command, args []string) error {
name, docs, err := f(cmd, args)
if err != nil {
return err
}
if len(docs) == 0 {
return fmt.Errorf("no docs")
}
if printDocsFlag {
printDocs(docs)
}

e, err := embedding_openai.NewOpenAI()
if err != nil {
return err
}

db, err := localvectordb.New(tmpDBFilePath)
if err != nil {
return err
}
fmt.Println("load cache db file", tmpDBFilePath)
defer db.Close()

ok, err := ragchain.RefreshDoc(db, e, name, docs)
if err != nil {
return err
}
if !ok {
fmt.Println("use cached doc")
}
qa, err := ragchain.NewDocRAGChain(llm, db, e, name, topN, scoreThreshold)
if err != nil {
return err
}

return runUI(qa, name)
}
}

func docCMD(llm llms.LanguageModel) *cobra.Command {
cmd := &cobra.Command{
Use: "doc",
Short: "Choose doc to chat(need add doc to cache first)",
}
cmd.RunE = func(cmd *cobra.Command, args []string) error {
e, err := embedding_openai.NewOpenAI()
if err != nil {
return err
}

db, err := localvectordb.New(tmpDBFilePath)
if err != nil {
return err
}
defer db.Close()
names, err := db.List()
if err != nil {
return err
}
if len(names) == 0 {
fmt.Println("no doc in cache")
return nil
}
for {
sel := promptui.Select{
Label: "Select doc",
Items: names,
}
_, name, err := sel.Run()
if err != nil {
if err == promptui.ErrInterrupt {
return nil
}
return err
}

qa, err := ragchain.NewDocRAGChain(llm, db, e, name, topN, scoreThreshold)
if err != nil {
return err
}

err = cli.CLI(qa, name)
if err != nil {
return err
}
}
}
return cmd
}

func cleanDocCacheCMD() *cobra.Command {
cmd := &cobra.Command{
Use: "clean",
Short: "clean doc cache",
}
cmd.RunE = func(cmd *cobra.Command, args []string) error {
err := os.RemoveAll(tmpDBFilePath)
if err != nil {
fmt.Println("error:", err)
} else {
fmt.Println("cache cleaned", tmpDBFilePath)
}
return nil
}
return cmd
}
Loading

0 comments on commit d3f343c

Please sign in to comment.