This is a three-model implementation, 2 open-sourced, and one proprietary. Groq is used as an inference provider, Brave as a search provider, and Preqin as a research provider.
An interactive demo can be found here. The code is not a production level, but rather a demo for showing several tools and approaches. A production level would have different architectural choices and implementation.
Several weeks ago, the frontier open-sourced model Llama had a major update. One of the biggest changes is the capability to fetch data online. This code tries to reproduce the logic of ensemble interaction using Brave and PDF parsing.
The code covers two cases.
- Revenue/Valuation for a specific company. If the information is unavailable in LLM, Brave is used as a resource for updated information.
- 'Talking to a PDF' use case, specifically for alternative investment domain.
The current implementation is loosely based on Ensemble Architecture, which consists of three models working in tandem:
1. Base General Model (Llama3.1_70B). This model is fetching knowledge only from itself.
2. Base Parsing Model (GPT4-o-mini). In this implementation, it specializes in parsing.
3. Meta Model (Llama3.1_8B). It gauges the output of Base models and controls the flow
The code is orchestrated by Models and Pipelines components of LangChain utilizing Brave Search (Llama default option) in the following way:
- The Base General Model attempts to answer the question from the user.
- Meta-Model evaluates the quality and if it identifies it as unsatisfactory, it
- Employs Base Parsing Model, which in turn, searches the answer on the Internet using Brave browser.
- Paid tiers of Techcrunch/PitchBook are much more accurate and using API for production would be more appropriate for that task. In addition, most LLMs have knowledge cut-off. In practice, it means that virtually any request would benefit from enriching with up-to-date information from online. Two steps were introduced mostly for illustrative purposes. Finally, the current implementation depends on Brave search results quality, which lags behind engines like Preplexity, or legacy fetchers like Google search.
- Ollama 3.1. (default backend for development) doesn't support default Ollama Metal out of the box for tooling/search (only CUDA), hence LangChain tooling was leveraged instead.
- 'Talking for files' should be implemented with embeddings and vector databases for production. The current implementation uses just the standard Python library.
- LangChain's ‘chain of chains’ approach might be used more effectively but would require more effort to facilitate non-linear flow. Moreover, it would be more constraining for this simple showcase. Agents' architecture maybe even more suitable but would require more upfront investment, which is not in line with the intention of this demo.
- OpenAI was included mostly for showcase purposes since many companies prefer not to upload sensitive/valuable data. Llama 70B (a default option for the development environment), and recently Llama 405B are very capable for most use cases. Paired with Groq, they offer balanced results given triple constraints of accuracy, cost, and speed.
- The code is fully reproducible on the local server with Ollama as a backend. In this case, getpass invoking should be activated in the code in addition to local models of choice.
- To make the demo interactive, Replit is used to run the code, facilitate all dependencies, and use API keys. For the production level, containerization solutions like Docker would be more suitable for this and similar cases.