# **SNOWFLAKE CORTEX COMPLETE FINANCIAL SERVICES DEMO**

## Authors: John Heisler, Garrett Frere

In this demo, using Snowflake Cortex (https://www.snowflake.com/en/data-cloud/cortex/), we will build an AI-infused Data Pipeline with Cortex Complete.

NOTE: This notebook has been modified from the Original by Carlos Carrero in order to repeat the demo process just running the whole notebook. Also added here the native capabilities to process PDF files instead of PyPDF2.

### AI Pipeline Overview

We'll learn how to extract raw text from a PDF, perform prompt engineering, and pass custom prompts and data to a large language model of our choosing all without leaving Snowflake.

Specifically, we will be taking on the role of an AI Engineer who is working closely with a portfolio team at an asset manager. The portfolio team would like to speed up their ingestion and comprehension of statements by the Federal Open Market Committee (FOMC) who determines the direction of monetary policy by directing open market operations. Ultimately they would like to get a signal as to whether interest rates will increase, remain the same, or increase (hawkish, or, dovish respectively).

I refer to this as an AI pipeline because we can imbue this type of signal generation with AI much further up the data delivery value chain. In this way, we will maximize the value of our work imbuing into a common dataset. End users will not need invoke any additional logic; good design is invisible!

### Next Steps

 * To industrialize this demo with continuous ingestion and scoring, please check out the `FSI_Cortex_AI_Pipeline_Industrialization.ipynb` notebook in this repository
 * Check out the companion demo in this repository: `FSI_Cortex_Search.ipynb`

# 🛑 BEFORE YOU START 🛑

**Be sure to do the following FIRST to create dependent database objects for the following steps**:
1. Run the `1_SQL_SETUP_FOMC.sql` script

------

### AI Pipeline: Step 1 - Copy PDF Files

We need to extract text from the PDFs. We will do that with a new python function.

> Note that we're builidng this function directly in SQL.

The steps below requires the `langchain`, `pypdf2` and `pandas` packages. To import packages from Anaconda, install them first using the package selector at the top of the page.

In [None]:

USE DATABASE GEN_AI_FSI;
USE SCHEMA FOMC;

Create staging areas to store PDF files and the Stream to detect new files:

In [None]:
--create stage fed_logic;
CREATE OR REPLACE STAGE gen_ai_fsi.fomc.fed_logic
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
    DIRECTORY = (ENABLE = TRUE);

--create stage fed_pdf;
CREATE OR REPLACE STAGE gen_ai_fsi.fomc.fed_pdf
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
    DIRECTORY = (ENABLE = TRUE);

-- create a stream on the directory
CREATE OR REPLACE STREAM gen_ai_fsi.fomc.fomc_stream on DIRECTORY(@gen_ai_fsi.fomc.fed_pdf);

In [None]:
COPY FILES
    INTO @gen_ai_fsi.fomc.fed_pdf
    FROM @gen_ai_fsi.fomc.git_repo/branches/main/FOMC_DOCS/;

In [None]:
alter stage gen_ai_fsi.fomc.fed_pdf refresh;

### AI Pipeline: Step 2 - Create and Register `generate_prompt` Function

As we load data into our system, we want to automatically generate a signal. To do so, we need to call an LLM and pass it our prompt. 

Below, we define our specialized prompt engineering as a python function and then we register the function for later reuse when loading data.

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.types import *

session = get_active_session() 

def generate_prompt(document_text):
    prompt = f"""
        <Role> You are an experienced Senior Economist deeply knowledgeable on Federal Reserve guidance including FOMC or Federal Open Market Committee meeting minutes and communications.
        You are an expert in interpreting Hawkish and Dovish signals from the Fed or Federal Reserve. Such signals are derived from guidance conveyed in FOMC meeting notes and communications.
        
        As an analyst, you excel at discerning macroeconomic trends for each FOMC meeting notes and communications published by the Federal Reserve.
        The  signal or trends are either Hawkish or Dovish based on the growth outlook and inflation outlook of the Fed. The Federal Reserve has a long 
        term objective of keeping inflation around 2%, and low unemployment. Hawkish sentiment could imply 
        the Federal Reserve intends to raise interest rates to increase the cost of borrowing and slow economic activity. 
        The Fed typically increases interest rates when inflation is high or rising, or when the unemployment 
        rate is low or falling. Conversely, dovish sentiment could imply the Federal Reserve intends to lower interest 
        rates to allow easier access borrowing and lowering the cost of money to stimulate economic activity.  The Fed 
        typically decreases interest rates when inflation is low or falling, or when the unemployment rate is high or rising.
        
        Signal categories known as Economic Policy Stances:
        Hawkish stance or attitude for economic policy
        -characterized by a focus on combating inflation and often involves advocating for higher interest rates and tolerant to higher levels of unemployment.
        -concerned about rising inflation. Hawkish stance believes higher interest rates can help keep inflation in check, even if it slows down economic growth or increases unemployment.
        
        Dovish stance or attitude for economic policy
        -characterized by a focus on prioritizing stimulating economic growth, reducing unemployment, and tolerant to higher levels of inflation.
        -concerned with boosting economic activity, reducing unemployment and, for this reason, lower interest rates are preferred to create economic growth and employment.
        
        Neutral stance or attitude for economic policy
        -characterized by a focus on balance between combating inflation and supporting economic growth, with no strong inclination toward either side.
        -concerned with maintaining a steady economic environment without significant deviations. They seek to neither overly stimulate the economy nor excessively tighten it.
        </Role>
        
        <Data> 
        You are provided the text of a Federal Reserve Guidance or FOMC meeting notes as context. These generally are released before the Federal Reserve takes action on economic policy. 
        </Data>

        <FOMC_meeting_notes>
        {document_text}
        </FOMC_meeting_notes>
        
        <Task>: Follow these instructions,
        1) Review the provided FOMC communication or meeting notes text. Then,
        2) Consider the FOMC members or Committee Members tone and sentiment around economic conditions. Then,
        3) Consider specific guidance and stated conditions that validate the tone and signal FOMC members make concerning current macro economic conditions. Then,
        4) Based on this sentiment classify if the FOMC communication text indicates Hawkish, Dovish, or Neutral outlook for the economy. Be critical and do not categorize sentiment as "Neutral" unless necessary. This will be output as [Signal].
        5) Summarize a concise and accurate rationale for classifying the sentiment Hawkish, Neutral, or Dovish sentiment. This will be output as [Signal_Summary].
        </Task>
        
        <Output> 
        produce valid JSON. Absolutely do not include any additional text before or following the JSON. Output should use following JSON_format
        </Output>
        
        <JSON_format>
        {{
            "Signal": (A trend sentiment classification of Hawkish, Neutral or Dovish),
            "Signal_Summary": (A concise summary of sentiment trend),
        }}
        </JSON_format>"""
    return prompt

session.add_packages("snowflake-snowpark-python", "snowflake-ml-python", "snowflake")

session.udf.register(
  func = generate_prompt,
  return_type = StringType(),
  input_types = [ StringType()],
  is_permanent = True,
  name = 'generate_prompt',
  replace = True,
  stage_location = '@gen_ai_fsi.fomc.fed_logic')

### AI Pipeline: Step 3 - Ingest Text and Determine Signal

Now we're using the functions that we've just created in a simple insert statement. This approach of encapsulating complexity for later reuse in SQL pipelines greatly increases the value of our work in a one-to-many relationship.

### 🤯 🧠 CHECK IT OUT! 🧠 🤯 
* We're calling our pdf native text extractor function! (line 11)
* We're calling our promp function! (line 28)

In [None]:
select * from gen_ai_fsi.fomc.fomc_stream;

In [None]:
CREATE OR REPLACE TABLE gen_ai_fsi.fomc.pdf_full_text (
    id            NUMBER(19, 0),
    relative_path VARCHAR(16777216),
    size          NUMBER(38, 0),
    last_modified TIMESTAMP_TZ(3),
    md5           VARCHAR(16777216),
    etag          VARCHAR(16777216),
    file_url      VARCHAR(16777216),
    file_text     VARCHAR(16777216),
    file_date     DATE,
    sentiment     VARCHAR(16777216)
);

In [None]:
INSERT INTO gen_ai_fsi.fomc.pdf_full_text (id, relative_path, size, last_modified, md5, etag, file_url, file_text, file_date, sentiment)
WITH cte AS (SELECT gen_ai_fsi.fomc.fed_pdf_full_text_sequence.nextval AS id,
                    relative_path                                      AS relative_path,
                    size                                               AS size,
                    last_modified                                      AS last_modified,
                    md5                                                AS md5,
                    etag                                               AS etag,
                    file_url                                           AS file_url,
                    REPLACE(TO_VARCHAR (
                        SNOWFLAKE.CORTEX.PARSE_DOCUMENT ('@gen_ai_fsi.fomc.fed_pdf', relative_path)), '''', '')  AS file_text,
                    TRY_TO_DATE(REGEXP_SUBSTR(relative_path, '\\d{8}'), 'YYYYMMDD') AS file_date
             FROM directory(@gen_ai_fsi.fomc.fed_pdf)
             WHERE relative_path LIKE '%.pdf'
)
SELECT id,
       relative_path,
       size,
       last_modified,
       md5,
       etag,
       file_url,
       file_text,
       file_date,
       snowflake.cortex.try_complete('mistral-large2', gen_ai_fsi.fomc.generate_prompt(file_text)) AS signal_mis
FROM cte;

In [None]:
select * from gen_ai_fsi.fomc.pdf_full_text;