<div style="display: flex; align-items: left;">
    <a href="https://sites.google.com/corp/google.com/genai-solutions/home?authuser=0">
        <img src="https://storage.googleapis.com/miscfilespublic/Linkedin%20Banner%20%E2%80%93%202.png" style="margin-right">
    </a>
</div>

In [4]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# **Open Data QnA**

---

This notebook assumes you have already Setup Vector Store and Variables are assigned in Config.ini file


The notebook covers the following steps: 

> 1. Take user question and generate sql in the dialect corresponding to data source

> 2. Execute the sql query and retreive the data

> 3. Generate natural language respose and charts to display

> 4. Clean Up resources



## 🚧 **0. Pre-requisites**

Make sure that you have completed the intial setup process using [1_SetUpVectorStore.ipynb](1_SetUpVectorStore.ipynb). If the 1_SetUpVectorStore notebook has been run successfully, the following are set up:
* GCP project and all the required IAM permissions

* Environment to run the solution

* Data source and Vector store for the solution


## ⚙️ **1. Retrieve Configuration Parameters**
The notebook will load all the configuration parameters from the `config.ini` file in the root directory. 
Most of these parameters were set in the initial notebook `1_SetUpVectorStore.ipynb` and save to the 'config.ini file.
Use the below cells to retrieve these values and specify additional ones required for this notebook. 

In [5]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
sys.path.append(module_path)

import configparser
config = configparser.ConfigParser()
config.read(module_path+'/config.ini')

PROJECT_ID = config['GCP']['PROJECT_ID']
DATA_SOURCE = config['CONFIG']['DATA_SOURCE']
VECTOR_STORE = config['CONFIG']['VECTOR_STORE']

BQ_OPENDATAQNA_DATASET_NAME = config['BIGQUERY']['BQ_OPENDATAQNA_DATASET_NAME']
BQ_LOG_TABLE_NAME = config['BIGQUERY']['BQ_LOG_TABLE_NAME'] 
BQ_DATASET_REGION = config['BIGQUERY']['BQ_DATASET_REGION']
BQ_DATASET_NAME = config['BIGQUERY']['BQ_DATASET_NAME']
BQ_TABLE_LIST = config['BIGQUERY']['BQ_TABLE_LIST']

#The Postgress settings are not used, but some of the API calls below depend on them being set.
PG_SCHEMA = config['PGCLOUDSQL']['PG_SCHEMA']
PG_DATABASE = config['PGCLOUDSQL']['PG_DATABASE']
PG_USER = config['PGCLOUDSQL']['PG_USER']
PG_REGION = config['PGCLOUDSQL']['PG_REGION'] 
PG_INSTANCE = config['PGCLOUDSQL']['PG_INSTANCE'] 
PG_PASSWORD = config['PGCLOUDSQL']['PG_PASSWORD']

## 🔐 **2. Authenticate and Connect to Google Cloud Project**
Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.

You can do this within Google Colab or using the Application Default Credentials in the Google Cloud CLI.

In [6]:
"""Colab Auth""" 
# from google.colab import auth
# auth.authenticate_user()


"""Google CLI Auth"""
# !gcloud auth application-default login


import google.auth
credentials, project_id = google.auth.default()

# Configure gcloud.
!gcloud config set project {PROJECT_ID}
print(f'Project has been set to {PROJECT_ID}')
!gcloud auth application-default set-quota-project {PROJECT_ID}

import os
os.environ['GOOGLE_CLOUD_QUOTA_PROJECT']=PROJECT_ID
os.environ['GOOGLE_CLOUD_PROJECT']=PROJECT_ID

Updated property [core/project].
Project has been set to uk-bh-experiments-argolis

Credentials saved to file: [/home/brendanhills/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "uk-bh-experiments-argolis" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


## ▶️ **3. Run the Open Data QnA Pipeline**

### 🔗 **3A. Connect to Datasource, Vector Source and Vertex AI**


In [7]:
# Fetch the USER_DATABASE based on data source
from dbconnectors import pgconnector, bqconnector
if DATA_SOURCE=='bigquery':
    USER_DATABASE=BQ_DATASET_NAME 
    src_connector = bqconnector
else: 
    USER_DATABASE=PG_SCHEMA
    src_connector = pgconnector

print("Source selected is : "+ str(DATA_SOURCE) + "\nSchema or Dataset Name is : "+ str(USER_DATABASE))
print("Vector Store selected is : "+ str(VECTOR_STORE))


# Set the vector store paramaters
if VECTOR_STORE=='bigquery-vector':
    region=BQ_DATASET_REGION
    vector_connector = bqconnector
    call_await = False

else:
    region=PG_REGION
    vector_connector = pgconnector
    call_await=True


num_table_matches = 5
num_column_matches = 10
similarity_threshold = 0.3
num_sql_matches=3


RUN_DEBUGGER = True 
EXECUTE_FINAL_SQL = True 

from google.api_core.exceptions import NotFound
from google.cloud import aiplatform
import vertexai

from agents import EmbedderAgent, BuildSQLAgent, DebugSQLAgent, ValidateSQLAgent, ResponseAgent, VisualizeAgent


embedder = EmbedderAgent('vertex') 

SQLBuilder = BuildSQLAgent('gemini-1.0-pro')
SQLChecker = ValidateSQLAgent('gemini-1.0-pro')
SQLDebugger = DebugSQLAgent('gemini-1.0-pro')
Responder = ResponseAgent('gemini-1.0-pro')
Visualize = VisualizeAgent ()

found_in_vector = 'N'
final_sql='Not Generated Yet'

vertexai.init(project=PROJECT_ID, location=region)
aiplatform.init(project=PROJECT_ID, location=region)

current dir:  /home/brendanhills/dev/applied-ai-engineering-samples/notebooks
root_dir set to: /home/brendanhills/dev/applied-ai-engineering-samples
Source selected is : bigquery
Schema or Dataset Name is : breedr
Vector Store selected is : bigquery-vector
Creating agent with model_id: gemini-1.0-pro
Creating agent with model_id: gemini-1.0-pro
Creating agent with model_id: gemini-1.0-pro


###  ❓ **3B. Ask your Natural Language Question**

In [8]:
print("\033[1mData Source:- "+ DATA_SOURCE)
print("Vector Store:- "+ VECTOR_STORE)
print("Schema:- "+USER_DATABASE)
    
# Suggested question for 'fda_food' dataset: "What are the top 5 cities with highest recalls?"
#  Suggested question for 'google_dei' dataset: "How many asian men were part of the leadership workforce in 2021?"

prompt_for_question = "Please enter your question for source :" + DATA_SOURCE + " and database : " + USER_DATABASE
user_question = input(prompt_for_question) #Uncomment if you want to ask question yourself
#user_question = '' # Or Enter Question here

print(f"Asked database {USER_DATABASE} the {user_question}")

[1mData Source:- bigquery
Vector Store:- bigquery-vector
Schema:- breedr
Asked database breedr the WHich animal was the heaviest?


In [9]:
# Fetch the embedding of the user's input question 
embedded_question = embedder.create(user_question)

# Reset AUDIT_TEXT
AUDIT_TEXT = ''

AUDIT_TEXT = AUDIT_TEXT + "\nUser Question : " + str(user_question) + "\nUser Database : " + str(USER_DATABASE)
process_step = "\n\nGet Exact Match: "
# Look for exact matches in known questions 
exact_sql_history = vector_connector.getExactMatches(user_question) 

if exact_sql_history is not None:
    found_in_vector = 'Y' 
    final_sql = exact_sql_history
    invalid_response = False
    AUDIT_TEXT = AUDIT_TEXT + "\nExact match has been found! Going to retreive the SQL query from cache and serve!"


else:
    # No exact match found. Proceed looking for similar entries in db 
    AUDIT_TEXT = AUDIT_TEXT +  process_step + "\nNo exact match found in query cache, retreiving revelant schema and known good queries for few shot examples using similarity search...."
    process_step = "\n\nGet Similar Match: "
    if call_await:
        similar_sql = await vector_connector.getSimilarMatches('example', USER_DATABASE, embedded_question, num_sql_matches, similarity_threshold)
    else:
        similar_sql = vector_connector.getSimilarMatches('example', USER_DATABASE, embedded_question, num_sql_matches, similarity_threshold)

    process_step = "\n\nGet Table and Column Schema: "
    # Retrieve matching tables and columns
    if call_await: 
        table_matches =  await vector_connector.getSimilarMatches('table', USER_DATABASE, embedded_question, num_table_matches, similarity_threshold)
        column_matches =  await vector_connector.getSimilarMatches('column', USER_DATABASE, embedded_question, num_column_matches, similarity_threshold)
    else:
        table_matches =  vector_connector.getSimilarMatches('table', USER_DATABASE, embedded_question, num_table_matches, similarity_threshold)
        column_matches =  vector_connector.getSimilarMatches('column', USER_DATABASE, embedded_question, num_column_matches, similarity_threshold)

    AUDIT_TEXT = AUDIT_TEXT +  process_step + "\nRetrieved Similar Known Good Queries, Table Schema and Column Schema: \n" + '\nRetrieved Tables: \n' + str(table_matches) + '\n\nRetrieved Columns: \n' + str(column_matches) + '\n\nRetrieved Known Good Queries: \n' + str(similar_sql)
    # If similar table and column schemas found: 
    if len(table_matches.replace('Schema(values):','').replace(' ','')) > 0 or len(column_matches.replace('Column name(type):','').replace(' ','')) > 0 :

        # GENERATE SQL
        process_step = "\n\nBuild SQL: "
        generated_sql = SQLBuilder.build_sql(DATA_SOURCE,user_question,table_matches,column_matches,similar_sql)
        final_sql=generated_sql
        AUDIT_TEXT = AUDIT_TEXT + process_step +  "\nGenerated SQL : " + str(generated_sql)
        
        if 'unrelated_answer' in generated_sql :
            invalid_response=True

        # If agent assessment is valid, proceed with checks  
        else:
            invalid_response=False

            if RUN_DEBUGGER: 
                generated_sql, invalid_response, AUDIT_TEXT = SQLDebugger.start_debugger(DATA_SOURCE, generated_sql, user_question, SQLChecker, table_matches, column_matches, AUDIT_TEXT, similar_sql) 
                # AUDIT_TEXT = AUDIT_TEXT + '\n Feedback from Debugger: \n' + feedback_text

            final_sql=generated_sql
            AUDIT_TEXT = AUDIT_TEXT + "\nFinal SQL after Debugger: \n" +str(final_sql)


    # No matching table found 
    else:
        invalid_response=True
        print('No tables found in Vector ...')
        AUDIT_TEXT = AUDIT_TEXT + "\nNo tables have been found in the Vector DB. The question cannot be answered with the provide data source!"

print(f'\n\n AUDIT_TEXT: \n {AUDIT_TEXT}')

No exact match found for the user prompt
Did not find any results for example. Adjust the query parameters.
Found 1 similarity matches for table.
Found 10 similarity matches for column.
```sql
SELECT
    `uk-bh-experiments-argolis.breedr.animals`.animal_subtype,
    `uk-bh-experiments-argolis.breedr.animals`.kill_fat_score,
    `uk-bh-experiments-argolis.breedr.animals`.kill_fat_score_int,
    `uk-bh-experiments-argolis.breedr.animals`.kill_quality_int,
    `uk-bh-experiments-argolis.breedr.animals`.kill_weight,
    `uk-bh-experiments-argolis.breedr.animals`.name,
    `uk-bh-experiments-argolis.breedr.animals`.pedigree
  FROM
    `uk-bh-experiments-argolis.breedr.animals` AS `uk-bh-experiments-argolis.breedr.animals`
  WHERE `uk-bh-experiments-argolis.breedr.animals`.kill_weight IS NOT NULL
   AND `uk-bh-experiments-argolis.breedr.animals`.kill_fat_score IS NOT NULL
   AND `uk-bh-experiments-argolis.breedr.animals`.kill_fat_score_int IS NOT NULL
   AND `uk-bh-experiments-argolis.breedr

In [10]:
if not invalid_response:
    try: 
        if EXECUTE_FINAL_SQL is True:
                final_exec_result_df=src_connector.retrieve_df(final_sql.replace("```sql","").replace("```","").replace("EXPLAIN ANALYZE ",""))
                print('\nQuestion: ' + user_question + '\n')
                # print('\n Final SQL Execution Result: \n')
                # print(final_exec_result_df)
                response = final_exec_result_df
                _resp=Responder.run(user_question, response)
                AUDIT_TEXT = AUDIT_TEXT + "\nModel says " + str(_resp) 


        else:  # Do not execute final SQL
                print("Not executing final SQL since EXECUTE_FINAL_SQL variable is False\n ")
                response = "Please enable the Execution of the final SQL so I can provide an answer"
                _resp=Responder.run(user_question, response)
                AUDIT_TEXT = AUDIT_TEXT + "\nModel says " + str(_resp) 

    except ValueError: 
          print('')
    # except Exception as e: 
    #     print(f"An error occured. Aborting... Error Message: {e}")
        
else:  # Do not execute final SQL
    print("Not executing final SQL as it is invalid, please debug!")
    response = "I am sorry, I could not come up with a valid SQL."
    _resp=Responder.run(user_question, response)
    AUDIT_TEXT = AUDIT_TEXT + "\nModel says " + str(_resp)

print("Final Answer:" + str(_resp))
bqconnector.make_audit_entry(DATA_SOURCE, USER_DATABASE, "gemini-1.0-pro", user_question, final_sql, found_in_vector, "", process_step, "", AUDIT_TEXT)  


Question: WHich animal was the heaviest?

Final Answer:## I'm sorry, but I can't answer your question.

According to the data provided, there are no animals in the database. Therefore, I cannot tell you which animal was the heaviest.

Is there any other information you can provide that might help me answer your question? For example, do you know what time period the data covers? Or, are there any other animals in the database that you are interested in?



'Log Row added'

### Create Charts for the results (Run only when you have proper results in the above cells)
Agent provides two suggestive google charts to display on a UI with element IDs chart_div and chart_div_1

In [11]:
chart_js=''
chart_js = Visualize.generate_charts(user_question,final_sql,response) #sending 
# print(chart_js["chart_div_1"])

Charts Suggested : ['Bar Chart', 'Table Chart']


In [12]:
from IPython.display import HTML

html_code = f'''
<script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
<script type="text/javascript">
{chart_js["chart_div"]}
</script>
<div id="chart_div"></div>
'''

HTML(html_code)


In [13]:
html_code = f'''
<script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
<script type="text/javascript">
{chart_js["chart_div_1"]}
</script>
<div id="chart_div_1"></div>
'''

HTML(html_code)