# 0. Setting up Elastic DB and Kibana

Kibana is optional

-----

Follow quick start guide using Docker for Elastic DB: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/getting-started.html

Also run a terminal to get to kibana from the gui download the sample global flight dataset

test in another terminal to see if your db is working
bash: curl -X GET http://localhost:9200/

Install and run Elasticsearch

Install and start Docker Desktop.
Run:

```python
docker network create elastic
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.17.25
docker run --name es01-test --net elastic -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.25
```
Install and run Kibana

To analyze, visualize, and manage Elasticsearch data using an intuitive UI, install Kibana.

In a new terminal session, run:

```python
docker pull docker.elastic.co/kibana/kibana:7.17.25
docker run --name kib01-test --net elastic -p 127.0.0.1:5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01-test:9200" docker.elastic.co/kibana/kibana:7.17.25
To access Kibana, go to http://localhost:5601
```

Hooking up your API Key (takes about 50¢ to add 1536 dimension vector embeddings to 13000 entries)

In [6]:
import os
from openai import OpenAI

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "replace_with_your_api_key"
openai_api_key = os.getenv("OPENAI_API_KEY")
print(f"API Key loaded: {openai_api_key is not None}")

# Initialize OpenAI client
client = OpenAI()

# Set the embedding model
EMBEDDING_MODEL = "text-embedding-ada-002" # this embedding is 1536 dimensional e.g. [0.1, 0.1, 0.3] is 3 dimensional

API Key loaded: True


-----

# Code

In [7]:
import os
import json
from openai import OpenAI
from tqdm import tqdm  # For progress tracking
import openai
from elasticsearch import Elasticsearch, helpers
from elasticsearch.helpers import reindex
import time
import tiktoken # for truncating long inputs, not necessary if you can tailor inputs ahead of embedding

es = Elasticsearch("http://localhost:9200", basic_auth=("elastic"))
print(es.info())
source_index = "kibana_sample_data_flights"
target_index = "flights_with_embeddings"


{'name': '574b453b4115', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'PK-zP1pNQwehaNMvWa62VQ', 'version': {'number': '7.17.25', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'f9b6b57d1d0f76e2d14291c04fb50abeb642cfbf', 'build_date': '2024-10-16T22:06:36.904732810Z', 'build_snapshot': False, 'lucene_version': '8.11.3', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}


  print(es.info())


In [8]:
rebuild = False
if rebuild: # run the following if you want to delete your database
    es.indices.delete(index=target_index, ignore=[404])
    print(f"Deleted the {target_index} index.")

In [10]:
def get_embedding(text, model=EMBEDDING_MODEL):
    """for embedding your query
    """
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding


def get_embeddings(texts, model=EMBEDDING_MODEL):
    """for adding an embedding vector to each entry in database
    """
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return [data.embedding for data in response.data]

def prepare_text_for_embedding(doc_source, max_tokens=8191, model=EMBEDDING_MODEL, truncate=True):
    # Extract specified fields
    fields_to_include = [
        ('Flight Number', 'FlightNum'),
        ('Carrier', 'Carrier'),
        ('From', 'OriginCityName'),
        ('Origin Country', 'OriginCountry'),
        ('To', 'DestCityName'),
        ('Destination Country', 'DestCountry'),
        ('Distance Kilometers', 'DistanceKilometers'),
        ('Distance Miles', 'DistanceMiles'),
        ('Flight Time Hour', 'FlightTimeHour'),
        ('Weather at Origin', 'OriginWeather'),
        ('Weather at Destination', 'DestWeather'),
        ('Average Ticket Price', 'AvgTicketPrice'),
        ('Delay', 'FlightDelay'),
        ('Cancellation', 'Cancelled'),
        ('Date', 'timestamp')
    ]
    
    # Build the text parts
    text_parts = []
    for label, field in fields_to_include:
        value = doc_source.get(field, 'N/A')
        if isinstance(value, bool):
            value = 'Yes' if value else 'No'
        else:
            value = str(value)
        text_parts.append(f"{label}: {value}")
    
    # Include additional fields
    additional_fields = set(doc_source.keys()) - {field for _, field in fields_to_include}
    for field in additional_fields:
        value = doc_source.get(field, '')
        if isinstance(value, (dict, list)):
            value = json.dumps(value)
        else:
            value = str(value)
        text_parts.append(f"{field}: {value}")
    
    combined_text = '\n'.join(text_parts)
    
    if truncate:
        # Use tiktoken to count tokens and truncate if necessary
        encoding = tiktoken.encoding_for_model(model)
        tokens = encoding.encode(combined_text)
        
        if len(tokens) > max_tokens:
            # Truncate tokens to max_tokens
            tokens = tokens[:max_tokens]
            # Decode tokens back to text
            combined_text = encoding.decode(tokens)
    
    return combined_text

def fetch_documents_in_batches(index, es_client, batch_size=100):
    query = {"query": {"match_all": {}}}
    scan_response = helpers.scan(es_client, index=index, query=query, scroll='1m')
    batch = []
    for doc in scan_response:
        batch.append(doc)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch


def create_or_update_index(es_client, index_name):
    mapping = {
        "mappings": {
            "properties": {
                "embedding": {
                    "type": "dense_vector",
                    "dims": 1536  # Dimensions of the embedding vector
                },
                # Include other fields as needed
            }
        }
    }
    if not es_client.indices.exists(index=index_name):
        es_client.indices.create(index=index_name, body=mapping)
    else:
        # Update the existing index mapping if necessary
        es_client.indices.put_mapping(index=index_name, body=mapping["mappings"])

def load_progress(progress_file):
    if os.path.exists(progress_file):
        with open(progress_file, 'r') as f:
            processed_ids = set(json.load(f))
    else:
        processed_ids = set()
    return processed_ids

def save_progress(progress_file, processed_ids):
    with open(progress_file, 'w') as f:
        json.dump(list(processed_ids), f)

def process_documents(es_client, source_index, target_index, batch_size=100, progress_file='progress.json'):
    create_or_update_index(es_client, target_index)
    processed_ids = load_progress(progress_file)
    total_docs = es_client.count(index=source_index)['count']
    
    # Initialize progress bar
    pbar = tqdm(total=total_docs, desc="Processing documents", initial=len(processed_ids))
    
    for batch in fetch_documents_in_batches(source_index, es_client, batch_size):
        actions = []
        texts = []
        doc_ids = []
        doc_sources = []
        for doc in batch:
            doc_id = doc['_id']
            if doc_id in processed_ids:
                pbar.update(1)
                continue  # Skip already processed documents
            doc_source = doc['_source']
            combined_text = prepare_text_for_embedding(doc_source)
            if combined_text.strip():
                texts.append(combined_text)
                doc_ids.append(doc_id)
                doc_sources.append(doc_source)  # Keep the original source to include other fields
            else:
                # Handle empty text
                doc_source['embedding'] = [0] * 1536
                action = {
                    "_index": target_index,
                    "_id": doc_id,
                    "_source": doc_source
                }
                actions.append(action)
                processed_ids.add(doc_id)
                pbar.update(1)
        
        if texts:
            try:
                # Generate embeddings in batch
                embeddings = get_embeddings(texts)
                
                # Prepare actions for bulk indexing
                for doc_id, embedding, doc_source in zip(doc_ids, embeddings, doc_sources):
                    doc_source['embedding'] = embedding
                    action = {
                        "_index": target_index,
                        "_id": doc_id,
                        "_source": doc_source
                    }
                    actions.append(action)
                    processed_ids.add(doc_id)
                    pbar.update(1)
            except Exception as e:
                print(f"\nError during embedding generation: {e}")
                # Save progress before exiting
                save_progress(progress_file, processed_ids)
                pbar.close()
                return
        
        # Bulk index the documents
        if actions:
            try:
                helpers.bulk(es_client, actions)
            except Exception as e:
                print(f"\nError during bulk indexing: {e}")
                # Save progress before exiting
                save_progress(progress_file, processed_ids)
                pbar.close()
                return
        
        # Save progress after each batch
        save_progress(progress_file, processed_ids)
    
    pbar.close()
    print("Processing completed.")
    
    
def search_index(es_client, index_name, query_text, top_k=5):
    # Generate embedding for the query text
    query_embedding = get_embedding(query_text)

    # Build the search query using script_score and cosineSimilarity
    search_query = {
        "size": top_k,
        "query": {
            "script_score": {
                "query": {"match_all": {}},
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                    "params": {
                        "query_vector": query_embedding
                    }
                }
            }
        }
    }

    # Perform the search
    response = es_client.search(index=index_name, body=search_query)

    # Display the results
    for hit in response['hits']['hits']:
        score = hit['_score']
        source = hit['_source']
        print(f"Score: {score}")
        print(f"Flight Number: {source.get('FlightNum', 'N/A')}")
        print(f"Carrier: {source.get('Carrier', 'N/A')}")
        print(f"From: {source.get('OriginCityName', 'N/A')} ({source.get('OriginCountry', 'N/A')})")
        print(f"To: {source.get('DestCityName', 'N/A')} ({source.get('DestCountry', 'N/A')})")
        print(f"Distance: {source.get('DistanceKilometers', 0):,.2f} km "
              f"({source.get('DistanceMiles', 0):,.2f} miles)")
        flight_time_hour = source.get('FlightTimeHour', 0)
        print(f"Flight Time: {float(flight_time_hour):.2f} hours" if flight_time_hour else "Flight Time: N/A")
        print(f"Weather at Origin: {source.get('OriginWeather', 'N/A')}")
        print(f"Weather at Destination: {source.get('DestWeather', 'N/A')}")
        print(f"Average Ticket Price: ${source.get('AvgTicketPrice', 0):,.2f}")
        print(f"Delay: {'Yes' if source.get('FlightDelay', False) else 'No'}")
        print(f"Cancellation: {'Yes' if source.get('Cancelled', False) else 'No'}")
        print(f"Date: {source.get('timestamp', 'N/A')}")
        print("-" * 50)

    return response  # Return response for use in the next step

In [11]:
# takes about 2-3 minutes
process_documents(es, source_index, target_index, batch_size=50, progress_file='progress.json')

  if not es_client.indices.exists(index=index_name):
  es_client.indices.put_mapping(index=index_name, body=mapping["mappings"])
  total_docs = es_client.count(index=source_index)['count']
  for doc in scan_response:
Processing documents: 26118it [00:03, 3910.71it/s]                 

Processing completed.





In [None]:
query_text = "show me a cheap flight to mexico from canada"
response = search_index(es, target_index, query_text, top_k=5)

-----

Add ons: if you want to enahnce your response with a more sophisticated model you can serve the response to GPT 4

In [None]:
# Extract relevant information from the top search results
def extract_context_from_hits(hits):
    context = ""
    for hit in hits:
        source = hit['_source']
        flight_info = (
            f"Flight Number: {source.get('FlightNum', 'N/A')}\n"
            f"Carrier: {source.get('Carrier', 'N/A')}\n"
            f"From: {source.get('OriginCityName', 'N/A')} ({source.get('OriginCountry', 'N/A')})\n"
            f"To: {source.get('DestCityName', 'N/A')} ({source.get('DestCountry', 'N/A')})\n"
            f"Distance: {source.get('DistanceKilometers', 0):,.2f} km "
            f"({source.get('DistanceMiles', 0):,.2f} miles)\n"
            f"Flight Time: {float(source.get('FlightTimeHour', 0)):.2f} hours\n"
            f"Weather at Origin: {source.get('OriginWeather', 'N/A')}\n"
            f"Weather at Destination: {source.get('DestWeather', 'N/A')}\n"
            f"Average Ticket Price: ${source.get('AvgTicketPrice', 0):,.2f}\n"
            f"Delay: {'Yes' if source.get('FlightDelay', False) else 'No'}\n"
            f"Cancellation: {'Yes' if source.get('Cancelled', False) else 'No'}\n"
            f"Date: {source.get('timestamp', 'N/A')}\n"
            "----------------------------------------\n"
        )
        context += flight_info
    return context


add on a more sophisticated LLM to make the response more pleasing

In [None]:
# Extract context from search results
context = extract_context_from_hits(response['hits']['hits'])
from openai import AsyncOpenAI

client = AsyncOpenAI()
# Use OpenAI's ChatCompletion API
def generate_response(question, context):
    messages = [
        {"role": "system", "content": "You are a helpful assistant that provides flight information."},
        {"role": "user", "content": f"Question: {question}\n\nContext:\n{context}\n\nAnswer the question using the context provided. If the answer is not in the context, say 'I could not find the information in the provided data.'"}
    ]

    completion = openai.chat.completions.create(
        model="gpt-4",
        messages=messages)

    return completion.choices[0].message.content

# Generate the response
answer = generate_response(query_text, context)

print("Assistant's Response:")
print(answer)


## Notes on how to potentially integrate with Kibana

Overview
To integrate your custom search functionality into Kibana, we'll explore the following approaches:

Create a Custom Kibana Plugin: Develop a plugin that adds new routes and UI components to Kibana.
Use Kibana Vega Visualizations: Utilize Vega or Vega-Lite visualizations to create custom visualizations that can interact with Elasticsearch.
Embed External Applications: If necessary, build an external web application that interfaces with Elasticsearch and OpenAI, and embed it within Kibana using iframes or Canvas.
For your use case, developing a custom Kibana plugin is the most direct way to integrate your Python code into Kibana's UI. Below, I'll provide detailed steps on how to create a Kibana plugin that can:

Accept user queries.
Perform vector searches using Elasticsearch.
Display results within Kibana.
Optionally, integrate with OpenAI's API to generate additional responses.
Prerequisites
Knowledge of JavaScript/TypeScript: Kibana plugins are developed using JavaScript or TypeScript and React.
Development Environment: Set up a development environment for Kibana plugin development.
Elasticsearch and Kibana Version Compatibility: Ensure that your Elasticsearch and Kibana versions are compatible.
Step-by-Step Guide
Step 1: Set Up Your Development Environment
1.1 Clone the Kibana Repository
Clone the Kibana repository from GitHub to get access to the necessary development tools and plugin generator.

bash
Copy code
git clone https://github.com/elastic/kibana.git
cd kibana
1.2 Check Out the Correct Branch
Switch to the branch that matches your Kibana version. For example, if you're using Kibana 8.5.0:

bash
Copy code
git checkout v8.5.0
1.3 Install Dependencies
Install the required dependencies using Yarn (Kibana uses Yarn for package management):

bash
Copy code
yarn kbn bootstrap
This command sets up the development environment and installs all necessary packages.

Step 2: Generate a New Plugin
Kibana provides a plugin generator to scaffold a new plugin.

2.1 Run the Plugin Generator
bash
Copy code
node scripts/generate_plugin.js
2.2 Provide Plugin Details
You'll be prompted to provide details about your plugin:

Plugin Name: e.g., vector_search_plugin
Plugin ID: e.g., vectorSearchPlugin
Description: e.g., A plugin to perform vector searches and display results
Owner Name: Your name or your organization's name
Client-side: Yes (since we'll be building UI components)
Server-side: Yes (if you need to add server routes)
Step 3: Develop the Plugin
Now that you have the plugin scaffolded, you can start developing it.

3.1 Understand the Plugin Structure
Your plugin directory will be located at kibana/plugins/vector_search_plugin/. It will have the following structure:

public/: Contains client-side code (UI components).
server/: Contains server-side code (routes, handlers).
kibana.json: Plugin manifest file.
3.2 Implement the Client-Side UI
In the public/ directory, you'll develop the React components that make up your plugin's UI.

3.2.1 Create a Query Input Component
In public/components/, create a new component called QueryInput.tsx.

tsx
Copy code
import React, { useState } from 'react';
import { EuiFieldText, EuiButton } from '@elastic/eui';

interface QueryInputProps {
  onSearch: (queryText: string) => void;
}

export const QueryInput: React.FC<QueryInputProps> = ({ onSearch }) => {
  const [queryText, setQueryText] = useState('');

  const handleSearch = () => {
    onSearch(queryText);
  };

  return (
    <div>
      <EuiFieldText
        placeholder="Enter your query..."
        value={queryText}
        onChange={(e) => setQueryText(e.target.value)}
      />
      <EuiButton onClick={handleSearch}>Search</EuiButton>
    </div>
  );
};
This component provides an input field and a search button.

3.2.2 Display Search Results
Create a SearchResults.tsx component to display the results.

tsx
Copy code
import React from 'react';

interface SearchResult {
  score: number;
  source: any;
}

interface SearchResultsProps {
  results: SearchResult[];
}

export const SearchResults: React.FC<SearchResultsProps> = ({ results }) => {
  return (
    <div>
      {results.map((hit, index) => {
        const source = hit.source;
        return (
          <div key={index}>
            <h3>Score: {hit.score}</h3>
            <p>Flight Number: {source.FlightNum}</p>
            <p>Carrier: {source.Carrier}</p>
            <p>
              From: {source.OriginCityName} ({source.OriginCountry})
            </p>
            <p>
              To: {source.DestCityName} ({source.DestCountry})
            </p>
            {/* Add more fields as needed */}
            <hr />
          </div>
        );
      })}
    </div>
  );
};
3.2.3 Combine Components in the Main App
In public/application.tsx, import your components and handle the logic.

tsx
Copy code
import React, { useState } from 'react';
import { CoreStart } from '@kbn/core/public';
import { QueryInput } from './components/QueryInput';
import { SearchResults } from './components/SearchResults';

interface AppProps {
  http: CoreStart['http'];
}

export const App: React.FC<AppProps> = ({ http }) => {
  const [results, setResults] = useState([]);

  const handleSearch = async (queryText: string) => {
    try {
      const response = await http.post('/api/vector_search_plugin/search', {
        body: JSON.stringify({ queryText }),
      });
      setResults(response.hits);
    } catch (error) {
      console.error('Error performing search:', error);
    }
  };

  return (
    <div>
      <QueryInput onSearch={handleSearch} />
      <SearchResults results={results} />
    </div>
  );
};
3.3 Implement Server-Side Route
In order to perform the search, you need to define a server-side route that your UI can call.

3.3.1 Define the Route
In server/routes/, create search.ts.

ts
Copy code
import { IRouter } from '@kbn/core/server';
import { schema } from '@kbn/config-schema';

export function defineRoutes(router: IRouter) {
  router.post(
    {
      path: '/api/vector_search_plugin/search',
      validate: {
        body: schema.object({
          queryText: schema.string(),
        }),
      },
    },
    async (context, request, response) => {
      const { queryText } = request.body;

      // Perform vector search using Elasticsearch client
      const esClient = (await context.core).elasticsearch.client.asCurrentUser;

      // Generate embedding for the query text
      // You need to integrate your embedding generation here
      // Since server-side code is in Node.js, you might need to use a Node.js OpenAI client

      // For the purpose of this example, we'll assume you have a function `getEmbedding`
      const queryEmbedding = await getEmbedding(queryText);

      // Build the search query
      const searchQuery = {
        size: 5,
        query: {
          script_score: {
            query: { match_all: {} },
            script: {
              source: "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
              params: {
                query_vector: queryEmbedding,
              },
            },
          },
        },
      };

      // Perform the search
      const esResponse = await esClient.search({
        index: 'flights_with_embeddings',
        body: searchQuery,
      });

      // Return the results
      return response.ok({
        body: {
          hits: esResponse.hits.hits.map((hit: any) => ({
            score: hit._score,
            source: hit._source,
          })),
        },
      });
    }
  );
}
3.3.2 Register the Route
In server/plugin.ts, import and register the route:

ts
Copy code
import { defineRoutes } from './routes/search';

public setup(core: CoreSetup) {
  const router = core.http.createRouter();
  defineRoutes(router);
}
3.4 Integrate OpenAI Embedding Generation
Since the server-side code is in Node.js, you'll need to use the OpenAI Node.js client.

3.4.1 Install OpenAI Node.js SDK
In your plugin directory, install the OpenAI SDK:

bash
Copy code
cd plugins/vector_search_plugin
npm install openai
3.4.2 Implement the Embedding Function
In server/lib/embedding.ts, create the function to generate embeddings:

ts
Copy code
import { Configuration, OpenAIApi } from 'openai';

const configuration = new Configuration({
  apiKey: 'YOUR_OPENAI_API_KEY', // Securely manage your API key
});

const openai = new OpenAIApi(configuration);

export async function getEmbedding(text: string): Promise<number[]> {
  const response = await openai.createEmbedding({
    model: 'text-embedding-ada-002',
    input: text,
  });

  return response.data.data[0].embedding;
}
Security Note: Never hard-code your API key in the codebase. Use environment variables or Kibana's secure settings to manage sensitive information.

3.4.3 Securely Manage API Keys
Use Kibana's config or secrets management to store your OpenAI API key.

In config/kibana.yml, add:
yaml
Copy code
vectorSearchPlugin.openaiApiKey: YOUR_OPENAI_API_KEY
Access the API key in your code:
ts
Copy code
const config = context.config.get<{ openaiApiKey: string }>();
const openaiApiKey = config.openaiApiKey;
Step 4: Build and Run Kibana with Your Plugin
4.1 Build the Plugin
From the Kibana root directory, run:

bash
Copy code
yarn build
4.2 Start Kibana
bash
Copy code
yarn start
This will start Kibana with your plugin included.

Step 5: Test Your Plugin
Navigate to Kibana in your web browser.
Find your plugin in the side navigation bar.
Enter a query in the input field and perform a search.
Verify that the search results are displayed as expected.
Considerations and Limitations
1. API Key Security
Ensure that your OpenAI API key is securely managed and not exposed in the client-side code. Use Kibana's secure settings or environment variables.

2. Rate Limits and Performance
Be mindful of OpenAI's rate limits. Implement caching or request throttling as needed.

3. Error Handling
Add proper error handling in your server and client code to handle exceptions and provide feedback to users.

4. Elasticsearch Version Compatibility
Make sure that the Elasticsearch features you're using (e.g., script_score, cosineSimilarity) are supported in your Elasticsearch version.

5. Plugin Maintenance
Custom plugins need to be maintained and may require updates when upgrading Kibana versions.

Alternative Approaches
Option 1: Use Kibana Vega Visualizations
Vega is a visualization grammar that allows for custom visualizations within Kibana.

Pros:
No need to develop a full plugin.
Can perform Elasticsearch queries and visualize results.
Cons:
Limited in terms of custom logic and external API calls.
Embedding OpenAI API calls within Vega might not be feasible.