## Indexing
Creating the right index — with the right keys, in the right order, and using the right expressions — is critical to query performance in any database system. This is true for Couchbase as well. This topic provides an overview of the types of index that you can create using the Index Service, and explains how they help to query for data efficiently and improve query performance.

### Notes
- In Couchbase, you need indexes to query any data. Without an index, you cannot run any queries. In the case of `travel-sample` data, the indexes are created for you when you import the sample bucket.
- Indexes are created asynchronously and can take a bit of time before the process is completed.
- You can create indexes using any of the following utilities:
    - The Couchbase Query Workbench (in the Web Console)
    - The Command-Line based Query Shell (cbq)
    - Our REST API
    - Any of our Language SDKs, including Python (which we’ll focus on today).


### Configuring the Couchbase Cluster Information for Examples

The configuration is stored in an environment file, `.env` in this folder. 

Note that you might have to check for hidden files to see this file on Unix environments.

This file can be used to update the connection settings.
* DB_HOST: Set to `couchbase://couchbase` by default for connecting to the Couchbase cluster in the docker environment via Docker Compose. If you are running Couchbase locally on your machine via docker or installation, you can change the connection string to `couchbase://localhost`.
* DB_USER: Set to `Administrator` by default. If it is different for your cluster, please update the file.
* DB_PASSWORD: Set to `Password` by default. If it is different for your cluster, please update the file.


In [None]:
# Read the Database information from .env file
from dotenv import load_dotenv
import os

load_dotenv()  # take environment variables from .env file.

In [None]:
DB_HOST = os.getenv("DB_HOST")
DB_USER = os.getenv("DB_USER")
DB_PASSWORD = os.getenv("DB_PASSWORD")
print(f"Environment Settings \n{DB_HOST=} \n{DB_USER=} \n{DB_PASSWORD=}")

### Connecting to Couchbase Cluster
- Connection String: `couchbase://couchbase` would connect to the Couchbase instance.
- PasswordAuthenticator: It specifies the username & password used to access the Cluster.

#### Note
If you are running Couchbase locally on your machine via docker or installation, you can change the connection string to `couchbase://localhost` via the configuration file `.env`

## Types of Indexes
- Primary Index: The primary index is simply an index on the document key on the entire keyspace.
- Secondary Index: A secondary index is an index on any key-value or document-key.
- Composite Secondary Index: A secondary index using multiple keys.
- Partial Index: An index defined on a subset of documents.
- Covering Index: An index that includes the actual values of all the fields specified in the query.
- Array Index: An index on array objects in documents.

## Primary Index
Primary indexes contain a full set of keys in a given keyspace like in Relational Databases. 

Every primary index is maintained asynchronously. A primary index is intended to be used for simple queries, which have no filters or predicates.

Primary indexes are optional and are only required for running ad hoc queries on a keyspace that is not supported by a secondary index. They are slow as the entire document has to be fetched to match them against the queries and hence not recommended for production. 

In [None]:
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
from couchbase.management.queries import (
    CreatePrimaryQueryIndexOptions,
    QueryIndexManager,
)

In [None]:
cluster = Cluster.connect(
    DB_HOST, ClusterOptions(PasswordAuthenticator(DB_USER, DB_PASSWORD))
)

In [None]:
primary_idx_query = (
    "CREATE PRIMARY INDEX primary_idx_hotels ON `travel-sample`.inventory.hotel"
)
try:
    result = cluster.query(primary_idx_query).execute()
except Exception as e:
    print(e)

## Check for the Created Index on Indexes menu in the Web Console
This index will be used for all queries on the hotel collection in inventory scope of the travel-sample bucket in case there is no other index on this collection relevant to the query. 

The drawback with this index is that all the records have to fetched from the index to check whether it matches a query. This can be avoided by using specialized indexes with the relevant fields being indexed.

![primary-index-inventory](./img/Primary-Index-Inventory-Hotel.png)

## Checking all Available Indexes
You can check for all the available indexes in the cluster by querying the `system:indexes` keyspace which is an internal keyspace which keeps track of all the indexes.

In [None]:
import pprint

pp = pprint.PrettyPrinter(indent=4, depth=6)

In [None]:
all_indexes_query = "SELECT * FROM system:indexes"

try:
    result = cluster.query(all_indexes_query).execute()
    for row in result:
        pp.pprint(row)
except Exception as e:
    print(e)

## Explain: Check how the Query is being executed
Couchbase allows you to check how the query is being used executed using the current indexes. 

You can click on `Explain` in the Web interface for the Query Workbench to see the plan for a query.

The query plan for this query indicates that the query 

    - Scans the Primary Index.
    - Fetches all the Hotel documents
    - Projects the `title` and `country` fields for all the fetched documents

![Explain-Index](./img/Explain-Index.png)

The Primary Index used here is different from the one created above as there was already a primary index on the same collection that was created when the sample bucket was imported.

Note that the Execution Plans can change based on the indexes available. Couchbase automatically selects the best index for the query. 

## Secondary Index
A secondary index is an index on any key-value or document-key. This index can use any key within the document and the key can be of any type: scalar, object, or array. 

The query has to use the same type of object for the query engine to use the index.

In [None]:
# This index will be used for queries that work with the hotel titles
secondary_idx_query = (
    "CREATE INDEX idx_hotels_title ON `travel-sample`.inventory.hotel(title)"
)
try:
    result = cluster.query(secondary_idx_query).execute()
except Exception as e:
    print(e)

## Composite Secondary Index
It is common to have queries with multiple filters (predicates). In such cases, you want to use indexes with multiple keys so the indexes can return only the qualified document keys. Additionally, if a query is referencing only the keys in the index, the query engine can simply answer the query from the index scan result without having to fetch from the data nodes. This is commonly used for performance optimization.

We can create an index that will handle the query to get the name and country for each hotel in the inventory scope to make it more efficient than using the primary index.

In [None]:
# This index will be used for queries that work with the hotel titles & countries
hotel_title_country_idx_query = "CREATE INDEX idx_hotels_title_country ON `travel-sample`.inventory.hotel(title, country)"
try:
    result = cluster.query(hotel_title_country_idx_query).execute()
except Exception as e:
    print(e)

## Partial Index
Unlike relational systems where each type of row is in a distinct table, Couchbase keyspaces can have documents of various types. You can include a distinguishing field in your document to differentiate distinct types.

For example, the landmark keyspace distinguishes types of landmark using the activity field. Couchbase allows you to create indexes for specific activities from them.

In [None]:
activities = "SELECT DISTINCT activity FROM `travel-sample`.inventory.landmark"
try:
    result = cluster.query(activities)
    for row in result:
        print(row)
except Exception as e:
    print(e)

In [None]:
# Create an index for landmarks that are of type 'eat'
restaurants_index_query = "CREATE INDEX landmarks_eat ON `travel-sample`.inventory.landmark(name, id, address) WHERE activity='eat'"
try:
    result = cluster.query(restaurants_index_query).execute()
except Exception as e:
    print(e)

In [None]:
all_indexes_query = "SELECT * FROM system:indexes where name='landmarks_eat'"

try:
    result = cluster.query(all_indexes_query).execute()
    for row in result:
        pp.pprint(row)
except Exception as e:
    print(e)

## Covering Index
When an index includes the actual values of all the fields specified in the query, the index covers the query and does not require an additional step to fetch the actual values from the data service. An index, in this case, is called a covering index and the query is called a covered query. As a result, covered queries are faster and deliver better performance.


In [None]:
hotel_state_index_query = (
    "CREATE INDEX idx_state on `travel-sample`.inventory.hotel (state)"
)
try:
    result = cluster.query(hotel_state_index_query).execute()
except Exception as e:
    print(e)

We can see the query execution plan using the EXPLAIN statement. When a query uses a covering index, the EXPLAIN statement shows that a covering index is used for data access, thus avoiding the overhead associated with key-value document fetches. 

If we select state from the hotel keyspace, the actual values of the field state that are to be returned are present in the index idx_state, and avoids an additional step to fetch the data. In this case, the index idx_state is called a covering index and the query is a covered query.
![Covered-Index](./img/Covered-Index.png)

## Array Indexing
Array Indexing adds the capability to create global indexes on array elements and optimizes the execution of queries involving array elements.



In [None]:
# Create an index on all schedules
# Here, we create an index on all the distinct flight schedules
schedules_index_query = "CREATE INDEX idx_sched ON `travel-sample`.inventory.route ( DISTINCT ARRAY v.flight FOR v IN schedule END )"

try:
    result = cluster.query(schedules_index_query).execute()
except Exception as e:
    print(e)

In [None]:
# Select scheduled flights operated by 'UA'
query_schedules = "SELECT * FROM `travel-sample`.inventory.route WHERE ANY v IN schedule SATISFIES v.flight LIKE 'UA%' END LIMIT 5"

try:
    result = cluster.query(query_schedules)
    for row in result:
        pp.pprint(row)
except Exception as e:
    print(e)

In [None]:
# Index on Flight Stops
flight_stops_index = "CREATE INDEX idx_flight_stops ON `travel-sample`.inventory.route( stops, DISTINCT ARRAY v.flight FOR v IN schedule END )"
try:
    result = cluster.query(flight_stops_index).execute()
except Exception as e:
    print(e)

In [None]:
# Select flights with a stopover
filter_stops_query = "SELECT * FROM `travel-sample`.inventory.route WHERE stops >=1 AND ANY v IN schedule SATISFIES v.flight LIKE 'FL%' END"
try:
    result = cluster.query(filter_stops_query)
    for row in result:
        pp.pprint(row)
except Exception as e:
    print(e)

## Dropping Indexes
The DROP INDEX statement allows you to drop a named primary index or a secondary index. 

You can drop an index by specifying the name of the index and the keyspace (bucket.scope.collection).

In [None]:
# This query will drop the index idx_hotels_title that we created earlier
drop_idx_query = "DROP INDEX idx_hotels_title ON `travel-sample`.inventory.hotel"
try:
    result = cluster.query(drop_idx_query).execute()
except Exception as e:
    print(e)

In [None]:
# This query will drop the primary index primary_idx_hotels that we created earlier
# It is recommended to not have primary indexes on production systems as they scan all the documents in the collection
drop_primary_idx_query = (
    "DROP INDEX primary_idx_hotels ON `travel-sample`.inventory.hotel"
)
try:
    result = cluster.query(drop_primary_idx_query).execute()
except Exception as e:
    print(e)

## Observe (Optional)
Could you try to observe the performance difference between using the Primary Index & the Secondary Index? For this experiment if you are working with the travel-sample data, you would have to delete some of the existing Indexes. 

## Query Optimization
Query Optimization tries to optimize queries in various forms and scenarios to bring efficiency. Each optimization is different and results in a different amount of performance benefit.

Tuning is iterative and involves the following basic steps:
- Identifying the slowly performing or high resource consumption N1QL statements that are responsible for a large share of the application workload and system resources. Generally tuning the slower and most frequently used N1QL queries will yield the highest results. Additionally, depending on your response and SLA needs you will need to identify and tune specific queries. As in many scenarios generally, the Pareto principle applies to query tuning as well - 80% of your workload/performance problems are probably caused by 20% of your queries - focus and tune that 20% of your queries
- Verify that the execution plans produced by the query optimizer for these statements are reasonable and expected. Note: Couchbase currently is a RULE based optimizer and not a COST based optimizer so key or index cardinality do not impact the choice of the index or creation of the overall query plan
- Implement corrective actions to generate better execution plans for poorly performing SQL statements

The previous steps are repeated until the query performance reaches a satisfactory level or no more statements can be tuned.

For more details on optimizing your queries, you can check the [Learning Path on our Developer Portal](https://developer.couchbase.com/learn/n1ql-query-performance-guide).

## Exercise 3.1
- Create an index to cover the query: "SELECT name, url, city from \`travel-sample\`.inventory.hotel where country='United Kingdom'"
- Create an index to query airports that are over the altitude of 1000. You can look at the alt field inside geo.

## Solutions

In [None]:
# Covered Index on Hotels


In [None]:
# Explain the query to check the index in the query workbench


In [None]:
# Airports with altitude over 1000


In [None]:
# Explain the query to check the index in the query workbench


## References
- [Indexing in Couchbase](https://docs.couchbase.com/server/current/learn/services-and-indexes/indexes/global-secondary-indexes.html)
- [N1QL Query Performance Guide](https://developer.couchbase.com/learn/n1ql-query-performance-guide)