# Queries in PyTables

> Objectives:
> * Query HDF5 files without loading them in-memory
> * How to query normalized and denormalized tables
> * Index columns in tables for accelerating queries

In [1]:
import os
import numpy as np
import pandas as pd
import tables

In [2]:
%ls -lh structuring compression

ls: compression: No such file or directory
ls: structuring: No such file or directory


## Querying in PyTables

### Denormalized tables

In [3]:
h5denorm = "structuring/no-compressed.h5"
h5file = tables.open_file(h5denorm)
h5lens = h5file.root.lens

IOError: ``structuring/no-compressed.h5`` does not exist

In [None]:
h5lens

In [None]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

In [None]:
ratings

In [None]:
h5file.close()

Querying denormalized tables is easy as pie.  Let's see how to manage normalized ones.

### Normalized tables

In [None]:
h5norm = "compression/blosc-zstd-5-shuffle.h5"
h5file = tables.open_file(h5norm)
h5ratings = h5file.root.ratings
h5movies = h5file.root.movies

In [None]:
h5ratings

In [None]:
h5movies

In [None]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

In [None]:
ratings

In [None]:
h5file.close()

So, the query in the normalized version is more than 2~3x faster than using the denormalized file.  However, this is just a simple example, and in general experimentation should be done so as to determine the best layout for your data.

## Indexing

Indexing is a general technique for adding data structures that can accelerate queries.  Let's see how PyTables makes use of this.

### Denormalized case

In [None]:
## Copy the original PyTables table into another file
import shutil
h5idx = "movielens-denorm-indexed.h5"
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5denorm, h5idx)

In [None]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")

In [None]:
# Create an index for the 'title' column
h5lens = h5i.root.lens
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5lens.cols.title.create_csindex(filters=blosc_filter)

In [None]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Ok, so this time is 100x less than without using indexing.  What if we index the `rating` column too?

In [None]:
ratings

In [None]:
# Create an index for the rating column
%time h5lens.cols.rating.create_csindex(filters=blosc_filter)

In [None]:
%%time
ratings = [0] * 6
for rt in range(0,6):
    ratings[rt] = sum(1 for r in h5lens.where("(title == b'Tom and Huck (1995)') & (rating == rt)"))

Ok, so although small, this represents another improvement in performance.

In [None]:
ratings

In [None]:
h5i.close()

### Normalized case

In [None]:
## Copy the original PyTables table into another file
import shutil
h5idx = "movielens-norm-indexed.h5"
if os.path.exists(h5idx):
    os.unlink(h5idx)
shutil.copyfile(h5norm, h5idx)

In [None]:
# Open the new file in 'a'ppend mode
h5i = tables.open_file(h5idx, mode="a")
h5ratings = h5i.root.ratings
h5movies = h5i.root.movies

In [None]:
# Create an index for the rating column
blosc_filter = tables.Filters(complevel=9, complib="blosc")
%time h5ratings.cols.rating.create_csindex(filters=blosc_filter)

In [None]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

Hmm, in this case indexing the rating column has not served to accelerate the query (at first sight at least).

In [None]:
ratings

In [None]:
# Create an index for the movie_id column
%time h5ratings.cols.movie_id.create_csindex(filters=blosc_filter)

In [None]:
%%time
ratings = [0] * 6
for rt in range(6):
    th_movie_id = [r['movie_id'] for r in h5movies.where("(title == b'Tom and Huck (1995)')")][0]
    ratings[rt] = sum(1 for r in h5ratings.where("(movie_id == th_movie_id) & (rating == rt)"))

This time we see a better acceleration in the query, but cannot compete with the query speed for the denormalized case (which is ~10x faster).

In [None]:
ratings

In [None]:
h5i.close()

In [None]:
%ls -lh movielens*

## Exercise

We have not created an index for the title for the normalized case.  Create such an index and determine if there is a noticeable speed-up or not.  Explain why you think that is the case.  Note: the times for a cold query can be **significatively** different from a hot query.