Skip to content

Latest commit

 

History

History
678 lines (568 loc) · 21.7 KB

testqueries.md

File metadata and controls

678 lines (568 loc) · 21.7 KB

Test 1 : ALTER TABLE and Update

Summary - Fail

    1. Create a table and alter the table or update the data

For altering the table or updating, you need to create a table that supports transactions

CREATE TABLE test4.employee (id int, name varchar, salary int)
WITH (
   format='ORC'
   );

INSERT INTO test4.employee VALUES
(1, 'Jerry', 5000),
(2, 'Tom',   8000),
(3, 'Kate',  6000);

Insert works;

Trying to alter the table - works

ALTER TABLE test4.employee ADD column description varchar;

Trying to update the Table - Does not work - Not supported now

Note that Presto cannot create or write to Hive transactional tables yet. You can create and write to Hive transactional tables via Hive or via Spark with Hive ACID Data Source plugin and use Presto to read these tables. https://trino.io/blog/2020/06/01/hive-acid.html

update test4.employee  set description ='Employee Jerry description' where id=1;

Query 20220622_105039_00034_7tuyy failed: Hive update is only supported for ACID transactional tables

Testing Delete on a non transactional table - Not working

trino:default> delete from test4.employee2 where id=1;
Query 20220629_195118_00005_q8zcg failed: Deletes must match whole partitions for non-transactional tables

Trying to create a Table with transactional support for testing Updates

trino:default> CREATE TABLE test4.employee5 (id int, name varchar, salary int)
            -> WITH (
            ->    format='ORC',
            ->    transactional=true
            ->    );
CREATE TABLE
trino:default> ALTER TABLE test4.employee5 ADD column description varchar;
ADD COLUMN
    1. Inserting into a table with transaction -Not working - not possible from Trino now
trino:default> INSERT INTO test4.employee5 VALUES
            -> (1, 'Jerry', 5000,  'test'),
            -> (2, 'Tom',   8000, 'test'),
            -> (3, 'Kate',  6000, 'test');
INSERT: 3 rows

Query 20220623_043713_00027_7tuyy, FAILED, 3 nodes
Splits: 8 total, 8 done (100.00%)
10.69 [0 rows, 0B] [0 rows/s, 0B/s]

Query 20220623_043713_00027_7tuyy failed: Invalid method name: 'alter_table_req'

Bug raised - trinodb/trino#12949 before I knew this was a restriction

Test 2: Load Training Data as CSV and Query with Presto

Followed - https://towardsdatascience.com/load-and-query-csv-file-in-s3-with-presto-b0d50bc773c9

Downloaded part of data from https://catalog.data.gov/dataset/2018-yellow-taxi-trip-data and uploaded to S3 (Mino bucket) in this particular location

s3a://test/warehouse/nyc_text.db/tlc_yellow_trips_2018'

csv upload location

Note that the Path is where we will create the SCHEMA and table as next step

Create SCHEMA

CREATE SCHEMA nyc_text ;

Since we have give metadata store default path in S3, this will create DB in S3 at s3a://test/warehouse/nyc_text.db

Create Table, the External_location aligns to the path where we uploaded the csv

CREATE TABLE nyc_text.tlc_yellow_trips_2018 (
    vendorid VARCHAR,
    tpep_pickup_datetime VARCHAR,
    tpep_dropoff_datetime VARCHAR,
    passenger_count VARCHAR,
    trip_distance VARCHAR,
    ratecodeid VARCHAR,
    store_and_fwd_flag VARCHAR,
    pulocationid VARCHAR,
    dolocationid VARCHAR,
    payment_type VARCHAR,
    fare_amount VARCHAR,
    extra VARCHAR,
    mta_tax VARCHAR,
    tip_amount VARCHAR,
    tolls_amount VARCHAR,
    improvement_surcharge VARCHAR,
    total_amount VARCHAR)
WITH (FORMAT = 'CSV',
    skip_header_line_count = 1,
    EXTERNAL_LOCATION = 's3a://test/warehouse/nyc_text.db/tlc_yellow_trips_2018')
;

Try to Query the data out- Works

trino:default> select * from hive.nyc_text.tlc_yellow_trips_2018;
vendorid |  tpep_pickup_datetime  | tpep_dropoff_datetime  | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | >
----------+------------------------+------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+->
2        | 06/29/2018 02:38:33 PM | 06/29/2018 03:29:33 PM | 2               | 10.46         | 1          | N                  | 140          | 61           | 2            | >
2        | 06/29/2018 02:05:23 PM | 06/29/2018 02:15:41 PM | 5               | 2.02          | 1          | N                  | 249          | 13           | 1            | >
2        | 06/29/2018 02:20:52 PM | 06/29/2018 02:51:28 PM | 5               | 4.32          | 1          | N                  | 13           | 4            | 1            | >

Convert this to ORC or Parquet as each field in CSV is read as VARCHAR. Apache Parquet is a columnar data format and stores/compress data very efficiently when compared with sya CSV or Text. Colum based storage would also make queries much faster.

CREATE SCHEMA hive.nyc_parq;

CREATE TABLE hive.nyc_parq.tlc_yellow_trips_2018
COMMENT '2018 Newyork City taxi data'
WITH (FORMAT = 'PARQUET')
AS
SELECT 
    cast(vendorid as INTEGER) as vendorid,
    date_parse(tpep_pickup_datetime, '%m/%d/%Y %h:%i:%s %p') as tpep_pickup_datetime,
    date_parse(tpep_dropoff_datetime, '%m/%d/%Y %h:%i:%s %p') as tpep_dropoff_datetime,
    cast(passenger_count as SMALLINT) as passenger_count,
    cast(trip_distance as DECIMAL(8, 2)) as trip_distance,
    cast(ratecodeid as INTEGER) as ratecodeid,
    cast(store_and_fwd_flag as CHAR(1)) as store_and_fwd_flag,
    cast(pulocationid as INTEGER) as pulocationid,
    cast(dolocationid as INTEGER) as dolocationid,
    cast(payment_type as SMALLINT) as payment_type,
    cast(fare_amount as DECIMAL(8, 2)) as fare_amount,
    cast(extra as DECIMAL(8, 2)) as extra,
    cast(mta_tax as DECIMAL(8, 2)) as mta_tax,
    cast(tip_amount as DECIMAL(8, 2)) as tip_amount,
    cast(tolls_amount as DECIMAL(8, 2)) as tolls_amount,
    cast(improvement_surcharge as DECIMAL(8, 2)) as improvement_surcharge,
    cast(total_amount as DECIMAL(8, 2)) as total_amount
FROM hive.nyc_text.tlc_yellow_trips_2018;
DESCRIBE nyc_parq.tlc_yellow_trips_2018;

The conversion is successful

trino:default> DESCRIBE nyc_parq.tlc_yellow_trips_2018;
        Column         |     Type     | Extra | Comment 
-----------------------+--------------+-------+---------
 vendorid              | integer      |       |         
 tpep_pickup_datetime  | timestamp(3) |       |         
 tpep_dropoff_datetime | timestamp(3) |       |         
 passenger_count       | smallint     |       |         
 trip_distance         | decimal(8,2) |       |         
 ratecodeid            | integer      |       |         
 store_and_fwd_flag    | char(1)      |       |         
 pulocationid          | integer      |       |         
 dolocationid          | integer      |       |         
 payment_type          | smallint     |       |         
 fare_amount           | decimal(8,2) |       |         
 extra                 | decimal(8,2) |       |         
 mta_tax               | decimal(8,2) |       |         
 tip_amount            | decimal(8,2) |       |         
 tolls_amount          | decimal(8,2) |       

Let's see how fast we can query; I am running Mino in my Laptop on SSD

select * from nyc_parq.tlc_yellow_trips_2018 ;
vendorid |  tpep_pickup_datetime   |  tpep_dropoff_datetime  | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type >
----------+-------------------------+-------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+-------------->
       2 | 2018-06-28 11:07:27.000 | 2018-06-28 11:28:54.000 |               1 |          2.26 |          1 | N                  |          140 |          161 |            1 >
       2 | 2018-06-28 11:44:03.000 | 2018-06-28 11:48:00.000 |               1 |          0.64 |          1 | N                  |          239 |          238 |            2 >
       2 | 2018-06-28 11:52:42.000 | 2018-06-28 12:07:38.000 |               1 
...
Query 20220623_125710_00093_7tuyy, RUNNING, 2 nodes
Splits: 5 total, 5 done (100.00%)
5:00 [754K rows, 11.8MB] [2.51K rows/s, 40.3KB/s]

Let's try to predict the Trip time using this data

SELECT
  pulocationid,
   dolocationid,
   date_diff('minute' ,tpep_pickup_datetime, tpep_dropoff_datetime)  as difference
  FROM nyc_parq.tlc_yellow_trips_2018
  GROUP BY (pulocationid,dolocationid)

 pulocationid | dolocationid | difference 
--------------+--------------+------------
           48 |           90 |         15 
          170 |          162 |         14 
          140 |          236 |          5 
          107 |           90 |         10 
          186 |          162 |         14 
          249 |           13 |         12 
           13 |          231 |         20 

Another more complex query (from a site) As you can see you need to be a SQL expert to analyze data. However this is easier for most data analyst than say coding in Python or R; and definitely so for Business Analyst.


trino:default> SELECT DISTINCT passenger_count
            ->     , ROUND (SUM (fare_amount),0) as TotalFares
            ->     , ROUND (AVG (fare_amount),0) as AvgFares
            -> FROM nyc_parq.tlc_yellow_trips_2018
            -> GROUP BY passenger_count
            -> ORDER BY  AvgFares DESC
            -> ;
 passenger_count | TotalFares | AvgFares 
-----------------+------------+----------
               9 |     399.00 |    67.00 
               7 |      86.00 |    29.00 
               8 |     111.00 |    28.00 
               4 |  246091.00 |    15.00 
               3 |  466199.00 |    14.00 
               2 | 1585505.00 |    14.00 
               6 |  276720.00 |    13.00 
               0 |   85157.00 |    13.00 
               5 |  458260.00 |    13.00 
               1 | 6997287.00 |    13.00 
(10 rows)

Query 20220623_133806_00111_7tuyy, FINISHED, 2 nodes
Splits: 13 total, 13 done (100.00%)
50.37 [754K rows, 3.61MB] [15K rows/s, 73.5KB/s]

You can see this relation in Redash redash_vi

Let's now try something more complex; Try to predict trip fare based on day of week and time of day

select date_format(tpep_pickup_datetime, '%Y%m%d %H %i %a'),fare_amount from nyc_parq.tlc_yellow_trips_2018 limit 10;
       _col0        | fare_amount 
--------------------+-------------
 20180630 09 42 Sat |        7.00 
 20180630 08 22 Sat |       52.00 

Building on the above

Select dayofweek, AVG(fare_amount) FROM
 ( select fare_amount, date_format(tpep_pickup_datetime, '%a') as dayofweek 
     from nyc_parq.tlc_yellow_trips_2018)
  group by dayofweek ;
 dayofweek | _col1 
-----------+-------
 Thu       | 13.72 
 Wed       | 17.20 
 Fri       | 14.14 
 Sat       | 12.31 
 Sun       |  9.00 

Note that I do not have the entire dataset but truncated version; hence results are missing certain days like for example Monday.


Upload data in Parquet And Load as Table

The new data by NYC for the Taxi service is given in Parquet files.Let's see how we can upload the parquet file to S3 and connect it to table. Turns out that one you figure out the correct datatype to give in CREATE TABLE, it is pretty simple

The data is got from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

yellow_tripdata_2022-01.parquet We first create a Schema and then create the following path in S3 and upload the data there

CREATE SCHEMA nyc_in_parquet;

Upload yellow_tripdata_2022-01.parquet to

s3a://test/warehouse/nyc_in_parquet.db/tlc_yellow_trip_2022/yellow_tripdata_2022-01.parquet

Lets inspect this file

$ pip install parquet-tools

$ parquet-tools inspect yellow_tripdata_2022-01.parquet 

############ file meta data ############
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 19
num_rows: 2463931
num_row_groups: 1
format_version: 1.0
serialized_size: 10386


############ Columns ############
VendorID
tpep_pickup_datetime
tpep_dropoff_datetime
passenger_count
trip_distance
RatecodeID
store_and_fwd_flag
PULocationID
DOLocationID
payment_type
fare_amount
extra
mta_tax
tip_amount
tolls_amount
improvement_surcharge
total_amount
congestion_surcharge
airport_fee

############ Column(VendorID) ############
name: VendorID
path: VendorID
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 47%)

############ Column(tpep_pickup_datetime) ############
name: tpep_pickup_datetime
path: tpep_pickup_datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: GZIP (space_saved: 52%)

############ Column(tpep_dropoff_datetime) ############
name: tpep_dropoff_datetime
path: tpep_dropoff_datetime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: GZIP (space_saved: 51%)

############ Column(passenger_count) ############
name: passenger_count
path: passenger_count
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 53%)

############ Column(trip_distance) ############
name: trip_distance
path: trip_distance
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 10%)

############ Column(RatecodeID) ############
name: RatecodeID
path: RatecodeID
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 61%)

############ Column(store_and_fwd_flag) ############
name: store_and_fwd_flag
path: store_and_fwd_flag
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 55%)

############ Column(PULocationID) ############
name: PULocationID
path: PULocationID
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 25%)

############ Column(DOLocationID) ############
name: DOLocationID
path: DOLocationID
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 12%)

############ Column(payment_type) ############
name: payment_type
path: payment_type
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 55%)

############ Column(fare_amount) ############
name: fare_amount
path: fare_amount
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 27%)

############ Column(extra) ############
name: extra
path: extra
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 65%)

############ Column(mta_tax) ############
name: mta_tax
path: mta_tax
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 69%)

############ Column(tip_amount) ############
name: tip_amount
path: tip_amount
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 30%)

############ Column(tolls_amount) ############
name: tolls_amount
path: tolls_amount
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 80%)

############ Column(improvement_surcharge) ############
name: improvement_surcharge
path: improvement_surcharge
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 63%)

############ Column(total_amount) ############
name: total_amount
path: total_amount
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 18%)

############ Column(congestion_surcharge) ############
name: congestion_surcharge
path: congestion_surcharge
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 62%)

############ Column(airport_fee) ############
name: airport_fee
path: airport_fee
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 58%)

We use the above data types to create the right SQL data types in the CREATE Table below

CREATE TABLE nyc_in_parquet.tlc_yellow_trip_2022 (
    vendorid INTEGER,
    tpep_pickup_datetime TIMESTAMP,
    tpep_dropoff_datetime TIMESTAMP,
    passenger_count DOUBLE,
    trip_distance DOUBLE,
    ratecodeid DOUBLE,
    store_and_fwd_flag CHAR(1),
    pulocationid INTEGER,
    dolocationid INTEGER,
    payment_type INTEGER,
    fare_amount DOUBLE,
    extra DOUBLE,
    mta_tax DOUBLE,
    tip_amount DOUBLE,
    tolls_amount DOUBLE,
    improvement_surcharge DOUBLE,
    total_amount DOUBLE,
    congestion_surcharge DOUBLE,
    airport_fee DOUBLE)
WITH (FORMAT = 'PARQUET',
    EXTERNAL_LOCATION = 's3a://test/warehouse/nyc_in_parquet.db/tlc_yellow_trip_2022')
;

That's it; Let's Query the data and see

SELECT count(*) FROM  nyc_in_parquet.tlc_yellow_trip_2022 ;
  _col0  
---------
 2463931 
(1 row)

Query 20220625_100612_00061_7tuyy, FINISHED, 2 nodes
Splits: 4 total, 4 done (100.00%)
13.60 [2.46M rows, 16KB] [181K rows/s, 1.18KB/s

Cool 2.5 million rows in this small 37M parquet file

SELECT * FROM  nyc_in_parquet.tlc_yellow_trip_2022 LIMIT 10;
 vendorid |  tpep_pickup_datetime   |  tpep_dropoff_datetime  | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type >
----------+-------------------------+-------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+-------------->
        1 | 2022-01-01 00:35:40.000 | 2022-01-01 00:53:29.000 |             2.0 |           3.8 |        1.0 | N                  |          142 |          236 |            1 >
        1 | 2022-01-01 00:33:43.000 | 2022-01-01 00:42:07.000 |             1.0 |           2.1 |        1.0 | N                  |          236 |           42 |            1 >
        2 | 2022-01-01 00:53:21.000 | 2022-01-01 01:02:19.000 |             1.0 
        ...

Let's run a complex query

Select dayofweek, AVG(fare_amount) FROM
 ( select fare_amount, date_format(tpep_pickup_datetime, '%a') as dayofweek 
     from nyc_in_parquet.tlc_yellow_trip_2022)
  group by dayofweek ;

   dayofweek |       _col1        
-----------+--------------------
 Wed       | 12.290740850208424 
 Mon       | 13.422371442169597 
 Tue       | 12.531063256070471 
 Thu       | 12.408311232285685 
 Sat       |  12.51759075286025 
 Fri       | 13.581231240928696 
 Sun       | 13.811130821498017 
(7 rows)

Query 20220625_230058_00076_7tuyy, FINISHED, 2 nodes
Splits: 7 total, 7 done (100.00%)
17.92 [2.46M rows, 11.4MB] [138K rows/s, 652KB/s]

The output in a few seconds; Not bad from 2 million rows

Here is another

select t.range, count(*) as "Number of Occurance", ROUND(AVG(fare_amount),2) as "Avg",
  ROUND(MAX(fare_amount),2) as "Max" ,ROUND(MIN(fare_amount),2) as "Min"
from (
  select 
   case 
      when trip_distance between  0 and  9 then ' 0-9 '
      when trip_distance between 10 and 19 then '10-19'
      when trip_distance between 20 and 29 then '20-29'
      when trip_distance between 30 and 39 then '30-39'
      else '> 39' 
   end as range ,fare_amount
  from nyc_in_parquet.tlc_yellow_trip_2022) t
group by t.range;

 range | Number of Occurance |  Avg   |    Max    |  Min   
-------+---------------------+--------+-----------+--------
  0-9  |             2274195 |  10.34 | 401092.32 | -480.0 
 30-39 |                1120 | 102.51 |     280.0 | -159.5 
 10-19 |              126672 |  43.45 |     413.5 | -200.0 
 > 39  |               42716 |  38.83 |     668.0 | -388.0 
 20-29 |               19228 |  58.08 |     250.0 | -160.0 
(5 rows)

Note that there are some bad values here, that you can see from Max and Min; Adapting for that

  select t.range, count(*) as "Number of Occurance", ROUND(AVG(fare_amount),2) as "Avg",
    ROUND(MAX(fare_amount),2) as "Max" ,ROUND(MIN(fare_amount),2) as "Min" 
  from (
    select 
    case 
        when trip_distance between  0 and  9 then ' 0-9 '
        when trip_distance between 10 and 19 then '10-19'
        when trip_distance between 20 and 29 then '20-29'
        when trip_distance between 30 and 39 then '30-39'
        else '> 39' 
    end as range ,fare_amount 
    from nyc_in_parquet.tlc_yellow_trip_2022) t
    where fare_amount > 1 and fare_amount < 401092
  group by t.range;

 range | Number of Occurance |  Avg   |  Max  | Min  
-------+---------------------+--------+-------+------
  0-9  |             2260865 |  10.28 | 720.0 | 1.11 
 30-39 |                1107 | 104.28 | 280.0 |  5.0 
 10-19 |              126136 |   43.8 | 413.5 |  2.0 
 > 39  |               42556 |  39.11 | 668.0 | 1.99 
 20-29 |               19133 |  58.62 | 250.0 |  2.5 

Next ; lets Create, Insert and Query from Python using the Python SQL library. Let's then see how the query results can be loaded into a Pandas Dataframe.

You can find the code in \python folder for this