# Tutorial - QFrame

## What is a QFrame?
QFrame is a class which generates an SQL statement. It stores fields info in `QFrame.data` parameter which is a dictionary.

## How to create a QFrame?
You can create a QFrame manually - passing the data directly to QFrame or automatically - using `QFrame.from_table()` method. Before you generate a QFrame you have to set up your ODBC. If you set up `dsn` (data source name) parameter in grizly configuration file then when you create a QFrame you only need to pass `dsn`. In other case you should also specify `dialect` and `db` parameters.

In [1]:
from grizly import (
    get_path, 
    QFrame
)

In [2]:
qf = QFrame(dsn="redshift_acoe")
qf.from_table(schema="administration", table="table_tutorial")

print(qf)

SELECT "col1",
       "col2",
       "col3",
       "col4"
FROM administration.table_tutorial


You can also specify which columns you want to pull from database by using `columns` parameter.

In [3]:
qf_c = QFrame(dsn="redshift_acoe")
qf_c.from_table(schema="administration", table="table_tutorial", columns=["col1", "col3"])

print(qf_c)

SELECT "col1",
       "col3"
FROM administration.table_tutorial


## Working with the QFrame
There is a lot of methods which you can use to edit the QFrame. You can check them in QFrame docs. In this tutorial we will only show some of them.

### Checking QFrame size

You can check how many rows returns the query generated by your QFrame by using Python build-in `len()` function.

In [4]:
len(qf)

2

### Doing some basic SQL stuff
Let's now add a `where` statement, rename some fields, add calculated field and remove some fields`.

In [5]:
qf.query("col2 > 1") # <- where

qf.rename({"col1": "items", "col2": "price"})

qf.assign(calculated_field="col4*2", 
          type='num', 
          custom_type='double precision')

qf.remove(["col3", "col4"])

print(qf)

SELECT "col1" AS "items",
       "col2" AS "price",
       col4*2 AS "calculated_field"
FROM administration.table_tutorial
WHERE col2 > 1


### Assigning many expressions

You can also assign many expressions at the same time using a dictionary.

In [6]:
new_fields = {f"string_field {i}": f"'{i}'" for i in range(3)}

qf.assign(**new_fields, type="dim", custom_type="VARCHAR(5)")

print(qf)

SELECT "col1" AS "items",
       "col2" AS "price",
       col4*2 AS "calculated_field",
       '0' AS "string_field_0",
       '1' AS "string_field_1",
       '2' AS "string_field_2"
FROM administration.table_tutorial
WHERE col2 > 1


We created three new fields with spaces using a loop. We will remove these fields as we won't need them in next sections.

In [7]:
qf.remove(new_fields.keys())

<grizly.tools.qframe.QFrame at 0x7ff38c41ca00>

### Forking

Forking QFrames can be important if your data workflow needs to take the same sql table and apply different transformations to it.

Sometimes we want to fork, do some transforms, then union the QFrames back together which results into an append operation on the data side.

Let's create two copies of one QFrame.

In [8]:
qf1 = qf.copy()
qf2 = qf.copy()

## Unioning data

There are two ways of unioning two QFrames - we can union by the position of the field or by the final name of the columns (that means the alias). 

In [9]:
from grizly import union

qf1.rename({"col2": "price_1", "calculated_field": "price_2"})
qf2.rename({"col2": "price_2", "calculated_field": "price_1"})

<grizly.tools.qframe.QFrame at 0x7ff38c3ef4f0>

#### Union by the positon

In [10]:
uqf_pos = union(qframes=[qf1, qf2], union_type="UNION ALL", union_by='position')
print(uqf_pos)

SELECT "col1" AS "items",
       "col2" AS "price_1",
       col4*2 AS "price_2"
FROM administration.table_tutorial
WHERE col2 > 1
UNION ALL
SELECT "col1" AS "items",
       "col2" AS "price_2",
       col4*2 AS "price_1"
FROM administration.table_tutorial
WHERE col2 > 1


#### Union by the column names

In [11]:
uqf_name = union(qframes=[qf1, qf2], union_type="UNION ALL", union_by='name')
print(uqf_name)

SELECT "col1" AS "items",
       "col2" AS "price_1",
       col4*2 AS "price_2"
FROM administration.table_tutorial
WHERE col2 > 1
UNION ALL
SELECT "col1" AS "items",
       col4*2 AS "price_1",
       "col2" AS "price_2"
FROM administration.table_tutorial
WHERE col2 > 1


You can see that in this case union changes the order of the columns. 

## Joining data

In [12]:
from grizly import join

We will be using `Chinook.sqlite` to visualize data.

In [13]:
dsn_sqlite = get_path("grizly_dev", "tests", "Chinook.sqlite")

### Simple join

First table is `Track` table.

In [14]:
tracks = {  'select': {
                'fields': {
                    'TrackId': { 'type': 'dim'},
                    'Name': {'type': 'dim'},
                    'AlbumId': {'type': 'dim'},
                    'Composer': {'type': 'dim'},
                    'UnitPrice': {'type': 'num'}
                },
                'table': 'Track'
            }
}
tracks_qf = QFrame(dsn=dsn_sqlite, db="sqlite", dialect="mysql").from_dict(tracks)
print(tracks_qf)

SELECT "TrackId",
       "Name",
       "AlbumId",
       "Composer",
       "UnitPrice"
FROM Track


In [15]:
tracks_qf.to_df().sample(5)

Unnamed: 0,TrackId,Name,AlbumId,Composer,UnitPrice
825,826,Pour Some Sugar On Me,67,,0.99
2544,2545,No Memory,206,Dean Deleo,0.99
504,505,Sangrando,41,Gonzaga Jr/Gonzaguinha,0.99
1557,1558,Ram It Down,125,,0.99
1614,1615,Four Sticks,131,"Jimmy Page, Robert Plant",0.99


The second table is `PlaylistTrack` table. 

In [16]:
playlist_track_qf = QFrame(dsn=dsn_sqlite, db="sqlite", dialect="mysql").from_table(table="PlaylistTrack")

print(playlist_track_qf)

SELECT "PlaylistId",
       "TrackId"
FROM PlaylistTrack


In [17]:
playlist_track_qf.to_df().sample(5)

Unnamed: 0,PlaylistId,TrackId
6641,8,1665
8323,10,2846
7353,8,423
3169,1,3099
314,1,2711


Now let's join them on `TrackId`.

In [18]:
joined_qf = join([tracks_qf,playlist_track_qf], 
                 join_type="left join", 
                 on="sq1.TrackId=sq2.TrackId")

print(joined_qf)

SELECT sq1."TrackId" AS "TrackId",
       sq1."Name" AS "Name",
       sq1."AlbumId" AS "AlbumId",
       sq1."Composer" AS "Composer",
       sq1."UnitPrice" AS "UnitPrice",
       sq2."PlaylistId" AS "PlaylistId"
FROM
  (SELECT "TrackId",
          "Name",
          "AlbumId",
          "Composer",
          "UnitPrice"
   FROM Track) sq1
LEFT JOIN
  (SELECT "PlaylistId",
          "TrackId"
   FROM PlaylistTrack) sq2 ON sq1.TrackId=sq2.TrackId


In [19]:
joined_qf.to_df().sample(5)

Unnamed: 0,TrackId,Name,AlbumId,Composer,UnitPrice,PlaylistId
1561,626,Drum Boogie,51,,0.99,8
4144,1678,Soul Parsifal,139,Renato Russo - Marisa Monte,0.99,5
3813,1530,Sem Sentido,123,,0.99,5
4140,1677,Aloha,139,Renato Russo,0.99,1
1488,592,Sombras Do Meu Destino,47,,0.99,8


As you can see in this example `UnitPrice` is taken from the first table. By default join function is taking all fields from the first QFrame, then all the fields from the second QFrame which are not in the first and so on. If you still want to keep all fields from each QFrame we have to set `unique_col=False`. We will see in the next example how does it work.

### Multiple join

Now let's use one more table to check how does multiple join look like.

In [20]:
playlists_qf = QFrame(dsn=dsn_sqlite, db="sqlite", dialect="mysql").from_table(table="Playlist")

print(playlists_qf)

SELECT "PlaylistId",
       "Name"
FROM Playlist


In [21]:
playlists_qf.to_df().sample(5)

Unnamed: 0,PlaylistId,Name
8,9,Music Videos
6,7,Movies
2,3,TV Shows
10,11,Brazilian Music
3,4,Audiobooks


Now if we want to join `Tracks`, `PlaylistTrack` and `Playlist` tables we can use `TrackId` and `PlaylistId`. We can see that in `Tracks` and `Playlist` tables we have the same column `Name`. Let's check the option `unique_col=False` and analyse duplicated columns.

In [22]:
joined_qf = join(qframes=[tracks_qf, playlist_track_qf, playlists_qf], 
                 join_type=['left join', 'left join'], 
                 on=['sq1.TrackId=sq2.TrackId', 'sq2.PlaylistId=sq3.PlaylistId'], 
                 unique_col=False)

Please remove or rename duplicated columns. Use your_qframe.show_duplicated_columns() to check duplicates.


In [23]:
joined_qf.show_duplicated_columns()

[1m DUPLICATED COLUMNS: 
 [0m
TrackId:	 ['sq1.TrackId', 'sq2.TrackId']

Name:	 ['sq1.Name', 'sq3.Name']

PlaylistId:	 ['sq2.PlaylistId', 'sq3.PlaylistId']

Use your_qframe.remove() to remove or your_qframe.rename() to rename columns.


We can see that three columns occure in two different tables. We will remove `sq2.TrackId` and  `sq2.PlaylistId` fields and rename `Name` column.

In [24]:
joined_qf.remove(['sq2.TrackId', 
                  'sq2.PlaylistId']).rename({'sq1.Name': 'TrackName', 
                                             'sq3.Name': 'PlaylistType'})
print(joined_qf)

SELECT sq1."TrackId" AS "TrackId",
       sq1."Name" AS "TrackName",
       sq1."AlbumId" AS "AlbumId",
       sq1."Composer" AS "Composer",
       sq1."UnitPrice" AS "UnitPrice",
       sq3."PlaylistId" AS "PlaylistId",
       sq3."Name" AS "PlaylistType"
FROM
  (SELECT "TrackId",
          "Name",
          "AlbumId",
          "Composer",
          "UnitPrice"
   FROM Track) sq1
LEFT JOIN
  (SELECT "PlaylistId",
          "TrackId"
   FROM PlaylistTrack) sq2 ON sq1.TrackId=sq2.TrackId
LEFT JOIN
  (SELECT "PlaylistId",
          "Name"
   FROM Playlist) sq3 ON sq2.PlaylistId=sq3.PlaylistId


In [25]:
joined_qf.to_df().sample(5)

Unnamed: 0,TrackId,TrackName,AlbumId,Composer,UnitPrice,PlaylistId,PlaylistType
1819,754,Speed King,59,"Blackmore, Gillan, Glover, Lord, Paice",0.99,8,Music
6635,2689,Out Of Control,217,Jagger/Richards,0.99,8,Music
6541,2654,Don't Stand so Close to Me,215,G M Sumner,0.99,5,90’s Music
7912,3222,The Job,251,,1.99,3,TV Shows
5295,2154,Untitled,178,Pearl Jam,0.99,8,Music


## Pivot

Again we will use `Chinook.sqlite` and `Track` table to visualize data.

In [26]:
qf = QFrame(dsn=dsn_sqlite, db="sqlite", dialect="mysql").from_table(table="Track")

len(qf)

3503

Our table has `3503` rows - we will limit the data to `15` rows to get better view. We will use `QFrame.window()` method to be sure that the result is deterministic. 

In [27]:
qf.window(offset=90, limit=15, order_by=["TrackId"])
qf.to_df()

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,91,Shadow on the Sun,10,1,1,Audioslave/Chris Cornell,343457,8245793,0.99
1,92,I am the Highway,10,1,1,Audioslave/Chris Cornell,334942,8041411,0.99
2,93,Exploder,10,1,1,Audioslave/Chris Cornell,206053,4948095,0.99
3,94,Hypnotize,10,1,1,Audioslave/Chris Cornell,206628,4961887,0.99
4,95,Bring'em Back Alive,10,1,1,Audioslave/Chris Cornell,329534,7911634,0.99
5,96,Light My Way,10,1,1,Audioslave/Chris Cornell,303595,7289084,0.99
6,97,Getaway Car,10,1,1,Audioslave/Chris Cornell,299598,7193162,0.99
7,98,The Last Remaining Light,10,1,1,Audioslave/Chris Cornell,317492,7622615,0.99
8,99,Your Time Has Come,11,1,4,"Cornell, Commerford, Morello, Wilk",255529,8273592,0.99
9,100,Out Of Exile,11,1,4,"Cornell, Commerford, Morello, Wilk",291291,9506571,0.99


In [28]:
qf.pivot(rows=["Composer"], columns=["AlbumId", "GenreId"], values="UnitPrice", aggtype="sum")
qf.to_df()

Unnamed: 0,Composer,10_1,11_4
0,Audioslave/Chris Cornell,7.92,0.0
1,"Cornell, Commerford, Morello, Wilk",0.0,6.93


As you can see all values in `AlbumId` and `GenreId` became separate columns, `Composer` column has been grouped and `UnitPrice` has been sumed up.

## Going into QFrame data details

### QFrame data structure

`QFrame.data` has `select` key in which it stores `fields` which we want to have in our SQL statement. Each key have to have specified `type` which can be 'dim' if the varibale is a dimension variable or 'num' if the variable is a numeric variable. Let's take a look at all options that we can have under `select` and `fields` keys.

```json
{
  "select": {
    "table": "table",
    "schema": "schema",
    "fields": {
      "column": {
        "type": "dim",
        "as": "",
        "group_by": "",
        "order_by": "",
        "expression": "",
        "select": "",
        "custom_type": ""
      }
    },
    "where": "",
    "distinct": "",
    "having": "",
    "limit": ""
  }
}
```

- `table` - Name of the table.
- `schema` - Name of the schema.
- `fields`, in each field:
    - `type` - Type of the column. Options:

        - 'dim' - VARCHAR(500)  
        - 'num' - FLOAT
     
     Every column has to have specified type. If you want to sepcify another type check `custom_type`.
    - `as` - Column alias (name).

    - `group_by` - Aggregation type. Possibilities:

        - 'group' - This field will go to GROUP BY statement.
        - {'sum', 'count', 'min', 'max', 'avg'} - This field will by aggregated in specified way.
  
     If you don't want to aggregate fields leave `group_by` empty in each field.
    - `order_by` - Put the field in order by statement. Options:
    
        - 'ASC'
        - 'DESC'
        
    - `expression` - Expression, eg. CASE statement, column operation, CONCAT statement, ... .
    - `select` - Set 0 if you don't want to put this field in SELECT statement.
    - `custom_type` - Specify custom SQL data type, eg. DATE.
- `where` - Add where statement, eg. 'sales>100'
- `distinct` - Set 1 to add distinct to select
- `having` - Add having statement, eg. 'sum(sales)>100'
- `limit` - Add limit, eg. 100

### Generating and saving QFrame in JSON file 

We use a `.json` file to conviniently manipulate information about columns, renames and other things that might be very verbose to manipulate in python code. We can edit the json file into a json editor like http://jsoneditoronline.org/ more conviniently than in Python code.

After editing the `store.json` we can read it back inside a QFrame using `from_json()`.

This means we can use our json as our main `store` of verbose information and python as our main way to manipulate said information.

In [29]:
json_path = get_path("grizly_dev", "notebooks", "store.json")
qf.save_json(json_path=json_path, subquery="my_query_1")

qf = QFrame(dsn=dsn_sqlite, db="sqlite", dialect="mysql").from_json(json_path=json_path, subquery="my_query_1")
print(qf)

SELECT sq."Composer" AS "Composer",
       sum(CASE
               WHEN "AlbumId"='10'
                    AND "GenreId"='1' THEN "UnitPrice"
               ELSE 0
           END) AS "10_1",
       sum(CASE
               WHEN "AlbumId"='11'
                    AND "GenreId"='4' THEN "UnitPrice"
               ELSE 0
           END) AS "11_4"
FROM
  (SELECT "TrackId",
          "Name",
          "AlbumId",
          "MediaTypeId",
          "GenreId",
          "Composer",
          "Milliseconds",
          "Bytes",
          "UnitPrice"
   FROM Track
   ORDER BY 1
   LIMIT 15
   OFFSET 90) sq
GROUP BY 1
