# Data Definition Language - DDL
Let us create a database `post`

In [None]:
%load_ext sql
%sql hive://hadoop@localhost:10000/

In [None]:
%%sql
CREATE DATABASE IF NOT EXISTS post

In [None]:
%sql USE post

## PLZ Verzeichnis

Let's focus first on the PLZ dataset

In [None]:
!head -n3 /data/dataset/post/plz_verzeichnis_v2.csv

### We see that

|REC_ART|ONRP|BFSNR|PLZ_TYP|POSTLEITZAHL|PLZ_ZZ|GPLZ|ORTBEZ18|ORTBEZ27|KANTON|SPRACHCODE|SPRACHCODE_ ABW|BRIEFZ_DURCH|GILT_AB_DAT|PLZ_BRIEFZUST|PLZ_COFF|Geo Shape|Geokoordinaten|
| ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- | ------- |
|01|111|5586|80|1000|07|1000|Lausanne St-Paul|Lausanne St-Paul|VD|2||130|1993-09-28|100060||||


1. The separator is a `;`.
2. "Kanton" would be a great partition. We will create one when we convert it to parquet.


Remember, the different types `hive` can use are [here](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82706456#LanguageManualTypes-date).

Die Post gave additional information on the entire dataset [here](https://swisspost.opendatasoft.com/api/datasets/1.0/plz_verzeichnis_v2/attachments/strassenverzeichnis_mit_sortierdaten_de_pdf/).

E.g. for the PLZ dataset:

| Field name | Field type (length) | Mandatory field | Source | Observations |
| ------ | ------- | ------ | ------- | ------ |
| REC_ART | VARCHAR(2) | YES | “01” |Record type: Designates the record type. |
| ONRP | NUMBER(5) | YES | ASDP | Swiss Post classification number: This number (ONRP) is the primary key designating a postcode/location in accordance with the Swiss Post postcode database and the unique, unalterable key term of the postcode. |
| BFSNR | NUMBER(5) | YES | ASDP | Foreign key for BFSNR (refers to NEW_COM)|

etc.

In [None]:
%%sql
CREATE TABLE IF NOT EXISTS plz_csv (
    REC_ART VARCHAR(2),
    ONRP INT,
    BFSNR INT,
    PLZ_TYP SMALLINT,
    POSTLEITZAHL SMALLINT,
    PLZ_ZZ VARCHAR(2), 
    GPLZ SMALLINT,
    ORTBEZ18 VARCHAR(18),
    ORTBEZ27 VARCHAR(27),
    KANTON VARCHAR(2),
    SPRACHCODE TINYINT,
    SPRACHCODE_ABW TINYINT,
    BRIEFZ_DURCH INT,
    GILT_AB_DAT DATE,
    PLZ_BRIEFZUST INT,
    PLZ_COFF VARCHAR(1),
    Geo_Shape STRING,
    Geokoordinaten STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties ("skip.header.line.count"="1")

In [None]:
%%sql
LOAD DATA LOCAL INPATH '/data/dataset/post/plz_verzeichnis_v2.csv' INTO TABLE plz_csv

In [None]:
%%sql 
SELECT
    REC_ART,
    ONRP,
    BFSNR,
    PLZ_TYP,
    POSTLEITZAHL,
    PLZ_ZZ, 
    GPLZ,
    ORTBEZ18,
    ORTBEZ27,
    SPRACHCODE,
    SPRACHCODE_ABW,
    BRIEFZ_DURCH,
    GILT_AB_DAT,
    PLZ_BRIEFZUST,
    PLZ_COFF,
    Kanton
from plz_csv limit 10

### Converting PLZ_CSV as a Parquet Table with Partitions

1. We remove `Geo_Shape` and `Geokoordinaten`.
2. We use compression.
3. We use `Kanton` as a partition. We use `String` as the partition type. 

In [None]:
%%sql
CREATE TABLE IF NOT EXISTS plz (
    REC_ART VARCHAR(2),
    ONRP INT,
    BFSNR INT,
    PLZ_TYP SMALLINT,
    POSTLEITZAHL SMALLINT,
    PLZ_ZZ VARCHAR(2), 
    GPLZ SMALLINT,
    ORTBEZ18 VARCHAR(18),
    ORTBEZ27 VARCHAR(27),
    SPRACHCODE TINYINT,
    SPRACHCODE_ABW TINYINT,
    BRIEFZ_DURCH INT,
    GILT_AB_DAT DATE,
    PLZ_BRIEFZUST INT,
    PLZ_COFF VARCHAR(1)
)
PARTITIONED BY(KANTON string)
STORED AS Parquet
TBLPROPERTIES("parquet.compression"="SNAPPY")

The partition `Kanton` needs to be the last column in the `insert` statement.

In [None]:
%%sql
INSERT INTO TABLE plz 
    SELECT
        REC_ART,
        ONRP,
        BFSNR,
        PLZ_TYP,
        POSTLEITZAHL,
        PLZ_ZZ, 
        GPLZ,
        ORTBEZ18,
        ORTBEZ27,
        SPRACHCODE,
        SPRACHCODE_ABW,
        BRIEFZ_DURCH,
        GILT_AB_DAT,
        PLZ_BRIEFZUST,
        PLZ_COFF,
        Kanton 
    FROM plz_csv

In [None]:
%sql select * from plz limit 2

## Strassenbezeichnungen

### Can you do the same for `strassenbezeichnungen_v2`?

In [None]:
!head -n5 /data/dataset/post/strassenbezeichnungen_v2.csv

In [None]:
%%sql
CREATE TABLE IF NOT EXISTS streets_csv (
    REC_ART VARCHAR(2),
    ...

In [None]:
%%sql
LOAD DATA LOCAL ...

In [None]:
%%sql
CREATE TABLE IF NOT EXISTS streets (
    REC_ART VARCHAR(2),
    ...

In [None]:
%%sql
INSERT INTO TABLE ...

In [None]:
%sql select * from streets limit 2

## Bevölkerung

### Add `bevoelkerung_proplz.csv` to the tables `bevoelkerung_csv` and `bevoelkerung` respectively. 

In [None]:
...

In [None]:
%sql select * from bevoelkerung limit 3

## Nachnamen

### Finally, add `nachnamen_proplz.csv` to the tables `nachnamen_csv` and `nachnamen`. 
Use a patition on the gender for the parquet table `nachnamen`: `PARTITIONED BY(Geschlecht string)`

In [None]:
...