[![AWS Data Wrangler](_static/logo.png "AWS Data Wrangler")](https://github.com/awslabs/aws-data-wrangler)

# 5 - Glue Catalog

[Wrangler](https://github.com/awslabs/aws-data-wrangler) makes heavy use of [Glue Catalog](https://aws.amazon.com/glue/) to stores metadata of table and connections.

In [1]:
import awswrangler as wr
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

## Enter your bucket name:

In [2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"

 ···········································


### Creating DataFrame from Sklearn Boston housing samples

In [3]:
df = pd.DataFrame(
    data=np.c_[load_boston()["data"], load_boston()["target"]],
    columns=np.append(load_boston()["feature_names"], ["target"])
)
df.head(3)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7


## Checking Glue Catalog Databases

In [4]:
databases = wr.catalog.databases()
print(databases)

            Database                                   Description
0  aws_data_wrangler  AWS Data Wrangler Test Arena - Glue Database
1     aws_dataframes     AWS DataFrames Test Arena - Glue Database
2           covid-19                                              
3            default                         Default Hive database


### Create the database awswrangler_test if not exists

In [5]:
if "awswrangler_test" not in databases.values:
    wr.catalog.create_database("awswrangler_test")
    print(wr.catalog.databases())
else:
    print("Database awswrangler_test already exists")

            Database                                   Description
0  aws_data_wrangler  AWS Data Wrangler Test Arena - Glue Database
1     aws_dataframes     AWS DataFrames Test Arena - Glue Database
2   awswrangler_test                                              
3           covid-19                                              
4            default                         Default Hive database


## Checking the empty database

In [6]:
wr.catalog.tables(database="awswrangler_test")

Unnamed: 0,Database,Table,Description,Columns,Partitions


### Writing DataFrames to Data Lake (S3 + Parquet + Glue Catalog)

In [7]:
desc = """This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression problems.
"""

param = {
    "source": "scikit-learn",
    "class": "cities"
}

comments = {
    "crim": "per capita crime rate by town",
    "zn": "proportion of residential land zoned for lots over 25,000 sq.ft.",
    "indus": "proportion of non-retail business acres per town",
    "chas": "Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)",
    "nox": "nitric oxides concentration (parts per 10 million)",
    "rm": "average number of rooms per dwelling",
    "age": "proportion of owner-occupied units built prior to 1940",
    "dis": "weighted distances to five Boston employment centres",
    "rad": "index of accessibility to radial highways",
    "tax": "full-value property-tax rate per $10,000",
    "ptratio": "pupil-teacher ratio by town",
    "b": "1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town",
    "lstat": "lower status of the population",
}

res = wr.s3.to_parquet(
    df=df,
    path=f"s3://{bucket}/boston",
    dataset=True,
    database="awswrangler_test",
    table="boston",
    mode="overwrite",
    description=desc,
    parameters=param,
    columns_comments=comments
)

### Checking Glue Catalog (AWS Console)

![Glue Console](_static/glue_catalog_table_boston.png "Glue Console")

### Looking Up for the new table!

In [8]:
wr.catalog.tables(name_contains="osto")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


In [9]:
wr.catalog.tables(name_prefix="bos")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


In [10]:
wr.catalog.tables(name_suffix="ton")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


In [11]:
wr.catalog.tables(search_text="UCI ML housing dataset")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


### Getting tables details

In [12]:
wr.catalog.table(database="awswrangler_test", table="boston")

Unnamed: 0,Column Name,Type,Partition,Comment
0,crim,double,False,per capita crime rate by town
1,zn,double,False,proportion of residential land zoned for lots ...
2,indus,double,False,proportion of non-retail business acres per town
3,chas,double,False,Charles River dummy variable (= 1 if tract bou...
4,nox,double,False,nitric oxides concentration (parts per 10 mill...
5,rm,double,False,average number of rooms per dwelling
6,age,double,False,proportion of owner-occupied units built prior...
7,dis,double,False,weighted distances to five Boston employment c...
8,rad,double,False,index of accessibility to radial highways
9,tax,double,False,"full-value property-tax rate per $10,000"


## Cleaning Up the Database

In [13]:
for table in wr.catalog.get_tables(database="awswrangler_test"):
    wr.catalog.delete_table_if_exists(database="awswrangler_test", table=table["Name"])

### Delete Database

In [14]:
wr.catalog.delete_database('awswrangler_test')