# Catalog & Metadata

[AWS Data Wrangler](https://github.com/awslabs/aws-data-wrangler) makes heavy use of [Glue Catalog](https://aws.amazon.com/glue/) to stores metadata of table and connections.

This tutotial will expose some useful features for this purpose.

In [1]:
import awswrangler as wr
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

S3_BUCKET = "BUCKET_NAME"

### Creating mock DataFrame

In [2]:
dataset = load_boston()
df_boston = pd.DataFrame(np.c_[dataset["data"], dataset["target"]], columns= np.append(dataset["feature_names"], ["target"]))
df_boston

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0


### Checking a brand new Glue Catalog (empty)

In [3]:
wr.glue.databases()

Unnamed: 0,Database,Description
0,awswrangler_test,AWS Data Wrangler Test Arena - Glue Database
1,default,


In [4]:
wr.glue.tables(database="awswrangler_test")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


### Loading DataFrames to Data Lake (S3 + Parquet + Glue Catalog)

In [5]:
desc = """This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.
"""

param = {
    "source": "scikit-learn",
    "class": "cities"
}

comments = {
    "crim": "per capita crime rate by town",
    "zn": "proportion of residential land zoned for lots over 25,000 sq.ft.",
    "indus": "proportion of non-retail business acres per town",
    "chas": "Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)",
    "nox": "nitric oxides concentration (parts per 10 million)",
    "rm": "average number of rooms per dwelling",
    "age": "proportion of owner-occupied units built prior to 1940",
    "dis": "weighted distances to five Boston employment centres",
    "rad": "index of accessibility to radial highways",
    "tax": "full-value property-tax rate per $10,000",
    "ptratio": "pupil-teacher ratio by town",
    "b": "1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town",
    "lstat": "lower status of the population",
}

paths = wr.pandas.to_parquet(
    dataframe=df_boston,
    path=f"s3://{S3_BUCKET}/boston",
    database="awswrangler_test",
    mode="overwrite",
    description=desc,
    parameters=param,
    columns_comments=comments
)

### Checking Glue Catalog (AWS Console)

![AWS Data Wrangler](_static/glue_catalog_table_boston.png "AWS Data Wrangler")

### Looking Up for the new table!

In [6]:
wr.glue.tables(name_contains="osto")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


In [7]:
wr.glue.tables(name_prefix="bos")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


In [8]:
wr.glue.tables(name_suffix="ton")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


In [9]:
wr.glue.tables(search_text="UCI ML housing dataset")

Unnamed: 0,Database,Table,Description,Columns,Partitions
0,awswrangler_test,boston,This is a copy of UCI ML housing dataset. http...,"crim, zn, indus, chas, nox, rm, age, dis, rad,...",


### Getting table details

In [10]:
wr.glue.table(database="awswrangler_test", name="boston")

Unnamed: 0,Column Name,Type,Partition,Comment
0,crim,double,False,per capita crime rate by town
1,zn,double,False,proportion of residential land zoned for lots ...
2,indus,double,False,proportion of non-retail business acres per town
3,chas,double,False,Charles River dummy variable (= 1 if tract bou...
4,nox,double,False,nitric oxides concentration (parts per 10 mill...
5,rm,double,False,average number of rooms per dwelling
6,age,double,False,proportion of owner-occupied units built prior...
7,dis,double,False,weighted distances to five Boston employment c...
8,rad,double,False,index of accessibility to radial highways
9,tax,double,False,"full-value property-tax rate per $10,000"
