# ðŸ”‹ Dataset Download

In this notebook, we perform the following steps:

1. **Environment Configuration**  
   We select the `100cep_gateway` catalog and the `staging` schema to ensure that all data and objects created are organized in the correct environment.

2. **Unity Catalog Volume Creation**  
   We create a volume called `imdb` to store data files securely and centrally in Unity Catalog.

3. **Library Installation and Import**  
   We install the `kagglehub` library to facilitate downloading datasets directly from Kaggle.

4. **Dataset Download and Copy**  
   We use `kagglehub` to download the "olistbr/brazilian-ecommerce" dataset from Kaggle. Then we copy the downloaded files to the volume created in Unity Catalog, ensuring that the data is available for future analyses.

These steps prepare the environment for advanced analyses using real Brazilian e-commerce data, with governance and security provided by Unity Catalog.

In [0]:
%sql
USE CATALOG `100cep_gateway`

In [0]:
%sql
USE SCHEMA staging

In [0]:
%sql
CREATE VOLUME IF NOT EXISTS imdb;

In [0]:
%pip install kagglehub
import kagglehub

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
import shutil
import os

# Download latest version
path = kagglehub.dataset_download(
    "olistbr/brazilian-ecommerce"
)

uc_volume_path = "/Volumes/100cep_gateway/staging/imdb"

os.makedirs(uc_volume_path, exist_ok=True)

for file_name in os.listdir(path):
    src_file = os.path.join(path, file_name)
    dst_file = os.path.join(uc_volume_path, file_name)
    # Copy and overwrite if exists
    shutil.copy2(src_file, dst_file)

print(
    f"Arquivo copiado para Unity Catalog volume em: {uc_volume_path}"
)

Arquivo copiado para Unity Catalog volume em: /Volumes/100cep_gateway/staging/imdb
