# Partition data 
**This notebook partitions `orders` data into batches which will be used by the other notebooks in this module for batch ingestion to the feature store.**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Setup](#Setup)
1. [Load data](#Load-data)
1. [Groupby and partition](#Groupby-and-partition)
1. [Copy partitions from local to S3](#Copy-partitions-from-local-to-S3)

# Setup

#### Imports 

In [1]:
import pandas as pd
import sagemaker
import shutil
import os

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


#### Essentials

In [2]:
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-feature-store'

# Load data

#### Read `orders` data

In [5]:
df = pd.read_csv('./data/raw/orders.csv')

In [6]:
df['year_month'] = df['purchased_on'].apply(
    lambda x: '-'.join([str(pd.to_datetime(x).year), 
                        str(pd.to_datetime(x).month)])
)

In [7]:
df.head(5)

Unnamed: 0,order_id,customer_id,product_id,purchase_amount,is_reordered,purchased_on,event_time,year_month
0,O1,C5731,P16,93.26,1,2021-02-16 16:07:28,2024-07-14T04:30:55.982Z,2021-2
1,O2,C3541,P12802,67.98,1,2020-08-07 06:57:31,2024-07-14T04:30:55.982Z,2020-8
2,O3,C7402,P8320,64.59,1,2021-04-23 17:07:09,2024-07-14T04:30:55.982Z,2021-4
3,O4,C7356,P5165,63.51,0,2020-08-19 14:04:09,2024-07-14T04:30:55.982Z,2020-8
4,O5,C5806,P12940,6.37,1,2020-07-17 07:42:27,2024-07-14T04:30:55.982Z,2020-7


# Groupby and partition 
Groupby and partition `orders` dataframe into multiple dataframes based on `year` and `month`

In [8]:
grouped = df.groupby('year_month')

In [9]:
for name, group in grouped:
    partition_dir = f'../data/partitions/{name}'
    if os.path.exists(partition_dir):
        shutil.rmtree(partition_dir)
    os.makedirs(partition_dir)
    partition_df = group.drop('year_month', axis=1)
    partition_df.to_csv(f'{partition_dir}/partition.csv', index=False)

In [12]:
prefix

'sagemaker-feature-store'

# Copy partitions from local to S3 

In [10]:
!aws s3 cp ../data/partitions/ s3://{default_bucket}/{prefix}/partitions/ --recursive

upload: ../data/partitions/2020-1/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2020-1/partition.csv
upload: ../data/partitions/2020-11/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2020-11/partition.csv
upload: ../data/partitions/2020-12/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2020-12/partition.csv
upload: ../data/partitions/2020-3/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2020-3/partition.csv
upload: ../data/partitions/2020-4/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2020-4/partition.csv
upload: ../data/partitions/2020-6/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2020-6/partition.csv
upload: ../data/partitions/2020-10/partition.csv to s3://sagemaker-us-east-1-419974056037/sagemaker-feature-store/partitions/2