<!--
#  Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
#    Licensed under the Apache License, Version 2.0 (the "License").
#    You may not use this file except in compliance with the License.
#    You may obtain a copy of the License at
#
#        http://www.apache.org/licenses/LICENSE-2.0
#
#    Unless required by applicable law or agreed to in writing, software
#    distributed under the License is distributed on an "AS IS" BASIS,
#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#    See the License for the specific language governing permissions and
#    limitations under the License.
-->

# Extracting files in parallel using notebooks
***Extracting Data from Zipped Files and Migrating Output Files to S3 Buckets***

___
---
## Contents

1. [Introduction](#Introduction)
2. [Setup and Define Parameters](#Setup-and-Define-Parameters)
 1. [Define Parameters](#Define-Parameters)
 2. [Copy Zip Files Locally to Handle Extraction](#Copy-Zip-Files-Locally-to-Handle-Extraction)
3. [Extract Zip Files](#Step-2:-Extract-Zip-Files)
4. [Check For Errors and Clean Up](#Step-3:-Checking-for-Errors-and-Clean-Up)
  
---

## Introduction
This notebook goes through the process of extracting zip files and migrating their unzipped content to s3. We will go through the following steps to extract our files:
1. Migrate the files to our local environment
2. Use Shell Commands in our notebook with IPython to unzip the files
3. Use AWS CLI Commands to move these unzipepd files to a remote target s3 bucket. 

When calling on this notebook to run in **Example 1: "Orchestration Notebook for Building the Lake"**, we will concurrently notebooks for each of the different zipped files on AWS Fargate which provides a serverless container execution environment. This way we can reduce time and extract our zip files in parallel.


***
## Setup

#### Define Parameters 

First, let's define the source folder, s3 bucket path, and zip file name for our zip files we wish to extract. This will allow us to format an extract path where the Zip file sits in our remote environment

We also will specify an s3 bucket path for our target folders which will be where the extracted contnet will be placed.

In [None]:
sourceFolder = "landing/"
bucketName = "orbit-test-base-accoun-testlakebucketfa111111-1111111111"
zipFileName = "landing/cms/DE1_0_2008_Beneficiary_Summary_File_Sample_1.zip"
targetFolder = "s3://orbit-test-base-accoun-testlakebucketfa111111-1111111111/extracted/"
use_subdirs = True


In [None]:
toExtractPath = "s3://{}/{}".format(bucketName,zipFileName)

toExtractPath


#### Copy Zip Files Locally to Handle Extraction
Once we have defined our parameters we can copy the zip files over from our s3 bucket "ExtractPath" to a zip file located on our local environment. This will allow us to call on the shell commands to unzip our file and move it back to cloud storage in s3:

In [None]:
!aws s3 ls --recursive $toExtractPath

In [None]:
!aws s3 cp $toExtractPath ./$zipFileName

**Note:** Here we are just removing the filename extension so we can store unzipped content in the same named file:

In [None]:
baseName = zipFileName.split(".")[0]
baseName

***
## Extract Zip Files
Now, let's call on the **unzip Shell command** to unzip our file in our local source location and transfer the unzipped file to the target directory "baseName".

We will then check that we have a valid target Folder name in s3 to move the unzipped content back to cloud storage:


In [None]:
!rm -fR ./$baseName

In [None]:
!unzip ./$zipFileName -d ./$baseName

In [None]:
if use_subdirs:
    filename = baseName.split("/")[-1]
    targetFolder += filename
targetFolder

#### Move Output and Error Files to Target s3 Bucket(s)
Lastly, let's use "**%%bash script magics**" to run cells with bash in a subprocess. We can copy all of the output and errors (if any) to our target folder in s3 to complete the extraction process for our zip files:

In [None]:
%%bash --out output --err error -s "$baseName" "$targetFolder"
echo "aws s3 cp --recursive ./$1 $2"
aws s3 cp --recursive ./$1 $2

***
## Checking for Errors and Clean Up
Lets double check that we did not run into any errors during the process unzipping our zip files. We can check to see if any errors were logged when unzipping and assert that no errors were found if successful. 

Next, we can remove our two local directories holding our zipped file and our unzipped file(s) and continue building out Data Lake with our unzipped data securely stored in s3:


In [None]:
print(output)
print(error)
assert "upload" in output
assert len(error) == 0

In [None]:
!rm -fR ./$baseName
!rm -f ./$zipFileName