# Datalake Setup Automation 
## Overview

**Requirements**
- Download a large dataset using a virtual machine
- Create a data lake to upload the dataset to in a cloud platform of your choice
- No local downloads, must use a cloud VM and document the process via screenshare
- Walkthrough of the process via <5 minute video and upload

**Rubric**
- Video is <5 minutes
- Video walks through process clearly 
- Team uses a virtual machine
- Team does not download any data locally
- Team creates a data lake in a cloud platform
- Team downloads/uploads dataset to the data lake using the VM

## Suggested Approach 

Use an AWS Sagemaker Studio Jupyter interface to host this Jupyter notebook, leveraging the boto3 library to interact with AWS EC2, S3 to satisfy the assignment requirements. This has the benefit of: 
- minimizing the clicking around in the Web UI
- furnishing code that can be used later in the class with other datasets, where running this repeatedly will be annoying in the GUI

**Steps**
1. Load this Jupyter notebook into an AWS SageMaker lab instance
2. Ensure [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) is installed 
3. Configure boto with necessary credentials for S3 access
these steps over and over again may become tedious
5. Create the S3 bucket that will serve as a data lake for our unstructured data
6. Use Requests or similar to grab a pile of data, write it to disk if we need an intermediate stage (could be essential if downloads are interrupted or lengthy) 
7. Upload the data to S3 and validate it's presence
8. Destroy local data and release Sagemaker resources

In [6]:
# This will be preinstalled in the env if we run in Sagemaker studio
import boto3

## Configuration

Check to ensure AWS configuration exists, write it to disk if it doesn't... this might not be necessary from a Sagemaker instance context... 

In [None]:
default_config = """\
[default]\
aws_access_key_id = YOUR_ACCESS_KEY\
aws_secret_access_key = YOUR_SECRET_KEY\
"""

aws_creds_file = "~/.aws/credentials"

if not ~/.aws/credentials: 
    write default_config to aws_credentials

## Import data

Go fetch the dataset(s) that will be uploaded to S3 to create our data lake. 

In [None]:
# This will be preinstalled in the env if we run in Sagemaker studio
import requests

## Upload to S3

In [None]:
s3 = boto3.resource('s3') 