# Create S3 Bucket and Upload CSV files to S3

Run through the following cells to create a S3 bucket and upload the csv data files for this project to S3.

Make sure to go to the **Edit** menu and click **Clear all outputs** before and after running this notebook.

## Set up Spark Session

In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [0]:
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

In [0]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("S3Setup").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

## Mount Google Drive into this runtime

To upload the csv data files to a S3 bucket, you need to mount your google drive into this runtime. To do that, run the following cells.

This will prompt a URL with an authentication code. After you go to the URL and insert that authentication code in the provided space, your google drive will be mounted.

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
# Check the contents of the current folder in the runtime.
! ls

In [0]:
# If the drive is mounted correcly, you will see that the current folder has a directory called gdrive.
# This is where you can find your google drive contents. 
%cd /content/gdrive/My Drive

## Extract the csv data files from the zip file

Running this cell will extract the csv data files from the .zip file inside the Resources folder of the repository.

In [0]:
%cd /content/gdrive/My Drive/data_final_project/mental_health_ML

In [0]:
import zipfile

# unzip files in Resources folder.
extension = ".zip"
extracted_dir_name = "."

# Get the current working directory.
# Need to be in root directory of repo for this to work.
cwd_dir_name = os.getcwd()
print(f"The current working directory is {cwd_dir_name}.")

# change directory from working dir to dir with zip file.
os.chdir("Resources")
# This should be the "Resources" folder.
dir_name = os.getcwd()
print(f"You are now in the following directory: {dir_name}.")

# loop through the items in the directory.
for item in os.listdir(dir_name):
  # check for ".zip" extension"
  if item.endswith(extension):
    try:
      # get full path of files
      file_name = os.path.abspath(item)
      # create zipfile object
      zip_ref = zipfile.ZipFile(file_name)
      # reference to the directory where the zip files will be extracted.
      unzipped_directory = os.path.join(extracted_dir_name)
      # extract file to dir
      zip_ref.extractall(unzipped_directory)
      # close file
      zip_ref.close()
      print(f"Successfully unzipped {item} into the following folder:{dir_name}.")
    except Exception as e:
      print(f"Error trying to unzip data file(s).")
      print(e)
            
# Go up one directory into the repo root directory.
os.chdir(os.path.normpath(os.getcwd() + os.sep + os.pardir))
print(os.path.normpath(os.getcwd() + os.sep + os.pardir))

## Install Boto3

Boto3 is the AWS SDK for Python to create, configure, and manage AWS Services, such as S3.

In [0]:
! pip install boto3

## Configuration

To use Boto3, you need to set up authentication credentials. Credentials for your AWS account can be found in the IAM Console.

Go through the following steps to create a AWS user and generate a new set of keys.

1. To create a new AWS user, go to your [AWS Console](https://console.aws.amazon.com/).

2. In the top navigation bar, click **Services** and then under **Security, Identity, and Compliance**, click **IAM**.

3. Then, from the left navigation, click **Users** > **Add user**.

4. Give the user a name (for example, *boto3user*).

5. Enable **Programmatic access** to be able to work with the AWS SDK.

6. Click **Next:Permissions**.

7. For permissions, select **Attach existing policies directly** and choose the **AmazonS3FullAccess** policy.

8. Click **Next:Tags**.

9. Click **Next:Review**.

10. Confirm the user details and click **Create user**.

11. Click **Download.csv** to make a copy of your credentials. You will need these later.

12. Install the AWS CLI from here: <https://aws.amazon.com/cli/>.

13. Run the following command from a terminal window (for example, Git Bash): ```aws configure```

  Note: You might need to restart the terminal after installing the AWS CLI.

14. When prompted, enter your AWS Access Key ID, which can be found in the csv file you downloaded in step 11.

15. When prompted, enter your AWS Secret Access Key, which can be found in the csv file you downloaded in step 11.

16. When prompted for a default region name, press Enter to use the default region (us-east-1).

17. When prompted for a default output format, press Enter to use the default of None.

This sets up credentials for the default profile as well as a default region to use when creating connections.

## Set up S3 Resource

Run the following cell to import Boto3 and tell it to use the S3 service.

You will also need to import the AWS access key id and access key secret from the config.py file.

In [0]:
import boto3
from config import ACCESS_ID, ACCESS_KEY

# Use Amazon S3
s3 = boto3.resource('s3', aws_access_key_id=ACCESS_ID, aws_secret_access_key= ACCESS_KEY)

## Create a S3 Bucket

The name of a S3 bucket must be unique across all regions of AWS. The uuid package is used to help ensure that the bucket name is unique by generating a random string of characters at the end of the bucket name you choose.

In [0]:
import uuid
def create_bucket_name(bucket_prefix):
    # The generated bucket name must be between 3 and 63 chars long
    return ''.join([bucket_prefix, str(uuid.uuid4())])

In [0]:
def create_bucket(bucket_prefix, s3_connection):
  bucket_name = create_bucket_name(bucket_prefix)
  bucket_response = s3_connection.create_bucket(
    Bucket=bucket_name)
  print(bucket_name)
  return bucket_name, bucket_response

In [0]:
# Create bucket.
bucket_name, response = create_bucket('mentalhealthml', s3)

print(bucket_name)
print(response)

In [0]:
# Grant public read access to the bucket that was just created.
bucket = s3.Bucket(bucket_name)
bucket.Acl().put(ACL='public-read')

In [0]:
# Verify that bucket was created.
# Print out bucket names.
for bucket in s3.buckets.all():
    print(bucket.name)

## Upload the csv data files to the S3 Bucket

Run the following cells to upload the csv data files to the S3 bucket you just created.

In [0]:
# import glob
# bucket = s3.Bucket(bucket_name)

# path_to_csvs = os.path.join(".", "Resources")
# all_files = glob.glob(os.path.join(path_to_csvs, "*.csv"))
# for f in all_files:
#   filename = os.path.basename(f)
#   # Use method of creating object instance to upload files from local machine to
#   # S3 bucket using boto3.
#   # s3.Object(bucket_name, filename).upload_file(f)
#   # s3.Object(bucket_name, filename).put(Body=open(f, 'rb'))
#   s3.Bucket(bucket_name).upload_file(
#     Filename=f, Key=file_name)
#   # Update access controls to allow public reads.
#   data_file_object = s3.Object(bucket_name, filename)
#   data_file_object.put(ACL='public-read')

In [0]:
# Verify that files/objects were succesfully uploaded to S3.
# for my_bucket_object in bucket.objects.all():
#     print(my_bucket_object)