# AWS Cloud Setup

Run through the following cells to set up a S3 bucket where the csv data files for this project will be stored.

## Set up Spark session


In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [0]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CloudSetup").getOrCreate()

## Mount Google Drive into this runtime

To store the csv data files in an S3 bucket, you need to mount your google drive into this runtime. To do that, run the following cells.

This will prompt a URL with an authentication code. After you go to the URL and insert that authentication code in the provided space, your google drive will be mounted.

In [3]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [4]:
# Check the contents of the current folder in the runtime.
! ls

gdrive	sample_data  spark-2.4.5-bin-hadoop2.7	spark-2.4.5-bin-hadoop2.7.tgz


In [5]:
# If the drive is mounted correcly, you will see that the current folder has a directory called gdrive.
# This is where you can find your google drive contents. 
%cd /content/gdrive/My Drive

/content/gdrive/My Drive


In [6]:
# Create folder in Google Drive to temporarily store the csv data files.
project_folder_exists = os.path.isdir('data_final_project')

# If the project folder doesnt exist, create it.
if project_folder_exists == False:
  %mkdir data_final_project/

# Otherwise, change into that directory.
%cd data_final_project

/content/gdrive/My Drive/data_final_project


## Clone GitHub Repository

Running this cell clones the repository into google drive (if the repository doesn't already exist in google drive). If the repository already exists in google drive, then running this cell will simply pull the latest changes from master.

As part of this step, you will need to enter your github email and github username. So, have those ready.

Running this cell sometimes takes a few minutes...

In [7]:
repo_exists = os.path.isdir('mental_health_ML')

email = input('GitHub Email: ')
username = input('GitHub Username: ')

!git config --global user.email email
!git config --global user.name username

if repo_exists == False:
  ! git clone https://github.com/abbylemon/mental_health_ML.git

%cd mental_health_ML/

! git add .
! git stash
! git pull origin master
! git stash pop

GitHub Email: philipstubbs13@gmail.com
GitHub Username: philipstubbs13
/content/gdrive/My Drive/data_final_project/mental_health_ML
Saved working directory and index state WIP on aws_cloud_setup: 2c5b85d start
From https://github.com/abbylemon/mental_health_ML
 * branch            master     -> FETCH_HEAD
Already up to date.
^C


## Extract the csv data files from the zip file

Running this cell will extract the csv data files from the .zip file inside the Resources folder of the repository.

In [9]:
import zipfile

# unzip files in Resources folder.
extension = ".zip"
extracted_dir_name = "."

# Get the current working directory.
# Need to be in root directory of repo for this to work.
cwd_dir_name = os.getcwd()
print(f"The current working directory is {cwd_dir_name}.")

# change directory from working dir to dir with zip file.
os.chdir("Resources")
# This should be the "Resources" folder.
dir_name = os.getcwd()
print(f"You are now in the following directory: {dir_name}.")

# loop through the items in the directory.
for item in os.listdir(dir_name):
  # check for ".zip" extension"
  if item.endswith(extension):
    try:
      # get full path of files
      file_name = os.path.abspath(item)
      # create zipfile object
      zip_ref = zipfile.ZipFile(file_name)
      # reference to the directory where the zip files will be extracted.
      unzipped_directory = os.path.join(extracted_dir_name)
      # extract file to dir
      zip_ref.extractall(unzipped_directory)
      # close file
      zip_ref.close()
      print(f"Successfully unzipped {item} into the following folder:{dir_name}.")
    except Exception as e:
      print(f"Error trying to unzip data file(s).")
      print(e)
            
# Go up one directory into the repo root directory.
os.chdir(os.path.normpath(os.getcwd() + os.sep + os.pardir))
print(os.path.normpath(os.getcwd() + os.sep + os.pardir))

The current working directory is /content/gdrive/My Drive/data_final_project/mental_health_ML.
You are now in the following directory: /content/gdrive/My Drive/data_final_project/mental_health_ML/Resources.
Successfully unzipped osmi_mental_health_in_tech_survey_results.zip into the following folder:/content/gdrive/My Drive/data_final_project/mental_health_ML/Resources.
/content/gdrive/My Drive/data_final_project


## Install Boto3

Boto3 is the AWS SDK for Python to create, configure, and manage AWS Services, such as S3.

In [10]:
! pip install boto3



## Configuration

To use Boto3, you need to set up authentication credentials. Credentials for your AWS account can be found in the IAM Console.

Go through the following steps to create a AWS user and generate a new set of keys.

1. To create a new AWS user, go to your [AWS Console](https://console.aws.amazon.com/).

2. In the top navigation bar, click **Services** and then under **Security, Identity, and Compliance**, click **IAM**.

3. Then, from the left navigation, click **Users** > **Add user**.

4. Give the user a name (for example, *boto3user*).

5. Enable **Programmatic access** to be able to work with the AWS SDK.

6. Click **Next:Permissions**.

7. For permissions, select **Attach existing policies directly** and choose the **AmazonS3FullAccess** policy.

8. Click **Next:Tags**.

9. Click **Next:Review**.

10. Confirm the user details and click **Create user**.

11. Click **Download.csv** to make a copy of your credentials. You will need these later.

12. Install the AWS CLI from here: <https://aws.amazon.com/cli/>.

13. Run the following command from a terminal window (for example, Git Bash): ```aws configure```

  Note: You might need to restart the terminal after installing the AWS CLI.

14. When prompted, enter your AWS Access Key ID, which can be found in the csv file you downloaded in step 11.

15. When prompted, enter your AWS Secret Access Key, which can be found in the csv file you downloaded in step 11.

16. When prompted for a default region name, press Enter to use the default region (us-east-1).

17. When prompted for a default output format, press Enter to use the default of None.

This sets up credentials for the default profile as well as a default region to use when creating connections.

## Set up S3 Resource

Run the following cell to import Boto3 and tell it to use the S3 service.

You will also be prompted for your AWS Access Key ID and your AWS Secret Access Key. Both of these can be found in the csv file that you downloaded when you created a AWS user in the previous section.

In [11]:
import boto3

ACCESS_ID = input('AWS Access Key ID: ')
ACCESS_KEY = input('AWS Secret Access Key: ')

# Use Amazon S3
s3 = boto3.resource('s3', aws_access_key_id=ACCESS_ID, aws_secret_access_key= ACCESS_KEY)

AWS Access Key ID: AKIASINUSCIKVM6NPRLR
AWS Secret Access Key: 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


## Create a S3 Bucket

The name of a S3 bucket must be unique across all regions of the AWS platform. The uuid package is used to help ensure that the bucket name is unique by generating a random string of characters at the end of the bucket name you choose.

In [0]:
import uuid
def create_bucket_name(bucket_prefix):
    # The generated bucket name must be between 3 and 63 chars long
    return ''.join([bucket_prefix, str(uuid.uuid4())])

In [0]:
def create_bucket(bucket_prefix, s3_connection):
  bucket_name = create_bucket_name(bucket_prefix)
  bucket_response = s3_connection.create_bucket(
    Bucket=bucket_name)
  print(bucket_name)
  return bucket_name, bucket_response

In [35]:
# Create bucket.
bucket_name, response = create_bucket('mentalhealthml', s3)

print(bucket_name)
print(response)

mentalhealthml24fc4443-f38a-474b-af39-069ee7fd1932
mentalhealthml24fc4443-f38a-474b-af39-069ee7fd1932
s3.Bucket(name='mentalhealthml24fc4443-f38a-474b-af39-069ee7fd1932')


In [36]:
# Grant public read access to the bucket that was just created.
bucket = s3.Bucket(bucket_name)
bucket.Acl().put(ACL='public-read')

{'ResponseMetadata': {'HTTPHeaders': {'content-length': '0',
   'date': 'Sun, 10 May 2020 13:38:03 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'Q1iAFtTYgtfg6t9Y092wnhZaKMTeQeVCGewmU26yya2HlaDY55UMlz3HPEUclIUCNBvj9HsTeP0=',
   'x-amz-request-id': 'FAD5B5A54EEFE6EB'},
  'HTTPStatusCode': 200,
  'HostId': 'Q1iAFtTYgtfg6t9Y092wnhZaKMTeQeVCGewmU26yya2HlaDY55UMlz3HPEUclIUCNBvj9HsTeP0=',
  'RequestId': 'FAD5B5A54EEFE6EB',
  'RetryAttempts': 0}}

In [51]:
# Verify that bucket was created.
# Print out bucket names.
for bucket in s3.buckets.all():
    print(bucket.name)

amplifyexample-20181111091510-deployment
artowl.co
client-deployments-mobilehub-1930970617
client-deployments-mobilehub-237277109
client-hosting-mobilehub-1930970617
client-hosting-mobilehub-237277109
grudges-deployments-mobilehub-1653245828
grudges-deployments-mobilehub-661855754
grudges-hosting-mobilehub-1653245828
grudges-hosting-mobilehub-661855754
grudgesappsync-deployments-mobilehub-891706388
grudgesappsync-hosting-mobilehub-891706388
mentalhealthml24fc4443-f38a-474b-af39-069ee7fd1932
phil-data-bootcamp
photoalbums-20181117161541-deployment
photoalbums71cecdc2603a42998e406bac6622dda7
trapperkeepermaster-deployments-mobilehub-1512747905
trapperkeepermaster-hosting-mobilehub-1512747905
trapperkeepermaster-userfiles-mobilehub-1512747905
www.artowl.co


## Upload the csv data files to the S3 Bucket

Run the following cells to upload the csv data files to the S3 bucket you just created.

In [0]:
import glob
bucket = s3.Bucket(bucket_name)

path_to_csvs = os.path.join(".", "Resources")
all_files = glob.glob(os.path.join(path_to_csvs, "*.csv"))
for f in all_files:
  filename = os.path.basename(f)
  # Use method of creating object instance to upload files from local machine to
  # S3 bucket using boto3.
  s3.Object(bucket_name, filename).upload_file(f)
  # Update access controls to allow public reads.
  data_file_object = s3.Object(bucket_name, filename)
  data_file_object.put(ACL='public-read')