# AWS and Google Drive Setup

Run through the following cells to set up AWS and Google Drive for the project.

Make sure to go to the **Edit** menu and click **Clear all outputs** before and after running this notebook.

## Set up Spark session


In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [0]:
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

In [0]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CloudSetup").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

## Mount Google Drive into this runtime

To store and access the csv data files from a S3 bucket, you need to mount your google drive into this runtime. To do that, run the following cells.

This will prompt a URL with an authentication code. After you go to the URL and insert that authentication code in the provided space, your google drive will be mounted.

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
# Check the contents of the current folder in the runtime.
! ls

In [0]:
# If the drive is mounted correcly, you will see that the current folder has a directory called gdrive.
# This is where you can find your google drive contents. 
%cd /content/gdrive/My Drive

In [0]:
# Create folder in Google Drive to store project files.
project_folder_exists = os.path.isdir('data_final_project')

# If the project folder doesn't exist, create it.
if project_folder_exists == False:
  %mkdir data_final_project/

# Otherwise, change into that directory.
%cd data_final_project

## Clone GitHub Repository

Running this cell clones the repository into google drive (if the repository doesn't already exist in google drive) so that we can access project files within the colab notebook. If the repository already exists in google drive, then running this cell will simply pull the latest changes from master.

As part of this step, you will need to enter your github email and github username.

Running this cell sometimes takes a few minutes...

In [0]:
repo_exists = os.path.isdir('mental_health_ML')

email = input('GitHub Email: ')
username = input('GitHub Username: ')

!git config --global user.email email
!git config --global user.name username

if repo_exists == False:
  ! git clone https://github.com/abbylemon/mental_health_ML.git

%cd mental_health_ML/

! git stash
! git pull origin master
! git stash pop

## Create config.py file

In the **mental_health_ML** directory, create a file called **config.py** and add the following contents:

```bash
ACCESS_ID='AWS_ACCESS_KEY_ID'
ACCESS_KEY='AWS_SECRET_ACCESS_KEY'
BUCKET_NAME='S3_BUCKET_NAME'
```

Replace AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET_NAME with their actual values.