Make sure to go to the **Edit** menu and click **Clear all outputs** before and after running this notebook.

# Load CSV files from S3 bucket into Spark dataframes

## Start Spark Session

In [0]:
# Install Java, Spark, and Findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

In [0]:
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MentalHealthETL").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

## Mount Google Drive into this runtime

To access the csv data files from the S3 bucket, you need to mount your google drive into this runtime. To do that, run the following cells.

This will prompt a URL with an authentication code. After you go to the URL and insert that authentication code in the provided space, your google drive will be mounted.

In [0]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
%cd /content/gdrive/My Drive/data_final_project/mental_health_ML

## Create config.py file

In the **mental_health_ML** directory, create a file called **config.py** and add the following contents:

```bash
ACCESS_ID='AWS_ACCESS_KEY_ID'
ACCESS_KEY='AWS_SECRET_ACCESS_KEY'
BUCKET_NAME='S3_BUCKET_NAME'
```

Replace AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and S3_BUCKET_NAME with their actual values. This file is in the .gitignore so that it won't be committed to GitHub.

## Access S3 bucket where csv files are stored

In [0]:
# Install aws sdk for python
! pip install boto3

In [0]:
import boto3
from config import ACCESS_ID, ACCESS_KEY, BUCKET_NAME

# Use Amazon S3
s3 = boto3.resource('s3', aws_access_key_id=ACCESS_ID, aws_secret_access_key= ACCESS_KEY)
bucket_name = BUCKET_NAME

# Bucket where csv files are stored.
bucket = s3.Bucket(bucket_name)

## Read in data from S3 bucket and load into Spark dataframes

In [0]:
# Read in data from S3 bucket
from pyspark import SparkFiles

bucket_name = BUCKET_NAME
original_dataframes = {}

for file in bucket.objects.all():
  key = file.key
  key_without_extension = key[:-4]
  year = key_without_extension[-4:]
  url=f"https://{bucket_name}.s3.amazonaws.com/{key}"
  spark.sparkContext.addFile(url)
  original_dataframes[year] = spark.read.csv(SparkFiles.get(key), sep=",", header=True, inferSchema=True)

# Show DataFrame
original_dataframes["2019"].show(n=5)
# original_dataframes["2018"].show(n=5)
# original_dataframes["2017"].show(n=5)
# original_dataframes["2016"].show(n=5)
# original_dataframes["2014"].show(n=5)