The purpose of this repo is to:
- Conduct Data Science experiments to build a model that predicts viral Reddit posts.
- Build a reusable ETL pipeline for for both data science development and model deployment.
- House docker files needed for building an image of the model.
- Deploy infrastructure and model on AWS ECS.
-
First ensure the DynamoDB tables are set up via DynamoDB-Setup.
-
Installs - see the prerequisites section on this page for additional information, the steps are essentially:
- Install Terraform CLI
- Install AWS CLI and run
aws configure
and enter in your aws credentials. - JDK 17 installed (8, 11 or 17 are compatible with spark 3.4.0)
- You will need to add this to you're
.zshrc
:export JAVA_HOME=\$(/usr/libexec/java_home)
- You will need to add this to you're
-
Clone this repository
-
You can run the tests locally yourself by doing the following (it is recommended that you manage your python environments with something like asdf and use python==3.12.3 as your local runtime):
python -m venv venv # this sets up a local virtual env using the current python runtime source ./venv//bin/activate # activates the virtual env pip install -e ."[dev]" # installs this packages in local env with dependencies pytest . -r f -s # -r f shows extra info for failures, -s disables capturing
- If everything installed without issue then test that pyspark works, open a fresh terminal and type
pyspark
and hit enter. This is dependent upon setting JAVA_HOME in the earlier step.exit()
out of this if it worked. - You need to follow the steps in the Getting Started section for connecting to S3, see also StackOverflow posts like this one for clarifications. The important thing is that you install these 2 JARs in the pyspark classpath and that their versions match each other:
- hadoop-aws JAR must match the version of hadoop required by this version of spark. Spark 3.4.0 requires hadoop 3.3.4.
- the AWS SDK For Java Bundle JAR - this one you need to find the version that hadoop-aws was created with by looking at its dependencies. For hadoop-aws 3.3.4 this is 1.12.262.
- The installed by navigating to something like the following:
cd venv/lib/python3.12/site-packages/pyspark/jars/ curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
- If everything installed without issue then test that pyspark works, open a fresh terminal and type
-
From within this repository run the following:
terraform init terraform workspace new dev # this should switch you to the dev workspace terraform plan -var-file="dev.tfvars" -out=dev-plan.out terraform apply -var-file="dev.tfvars" dev-plan.out
For deploying to prd
terraform workspace new prd # or terraform workspace select prd if already created terraform plan -var-file="prd.tfvars" -out=prd-plan.out terraform apply -var-file="prd.tfvars" prd-plan.out
On subsequent updates you don't need to
init
or make a new workspace again.