Viral Reddit Posts Model

The purpose of this repo is to:

Conduct Data Science experiments to build a model that predicts viral Reddit posts.
Build a reusable ETL pipeline for for both data science development and model deployment.
House docker files needed for building an image of the model.
Deploy infrastructure and model on AWS ECS.

How to use

First ensure the DynamoDB tables are set up via DynamoDB-Setup.
Installs - see the prerequisites section on this page for additional information, the steps are essentially:
1. Install Terraform CLI
2. Install AWS CLI and run aws configure and enter in your aws credentials.
3. JDK 17 installed (8, 11 or 17 are compatible with spark 3.4.0)
  1. You will need to add this to you're .zshrc: export JAVA_HOME=\$(/usr/libexec/java_home)
Clone this repository
You can run the tests locally yourself by doing the following (it is recommended that you manage your python environments with something like asdf and use python==3.12.3 as your local runtime):
```
python -m venv venv  # this sets up a local virtual env using the current python runtime
source ./venv//bin/activate  # activates the virtual env
pip install -e ."[dev]"  # installs this packages in local env with dependencies
pytest . -r f -s   # -r f shows extra info for failures, -s disables capturing
```
1. If everything installed without issue then test that pyspark works, open a fresh terminal and type pyspark and hit enter. This is dependent upon setting JAVA_HOME in the earlier step. exit() out of this if it worked.
2. You need to follow the steps in the Getting Started section for connecting to S3, see also StackOverflow posts like this one for clarifications. The important thing is that you install these 2 JARs in the pyspark classpath and that their versions match each other:
  1. hadoop-aws JAR must match the version of hadoop required by this version of spark. Spark 3.4.0 requires hadoop 3.3.4.
  2. the AWS SDK For Java Bundle JAR - this one you need to find the version that hadoop-aws was created with by looking at its dependencies. For hadoop-aws 3.3.4 this is 1.12.262.
3. The installed by navigating to something like the following:
```
cd venv/lib/python3.12/site-packages/pyspark/jars/
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar
```

From within this repository run the following:

terraform init
terraform workspace new dev  # this should switch you to the dev workspace
terraform plan -var-file="dev.tfvars" -out=dev-plan.out
terraform apply -var-file="dev.tfvars" dev-plan.out

For deploying to prd

terraform workspace new prd  # or terraform workspace select prd if already created
terraform plan -var-file="prd.tfvars" -out=prd-plan.out
terraform apply -var-file="prd.tfvars" prd-plan.out

On subsequent updates you don't need to init or make a new workspace again.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
model		model
patches		patches
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.tool-versions		.tool-versions
LICENSE		LICENSE
README.md		README.md
example_reddit.cfg		example_reddit.cfg
old_setup.py		old_setup.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Viral Reddit Posts Model

How to use

About

Releases

Packages

Languages

License

ViralRedditPosts/Model

Folders and files

Latest commit

History

Repository files navigation

Viral Reddit Posts Model

How to use

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages